Open-Vocabulary Part-Level Detection and Segmentation for Human–Robot Interaction

Yang, Shan; Liu, Xiongding; Wei, Wu

doi:10.3390/app14146356

Open AccessArticle

Open-Vocabulary Part-Level Detection and Segmentation for Human–Robot Interaction

by

Shan Yang

,

Xiongding Liu

and

Wu Wei

^*

School of Automation Science and Engineering, South China University of Technology, Guangzhou 510641, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(14), 6356; https://doi.org/10.3390/app14146356

Submission received: 19 June 2024 / Revised: 29 June 2024 / Accepted: 30 June 2024 / Published: 21 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

Object detection and segmentation have made great progress in robotic application missions. However, intelligent agents require fine-grained recognition algorithms rather than object-level and language instructions to enhance the interaction between humans and robots. To improve the robot’s interactivity in the process of the robot response to language instructions, we propose a method for part-level detection and segmentation by exploiting vision language models. In this approach, Swin Transformer is introduced in the image encoder for extracting image features, and FPNs (Feature Pyramid Networks) are modified to better process the features from Swin Transformer. Next, the image decoder is proposed for model aligning between the image features and text embeddings for achieving human–robot interaction via language. Finally, we verify that the text embeddings are impacted by the command of input and that different prompt templates also affect classification. The method proposed in this paper, which is validated on two datasets (PartImagePart and Pascal Part), possesses the ability to understand and execute part-level missions and accurately segments and detects parts compared with existing interactive methods.

Keywords:

detection; segmentation; language instructions; vision language models; Swin Transformer; part-level

1. Introduction

With the development of deep learning, robots play an increasingly important role in human daily life owing to their more user-friendly and intelligent properties [1]. However, there are still two critical challenges before the initiation of the next generation containing advanced agents. One challenge involves the ability of robots to understand natural language, which affects the efficiency of robots responding to human commands to achieve seamless interactions. Another challenge involves the environmental perception of robots, including both object detection and part recognition in the interaction process. Specifically, the intelligent agent receives a command about “the cap of the bottle” in a complex environment, which implies that it comprehends the location of the part that the intelligent agent can twist, such as when people grasp a bottle cap to open it. In this paper, the datasets, including images of fine granularity and short commands, are reconstructed, and their corresponding part-level segmentation masks and detection boxes are selected as the ground truths to evaluate the capability of intelligent agents to solve those challenges.

The multi-model has made significant progress in language and visual knowledge. Nevertheless, most existing studies based on the contrastive language–image pre-training (CLIP) model [2,3] mainly focus on object-level recognition, and part-level recognition is rarely reported. For object recognition, current foundational vision language models (VLMs) can localize objects based on their labels. However, more fine-grained recognition is required for the intelligent agent to execute subsequent manipulations, such as twisting the cap of a bottle. Hence, it is necessary to recognize the fine-grained parts for the development of the intelligent agent. On account of the scarcity of part datasets, especially part-level segmentation datasets, the focus of studies on part-level recognition is placed on open-world or open-vocabulary part-level segmentation [4,5]. In this paper, VLMs are explored by reconstructing a dataset containing both language and fine-grained annotations. Additionally, VLMs are constructed to segment fine granularity from language. To this end, the datasets are reconstructed based on Pascal Part [6] and PartImageNet [7]. Pascal Part includes 10,103 images, 6 object-level categories, and 40 part-level categories; PartImageNet contains 24,000 images, 11 super-categories, and 93 part-level categories. The evaluation of the VLMs on the reconstructed dataset reveals that the text embedding, to a certain extent, improves the accurate detection and segmentation of different parts.

In summary, an integrated framework is presented in this paper for fine-grained segmentation and detection by exploring vision language models in the interaction between robots and humans. Different from conventional methods, this method highlights the part-level understanding, which maximizes the chance of the robots controlling the objects. Additionally, the proposed method is related to the relationship between parts and their objects via the connection with text features. Part-level segmentation has been investigated in previous studies, and it involves inputting an image and outputting the names of parts that do not contain objects. However, we cannot know what objects these parts belong to through the output, as it neglects the relation between the parts and objects. Finally, some experiments are performed based on two datasets to verify the performance of the proposed method.

2. Related Works

2.1. Part Segmentation

Object segmentation aims to seek semantically meaningful pixels of objects in an image. Further, there is a growing demand for part-level segmentation to provide a more fine-grained understanding of different objects, thus assigning semantic labels to different parts of an object [8,9,10,11]. Some pioneering studies have mainly been conducted based on a fully supervised method, which can be employed to perform training on various datasets containing part annotations to support the process [12], such as PartImageNet [7], Pascal Part [6], and ADE20K [13]. These datasets are utilized in relatively common scenarios because they contain various objects and parts. In addition, some specific tasks are also explored in certain studies; these, however, are constrained to the domain of those datasets. Hence, some special part-level datasets have been established, such as those relative to humans [14,15], cars [16,17], and the fashion domain [18,19], which are limited in their application scenarios, especially for the interaction between robots and the environment. Therefore, common datasets, Pascal Part and PartImageNet, were here selected for the interactive tasks of robots to perform training and evaluations of the model.

2.2. Open-Vocabulary Segmentation

Conventional fully supervised methods related to detection and segmentation have a limited generalization capability, resulting in failure to process objects or parts in open-word categories without the training set. Recently, vision-and-language representation learning has developed rapidly in various domains, such as in visual reasoning [20,21], image–text retrieval [22,23,24], and visual question answering [25,26]. The core of this method is aligned with the gap between the embedding space of vision and language in open-vocabulary segmentation. In most relevant methods, the existing image encoder is used as the backbone, such as Mask-Former [27], and CLIP is applied for classification. The core of these methods is aligning the features from the image encoder with that of CLIP. For instance, OVSeg [28] seeks to crop the region proposals and fine-tune CLIP via a mask prompt tuning mechanism. FC-CLIP [29] leverages a frozen convolutional network as the backbone from CLIP and aligns the region features from the visual backbone with CLIP. SAN [30] utilizes a side adapter network with a frozen CLIP to obtain the category of masks. In addition to object-level segmentation, part segmentation needs to enhance the visual representation of images. Hence, object detection based on region features has been investigated in some studies. For example, OSCAR [31] trains universal semantics via region information and object tags. UNITER [32] utilizes the supervision of similar outputs across multiple modalities to establish word–region alignments. ALBEF [33] aligns the image features and text features before fusing them with a multimodal encoder. SimVLM [34] leverages large-scale weak supervision to reduce the requirement for regional labels. In this study, the robot is empowered to acquire the coordinates and masks of specific parts of an object according to human commands.

3. Methods

Since the accurate segmentation and detection of parts of different objects can effectively guide the robot control of an object, part-level segmentation and detection constitute an important assignment in the interaction between robots and humans. In practical applications of part-level segmentation and detection, the model needs to satisfy the task requirements of accurate segmentation and detection at the same time, so that the robot can efficiently execute human commands. However, the existing models cannot apply the method to accomplish both part-level detection and segmentation for human–robot interaction. In this paper, an architecture is proposed for part-level detection and segmentation for robot interaction which is composed of an image encoder, a text encoder, a detection decoder, and a mask decoder, as shown in Figure 1. In the image encoder stage, Swin Transformer is utilized as the backbone to extract features from images, an FPN is employed to generate multi-scale features, and a region proposal network (RPN) is used to obtain the region of interest (RoI). The text encoder from CLIP is employed to capture human instructions to enhance the part-level recognition capabilities. The detection decoder utilizes convolutional and regression computations to determine bounding boxes, while the categorization of parts relies on the similarity between image features and text embeddings. Meanwhile, the mask region-based convolutional neural network (mask R-CNN) architecture is implemented by the mask decoder.

3.1. Image Encoder

The operation of the image encoder is consistent with the backbone extraction of part-level features in the images. The FPN can be employed to provide multi-scale features, and the RPN can be utilized to generate the RoI. In the image encoder, the backbone and FPN are modified, except for the RPN, which do not undergo any improvement. Backbones can be broadly divided into convolution-based networks and transformer-based networks, such as Resnet and Swin Transformer. Compared with Resnet, Swin Transformer is employed to extract image features by using the self-attention mechanism based on the shifted window, enabling the model to provide a more compact feature representation by capturing global and local context information. Furthermore, the sizes of the pooling layer and convolutional kernels can affect the convergence and computational efficiency of the model. Therefore, the backbone in the image encoder adopts Swin Transformer to extract visual features. The feature map obtained by the backbone in the image encoder has low resolution; thus, it lacks information about multi-scale features, especially small parts. Therefore, an FPN should be developed for the image encoder to process features from the backbone to generate multi-scale features. There are many schemes for FPNs due to different design and improvement strategies, and the structures mainly include the original FPN, FPN++, PANet, BiFPN, and NAS-FPN. As shown in Figure 2, an FPN is split into two paths: the bottom-up path and the top-down path. The bottom-up path is composed of transformer structures, which are responsible for acquiring features from the image. The top-down path fuses the extracted features from high-level semantic features with low-level detailed information.

Swin Transformer can enhance the ability of robots to extract fine-grained information, thus improving computational efficiency. Therefore, Swin Transformer is introduced into the backbone to facilitate fine-grained detection and segmentation in robots performing subtasks. The Swin Transformer architecture was proposed in [35], as shown in Figure 3A. The Swin Transformer block consists of two successive Swin Transformer blocks, as illustrated in Figure 3B. In Figure 2, Swin0, Swin1, Swin2, and Swin3 are features with different scales generated by the backbone. The operation of the Swin Transformer block is divided into five stages. Suppose that the shape of the input image is

(N \times 3 \times W \times H)

, where N represents the batch size, 3 represents the number of channels of RGB images, W represents the width of the input, and H represents the height of the input. In these stages, the space of the feature maps is compressed, and simultaneously, the receptive fields are expanded. Firstly, the output of stage 1,

(\frac{H}{4} \times \frac{W}{4} \times 48)

, is obtained via the initial image from robot acquisition divided into non-overlapping patches. Secondly, the output from the previous stage is projected via a linear embedding layer with dimension

(C)

, and it is then processed by several Swin Transformer blocks to acquire Swin0

(\frac{H}{4} \times \frac{W}{4} \times C)

. Thirdly, Swin1

(\frac{H}{4} \times \frac{W}{4} \times 2 C)

is constructed similarly to the generation process of Swin0. However, it includes an additional patch merging layer for

2 \times

down-sampling on the space of features and

2 \times

on the dimension of the features to generate hierarchical representations. Fourthly, Swin2

(\frac{H}{16} \times \frac{W}{16} \times 4 C)

is developed by using a patch merging layer and Swin Transformer blocks, similarly to Swin1. Finally, Swin3

(\frac{H}{32} \times \frac{W}{32} \times 8 C)

is the final stage of Swin Transformer and its generation follows the same pattern as Swin1 and Swin2. The output of Swin3 represents high-level semantic features extracted from the input image acquired by robots.

Following the aforementioned processing procedures, with an increase in neural network layers, the ability of the algorithm for feature expression is enhanced [36]. These processes contain multiple Swin Transformer blocks, which mainly consist of Window Multi-head Self-attention (W-MSA), Shift Window Multi-head Self-attention (SW-MSA), LayerNorm (LN) layers, and Multi-Layer Perceptron (MLP) layers. The two improved MSA layers are located between the two LN layers, and the MLP layer connects to the LN layer, which occurs after the MSA layer is improved.

Different from MSA, W-MSA employed to compute the self-attention in non-overlapping local windows. Through W-MSA, the computational complexity of the model is effectively reduced compared with MSA. SW-MSA is modified based on W-MSA by introducing a shifted window partitioning approach to construct cross-window connections and maintain efficient computation of the non-overlapping windows, thus solving the problem of lacking connections across windows. The shifted window partitioning approach utilizes a cyclic shift to keep the number of windows the same as that of the regular window partitioning to maintain efficient computations. Suppose that each window includes

M \times M

patches, and a window based on an image of

h \times w

patches. For a fair comparison of computational complexity, MSA and W-MSA are used to compute patches of the same shape. Compared with W-MSA, SW-MSA does not increase the computation complexity, which only introduces a shift window operation. The computational complexity of MSA and W-MSA can be expressed as follows:

Ω (MSA) = 4 h w C^{2} + 2 {(h w)}^{2} C

(1)

Ω (W - MSA) = 4 h w C^{2} + 2 M^{2} h w C

(2)

This highlights a significant difference in the computational requirements from these functions between MSA and W-MSA. MSA entails substantial computation, especially for high-resolution images, while W-MSA demands comparatively minimal computational resources. This observation underscores the efficiency and scalability advantages of window-based approaches, particularly when processing large-scale visual data. Moreover, Swin Transformer blocks ensure linear computational complexity relative to the image size, rather than square complexity, thereby mitigating the computational burden of training models on higher-resolution images. This facilitates training on larger image datasets with higher resolution without excessively increasing computational demands.

In the FPN, the new feature maps P2, P3, P4, and P5 are obtained via the up-sampling and fusing of high-level feature maps to correspond with the feature maps in the bottom-up branch, so that the semantic information is better represented in the decoder. More specifically, P5 is obtained from the processing of Swin3 by the lateral connection and the smooth layer process. P4 is obtained by adding the pixel-wise operation between Swin3 after a later connection and up-sampling with Swin2 after a lateral connection, followed by a smooth layer. The generation process of P3 and P2 is the same as that of P4. The lateral connection usually involves a 1 × 1 convolution operation, and its function is to connect the feature maps of different levels in the backbone to achieve cross-level information transmission and fusion. The smooth layer consists of a 3 × 3 convolutional layer and an activation function, which can reduce high-frequency noise, enhance the semantic information of features, and adjust the scale and number of the channels of features. The role of the smooth layer in the FPN is to further smooth and adjust the feature pyramid to extract more stable and useful feature representations, which contributes to improving the performance of the part-level segmentation and detection task. Due to the difference between part-level recognition and object-level recognition, we eliminate P6, containing global information, which is suitable for image classification and large-object recognition.

3.2. Text Encoder

As a large-scale pre-trained model, CLIP presents an impressive classification ability. Recently, the application of CLIP to detection and segmentation has also been explored. In the text encoder, the text embedding is generated by the text encoder of CLIP by designing specific texts, such as [an object part] and [part of the object]. For completing the part-level classification task, the prompts of all categories are fed into the text encoder of CLIP for processing to obtain the text weighting of each category. The text weighting generation process of the part categories can be divided into three steps. Firstly, the text of a part category performs Unicode encoding to generate computer procession tokens; then, a start of text (SOT) token and an end of text (EOT) token are added. Secondly, these tokens project to token tensors by nn.Embedding, and the positional embedding is added to each token. Finally, these token tensors extract text features via Transformer and the LN layer. The text features are saved as text embeddings, which can be employed in the category predictor in the detection decoder.

3.3. Detection Decoder

The detection decoder is composed of region-of-interest pooling (RoI Pooling), a box predictor, and a category predictor, as shown in Figure 1. Fine-grained localization is one of the most basic functions of the detection decoder. To promote the accuracy of fine-grained localization, Cascade R-CNN is adopted to fine-tune the RoI from the RPN via three stages for optimization [37]. RoI Pooling solves the different sizes of the RoI from the RPN. After RoI Pooling, the generated fixed-size features are flattened and passed through a series of fully connected (FC) layers. These FC layers serve as a feature transformation mechanism, enabling the model to learn discriminative representations for part-level classification and bounding box regression. The box predictor is an FC layer with a linear activation function, which can be employed to predict the bounding box offsets of parts. In the category predictor, a dot product is adopted to calculate the similarity between the text embedding from the text encoder and the region feature from RoI Pooling to obtain the score of the region, which can be expressed as follows:

s c o r e s = x \cdot E + b

(3)

where x represents the region feature, E represents the text embedding, and bias is the bias vector.

3.4. Mask Decoder

The mask decoder function can be utilized to generate a mask of the part to implement part segmentation and is composed of region-of-interest align (RoI Align) and the mask predictor. RoI Align is the mapping of differently sized RoIs into fixed-size features, and the mask predictor can generate class-agnostic masks. Similar to the detection decoder, the size of the RoI generated by the RPN is different. RoI Pooling leverages simple nearest-neighbor interpolation or average pooling, which may lead to precision loss when dealing with RoI Pooling. To solve this problem, RoI Align is implemented, and a more accurate interpolation strategy is adopted to better retain the spatial information inside the RoI. Rol Align is employed to align RoIs into fixed-size feature maps for generating part masks. Additionally, it eliminates the quantization operation and leverages bilinear interpolation to obtain the values of floating-point pixels on the image, thereby turning the entire feature aggregation process into a continuous operation. The fixed-size features are processed through a series of convolutional layers. These layers typically perform spatial convolution operations on the fixed-size feature to learn more abstract representations. After the operation of the convolutional layers, the features are up-sampled to increase their spatial resolution to match the size of the original RoIs. The up-sampling may involve techniques like transposed convolutional or bilinear interpolation to increase the feature map size. Then, the up-sampling features generate a mask through the final convolutional layer. This layer typically has a kernel size of

1 \times 1

.

3.5. Loss Function

The loss function of part-level segmentation and detection includes three sections. The image decoder contains the classification loss and the bounding box regression loss. The mask decoder comprises the mask loss. The bounding box regression loss is employed to ensure that the model can accurately predict the bounding box position of the part [38]. The formula can be expressed as follows:

L_{r e g} = \{\begin{matrix} \frac{1}{N_{r e g}} \sum p_{i}^{*} 0.5 {(t_{i} - t_{i}^{*})}^{2} \times 1 / σ^{2} & if | t_{i} - t_{i}^{*} | < 1 / σ^{2} \\ \frac{1}{N_{r e g}} \sum p_{i}^{*} (| t_{i} - t_{i}^{*} | - 0.5) & otherwise \end{matrix}

(4)

where

N_{r e g}

represents the number of RoIs,

t_{i}

represents the bounding box offset predicted by the model, and

t_{i}^{*}

represents the ground-truth bounding box offset. The classification loss uses a cross-entropy loss function to calculate the difference between the category prediction and the ground-truth label for each RoI. The formula can be expressed as follows:

L_{c l s} = \frac{1}{N_{c l s}} \sum_{i} L_{c l s} (p_{i}, p_{i}^{*})

(5)

L_{c l s} (p_{i}, p_{i}^{*}) = - log [p_{i}^{*} p_{i} + (1 - p_{i}^{*}) (1 - p_{i})]

(6)

where

N_{c l s}

represents the number of RoIs,

p_{i}^{*}

represents the predicted category probability of the i-th RoI, and i represents the sum of all RoIs.

The mask loss is used to ensure that the model can accurately segment the pixels of the part. The formula can be expressed as follows:

L_{m a s k} = \frac{1}{m^{2}} \sum_{1}^{m^{2}} [- y * l o g (s i g m o i d (x)) - (1 - y) * l o g (1 - s i g m o i d (x))]

(7)

where y represents the label value of the mask of the current location, which is 0 or 1; and x represents the output value of the current location.

4. Results

4.1. Experimental Setup

In the experiment, the software environment is composed of python 3.8.13, torch 1.7.1, and CUDA 11.3, and the hardware environment is composed of Intel CPU i7-7800X CP, 3 × NVIDIA GTX1080, Ubuntu16.04, and a 64-bit operating system for model training and testing.

4.2. Datasets

In this study, two datasets (namely, PartImageNet and Pascal Part) related to part-level detection and segmentation are used. To assess the performance of the proposed method, the model is trained and tested based on the two datasets separately. To apply this method to the interaction tasks of robots, the labels of parts are modified based on the two datasets. The distribution of each category in the two datasets is in Figure 4.

PartImageNet: PartImageNet includes 158 classes from 11 super-categories and offers 40 part categories. PartImageNet contains categories of 40 parts, where the maximum number of [foot of the quadruped] is 13,990 and the minimum number of [head of the airplane] is 252.

Pascal Part: Pascal Part includes 20 categories and 193 part categories. After the elimination process, the number of part categories is modified to 93 for this study, where the maximum number of [arm of the person] is 9158 and the minimum number of [saddle of the motorbike] is 8.

4.3. Main Result

In this study, relevant experiments are performed based on the PartImageNet dataset and the Pascal Part dataset to evaluate the performance of the proposed method. Additionally, the proposed method is compared with some representative methods by using the average precision (AP), mean average precision (mAP), and

m A P_{50}

metrics, with a higher value representing better performance in part-level detection and segmentation. AP is defined as the average value of 10 IOU (Intersection over Union) thresholds when the IOU takes 0.05 as a single step. mAP is obtained by mean computing the AP for each category separately.

m A P_{50}

refers to the mAP when the IOU threshold is 0.5. These calculation methods for the above evaluation indexes are shown in Equations (8) and (9). Given the difference in the number of parts per category, mAP is defined as the mean of AP for all categories in the dataset, which may mitigate this influence to some extent. Moreover,

m A P_{50}

explicitly defines the IOU threshold, facilitating a more precise evaluation of model performance. The results are presented in Table 1. The part-level segmentation using the proposed method based on PartImageNet and Pascal Part datasets is visualized in Figure 5 and Figure 6, respectively. Although the proposed method is only trained via the ground-truth annotation from the PartImageNet and Pascal Part training datasets, it generates high-quality part segment masks that align well with ground-truth masks. In the application based on the PartImageNet dataset and the Pascal Part dataset, this method can learn better representation for part features whether in detection or segmentation, which corroborates its good performance.

A P = \int_{0}^{1} P (r) d r

(8)

m A P = \frac{\sum_{i = 1}^{n} A P_{i}}{n}

(9)

where

P (r)

represents precision when recall is r; n represents the total number of parts classes in the dataset;

A P_{i}

represents the AP of the i-th class part.

4.4. Comparison of Results

In this study, Swin Transformer and RetiaNet are compared in terms of their performance based on the PartImageNet and Pascal Part datasets. The main difference between these methods is the process of feature extraction from images. It is proven that the Swin Transformer in this study is better than RetiaNet in part-level tasks and that it takes less time in the training process. The results of part-level detection and segmentation via RetiaNet are shown in Table 2. The comparison results reveal that the Swin Transformer backbone obtains more favorable results than the RetiaNet backbone in both the PartImageNet and Pascal Part datasets, as shown in Figure 7A. During the part-level detection or part segmentation, the image feature extraction ability of Swin Transformer is significantly better than that of RetiaNet. Among them, in the detection task, the model using the [part of the object] prompt template and capturing image features via Swin Transformer exhibits the best performance, with

m A P_{50}

reaching 75.46%, which is 9.78% higher than that of the model which utilizes the RetiaNet backbone to extract image features in the PartImageNet dataset. Moreover, the best

m A P_{50}

of the Swin Transformer backbone is 52.35% in the Pascal Part dataset. In the part segmentation assignment, Swin Transformer also outperforms RetiaNet in image feature extraction based on the two datasets, where the

m A P_{50}

values of the model with the best performance are 72.80% (PartImageNet) and 50.86% (Pascal Part). Additionally, it can be found that the sizes of the Swin Transformer model and the Retia model are 1.58 GB and 525.68 MB, respectively but that the Swin Transformer model spends less time than Retia in the training process.

4.5. Ablation Study

To verify whether part-level detection and segmentation are related to the text embedding, which contributes to promoting the component of part-level classification, the efficiency of different templates of text prompts is investigated in terms of part-level detection, and the capability of part segmentation without text prompts is also explored. More specifically, two familiar expressions are used in this experiment, including [part of the object] and [an object part], such as [a bottle cap] and [cap of the bottle]. As shown in Figure 7B and Table 3, the text embedding of both prompts can affect the accuracy of part location and the precision of part segmentation to some extent. The

m A P_{50}

of [part of the object] is slightly superior to [an object part] in PartImageNet. Similarly, the efficiency of different prompt templates in Pascal Part also exerts an impact. In this study, we simply demonstrate that the object information also affects the results. To draw a fair comparison, the quantity of parts categories in part segmentation without text prompts is related to labels both the part and the object, not only depend on the name of the part. Different prompt templates are the key factors influencing the performance of part-level detection and segmentation. Hence, a commonly better text prompt for part segmentation needs to be verified based on more datasets. Furthermore, it is an open problem that requires advanced prompt engineering in part segmentation.

In the process of fusing image features from the image encoder, it can be verified that the multi-scale features generated from the FPN exert significant impacts on the results of part-level segmentation and detection, as shown in Table 4 and Table 5. These results indicate that regardless of which FPN is employed in the image encoder, the results with the FPN are consistently better than those without the FPN. The structure of the FPN is crucial in part-level recognition, as it can effectively fuse semantic information at different scales. Without the FPN structure in the image encoder, the lowest AP and mAP of the part-level detection and segmentation can be observed in both Pascal Part and PartImageNet.

In this study, the FPN with P6, the FPN without P6, and BiFPN are employed for a comparative analysis based on the PartImageNet and Pascal Part datasets. Additionally, the FPN with P6 and the FPN without P6 are also utilized in the PartImageNet and Pascal Part datasets to prove that P6 is not suitable for part-level recognition. Moreover, BiFPN is leveraged as a reference to compare the influence of different structures of the FPN. BiFPN involves a bidirectional feature propagation and fusion structure that not only has a top-down feature fusion path but also possesses a bottom-up path. Compared with the FPN with P6, the FPN without P6 raised 0.99% higher

m A P_{50}

on PartImageNet and 0.98% higher

m A P_{50}

on Pascal Part in part-level segmentation. Compared with the FPN without P6, the

m A P_{50}

of BiFPN decreases by 8.13% and 2.21% for part-level segmentation based on PartImageNet and Pascal Part, respectively. In part-level detection, the FPN without P6 exhibits the best performance compared with the other FPNs.

5. Conclusions

In this paper, a fine-grained segmentation method is proposed based on vision language models to ensure that robots can perceive surrounding environments during interaction with humans. To enhance the feature extraction capabilities of this method, Swin Transformer is introduced into the backbone of the image encoder. Additionally, the influence of different templates of text prompts on the detection and segmentation of parts is also analyzed. Compared with different prompt types and other methods, the fine-grained segmentation effect of the proposed method is better based on the two datasets. Additionally, the

m A P_{50}

of the proposed method reaches 75.46% (PartImageNet) and 52.35% (Pascal Part) in the detection task and 72.39% (PartImageNet) and 50.86% (Pascal Part) in the segmentation task using [part·of the object]. The results of this study are expected to lay a foundation for the interaction between robots and humans and the development of the intelligent agent. In the future, this model may be applied to various applications, such as robot manipulation, robot obstacle avoidance, and robotic crop picking.

Author Contributions

Conceptualization, S.Y.; methodology, S.Y.; formal analysis, S.Y. and X.L.; investigation, X.L.; data curation, S.Y.; writing—original draft preparation, S.Y.; writing—review and editing, X.L. and W.W.; visualization, S.Y.; supervision, W.W.; project administration, W.W.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China [grants No. 61573148 and 61603358] and Science and Technology Planning Project of Guangdong Province, China [grants No. 2015B010919007 and 2019A050520001].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Matheson, E.; Minto, R.; Zampieri, E.G.; Faccio, M.; Rosati, G. Human–robot collaboration in manufacturing applications: A review. Robotics 2019, 8, 100. [Google Scholar] [CrossRef]
Mogadala, A.; Kalimuthu, M.; Klakow, D. Trends in integration of vision and language research: A survey of tasks, datasets, and methods. J. Artif. Intell. Res. 2021, 71, 1183–1317. [Google Scholar] [CrossRef]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Pan, T.Y.; Liu, Q.; Chao, W.L.; Price, B. Towards open-world segmentation of parts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15392–15401. [Google Scholar]
Han, M.; Zheng, H.; Wang, C.; Luo, Y.; Hu, H.; Zhang, J.; Wen, Y. PartSeg: Few-shot Part Segmentation via Part-aware Prompt Learning. arXiv 2023, arXiv:2308.12757. [Google Scholar]
Chen, X.; Mottaghi, R.; Liu, X.; Fidler, S.; Urtasun, R.; Yuille, A. Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1971–1978. [Google Scholar]
He, J.; Yang, S.; Yang, S.; Kortylewski, A.; Yuan, X.; Chen, J.N.; Liu, S.; Yang, C.; Yu, Q.; Yuille, A. Partimagenet: A large, high-quality dataset of parts. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 128–145. [Google Scholar]
de Geus, D.; Meletis, P.; Lu, C.; Wen, X.; Dubbelman, G. Part-aware panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5485–5494. [Google Scholar]
Michieli, U.; Borsato, E.; Rossi, L.; Zanuttigh, P. Gmnet: Graph matching network for large scale part semantic segmentation in the wild. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 397–414. [Google Scholar]
Zhou, T.; Wang, W.; Liu, S.; Yang, Y.; Van Gool, L. Differentiable multi-granularity human representation learning for instance-aware human semantic parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1622–1631. [Google Scholar]
Li, X.; Xu, S.; Yang, Y.; Cheng, G.; Tong, Y.; Tao, D. Panoptic-partformer: Learning a unified model for panoptic part segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 729–747. [Google Scholar]
Sun, P.; Chen, S.; Zhu, C.; Xiao, F.; Luo, P.; Xie, S.; Yan, Z. Going denser with open-vocabulary part segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 15453–15465. [Google Scholar]
Zhou, B.; Zhao, H.; Puig, X.; Xiao, T.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vis. 2019, 127, 302–321. [Google Scholar] [CrossRef]
Li, J.; Zhao, J.; Wei, Y.; Lang, C.; Li, Y.; Sim, T.; Yan, S.; Feng, J. Multiple-human parsing in the wild. arXiv 2017, arXiv:1705.07206. [Google Scholar]
Yang, L.; Song, Q.; Wang, Z.; Jiang, M. Parsing r-cnn for instance-level human analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 364–373. [Google Scholar]
Reddy, N.D.; Vo, M.; Narasimhan, S.G. Carfusion: Combining point tracking and part detection for dynamic 3d reconstruction of vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1906–1915. [Google Scholar]
Song, X.; Wang, P.; Zhou, D.; Zhu, R.; Guan, C.; Dai, Y.; Su, H.; Li, H.; Yang, R. Apollocar3d: A large 3d car instance understanding benchmark for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5452–5462. [Google Scholar]
Zheng, S.; Yang, F.; Kiapour, M.H.; Piramuthu, R. Modanet: A large-scale street fashion dataset with polygon annotations. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1670–1678. [Google Scholar]
Jia, M.; Shi, M.; Sirotenko, M.; Cui, Y.; Cardie, C.; Hariharan, B.; Adam, H.; Belongie, S. Fashionpedia: Ontology, segmentation, and an attribute localization dataset. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 316–332. [Google Scholar]
Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6904–6913. [Google Scholar]
Gurari, D.; Li, Q.; Stangl, A.J.; Guo, A.; Lin, C.; Grauman, K.; Luo, J.; Bigham, J.P. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3608–3617. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision, Boston, MA, USA, 7–12 June 2015; pp. 2641–2649. [Google Scholar]
Xu, J.; Mei, T.; Yao, T.; Rui, Y. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5288–5296. [Google Scholar]
Girdhar, R.; Ramanan, D. Cater: A diagnostic dataset for compositional actions and temporal reasoning. arXiv 2019, arXiv:1910.04744. [Google Scholar]
Suhr, A.; Lewis, M.; Yeh, J.; Artzi, Y. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 217–223. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Liang, F.; Wu, B.; Dai, X.; Li, K.; Zhao, Y.; Zhang, H.; Zhang, P.; Vajda, P.; Marculescu, D. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Paris, France, 2–6 October 2023; pp. 7061–7070. [Google Scholar]
Yu, Q.; He, J.; Deng, X.; Shen, X.; Chen, L.C. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 32215–32234. [Google Scholar]
Xu, M.; Zhang, Z.; Wei, F.; Hu, H.; Bai, X. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Paris, France, 2–6 October 2023; pp. 2945–2954. [Google Scholar]
Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 121–137. [Google Scholar]
Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 104–120. [Google Scholar]
Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 2021, 34, 9694–9705. [Google Scholar]
Wu, C.; Yin, S.; Qi, W.; Wang, X.; Tang, Z.; Duan, N. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv 2023, arXiv:2303.04671. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Cong, P.; Li, S.; Zhou, J.; Lv, K.; Feng, H. Research on instance segmentation algorithm of greenhouse sweet pepper detection based on improved mask RCNN. Agronomy 2023, 13, 196. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]

Figure 1. The architecture of the proposed method for part-level segmentation and detection is composed of a text encoder, an image encoder, a detection decoder, and a mask decoder.

Figure 2. The architecture of an FPN, whose inputs are different-scale features from Swin Transformer denoted as Swin0, Swin1, Swin2, and Swin3, and the outputs are P2, P3, P4, and P5.

Figure 3. (A) The structure of Swin Transformer in the image encoder and (B) the structure of the Swin Transformer block.

Figure 4. The distribution of each category in the two datasets.

Figure 5. The qualitative results of part-level segmentation and part detection in Pascal Part. (A) The original images sourced from Pascal Part are used as the input for the model. (B) The prediction results from the model trained on Pascal Part illustrate the segmentation and detection effects.

Figure 6. The qualitative results of part-level segmentation and part detection in PartImageNet. (A) The original images sourced from PartImageNet are used as the input for the model. (B) The prediction results generated by the model trained on PartImageNet demonstrate the segmentation and detection effects.

Figure 7. The influence of different factors on part-level segmentation and detection results. (A) The impact of different backbones and (B) the impact of different prompts.

Table 1. The results of part-level detection and segmentation via Swin Transformer using two prompt templates in PartImageNet and Pascal Part.

PartImageNet	[Part of the Object]			[An Object Part]
PartImageNet	AP	${mAP}_{50}$	mAP	AP	${mAP}_{50}$	mAP
Detection	47.63	75.46	46.04	47.40	74.92	46.42
Segmentation	44.36	72.39	42.05	44.29	72.80	42.53
Pascal Part	[Part of the Object]			[An Object Part]
Pascal Part	AP	${mAP}_{50}$	mAP	AP	${mAP}_{50}$	mAP
Detection	25.27	52.35	35.34	23.01	49.71	32.62
Segmentation	24.04	50.86	35.05	22.53	48.41	32.85

Table 2. The results of part detection and segmentation via Retia using two prompt templates based on PartImageNet and Pascal Part.

PartImageNet	[Part of the Object]			[An Object Part]
PartImageNet	AP	${mAP}_{50}$	mAP	AP	${mAP}_{50}$	mAP
Detection	35.99	65.68	35.83	36.36	65.82	36.07
Segmentation	35.16	61.91	33.39	35.54	62.12	33.78
Pascal Part	[Part of the Object]			[An Object Part]
Pascal Part	AP	${mAP}_{50}$	mAP	AP	${mAP}_{50}$	mAP
Detection	18.62	43.46	25.98	17.44	41.28	24.36
Segmentation	18.96	42.06	27.43	18.09	40.26	26.64

Table 3. The comparison results of different templates of text prompts via Swin Transformer.

PartImageNet	AP	${mAP}_{50}$	mAP	Pascal Part	AP	${mAP}_{50}$	mAP
None	47.29	75.28	46.39	None	22.22	48.35	31.73
[part of the object]	47.63	75.46	46.04	[part of the object]	25.27	52.35	35.34
[an object part]	47.40	74.92	46.42	[an object part]	23.01	49.71	32.62

Table 4. Evaluation results of performance indicators for BiFPN and the original FPN using [part of the object].

	BiFPN			FPN with P6
Detection	AP	${mAP}_{50}$	mAP	AP	${mAP}_{50}$	mAP
PartImageNet	41.27	66.31	38.90	46.09	73.92	44.76
Pascal Part	23.84	50.03	34.49	24.26	51.43	34.21
Segmentation	AP	${mAP}_{50}$	mAP	AP	${mAP}_{50}$	mAP
PartImageNet	38.81	64.26	35.90	43.15	71.40	41.12
Pascal Part	22.97	48.65	34.44	23.40	49.88	33.93

Table 5. Evaluation results of performance indicators without FPN using [part of the object].

Detection	AP	${mAP}_{50}$	mAP
PartImageNet	39.07	62.81	38.16
Pascal Part	13.29	26.73	27.68
Segmentation	AP	${mAP}_{50}$	mAP
PartImageNet	37.03	60.46	34.90
Pascal Part	13.15	26.29	29.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, S.; Liu, X.; Wei, W. Open-Vocabulary Part-Level Detection and Segmentation for Human–Robot Interaction. Appl. Sci. 2024, 14, 6356. https://doi.org/10.3390/app14146356

AMA Style

Yang S, Liu X, Wei W. Open-Vocabulary Part-Level Detection and Segmentation for Human–Robot Interaction. Applied Sciences. 2024; 14(14):6356. https://doi.org/10.3390/app14146356

Chicago/Turabian Style

Yang, Shan, Xiongding Liu, and Wu Wei. 2024. "Open-Vocabulary Part-Level Detection and Segmentation for Human–Robot Interaction" Applied Sciences 14, no. 14: 6356. https://doi.org/10.3390/app14146356

APA Style

Yang, S., Liu, X., & Wei, W. (2024). Open-Vocabulary Part-Level Detection and Segmentation for Human–Robot Interaction. Applied Sciences, 14(14), 6356. https://doi.org/10.3390/app14146356

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Open-Vocabulary Part-Level Detection and Segmentation for Human–Robot Interaction

Abstract

1. Introduction

2. Related Works

2.1. Part Segmentation

2.2. Open-Vocabulary Segmentation

3. Methods

3.1. Image Encoder

3.2. Text Encoder

3.3. Detection Decoder

3.4. Mask Decoder

3.5. Loss Function

4. Results

4.1. Experimental Setup

4.2. Datasets

4.3. Main Result

4.4. Comparison of Results

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI