PMG—Pyramidal Multi-Granular Matching for Text-Based Person Re-Identification

Liu, Chao; Xue, Jingyi; Wang, Zijie; Zhu, Aichun

doi:10.3390/app132111876

Open AccessArticle

PMG—Pyramidal Multi-Granular Matching for Text-Based Person Re-Identification

¹

School of Intelligent Science and Control Engineering, Jinling Institute of Technology, Nanjing 211199, China

²

School of Computer Science and Technology, Nanjing Tech University, Nanjing 211816, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(21), 11876; https://doi.org/10.3390/app132111876

Submission received: 2 October 2023 / Revised: 25 October 2023 / Accepted: 27 October 2023 / Published: 30 October 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Given a textual query, text-based person re-identification is supposed to search for the targeted pedestrian images from a large-scale visual database. Due to the inherent heterogeneity between different modalities, it is challenging to measure the cross-modal affinity between visual and textual data. Existing works typically employ single-granular methods to extract local features and align image regions with relevant words/phrases. Nevertheless, the limited robustness of single-granular methods cannot adapt to the imprecision and variances of visual and textual features, which are usually influenced by the background clutter, position transformation, posture diversity, and occlusion in surveillance videos, thereby leading to the deterioration of cross-modal matching accuracy. In this paper, we propose a Pyramidal Multi-Granular matching network (PMG) that incorporates a gradual transition process between the coarsest global information and the finest local information by a coarse-to-fine pyramidal method for multi-granular cross-modal features extraction and affinities learning. For each body part of a pedestrian, PMG is adequate in ensuring the integrity of local information while minimizing the surrounding interference signals at a certain scale and can adapt to capture discriminative signals of different body parts and achieve semantically alignment between image strips with relevant textual descriptions, thus suppressing the variances of feature extraction and improving the robustness of feature matching. Comprehensive experiments are conducted on the CUHK-PEDES and RSTPReid datasets to validate the effectiveness of the proposed method and results show that PMG outperforms state-of-the-art (SOTA) methods significantly and yields competitive accuracy of cross-modal retrieval.

Keywords:

text-based person retrieval; person re-identification; multi-granular matching

1. Introduction

Person re-identification (Re-ID) aims to retrieve a query person in a large image pool and is nowadays an extremely valuable task for its use in video surveillance and activity analysis [1,2,3,4,5]. With the proliferation of surveillance cameras in urban areas, massive original video data containing various events and persons are generated every second. Obviously, it is impractical to manually search for the corresponding persons in such large-scale videos, which is extremely time-consuming and boring. Thus, automatic retrieval methods are in urgent need to handle this task more efficiently. According to the query modality, existing methods can be mainly classified into the ones with an image-based query, attribute-based query, and text-based query. The major limitation of image-based person Re-ID methods is the requirement for at least one high-quality image of the queried person, a condition difficult to satisfy on many occasions. As to attribute-based methods, due to the limited descriptive capability of attributes, many candidate images with similar attributes are often matched for a query given as a set of predefined person attributes [6]. Nevertheless, text-based methods utilize verbal descriptions as queries for person search and can provide much more detailed information about the queried person. Moreover, a verbal description of the suspect may be the only accessible information in many real-life scenes. Although text-based person Re-ID has the above advantages and has been studied from various perspectives, the inherent ambiguity of natural language and the huge modality gap make it still a challenging task that needs to be further addressed.

The central problem of text-based person Re-ID lies in properly bridging the gap between visual modality and textual modality. Technically speaking, “bridging the gap” relies on a two-stage process: cross-modal feature extraction and cross-modal similarity learning [7]. In fact, how to effectively extract visual and textual features for the upcoming cross-modal similarities measuring under potential semantic relevance between modalities is the main challenge faced in the text-based person Re-ID task [8]. Nevertheless, due to the inevitable background clutter, position transformation, posture diversity, and occlusion in surveillance videos, visual features extracted from multiple images of the same pedestrian may have significant differences, even when a detection model is initially utilized to delimit the region of the targeted pedestrian in an image by generating a bounding box. Meanwhile, the ambiguity of natural language determines that textual descriptions of the same image may vary dramatically, which also leads to deviations in extracted textual features. The imprecision and variances of visual and textual features make it difficult to identify the same person and distinguish different persons, diminishing the cross-modal matching accuracy. Therefore, robust visual and textual feature descriptors used to effectively capture the discrimination of persons need to be well designed.

Nowadays, almost all text-based person Re-ID methods are striving to alleviate the adverse effect caused by the aforementioned problems and extract discriminative features with small inter-class and large intra-class variations as much as possible [9]. These methods typically employ deep neural networks to obtain global and local feature representations. Li et al. [10] introduced the concept of text-based person Re-ID in their initial work. They also proposed the use of VGG-16, a deep convolutional neural network model, to extract global visual features for this task. Inspired by this work, Chen et al. [11] devised an efficient patch-word matching model to accurately capture the local matching details between visual and textual data, by computing the affinity of the best matching patch of an image toward a word. On the other hand, Jing et al. [8] incorporated pose information to aid in localizing the discriminative body parts for effective local visual feature extraction. However, utilizing prior knowledge like poses can suffer from the inaccuracy of human pose estimation and great computation burden. In the study conducted by Niu et al. [12], a Multi-granularity Image-text Alignments (MIA) model was proposed, by which a feature map is horizontally cropped into a group of non-overlapping image stripes for local visual feature extraction. Following this work, Ding et al. [13] proposed a Word Attention Module (WAM) to acquire correlations between words and image stripes, thus extracting semantically aligned part-level features. Briefly, existing works typically utilize partitioning methods to divide pedestrian images into multiple strips to capture more discriminative local visual features, and employ attention mechanisms for better matching of image strips with relevant textual descriptions.

Although existing methods have achieved significant progress in cross-modal retrieval, there still exist some limitations. Taking the SOTA method SSAN [13] as an example, it is difficult to determine how many stripes an image should be partitioned into. In fact, the coarser the image slicing, the more difficult it is to capture detailed information, while the finer the slicing, the more likely it is to compromise the integrity of local features. Intuitively, diverse partitioning scales are preferred, ensuring that for each body part of a pedestrian, there always exists an appropriate scale that allows an effective extraction of local features from a certain image strip and accurate alignment with relevant words/phrases at this scale. Figure 1 illustrates this concept, at scale 4, the weakened signal of “glasses” is greatly strengthened by minimizing the surrounding interference of excessive background while the integrity is well-preserved, which facilitates extracting discriminative local features and aligning with relevant words/phrases highlighted by green rectangles.

Inspired by this idea, we propose a Pyramidal Multi-Granular matching network (PMG) with an overall architecture illustrated in Figure 2. PMG extracts eight representations with different granularities, including a global visual representation (

V_{g}

), a global textual representation (

T_{g}

), three local visual representations (

V_{l 2}

,

V_{l 4}

,

V_{l 6}

), and three local textual representations (

T_{l 2}

,

T_{l 4}

,

T_{l 6}

), and then four different levels of cross-modal similarities between visual and textual features, i.e.,

V_{g}

-

T_{g}

,

V_{l 2}

-

T_{l 2}

,

V_{l 4}

-

T_{l 4}

and

V_{l 6}

-

T_{l 6}

are calculated to better match visual and textual information. To be more specific, PMG horizontally partitions feature maps of the visual backbone into multiple non-overlapping stripes with various pyramidal scales and employs a Word Attention Module to acquire correlations between words and image stripes, then extracts modality features with different scales independently, which can incorporate the gradual transition process between the coarsest global information and the finest local information. After training the combination of multi-granular visual and textual representations, PMG obtains the final cross-modal similarity by fusing the multi-granular cross-modal similarities. In contrast with existing methods which extract local features with a single granularity for matching, our work employs a pyramidal method for the extraction and matching of features with four different granularities. Obviously, multiple granularities can better adapt to the extraction and matching of different features, even for pedestrian images under intensive changes, such as pose, viewpoint, etc., thus suppressing the variances of feature extraction and improving the robustness of feature matching. The main contribution of this work is threefold:

A coarse-to-fine pyramidal matching method is employed to handle the problem of ineffective local feature extraction from surveillance videos plagued by arbitrariness in existing single-granular methods.
A Pyramidal Multi-Granular matching network (PMG) is proposed to learn multi-granular cross-modal affinities.
Comprehensive experiments are conducted on the CUHK-PEDES and RSTPReid datasets, which indicate that PMG outperforms other previous methods significantly and achieves state-of-the-art performance.

2. Related Works

2.1. Person Re-Identification

Person Re-ID has drawn remarkable attention due to its applicability and research significance. In the early days, hand-crafted person Re-ID systems were predominantly reported, which focused on manually extracting features such as color and texture, and learning better similarity measures by using classic supervised or unsupervised metric learning algorithms. However, traditional methods have certain limitations and are difficult to handle the different poses, backgrounds, lighting, scales, and other issues of pedestrians captured by different cameras. Since deep learning technology was first introduced into this task by Yi et al. [14], person Re-ID methods based on deep learning have achieved great success over the last few years. Zheng et al. [15] designed a unified framework named DG-Net to enable end-to-end interaction between the generative module and discriminative module for more robust and accurate Re-ID learning. Liu et al. [16] presented a two-branch Deep Joint Learning (DJL) network to enhance the discriminative capability of local and global visual features, and hence a hierarchical feature aggregation mechanism was proposed to aggregate the learned hybrid features. To globally learn the attention for each feature node via a global view of feature relations, Zhang et al. [17] proposed a Relation-Aware Global Attention (RGA) network to excavate the global structural information, thus enhancing the capability of image feature representation. Li et al. [18] managed to propose a lightweight network CDNet to reduce computational resources and time costs of person Re-ID via a novel search space called Combined Depth Space and a new search strategy called Top-k Sample Search. Bak et al. [19] proposed a one-shot learning method that decomposes the Re-ID metric into different components to reduce the amount of training data required.

2.2. Text-Based Person Re-Identification

Recently, due to the rapid development of natural language processing (NLP) technology, person Re-ID based on text description has gradually become a focus of attention. Text-based person Re-ID can find the corresponding target person according to the given text query. This combination of text description and image not only improves the accuracy of pedestrian recognition but also realizes cross-modal person Re-ID. In recent years, the task of text-based person Re-ID has been studied from many perspectives. Li et al. [10] first put forward this task and further employed a deep neural network GNA-RNN for person search, which can extract the global visual feature and estimate the affinity between a query text and a person image. Following their work, an efficient patch-word matching model was proposed by Chen et al. [11], with the aim of accurately exploiting the cross-modal local matching details between image patches and text words. Liu et al. [20] proposed a novel deep adversarial graph attention convolution network (A-GANet) in which an adversarial learning module is developed to learn a joint textual-visual feature space for cross-modality matching. Sarafianos et al. [21] proposed a Text-Image Modality Adversarial Matching (TIMAM) approach which learns modality-invariant feature representations using adversarial and cross-modal matching objectives. In addition, BERT, a publicly available language model is first applied for extracting word embeddings in this work. Aggarwal et al. [22] proposed a novel framework to learn common representations for images and text, such that semantics are explicitly preserved in the features. More recently, both the global and local cross-modal information are usually taken into consideration together by many works for further improvement in retrieval accuracy. Jing et al. [8] employed pose information to benefit localizing the discriminative regions, and then two alignment networks were proposed to learn the latent image-text alignments at global and local scales, respectively. Hao et al. [23] presented a novel Modality Confusion Learning Network (MCLNet), with the aim of confusing two modalities during optimization to focus solely on modality-irrelevant information. A novel cross-modal alignments model named MIA was proposed by Niu et al. [12]. By horizontally splitting the feature map into non-overlapping strips, MIA extracts fine-grained image features, and then the cross-modal similarities are evaluated from multi-granular image-text alignments via a cross-modal attention mechanism. A NAFS model was proposed by Gao et al. [24], in which both the visual and textual description are decomposed at three scales, and then a contextual non-local attention mechanism is employed to discover latent alignments. With a Gumbel attention module, a hierarchical adaptive matching model was introduced by Zheng et al. [25] to tackle the problem of matching redundancy when aligning image regions and words/phrases. Wang et al. [26] proposed a Visual-Textual Attributes Alignment (ViTAA) model, which warps both visual and textual attributes into multiple categories and then learns to disentangle the feature space of a person into subspaces corresponding to attributes using a light auxiliary attribute segmentation layer. A DSSL model was proposed by Zhu et al. [27] along with a new RSTPReid dataset. In order to obtain higher retrieval accuracy, DSSL employed a surroundings–person separation mechanism to effectively excavate person information and adopted diverse alignment paradigms to adequately utilize multi-modal and multi-granular information. Ding et al. [13] proposed the SSAN method to extract semantically aligned part-level features for the multi-modal data and a compound ranking loss is also employed to optimize global and part feature learning.

For a more comprehensive review of research works in the field of text-based person Re-ID, we summarize the related literature reviews according to their optimization components and local cross-modal alignments in Table 1, where “✓” indicates the integration of this component,

V_{L}^{k}

denotes the number of image strips/parts partitioned for local cross-modal alignments, and

T_{L}^{p} / T_{L}^{s} /

denotes that the textual description is split into phrase/words.

3. Proposed Method

In this section, we describe the proposed Pyramidal Multi-Granular matching network (PMG) in detail. First, we introduce the process of extracting pyramidal multi-granular visual and textual representations. Then, a coarse-to-fine pyramidal matching method is presented to learn the image-text similarities. Finally, a compound Re-ID loss and a two-stage training strategy are adopted for the training of PMG. The algorithm framework of PMG is shown in Figure 3.

3.1. Pyramidal Multi-Granular Feature Extraction

PMG extracts eight representations with different granularities, including global visual representation (

V_{g}

), local

_{2}

visual representation (

V_{l 2}

), local

_{4}

visual representation (

V_{l 4}

), local

_{6}

visual representation (

V_{l 6}

), global textual representation (

T_{g}

), local

_{2}

textual representation (

T_{l 2}

), local

_{4}

visual representation (

T_{l 4}

) and local

_{6}

visual representation (

T_{l 6}

). Correspondingly, a visual pyramidal module and a textual pyramidal module are utilized to split the representation extraction into multiple parallel branches for multi-granular visual and textual feature representations, respectively.

3.1.1. Visual Feature Extraction

In order to extract global and local visual features from a given image I, a ResNet-50 backbone that has been pretrained on the ImageNet dataset is first employed to generate the shared visual feature map

ψ (I) \in R^{w \times h \times c}

with width w, height h and channels c. Considering that horizontally partitioning

ψ (I)

into a fixed number of non-overlapping stripes may suffer from the misaligned bounding box problem as discussed in the Introduction section, we adopt a coarse-to-fine pyramidal matching method to obtain visual representations with various pyramidal scales.

Specifically, PMG introduces four different pyramidal scales by horizontally partitioning

ψ (I)

into 1, 2, 4, or 6 spatial bins, respectively. Accordingly, the feature map is denoted as

ψ_{K} (I) \in R^{w \times h_{K} \times c}

, where K = 1, 2, 4, 6 and

h_{K} = h / K

means

ψ (I)

is horizontally portioned by K uniform parts. For each

ψ_{K} (I)

, the partitioning operation is carried out by an average pooling module

A v g P o o l i n g_{K} = \frac{1}{h_{K} \times w} \sum_{i = 1}^{h_{K}} \sum_{j = 1}^{w} ψ_{K}^{[1 : K]} {(I)}_{i, j, 1 : c},

(1)

and a maximum pooling module

M a x P o o l i n g_{K} = max_{1 \leq i \leq h_{K}, 1 \leq j \leq w} ψ_{K}^{[1 : K]} {(I)}_{i, j, 1 : c},

(2)

hence the visual pyramidal module is defined as follows:

Φ_{K} (I) = A v g P o o l i n g_{K} (I) + M a x P o o l i n g_{K} (I),

(3)

where

K \in {1, 2, 4, 6}

denotes the pyramidal level and

Φ_{K} (I) \in R^{K \times 1 \times c}

. This fused pooling module takes advantage of the down-sampling characteristic of the average pooling layer to reduce computation complexity, as well as the capability for capturing discriminative signals of the maximum pooling layer, which may be weakened when they are surrounded by signals relatively poor (e.g., background).

Then, the four different pyramidal scales are separately processed to obtain the final visual representations. To extract the global visual representation

V_{g} \in R^{P}

, we reshape

Φ_{1} (I) \in R^{1 \times 1 \times c}

to a c-dimensional vector and then pass it through a batch normalization (BN) layer followed by a 1 × 1 convolutional layer consisting of P kernels. As for local visual representations of

V_{l 2}

,

V_{l 4}

and

V_{l 6}

, we first create three different

K \times 1 \times c

(K = 2, 4, 6)

processing structures, in which a 1 × 1 convolutional layer is, respectively, adopted to reduce the dimensionality from c to

c / 2

. Then, these K vectors with the dimensionality of

c / 2

are separately passed through a multi-layer perception containing a batch normalization layer and two fully-connected (FC) layers with a ReLU layer between them to obtain the local visual-medium vector, which follows:

v_{l K}^{k} = W_{V_{K}}^{k} ReLU ({\tilde{W}}_{V_{K}}^{k} BN ({\tilde{Φ}}_{K}^{k} (I))),

(4)

where

{\tilde{Φ}}_{K}^{k} (I)

is the k-th vector of the modified visual feature map with reduced dimensionality of c/2,

{\tilde{W}}_{V_{K}}^{k} \in R^{c \times \frac{c}{2}}

,

W_{V_{K}}^{k} \in R^{P \times c}

, and

v_{l K}^{k} \in R^{P} (K \in {2, 4, 6}, k \in {1, . . ., K})

is the visual representation of the k-th local strip on scale K. As a result, the final local visual representation

V_{l K} = {v_{l K}^{1}, v_{l K}^{2}, . . ., v_{l K}^{K}}

can be formed by concatenating

v_{l K}^{k}

, and thus there are total

\sum_{K \in {2, 4, 6}} K = 12

P-dimensional vectors to represent the local visual features.

3.1.2. Textual Feature Extraction

To extract global and local textual representations, we begin by utilizing a pretrained BERT language model to embed each word within the textual description. Specifically, given a description of length n, we build an embedding matrix

W_{e} \in R^{m \times n}

and then obtain the word embedding for each word by

x_{i} = W_{e} \times w_{i}

, where

w_{i} \in R^{n}

is the i-th word in the description, and

x_{i} \in R^{m}

denotes

w_{i}

is embedded into a vector with dimensionality of m. After that, we feed the embedded words through a bi-directional gated recurrent unit (Bi-GRU) to capture the relationship among them, which follows

\vec{h_{i}} = \vec{G R U} (x_{i}, \vec{h_{i - 1}}), 1 \leq i \leq n

(5)

\overset{\leftarrow}{h_{i}} = \overset{\leftarrow}{G R U} (x_{i}, \overset{\leftarrow}{h_{i - 1}}), 1 \leq i \leq n

(6)

where

\vec{h_{i}}, \overset{\leftarrow}{h_{i}} \in R^{c}

and represent the forward and backward hidden states of the i-th word, respectively.

Then, we define the representation of the i-th word as the average of the i-th last hidden states from both forward and backward GRUs, namely

e_{i} = (\vec{h_{i}} + \overset{\leftarrow}{h_{i}}) / 2

, where

e_{i} \in R^{c} (1 \leq i \leq n)

. To generate the representation E for the entire textual description, we concatenate all n word representations together as

E = {e_{1}, e_{2}, . . ., e_{n}}

.

After obtaining the textual feature map E, we carry out a row-wise max pooling (RMP) operation on E and then obtain the global textual representation

T_{g}

through a

1 \times 1

convolutional layer as follows

T_{g} = W_{T_{g}} \times s o f t m a x (max_{c} E),

(7)

where

W_{T_{g}} \in R^{P \times n}

and

T_{g} \in R^{P}

. Next, we utilize a textual pyramidal module to obtain local textual representations according to the word-part correspondences, which expends the part-level textual feature extraction method in SSAN [13] to various pyramidal scales corresponding to the K-partitioned feature map

ψ_{K} (I) (K = 2, 4, 6)

. We first predict the probability that the i-th word belongs to the k-th strip by

s_{K}^{i, k} = σ (ω_{K}^{k} e_{i})

, where

σ

denotes the Sigmoid function and

ω_{K}^{k} \in R^{1 \times c}

stands for a linear transformation operation. Then, we modify the textual feature map

E = {e_{i}}_{i = 1}^{n}

as

E_{K}^{k} = {s_{K}^{i, k} e_{i}}_{i = 1}^{n}

to represent the textual description for the k-th local strip. Similar to the process of obtaining

T_{g}

, each modified textual description

E_{K}^{k}

is passed through an RMP layer followed by a

1 \times 1

convolutional layer to obtain

t_{l K}^{k}

as

t_{l K}^{k} = W_{T_{K}}^{k} \times s o f t m a x (max_{c} E_{K}^{k}),

(8)

where

W_{T_{K}}^{k} \in R^{P \times n}

, and

t_{l K}^{k} \in R^{P}

represents the textual features for the k-th local strip. Finally, we stack

t_{l K}^{k} (1 \leq k \leq K)

as

T_{l K} = {t_{l K}^{1}, t_{l K}^{2}, . . ., t_{l K}^{K}}

to obtain the final local textual representation with a given pyramidal scale K.

3.2. Image-Text Matching

After obtaining a total of eight representations of the diverse modal and granularity, four different cross-modal similarities of global and local features are proposed to, respectively, match visual and textual information with different granularities.

For each of the four cross-modal combinations, i.e.,

V_{g}

-

T_{g}

,

V_{l 2}

-

T_{l 2}

,

V_{l 4}

-

T_{l 4}

and

V_{l 6}

-

T_{l 6}

, the modified cosine metric is adopted to evaluate the similarity with relevant representations by

{simi}_{g} = \frac{V_{g}^{T} \cdot T_{g}}{| | V_{g} | | \times | | T_{g} | |},

(9)

{simi}_{l 2} = \frac{1}{2} \sum_{k = 1}^{2} (\frac{{(v_{l 2}^{k})}^{T} \cdot t_{l 2}^{k}}{| | v_{l 2}^{k} | | \times | | t_{l 2}^{k} | |}),

(10)

{simi}_{l 4} = \frac{1}{4} \sum_{k = 1}^{4} (\frac{{(v_{l 4}^{k})}^{T} \cdot t_{l 4}^{k}}{| | v_{l 4}^{k} | | \times | | t_{l 4}^{k} | |}),

(11)

{simi}_{l 6} = \frac{1}{6} \sum_{k = 1}^{6} (\frac{{(v_{l 6}^{k})}^{T} \cdot t_{l 6}^{k}}{| | v_{l 6}^{k} | | \times | | t_{l 6}^{k} | |}),

(12)

where

| | \cdot | |

denotes the 2-norm of the vector, and the cross-modal similarity sim

_{l K}

(K = 2, 4, 6)

is calculated by averaging the K part-level feature similarities of the image-text pair with a given pyramidal scale K.

Then, the final cross-modal similarity is fused by the above four similarities as follows:

{simi}_{P M G N e t} = {simi}_{g} + λ \cdot ({simi}_{l 2} + {simi}_{l 4} + {simi}_{l 6}) .

(13)

where

λ

is the fusion coefficient and set to 0.5 in the testing stage, indicating that global and local cross-modal similarities are equally important to the final image-text matching.

3.3. Loss Functions and Training Strategy

For the training of PMG, we adopt a compound Re-ID loss, which uses both identification (ID) loss with loose constraints and triplet ranking loss with hard mining. Furthermore, a two-state training strategy is employed.

During Stage-I, the visual backbone parameters remain fixed, while the remaining components of PMG are optimized using the ID loss. This process involves grouping individuals into distinct clusters with the guidance of their IDs. As global representations can provide more comprehensive information for this classification process, only the two global representations, namely

V_{g}

and

T_{g}

, are utilized here. We first build a shared transformation matrix

W_{ID} \in R^{N \times P}

via a bias-free fully-connected layer, where N represents the total number of distinct individuals presented in the training set, and then calculate the two proposed ID losses

L_{ID}^{V}

and

L_{ID}^{T}

for visual and textual representations as

L_{ID}^{V} = - \frac{1}{N} \sum_{i = 1}^{N} \log (s o f t m a x (W_{ID}^{i} \cdot V_{g})),

(14)

L_{ID}^{T} = - \frac{1}{N} \sum_{i = 1}^{N} \log (s o f t m a x (W_{ID}^{i} \cdot T_{g})),

(15)

where

W_{ID}^{i} \in R^{1 \times P}

is the i-th row vector of

W_{ID}

. Here, the transformation matrix

W_{ID}

is shared so as to map multi-modal ID representations into the same latent space. The integrated loss function for Stage-I is

L_{stageI} = L_{ID}^{V} + L_{ID}^{T} .

(16)

In Stage-II, all parameters of PMG including the visual backbone are together fine-tuned. In addition to the ID loss proposed in Stage-I, the popular triplet ranking loss is also used to ensure fine-grained visual-textual matching, which follows:

\begin{matrix} L_{rank}^{K} = \sum_{\hat{T_{K}}} \max {α - \cos (V_{K}, T_{K}) & + \cos (V_{K}, \hat{T_{K}}), 0} \\ + \sum_{\hat{V_{K}}} \max {α - \cos (V_{K}, T_{K}) + \cos (\hat{V_{K}}, T_{K}), 0}, \end{matrix}

(17)

where

L_{rank}^{K} \in {L_{rank}^{g}, L_{rank}^{l 2}, L_{rank}^{l 4}, L_{rank}^{l 6}}

denotes the ranking loss of visual-textual matching on scale K. Similarly, visual representation

V_{K}

can correspond to

V_{g}

,

V_{l 2}

,

V_{l 4}

, or

V_{l 6}

, and textual representation

T_{K}

can be

T_{g}

,

T_{l 2}

,

T_{l 4}

, or

T_{l 6}

according to

L_{rank}^{K}

. Here,

(V_{K}, T_{K})

represents a matching visual-textual pair, while

(V_{K}, \hat{T_{K}})

and

(\hat{V_{k}}, T_{k})

represent mismatched pairs. The margin

α

is applied to ensure the similarity score of matched pairs is larger than mismatched ones in a mini-batch. The joint loss function in Stage-II is

L_{stageII} = L_{ID}^{V} + L_{ID}^{T} + L_{rank}^{g} + L_{rank}^{l 2} + L_{rank}^{l 4} + L_{rank}^{l 6},

(18)

4. Experiment Details

4.1. Experiment Setup

4.1.1. Experiment Preparations

In our experiments, we utilize Python environment 3.7.2, Anaconda3, Pycharm 2019.3, PyTorch 1.5.0, CUDA 10.2, cuDNN 7.65, and Windows 10 Professional OS. The experiments are executed on a Lenovo ThinkStation P520C workstation equipped with 128 GB DDR4 RAM, an Intel Xeon W-2245 processor, and an NVIDIA RTX A5000 GPU.

4.1.2. Dataset and Evaluation Metrics

To evaluate the performance of the proposed PMG method, we conduct extensive experiments on two text-based person Re-ID datasets: CUHK-PEDES [10] and RSTPRedi [27]. CUHK-PEDES is currently the most popular dataset used for text-to-image person Re-ID and has become the de facto benchmark for text-based person retrieval tasks. Compared with CUHK-PEDES, the newly conducted RSTPRedi is more adaptable to real application scenarios in that images were caught by different cameras with complex scene transformations and backgrounds in various periods of time [27]. (1) CUHK-PEDES: We follow the official data split paradigm [10] for the training, validating, and testing of PMG. Specifically, the training set consists of 34,054 images representing 11,003 individuals, along with 68,126 query sentences. The validation set comprises 3078 images, 1000 individuals and 6158 textual descriptions. The testing set contains 3074 images, 1000 individuals and 6156 descriptions. (2) RSTPReid: The RSTPReid dataset used in this study consists of 20,505 images and 41,010 descriptions showcasing a total of 4101 individuals. Following the data split approach in [27], the whole dataset is divided such that 3701 identities are assigned to the training set, 200 identities to the validation set, and 200 identities to the testing set. For each identity in all sets, 5 images along with 10 annotated textual descriptions are attached to represent the corresponding identity.

We evaluate the performance using Top-1, Top-5, and Top-10 accuracies, which are the most popular and crucial metrics in person Re-ID. Given a query sentence, all images in the test set are ranked by their similarities with the query text, and Top-k images with the highest scores are selected as candidates. If the Top-k candidate images contain at least one image of the targeted pedestrian, the query is called to be successful. Then, the Top-k accuracy is calculated as the ratio of successful queries to the total number of test queries. Figure 4 shows an example of Top-5 text-based person Re-ID results.

4.1.3. Implementation Details

In our experiments, the representation dimension of the feature space P is set to 1024 following [7], in which an ablation experiment has been conducted on CUHK-PEDES for the best setting of P. As to the dimension of embedded word vectors c, we maintain the existing setting of WAM [13] to ensure consistency by keeping c at 768. The Natural Language ToolKit (NLTK) is employed to extract the noun phrases from each query sentence by syntactic analysis, word segmentation, and part-of-speech tagging. The number of phrases n is kept flexible with an upper bound of 26 based on the length of query sentences. A pretrained BERT language model [28] with initialized fixed parameters is also used to handle the textual input.

For fair comparisons with previous works, we choose the pretrained ResNet-50 as our visual CNN backbone, whose parameters are initialized with weights pretrained on the ImageNet classification task. All input images are resized to 384 × 128 × 3 resolution, and random horizontal flipping is applied for data augmentation before feeding into the backbone [12,13], and we extract the feature maps before the average pooling layer of ResNet-50, whose size is 24 × 8 × 2048.

An Adaptive Moment Estimation (ADAM) optimizer is also utilized to optimize the learning strategy of PMG, with a training batch size of 64, which is a moderate batch size adapted to the processing capability of our hardware platform and widely employed by many research works. In training Stage-I, the learning rate is set to

1 \times 10^{- 3}

for an iteration of 10 epochs with all parameters in the visual backbone fixed. And then in Stage-II, the learning rate is first initialized to

2 \times 10^{- 4}

and PMG is further trained for 50 epochs. The settings of learning rate and training epoch are determined by ablation analysis, as shown in Table 2, Table 3 and Table 4. In the training stage, the margin of triplet ranking loss

α

is used to enhance the discrimination between positive and negative pairs, and we set

α

to 0.2 following almost all research works [7,8,13,27,29].

4.2. Comparison with SOTA

The performance comparisons on the CUHK-PEDES and RSTPReid datasets are presented in Table 5 and Table 6, respectively. For more visually appealing results, graphical comparisons are also presented in Figure 5 and Figure 6. It is noted that considering many previous methods do not utilize BERT, a powerful pretrained language model which can improve the performance, we validate the performance of PMG both with and without BERT for the fairness of experimental comparisons.

As shown in Table 5 and Figure 5, the PMG model outperforms other methods by considerable margins in terms of Top-1, Top-5, and Top-10 metrics. For instance, compared with SSAN, which is the SOTA method using both global and local features, PMG+BERT(ours) gains 3.22%, 3.04%, and 2.39% performance improvement in terms of Top-1, Top-5, and Top-10 metrics, respectively. This is a significant performance gain in text-based person Re-ID which has reached a technological bottleneck in terms of performance improvement in the past few years and indicates the effectiveness of our proposed method on the CUHK-PEDES dataset. Instead of utilizing methods that optimize the cross-modal matching from various aspects, such as network architecture, local representation extraction, loss function, and aiding discriminative information, PMG employs a simple coarse-to-fine pyramidal matching method to extract local visual and textual features with different granularities, thus to suppress feature variances and lead to more robust representation and better cross-modal matching accuracy. Additionally, it can be observed that thanks to the inherent capability of BERT for retaining correlations between noun phrases and the integrity of textual features, PMG achieves better performance when combined with BERT, which further optimizes the extraction of fine-grained local textual features.

To further validate the effectiveness of PMG, we also compare it with some other methods proposed in our previous works, including IMG-Net [29], AMEN [36], DSSL [27] and the latest CAIBC [9] on the RSTPReid datasets, and achieve similar results as shown in Table 6 and Figure 6. It can be observed that by utilizing the proposed multi-granular method, PMG+BERT(ours) outperforms CAIBC by 1.50%, 3.10%, and 2.30% on RSTPReid under the Top-1/5/10 accuracies, respectively. It is noted that, although the CAIBC method also utilizes the pretrained BERT model, PMG+BERT still shows better performance, indicating the superiority of PMG under the same condition. On the other hand, we can observe that PMG without BERT is inferior under the Top-1 metric while is superior under the Top-5 and Top-10 metrics, which indicates that even if PMG does not utilize BERT, it is still superior to CAIBC in most cases.

4.3. Ablation Study

Comprehensive ablation studies are carried out to further analyze components of the PMG model. As shown in Table 7 and Table 8, the digit ’0’ in the parentheses means the corresponding pyramidal scale is not utilized while ’1’ means that is utilized. From left to right, the three digits denote pyramidal scales 2, 4, and 6, respectively. Several examples of Top-5 retrieval results are shown in Figure 4.

4.3.1. Pyramidal Scales

As shown in Table 7 and Figure 7, PMG achieves the best performance with all three pyramidal scales (denoted by “111”), and the worst performance with only one scale (“001” or “010” or “100”). Intuitively, the model solely utilizing scale 2 instead of scale 3 reaches the lowest accuracy on account that the partitioning is too rough to provide detailed information. Nevertheless, the model achieves the highest accuracy when solely utilizing scale 4 instead of scale 6, which indicates that the image feature maps need not be partitioned as fine as possible, and too many strips may compromise the integrity of visual features. Not unexpected, when two or all of the three scales are employed together, varied pyramidal scales can complement each other and help to capture more cross-modal matching details, thereby leading to superior accuracy performance. Compared to solely utilizing a single scale, PMG utilizing all three scales brings an average accuracy improvement of 2.8%, 2.0%, and 2.3% under the Top-1/5/10 accuracies, respectively. This demonstrates the advantages of the multi-scale approach and confirms that methods utilizing multi-granular representation extraction and matching outperform the ones utilizing a single granularity.

4.3.2. Fused Pooling Method

Table 8 shows the comparison of different pooling methods. We find that the model with the maximum pooling method performs slightly better than the one with the average pooling method. This is reasonable as the average pooling method takes all contextual information into consideration while down-sampling, it may not perform properly if the discriminative signal is surrounded by unrelated signals. In contrast, the maximum pooling method catches the most salient signals for a local view. By fusing the two methods together to take advantage of both contextual information and the most salient signals, PMG achieves a result better than models with either of them.

5. Conclusions

In this paper, we propose a Pyramidal Multi-Granular matching network (PMG) for the text-based person Re-ID task. PMG utilizes a coarse-to-fine pyramidal matching method to learn the cross-modal affinities by extracting and aligning multi-granular visual and textual features. The introduction of the novel gradual transition process between the most coarse global information and the finest local information can suppress the variances of feature extraction and improve the robustness of feature matching. Hence, the problem of imprecision and variances of visual and textual features commonly existing in methods utilizing strict image partitioning strategy with a single granularity, is effectively alleviated. Comprehensive experiments are conducted both on the CUHK-PEDES and RSTPReid datasets, and results clearly indicate that PMG surpasses state-of-the-art methods by a significant margin and yields competitive accuracy of text-based cross-modal retrieval.

The main advantage of PMG lies in its ability to achieve additional performance gains by simply incorporating multiple parallel processing branches. This versatility allows PMG to serve as a flexible network framework, easily integrated with other optimization components to enhance cross-modal retrieval accuracy. However, a notable drawback of PMG is its relatively low computational efficiency. For instance, when utilizing three partitioning scales (i.e., 2, 4, and 6), PMG divides an image feature map into a total of 12 strips. Although the most discriminative feature of a body part can be extracted from a specific image strip, computations are still conducted on the remaining 11 strips, leading to a considerable amount of unnecessary computation. To enhance performance further, we explore the use of a deep neural network to identify unimportant image strips, retaining only the one most conducive to capturing essential features.

Author Contributions

Conceptualization, C.L.; data curation, J.X.; formal analysis, Z.W.; methodology, C.L. and Z.W.; software, J.X.; supervision, A.Z.; validation, J.X.; writing—original draft, C.L.; writing—review and editing, A.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by the Future Network Scientific Research Fund Project (Grant No. FNSRFP-2021-YB-21) and Postgraduate Research & Practice Innovation Program of Jiangsu Province, China (Grant No. KYCX23_1452).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We are hugely grateful to the possible anonymous reviewers for their careful, unbiased, and constructive suggestions with respect to the original manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and A strong convolutional baseline). In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
Yao, H.; Zhang, S.; Hong, R.; Zhang, Y.; Xu, C.; Tian, Q. Deep Representation Learning with Part Loss for Person Re-identification. IEEE Trans. Image Process. 2019, 28, 2860–2871. [Google Scholar] [CrossRef] [PubMed]
Xiong, M.; Gao, Z.; Hu, R.; Chen, J.; He, R.; Cai, H.; Peng, T. A Lightweight Efficient Person Re-Identification Method Based on Multi-Attribute Feature Generation. Appl. Sci. 2022, 12, 4921. [Google Scholar] [CrossRef]
Xie, H.; Luo, H.; Gu, J.; Jiang, W. Unsupervised Domain Adaptive Person Re-Identification via Intermediate Domains. Appl. Sci. 2022, 12, 6990. [Google Scholar] [CrossRef]
Wang, C.; Zhang, C.; Feng, Y.; Ji, Y.; Ding, J. Learning Visible Thermal Person Re-identification via Spatial Dependence and Dual-constraint Loss. Entropy 2022, 24, 443. [Google Scholar] [CrossRef] [PubMed]
Jeong, B.; Park, J.; Kwak, S. ASMR: Learning attribute-based Person search with adaptive semantic margin regularizer. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 12016–12025. [Google Scholar]
Wang, Z.; Zhu, A.; Xue, J.; Jiang, D.; Liu, C.; Li, Y.; Hu, F. SUM: Serialized Updating and Matching for text-based person retrieval. Knowl.-Based Syst. 2022, 248, 108891. [Google Scholar]
Jing, Y.; Si, C.; Wang, J.; Wang, W.; Wang, L.; Tan, T. Pose-guided multi-granularity attention network for text-based Person search. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11189–11196. [Google Scholar]
Wang, Z.; Zhu, A.; Xue, J.; Wan, X.; Liu, C.; Wang, T.; Li, Y. CAIBC: Capturing all-round information beyond color for text-based person retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 5314–5322. [Google Scholar]
Li, S.; Xiao, T.; Li, H.; Zhou, B.; Yue, D.; Wang, X. Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1970–1979. [Google Scholar]
Chen, T.; Xu, C.; Luo, J. Improving text-based Person search by spatial matching and adaptive threshold. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikola, HI, USA, 21–15 March 2018; pp. 1879–1887. [Google Scholar]
Niu, K.; Huang, Y.; Ouyang, W.; Wang, L. Improving Description-based Person Re-identification by Multi-granularity Image-text Alignments. IEEE Trans. Image Process. 2020, 29, 5542–5556. [Google Scholar] [CrossRef] [PubMed]
Ding, Z.; Ding, C.; Shao, Z.; Tao, D. Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification. arXiv 2021, arXiv:2107.12666. [Google Scholar]
Yi, D.; Lei, Z.; Liao, S.; Li, S.Z. Deep metric learning for person re-identification. In Proceedings of the 22nd IEEE International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 24–28 August 2014; pp. 34–39. [Google Scholar]
Zheng, Z.; Yang, X.; Yu, Z.; Zheng, L.; Yang, Y.; Kautz, J. Joint discriminative and generative learning for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2138–2147. [Google Scholar]
Liu, Y.; Yang, H.; Zhao, Q. Hierarchical Feature Aggregation from Body Parts for Misalignment Robust Person Re-Identification. Appl. Sci. 2019, 9, 2255. [Google Scholar] [CrossRef]
Zhang, Z.; Lan, C.; Zeng, W.; Jin, X.; Chen, Z. Relation-aware global attention for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3186–3195. [Google Scholar]
Li, H.; Wu, G.; Zheng, W.S. Combined depth space based architecture search for Person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6729–6738. [Google Scholar]
Bak, S.; Carr, P. One-shot metric learning for person Re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2990–2999. [Google Scholar]
Liu, J.; Zha, Z.J.; Hong, R.; Wang, M.; Zhang, Y. Deep adversarial graph attention convolution network for text-based person search. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 665–673. [Google Scholar]
Sarafianos, N.; Xu, X.; Kakadiaris, I.A. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5814–5824. [Google Scholar]
Aggarwal, S.; Radhakrishnan, V.B.; Chakraborty, A. Text-based Person search via attribute-aided matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 2617–2625. [Google Scholar]
Hao, X.; Zhao, S.; Ye, M.; Shen, J. Cross-modality person re-identification via modality confusion and center aggregation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 16403–16412. [Google Scholar]
Gao, C.; Cai, G.; Jiang, X.; Zheng, F.; Zhang, J.; Gong, Y.; Peng, P.; Guo, X.; Sun, X. Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search. arXiv 2021, arXiv:2101.03036. [Google Scholar]
Zheng, K.; Liu, W.; Liu, J.; Zha, Z.J.; Mei, T. Hierarchical gumbel attention network for text-based person search. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 3441–3449. [Google Scholar]
Wang, Z.; Fang, Z.; Wang, J.; Yang, Y. Vitaa: Visual-textual attributes alignment in person search by natural language. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 402–420. [Google Scholar]
Zhu, A.; Wang, Z.; Li, Y.; Wan, X.; Jin, J.; Wang, T.; Hu, F.; Hua, G. DSSL: Deep surroundings-person separation learning for text-based person retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 209–217. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 15. [Google Scholar]
Wang, Z.; Zhu, A.; Zheng, Z.; Jin, J.; Xue, Z.; Hua, G. IMG-Net: Inner-cross-modal Attentional Multigranular Network for Description-based Person Re-identification. J. Electron. Imaging 2020, 29, 043028. [Google Scholar] [CrossRef]
Reed, S.; Akata, Z.; Lee, H.; Schiele, B. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 49–58. [Google Scholar]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
Li, S.; Xiao, T.; Li, H.; Yang, W.; Wang, X. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1890–1899. [Google Scholar]
Zheng, Z.; Zheng, L.; Garrett, M.; Yang, Y.; Xu, M.; Shen, Y.D. Dual-Path Convolutional Image-Text Embeddings with Instance Loss. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2020, 16, 1–23. [Google Scholar] [CrossRef]
Chen, D.; Li, H.; Liu, X.; Shen, Y.; Shao, J.; Yuan, Z.; Wang, X. Improving deep visual representation for person re-identification by global and local image-language association. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 54–70. [Google Scholar]
Zhang, Y.; Lu, H. Deep cross-modal projection learning for image-text matching. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 686–701. [Google Scholar]
Wang, Z.; Xue, J.; Zhu, A.; Li, Y.; Zhang, M.; Zhong, C. AMEN: Adversarial multi-space embedding network for text-based Person re-identification. In Proceedings of the 4th Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Beijing, China, 29 October–1 November 2021; pp. 462–473. [Google Scholar]

Figure 1. Examples of multiple partitioning scales. For each body part of a pedestrian, there always exists an appropriate scale at which the integrity of the relevant local information is well-preserved while minimizing the surrounding interference signals. This facilitates the capture of highly discriminative local visual features and enhances the alignments between image regions and words/phrases. Consequently, a coarse-to-fine pyramidal method is employed to take advantage of varied partitioning scales.

Figure 2. The overall architecture of our proposed Pyramidal Multi-Granular matching network (PMG). It extracts eight representations with different granularities, including global visual representation (

V_{g}

), local

_{2}

visual representation (

V_{l 2}

), local

_{4}

visual representation (

V_{l 4}

), local

_{6}

visual representation (

V_{l 6}

), global textual representation (

T_{g}

), local

_{2}

textual representation (

T_{l 2}

), local

_{4}

visual representation (

T_{l 4}

) and local

_{6}

visual representation (

T_{l 6}

), and then calculates four different levels of cross-modal similarities between visual and textual features to better match visual and textual information.

Figure 2. The overall architecture of our proposed Pyramidal Multi-Granular matching network (PMG). It extracts eight representations with different granularities, including global visual representation (

V_{g}

), local

_{2}

visual representation (

V_{l 2}

), local

_{4}

visual representation (

V_{l 4}

), local

_{6}

visual representation (

V_{l 6}

), global textual representation (

T_{g}

), local

_{2}

textual representation (

T_{l 2}

), local

_{4}

visual representation (

T_{l 4}

) and local

_{6}

visual representation (

T_{l 6}

), and then calculates four different levels of cross-modal similarities between visual and textual features to better match visual and textual information.

Figure 3. The algorithm framework of PMG. Based on the pretrained ResNet-50 and Bi-GRU backbones, PMG extracts global and local features, respectively, from visual and textual modalities. For each modality, multiple parallel branches are utilized for extracting multi-granular representations. For simplicity, only the global branch and the K-strip branch are shown. A Word Attention Module is adopted to establish semantic correlations between the K-strip and relative noun phrases.

Figure 4. Examples of Top-5 text-based person Re-ID results by PMG. Given a query sentence, all images in the dataset are sorted based on their similarities to the query, and then the five images with the highest similarities are picked out as candidates. Images of the corresponding person contained in the candidates are marked by green rectangles.

Figure 5. Graphical comparison with SOTA on CUHK-PEDES. The data in the graph is consistent with those presented in Table 5.

Figure 6. Graphical comparison with SOTA on RSTPReid. The data in the graph is consistent with those presented in Table 6.

Figure 7. Graphical pot of ablation analysis of PMG on CUHK-PEDES. The data in the graph is consistent with those presented in Table 7.

Table 1. Summary of related literature reviews.

Method	Network Architecture	Attention Mechanism	Aiding Information	Local Feature Alignments
SUM [7]	✓			-
PMA [8]		✓	✓	$V_{L}^{6}$ - $T_{L}^{p}$
CAIBC [9]	✓		✓	$V_{L}^{3}$ - $T_{L}^{w}$
GNA-RNN [10]	✓			-
PWM-ATH [11]		✓		-
MIA [12]	✓	✓	✓	$V_{L}^{6}$ - $T_{L}^{w}$
SSAN [13]		✓		$V_{L}^{3}$ - $T_{L}^{p}$
A-GANet [20]	✓	✓		-
TIMAM [21]	✓			-
CMAAM [22]	✓			$V_{L}^{2}$ - $T_{L}^{p}$
MCLNet [23]	✓			-
NAFS [24]	✓	✓		$V_{L}^{3}$ - $T_{L}^{p, w}$
HGAN [25]		✓		-
ViTAA [26]			✓	$V_{L}^{5}$ - $T_{L}^{p}$
DSSL [27]		✓		$V_{L}^{6}$ - $T_{L}^{p, w}$

Table 2. Ablation analysis of learning rate of Stage-I.

Epoch	ID_Loss ( $1 \times 10^{- 2}$ )	ID_Loss ( $1 \times 10^{- 3}$ )	ID_Loss ( $1 \times 10^{- 4}$ )
1	56.04%	56.04%	56.04%
5	27.94%	24.02%	27.80%
10	17.01%	12.10%	17.49%

Table 3. Ablation analysis of learning rate of Stage-II.

Epoch	ID_Loss ( $2 \times 10^{- 2}$ )	ID_Loss ( $2 \times 10^{- 4}$ )	ID_Loss ( $2 \times 10^{- 5}$ )
20	6.24%	8.47%	9.57%
30	5.43%	4.32%	5.80%
40	4.36%	3.50%	4.14%

Table 4. Ablation analysis of trained epoch.

Epoch	Top-1	Top-5	Top-10
40	62.15%	80.86%	87.04%
50	64.59%	83.19%	89.12%
60	63.94%	82.87%	88.91%

Table 5. Comparison with SOTA on CUHK-PEDES.

Method	Top-1	Top-5	Top-10
CNN-RNN [30]	8.07	-	32.47
Neural Talk [31]	13.66	-	41.72
GNA-RNN [10]	19.05	-	53.64
IATV [32]	25.94	-	60.48
PWM-ATH [11]	27.14	49.45	61.02
Dual Path [33]	44.40	66.26	75.07
GLA [34]	43.58	66.93	76.26
CMPM-CMPC [35]	49.37	71.69	79.27
MIA [12]	53.10	75.00	82.90
A-GANet [20]	53.14	74.03	81.95
PMA [8]	54.12	75.45	82.97
TIMAM [21]	54.51	77.56	84.78
ViTAA [26]	55.97	75.84	83.52
CMAAM [22]	56.68	77.18	84.86
HGAN [25]	59.00	79.49	86.62
NAFS [24]	59.94	79.86	86.70
DSSL [27]	59.98	80.41	87.56
SSAN [13]	61.37	80.15	86.73
PMG (ours)	62.33	81.32	87.26
PMG + BERT (ours)	64.59	83.19	89.12

Table 6. Comparison with SOTA on RSTPReid.

Method	Top-1	Top-5	Top-10
IMG-Net [29]	37.60	61.15	73.55
AMEN [36]	38.45	62.40	73.80
DSSL [27]	39.05	62.60	73.95
CAIBC [9]	47.35	69.55	79.00
PMG (ours)	46.60	70.85	79.55
PMG + BERT (ours)	48.85	72.65	81.30

Table 7. Ablation Analysis of PMG on CUHK-PEDES.

Method	Top-1	Top-5	Top-10
PMG (001)	62.02	81.82	87.04
PMG (010)	62.18	81.93	87.17
PMG (011)	63.52	82.16	88.73
PMG (100)	61.08	79.98	86.65
PMG (101)	64.04	82.95	88.94
PMG (110)	64.20	83.13	89.08
PMG (111)	64.59	83.19	89.21

Table 8. Ablation Analysis of Pooling Methods on CUHK-PEDES.

Method	Top-1	Top-5	Top-10
PMG-AvgPooling	63.47	83.02	89.10
PMG-MaxPooling	63.66	83.15	89.13
PMG-FusedPooling	64.59	83.19	89.21

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, C.; Xue, J.; Wang, Z.; Zhu, A. PMG—Pyramidal Multi-Granular Matching for Text-Based Person Re-Identification. Appl. Sci. 2023, 13, 11876. https://doi.org/10.3390/app132111876

AMA Style

Liu C, Xue J, Wang Z, Zhu A. PMG—Pyramidal Multi-Granular Matching for Text-Based Person Re-Identification. Applied Sciences. 2023; 13(21):11876. https://doi.org/10.3390/app132111876

Chicago/Turabian Style

Liu, Chao, Jingyi Xue, Zijie Wang, and Aichun Zhu. 2023. "PMG—Pyramidal Multi-Granular Matching for Text-Based Person Re-Identification" Applied Sciences 13, no. 21: 11876. https://doi.org/10.3390/app132111876

APA Style

Liu, C., Xue, J., Wang, Z., & Zhu, A. (2023). PMG—Pyramidal Multi-Granular Matching for Text-Based Person Re-Identification. Applied Sciences, 13(21), 11876. https://doi.org/10.3390/app132111876

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PMG—Pyramidal Multi-Granular Matching for Text-Based Person Re-Identification

Abstract

1. Introduction

2. Related Works

2.1. Person Re-Identification

2.2. Text-Based Person Re-Identification

3. Proposed Method

3.1. Pyramidal Multi-Granular Feature Extraction

3.1.1. Visual Feature Extraction

3.1.2. Textual Feature Extraction

3.2. Image-Text Matching

3.3. Loss Functions and Training Strategy

4. Experiment Details

4.1. Experiment Setup

4.1.1. Experiment Preparations

4.1.2. Dataset and Evaluation Metrics

4.1.3. Implementation Details

4.2. Comparison with SOTA

4.3. Ablation Study

4.3.1. Pyramidal Scales

4.3.2. Fused Pooling Method

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI