Prototype-Based Support Example Miner and Triplet Loss for Deep Metric Learning

Yang, Shan; Zhang, Yongfei; Zhao, Qinghua; Pu, Yanglin; Yang, Hangyuan

doi:10.3390/electronics12153315

Open AccessArticle

Prototype-Based Support Example Miner and Triplet Loss for Deep Metric Learning

¹

Beijing Key Laboratory of Digital Media, School of Computer Science and Engineering, Beihang University, Beijing 100191, China

²

State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China

³

Pengcheng Laboratory, Shenzhen 518055, China

⁴

State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(15), 3315; https://doi.org/10.3390/electronics12153315

Submission received: 13 June 2023 / Revised: 22 July 2023 / Accepted: 27 July 2023 / Published: 2 August 2023

(This article belongs to the Special Issue Machine Intelligent Information and Efficient System)

Download

Browse Figures

Versions Notes

Abstract

:

Deep metric learning aims to learn a mapping function that projects input data into a high-dimensional embedding space, facilitating the clustering of similar data points while ensuring dissimilar ones are far apart. The most recent studies focus on designing a batch sampler and mining online triplets to achieve this purpose. Conventionally, hard negative mining schemes serve as the preferred batch sampler. However, most hard negative mining schemes search for hard examples in randomly selected mini-batches at each epoch, which often results in less-optimal hard examples and thus sub-optimal performances. Furthermore, Triplet Loss is commonly adopted to perform online triplet mining by pulling the hard positives close to and pushing the negatives away from the anchor. However, when the anchor in a triplet is an outlier, the positive example will be pulled away from the centroid of the cluster, thus resulting in a loose cluster and inferior performance. To address the above challenges, we propose the Prototype-based Support Example Miner (pSEM) and Triplet Loss (pTriplet Loss). First, we present a support example miner designed to mine the support classes on the prototype-based nearest neighbor graph of classes. Following this, we locate the support examples by searching for instances at the intersection between clusters of these support classes. Second, we develop a variant of Triplet Loss, referred to as a Prototype-based Triplet Loss. In our approach, a dynamically updated prototype is used to rectify outlier anchors, thus reducing their detrimental effects and facilitating a more robust formulation for Triplet Loss. Extensive experiments on typical Computer Vision (CV) and Natural Language Processing (NLP) tasks, namely person re-identification and few-shot relation extraction, demonstrated the effectiveness and generalizability of the proposed scheme, which consistently outperforms the state-of-the-art models.

Keywords:

deep metric learning; hard example mining; outlier; Triplet Loss; person re-identification; few-shot relation extraction

1. Introduction

Deep metric learning aims to develop a similarity metric superior to traditional Euclidean distance by focusing on the derivation of high-dimensional data representations [1]. Contemporary research underscores the criticality of designing batch samplers [2,3,4] and employing online triplet mining [5,6,7,8], with hard negative mining schemes [9,10,11] traditionally serving as the preferred choice, proving their crucial role in significantly refining the similarity metric in deep metric learning. For an understanding of key concepts related to hard example mining and Batch Samplers, see the Section 3. These techniques have found utility in various tasks, including face recognition [12,13,14,15,16], person re-identification (ReID) [2] within Computer Vision, and Natural Language Processing tasks such as few-shot relation extraction [17], few-shot text classification [18], text clustering [19], and sentence embedding [20], thereby boosting performance. Although certain methods, including the batch sampler and online triplet mining, have indeed demonstrated admirable performance, two significant challenges persist that invite further scholarly exploration.

Challenge 1: Traditional batch samplers struggle to identify hard examples situated at cluster intersections. To the best of our knowledge, existing types of batch samplers include the standard PK sampler (PK) [21] and the sampler associated with hard negative mining (HNM) schemes. The PK, while randomly selecting P classes and K examples from each class, experiences compromised efficiency in sampling hard examples due to the randomness of its approach. To circumvent this limitation, certain HNM schemes construct mini-batches for each epoch by mining hard examples from the entire training datasets, thereby achieving significant results [2,3,22,23]. These schemes primarily identify hard classes by creating Nearest Neighbor Graphs (NNGs) at instance-to-instance [3] or instance-to-class [22,23] (e.g., state-of-the-art Graph Sampling (GS) [2]) levels. Subsequently, examples (e.g., the examples in the red rectangular box as depicted in Figure 1) are randomly chosen from these hard classes to compose the mini-batch [2]. It is worth noting that the direct mining of hard examples, such as those enclosed by the rectangle in Figure 1 situated at the intersections of clusters (i.e., groups of data objects belonging to the same class as represented by green circles in Figure 1), is often overlooked in much of the existing research due to the prevalence of random sampling from hard classes. Importantly, these examples pose significant classification challenges for the model. Although they may seem analogous to the ‘support vectors’ in SVM [24] due to their positions and influences on the model, they play a different role in our context because of their distinct definitions and usages. We refer to these hard examples as ‘support examples’ and their corresponding classes as ‘support classes’. The necessity of studying the construction of prototype-based NNGs arises, aiming to identify the support classes and select support examples located at the intersections of support class clusters. Details on prototype acquisition will be thoroughly discussed in the Section 4. As depicted in Table 1, we compared the PK, GS [2], and our proposed Prototype-based Support Example Miner (pSEM), detailed subsequently. Since the PK and GS randomly sample hard examples, they are incapable of mining support examples. In contrast, our non-random pSEM focuses on intersections between clusters, directly mining support examples, as illustrated in Figure 1.

Challenge 2: Online Triplet Mining overlooks the issue of excessively loose clusters caused by outliers serving as anchors. The Online Triplet Mining (OTM) scheme, proposed by Triplet Loss [5], operates on triplets of examples (anchor, positive, negative) and generally aims to minimize the distance between the anchor and the positive example while maximizing the distance between the anchor and the negative example within the mini-batch [5,7,8,25]. Nonetheless, when the anchor is an outlier, it can lead to overly loose clusters, which are not preferable in deep learning due to their potential detriment to model classification. Specifically, the positive example distances itself from the class cluster as the distance between the outlier anchor and the positive example is minimized, as illustrated in Figure 2(1). This scenario may consequently result in an excessively loose cluster to which the positive example belongs. This adverse outcome also arises when the distance between the outlier anchor and the negative example is maximized. Inspired by stochastic optimization strategies, our approach integrates the concept of historical gradients—a technique that uses previously computed gradients to curb oscillation and stabilize the learning process [26]—to effectively address the challenges associated with outlier anchors and excessively loose clusters. To tackle these issues, we introduce the notion of ‘historical normal anchors’, which we define as the normal anchors derived from previous mini-batches during the training process. In an innovative twist, we merge these historical normal anchors with outlier anchors, utilizing the exponentially weighted average formula [26] to create a corrected, normal anchor. Through this merging process, the outlier anchors are infused with the intrinsic structure characteristic of historical normal anchors—a structure embodying inherent patterns, similarities, and clusters within the data. This maneuver not only brings outlier anchors closer to both historical normal anchors and cluster centers but also facilitates better alignment with the underlying structure of the data. Ultimately, this method significantly enhances the representational quality of outlier anchors within the intrinsic structure, demonstrating its robustness and innovation in addressing the outlined challenges. As depicted in Figure 2(2), the corrected outlier anchor pulls the positive example, alleviating the aforementioned adverse outcome.

To address the challenges encountered in the two stages of deep metric learning, we propose a Prototype-based Support Example Miner and Triplet Loss. Initially, the Prototype-based Support Example Miner (pSEM) identifies support classes by building the NNGs with the prototype and mines the support examples at the cluster intersections of support classes. Secondly, to alleviate the problem of excessively loose clusters caused by outlier anchors, we design the Prototype-based Triplet Loss (pTriplet Loss). pTriplet Loss can detect the outlier anchor and subsequently correct it by integrating the outlier anchor with the historical normal anchor. We showcase the effectiveness and generalizability of our schemes through experiments on Person Re-identification (ReID) and Few-shot Relation Extraction (FSRE), which are typical tasks in Computer Vision (CV) and Natural Language Processing (NLP), respectively.

2. Related Work

In the domain of deep metric learning, several studies focus on the design of batch samplers and OTM to learn a similarity metric function. These methods are specifically engineered to mine hard examples from the training set, thereby enabling the model to concentrate on learning these examples during the training process [1].

Batch sampler designing. The standard methodologies include typical PK samplers and HNM schemes. The PK sampler fails to mine hard examples, as elaborated in the Section 1. HNM schemes, on the other hand, stabilize the deep neural network after a few iterations or each epoch and then leverage the updated model to detect hard examples across the entire training set. This model constructs hard examples within each mini-batch for training at each step. Bootstrapping [27] is a technique that employs the updated model every few iterations to find hard examples within the training set while training deep neural networks, although this dramatically decelerates the training progress. Recent work aims to expedite the training progress by adapting training strategies. SmartMining [3], for instance, establishes approximate Nearest Neighbor Graphs (NNGs) grounded in the instance-to-instance distance at the outset of each training epoch. However, considering all examples makes constructing instance-to-instance level NNGs resource-consuming. Wang et al. [22] suggests randomly sampling hard examples from hard classes determined by K-means clustering. Nonetheless, K-means can easily converge to a local optimal solution [28], implying that the hard class obtained might not necessarily be the support class. Suh et al. [23] procures hard classes from NNGs constructed based on instance-to-class distances, and then the hard example is obtained through random sampling from the hard class. As this operation is conducted at each iteration, it results in enormous computational costs. To mitigate these shortcomings, GS [2] randomly samples a single example per class for NNG construction based on instance-to-instance distance. Despite its effectiveness, it is not without flaws. For instance, when a class encompasses tens of thousands of training examples, a randomly selected example may poorly represent the class. Moreover, the selected hard examples, randomly sampled from the hard classes, may not be situated at the intersection between clusters, as discussed in Section 1. Hence, the hard examples obtained by such methods are the non-support examples. Distinct from the aforementioned methods, our proposed Prototype-based Support Example Miner (pSEM) builds the NNGs predicated on the prototype (i.e., class-to-class distance). It identifies the support classes across the entire training dataset and then selects examples at the intersection between clusters, thus effectively mining the support example.

Online Triplet Mining. Schroff et al. [5] introduced OTM, which mines hard positives and hard negatives in each mini-batch based on the loss value to form a triplet (anchor, hard positive, hard negative). These triplets are then utilized to update the model. A substantial number of the remaining triplets do not participate in the model update process, significantly accelerating the model’s convergence rate. To address the ’local optimal solutions’ concern—which arises when the Triplet Loss method mines hard samples within a mini-batch, not the entire training set, potentially leading to the less accurate identification of truly hard samples or understating their hardness level—Ge [7] proposed a solution. They suggested constructing a hierarchical class tree encompassing all classes in the training set, thereby expanding the mining scope for a more accurate hard sample identification. Cai et al. [25] proposed the Hard Exemplar Reweighting Triplet Loss, which assigns weights to the triplets according to their difficulty level. The t-Distributed Stochastic Triplet Embedding (t-STE) [29] algorithm adeptly refines data embeddings by enforcing similarity triplet constraints while concurrently amalgamating similar instances and diverging dissimilar ones, thereby unearthing the genuine structure of underlying data. Vasudeva et al. [8] presented LoOp, which can be employed to alleviate the biased embedding caused by Triplet Loss. To the best of our knowledge, existing research has not addressed the problem of excessively loose clusters resulting from anchors being outliers. In response, we propose pTriplet Loss, which amalgamates outlier anchors and prototypes to generate a typical anchor near the cluster center, thereby alleviating this issue.

3. Background

To ensure our paper is readily comprehensible for the general reader, we provide definitions of key concepts in this section.

3.1. Batch Sampler

Batch Samplers, in the context of machine learning, act as components of a data loader that are responsible for grouping individual data points into batches and retrieving these batches of data points during each iteration of model training or inference.

3.2. Hard Example Mining

Hard Example Mining refers to an insightful approach in deep metric learning, systematically identifying instances in the training set that pose significant challenges to accurate model classification. This approach plays a pivotal role in pushing the boundaries of model performance, fostering the cultivation of a superior similarity metric [30].

Hard Example. Hard examples, typically instances the model tends to misclassify in the training set, often lead to performance bottlenecks. By shifting the model’s learning emphasis towards these challenging cases, much in the same way human learners stress their own mistakes, we can enhance the model’s learning proficiency in the face of difficulties, thereby more effectively advancing the development of a superior similarity metric, a key objective in deep metric learning [31].

Hard Class. Hard classes can be defined as groups of examples that share common traits but are inherently complex or ambiguous to classify, often posing substantial challenges to the accuracy and efficiency of machine learning models.

Hard Negtive Mining. The Hard Negative Mining schemes construct mini-batches for each epoch by mining hard examples from the entire training datasets [27,32,33].

4. Methodology

Figure 3 illustrates the architecture of our proposed framework, comprising primarily the Prototype-based Support Example Miner (pSEM) and Triplet Loss components. In Step 1, we utilize pSEM to mine support examples, which, in the context of pedestrian re-identification, refer to pedestrians with similar appearances. Initially, we derive a prototype for each class within the training set using the model updated at the end of the previous epoch. Subsequently, we identify support classes by constructing Nearest Neighbor Graphs (NNGs) based on these prototypes. Finally, we mine support examples from the intersections between clusters of support classes, which are then used to assemble the mini-batch. In Step 2, we refine the anchor using pTriplet Loss. For the current epoch, the input mini-batch is processed through the backbone network to obtain the feature vectors of the examples. Our pTriplet Loss detects the outlier anchor and then amends it by fusing the outlier anchor with the historical normal anchor to generate a new, standard anchor. Ultimately, the pTriplet Loss integrates other loss functions (e.g., cross-entropy loss function) for back-propagation. Upon completion of the current epoch’s training, this process is repeated until model convergence.

Formally, given a training set

T = x_{1}, x_{2}, x_{3}, \dots, x_{n}

consisting of C classes, we extract the support example set

H = h_{1}, h_{2}, h_{3}, \dots, h_{m}

from

T

. The pSEM identifies the support example set

H

on the training set

T

prior to the commencement of each epoch and utilizes these examples to construct the mini-batch for subsequent model training. The mini-batch encompasses M classes, with N examples from each class. Additionally, pTriplet Loss combines the outlier anchor

f_{θ} (x^{o})

and prototype

V_{P}

to produce a new, standard anchor

V_{u}

.

4.1. Prototype-Based Support Example Miner

The pSEM mines support examples at the intersections of clusters belonging to support classes on NNGs to construct mini-batches for model training.

Formulation of Nearest Neighbor Graphs (NNGs) for Support Class Determination. NNGs, which are prevalent graph structures, are formulated for all classes within the training set using their respective prototypes. Each prototype is depicted as a node, with edges connecting it to its ‘nearest neighbors’, which are ascertained via a distance metric such as cosine similarity. To elaborate, the cosine similarity between classes is computed utilizing the class-specific prototype, conceived as the mean of the feature vectors corresponding to each class’s instances.

V_{P_{i}} = \frac{\sum_{j = 1}^{k} f_{θ} (x_{i, j})}{k}, i \in [1, C]

(1)

where f denotes the backbone, i signifies the i-th class, and k represents the number of examples in the i-th class. Moreover,

f_{θ} (x_{i, j})

is the feature vector of the j-th example in the i-th class, and

V_{P_{i}}

represents the prototype of the i-th class.

For any class in the training set, we identify its nearest-neighbor class by comparing the cosine distance between the prototype of this class and the prototypes of all other classes. Each class and its nearest neighbors, being the closest in the feature vector space and challenging for the model to distinguish, are considered support classes and grouped into a set of support classes.

S_{i} = {s_{1}, s_{2}, \dots, s_{l}}, i \in [1, C]

(2)

where

S_{i}

represents the set of support classes consisting of the i-th class and its neighbors, l is the size of the

S_{i}

, and

s_{i}

denotes the i-th support class. Thus, the NNGs are constructed as

G = (V, E), V = {v_{1}, v_{2}, \dots, v_{C}}, E = {(v_{i}, v_{j}) ∣ v_{i} \in S (v_{j})} .

(3)

In this case,

V

refers to all classes in the training set, and

E

refers to the set of edges. The

S (v_{j})

represents the support classes of class

v_{j}

.

Mining Support Examples. To identify the support example, we explore the intersection between clusters of support classes on

G

. Specifically, for any two support classes that are nearest neighbors, we compute the vector

V_{m}

of their midpoints based on the prototype

V_{P}

.

V_{m} = \frac{(V_{P_{a}} + V_{P_{b}})}{2}, a, b \in S_{i}

(4)

where

V_{P_{a}}

and

V_{P_{b}}

denote the prototypes of support class a and support class b, respectively.

We consider the sphere centered at

V_{m}

with radius

δ

as the region encompassing the intersection between clusters of support classes. Subsequently, we calculate the cosine distance between

V_{m}

and each example within the support classes. Ultimately, the example with a distance less than the threshold

δ

is selected as the support example.

\begin{matrix} T_{a_{b}} = T_{a} \cup T_{b} \\ T_{a_{b}} = {t \in T_{a_{b}} ∣ 1 - cos (f_{θ} (x_{t}), V_{m}) < = δ} \end{matrix}

(5)

where

T_{i}

represents the set of examples corresponding to the i-th class and

H_{a_{b}}

represents the set of support examples. By adjusting the value of

δ

, we can effectively select support examples of varying degrees of difficulty. Refer to the hyper-parameter sensitivity analysis in Section 5.8.3 for further details.

4.2. Prototype-Based Triplet Loss

The aim of pTriplet Loss is to enhance the efficacy of Triplet Loss by identifying and rectifying outlier anchors.

Identification of Outlier Anchors. We compute the distance d between each example in the cluster and the prototype, defining those examples with a cosine distance d exceeding the threshold

λ

as outlier anchors, while the remaining are treated as normal anchors.

\begin{matrix} d_{j} = 1 - cos (V_{p_{i}}, x_{j}) \\ O = {o \in T_{i} ∣ d_{j} > λ} \\ N = {n \in T_{i} ∣ d_{j} < = λ} \end{matrix}

(6)

In this case,

O

and

N

correspond to the outlier anchor set and the set of normal anchors, respectively.

Prototype Updating. In each step of the model learning process, we employ an exponentially weighted average formula, a weighted average strategy [26]. This involves initializing an average value and a decay factor (

α

), and at each step, a new value is obtained and combined with the previous average using the decay factor. The exponentially weighted average formula assigns decreasing weights to the values over time, giving more importance to recent values. By applying this formula and considering the neural network’s improved updates, which lead to better and more accurate representations of the normal anchors obtained in the current step, we fuse the normal anchor and prototype within the mini-batch, resulting in the update of the prototype vector, capturing more precise representations.

V_{P_{i}}^{(t)} = α \times V_{P_{i}}^{(t - 1)} + (1 - α) \times f_{θ} (x_{i, j}^{(n)}), α \in [0, 1]

(7)

where f represents the backbone, i refers to the i-th class, j signifies the j-th example, and

f_{θ} (x_{i, j}^{(n)})

is the feature vector of the normal anchors

x_{i, j}^{(n)}

. The term

V_{P_{i}}^{(t)}

is the prototype of the i-th class after t iterations, while

α

is the adjustment factor controlling the proportion of

V_{P_{i}}^{(t - 1)}

in the updated prototype

V_{P_{i}}^{(t)}

. This prevents the proportion of prototype

V_{P_{i}}^{(t - 1)}

from becoming too large, which could result in over-compaction of clusters and over-fitting. The prototype derived via pSEM prior to each epoch serves as the initial value of

V_{P}

for the current epoch.

Outlier Anchor Correction. By employing the exponentially weighted average formula, we generate a corrected normal anchor, denoted as

V_{u}

, through the weighted averaging of updated prototypes

V_{P}^{(t)}

and detected outlier anchors

x^{(o)}

, where the prototypes are derived from historical normal anchors. The correction process imbues the outlier anchors with the intrinsic structure that is characteristic of normal anchors—a structure that encompasses inherent patterns, similarities, and clusters within the data. Consequently, this process propels outlier anchors closer to both normal anchors and cluster centers, thereby enabling better alignment with the inherent structure of the data. As a result, this methodology markedly enhances the representation quality of outlier anchors within the intrinsic structure.

\begin{matrix} V_{u} = β \times V_{p_{i}}^{(t)} + (1 - β) \times f_{θ} (x_{i, j}^{(o)}), \\ x_{i, j}^{(o)} \in O, β \in [0, 1] \end{matrix}

(8)

Formulation of pTriplet Loss. Given the above discussion, the Prototype-based Triplet Loss can be formalized as follows:

\begin{matrix} L_{pTriplet} (θ; x) = \sum_{i = 1}^{M} \sum_{a = 1}^{N} [m & + max_{p = 1 \dots K} D (V_{a}^{i}, f_{θ} (x_{p}^{i})) & - min_{\begin{matrix} j = 1 \dots M \\ n = 1 \dots N \\ j \neq i \end{matrix}} D (V_{a}^{i}, f_{θ} (x_{n}^{j}))]_{+} \end{matrix}

(9)

where D represents the Euclidean distance and

f_{θ} (x_{p}^{i})

and

f_{θ} (x_{n}^{j})

are the features of positive and negative examples, respectively.

V_{a}^{i}

is either the corrected outlier anchor

V_{u}

or the normal anchor

V_{n}

, and m represents a margin between positive and negative pairs. The Batch Size is

M \times N

. pTriplet Loss seeks to minimize the distance between the anchor and the positive example and maximize the distance between the anchor and the negative example during training within the mini-batch.

5. Experiment

To validate the effectiveness and generalizability of our approach, we performed a range of empirical experiments, including quantitative performance comparisons, qualitative analyses, ablation studies, and sensitivity analyses of hyperparameters. These experiments were carried out on typical Computer Vision and Natural Language Processing tasks, specifically person re-identification and few-shot relation extraction.

Annotation: In Tables 4–6, ① signifies the use of the PK sampler and Triplet Loss, while ② indicates the usage of the GS [2] sampler and Triplet Loss during the training phase.

5.1. Overview of the Person Re-Identification Experiment

5.1.1. Introduction of Hard Examples for ReID

People exhibiting similar appearances are categorized as hard examples.

5.1.2. Datasets, Evaluation Metrics, Baseline, Comparison Scheme, and Experimental Settings

Datasets: In person re-identification and generalizable person re-identification tasks, we employ four datasets to validate our proposed scheme: Market1501 [34], DukeMTMC-ReID [35], MSMT17 [36], and CUHK [37]. The specifics of these datasets are summarized in Table 2. Notably, the generalizable person-re-identification task can assess the model’s generalization ability using hard negative schemes and online Triplet Loss. Cross-dataset evaluation is conducted by training the model on the training set of the source dataset and then evaluating it on the test set of the target dataset.

Evaluation Metrics: We employ Rank-1 (R1) and mean average precision (mAP) as our evaluation metrics. R1 denotes the accuracy rate of the first result in the retrieved outcomes, while mAP measures the overall ranking accuracy of the retrieved results. Detailed descriptions of these two evaluation metrics are provided below.

mAP:: Let us denote $q u e r y$ as the set of images to be retrieved, hypothetically containing N images, where $q_{i}$ represents the i-th query image. Thus, the dataset is represented as $Q = {q_{1}, q_{2}, \dots, q_{n}}$ . Similarly, $g a l l e r y$ refers to the image database, hypothetically encompassing M images, with $g_{i}$ standing for the i-th image. The dataset is represented as: $G = {g_{1}, g_{2}, \dots, g_{i}, \dots, g_{M}}$ . Let us denote the number of the query image $q_{i}$ in G as $K_{q_{i}}$ . For each query process, after extracting the feature vector from $q_{i}$ , the algorithm computes the distance by comparing this vector with the feature vectors of each image in G. Subsequently, the $g a l l e r y$ is sorted in ascending order based on this distance, resulting in a sorted $g a l l e r y$ dataset represented as $G_{q_{i}}$ . The dataset composed of the hit images for $q_{i}$ is denoted as $\hat{G_{q_{i}}} = \{\hat{g_{1}}, \hat{g_{2}}, \dots, \hat{g_{j}}, \dots, \hat{g_{K_{q_{i}}}}\}$ , which is a subset of $G_{q_{i}}$ . Let us assume $\hat{g_{j}}$ ranks as $r_{j}$ in $G_{q_{i}}$ and $\hat{r_{j}}$ in $\hat{G_{q_{i}}}$ . By repeating this process for all query images, we can formulate the precision calculation as follows:

$m A P = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{K_{q_{i}}} (\hat{r_{j}} / r_{j})$

(10)
R1:: For a given query, denoted as $q_{i}$ , if the top-ranked image in $G_{i}$ (the $g a l l e r y$ ) corresponds to the same ID/person, we say $q_{i}$ achieves first-rank accuracy. Let $\hat{N}$ represent the number of all queries in Q that satisfy this first-rank accuracy condition. Then, Rank-1’s accuracy is calculated as

$R1 = \hat{N} / N$

(11)

Baseline: We adopt the widely-used PK sampler as the HNM baseline, which randomly selects P pedestrians from the training set, with each pedestrian randomly picking K images to construct a mini-batch. For online hard example mining, we utilize Triplet Loss as the baseline [5].

Comparison Scheme: For the comparison of HNM strategies, we utilize Cluster [22] and the state-of-the-art (SOTA) GS [2]. It should be mentioned that as the Cluster [22] does not provide open-source code, we only reference its experimental results on the generalized person re-identification task from [2].

Experimental Settings: Our approach is based on the official PyTorch code of TransReID [38] (https://github.com/damo-cv/TransReID, accessed on 27 May 2021) and GS [39] (https://github.com/ShengcaiLiao/QAConv, accessed on 1 May 2022). In the person re-identification task, we employ ViT [40] as the backbone, and the data augmentation strategy comprises random horizontal flipping, padding, random cropping, and random erasing [41]. The backbone for the generalizable person re-identification task is ResNet50 [42], augmented with IBN-b layers. The data augmentation strategies involve random cropping, flipping, occlusion, and color jittering. The input image is resized to dimensions 384 × 128, and the loss function is either Triplet Loss or cross-entropy loss. The remaining hyperparameters, such as batch size and the learning rate of the optimizer, are set according to the settings of TransReID [38] and GS [39]. Some table columns are left empty, indicating that the method has not been tested on the corresponding datasets. The parameter values mentioned in the Section 4, including

δ

,

λ

,

α

, and

β

, are empirically determined. For reference values, please consult the sample values provided in the project code linked in the Abstract section.

5.2. Overview of the Few-Shot Relation Extraction Experiment

5.2.1. Introduction of Hard Examples for the Few-Shot Relation Extraction Task

During the training phase, we mine sentences from various categories that share similar semantics, classifying them as hard examples.

5.2.2. Datasets, Evaluation Metrics, Baseline and Comparison Scheme, Experimental Settings

Dataset: To validate our proposed approach, we utilize the extensively employed FewRel few-shot relation extraction dataset [43]. This dataset encompasses 100 relations and 70,000 instances extracted from Wikipedia and a concealed test set comprising 20 relations. Adhering to the procedure delineated in prior work [44], we reorganize the available 80 relations into 50 for training, 14 for validation, and 16 for testing.

Evaluation Metrics: The N-way-K-shot (N-w-K-s) paradigm is routinely used to simulate the distribution of few-shot relation extraction across diverse scenarios. Here, N and K denote the number of classes and the number of examples per class, respectively. In the N-w-K-s framework, accuracy is the prime performance measure.

Baseline: The PK sampler is widely used as the baseline for Hard Negative Mining (HNM), while Triplet Loss is adopted as the benchmark for online hard example mining [5].

Comparison Scheme: To the best of our knowledge, there are no specialized HNM and OTM strategies designed particularly for few-shot relation extraction. A majority of the HNM and OTM approaches employed in Natural Language Processing (NLP) originate from the field of Computer Vision (CV). Consequently, we have chosen the GS scheme for comparison. We also consider the noteworthy model ConceptFERE [44] as our principal point of comparison.

Experimental Settings: We initialize the parameters for BERT [45] using the BERT-base-uncased model with a hidden size of 768. Hyperparameters, including the learning rate, adhere to the settings prescribed in ConceptFERE [44].

5.2.3. Model Training Details

For instance, taking 5-w-1-s as an example, during the data sampling stage of mining support examples, our scheme randomly samples a target class. Subsequently, we identify the top-4 nearest neighbors in the training set to serve as the support class. Each of the sampled five classes discovers its support examples and then randomly assigns support examples to the query and support sets. Both our proposed scheme and GS are implemented on ConceptFERE (https://github.com/LittleGuoKe/ConceptFERE, accessed on 1 May 2022).

5.3. Comparison with State-of-the-Art HNM Schemes on the Generalizable ReID Task

Due to the limited size of the CUHK03-NP and Market1501 test sets, variations in the Rank-1 (R1) rate by a few percentage points cannot accurately reflect the model’s performance. Nevertheless, the mean average precision (mAP) metric measures the overall retrieval performance; thus, we prioritize mAP as the main evaluation metric on both datasets. Notably, our method outperforms all other approaches in terms of mAP, as illustrated in Table 3. Specifically, the PK sampler underperforms due to its reliance on random sampling for mini-batch construction, which yields minimal support examples for mining. On the other hand, the Cluster [22] and GS [2] schemes yield improved performance as they obtain non-random hard classes. Our scheme delivers a significant improvement over the state-of-the-art GS, with gains of 2.5% and 1.5% in R1, and 1.1% and 0.9% in mAP for Market1501→MSMT17 and CUHK03-NP→MSMT17 transfers, respectively (where A→B indicates training on dataset A and testing on dataset B). This demonstrates that our method can effectively enhance the baseline in large-scale settings. The superior performance of our scheme, particularly in terms of mAP, can be attributed to the following factors: our proposed pSEM can mine the support examples at cluster intersections, and the pTriplet Loss corrects outliers to compact clusters, simplifying classification. For an in-depth quantitative analysis of our proposed scheme’s superiority, readers are referred to the cases of hard example mining depicted in Figure 4 and the distribution of intra-class similarity, as detailed in Section 5.8.1. The presented results in Figure 4, obtained by random sampling from mini-batches constructed using the standard PK sampler, the state-of-the-art GS [2] sampler, and our pSEM, compare their effectiveness in mining support examples during the ReID task training. We employed a random sampling method to ensure a fair comparison. Owing to space limitations, only a portion of the selected results is displayed.

5.4. Performance Comparison of HNM and OTM Schemes on Person Re-Identification Tasks

As demonstrated in Table 4, our proposed scheme outperforms other approaches across various datasets. Specifically, we observe a performance improvement of 1.3% in Rank-1 accuracy (R1) and 1.6% in mean Average Precision (mAP) on the MSMT17 dataset when compared to the TransReID scheme. Furthermore, our method registers gains of 0.7% in R1 and 0.9% in mAP on the MSMT17 dataset relative to the GS scheme [2]. Two possible reasons contribute to our method’s superior performance over GS. First, our methodology builds the Nearest Neighbour Graphs (NNGs) based on the prototype, while GS forms the NNGs at the instance level. Hence, the nearest neighbors determined by our method are the support classes and not the non-support classes. Secondly, our proposed scheme pSEM focuses on support examples at the intersection between clusters, which are overlooked by GS.

5.5. Performance Comparison of HNM and OTM Schemes on General Person-Re-Identification Tasks

Table 5 presents the state-of-the-art (SOTA) direct cross-dataset evaluation results, with each group reflecting a model trained on the source dataset’s training set and tested on the target dataset’s test set. As can be seen, our proposed scheme delivers the most impressive performance across all groups. For instance, with the MSMT17(all)→CUHK03-NP transfer, our proposed scheme records gains of 2.6% in R1 and 2.3% in mAP over GS. The primary reason for our scheme’s superiority is its consideration of support examples at the intersection between clusters, which is not a feature of GS as it randomly samples examples from each class that are not necessarily support examples. Our proposed loss function, pTriplet Loss, corrects outlier anchors when computing loss, thereby encouraging clusters to converge more compactly. As a result, the model’s classification surface becomes more distinguishable and generalizable on the general person-re-identification task.

5.6. Performance Comparison of HNM and OTM Schemes on Few-Shot Relation Extraction Tasks

Table 6 presents the performance of different hard example mining schemes on the FewRel [43] dataset test set. Our proposed scheme outperforms all comparator methods, with gains of 1.03% and 3.13% in the scenarios of 5-way-1-shot (5-w-1-s) and 10-way-1-shot (10-w-1-s), respectively, compared to ConceptFERE, the best-performing model in the second group. The potential reason for this improvement is that the PK Sampler randomly selects classes and examples, making it virtually impossible to mine support examples. More importantly, our proposed scheme yields gains of 0.87% and 3.1% over GS in the 5-w-1-s and 10-w-1-s scenarios, respectively. This superior performance can be attributed to GS’s limitations in the NLP scene, where each class has thousands of examples and complex semantics. By building an NNG at the instance level, GS may not accurately model the inter-class relationships, making it challenging to mine support classes. Furthermore, it overlooks support examples at the cluster intersections.

5.7. Ablation Study

This section substantiates the efficacy of the proposed Prototypical Semantic Mining (pSEM) and Prototypical Triplet Loss (pTriplet Loss), elaborated in Section 4.1 and Section 4.2, respectively. As demonstrated in Table 7 and Table 8, the absence of pSEM and pTriplet Loss results in a significant performance drop for both TransReID and ConceptFERE. This evidence confirms that the pSEM effectively mines support examples and that pTriplet Loss aptly corrects outlier anchors, optimizing the performance of Triplet Loss.

5.8. Qualitative Analysis

5.8.1. Evidence of Cluster Compactness

To validate that our proposed pTriplet Loss engenders more compact clusters, we present the intra-class variance distributions resulting from both Triplet Loss and pTriplet Loss in Table 9. Lower variance indicates higher example similarity within a class, suggesting a closer spatial grouping and hence more compact clusters. The intra-class variance in the ReID task, defined as the variance of the cosine distance between any two examples within a class, is portrayed in Table 9. A perusal of this table confirms the superior compactness of clusters produced by pTriplet Loss.

5.8.2. Comparative Analysis of Predictive Results

To provide an intuitive understanding of the effects of different Hard Negative Mining (HNM) and Other Training Methods (OTM) on the model’s ability to classify challenging examples, we use the person ReID task as an illustration. Specifically, we randomly select a group from the retrieval results, as displayed in Figure 5. These retrieval results have been randomly sampled from predictions made by models trained using the PK sampler, GS, and our method (pSEM), all showcasing their top five outcomes. The images marked with yellow, green, and red boxes denote the query, positive example, and negative example, respectively, with a random sampling approach adopted to ensure a fair comparison.

An examination of the case presented in Figure 5 reveals that models trained with our method demonstrate a superior ability to discriminate support examples. For a detailed theoretical analysis, refer to Section 5.8.1.

5.8.3. Hyper-Parameter Sensitivity Analysis

Sensitivity Analysis of $δ$ in Equation (5) In our evaluation, the influence of $δ$ in determining the difficulty level of the mined support examples is evident. As illustrated in Figure 6, a decrement in the $δ$ value leads to an increase in the complexity of the examples extracted by pSEM, and a lower $δ$ value results in the extraction of examples lying closer to the intersection of clusters, thereby augmenting the challenge of identification. Accordingly, by adjusting the $δ$ value, we enabled the selection of support examples with varying difficulty levels. To ensure an unbiased comparison, the outcomes were randomly selected from pSEM, employing different delta values.
Sensitivity Analysis of $λ$ in Equation (6) $λ$ is utilized to detect outliers at varying distances from the cluster center. As depicted in Figure 7, an increase in the value of $λ$ incrementally enhances the model’s performance until it reaches a stable state. This behavior suggests that the identification of outliers distant from the cluster center contributes to the improvement in the model’s performance.
Sensitivity Analysis of $α$ in Equation (7) In the process of prototype updating, we assess the impact of the proportion of the current normal anchor in the updated prototype $V_{P_{i}}^{(t)}$ on model performance. This is accomplished by adjusting the $α$ value as demonstrated in Figure 8. An excessive proportion adversely impacts the model’s performance, leading to overfitting.
Sensitivity Analysis of $β$ in Equation (8) $β$ determines the proportion of prototypes in the corrected outlier anchors. Figure 9 illustrates that an increment in $β$ value results in diminished model performance, a consequence of overfitting.

5.9. Training Time Comparison

Table 10 presents a comparison of training times when employing the QAConv [39] backbone, following GS [2], while using PK, GS, and pSEM as samplers. Our pSEM significantly accelerates the training process compared to PK and GS samplers. This increased efficiency is attributed to QAConv’s ability to learn the support examples positioned at the intersection of clusters, selectively picked by our pSEM.

6. Conclusions

In this work, we examined the challenges associated with hard negative learning and online triplet mining schemes, both of which are integral components of deep metric learning. Our developed pSEM successfully mines support examples utilizing prototype-based NNGs. Additionally, the designed pTriplet Loss corrects outlier anchors for Triplet Loss, enabling the production of more compact clusters, which is advantageous for model classification. The efficacy and generalizability of our method are evident from the experimental results. As a future direction, we aim to develop a fine-grained discrimination module at the model level to distinguish hard examples more effectively.

Author Contributions

S.Y. performed the manuscript writing; Y.Z., Q.Z., Y.P. and H.Y. were responsible for manuscript review and editing; Y.Z. provided supervision and funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This work received partial support from the National Natural Science Foundation of China (No. 62072022) and the Fundamental Research Funds for the Central Universities.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We will publish source codes of this work at https://github.com/LittleGuoKe/pSEM-with-pTriplet (accessed on 27 May 2021). Data integral to this study may be obtained by contacting the corresponding author, who will ensure accessibility upon a duly justified request.

Acknowledgments

We also wish to express our sincere thanks to Sibusiso Reuben Bakana from the School of Computer Science and Engineering at Beihang University for his thorough review and valuable suggestions on the language and grammar of this manuscript, contributing significantly to the clarity of our work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C.H. Deep Learning for Person Re-Identification: A Survey and Outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2872–2893. [Google Scholar] [CrossRef] [PubMed]
Liao, S.; Shao, L. Graph Sampling Based Deep Metric Learning for Generalizable Person Re-Identification. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2021; pp. 7349–7358. [Google Scholar]
Harwood, B.; Kumar BG, V.; Carneiro, G.; Reid, I.; Drummond, T. Smart mining for deep metric learning. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2821–2829. [Google Scholar]
Manmatha, R.; Wu, C.; Smola, A.; Krähenbühl, P. Sampling Matters in Deep Embedding Learning. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2859–2867. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Shrivastava, A.; Gupta, A.; Girshick, R. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 761–769. [Google Scholar]
Ge, W. Deep metric learning with hierarchical triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 269–285. [Google Scholar]
Vasudeva, B.; Deora, P.; Bhattacharya, S.; Pal, U.; Chanda, S. Loop: Looking for optimal hard negative embeddings for deep metric learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10634–10643. [Google Scholar]
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Canévet, O.; Fleuret, F. Large scale hard sample mining with monte carlo tree search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5128–5137. [Google Scholar]
Jin, S.; RoyChowdhury, A.; Jiang, H.; Singh, A.; Prasad, A.; Chakraborty, D.; Learned-Miller, E.G. Unsupervised Hard Example Mining from Videos for Improved Object Detection. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Chopra, S.; Hadsell, R.; LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 539–546. [Google Scholar]
Masi, I.; Tran, A.T.; Hassner, T.; Leksut, J.T.; Medioni, G. Do we really need to collect millions of faces for effective face recognition? In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part V 14. Springer: Cham, Switzerland, 2016; pp. 579–596. [Google Scholar]
Smirnov, E.; Melnikov, A.; Novoselov, S.; Luckyanets, E.; Lavrentyeva, G. Doppelganger mining for face representation learning. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 1916–1923. [Google Scholar]
Smirnov, E.; Melnikov, A.; Oleinik, A.; Ivanova, E.; Kalinovskiy, I.; Luckyanets, E. Hard example mining with auxiliary embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 37–46. [Google Scholar]
Qian, J.; Zhu, S.; Zhao, C.Y.; Yang, J.; Wong, W.K. OTFace: Hard Samples Guided Optimal Transport Loss for Deep Face Representation. IEEE Trans. Multimed. 2022, 25, 1427–1438. [Google Scholar] [CrossRef]
Ren, H.; Cai, Y.; Chen, X.; Wang, G.; Li, Q. A Two-phase Prototypical Network Model for Incremental Few-shot Relation Classification. In Proceedings of the 28th International Conference on Computational Linguistics, (Online). Barcelona, Spain, 8–13 December 2020; pp. 1618–1629. [Google Scholar] [CrossRef]
Wei, J.; Huang, C.; Vosoughi, S.; Cheng, Y.; Xu, S. Few-Shot Text Classification with Triplet Networks, Data Augmentation, and Curriculum Learning. In Proceedings of the North American Chapter of the Association for Computational Linguistics, Online, 6–11 June 2021. [Google Scholar]
Dor, L.E.; Mass, Y.; Halfon, A.; Venezian, E.; Shnayderman, I.; Aharonov, R.; Slonim, N. Learning thematic similarity metric from article sections using triplet networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, 15–20 July 2018; pp. 49–54. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019. [Google Scholar]
Hermans, A.; Beyer, L.; Leibe, B. In Defense of the Triplet Loss for Person Re-Identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
Wang, C.; Zhang, X.; Lan, X. How to train triplet networks with 100k identities? In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 1907–1915. [Google Scholar]
Suh, Y.; Han, B.; Kim, W.; Lee, K.M. Stochastic class-based hard example mining for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7251–7259. [Google Scholar]
Cortes, C.; Vapnik, V.N. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Cai, S.; Guo, Y.; Khan, S.; Hu, J.; Wen, G. Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8391–8400. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Sung, K.K. Learning and Example Selection for Object and Pattern Detection. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 1995. [Google Scholar]
Bottou, L.; Bengio, Y. Convergence properties of the k-means algorithms. In Advances in Neural Information Processing Systems 7; MIT Press: Denver, CO, USA, 1994. [Google Scholar]
van der Maaten, L.; Weinberger, K.Q. Stochastic triplet embedding. In Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2012, Santander, Spain, 23–26 September 2012; pp. 1–6. [Google Scholar] [CrossRef]
Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 212–220. [Google Scholar]
Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 499–515. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland; pp. 17–35. [Google Scholar]
Wei, L.; Zhang, S.; Gao, W.; Tian, Q. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 79–88. [Google Scholar]
Li, W.; Zhao, R.; Xiao, T.; Wang, X. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 23–28 June 2014; pp. 152–159. [Google Scholar]
He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF International Conference on Computer ViCVPR’21sion, Virtual, 11–17 October 2021; pp. 15013–15022. [Google Scholar]
Liao, S.; Shao, L. Interpretable and generalizable person re-identification with query-adaptive convolution and temporal lifting. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 456–474. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13001–13008. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Han, X.; Zhu, H.; Yu, P.; Wang, Z.; Yao, Y.; Liu, Z.; Sun, M. FewRel: A Large-Scale Supervised Few-Shot Relation Classification Dataset with State-of-the-Art Evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
Yang, S.; Zhang, Y.; Niu, G.; Zhao, Q.; Pu, S. Entity Concept-enhanced Few-shot Relation Extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online, 1–6 August 2021; pp. 987–991. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1, (Long and Short Papers). pp. 4171–4186. [Google Scholar] [CrossRef]
Zhuang, Z.; Wei, L.; Xie, L.; Zhang, T.; Zhang, H.; Wu, H.; Ai, H.; Tian, Q. Rethinking the distribution gap of person re-identification with camera-based batch normalization. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 140–157. [Google Scholar]
Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3702–3712. [Google Scholar]
Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 274–282. [Google Scholar]
Zhang, Z.; Lan, C.; Zeng, W.; Jin, X.; Chen, Z. Relation-aware global attention for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3186–3195. [Google Scholar]
Jin, X.; Lan, C.; Zeng, W.; Wei, G.; Chen, Z. Semantics-aligned representation learning for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11173–11180. [Google Scholar]
Chen, X.; Fu, C.; Zhao, Y.; Zheng, F.; Song, J.; Ji, R.; Yang, Y. Salience-guided cascaded suppression network for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3300–3310. [Google Scholar]
Chen, T.; Ding, S.; Xie, J.; Yuan, Y.; Chen, W.; Yang, Y.; Ren, Z.; Wang, Z. Abd-net: Attentive but diverse person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8351–8361. [Google Scholar]
Miao, J.; Wu, Y.; Liu, P.; Ding, Y.; Yang, Y. Pose-guided feature alignment for occluded person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 542–551. [Google Scholar]
Wang, G.; Yang, S.; Liu, H.; Wang, Z.; Yang, Y.; Wang, S.; Yu, G.; Zhou, E.; Sun, J. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6449–6458. [Google Scholar]
Zhu, K.; Guo, H.; Liu, Z.; Tang, M.; Wang, J. Identity-guided human semantic parsing for person re-identification. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 346–363. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Zhao, Y.; Zhong, Z.; Yang, F.; Luo, Z.; Lin, Y.; Li, S.; Sebe, N. Learning to generalize unseen domains via memory-based multi-source meta-learning for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6277–6286. [Google Scholar]
Qian, X.; Fu, Y.; Xiang, T.; Jiang, Y.G.; Xue, X. Leader-based multi-scale attention deep architecture for person re-identification. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 371–385. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Learning generalisable omni-scale representations for person re-identification. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5056–5069. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
Yuan, Y.; Chen, W.; Chen, T.; Yang, Y.; Ren, Z.; Wang, Z.; Hua, G. Calibrated domain-invariant learning for highly generalizable large scale re-identification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 3589–3598. [Google Scholar]
Jin, X.; Lan, C.; Zeng, W.; Chen, Z.; Zhang, L. Style normalization and restitution for generalizable person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3143–3152. [Google Scholar]
Satorras, V.G.; Estrach, J.B. Few-Shot Learning with Graph Neural Networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Mishra, N.; Rohaninejad, M.; Chen, X.; Abbeel, P. A Simple Neural Attentive Meta-Learner. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4077–4087. [Google Scholar]
Gao, T.; Han, X.; Liu, Z.; Sun, M. Hybrid attention-based prototypical networks for noisy few-shot relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6407–6414. [Google Scholar]
Ye, Z.X.; Ling, Z.H. Multi-Level Matching and Aggregation Network for Few-Shot Relation Classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2872–2881. [Google Scholar]
Gao, T.; Han, X.; Zhu, H.; Liu, Z.; Li, P.; Sun, M.; Zhou, J. FewRel 2.0: Towards More Challenging Few-Shot Relation Classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 6250–6255. [Google Scholar]
Yang, K.; Zheng, N.; Dai, X.; He, L.; Huang, S.; Chen, J. Enhance Prototypical Network with Text Descriptions for Few-shot Relation Classification. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Galway, Ireland, 19–23 October 2020; pp. 2273–2276. [Google Scholar]

Figure 1. An illustrative comparison of the GS and our pSEM.

Figure 2. The comparison between the Triplet Loss and our pTriplet Loss.

Figure 3. Overview of our proposed approach: Prototype-based Support Example Miner and Triplet Loss.

Figure 4. Comparative effectiveness of standard PK, GS, and pSEM samplers in mining support examples for ReID task training.

Figure 5. Visual comparison of the top 5 retrieval results from models trained using PK, GS, and pSEM samplers.

Figure 6. Varying difficulty levels of support examples selected by adjusting

δ

value in pSEM.

Figure 6. Varying difficulty levels of support examples selected by adjusting

δ

value in pSEM.

Figure 7. Results of sensitivity analysis of

λ

in Equation (6) for the FSRE task.

Figure 7. Results of sensitivity analysis of

λ

in Equation (6) for the FSRE task.

Figure 8. Results of sensitivity analysis of

α

in Equation (7) for the FSRE task.

Figure 8. Results of sensitivity analysis of

α

in Equation (7) for the FSRE task.

Figure 9. Results of sensitivity analysis of

β

in Equation (8) for the FSRE task.

Figure 9. Results of sensitivity analysis of

β

in Equation (8) for the FSRE task.

Table 1. Comparison of methods for constructing mini-batch.

Scheme	Hard Class Selection	Hard Example Selection
PK	Random	Random
GS	Non-random	Random
Our pSEM	Non-random	Non-random

Table 2. The dataset statistics of the person ReID

^{a}

.

Table 2. The dataset statistics of the person ReID

^{a}

.

Dataset	${ID}_{q}$	${ID}_{g}$	${ID}_{t}$	${IMG}_{q}$	${IMG}_{g}$	${IMG}_{t}$	${CAM}_{n}$
MSMT17	3060	3060	1041	11,659	82,161	32,621	15
Market1501	750	750	751	3368	19,732	12,936	6
DukeMTMC	702	702	702	2228	17,661	16,522	8
CUHK03-NP	700	700	767	1400	5332	7365	2

^{a}

I D_{q}

,

I D_{g}

, and

I D_{t}

represent the number of IDs (pedestrians) in the query set, gallery set, and training set, respectively.

I M G_{q}

,

I M G_{g}

, and

I M G_{t}

denote the number of images in the query set, gallery set, and training set, respectively.

C A M_{n}

represents the number of cameras in the dataset.

Table 3. Comparison of different HNM schemes.

Method	Training Set	MSMT17		Market1501		CUHK03-NP
Method	Training Set	R1	mAP	R1	mAP	R1	mAP
PK	Market1501	43.6	15.7	-	-	17.9	17.6
Cluster [22]	Market1501	44.0	15.8	-	-	18.4	17.3
GS [2]	Market1501	45.9	17.2	-	-	19.1	18.1
Ours	Market1501	48.4	18.3	-	-	19.2	18.5
PK	MSMT17	-	-	75.9	45.3	16.4	17.0
Cluster [22]	MSMT17	-	-	77.2	47.6	18.4	19.2
GS [2]	MSMT17	-	-	79.1	49.5	20.9	20.6
Ours	MSMT17	-	-	79.2	50.1	21.1	21.3
PK	MSMT17 (all)	-	-	79.5	52.3	22.8	23.3
Cluster [22]	MSMT17 (all)	-	-	80.4	54.2	26.3	26.3
GS [2]	MSMT17 (all)	-	-	82.4	56.9	27.6	28.0
Ours	MSMT17 (all)	-	-	82.6	58.1	30.2	30.3
GS [2]	CUHK03-NP	46.9	15.4	68.2	37.3	-	-
Ours	CUHK03-NP	48.4	16.3	69.8	38.2	-	-

Table 4. Comparison of various schemes’ evaluation results.

Backbone	Method	Size	MSMT17		Market $^{a}$		Duke $^{b}$
Backbone	Method	Size	mAP	R1	mAP	R1	mAP	R1
CNN	CBN [46]	256 × 128	42.9	72.8	77.3	91.3	67.3	82.5
	OSNet [47]	256 × 128	52.9	78.7	84.9	94.8	73.5	88.6
	MGN [48]	384 × 128	52.1	76.9	86.9	95.7	78.4	88.7
	RGA-SC [49]	256 × 128	57.5	80.3	88.4	96.1	-	-
	SAN [50]	256 × 128	55.7	79.2	88.0	96.1	75.7	87.9
	SCSN [51]	384 × 128	58.5	83.8	88.5	95.7	79.0	91.0
	ABDNet [52]	384 × 128	60.8	82.3	88.3	95.6	78.6	89.0
	PGFA [53]	256 × 128	-	-	76.8	91.2	65.5	82.6
	HOReID [54]	256 × 128	-	-	84.9	94.2	75.6	86.9
	ISP [55]	256 × 128	-	-	88.6	95.3	80.0	89.6
DeiT-B/16	DeiT [56] + ①	256 × 128	61.4	81.9	86.6	94.4	78.9	89.3
DeiT-B/16	TransReID [38] + ①	384 × 128	66.3	84.5	88.5	95.1	82.1	91.1
ViT-B/16	ViT [40] + ①	256 × 128	61.0	81.8	86.8	94.7	79.3	88.8
ViT-B/16	TransReID [38] + ①	384 × 128	69.4	86.2	89.5	95.2	82.6	90.7
ViT-B/16	TransReID + ②	384 × 128	70.0	86.9	89.6	95.6	83.4	91.2
ViT-B/16	TransReID + Ours	384 × 128	70.7	87.8	90.0	95.5	83.6	91.5

^{a}

Market refers to Market1501.

^{b}

Duke means DukeMTMC.

Table 5. Direct cross-dataset evaluation results

^{a}

.

Table 5. Direct cross-dataset evaluation results

^{a}

.

Method	Training Set	CUHK03-NP		Market1501		MSMT17
Method	Training Set	R1	mAP	R1	mAP	R1	mAP
M $^{3}$ L [57]	Multi	33.1	32.1	75.9	50.2	36.9	14.7
MGN [48]	Market1501	8.5	7.4	-	-	-	-
MuDeep [58]	Market1501	10.3	9.1	-	-	-	-
QAConv [39]	Market1501	9.9	8.6	-	-	22.6	7.0
OSNet-AIN [59]	Market1501	-	-	-	-	23.5	8.2
CBN [46]	Market1501	-	-	-	-	25.3	9.5
QAConv [39] + ②	Market1501	19.1	18.1	-	-	45.9	17.2
QAConv [39] + Ours	Market1501	19.2	18.5	-	-	48.4	18.3
PCB [60]	MSMT17	-	-	52.7	26.7	-	-
MGN [48]	MSMT17	-	-	48.7	25.1	-	-
ADIN [61]	MSMT17	-	-	59.1	30.3	-	-
SNR [62]	MSMT17	-	-	70.1	41.4	-	-
CBN [46]	MSMT17	-	-	73.7	45.0	-	-
QAConv [39] + ②	MSMT17	20.9	20.6	79.1	49.5	-	-
QAConv [39] + Ours	MSMT17	21.1	21.3	79.2	50.1	-	-
		R1	mAP	R1	mAP	R1	mAP
OSNet-IBN [47]	MSMT17 (all)	-	-	66.5	37.2	-	-
OSNet-AIN [59]	MSMT17 (all)	-	-	70.1	43.3	-	-
QAConv [39] + ①	MSMT17 (all)	25.3	22.6	72.6	43.1	-	-
QAConv + ②	MSMT17 (all)	27.6	28.0	82.4	56.9	-	-
QAConv [39] + Ours	MSMT17 (all)	30.2	30.3	82.6	58.1	-	-
QAConv [39] + ②	CUHK03-NP	-	-	68.2	37.3	46.9	15.4
QAConv [39] + Ours	CUHK03-NP	-	-	69.8	38.2	48.4	16.3

^a Note: “MSMT17 (all)” implies that the model utilizes both the test set and the training set for training. M³L selects three datasets from CUHK03, Market1501, DukeMTMC-ReID, and MSMT17 for training and leaves the remaining one for testing.

Table 6. Accuracies (%) of different models on the test set

^{a}

.

Table 6. Accuracies (%) of different models on the test set

^{a}

.

Backbone	Sampler	Model	5-w-1-s	10-w-1-s
CNN	PK	Gnearest neighbor [63]	67.30	54.10
		SNAIL [64]	71.13	50.61
		Proto [65]	74.29	61.15
		HATT-Proto [66]	74.84	62.05
		MLMAN [67]	78.21	65.70
BERT	PK	Bert-PAIR [68]	82.57	73.37
		TD-Proto [69]	84.76	74.32
		ConceptFERE (Simple) [44]	84.28	74.00
		ConceptFERE [44]	89.21	75.72
BERT	GS	ConceptFERE + ②	89.37	75.75
BERT	Ours	ConceptFERE + Ours	90.24	78.85

^{a}

Note: Owing to the inherent variability in the Few-Shot Relation Extraction (FSRE) experiment, we consider the average of ten experimental results as the final outcome.

Table 7. Ablation study results on the MSMT17 dataset for the person ReID task.

Scheme	R1	mAP
TransReID + Ours	87.8	70.7
w/o pSEM	87.4	70.3
w/o pTriplet Loss	87.3	70.2
w/o pSEM & pTriplet Loss	86.2	69.4

Table 8. Ablation study results for the FSRE task. Accuracy (%) is used as the performance evaluation metric.

Scheme	10-w-1-s
ConceptFERE + Ours	78.85
w/o pSEM	76.81
w/o pTriplet Loss	77.12
w/o pSEM & pTriplet Loss	75.72

Table 9. Variance distribution of intra-class similarity on MSMT17.

Model	Variance	Sampler	Online Hard Example Loss
TransReID	45.2	PK	Triplet Loss
TransReID	32.9	PK	pTriplet Loss

Table 10. Comparison of training times when QAConv employs PK, GS, and pSEM samplers, respectively.

	Training		CUHK3		Market		MSMT17
	Data	Hours	R1	mAP	R1	mAP	R1	mAP
PK	Market	1.07	13.3	14.2	-	-	40.9	14.7
GS	Market	0.25	19.1	18.1	-	-	45.9	17.2
pSEM	Market	0.24	19.2	18.5	-	-	48.4	18.3
PK	MSMT17	2.37	15.6	16.2	72.9	44.2	-	-
GS	MSMT17	0.73	20.9	20.6	79.1	49.5	-	-
pSEM	MSMT17	0.70	21.1	21.3	79.2	50.1	-	-
PK	MSMT17 (all)	17.85	25.1	24.8	79.5	52.3	-	-
GS	MSMT17 (all)	3.42	27.6	28.0	82.4	56.9	-	-
pSEM	MSMT17 (all)	3.31	30.2	30.2	82.6	58.1	-	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, S.; Zhang, Y.; Zhao, Q.; Pu, Y.; Yang, H. Prototype-Based Support Example Miner and Triplet Loss for Deep Metric Learning. Electronics 2023, 12, 3315. https://doi.org/10.3390/electronics12153315

AMA Style

Yang S, Zhang Y, Zhao Q, Pu Y, Yang H. Prototype-Based Support Example Miner and Triplet Loss for Deep Metric Learning. Electronics. 2023; 12(15):3315. https://doi.org/10.3390/electronics12153315

Chicago/Turabian Style

Yang, Shan, Yongfei Zhang, Qinghua Zhao, Yanglin Pu, and Hangyuan Yang. 2023. "Prototype-Based Support Example Miner and Triplet Loss for Deep Metric Learning" Electronics 12, no. 15: 3315. https://doi.org/10.3390/electronics12153315

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prototype-Based Support Example Miner and Triplet Loss for Deep Metric Learning

Abstract

1. Introduction

2. Related Work

3. Background

3.1. Batch Sampler

3.2. Hard Example Mining

4. Methodology

4.1. Prototype-Based Support Example Miner

4.2. Prototype-Based Triplet Loss

5. Experiment

5.1. Overview of the Person Re-Identification Experiment

5.1.1. Introduction of Hard Examples for ReID

5.1.2. Datasets, Evaluation Metrics, Baseline, Comparison Scheme, and Experimental Settings

5.2. Overview of the Few-Shot Relation Extraction Experiment

5.2.1. Introduction of Hard Examples for the Few-Shot Relation Extraction Task

5.2.2. Datasets, Evaluation Metrics, Baseline and Comparison Scheme, Experimental Settings

5.2.3. Model Training Details

5.3. Comparison with State-of-the-Art HNM Schemes on the Generalizable ReID Task

5.4. Performance Comparison of HNM and OTM Schemes on Person Re-Identification Tasks

5.5. Performance Comparison of HNM and OTM Schemes on General Person-Re-Identification Tasks

5.6. Performance Comparison of HNM and OTM Schemes on Few-Shot Relation Extraction Tasks

5.7. Ablation Study

5.8. Qualitative Analysis

5.8.1. Evidence of Cluster Compactness

5.8.2. Comparative Analysis of Predictive Results

5.8.3. Hyper-Parameter Sensitivity Analysis

5.9. Training Time Comparison

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI