A Self-Adaptive Gallery Construction Method for Open-World Person Re-Identification

Person re-identification, or simply re-id, is the task of identifying again a person who has been seen in the past by a perception system. Multiple robotic applications, such as tracking or navigate-and-seek, use re-identification systems to perform their tasks. To solve the re-id problem, a common practice consists in using a gallery with relevant information about the people already observed. The construction of this gallery is a costly process, typically performed offline and only once because of the problems associated with labeling and storing new data as they arrive in the system. The resulting galleries from this process are static and do not acquire new knowledge from the scene, which is a limitation of the current re-id systems to work for open-world applications. Different from previous work, we overcome this limitation by presenting an unsupervised approach to automatically identify new people and incrementally build a gallery for open-world re-id that adapts prior knowledge with new information on a continuous basis. Our approach performs a comparison between the current person models and new unlabeled data to dynamically expand the gallery with new identities. We process the incoming information to maintain a small representative model of each person by exploiting concepts of information theory. The uncertainty and diversity of the new samples are analyzed to define which ones should be incorporated into the gallery. Experimental evaluation in challenging benchmarks includes an ablation study of the proposed framework, the assessment of different data selection algorithms that demonstrate the benefits of our approach, and a comparative analysis of the obtained results with other unsupervised and semi-supervised re-id methods.


Introduction
Person re-identification, or simply re-id, addresses the problem of matching people across non-overlapping views in a multi-camera system [1,2]. Solutions to this problem benefit many robotic applications where people are involved, such as tracking [3,4], navigation [5] or searching [6,7]. An extensive number of studies have focused on obtaining the best feature representation in supervised close-world scenarios (e.g., [8][9][10][11]) where the problem is narrowed to seek a query person from an existing pool of labeled people images, generally called gallery. While they obtain high performance in commonly used benchmarks, from the viewpoint of practical re-id systems, people identity annotation to obtain sufficient ground truth data could be extremely inefficient [12]. Hence, there is a tendency in the research community to address other alternatives and still open problems in re-identification, such as unsupervised [13][14][15], domain adaptation [16][17][18] or open-set in open-world [19][20][21]. The vast majority of these works use a static and preset gallery in their development that restrains the dynamic nature of the open-world, where raw data from camera systems collect new people, detection errors, or junk data. In order to solve problems related to open-world recognition, the system needs to deal with unknown classes but also be able to incrementally self-adapt by acquiring new knowledge [22,23]. Therefore, an open-world re-identification system should automatically evolve its gallery, be able to Figure 1. Simplified comparison between a large static gallery, traditionally used, and our small self-adaptive gallery. Both have a set of images representing each identity (ID0, ID1, . . . ), i.e., each person. The traditional gallery is the same for every person query that arrives at different times (t i , t f ). However, because the adaptive gallery is being built and updated as new data arrives, we can appreciate a more comprehensive gallery for later times (t i < t n < t f ).
The experiments section provides a detailed analysis of the main parameters defined in the method, along with a comparison of different data selection algorithms commonly used in incremental settings. A comparison with other unsupervised and semi-supervised re-id methods is also discussed.
The rest of the paper is organized as follows. Section 2 details the related work. Section 3 describes the problem addressed, along with the main stages of the proposed framework. Section 4 presents a complete evaluation of the presented method on two challenging benchmarks. The first subsection analyzes the influence of the key parameter defined in the algorithm. Then, a comparison of different data selection methods demonstrates the benefits of our approach, and a discussion compares the proposed method with traditional approaches to re-identification. Finally, Section 5 concludes the work.

Related Work
The problem of person re-identification has been widely studied through time, as shown in [24]. Early works defined the problem as tracking [25], then moved to imagebased classification [26] and video-based classification [27]. With the success of deep learning, works have shifted from hand-crafted descriptors [28] to deep learning methods [29]. The next step in person re-identification research was the shift from close-world (complete known classes and correctly annotated data) to open-world (multiple modalities, limited and noisy annotations, an undefined number of people, etc.) and has raised interesting new research challenges [22] relating the problem to other fields.

Unsupervised and Semi-Supervised Re-Id Methods
Several works attempt to tackle the re-id problem by building the re-id models in an unsupervised or semi-supervised manner. For example, Panda et al. [30] present a method to add a new camera to a multi-camera re-id system using unsupervised transfer learning from the knowledge obtained on the other cameras. Unsupervised algorithms typically focus on modeling the spatiotemporal information to match the people images between them [14,31], generate new data from unlabeled samples [32,33], or reduce the error in hard pseudo-labels using softer adaptable pseudo-labels [15]. Semi-supervised methods leverage the available annotated information by gradually refining the descriptors with the unlabeled data most similar to the labeled one [34] or by generating virtual samples based on the annotated data [35]. Different from these, we propose a method that focuses on creating a gallery that incrementally adds new unsupervised data, and we do not retrain the feature descriptors.

Incremental Person Re-Id
Incremental person re-identification has been approached from two main perspectives. First, the incremental adaptation of the learned model as new data arrives at the system [36]. This perspective trains the model in the same domain as the queries that will be analyzed later and uses a human in the loop to label the most representative data for the model adaptation through active learning techniques. Second, instead of adapting the feature representation, the goal is to perform a re-ranking in the gallery as new queries are matched with the labeled images [37]. Both perspectives use a static large gallery that ensures a match for the query person.

Gallery Construction
The construction of the gallery is based on the principle that instances of the same class are close in the feature space. This problem is often solved using clustering algorithms [31], which have been studied thoroughly in the literature [38,39] and applied in many fields. Close to our approach, DeCann et al. [40] present a work that updates the reference database (gallery) if the new data is not similar to any user by adding new users. However, they focus on different multi-modal information (face and finger) and an unlimited amount of data stored. To deal with the gallery construction problem in incremental scenarios, the available system resources should be taken into account since storing all the information received in a limitless fashion is not feasible. Therefore, the imposition of a bounded memory is commonly applied in many of these approaches [41,42]. Some works address the dynamical expansion of the classes aided by the labeling of the novel samples [43,44], while others also consider receiving new instances of already known classes, facing the challenges related to the update of existing class models [45,46]. They perform the update of each class model using a scoring system and controlling the size limit of each class by merging the most similar elements. This scenario is the most similar to our approach, but different from these existing works, our approach updates the model by analyzing not only the diversity of the samples but also the global uncertainty of the gallery. The result sought by combining both properties, obtaining a more varied model, is similar to that of prior work [47], which selects data with different levels of uncertainty from a set of labeled images. Different from all these methods, our approach deals with incremental and unlabeled information in an open-world scenario.

Method
This section describes in detail the addressed problem, the method overview, and the main stages of the proposed system.

Problem Description
We define the gallery as a set of classes, C = {C 1 , . . . , C N }, where each class, C i ∈ C, represents one person. Each class is represented by a set of at most m features . . , f m i } with f j i the jth feature of the class, respectively. The features are extracted from sample images, named samples for simplicity, and comprise an appearance descriptor, obtained from a generic re-id neural network, x j i , and the skeleton joints visible in the sample, s . Specifically in this work, we select the re-identification Osnet model [9] to extract the appearance descriptors, and the OpenPose network [48] to obtain the skeleton joints.
The problem is to devise a method able to incrementally create the gallery from an empty initialization as new samples arrive in the system, considering an unknown (possibly unlimited) number of classes, N.

Method Overview
The overall idea of the proposed method is represented in Figure 2. First, whenever a new sample is acquired, the associated feature, f q , is obtained. Then, the method performs a classification by computing the class probability distribution of the new sample through a similarity evaluation. Based on the confidence of the classification, the system decides whether to conduct a dynamic expansion or not. Samples with high confidence enter the gallery, while samples with low confidence are sent to the unknown data manager for further analysis. The set of unknown data is periodically clustered to generate new potential classes that are compared with the existing ones to identify and initialize new classes. Finally, since there is a limit in the memory budget of m features per class, the gallery optimization handles the efficient use of memory resources by deciding the relevant data to keep.

Initialization Stage
In the initial phase of the gallery construction, the low number of classes initialized does not allow to work properly with probability distributions in the general regime. Therefore, the proposed system runs a short initialization stage. In order to perform this initialization, following the incremental setup, a set of candidate-classes, B = {B 1 , . . . , B k }, is defined, where the first candidate-class is created with the arrival of the first sample B 1 = { f 1 1 }. Then, the similarity of incoming samples is evaluated by computing the cosine similarity between x q and those appearance descriptors already included in B. If the maximum cosine similarity is greater than a threshold, ε, the sample is included in the corresponding candidate-class set; otherwise, a new candidate-class is initialized. As soon as a candidate-class reaches a minimum size of l, it becomes a person-class, i.e., a real class, belonging to the gallery C = {C 1 }. Once the gallery reaches a minimum number of personclasses, Q, the proposed decision-making based on the class probabilistic distribution of the samples is run as detailed next.

General Regime
Once the gallery is initialized, the system evaluates the similarity of each new sample with the current gallery to obtain a probability distribution over the set of existing classes. This is accomplished using the softmax operator where υ is a temperature parameter that controls the softness of probability distribution over classes [31], x q is the normalized appearance descriptor of the new sample, andx i is the weighted centroid of C i . Working with normalized vectors, the product of both descriptors,x i x q , is equivalent to the cosine similarity between them. In this work, the weighted centroidx i is defined asx In a similar fashion to existing techniques for incremental learning [23,49], a threshold is used to control the dynamic expansion of the classes identified in the current gallery. More concretely, a simple and intuitive condition is used to measure the classification confidence of x q through its class probability distribution, where τ is the expansion threshold. Samples whose probability distribution does not comply with the condition (3) are considered doubtful and go into the pool of unknown data. Conversely, if the confidence of the classification obtained with (1) is higher or equal than τ, the pseudo-label assigned to the sample corresponds to the class with maximum probability, i * = arg max i p i (x q ), and will be considered to be part of its representation model, C i * .

Unknown Data Manager
Samples that do not satisfy the classification confidence criteria (3) are defined as unknown. The role of the Unknown Data Manager is to identify new identities as well as to recover samples that could not be previously classified with enough certainty. To avoid the initialization of new classes with sets of poorly-explained features, i.e., images showing only one arm or one leg, all the unknown samples first undergo a quality filter to ensure that the appearance descriptors represent at least half of a person, formally r ≥ 0.5, r being the ratio of joints.
The identification of new classes is tackled through the periodic clustering of the unknown data. In open-world scenarios, the number of classes is unbounded, making the use of clustering methods such as K-Means unfeasible. Thus, to partition the set of unknown data, we use a DBSCAN algorithm [50] based on sample density and can deal with noisy information. The resulting clusters that reach the minimum size of l are compared with the current classes in the gallery to check whether they belong to an existing class or represent a new one. Following the analysis performed in [31] on criteria methods to decide which pair of clusters to merge, the minimum distance criterion is used to verify if a potential new class, C w , shares identity with any of the existing in the gallery. The minimum distance criterion takes the shortest distance between samples from the new cluster, C w , and all elements of the gallery, C, Since the computational cost of this process is considerably high, we compute an approximation limiting the number of existing classes that are compared with C w from the set N to a subset of k. To select which classes are analyzed, for each x ∈ C w , we compute the k-Nearest centroids of the gallery and then select the k most frequent classes among all of them. Using only these classes in the first minimum of (4), the computational cost remains constant with the size of the gallery.
Finally, if the approximated minimum distance is higher than α, the cluster C w is initialized in the gallery as a new class. Otherwise, the new cluster and the class with the closest sample represent the same identity and are merged, complying with the memory budget by means of the gallery optimization process.

Gallery Optimization
Our approach performs an intelligent decision-making process with the goal of storing representative features of each existing class and making efficient use of memory resources. In order to address this goal, we use two metrics that describe the relationship of each appearance descriptor with those in the same class and with all the rest.
The first metric is the intra-class diversity of the samples. For a descriptor, x, that belongs to class C i , we define its diversity through the minimum cosine distance among all the other descriptors that belong to the same class, The diversity of the whole class is then defined as the minimum diversity among all of its features, D(C i ) = min This metric is useful to identify redundant information, i.e., similar samples within a class. Leveraging this information, when a new sample is classified and assigned to an existing class of the gallery, C i , it is only added to the representation model of the class if its diversity is greater than the current diversity of the class, The second metric is the uncertainty of the sample with respect to the whole gallery, which is measured through Shannon's entropy by where N is the number of classes at the moment in the gallery, and p i (x) is the probability described in (1). High entropy values stand for appearance descriptors that can be easily confused with those of other classes. In contrast, a feature with low entropy indicates high confidence in belonging to a certain class. Therefore, this metric provides an intuition of the relative distance between the feature and the rest of the classes of the gallery (inter-class).
The dependency on all the classes in (8), together with the constant evolution of the class centroids required for (1), makes the computation of this metric very heavy. For efficient computation, we keep a matrix for each class, R i , with the cosine similarity between its samples, x j i , and all the weighted centroids of the gallery, as well as a list of the classes that have changed since the last gallery optimization of C i was performed. This list is used to update only the columns associated with classes with changes, noting that the other distances have not changed and can be reused. Note that the R i matrix is the changing element of (1) since υ is a constant value. Once we compute the update of the probability distribution of the samples belonging to C i , obtaining entropy with (8) is straightforward. When the memory budget of a class is exceeded, because of a merge caused by the Unknown Data Manager or the insertion of a new sample, an optimization process using both metrics is run to decide which sample to drop. In particular, the sample to drop is where γ ∈ [0, 1] is a parameter to weigh the relevance of the uncertainty and the diversity terms. The logarithm, log(1/N), normalizes the entropy to a value between zero and one, equivalent to the diversity. The proposed optimization function seeks a balance between how much a given feature mixes the different classes (entropy) and how distinctive it is with respect to the rest of the features of the same class (diversity). Figure 3 shows a simplified example with two clusters, C 1 and C 2 , where C 1 has exceeded its size constraint m = 3, and two examples of the final appearance models obtained with the proposed process. Note the balance between uncertainty and diversity even though the two identities look very similar.

Experiments
This section analyzes the influence of the main parameters defined in the system, the algorithm selected to model the person's appearance and compares the performance of the proposed framework with other unsupervised and semi-supervised re-id approaches.

Experimental Setup
The evaluation is performed with two public benchmarks, MARS [51] and DukeMTMC-VideoReID [34]. In both of them, we use the official test set, which is split into the query set and the gallery set.
Two experiments are performed in this section. First, the analysis of the gallery construction process assesses the key aspects of our approach. The second experiment, query re-identification, runs a conventional evaluation for re-id methods in order to compare the proposed framework with other unsupervised and semi-supervised approaches. For both experiments, the settings for our approach configuration are: similarity threshold in the initialization stage ε = 0.9, temperature parameter in the softness operator υ = 0.1, the k-Nearest centroids with k = 3 used by the Unknown Data Manager, distance threshold to initialize a new cluster α = 0.1, gallery size to run the probabilistic decision making Q = 20, the re-identification network used in cross-domain is an OsNet model [9] trained with the MSMT17 Benchmark [52], and the OpenPose network [48] is used to obtain the skeleton joints. The setup for both experiments is detailed next.

Gallery Construction
The gallery set from both datasets is used to evaluate the self-adaptive gallery construction process. As in traditional incremental settings, the tracklets are randomly shuffled, and then, the images from each tracklet are provided one by one to simulate an incremental input to the self-adaptive gallery.
In order to evaluate the global performance of the proposed approach, we consider the following three metrics based on the classic precision, recall, and F1 score: • Gallery Structure: The perfect gallery structure has one (and only one) class per ground truth identity (GT-ID). This GT-ID is set for each class with the mode of all the sample identities present at the class initialization. In order to evaluate the quality of the final gallery structure, we compute the precision (P), recall (R), and F1 score metrics as where we define the false negatives (FN) as those GT-ID not associated with any class, i.e., identities not found, the true positives (TP) as all GT-IDs associated with at least one class, i.e., identities found, and the false positives (FP) as the additional classes with the same GT-ID associated, i.e, two classes associated to the same GT-ID count as one FP and one TP.

Query Re-Identification
In order to compare the proposed framework with other unsupervised and semisupervised approaches, we use the query set to evaluate the gallery obtained at the end of the gallery construction process. Thus, the query set is matched with the limited size gallery created in the previous experiment, which remains static during this evaluation. The conventional evaluation for re-identification [9] is performed including the Rank-1 and Rank-5 metrics.

Gallery Construction: Parameter Evaluation
We first study the effect of the three key parameters for the gallery construction process: (1) the weight used in Equation (10) to balance the influence of the uncertainty and the diversity, γ, (2) the expansion threshold, τ, in Equation (3), and (3) the minimum size required to initialize a class, l, used during the initialization stage and the clustering process, along with the memory budget per identity, m, defined in Section 3.1. In this evaluation, we use K = 4 for the sample classification F1. The goal of this analysis is to choose the parameters that yield balanced galleries based on the defined metrics.
The results of the analysis are shown in Figure 4. The influence of each parameter at the end of the process is analyzed in Figure 4a-c, where it can be seen that the trend of the quality gallery structure F1 is inverse to the tendency of the class precision and the sample classification F1. Figure 4a shows the effect of weighting the uncertainty and diversity with γ, fixing all the other parameters to τ = 2, l = 20 and m = 50. The increase in γ favors the selection of samples with low entropy but less diverse ones in the appearance models. The balance between uncertainty and diversity in the gallery is attained at γ = 0.6.
The expansion threshold, τ, is analyzed in Figure 4b. We keep l = 20, m = 50, and from the former analysis, γ is set to 0.6. When this parameter increases, more samples are sent to the Unknown Data Manager, resulting in the initialization of more classes. The trade-off between the metrics analyzed is accomplished at τ = 2. Finally, the influence of the minimum size to create a class, l, and the memory budget per identity, m, is evaluated in Figure 4c. The rest of the parameters are set to γ = 0.6, τ = 2. The increase in the gallery structure F1 is caused by the reduction in the initialization, leading to fewer redundant classes. This implies greater confidence in the classification of the samples as m increases. Therefore, the selected memory budget configuration is the one that generates the highest gallery structure F1, l = 20 and m = 50, the influence being not highly significant in the other metrics analyzed. Figure 5 shows the evolution over time of the metrics with the final parameters set, γ = 0.6, τ = 2, l = 20, and m = 50. Since it is an evaluation over time, in this particular case we consider K = ∞ for the sample classification F1. All the metrics settle after processing 20% of the samples. Then, it can be fairly assumed that the method's behavior is stable beyond that stage.

Gallery Construction: Data Selection Method Comparison
Following the analysis from the previous section, this experiment sets γ = 0.6, τ = 2, l = 20, and m = 50. We study different gallery optimization processes that decide which sample to remove from the appearance model when the memory budget is exceeded. The compared techniques are algorithms used in incremental clustering works that have to deal with memory budget requirements. They are evaluated at the end of the gallery construction process. The first method is uniform sampling (Uniform) which saves a new feature for every U = 5 instance. When the size limit is exceeded, the oldest data is dropped to save a newer one. Another typical process is random decision-making (Random) which removes a random index when the memory reaches its budget. Regarding more sophisticated methods, we compare the two closest approaches in the literature, the method proposed in [45], called Incremental Object Model (IOM), and the ExStream method [46]. In both cases, we use the implementation provided by the authors to evaluate the effect of the data dropped in the gallery in our overall method. Moreover, due to the influence on the final results of the data arrival order in incremental setups, three different iterations are run (i.e., three different random data arrival orders). To make a fair comparison, all five methods use the same features extracted from OsNet [9].
First, a comprehensive analysis of the final quality of the gallery structure is performed. The number of classes created per GT-ID and the gallery structure metrics are shown in Figure 6a,b, respectively. The results in Figure 6a indicate that the ExStream and the Uniform algorithms create a high number of redundant classes in the gallery. This means that the appearance models resulting from these methods are significantly less representative, leading to more uncertain classifications. Thus, they send a high number of samples to the unknown pool and create new classes for already existing identities. The proposed optimization process (Ours) creates only one class for the same number of GT-IDs as IOM while identifying more people in the scene, which is represented by a smaller number of GT-IDs with 0 classes created. Then, derived from this analysis and verified in the F1 results on Figure 6b, the methods which provide a gallery structure of better quality are IOM, Random, and Ours, being Ours the one that identifies the most people in the scene among them, as measured with the gallery structure recall.
Second, Figure 6c shows the analysis of varying K in the sample classification F1, and Figure 6d shows the class precision results. As expected, the sample classification F1 improves in all algorithms with the increment of K. Comparing the methods that generate a gallery with a suitable structure, i.e., IOM, Random, and Ours, the results shown in Figure 6c,d demonstrate that the proposed gallery optimization process (Ours) outperforms IOM and Random in both metrics. Our approach is able to create more reliable people models without losing diversity, thus enhancing the classification of the samples. The ExStream and Uniform methods obtain high values in these metrics because of the large number of redundant classes, limiting in practice the actual ability to re-identify known people.
As a summary of the experiment, our algorithm is the one that maintains the best balance between having a good gallery structure and providing good classification metrics of the individual samples with it. The rest of the methods either generate galleries with worse quality structure, i.e., ExStream and Uniform, or obtain worse class precision and sample classification F1 results, i.e., IOM and Random.

Gallery Construction: Final Results
A detailed evaluation on MARS and DukeMTMC-VideoReID is provided using the same hyperparameter values from the previous section for both benchmarks. Table 1 shows the final results of the complete self-adaptive gallery construction approach on both datasets. In the gallery structure analysis, the table includes the number of GT-IDs, classes created and the gallery structure F1, the precision, and the recall. The larger number of people in DukeMTMC-VideoReID makes it more challenging to identify most of them, causing lower recall metrics than in the MARS dataset, i.e., the 80.06% of the people have been correctly identified in DukeMTMC-VideoReID against the 89.43% in MARS (gallery structure recall). In terms of class precision, note that the proposed framework obtains similar and consistent results for both datasets, 76.69% in MARS and 80.1% in DukeMTMC-VideoReID. Thus, the method creates robust appearance models, being able to correctly distinguish the people in the scene, which in turn helps in the sample classification obtaining precision results of 72%.  Finally, Figure 7 includes samples of the gallery for one identity per dataset at three different times during their construction, showing in each row the person model at different times. The left identity includes an example of corruption that the gallery can suffer remarked by a discontinuous red line.
In both cases, the third row shows how our resulting gallery presents high variability of samples, resulting in a representative model for each identity. More detailed qualitative results of the proposed self-adaptive gallery can be seen in the Supplementary Material, where the identification of new classes and the evolution of the people's appearance models are shown.

Query Re-Identification
This final experiment performs the traditional evaluation of person re-id, i.e., obtains the expectation that the true match is found within the first R ranks [53]. However, instead of matching the query set with a completely labeled gallery, the query set is matched with the resulting gallery from the gallery construction process. In this experiment, the gallery remains static. The proposed method obtains its results in an incremental unsupervised cross-domain setting (IUCD). Table 2 shows the results of this experiment, including the setting in which the different methods operate. Our offline baseline is the Full-gallery method, which has the whole gallery available and manually labeled using the same descriptors as our approach. This method is our upper bound result in the cross-domain setting. Moreover, due to the unsupervised component of our approach, we present the results of unsupervised and semi-supervised systems that perform offline training in the same domain as the query set. The unsupervised methods that included BUC [31], softened sim [15] and GLC+ [32] do not use any labeled data in the whole process (None). Concerning the semi-supervised approaches, they use one tracklet labeled per identity (OneEx). Note that we are the only algorithm working on the incremental unsupervised cross-domain (IUCD) setting, while the rest perform the entire process offline. Thus, although Table 2 is not a fair or direct comparison for our approach, we believe that it is interesting to see how close the proposed approach results are with respect to existing methods, despite the much more challenging and realistic scenario of our approach. The resulting values for our approach are the average and the standard deviation for the three random iterations performed previously, i.e., mean (±std). Besides, since the proposed gallery deals with memory requirements, the percentage of the gallery size used with respect to the total (GS) is shown. In this case, the standard deviation is not included, but we remark that it is lower than 0.01 in all cases. The DukeMTMC-VideoREID results show the impact of the different goals sought. In our case, the correct identification of the 1110 people that compose the gallery is a really challenging task, where some of the queries analyzed in this evaluation do not have corresponding models in the gallery. In contrast, the methods that focus on improving the feature representation obtain better results than in the MARS dataset due to the lack of distractors in the gallery. Regarding the MARS dataset, which is closer to an open-world scenario, the results with our approach are close to the unsupervised or semi-supervised approaches using two orders of magnitude less in the amount of data stored in the gallery. Finally, considering the difference between the Full-gallery baseline and our approach, we see how the proposed approach achieves comparable performance despite a much smaller (one or two orders of magnitude less) and unsupervised built gallery.

Conclusions
This work has presented a novel framework to address the problem of person re-id in open-world able to detect new identities and update the model about existing identities in the system. To deploy and evaluate intelligent systems in open-world settings, it is essential to be able to bridge certain gaps, such as lack of supervised data or lack of computational resources. In particular, the proposed approach shows how to build a self-adaptive gallery for person re-identification in a fully unsupervised fashion, while managing limited memory resources. Low supervision and resource requirements are key to robotics applications in the real world, so our self-adaptive gallery can boost robotic tasks that involve people in real-world applications, such as information gathering or searching. The main limitations of the presented work are those inherent to the re-identification systems, concerning long-term person re-id when there is a change of clothing or strong appearance changes in the people being monitored. In this situation, our system is likely to start a new class under the assumption that a new identity has appeared on the scene. Future steps to improve this aspect may include re-identification models focused on longterm robustness. In the short-term person re-identification problem, our framework can identify more than 80% of the people presented in the challenging scenarios evaluated by comparing the new unlabeled data and the existing classes in the gallery. The existing classes in the gallery are modeled with an optimization process that selects the most representative information to represent each class, balancing the uncertainty (inter-class) and the diversity (intra-class) of the samples. The experiments carried out demonstrate that the proposed optimization process returns a class precision of about 80% while encouraging the variability inside the classes, generating well-balanced and more structured galleries than those of the similar existing methods analyzed. The high class precision maintained over time aids the continuous person re-id by obtaining an F1 sample classification of 62.6% and 69.4% in the Mars and Duke datasets, respectively. Compared to existing re-id algorithms, our method obtains similar results to the fully labeled galleries storing one or two orders of magnitude less data.