Person Re-Identiﬁcation across Data Distributions Based on General Purpose DNN Object Detector

: Solving the person re-identiﬁcation problem involves making associations between the same person’s appearances across disjoint camera views. Further, those associations have to be made on multiple surveillance cameras in order to obtain a more efﬁcient and powerful re-identiﬁcation system. The re-identiﬁcation problem becomes particularly challenging in very crowded areas. This mainly happens for two reasons. First, the visibility is reduced and occlusions of people can occur. Further, due to congestion, as the number of possible matches increases, the re-identiﬁcation is becoming challenging to achieve. Additional challenges consist of variations of lightning, poses, or viewpoints


Introduction
Nowadays, the surrounding world is full of surveillance cameras, so that data that are generated from video surveillance are continuously expanding. Therefore, it has become impossible for a human operator to track and verify people's identities in the entire database on its own. For this reason, the tendency is to use systems that are capable of achieving the recognition [1,2] and tracking [3,4] of people automatically. Automatizing in video processing was possible due to the recent evolution in this field. New methods were implemented, being able to offer more remarkable performances. Simultaneously, hardware components had evolved and they can achieve higher levels of parallel processing. Most algorithms in the field focus on detecting and recognizing people [5]. Although it faces many obstacles, the recognition has evolved considerably over the last years.
When compared to face recognition, a more powerful application is the person re-identification [6]. Implementing a person re-identification system is one of the most difficult and challenging tasks in video processing. This field is continually expanding, while newer methods are being developed at an ever more accelerated pace. Furthermore, nowadays, the computational power is high, while the clothes, in different colors. Thus, the features vector extracted at the network's output will be quite different in such situations, making their association extremely challenging. Another aspect that influences the re-identification process is the variation of light. Light intensity, shadows, reflected light, or artificial lighting can cause a subject to appear differently on a surveillance camera.
Existing re-identification datasets. In the last decade, increasing databases have been created for re-identification. Each database is different in terms of the locations, the number of identities or appearances, and the number of images that make it up. The main difference is the way bounding boxes are generated. These are either generated manually, which is the case of VIPeR [12] or CUHK02 [13] datasets, either generated automatically, while using a detector, such as CUHK03 [14], Market-1501 [15], MARS [16], or DukeMTMC-reID [17] datasets. These are the most familiar and utilized databases in the existing methods in the current literature. Two databases less used, in both training and testing, a re-identification model are RPIField [18] and PRW-v16.04.20 [19] datasets. Table 1 describes all of these databases. Table 1. Comparing existing re-ID databases.

Database Frames ID's Annotated Box Boxes per ID Cameras
VIPeR [12] 632 1264 2 2 CUHK02 [13] 1816 7264 4 10 CUHK03 [14] 1360 13,164 9.7 10 Market-1501 [15] 1501 25 [19] 11,816 932 34,304 36. 8 6 Existing methods. In general, in most of the existing works, the re-identification problem is treated as a recognition problem. Some other methods are based on appearance [20], but these features are not stable over a long period, because people dress differently from a day to another. The greatest evolution in the field of re-identification is closely related to deep learning, because it is known that the use of neural networks leads to the highest performance [21].
The main difference between the existing re-identification methods is related to how this problem is addressed: either as a classification problem or as a more general one, which is based on deep metric embedding learning. The first implemented versions treat re-identification as an extension of the classification method. One approach uses the Softmax function to decide whether or not two input images belong to the same person [14,22]. In order to improve the Softmax function, which does not take the distance into account, the idea of the logistic regression loss function was introduced in [23], which improves the ability of the network to learn features. Later, ref. [24] proposed a Gaussian-based Softmax function, which also takes into account the distribution of features (which is considered a Gaussian distribution).
Recently, a newer approach is to implement Siamese networks [25]. Although it is also a classification method, unlike the other methods, this one is more efficient when the number of classes for people's recognition increases.
The most used approach is to implement a deep metric embedding. In the re-identification methods, the most commonly related function is triplet loss [26]. This function is also used in the current paper. Over time, various approaches have been implemented on this topic, e.g., some papers have implemented new methods for generating several triplet types [27][28][29], while others have tried to improve the triplet loss function [30][31][32]. In addition to the triplet loss, there have been attempts to implement other functions in order to better separate the feature vectors, like: quadruplet loss [33] or sphere loss [34].
One of the things that all of these methods has in common is the use of databases as large as possible, with as many subjects as possible. In [35], all databases with less than 200 subjects are explicitly excluded. In this paper, we aim to implement a re-identification method based on one of those small databases (with a reduced number of subjects), which was not used in the literature up to this point. So, we chose to train our neural network with a dataset from the RPIField database [18].
Furthermore, all of the existing methods train and test their models on datasets consisting of the same database. In this paper, we performed the test on a set that consists of another database, which are utterly different from the training one. Our primary purpose was to see how a re-identification method behaves when the two distributions are entirely different, thus proving the generalization capability of our method.

Theory Overview
In this section, we will describe the theoretical aspects related to the proposed method. First, we will present a short history of neural networks, along with a brief survey of CNN's [36], followed by the YOLO model, which represents the key issue of our algorithm.

Fully Connected Neural Networks
The architecture of a Fully Connected Neural Network defines how neurons are organized and interconnected inside a network. Training a particular architecture involves several steps that aim to adjust the weights and thresholds of neurons.
A neural network is generally composed of three parts, which are referred as layers. These are: • The input layer. It receives information, signals, features, or measurements made in the external environment. Network inputs are usually set to maximum values. As a result of the normalization stage, the numerical accuracy of the network's mathematical operations increases.

•
The hidden, intermediate, or invisible layers. These layers are composed of neurons that are responsible for extracting the patterns associated with the analyzed process. Most of the internal processing of the neural network takes place at the level of these layers. There may be one or more hidden layers in a network.

•
The output layer. This layer is responsible for producing and presenting the network's final outputs, which were obtained after the processing performed by the previous layers.
The fundamental component of a Fully Connected Neural Network is the perceptron [37]. The perceptron is an element that has a certain number of inputs, x i , i = 1 · · · N, and it calculates the weighted sum of these inputs. For each input i, the weight β i can take one of the values {−1, 1}. Eventually, this sum is compared with a threshold, in order to obtain the output y, according to Equation (1).
where w i represents the input weights and θ is the threshold.
The condition from Equation (1) is unintelligible. Accordingly, there can be made two modifications to simplify it. First, the sum can be write as a dot product, w · x = ∑ i w i x i where w and x are vectors whose components are the weights and inputs, respectively. The second change is to move the threshold to the other side of the inequality and replace it with the perceptron's bias, b = −threshold. The bias represents a measure of how easily the perceptron will output 1. Using the bias rather than the threshold, the perceptron rule can be rephrased as: The perceptron is not a complete model of decisions. Instead, a complex network of perceptrons can make better decisions. For this reason, Fully Connected networks are also known in the literature as Multi Layer Perceptrons (MLP) networks.
Their goal is to approximate a function f , as defined in Equation (3) . The network learns the value of the parameter θ, which leads to the best approximation of the function.
(3) Figure 1 shows the two types of Fully Connected networks: single-layer network and multilayer network. There are situations in which neural networks do not have all the connections. Lack of a connection is equivalent to zero weight.
Concerning single-layer networks, it can be seen that the number of outputs of the network coincides with the number of neurons. This type of network is generally used in model classification and linear filtering problems. Unlike single-layer networks, multilayer networks have more hidden neural layers. They are used in more complex issues, such as system identification and optimization. A multilayer network can be seen as a cascade of multiple single-layer networks.
The level of complexity increases with the number of layers that are used. In 1989, the theorem of universal approximation appeared, which states that there is a neural network that is large enough to achieve any desired degree of accuracy. The theorem, on the other hand, does not specify how extensive the network might be. The one who designs such a network must reach a compromise between the network's complexity and accuracy offered.
In order to design a model, it is necessary to select a cost function, which depends on the weight values. Mean Squared Error (Equation (4)) is one of the most often used cost functions.
where w represents the weights in the network, b is the bias, n is the number of inputs involved, and the vector a represents the vector of the outputs when the vector x is at the input of the network. Minimizing this cost function is the purpose of any neural network.

Convolutional Neural Networks
A CNN receives at the input an image, attributes importance to several objects within the image, and differentiates one from the other. CNN's are composed of several layers of neurons that have different weights. The first layer of neurons receives the input data, while the performances that are offered by the network and the accuracy of the outputs are measured by a loss function (e.g., SVM, SoftMax) that is applied to the last neural layer.
The difference between a CNN and the Fully Connected networks relies in terms of input data, i.e., CNN relies on the assumption that the input data are an image. This assumption leads to specific changes in network architecture. Regarding the Fully Connected networks, the neurons are placed in two-dimensional layouts. Unlike this type of neural networks, the layers of a CNN are developed from neurons arranged in three-dimensional layouts. This essential difference can also be observed in Figure 2. In this way, the CNN network transforms the original image, from the pixel values to all classes' final probabilities. In image processing, these neural networks are predominantly used, because they allow for the use of information from the three image channels in order to extract essential characteristics of the images, which leads to high efficiency in terms of video processing.
Subsequently, Region-based Convolutional Neural Networks (R-CNN) [38] were implemented, which involve two stages: identifying regions of interest (RoI) in the image containing various objects and independently extracting CNN features from each region. Several versions of these networks were developed, which aimed to improve the performance in terms of both the processing time and method accuracy [39,40].

YOLO Model
In this paper, we use the YOLOv3 model [10], which is the last version of the original YOLO detection system [8]. There was also implemented YOLOv2 [41], an improved algorithm based on a smaller CNN, with fewer convolutional layers. However, although this network, called Darknet-19, has few layers, the running time and computational costs are comparable to those of the Darknet-53 neural network. Furthermore, the accuracy of the detection increases when using the third YOLO model.
The YOLO model is a general object detection system, which is more efficient and faster than the other existing methods. As the name suggests, a single network predicts both the location of objects and the probabilities directly from the original image, which makes it more suitable for applications for which real-time running represents a priority.
It treats object recognition as a unified regression problem instead of R-CNN, which are learned in order to perform classification. Additionally, YOLO sees the entire image during the training phase, which leads to better background recognition.
The YOLO model has several advantages when compared to other existing methods. Firstly, it is a faster system, since it treats the detection as a regression problem. Second, this model uses the entire input image when it performs final detection.
Unlike other techniques that use regions of interest, YOLO uses the full image in both the training and testing stages. The network uses the features of the entire image in order to predict each bounding box. Furthermore, all of the bounding boxes corresponding to all classes are predicted at the same time.
In Figure 3, the entire YOLO model is presented. The system divides the input image into S × S cells. The cell that is responsible for detecting an object is the cell located in the center of that object. Further, each cell predicts B bounding boxes and confidence scores for each bounding box. These scores reflect the system's certainty that a bounding box contains an object and how accurate the prediction is. For each predicted bounding box, the system generates five values: • (x, y), which represent the coordinates of the center of the bounding box; • (w, h), which represent the dimensions: the weight, and respectively, the height of the bounding box; and, • the prediction's confidence, representing the Intersection Over Union (IOU) ratio between the predicted and the real bounding box.
Each cell also predicts C conditional probabilities, Pr(Class i |Object), which depend on an object's existence in that cell. The number of probabilities is the same as the number of classes. Only one set of such probabilities will be predicted, regardless of the number of existing framing rectangles. In the testing phase, it will obtain: Equation (5) defines the confidence that represents the product between the probability of an object's existence and the IOU size (calculated for the predicted and the real rectangle). If there is no object, then the confidence value is 0. This represents both the probability that a particular class of objects exists in the predicted framing rectangle and the accuracy with which it frames an object.
In the end, an image will contain S × S × B bounding boxes. Each bounding box corresponds to four predictions of the location, one confidence score, and C conditional probabilities, where C represents the number of classes. This leads to an amount of predicted values for a single image.

Darknet Architecture
Darknet neural network has multiple convolutional layers. Figure 4 presents the architecture of this network. The last two layers are fully connected. While the first layers extract the input image features, the last two layers predict the probabilities and output coordinates. The weight and the height of the predicted bounding boxes are normalized to the input image's size. Therefore, the output values will be in the range [0,1]. The YOLO model is trained in order to minimize the sum-squared error. Two scaling parameters are also introduced to control the loss function. This function is computed while using the equation: where: obj ij refers to the bounding box j in cell i, which is responsible for the prediction. This is the bounding box with the highest IOU relatively to the real bounding box (described in Figure 5). • C i represents the confidence score of the responsible predictor, from cell i; •Ĉ i represents the predicted confidence score; • P i (c) defines the conditional probability that the cell i contains an object from class c; and, • λ coord represents the scaling parameter that leads to the increasing of the loss function according to the coordinates' predictions of bounding boxes. YOLO model uses the parameter • λ noobj represents the scaling parameter that leads to the decreasing of the loss function, according to the predictions for the confidence score of the bounding boxes that do not contain objects. This system uses the scaling parameter According to Equation (7), the classification errors only appear when there is an object in cell i, while, otherwise, 1  Furthermore, it makes predictions after a single network evaluation, unlike the R-CNN network, which requires thousands of evaluations for a single image. This also leads to the second advantage. The YOLO model is high-speed. It is 1000 times faster than than the R-CNN network, and it is 100 times faster than the Fast R-CNN network [8].

YOLO Third Version
The newest version of the YOLO model trains a new and deeper neural network with more convolutional layers, while it also leads to more accurate detections. Although the number of layers in the network increases, it remains almost as fast as the previous versions [10].
Darknet-53 neural network predicts three bounding boxes, in order to frame the objects of different dimensions. For each bounding box, five values are predicted, i.e., t x , t y , t w , t h , and t o . When considering a cell displaced by (c x , c y ) from the upper left corner of the image and a bounding box with dimensions p w and p h , respectively, the predictions will be computed using Equations (10)- (14). Additionally, Figure 6 describes the location prediction and related parameters. Additionally, the system predicts all of the classes that a bounding box may contain, while using a multiple label classification. In the end, the network will predict a number of N = 3 · (5 + C) values. This model has 80 classes, which will lead to 255 parameters at the end of the network. The number of values also affects the number of layers from the last convolutional layer, which must always be equal to the neural network's number of outputs.
One of the problems regarding the older versions of the YOLO model was the detection of small objects. The YOLOv3 model improves this detection. This model locates small objects with high accuracy due to the use of multi-scale predictions, which leads to a better and more efficient detection system.

Proposed Method
The primary purpose of this paper is to implement an efficient re-identification system. We started with the general detection system, YOLO, as described in the previous section. We aim to design a CNN that generates the feature vectors of each existing person in the databases, taking into account two essential elements. Firstly, the distance between the feature vectors that belong to persons with the same ID must be as small as possible, so that inner-class variation is not too significant. Furthermore, the feature vectors of different people must be as different as possible in order to separate the clusters.
Initially, we intended to implement a network with a small number of convolutional layers and generate a feature vector with reduced dimension. We started by extracting various layers from the Darknet network. In order to choose those layers, we analyzed the feature maps that were generated while using the tool described in [42]. We also extracted the associate weights that were pretrained from the YOLO model and we continue training it with our dataset. We tried different losses, like quadruplet loss [33] or center loss [43]. However, none of the versions led to the convergence of the re-identification system.
Further, we tended to use a deeper neural network and then divide the problem of re-identification into two stages. First, we aim to achieve a recognition system [2] of people from the database, followed by a stage of generalization, which will turn the recognition into a broader problem, i.e., person re-identification.

Algorithm Description
The first step of the algorithm is to train the CNN in order to obtain an accurate classifier, as can be seen in the diagram from Figure 7. The first stage has the primary purpose of initializing the weights used in convolutional layers. The network is initially trained on a small database for 30,000 iterations.
Further, from the obtained network, we generate more network architectures. Each network is trained using the same database, while, during the training phase, each network that does not converge is eliminated. Further, the remaining networks are modified in order to be suitable for the re-identification problem. For this purpose, we added several layers at the end of each CNN and continue to train them on a bigger database. We aimed to obtain as many architectures as possible. We tried to change the last layers several times for each network if it does not converge. Finally, after training all of the remaining networks (using several layers), we choose the CNN that leads to the best results regarding re-identification. In the end, we take the best network and continue to train it using a vast database with a large number of subjects.
All of the presented stages and obtained results are described later in this paper.

Image Databases
1. eliminating all the images that contain two or more persons; 2. eliminating all the images in which a person is partially hidden by other object, or is not fully captured on camera; and, 3. we eliminate all of the blurred images and the out of focus ones, in order to remove noise from database.
After this step, we only kept the clear images, which completely capture a person. In Figure 8, some samples that are extracted from the RPIField database are presented.
From this database, we initially used only 900 images that belonged to 30 people. Those images were randomly chosen from the clean database. After that, we formed the triplets that were used in the second phase of training from this small subset of images. Later in the training stage, we expanded the database to 60 images per subject, forming 1800 images.
Regarding the testing stage, we first tested our method on a subset of images that were extracted from the same database. Most importantly, we tested it on a completely different database, i.e., PRW-v16.04.20 [19]. This image database contains frames that are captured with six different cameras. Each frame has several persons, which makes the re-identification process more difficult. The images represent crowded places and, inevitably, there are occlusions of people, thus making re-identification impossible in certain situations.
We used this second database in order to test our system's performances in a different situation, i.e., in a more general scenario. Figure 9 captures several frames that were extracted from this database.

First Training Phase
The first phase of the proposed method is to solve the recognition problem. As a starting point, we chose a CNN that consists of the first 80 layers extracted from the Darknet-53 neural network and the associated weights. The purpose was to start from a trained set of weights and avoid randomizing them. In this way, we increase the probability that the network will converge and we also try to avoid the idea of exploding gradients.
In this training stage, we consider a simple classification problem [10]. In order to do that, in the last layer, we use the logistic function for activation. This function is computed according to Equation (15). Further, for the prediction of classes, the binary cross-entropy loss is used, which is computed according to Equation (16).
where N is the output size,ŷ i represents the i-th scalar from the output, and y i is the corresponding target value.
In this stage, we aim to perform the recognition of 30 classes. Accordingly, we chose from the database 30 subjects and 30 images for each subject, thus obtaining a small training database with 900 images in total.

Second Training Phase
During the second phase, we focus on converting the recognition problem into a more general one, i.e., person re-identification. To do so, we eliminate the last convolutional layer and classification layer, and we replaced them with other layers.
We only added convolutional layers and fully connected (FC) layers. The FC layers' purpose was to reduce the feature vector's size until we reached the desired dimension. At the end of the network, we designed a brand new layer, which implements the triplet loss function inside the YOLO framework. This layer optimizes the network in order to minimize the triplet loss in a more general way. We excluded the concept of classes, which gives us a greater degree of freedom. In this way, we are able to re-identify many people without considering their belonging to a particular class.
Unlike the initial classification while using the YOLO model, by adding the layer that we have designed, the network architecture becomes invariant to the number of classes that we train. Another advantage of the proposed method is the minimum volume of computational requirements and the shorter time that is required to train each network. This is a consequence of the reduced number of additional layers added and trained, with the rest of the layers from the first phase of the algorithm being frozen.
Starting from the Darknet network's original parameters used in the first stage, we obtained 40 different variations. The most important parameters with the most significant impact on the network convergence are the learning rate and momentum. For these reasons, we varied both of the parameters according to Equations (17) and (18).
The main goal was to obtain more network architectures in order to determine which one leads to the fastest convergence and most accurate re-identification.

Proposed Neural Network Architecture
The proposed architecture is not very different from the one used in the first phase. We have excluded the last two layers. The last convolutional layer was used in order to obtain (at the end of the network) the specific number of values that are described in Section 3. The number of filters in this layer varies with the number of classes. In our case, because we started with 30 classes in the previous phase, we had N = 3 · (5 + 30) = 105 filters. We excluded both the classification and last convolutional layer, because we do not want to further treat the re-identification as a problem of recognition. Instead, we added the layers that are described in Table 2 at the end of the neural network. As can be seen, we did not add any additional layers because we tried to keep at a minimum the computational power and the required additional memory. We froze all of the previous layers during this second training stage, aiming to train only the new layers. Regarding the last layer, the re-identification one represents one of the novelties introduced by the proposed method. It implements several types of triplet losses in order to make the YOLO model suitable for person re-identification.

Triplet Loss
Triplet loss is a loss function that is often used in machine learning methods [44,45]. This function provides the idea of similarity, respectively, dissimilarity of images from a database. If the number of classes is considered to be known before training in the classification methods, using triplet loss is no longer necessary. Thus, the training phase does not have to be resumed every time that a new class is added.
In this way, the re-identification is seen as a problem of similarity, not of recognition. The main difference is that, for each input image, two more images are needed, thus forming a triplet. The three images from the triplet represent: • Anchor image. This is the image from the batch, which will pass through the network. • Positive image. It is an image similar to the anchor image (practically, from the same class). • Negative image. This must be dissimilar to the anchor image, belonging to another class.
where D(x, y) represents a distance between two learned feature vectors, x and y. As distance, one of the distances presented in Table 3 can be used. The second parameter used in triplet loss (i.e., the margin α) represents the distance between two clusters. The purpose of this loss function is to ensure that for each anchor A, its features representation is closer to the positive image's representation than that of the negative image, by at least α. Table 3. Distance metrics that are used in triplet loss.

Distance Metric Equation
Euclidean Distance D(x, y) = ( Those triplets can be chosen in many different ways, when considering the extent to which the anchor and the positive image are similar, or even to which the negative image differs from the anchor. One of the most common ways to choose triplets is the so-called hard negative triplets. In this case, we choose, for every anchor, the most similar negative image, according to the condition presented in Equation (22). In the worst-case scenario, the similarity between the anchor and positive image will be smaller than the similarity between the anchor and negative image.
Another type of triplets is the semihard negatives triplets. In this case, according to Equation (24), we have to choose for every anchor, the negatives sample that is the most similar with the anchor, but also less similar than the corresponding positive image. Regarding the third type of triplets (i.e., the moderate positives) they are chosen based on both the conditions from Equations (22) and (23).
In order to use triplet loss in an optimization problem, a cost function can be obtained, which represents the sum of all the losses: For this cost function to work, the network must be trained long enough until it passes through all of the pairs (A, P), so that all of the images belonging to the same person are associated with each other. Eventually, all of the appearances of the same person will form a single cluster.

Re-Identification Layer
The proposed method's novelty is the new layer that we implemented and included in the YOLO system. The re-identification layer is based on deep metric embedding learning. We use the triplet loss as a metric function. In this second phase, the neural network is trained in order to minimize the cost function that is described in Equation (25).
Mathematically, the re-identification layer uses (as metric distance) the l2-norm of a vector from Equation (26). Further, the algorithm requires several vector computations, which are described in Equations (26)- (30).
When the neural network architecture is designed, the developer must provide the network with all of the parameters that it needs. In the case of the re-identification layer, the necessary parameters are: • margin. This parameter is used in the computation of the triplet loss, according to Equation (20).
If not provided, this parameter has the default value 0. • type of triplets. Three types of triplets can be used in our algorithm: hard triplets, semihard triplets, and moderate positives triplets. Those three methods are described in Table 4. If not provided, this parameter will take the default value, i.e., hard triplets.
• number of samples. This parameter is used only when it comes to moderate positives triplets. In this case, the number of positive, respectively, negative samples are different from the number of batches. Table 4. Types of triplets implemented in our proposed method.

Type of Triplets Algorithm
Hard negative Initialization: Set network parameters: net_batch = 64, net_outputs = 128, margin = 0.2, triplets = H ARD For i = 1, 2, · · · , number of batches: For j = 1, 2, · · · , number of negative samples: (28) and (26) Find for A i the most similar negative sample N j using (22) Compute D(A i , P) based on (28) and (26) Compute the triplet loss L i = L(A i , P, N j ) using (20) Compute the cost function based on (25) Semi-hard negatives Initialization: Set network parameters: net_batch = 64, net_outputs = 128, margin = 0.2, triplets = SEMI H ARD For i = 1, 2, · · · , number of batches: Compute D(A i , P) based on (28) and (26) For j = 1, 2, · · · , number of negative samples Compute D(A i , N j ) based on (28) and (26) Choose all the triplets that correspond to (24) Compute the number of semi-hard negative triplets for A i For k = 1, 2, · · · , numbero f semi − hardtriplets Compute the triplet loss L k = L(A i , P, N k ) according to (20) Compute the partial cost function C i based on (25) Compute the cost function by summing all the partial functions.

Moderate Positives
Initialization: Set network parameters: net_batch = 64, net_outputs = 128, margin = 0.2, net_samples = 100, triplets = MODERATE For i = 1, 2, · · · , number of batches: For j = 1, 2, · · · , number of negative samples: Compute D(A i , N j ) based on (28) and (26) Find for A i the most similar negative sample N j using (22) For k = 1, 2, · · · , number of positive samples: Compute D(A i , P k ) based on (28) and (26) Find for A i the most different positive sample P k using (23) Check if the first part of the condition from (24) is fulfilled for D(A i , P k ) and D(A i , N j ) If the condition is met, compute the triplet loss L i = L(A i , P K , N j ) using (20) Compute the cost function based on (25)

Experimental Results
This section will describe the testing steps and the methods used to test our proposed method's performance after each training phase.

Testing Algorithm
In order to test the performance of the proposed method and be able to monitor the evolution of this system, it was necessary to design a test environment. At the end of each training stage, it is necessary to test the networks and the weights that were obtained to identify the best variants.
The testing stage involves passing each image from the test database through the CNN, thus obtaining the feature vectors. The ultimate goal is to compare each feature vector with all of the other vectors that belong to the previous images, and finally associate people based on the similarity between the vectors.
For running such a system, storing a remarkably large number of feature vectors is necessary. The number of vectors is equivalent to the number of occurrences of all persons in the test database. There are millions of such occurrences in a real scenario, which leads to much-needed storage space.
It was necessary to store the features vectors in a database for operating the system on a wide range of real-life scenarios.
The feature vectors are generated once a neural network receives an image at the input. These are then stored in the database. Finally, the vector with the most significant similarity is sought in order to achieve the association. In order to test the re-identification method, storing five parameters in the database for each occurrence was required: • ID. This column is generated automatically and also incrementally. It represents a sequence of unique and consecutive numbers, being present in any database. • LABEL. This value is, in fact, an Universally Unique Identifiers (UUID). It is a 128 bits identifier, which consists of a total of 32 hexadecimal digits. Each time the system generates a new label, it considers that a feature vector belonging to a new person is being introduced in the database.
If the system connects with an existing person in the database, the associated label is copied and assigned to the newly entered person. • TRUE_LABEL. This column represents the true identity of a person. The value is extracted from the .txt files that accompany the test images. It must be provided to the system in the testing stage. • NORM. This value represents the l2-norm of the feature vector and it is computed based on Equation (26). • FEATURES. This column stores the feature vector. This vector represents the output generated by the neural network. It is an 128-dimensional vector, based on which we finally made the association between the persons from the database.
In Table 5, the algorithm according to which the database realizes the association between the feature vectors is described. Mathematically, there were relatively basics concept used, such as L1-norm, L2-norm, ascending order, and selections.

Initialization:
Set parameters from PostgreSQL database: min_distance = FLOAT_MAX, number_o f _rows, norm_max, norm_min Compute | f |, as being the L1-norm of new feature vector based on (27) Sort vectors ascending, according to L1-norm Select k = number_o f _rows most similar feature vectors, according to L1-norm for i = 1, 2, · · · , k : Compute the distance D( f , f i ), based on (28) and (26) Compute the distance as being the minimum distance between the new feature vector and the others k vectors, based on (22).
Knowing the minimum distance between the new feature vector and all of the other vectors from the database, we will further compare it with a previously set threshold. If the minimum distance is smaller than the threshold, we consider the two vectors to be similar and the system decides that the new person belongs to the same class with the most similar one. On the other side, if the minimum distance is greater than the threshold, then the system decides that there is no sufficiently similar vector in the database, and the new person is considered to be part of a new class, thus a new label is assigned. The threshold represents one of the main parameters of our proposed method. We used a wide range of thresholds (in all the designed tests) to find its optimal value.

Results
In order to observe how the CNN evolves, we created a series of scenarios, closely following these networks' functioning. This helped us to intervene when the networks no longer converged, or the situation of gradients exploding appeared.
During the first phase of training, we trained the neural network on 30 classes, with a database of 900 images. Figure 8 presents some of the images used in this phase. This first step is only a classification one, in order to obtain the initial weights to start training in the following stage. Table 6 presents the first results that were obtained after training all of the 40 variations of neural networks for several epochs. All of the tests executed in this first scenario are performed in some extreme and challenging conditions. While we trained our initial network on a particular distribution, using frames from RPIField database, we changed all of the conditions in this testing phase. We wanted to see how this system behaves, in a completely new environment, in which it was not trained at all. We aimed to obtain a system as general as possible, which does not only work on a trained database or in similar conditions. That is the reason why we have chosen the second database, PRW-v16.04.20 [19]. This database presents a real-life scenario, capturing a crowded location with many people and objects that partially overlap.  We compute both the mean Average Precision (mAP) and the F-score in order to determine the performance of our method. While mAP considers only positive detections (true positives and false positives), F-score is calculated based on mAP and recall. Accordingly, it also takes the false negative detections into account. The F-score represents the harmonic average of the two scores, being calculated with Equation (31). This score is used in deep learning to measure the accuracy of a model on a given database.
The tests that are described in Table 6 were realized on a dataset formed by 2800 images. From 2800 images, there were 6276 appearances of persons belonging to 621 classes. The main reason for this first test was to identify the best values for the learning rate and momentum.
Of the 40 neural networks trained and tested, some did not converge. For this reason, only 13 neural networks are shown in Table 6. These are the networks that have managed to converge and optimize the classification problem. In order to differentiate the networks, each received a unique ID. This is represented in the first column of the table.
While we followed the variation of the mAP and its comparison with other existing methods, the best model in terms of the total performance was chosen based on the F-score's value. Following the first test results, it was concluded that the CNN with ID = 5 leads to the highest performance (as marked in Table 6). For this reason, the following tests are performed while using this CNN architecture: LR = 10 −4 , momentum = 0.91, and batch_size = 64.
Furthermore, we focused on training this particular network over a more extended period in order to see how it behaves. Figure 10 shows the obtained results. After we trained our proposed model, we tested it using Rank1, Rank5, Rank10, and Rank20. Our proposed method achieves an Rank-20 accuracy of up to 0.72.
As the rank increases, so does the performance of the model. However, it also increases both the computational complexity and running time. Thus, Rank-5 offers a good compromise.
For each test presented, we computed the results while using different thresholds. We varied the threshold in the range (0, 1], using a step ∆ = 0.01. Figure 11 shows the evolution of the threshold when training the proposed CNN. This parameter represents the minimum distance from which the algorithm separates two classes. Thus, the distance between two embeddings belonging to different classes is reduced. Once the threshold is low, both the intra-class and inter-class variations are reduced, and the method does not efficiently separate different classes.
Once the threshold increases, the inter-class variation increases, and the neural network is more accurately learning the differences between the characteristics of each class. In the proposed method, we trained the network while using triplet loss optimization, with a value margin of 0.2.  Ideally, the threshold will tend toward the margin value. In our case, the threshold tends to 0.23 as the training progresses, as can be seen in Figure 11.
Next, in Figures 12 and 13, the system's evolution is presented in terms of the mAP and F-score. The evolution of the F-score is closely related to one of the thresholds. As the number of training epochs increases, the maximum value of the score also increases. Furthermore, the maximum point of the function changes and it tends to the threshold's optimal value (0.2).
We focused on obtaining a general system re-identification, and so this is why all of these tests were realized across data distributions. While the whole training phase was performed on a small part of the RPIField database, the testing was performed entirely on a completely different database. Furthermore, the tests were realized in some crowded places, with several occlusions of people. This leads to the hiding of some features, with the neural network not receiving the complete information that it needs.  In order to compare the proposed method's accuracy with other existing methods, we chose another method that is based on deep metric learning. According to [34], the SphereReID method has led to better results than the other methods. For this reason, we trained the network proposed in [34] on the same database that was used in our method, for 25,000 epochs. Subsequently, we tested both variants on a dataset extracted from the PRW-v16.04.20 database. It can be observed from Figure 14 that our proposed method outperforms the second method in the scenario stated in this paper. Also, in terms of the time performances, our proposed method is more efficient, tending to a duration of 120 ms/identity as the number of examples stored in the database increases. It can be seen that our proposed method leads to a Rank-1 accuracy of almost 0.5.

Conclusions and Future Work
This paper has shown that a general-purpose DNN object detector, like the YOLO model, can address the person re-identification problem. The main goal was to implement a method that would achieve re-identification with the minimum requirements for computing and hardware resources.
Our proposed method treats the re-identification as a classification problem, while the second part generalizes it. In this way, we only added five new layers to be trained. The small number of supplementary layers leads to a high processing speed, reducing both the time and resources required. We aimed to design a system that runs in real-time and generates as few computational costs as possible, but it would lead to a high re-identification rate.
When considering all of these requirements, our proposed method manages to implement the first version of a general re-identification system. After testing a database utterly different from the one used in the learning stage, we obtained a Rank-1 re-identification rate of 50% and a Rank-20 of 72% (based on F-score function). This is a promising result in generalizing re-identification and achieving a robust and invariant system for variations of data distributions. Furthermore, besides recognizing a large number of people from an entirely new database, our algorithm is fast, providing the right balance between the speed, cost, and re-identification rate.
Regarding future work, we intend to implement an improved version of the system that will increase the re-identification rate, keeping speed, and additional costs to a minimum. We aim to obtain a fully automatic re-identification system, containing a stage of detection of persons before the stage of re-identification.
In conclusion, this paper demonstrates that it is possible to implement a re-identification system that runs on a wide range of real-life scenarios. It is possible to obtain a high re-identification rate on distribution without collecting large volumes of data from security systems or retrain the network while using a limited number of samples during the training stage. While, historically, person re-identification methods were dependent on a specific database, our paper shows that a system can still be usable on other distributions with minimal effort, even if the accuracy still needs some improvement.

Conflicts of Interest:
The authors declare no conflict of interest.