Comparative Analysis of Supervised and Unsupervised Approaches Applied to Large-Scale ``In The Wild'' Face Verification

Deep learning-based feature extraction methods and transfer learning have become common approaches in the field of pattern recognition. Deep convolutional neural networks trained using tripled-based loss functions allow for the generation of face embeddings, which can be directly applied to face verification and clustering. Knowledge about the ground truth of face identities might improve the effectiveness of the final classification algorithm; however, it is also possible to use ground truth clusters previously discovered using an unsupervised approach. The aim of this paper is to evaluate the potential improvement of classification results of state-of-the-art supervised classification methods trained with and without ground truth knowledge. In this study, we use two sufficiently large data sets containing more than 200,000 “taken in the wild” images, each with various resolutions, visual quality, and face poses which, in our opinion, guarantee the statistical significance of the results. We examine several clustering and supervised pattern recognition algorithms and find that knowledge about the ground truth has a very small influence on the Fowlkes–Mallows score (FMS) of the classification algorithm. In the case of the classification algorithm that obtained the highest accuracy in our experiment, the FMS improved by only 5.3% (from 0.749 to 0.791) in the first data set and by 6.6% (from 0.652 to 0.718) in the second data set. Our results show that, beside highly secure systems in which face verification is a key component, face identities discovered by unsupervised approaches can be safely used for training supervised classifiers. We also found that the Silhouette Coefficient (SC) of unsupervised clustering is positively correlated with the Adjusted Rand Index, V-measure score, and Fowlkes–Mallows score and, so, we can use the SC as an indicator of clustering performance when the ground truth of face identities is not known. All of these conclusions are important findings for large-scale face verification problems. The reason for this is the fact that skipping the verification of people’s identities before supervised training saves a lot of time and resources.


Introduction
Mobile devices supply users with the possibility of instantly taking photos and uploading them to social media platforms. Every day, millions of new photos depicting everyday situations become available on the Internet. Among them are images containing human faces of unknown identity. The human face is a widely used biometric modality for revealing the identity of a person. In spite of a great deal of research on face recognition, it remains a challenging issue [1]. In real-life scenarios, due to a large amount of data, face verification systems deal with unlabeled images in which identities of the people in the images are initially unknown. The typical approach to solve the problem of face verification is to train a classification algorithm that can "learn" to assign an appropriate class label (identity) to a given face. The most popular and effective classification algorithms, such as neural networks, support vector machines, and k-nearest neighbors require ground truth data for the training procedure. If the ground truth of training data is not known, we can undertake one of two possible solutions to find it: we can manually or semi-manually group persons in training data by identity or we can use unsupervised approaches based on clustering. In case of supervised classifier training, the manually or semi-manually labeled data set is commonly considered more reliable than a data set which is labeled by an unsupervised method; however, depending on the problem we are dealing with, the efficiency might differ.

Background
In this subsection, we discuss what is already known about the subject, how it is related to this paper, and the open problems which we wish to solve.

State-of-the-Art
Convolutional Neural Networks (CNN) are now the state-of-the-art approach for generating numerical vectors that represents faces (so-called embeddings), which are later used as input for clustering and classification algorithms. The role of CNN-based features is to supply a computer system with a real-valued vector-based discriminative face representation. Although the process of training of such a network is a supervised procedure [2] (i.e., the input data need to have labeled identities), novel papers have introduced some heuristics that allow this important limitation to be partially overcome [3]. After generating a face embedding, face verification systems utilize classification algorithms to assign faces to identities. Among most common classification approaches are k-nearest neighbors (KNN) [4,5], fully connected Neural Networks (NN) [6], and Support Vector Machines (SVM) [7].
Researchers have also addressed problems of domain adaptation networks for face recognition "in the wild" (i.e., in a non-laboratory environment) [8][9][10][11]. Some researchers have recommended adding additional face normalization by applying unsupervised face normalization (especially face frontalization) [12]. Training deep convolutional neural networks (CNNs) is often time demanding, as it requires the optimization of a large number of network parameters. To overcome this limitation, attempts have been made to perform additional preprocessing of data based on computing predefined convolution kernels from training data [13]. Researchers have also reported the application of Principal Components Analysis (PCA) in the role of unsupervised dimensionality reduction algorithm for face recognition [14]. In this context, "unsupervised" means that, contrary to Linear Discriminant Analysis (LDA), PCA does not require prior knowledge about the identities of people in the training data set. PCA has also been used, for example, to learn a filter bank for the convolutional layer [1].
Performance of a face recognition system may be conditioned by the quality of images in the data set. In [15], the authors proposed a quality assessment method aimed at estimating the suitability of a face image for recognition. Due to the complexity of CNN models and a lack of understanding of deep image features, there is still ongoing research into other feature discriminative methods [16,17].
Complex unsupervised systems that enable face identification have been already proposed in the literature. The solution proposed in [18] employs Deep Convolutional Neural Networks to extract features and an online clustering algorithm to determine face IDs. In [19], a graph-based unsupervised feature aggregation method for face recognition was proposed. The method uses the inter-connection between face pairs with a directed graph approach to refine the pairwise scores.
Large surveys on various aspects of deep learning applications for face recognition can be found in [20][21][22][23]. The survey [24] addressed the problems of occlusion, single sample per subject, and expression, while [25] addressed age invariant face recognition and [26] discussed face recognition under morphing attacks.

Motivation of This Paper
As can be seen in Section 1.1.1, researchers have addressed the problems of training face verification systems with and without knowledge about the ground truth; however, to the best of our knowledge, there has not been a complex study on the influence of knowledge about the ground truth on the efficiency of the trained classifier. The aim of this paper is to evaluate the potential improvement of classification results of state-of-the-art supervised classification methods trained with and without knowledge about ground truth and to answer the question "Is ground truth data required to train an effective face verification system?" This issue is very important in practice: manual or even semi-manual face image labelling is very time-consuming, as a face data set might consist of hundreds of thousands of images. Furthermore, we also propose a method to estimate the quality of a clustering algorithm for unlabeled data and its influence on the effectiveness of the classifier. In our research, we use two sufficiently large data sets that contain more than 200,000 "taken in the wild" images, each with various resolutions, visual quality, and face poses which, in our opinion, guarantees the statistical significance of our results.

Overview
Deep learning-based feature extraction methods and transfer learning have become common approaches in the field of pattern recognition. Deep convolutional neural networks trained using triple-based loss functions allow for the generation of face embeddings, which can be directly applied to face verification and clustering. Knowledge about the ground truth of face identities might improve the effectiveness of the final classification algorithm; however, it is also possible to conduct supervised training utilizing knowledge about data clusters discovered by an unsupervised approach. The aim of this paper is to evaluate the potential improvement of classification results of a state-of-the-art supervised classification method trained with and without knowledge of the ground truth.
We used an unsupervised clustering solution similar to that in [18], in order to discover groups of images with same identities; however, the algorithm in [18] has a slightly different purpose than ours: it is devoted to video data with a much smaller number of identities but has a similar image processing pipeline to that in our approach. In this paper, we propose many important improvements. Contrary to the work in [18], among other possible clustering algorithms, we evaluate HDBSCAN instead of DBSCAN, as HDBSCAN requires less adaptation parameters. Further, instead of video data, we evaluate our method on a data set which has about 6.3 times more identities than the YouTube data set used in [18]. We also propose the use of PCA-based dimension reduction for deep facial images features, which not only simplifies the computation by limiting the number of parameters but may also improve the face recognition results.
We evaluate the most popular supervised pattern recognition algorithms, applied to face embedding classification, namely, KNN [4], NN [6], and SVM [7], with various adaptive parameter values. We evaluated all algorithm results in terms of the clustering result quality measures, as it is not possible to obtain an accuracy of classification directly when the classifier is trained on labels generated by an unsupervised method. In this paper, we also show that it is possible to accurately estimate the quality of the clustering algorithm for unlabeled data, as there is a moderate positive correlation between measures utilizing and not utilizing knowledge about the ground truth. This is a very important find, as it allows for estimation of the algorithm efficiency in real-world scenarios. To the best of our knowledge, large-scale comparison of cluster quality measurements between classifiers trained on ground truth data and without this knowledge applied to state-of-the-art deep learning face recognition algorithms has not yet been published, and therefore we believe that results presented in this paper will be useful to the applied computer science community.
In the following sections, we present the details of our evaluation procedure and we describe the training and validation data sets, which we use to perform the experiments. We evaluate various possible clustering and classification algorithms. In this research, we use two sufficiently large data sets containing more than 200,000 "taken in the wild" images each, with various resolutions, visual quality, and face poses. The first data set contains celebrity images, each with 40 attribute annotations. Both data sets contain images that cover large pose variations and background clutter. More details about the data can be found in Section 2.5. All of this, in our opinion, guarantees the statistical significance of the obtained results. Furthermore, both the source code and the data sets are available for download and our results can be reproduced.

Materials and Methods
In this section, we present the proposed methods, the data set we used for testing, and the scoring metrics we used for evaluation.

Face Verification without Knowing the Ground Truth
The following pipeline can be used for face verification without knowing the ground truth. An overview of the proposed method is presented in Figure 1. The system should be able to operate in a certain environment (e.g., a given social network), in which it can collect images that potentially contains face photos. At the "cold start", the system needs to gather a sufficiently large training data set of images containing faces. Depending on the environment it operates within, its size might differ. For example, when we consider a social network of several thousand people, we should gather at least several photos published by each user. Each of those photos might contain faces of certain individuals that might appear on several other photos. It is also possible that an image does not contain any face; in this case, it has to be removed from the later processing. As described in Section 2.5, we trained this algorithm on a training data set containing about 100,000 facial images of about 10,000 individuals; however, there are no obstacles to utilizing even more images, as the clustering and classification algorithms we use are scalable. We do not have to make any assumption about how many images of certain individuals are present in the data set. A Figure 2 presents a block diagram of the proposed research.
The training data set of images, after initial processing, is used as the training data set for a classification algorithm. The preprocessing step consists of face detection, deep feature generation from facial images (which might be followed by PCA dimension reduction [27]), and unsupervised clustering. After clustering, each image has a cluster label to which it has been assigned. These labels, together with the deep features, are used as input data for supervised training of the classification algorithm. After the classifier has been trained, the system is ready to operate and moves to the working phase. In this phase, the pipeline works as follows; face detection, deep feature generation from facial images (which might be followed by PCA dimension reduction), and, finally, classification. PCA and classifier parameters are learned during the training phase. The trained classification algorithm assigns class labels that were discovered by clustering. Classifiers allow for the assignment of new facial images to classes faster than with the application of unsupervised clustering each time a new image is discovered. Furthermore, depending of the classification algorithm, it might also allow for generalization of the obtained results. The system might be retrained/adapted after certain amount of images has been recognized or after the scoring parameters described in Section 2.6.2, when the training data set enhanced by newly acquired data drops below a certain amount (in comparison to the initial value). Retraining of the system basically consists of repeating the training phase. The training data set should contain all of the images that have been acquired so far. The exact retraining "trigger" is highly dependent on the characteristics of the image data source, which is not in the scope of this paper. An overview of the pipeline for face verification without knowing the ground truth. The top row presents the training procedure, while the bottom row shows the working phase. The system may be continuously improved by repeating the training procedure and adding data that was gathered so far into initial training data set.

Face Detection and Feature Generation
After an image has been acquired, we need to detect the region of interest that might contain a face. Most approaches to date also require that the face detection method performs additional aligninment of the face. Alignment is responsible for positioning the face on the output image in such a way that we should expect that all facial images aligned by the same algorithm should have a similar spatial positioning of certain parts of the face. Face detection and alignment is challenging, due to the various poses, illuminations, and occlusions involved. For our purposes, we adapted a deep cascaded multi-task framework with three stages of deep convolutional networks which predict face and landmark location in a coarse-to-fine manner (MTCNN), proposed in [28]. We used a pre-trained model implemented in Keras (https://github.com/ipazc/mtcnn).
After detecting and aligning a face, we can perform feature generation (embedding). Feature generation is among most important steps in the image classification pipeline, as the incorrect choices of features make cause the classification problem to become unsolvable. As already mentioned in Section 1.1.1, there exist many methods that can be used to generate features from facial images. Face embeddings should have similar feature vectors when we compare photos of the same person and different vectors when we compare photos of two different persons. The similarity might be defined as the Euclidean distance between m-dimensional feature vectors. Furthermore, feature generation methods should be robust to the lighting conditions in the photo, the pose of the persons (e.g., the direction of the head), facial expression, hair style, clothing, and so on. Among the most popular methods that satisfy these needs are deep learning-based methods that utilize neural network architectures. In this research, we chose the facenet architecture, initially described in [2]. Facenet is a convolutional neural network trained using a supervised Triplet Loss approach. Triplet Loss minimizes the distance between an anchor (image) and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity. We used a pretrained facenet model (https://github.com/nyoki-mtl/keras-facenet) with over 22M trained parameters, as implemented in Keras and trained on the MS-Celeb-1M data set (https://www.microsoft.com/en-us/research/project/, ms-celeb-1m-challenge-recognizing-onemillion-celebrities-real-world/). The input layer of the model has size 160 × 160 × 3 (as it uses color images), while the output layer is a real-valued vector with 128 dimensions. Facenet in recent years is among most popular face embedding methods [29][30][31][32][33].
We can reduce the dimension of the problem by applying a dimension reduction method. As in a real-life scenario we will not know the identities of objects in the data set, we might apply the Principal Components Analysis (PCA) [34] to estimate the amount of variance explained by the generated components. Reduction of the number of dimensions can reduce the computational burden and size of the model without affecting the overall classification effectiveness.

Clustering
In a real-world scenario, it is hardly possible to estimate how many classes (identities) of faces are present in a given data set; thus, there is no point in using methods such as k-means clustering which require prior knowledge about the number of classes. Due to this, we take into account methods that do not require such knowledge. Each clustering algorithm has some parameters that govern its performance. The clustering scorings we used are discussed in the following Section 2.6. We evaluated two clustering algorithms that use different approaches to cluster structures. The first one was Agglomerative (Hierarchical, also called Tree-based) Clustering, which makes the assumption that clusters have a concentric structure. We chose the ward cluster linkage and evaluated various distance threshold values on linkage.
The second clustering algorithm we tested was Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), which is a density-based method [35]. It performs DBSCAN clustering over varying epsilon (spatial distance threshold) values and integrates the results to find a clustering that gives the best stability over epsilon. Due to this, HDBSCAN may find clusters of varying densities and is more robust to input parameter selection than DBSCAN. The parameter of HDBSCAN is the minimum size of clusters (i.e., the minimal number of objects in the cluster).

Classification
Many classification algorithms are commonly used for face verification problems. Due to the massive amount of data and large number of classes (in comparison to the number of objects) in the data set, there are several algorithms that are used more often than others. Among them are the K-nearest neighbors approach (KNN), the linear SVM method, and neural networks.
K-nearest neighbors is among most basic but, at the same time, effective methods of classification. The main drawbacks of this method are its poor generalization ability and the necessity of keeping the whole data set in the memory. The search for the nearest elements can be sped up by the application of k-d trees for hierarchical decomposition of the space along different dimensions [36].
Without application of a kernel trick, SVM is a linear classifier. In the case of large problems, linear SVM requires less parameters than the kernel-based approach and its coverage and final classification performance is faster. At present, multi-class SVM optimization problems are generally formulated as sequential dual methods [37]. Recent findings have allowed for the parallelization of coordinate descent methods, which further speeds up the coverage of this classifier [38,39]. Forward-only sequential neural networks are popular and well-established classifiers. Such networks are often composed of a set of dense (fully connected) neuron layers, postioned one after another, which play the role of partitioning the feature space by decision hyperplanes. Each unit (neuron) of a dense layer takes a linear combination of the input vector and the neuron's weight which, then, is an argument for an activation function: where x is an input vector, w is a vector of weights, • is the dot product, and A is an activation function. Among the most commonly used activation functions, we can mention the Rectified Linear Unit (ReLU), which is defined as max(x, 0), where x is an input vector. Although this activation function seems very basic, studies have reported that some network architectures with ReLUs consistently learn several times faster than their equivalents using saturating neurons (e.g., with tanh as the activation function) [40]. The assignments to the classes are usually determined through a softmax transformation. The softmax object is assigned to a class, the index of which refers to highest value of softmax-transformed coordinate of the input vector, where g is number of dimensions of the vector x. In this study, we trained the network using the Adam optimizer [41], which is first-order gradient-based optimization method using a categorical cross-entropy loss function.

Data Sets
In order to perform our experiment, we selected the Large-scale CelebFaces Attributes (CelebA) Data set (http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) which is among the largest and most popular "in the wild" image sets containing facial images of people with information about their identities. The original data set contained 202,599 images. We performed face detection and aligning using the previously mentioned algorithm [28]. MTCNN detected faces in 202,039 images; however, some aligned images did not contained faces. We manually removed those images and the final data set we used contained 201,804 objects with 10,177 unique identities (classes). The number of removed images was below 0.4% of overall data, which should not disturb the algorithm's evaluation process. The number of images of the same person differed, throughout the data set, from 1 to 35. In Figure 3, we present a histogram that summarizes the quantity of identities that have a certain number of images in the data set. We randomly split this data set into two halves: A training data set and a validation data set. The training data set contained 100,902 objects with 10,021 identities (classes), while the validation data set had 100,902 images with 10,004 identities (classes). Figures 4 and 5 present histograms that summarize the quantity of identities that had a certain number of images in the training and validation data sets.   The use of random selection did not guarantee that each identity was represented in both the training and validation data sets. We did it on purpose, however, as this situation better represents the real-life scenario. With this experimental set-up, it is virtually impossible to obtain 100% recognition accuracy.
In the next step, we generated facenet features for all images in both data sets. In Figure 6, we visualize a fragment of the data set, selecting only those objects that have at least 31 instances in the data set. The visualization was done using the Gephi 0.9.2 software. As this application is a graph visualization software, we have represented this fragment of the data set as a fully connected graph, the nodes of which are face objects and the undirected edges have weights equal to the Euclidean distances between vectors representing the pairs of faces that they connect. To make the graph layout more clear, we also removed edges with weights above 11 and filtered out all nodes that have degree below 3. To generate the layout, we used the Force Atlas 2 algorithm [42], which is a force-directed method. As can be seen in Figure 6, faces with the same identity seem to create clusters, as was expected.
We also used another large face data set, namely, CASIA-WebFace (Chinese Academy of Sciences) [43], to perform our experiment. We used about 40% of the original data (with data ordered by identity id) in order to make this data set a similar size as the CelebA dataset. After this selection, the CASIA-WebFace data set contained 206,458 objects with 3374 identities (classes). We performed face detection and alignment using the previously mentioned algorithm [28]. MTCNN detected faces in 205,312 images. The number of removed images was below 0.6% of overall data, which should not disturb the algorithm's evaluation process. We also randomly split this data set into two halves: a training data set and a validation data set. The training data set contained 102,656 objects with 3374 identities (classes), while the validation data set contained 102,657 images with 3373 identities (classes).

Evaluation of Clustering Results
Let us assume that we have an object set S = {O 1 , . . . , O n } and suppose that U = {u 1 , . . . , u l } and V = {u 1 , . . . , u k } are two different partitions of S defined in the following way, for 1 i = i l and 1 j = j k. We can evaluate the performance of the partitioning done by the clustering algorithm using several indexing/scoring methods.

The Ground Truth Is Known
Let us assume that U is a ground truth and V is the partition we want to evaluate. The Rand Index is given by where a is the number of element pairs that are in same set in U and V, b is the number of element pairs that are in different sets in U and V, and C 2 n is the number of all possible pairs. In order to make the scoring (4) values close to zero when the assignment is random, the Adjusted Rand Index [44] is defined as follows, where E[RI] is the expected value of (4), while max(RI) is the maximal value of (4). ARI is scaled into the range [−1, 1] and measures the similarity of U and V, ignoring permutations. The homogeneity score is defined as [45] where H(U|V) is the conditional entropy of the class distribution given the proposed clustering and H(U) is a maximal reduction in entropy the clustering information can provide. H(U) = 0 means that there is only a single class. The homogeneity score has the highest value (of 1) when each cluster contains only members of a single class.
The completeness score is defined as [45] where H(V|U) is the conditional entropy of the proposed cluster distribution given the class of components datapoints and H(V) is a maximum reduction in entropy the class information can provide. H(V) = 0 means that there is only a single cluster. The completeness score has highest value (of 1) when all members of all classes are assigned to same cluster. The V-measure score is defined as [45] V MS = (1 + β) · HS · CS β · HS + CS .
The Fowlkes-Mallows score is defined as [46] where TP is the number of members of the same classe that were assigned to the same clusters, FP is the number of members of the same classe that were not assigned to the same clusters, and FN is the number of same clusters that do not belong to the same classes. FMS = 1 that means that U and V are equal (with or without permutation). THe discovered clusters number ratio is

The Ground Truth Is Unknown
In the situation when there is no information about the ground truth, we have to use the V partition to evaluate it on itself. The Silhouette Coefficient for a single object O g is given as [47] where g ∈ [1, n], mdnnc(O g , V) is the mean distance between an object O g and all objects in the next nearest assignment, and mdsc(O g , V) is the mean distance between an object O g and all other objects in the same assignment. The Silhouette Coefficient for an assignment is defined as The Calinski-Harabasz Index [48] is defined as where tr(B k ) is a trace between the group dispersion matrix and tr(W k ) is a trace of the within-cluster dispersion matrix.

Results
Our experiments were implemented in Python 3.6. Among the most important packages that were used are Tensorflow 2.1 for machine learning with configured GPU support, in order to speed up network training, mtcnn 0.1.0 for face detection and segmentation, the deep neural networks (DNN) Keras 2.3.1 library, the sklearn package for KNN, and OpenCV-python 4.2.0.32 for general purpose image processing. For algorithm training and evaluation, we used a PC computer with an Intel i7-9700F 3.00 Ghz CP, 64 GB RAM, and an NVIDIA GeForce RTX 2060 GPU on Windows 10 OS. All source code can be downloaded from our GitHub repository (https://github.com/browarsoftware/adlcc). We performed evaluation on data sets introduced in Section 2.5. In those particular data sets, the ground truth is known. We did not use this knowledge for unsupervised algorithm training purposes.
We performed the following calculations, comparison of cluster quality measurements between classifiers trained on ground truth data and without this knowledge.

Face Detection and Embedding, PCA Analysis
After deep feature generation (see Section 2.2), each face was represented by a 128-dimensional real-valued vector. We performed PCA analysis of these values. Table 1 and Figure 7 present the number of PCA components and cumulative % of variance explained by them for the CelebaA data set. Results for the CASIA-WebFace data set are presented in Table 2 As can be seen for CelebaA, 21 components were required to explain over 51% of the variance, 36 components to explain over 75% of the variance, 48 components to explain over 90% of the variance, 53 components to explain over 95%, and 62 components to explain over 99% of the variance. In the case of CASIA-WebFace, the results were nearly identical: 21 components were required to explain over 50% of the variance, 36 components to explain over 75% of the variance, 48 components to explain over 90% of the variance, 53 components to explain over 95%, and 62 components to explain over 99% of the variance. As can be seen, only half of the components were required to explain nearly all of the variance present in the data sets. We attempted to take advantage of this by evaluating not only the original 128-dimensional vectors, but also the data projected onto lower-dimensional space by PCA. In further analysis, eigenvectors calculated with PCA on training data sets were also used to project the validation data sets to lower dimensional space. The dimensionality limitation of the problem may reduce its complexity; for example, reducing the number of coefficients that are required to be calculated during training of the classification model.

Clustering with HDBSCAN and Agglomerative Clustering
We have examined two clustering algorithms, namely, Agglomerative Clustering (Hierarchical clustering) and HDBSCAN [35,49], with various parameter values. In the case of Agglomerative Clustering, an adaptive parameter was the distance threshold, which is the linkage distance threshold; above which, clusters will not be merged. In the case of HDBSCAN, the parameter was minimal cluster size: single linkage splits that contain fewer points than this will be considered points "falling out" of a cluster, rather than a cluster splitting into two new clusters. We chose those two particular clustering approaches as they cover both centroid-based and density-based approaches. We chose various ranges for the clustering parameter values and also used both the 128 features set and the features set projected to lower dimensionality. Evaluation of the clustering results was done using the scores described in Section 2.6, as presented in Table 3. In the first columns, the number beside to clustering algorithm's name is its parameter. The string PCA and number informs that PCA dimensional reduction has been applied and how many dimensions persisted in the data set. The parameter β in (8) was set as 1.
In Tables 3-10 bold font indicates best results of certain scoring metrics in experiment. In the case of CelebA, the highest ARI and FMS scores were obtained for Agglomerative Clustering with threshold parameter of 16, with ARI = 0.786 and FMS = 0.789; however, the ARI value for Agglomerative Clustering with same threshold on the data set projected onto 53 dimensions differed by only 0.002 and on 62 dimensions by only 0.001 (in the case of FMS by 0.03 and 0.01, respectively). As the ARI value was positive and close to 1, we can assume that the clustering was carried out successfully. ARI is an important indicator, as it does not make any assumption about the cluster structure. The highest value of HS (of 1) was obtained under Agglomerative Clustering with a threshold of 2; however, this situation likely occurred as each cluster contained very few elements (for example, 1) and, due to this, the HS obtained its highest possible value. The highest value of CS was obtained for Agglomerative Clustering 24. In this case, the algorithm generated very large clusters, which contained all members of objects of a single class; however, these might also contain many objects of different classes. Due to this, HS and CS should not be considered separately but using VMS.
The highest value of VMS appeared in Agglomerative Clustering 16 (0.958) and Agglomerative Clustering 16/PCA 62. This is a very high value; however, we have to remember that this scoring is not normalized with regards to random labeling and, due to the large number of clusters that were present in our data set, this value might not be as meaningful as the ARI.
FMS reached its highest value for Agglomerative Clustering 16 (0.789); however, for Agglomerative Clustering 16/PCA 62, the scoring differed by only 0.001. This is a very good result, assuring us that the labeling corresponded with the real classes.
SC reached its highest value for Agglomerative Clustering 14 and Agglomerative Clustering 16/PCA 62, which was equal to 0.229. Very similar results were obtained for Agglomerative Clustering 16 and Agglomerative Clustering 16/PCA 53, where the FMS differed by only 0.001. A magnitude higher than the other values of CHS was obtained by Agglomerative Clustering 2. As can be seen, a degenerate case like this, where we have many one-element clusters, ARI = 0.001, HS = 1, and SC = 0.004, is a strong indicator that this metric is not meaningful and does not correspond well with other measures. Due to this, we omit the CHS in further evaluation and discussion. CNR had the highest value for Agglomerative Clustering 16/PCA 36. The CNR for Agglomerative Clustering 16 was above 0.9, meaning there was not much difference between the actual number of classes and clusters.
In the case of the CASIA-WebFace data set, we evaluated only Agglomerative Clustering. The highest ARI and FMS scores were obtained for Agglomerative Clustering with threshold parameters 34, equal to 0.643 and 0.649, respectively. ARI and FMS values for Agglomerative Clustering with same threshold on data set projected onto 62 dimensions differed by only 0.005. The ARI value was positive and close to 1; thus, we can assume that the clustering was carried out successfully. The highest value of HS was equal to 0.996, which was obtained for Agglomerative Clustering with threshold 8; however, this situation probably occurred as nearly all clusters contained very few elements (i.e., close to 1) and, due to this, the HS obtained its highest possible value.
The highest value of VMS appeared in Agglomerative Clustering 22 (0.875) and Agglomerative Clustering 24.
SC reached its highest value for Agglomerative Clustering 28 (0.126). Very similar results were obtained for Agglomerative Clustering 30, 32, and 34, where the FMS differed by only 0.001. The CNR had the highest value for Agglomerative Clustering 30 (0.973). In the case of Agglomerative Clustering 34, the CNR was 0.871, which means there was not much difference between the actual number of classes and clusters.
From Table 3, we can clearly see that all meaningful scorings indicate that Agglomerative Clustering performed better than HDBSCAN. Due to this, we did not evaluate HDBSCAN in Table 4. In case of the CelebA data set, in all but one case, the best results were obtained when threshold parameter was set as 16. Due to this, for further evaluation on the CelebA data set, we chose assignments that were generated with Agglomerative Clustering 16 and its variations with vaying numbers of PCA-projected dimensions. In the case of the CASIA-WebFace data set, the highest values of ARI, CS, and FMS were obtained for Agglomerative Clustering 34. Due to this, for further evaluation on CASIA-WebFace data set, we chose assignments that were generated with Agglomerative Clustering 34 and its variations with varying numbers of PCA-projected dimensions.

Supervised Training of KNN, SVM, and NN on Face Clusters Discovered by Unsupervised Algorithm
In the next step of evaluation, we used clusters obtained by Agglomerative Clustering to perform training of the classification algorithm using a supervised approach. We have to remember that, during training, we do not use any knowledge about the ground truth of classes, only the results of the previous unsupervised approach. We selected the nearest neighbor approach (KNN) with k-d trees for hierarchical decomposition, linear support vector machine (SVM) [50], and an artificial neural network (NN) with a single hidden fully-connected layer with ReLU activation and softmax output. Classifiers are trained with a supervised algorithm, in which we use the same deep features as before.
After training, classifiers were used to assign classes to the validation data sets described in Section 2.5. As we cannot directly calculate the accuracy of each approach due to the unknown mapping relation between ground truth and clusters index assignment, we have to once again perform evaluation using the scoring described in Section 2.6. The results are presented in Tables 5 and 6. In the first column, the number beside KNN indicates the number of neighbors considered in the classification and the number beside NN is number of neurons in hidden layer. If PCA was applied, the number indicates how many dimensions were used.
We can see, from Table 5, that the highest ARI score was obtained for SVM PCA 62 (at 0.769). . This is a very good result that, together with the ARI value, assures us that the classification corresponded well to the real classes. According to the SC, the most dense and well-separated clusters are obtained by KNN 1. The maximal value of CNR was for KNN 5/PCA 53 (0.998); however, nearly every method had this parameter equals at least 0.95, which means that there was nearly the same number of classes to which elements were assigned by the classifiers as ground truth.
In Table 6, the highest ARI, HS, VMS, FMS, and CNR scores were obtained by KNN 1/PCA 62 (0.652, 0.875, 0.878, 0.654, and 0.983, respectively). This is also a very good result: the high CNR and ARI values assure us that classification corresponds with the real classes. KNN had the highest values of CS (0.882) and VMS (0.878).

Analysis of Linear Relationships between Cluster Quality Measurements
In the situation where the ground truth values of image classes are not known, we need to check whether it is possible to estimate the performance of the proposed solution using clustering quality measures that do not require knowledge of the ground truth. We performed correlation analysis in order to investigate the linear relationship between the pairs of values presented in Tables 3 and 4. The correlation matrix is presented in Tables 7 and 8 and is visualized in Figures 8 and 9.    Table 7.
As can be seen in Table 7, the ARI and FMS are strongly positively correlated, and there was a moderately positive correlation between those two scorings and the SC (0.649 and 0.654, respectively). There was also a strong positive correlation between the VMS and SC, equal to 0.913. This is very important information, as we can use the SC (which does not require ground truth knowledge) to estimate the ARI, FMS, and VMS.
The results in Table 8 confirm the previous results. The ARI and FMS were strongly positively correlated, and there was a strong positive correlation between those two scorings and SC (0.739 and 0.796, respectively). Furthermore, there was a strong positive correlation between the VMS and SC (0.840). It seems that, in both CelebA and CASIA-WebFace, there are similar linear relationships between cluster quality measurements.  Table 8.

Supervised Training of KNN, SVM, and NN on Ground Truth Data
In the last part of the experiment, we compared the effectiveness of previously trained classifiers with pattern recognition methods that were trained using ground truth data. We used the same training and evaluation data sets as above; however, this time, we utilized knowledge about the ground truth. We used the same classifier configurations as in [4] (KNN-1), [6] (NN with single fully connected layer with 1024 neurons and softmax), and [7] (linear SVM). As the data set projected by PCA onto 62-dimensional space obtained satisfactory results in our previous experiment, we evaluated the classifiers on both the original and projected data. The results are presented in Table 9 for CelebA and in Table 10 for CASIA-WebFace.
In the case of the CelebA data set, the highest accuracy for the classification algorithm trained on the ground truth data was obtained using the KNN-1 and SVM methods (equals 0.875). Application of PCA dimensionality reduction does not affect the overall accuracy positively: It either remains the same as for the 62-dimensional data set or has a smaller value. The accuracy of the NN classifier is slightly smaller and equal to 0.851 or 0.853, depending on whether PCA was used.
In the case of the CASIA-WebFace data set, the highest accuracy for classification algorithm trained on ground truth data was obtained for SVM, equal to 0.838. Application of PCA dimensionality reduction slightly improved the accuracy in the case of NN (by 0.001). In other cases, it either remained the same as it was for the 62-dimensional data set or had a smaller value.

Comparison of Cluster Quality Measurements between Classifiers Trained on Ground Truth Data and without This Knowledge
As we were unable to calculate the actual accuracy of classification algorithms trained without the ground truth knowledge, we compared their effectiveness with pattern recognition methods trained with this knowledge using metrics that we used before to evaluate the quality of clustering. As can be seen in Tables 9 and 10, there was not much difference between those two groups of methods. Among the available metrics, the Fowlkes-Mallows score (FMS) seems to be most informative, in terms of accuracy. In the case of the CelebA data set, application of PCA dimensionality reduction on the data set had a minimal impact on the overall accuracy. In the case of KNN-1, the accuracy was reduced by 0.1%; in the case of NN, it increased by 0.2% and had no influence on the SVM. This was also true for the CASIA-WebFace data set: in the case of SVM, it reduced the accuracy by 0.3%. In the case of the NN, it increased by 0.1% and had no influence on the KNN-1.
In the CelebA data set, the highest FMS score was obtained for KNN-1 (0.791). In this case, the FMS had the same value as the original data set and that with PCA-reduced dimensionality. When KNN-1 was trained using data clustered by the Agglomerative clustering method with a threshold value of 16, its FMS score was reduced by 5.3%. In the case of NN, this reduction was equal to 3.6% (0.8% for PCA-processed data). In the case of SVM, all clustering parameters beside CS, VMS, and CNR were increased when the algorithm was trained without knowledge of the ground truth.
In the CASIA-WebFace data set, the highest FMS score was obtained for SVM (0.718). When the SVM was trained using data clustered by Agglomerative clustering with a threshold value of 34 and reduced dimensionality, its FMS score was reduced by 6.6%. In the case of NN, this reduction was equal to 6.7% (6.6% for PCA-processed data). In the case of KNN-1, this reduction was equal to 4.6% (4.3% for PCA-processed data).

Discussion
It was expected that we would obtain better clustering results with a centroid-based classification approach than with a density-based approach. This was because the deep feature optimization method using the Triplet Loss algorithm resulted in the creation of centroid-based clusters of faces with the same identity. Basing on the results from Table 3 (CelebA data set), we can indicate, without any doubt, that Agglomerate Clustering with a threshold of 16 maximizes the most important clustering quality measures. We also obtained very good results for Agglomerative Clustering with a threshold of 16 and projection of the data set onto 62-dimensional space with PCA. As can be seen from Table 1, the 62-dimensional projection explained over 0.999 of overall variance present in the data set and, at the same time, reduced the dimensionality of the problem by over two times. According to Table 5, the best results were obtained with KNN 1 and SVM/PCA 62; however, very similar results were found in case of SVM without PCA. Although training of a linear SVM classifier is usually more time-consuming than KNN, SVM has two very important advantages over KNN: SVM allows for generalization of the problem and can better deal with outliers.
These results were confirmed by experiments on the CASIA-WebFace data set. In that case, however, Agglomerate Clustering with threshold 34 maximized the most important clustering quality measures (see Table 4). According to Table 6, the best results were obtained with KNN 1 and SVM/PCA 62; however, very similar results were obtained in the case of SVM without PCA.
As was shown in Tables 7 and 8 SC was positively correlated with ARI, VMS, and FMS. We can use SC as an indicator of clustering quality performance.
We can clearly see, from Tables 9 and 10, that the clusters discovered by unsupervised methods can be safely used to train a classifier without knowing the ground truth, as lack of this knowledge does not deteriorate the overall clustering assignments much. This conclusion is especially important in real-world scenarios in which ground truth can be obtained only by manual assignment. Manual assignment is often very time-consuming and expensive for large data sets. We can conclude that, beside highly secure systems in which face verification is a key component, face identities discovered by unsupervised approaches can be safely used for training supervised classifiers. We can also observe that the CASIA-WebFace data set was more difficult to classify than CelebA: both scoring parameters described in Section 2.6.2 and the accuracy given in Tables 9 and 10 had slighlty lower values in the case of CASIA-WebFace. This may have been caused by the fact that the pictures in CelebA data set have a visually better image quality.

Conclusions
We found that knowledge about ground truth data improves the Fowlkes-Mallows score only by 5.3% (from 0.749 to 0.791) for the classification algorithm with the highest accuracy (namely, KNN-1 in the CelebA data set) and by 6.6% (from 0.652 to 0.718) for the SVM classifier in the CASIA-WebFace data set. As we have already mentioned, beside highly secure systems in which face verification is a key component, the face identity clusters discovered by an unsupervised method can be safely used to train a classifier. Furthermore, we found that the Silhouette Coefficient (SC) of unsupervised clustering was positively correlated with the Adjusted Rand Index, V-measure score, and Fowlkes-Mallows score and, so, we can use the SC as an indicator of clustering performance when the ground truth of face identities is not known. All of these conclusions are important findings relating to large-scale face verification problems. The reason for this is the fact that skipping the necessity of verification of identities before supervised training saves a lot of time and resources.
Author Contributions: T.H. was responsible for conceptualization, proposed methodology, software implementation, and writing the original draft; P.M. was responsible for data curation and validation. All authors have read and agreed to the published version of the manuscript.