ILRA: Novelty Detection in Face-Based Intervener Re-Identification

Transparency laws facilitate citizens to monitor the activities of political representatives. In this sense, automatic or manual diarization of parliamentary sessions is required, the latter being time consuming. In the present work, this problem is addressed as a person re-identification problem. Re-identification is defined as the process of matching individuals under different camera views. This paper, in particular, deals with open world person re-identification scenarios, where the captured probe in one camera is not always present in the gallery collected in another one, i.e., determining whether the probe belongs to a novel identity or not. This procedure is mandatory before matching the identity. In most cases, novelty detection is tackled applying a threshold founded in a linear separation of the identities. We propose a threshold-less approach to solve the novelty detection problem, which is based on a one-class classifier and therefore it does not need any user defined threshold. Unlike other approaches that combine audio-visual features, an Isometric LogRatio transformation of a posteriori (ILRA) probabilities is applied to local and deep computed descriptors extracted from the face, which exhibits symmetry and can be exploited in the re-identification process unlike audio streams. These features are used to train the one-class classifier to detect the novelty of the individual. The proposal is evaluated in real parliamentary session recordings that exhibit challenging variations in terms of pose and location of the interveners. The experimental evaluation explores different configuration sets where our system achieves significant improvement on the given scenario, obtaining an average F measure of 71.29% for online analyzed videos. In addition, ILRA performs better than face descriptors used in recent face-based closed world recognition approaches, achieving an average improvement of 1.6% with respect to a deep descriptor.


Introduction
Person re-identification is the process of recognizing an individual over different non-overlapping camera views [1][2][3][4][5][6]. Usually, probe is used to refer to the image of the individual to be recognized and gallery to the set of images of known people where the probe has to be recognized. Re-identification problems can be classified into different categories depending on the considered dimension [2]: sample set, body model, etc. Bedagkar-Gala and Sha [5] propose a wider taxonomy based on the mandatory presence or not of the probe in the gallery. Thus, a closed world, or closed set, scenario is similar to the classic matching problem with a fixed size gallery. In an open world, or open set, the probe does not necessarily belong to the gallery, which evolves dynamically, adding new identities as the re-identification process takes place.
In the open world re-identification scenario, firstly, it is necessary to decide whether the probe belongs to the gallery or not. If the probe belongs to the gallery, a matching process is carried out; otherwise, the probe is added to the gallery as a new identity. The first stage in an open world re-identification scenario is very similar to the problem of novelty detection [7][8][9], which refers to the identification of new or unknown individuals, who were not previously registered in the system. Those individuals are denominated atypicals in opposition to those registered, who are referred to as typicals.
Speaker diarization [10] can be considered a similar problem to person re-identification. In the former, systems try to answer the questions of who spoke when. The difference lies in the scenarios where they are applied. Person re-identification is considered mostly in video surveillance scenarios where there is no audio, and coarse views of the people are obtained, so appearance based methods are widely used [2]. On the contrary, speaker diarization is carried out in video recordings (news, talk shows or television debates) where audio and close views of the participants are available. The availability of audio and images allows the application of techniques that combine both information sources [10,11]. In addition, the intervener views are normally close frontal views that allow information of the face to be extracted, instead of the general appearance of the intervener, allowing the exploitation of the facial features that are almost symmetrical and uniform [12,13].
In this paper, a face based open world re-identification approach is presented in a parliamentary debate scenario. This is a challenging scenario because deputies can participate in the debate from different locations: speaker platform (top row in Figure 1), seats (second and third row in Figure 1) and presidential table (bottom row in Figure 1). These locations impose appearance variations in terms of pose and distance to the camera; therefore, a frontal face is not always available for each intervener during the debate. Thus, the main difference between usual speaker diarization scenarios, e.g., TV talk shows, and parliamentary debates, which makes the latter a challenging problem, is that there exists a higher variability in poses, from closeup intervener frontal views, to a general view where not only the intervener appears, but other deputies that are close to her/him (first image of the bottom row in Figure 1). In order to provide a solution to these situations, the contributions of this paper are threefold: • We present a contextualization of open world re-identification problems.

•
We propose a feature vector based on Isometric LogRatio (ILR) transformation of a posteriori probabilities of belonging to a known intervener, applying a previous descriptor calculated only over the intervener face. • A threshold-less approach is used to solve the novelty detection problem in an open world scenario. Thus, there is not a need for any user defined threshold.
The remainder of this paper is organized as follows: Section 2 presents a review of recent literature in both re-identification and speaker diarization. Section 3 describes our methodology. Section 4 contains the experiment designs to evaluate our proposal and includes the achievements of the experiments. Section 5 deals with the advantages and disadvantages of the proposal, and, finally, conclusions are drawn in Section 6.

Related Work
In recent years, a dual, i.e., audio-visual, methodology in diarization has become popular. Bredin and Gelly [14] use television series to evaluate their diarization method. Their proposal is based on applying a clustering technique over the face images to assign the most co-occurring face cluster with the corresponding audio cluster. The latter is extracted from the linear Bayesian Information Criterion (BIC) clustering of the audio stream. Lastly, regular BIC clustering is used to obtain the final diarization. Unlike the previous authors, a multiple speaker detection approach that uses the position of the audio signals sources was proposed in [15]. Other authors [16] use the LIUM system, to extract the audio diarization and deformable part-based model (DPM) to detect visual faces. Later, a conditional random field based multi-target tracking is adopted to track the interveners. Subsequently, a clustering technique based on the similarity distances and biometric measures is applied. To assign the names, One-to-One Speaker Tagging is computed to maximize the co-occurrence duration between clusters and the names provided by an Optical Character Recognition (OCR). As opposed to previous works, in [17], the authors do not detect the faces. Instead, skin blocks are detected using the chrominance coefficients of the skin-tone in the YUV color space, where motion vectors are obtained. The Mel Frequency Cepstral Coefficients (MFCCs) of the audio stream are combined with the visual representation using a log-likelihood from two Gaussian Mixture Models (GMM).
Given that our proposal is based on a re-identification approach, we summarize some related works. The approach by Bazzani et al. [18] consists of splitting the individual body parts of the pedestrians. Features are extracted from the HSV color space using weighted histograms. Other features are extracted using an agglomerative clustering of the image pixels and the computation of texture patches. Moreover, in recent years, some researchers have introduced the use of metric learning techniques in the field of people re-identification. The aim of these techniques is to project the representation of the individuals in a feature space where those of the same individual are closer and those of different individuals are further apart. Authors in [19] propose the Keep It Simple and Straightforward (KISS) learning, improving the method using a regularization in order to suppress the effect of larger eigenvalues in the covariance matrices. Moreover, in [20], the authors describe a technique to find a common space in different camera views in an unsupervised context. Thus, a k-means is used to cluster the person images from different views. Neural Networks are also commonly used to project the samples in a new sample space. In this sense, authors in [21] split the image into three grids and use this representation as input into a bilinear network to aggregate in a feature vector. These vectors are used to obtain a new embedding feature space using a Siamese network. This architecture is commonly used to verify the input samples. In [22], the authors add also an identification stage to the model.
As mentioned above, recent challenging scenarios in re-identification fields are those related to open world problems, where novelty detection is a must (Figure 2). Novelty detection is used in a large kind of context, such as [23] in wildlife scenes and [24] for temporal series of vital signs with gastrointestinal cancer surgery; in addition, diagnosis of dermal diseases and the analysis of lymphatic cancer have been treated [25] or in robotics scenarios [26]. More related with people re-identification but using audio cues, authors in [27] propose a novelty detection approach in a speaker diarization system. A likelihood ratio thresholding is applied, depending on the speaker gender; and it is normalized using the mean and standard deviation. This thresholding determines typical/atypical speakers. Despite previous approaches, we are focusing on visual based re-identification problems. Authors in [28] propose a novel transfer ranking approach for two types of verification, multi-shot and one-shot verification, in a bipartite ranking problem. They applied RankSVM and probabilistic relative distance comparison to obtain a model, which optimizes a margin parameter based on the typical intra-class and inter-class variations, and inter-class variations between typical and atypical images. Authors in [29] present a supervised subspace learning approach where a linear transformation of the features is learnt by the optimization of a cost function related to the proportion of positive and negative misclassified pairs. In order to determine the presence of a probe person in a gallery, they introduce a margin parameter such that pairs whose distance is lower than the threshold are considered as belonging to the gallery and not belonging to the gallery otherwise. Authors in [30] introduce a new person re-identification search setting where the main features are: a vast probe search population, fast disjoint-view search and sparse training person identities. Over this setting, they obtain a set of features from the cross-view identity correlation and identity discrimination verification. In the same way as previous authors, the novelty detection is based on a threshold over the distance between individual representations.
Open world re-identification problems have dealt with deep learning in recent years; in particular, generative networks are used. For instance, an unsupervised domain adaptation approach that generates samples for effective target-domain learning is presented in [31]. This is done under the assumption that datasets in different re-identification domains have entirely different sets of identities. Thus, a translated image should be of a different identity from any target image. In this way, a Cycle Generative Adversarial Network (CycleGAN) [32] is used to translate images from a source to a target domain. Then, a Siamese network pushes two dissimilar images away and brings similar ones closer, with the aim of classifying a sample as typical or atypical. In addition, authors in [33] take advantage of the benefit of integrating generated people images. On the one side, they use a person discriminator to verify whether the generated image is a person or not. On the other side, a target discriminator identifies if a person belongs to the dataset or not. The feature vector is extracted from the last fully connected layer of the target discriminator and a threshold is used to determine the novelty of the person.
Unlike the previous approaches in which most of them use a margin parameter to detect the novelty of an individual, our approach applies a one-class classifier [34] to determine the novelty of a person, without the need of tuning a threshold. The advantage of this classifier is that only positive samples are needed to train it, unlike other classifiers that make use of positive and negative samples in the training process. Furthermore, we propose the use of a feature vector based on ILR transformation of a posteriori probabilities of belonging to a known intervener, applying a descriptor calculated only over the intervener face that fits with the one-class classifier.

Method
In this section, firstly we outline the proposed approach, and then we explain in detail its two different stages: initialization and ILR transformation of a posteriori (ILRA) probabilities (see Figure 3). Previously to the initialization stage, the video is pre-processed keeping only frames that contain frontal faces.
A video is composed of a sequence of I shots (S 1 , . . . , S I ), where a shot is defined as a sequence of frames with a single intervener-see Figure 3. At the initialization stage, the system assigns an identity ID 1 to the first shot (K = 1). Next, shots are processed for novelty detection one by one, as a single intervener is assumed in each shot. Therefore, the system has to recognize whether a current shot intervener has been seen in previous shots (typical) or not (atypical). This stage is finished when an atypical shot is detected. Thus, the system knows two interveners (K = 2).
Once the system has registered two interveners, the next shots are processed to solve a new atypical detection problem. To this end, a novel modelling based on a posteriori probability of individuals is proposed. This modelling cannot be implemented in the previous stage because the system needs to have registered at least two interveners. If the new shot is typical, a K-label classification is used to recognize which one of the known interveners corresponds to the current shot. Otherwise, a new identity is assigned to the current shot. This procedure is repeated until no shots are left in the sequence S 1 , . . . , S I . In the following subsections, the different stage details are described. Figure 3. A video is divided into shots, S i that are composed of frames, f r i . Each shot contains a single intervener. Each shot is the input data of our proposed system. The system is mainly divided into two stages. The initialization stage is carried out without the modelling approach unlike the ILRA stage.

Video Pre-Processing
A video is a sequence of S 1 , . . . , S I shots and each shot S i is composed of f r i 1 , . . . , f r i n i frames with a detected face; in the case of multiple detected faces, the largest one is selected. Previously to the detection of the faces, each frame has been converted to grayscale because color information is not used by face descriptors [35]. For each shot S i of the video, a matrix X i = [x i 1 , . . . , x i n i ] is obtained, n i being the number of frames of shot i-th. The detected face of each frame, f r i j , is represented by a descriptor computed on the face region as proposed by [36]. Thus, each row x i j of matrix X i corresponds to the descriptor of dimension D, resulting in a matrix of dimension n i × D: (1)

Initialization Stage
Firstly, the system assigns identity ID 1 to the first shot S 1 , obtaining the extended matrix, including the label of the shot intervener: From now on, we refer as identity to the label (ID x ) given to each registered individual. Later, the system has to determine the identity of the intervener in the following shots until the first atypical shot is found. This stage has similarities with a One Vs. One (OVO) strategy because, so far, the system knows just one intervener. Therefore, the procedure has to detect whether the intervener in the next shot is the same intervener ID 1 (typical) or if s/he is a different one (atypical). In terms of a classification problem, a one-class Support Vector Machine (SVM) [37] classifier is trained with the extended matrices X e 1 , . . . , X e i−1 , and predictions are obtained for input matrix X i . In this way, for each frame in X i , a prediction in terms of typical/atypical is obtained. However, all frames do not necessarily have the same predicted labels; and it is reasonable to consider the whole shot S i as typical (id(S i ) = ID 1 ) if most of the n i frames in shot S i are predicted as typical-otherwise, as atypical (id(S i ) = ID 2 ), increasing the number of interveners K. Thus, we have decided to use the Winner-Takes-All (WTA) principle to this purpose.

ILRA Stage
Once the system has registered at least two individuals (K ≥ 2), it is necessary to determine whether the individual of the next shot S i is registered or not. For this purpose, this stage comprises three main processes: modelling, novelty detection and, if the current shot is typical, classification. This stage has similarities with a One Vs. All (OVA) strategy. The available data at this stage are, on the one hand, the extended matrices X e 1 , . . . , X e i−1 , which are the descriptors of each previous shot frames plus the label of their respective associated identities, and, on the other hand, the descriptors of the frames in shot S i , i.e., X i .
The aim of the modelling stage is to obtain the a posteriori probability, p i jk = Prob(ID k |x i j ), of each frame j in shot S i belonging to each registered identity k. Thus, for shot i-th, a matrix P i is computed: where ∑ K k=1 p i jk = 1. On the one hand, for shots S 1 , . . . , S i−1 , where an identity has been assigned, the estimation of the a posteriori probability is done using a leave-one-out strategy. Therefore, for each frame f r j ∈ {S 1 , . . . , S i−1 }, the a posteriori probabilities are computed using a Naïve Bayes classifier trained with all the frames minus frame f r j , {S 1 , . . . , S i−1 } \ f r j . On the other hand, for each frame f r j ∈ S i , the a posteriori probabilities are computed using a Naïve Bayes classifier trained with all the frames of previous shots, {S 1 , . . . , S i−1 }.
Once the a posteriori probabilities are computed, the second step of the modelling process is carried out. The ILR transformation is applied to P 1 , . . . , P i matrices. This is a well-known transformation in the field of Compositional Data, which obtains a real coordinate representation, preserving the Aitchison metric in the original space of the a posteriori probabilities [38]. Formally defined as: where clr is the Centered Log Ratio (CLR) transformation and V is a matrix whose columns form an orthonormal basis of the CLR plane [38]. As a summary, each jth frame is normalized as follows: Then, all transformed vectors are organized by rows in a matrix Z i and this is the matrix that characterizes the shot S i to determine the identity of the intervener. A similar transformation procedure is followed for all frames in shots S 1 , . . . , S i−1 , obtaining matrices Z 1 , . . . , Z i−1 . To determine the novelty in shot S i , a one-class SVM classifier is trained with the extended matrices Z e 1 , . . . , Z e i−1 , and, similarly to the novelty detection approach of the initialization stage, predictions are obtained for input matrix Z i . Again, WTA is used to determine if S i is atypical or typical. In the first situation, id(S i ) = ID K+1 is assigned and the number of identities known by the system increases. In the other case, when S i is considered typical, a classifier is used to identify which of the known ones it belongs to. The classification module could be performed by any classifier, which could be trained with the extended matrices Z e 1 , . . . , Z e i−1 to determine id(S i ). Moreover, a WTA strategy is chosen to determine the identity that characterizes shot S i .

ILRA Time Complexity
The time complexity for computing ILRA comprises the a posteriori probability computation, the ILR transformation, the novelty detection stage, and, in some cases, a classification.

Experimental Evaluation and Results
In order to evaluate our proposal, recordings from the Canary Islands Parliament (Santa Cruz de Tenerife, Canary Islands, Spain), which are publicly available in the Parliament web site [39], were processed on a workstation with an Intel Core i7-2600 at 3.40 GHz and 16 GB of RAM. The source code is available in github [40]. For the experiments, we chose six videos with different characteristics which are summarized in Table 1. The selected videos cover a wide range of interveners (5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21) and shots, so the influence of the number of interveners could be evaluated. Shots shorter than 30 s were skipped as they were considered not relevant for the diarization. In addition, frames without a detected face are avoided. For this aim, a face detector based on Histogram of Oriented Gradients features and SVM classifier is applied [41], where the face is normalized, establishing as a vertical symmetry axis through the center of the eyes position in the image, which are estimated by the face detector. Attending to the number of interveners, the videos could be classified as short with less than ten interverners (video identifiers 2771, 2918, 3015) and large with more than ten (video identifiers 2792, 2907, 3011). First, a set of offline experiments were carried out to focus and to evaluate different situations involved in the proposed approach. The evaluation comprised three main experiments: (1) novelty detection in the initialization; (2) novelty detection and (3) classification in the ILRA stage. In this way, the performance of the different stages of our approach can be evaluated. With this objective, the shots of the same ID were reorganized to carry out the experiments properly, as shown in Figure 4. As a result of the rearrangement of the samples, the training sets are unbalanced because there are IDs more present than others; to avoid that, 500 frames were randomly chosen per identity. When the number of frames for an identity was lower, all shot frames were used. To validate the process, we carried out 100 repetitions.
The dimensionality of the individuals was reduced, R w×h → R D , as mentioned in Section 3. This reduction is based on applying a descriptor to the intervener face area (w × h) where w and h represent the width and height, respectively. Two descriptor types have been evaluated, local descriptor and deep descriptor. The former type used a grid of 3 × 3 cells over an aligned image of 59 × 65. The following local descriptors were evaluated: Histogram of Oriented Gradients (HOG) [42], Local Binary Patterns (LBP) [43], LBP Uniform (LBPu2) [44], Neighborhood Intensity based LBP (NILBP) [45], and Weber Local Descriptor (WLD) [46] with a dimensionality of 81, 2304, 531, 531, and 2304, respectively. The latter type corresponds to a feature vector extracted from a deep network. In this case, a triplet network based on Inception Resnet backbone (Resnet T ) [47,48] is used. Mainly, a triplet network embedded the samples in a new feature space, where the samples that belong to the same identity are close and samples from different identities are far. Thus, three instances of Inception Resnet are used that share the same weight matrix. The embedded space is represented by the last fully connected layer, with a dimensionality of 128 in our experiments. Resnet T is used due to its excellent scores in different kinds of problems in recent years. The network was trained on Ms-celeb-1m [49] because the dataset consists of 1 million identities and we obtained a generalized model to extract the feature vectors from the faces. The network was initialized with the following parameters: mini-batches of size 90 along 500 epochs; the initial learning rate was 0.1, and this was decreased with a factor of 10 after every 100 epochs. Thus, the margin between positive and negative pairs (α) is set to 0.2. We set multiple descriptors due to the importance to evaluate the influence of different feature vectors for both stages of the algorithm.
Once the experimental setup is defined, it is necessary to adopt a metric. The accuracy (Acc.) is used with the purpose of evaluating the offline experiments, being formally defined as where TP and FP are the number of true and false positives, respectively; TN and FN are the number of true and false negatives, respectively. Accuracy is used to measure typical and atypical detections. Instead of calculating the mean of typical and atypical values, the F measure is adopted to obtain only a measure providing a trade-off between both accuracies. Its formal definition is presented in the following equation: where precision = TP TP + FP (8) and where precision is the fraction of relevant samples among the retrieved samples; moreover, recall is the fraction of relevant samples that have been retrieved over the total amount of relevant samples. Below, we present and discuss the results obtained in the experiments.

Evaluation of Novelty Detection in the Initialization Stage
The purpose of this first experiment is to evaluate the ability of the system to detect a novel identity when a single identity is known, i.e., K = 1. The typical or atypical detection was performed as follows: for each identity ID k , we considered its corresponding samples as a test set, and, to conform the training set, we considered two different situations.
In the first case, the training set was composed of those samples with identity ID j = ID k . In such situation, the tested identity should be labelled as atypical (Figure 5a) and the number of different comparisons is K 2 − K. Note that, for each comparison, the detection of the individuals has to be atypical to be a success.
In the other case, the training set was composed by those samples with the same identity ID k . To avoid having identical training and test sets, one third of the original samples of identity ID k is used as a test set and the remaining two thirds as a training set. In this situation, the detection of the individuals has to be typical to be a success (Figure 5b). We performed this experiment for all K identities in the video.
Novelty detection in initialization stage columns of Table 2 summarize the results of the initialization stage experiments. It can be observed that, in all videos, the best F measure is obtained using Resnet T , with an average value of 97.66%. In general, the atypical detection results are greater than or equal to 90% in 30 of 36 settings.

Evaluation of Novelty Detection in the ILRA Stage
The experiments related to the ILRA stage for offline scope are motivated by the need to evaluate the capacity of the approach to detect the novel identity of a new shot when several identities are known. Therefore, two evaluations are considered for each identity: atypical and typical. The former comprises all ID k identity samples in the test set, while the rest of identity samples, ID j =k , are used for training (Figure 6a). This experiment is carried out to evidence the approach behaviour for atypical identity detection, as the tested identity ID k should be labelled as atypical. The latter comprises all identities in both training and test set, splitting randomly and balanced their respective samples, using one third for testing and the rest for training (Figure 6b). This experiment is carried out to evidence the approach behaviour for typical identity detection, as the tested identity ID k should be labelled as typical.
Novelty detection in the ILRA stage columns of Table 2 allude that the descriptor with the highest F accuracy is HOG, reporting 78.14%. It is also observed that, when the number of interveners is low, the best descriptor is HOG and, over a large number of interveners, WLD behaves apparently better than the remaining descriptors.

Evaluation of Intervener Classification in the ILRA Stage
The purpose of this experiment is to evaluate the capacity of the approach to correctly assign the identity of a new intervener shot when multiple identities are known. That means, when the identity of the new shot (id(S i )) is present among the known identities, this intervener has been considered as typical in the ILRA stage, and the approach should match it to whom ID belongs to. Two classifiers are considered: the Maximum A Posteriori (MAP) probability extracted from the samples (see Figure 7a); and an SVM classifier to continue using the same typology of classifiers that we used throughout this proposal (see Figure 7b). In the case of the SVM, a Radial Basis Function (RBF) kernel is selected with main parameters ν = 0.1, γ = 0.1 and C = 1. A repeated holdout validation is carried out using 100 repetitions with re-sampling of the individuals, one third of the samples to test and the remaining to train.
The results are summarized in intervener classification in the ILRA stage columns of Table 2. Among the six descriptors, Resnet T yields the best accuracy in seven of the twelve experiments, giving an average value for the MAP and SVM classifiers of 88.69% and 97.36%, respectively.

Evaluation of the Proposed Online System
After evaluating the different offline stages, we carried out an online experiment. The number of frames per shot has been modified compared to the offline configuration. In addition, 200 frames per shot were used because the experiment comprises a larger number of shots, some of them containing a reduced number of frames. This situation brought about unbalanced shots that affect the performance of the algorithm. Given the best performance provided by the SVM classifiers in previous offline experiments, SVM is adopted to identify the interveners in the case of typical individuals.
To evaluate the online system, we adopted, from [50], True Re-identification Rate (TRR) and True Distinction Rate (TDR) measures. TRR evaluates how good the method is to re-identify interveners, while TDR evaluates how good the method is to distinguish among the interveners. Both measures are formulated as follows: where 1 N is a vector of dimension N with all the elements to one; and tr(score) is the trace of score that is a N × N matrix that has the result of comparing each proposed intervener shot identity with respect to all proposed intervener shot identities, 1 is assigned to equal identities and 0 to different ones. Thus, 1 in the diagonal elements and 0 in off-diagonal elements compose a perfect score. To obtain a single measure, the F measure is adopted, relating TRR (considered as recall) and TDR (considered as precision). The last evaluated experiment is the online process where real online video processing is comprised, evaluating the same descriptor for each stage of the algorithm. The results of the experiments are summarized in Table 3. In most of the processed videos, a descriptor beats the others, but there is no common behaviour across the entire video collection. In this case, the use of the descriptor depends on the video, not on the number of interveners. On the one hand, we would like to highlight the F measure obtained in video 3011, 88.25%, covering a population of 21 interveners and two hours of recording in an open world problem that means a real complex problem. On the other hand, the result achieved for recording 2907 is interesting because it brings forward a deficiency in traditional feature vectors, aroused by an occlusion issue due to most of the interveners putting the glasses on or taking them off during the intervention. In this situation, Resnet T improves at least 44.69% compared to the other descriptors, reaching 76.51% in recording 2907. Furthermore, our system is compared with our previous work [51]-as far as we know, the only existing approach in this scenario, i.e., face-based intervener re-identification in open-world parliamentary debates sessions. Additionally, face recognition approaches focusing on the closed world are used to extend the comparative of the proposed ILRA approach. In particular, HOG, LBP LBPu2, NILBP, WLD and Resnet T are used as feature vectors. In order to detect atypical samples, we use a threshold with a value of 0.5, an atypical sample being the corresponding one with a value larger than the threshold. In the case that the sample is typical, a distance vector is calculated from the samples previously analyzed with respect to the current sample. The identity with the minimum distance will represent the current sample.
Our method obtains in most of the experiments the best F measure for the different videos, compared with the above methods. These results are summarized in Table 4. On the one hand, the highest increase in performance is video 3015 where there is an improvement of 63.80% with respect to our previous work, ILRA being widely superior to traditional methods of face recognition. On the other hand, the recent technique, Resnet T , achieves a significant increase in results compared to the techniques mentioned above. However, it does beat the proposed method, reaching an average difference of 1.12% for the analyzed videos.

Discussion
In this paper, we analyzed the ILRA approach in offline and online contexts. On the one hand, offline experiments were carried out to evaluate the method in a controlled scenario. In this way, we could analyze each stage of the approach. On the other hand, online experiments allowed us to test the method in real conditions, where the system starts without any registered person.
A feature of the proposed method is the need of an initialization stage because it is not possible to calculate the ILRA with less than three registered identities (Section 3). The performance in detecting the second identity to start the ILRA process will affect the rest of the system. For this reason, we evaluate the initialization stage in an offline context (Section 4.1), where we have obtained that the Resnet T descriptor achieves a better score than local descriptors.
The modelling process is evaluated in the offline ILRA stage, which is split into two processes, novelty detection (Section 4.2) and classification of identities (Section 4.3). Firstly, the Resnet T descriptor is not the best descriptor for novelty detection in the ILRA stage. In this instance, a local descriptor, HOG, obtains the best average performance. Secondly, the Resnet T descriptor is better than local descriptors for classifying, as much as using a MAP as an SVM classifier.
This disaggregated analysis of the offline experiments shows that there is not a common best descriptor for each stage. This issue is translated into the online experiments, where a decreasing of the average score for each descriptor is obtained. This is due to having failures in the recognition at the first stage, which generates more false positive identifications. A way to alleviate this issue is to choose a specific descriptor for each stage; as shown in Section 4, there is no single descriptor that stands out in all stages. The selection of a single descriptor for any stage affects the system performance, making it less robust. Certainly, the system is simpler, but the use of a single descriptor in all system stages seems not to be the ideal approach. A further observation suggests that Resnet T is well suited to detect outliers in a one-class problem, HOG fits to novel detection in the ILRA stage and Resnet T performs better to classify in the ILRA stage.
The existence of short videos with very few detected faces favors SVM over MAP as can be observed in video 3011 (intervener classification in the ILRA stage of Table 2). This is due to the estimation of the Naïve Bayes parameters used in the MAP, as the average that is more affected by unbalanced classes. Some authors have verified that SVM performs better than Naïve Bayes dealing with unbalanced classes [52][53][54]. Moreover, the feature vector transformation using ILR alleviates the unbalanced problem as it is suggested in [55].

Conclusions
A feasible face-based intervener re-identification to open world solutions has been presented in order to be applied to diarization problems. We have evaluated the approach in parliamentary debate sessions, a challenging scenario, where people vary their pose and appearance, and do not necessarily appear while speaking.
In this scenario, the novelty intervener detection is relevant, as those identities must be properly registered. If novelty detection fails to detect a new intervener, s/he will be incorrectly assigned to a previously detected intervener. On the contrary, if a previously detected intervener is considered as a new one, the number of interveners will be erroneously increased. We have used and evaluated descriptors for identity registration. In the one-class problem, Resnet T has shown a good performance in novelty detection. The use of HOG yields the highest accuracy for a low number of interveners. However, when the number of interveners is larger, WLD achieves the best results. The best configuration is the Resnet T with an SVM classifier in the classification stage.
Our proposed system experiments have exhibited good results with an average F measure of 71.29% for the best descriptor for each video. In addition, we have compared the ILRA with respect to different techniques used in face recognition in a closed world, exhibiting an increase of 1.6% with respect to the deep descriptor extracted from a triplet network based on an Inception Resnet backbone. In the offline experiments, the results for the novelty detection in the initialization stage reach an average 97.66% accuracy for the Resnet T descriptor. In the ILRA stage, for the novelty detection, an average accuracy of 78.14% is obtained for the HOG descriptor. In the ILRA stage for the intervener classification experiments, the average accuracy is 97.36% using Resnet T descriptor with our method.
As future work, we plan to apply this approach using only audio features and the fusion of audio and video features. In this way, we could determine the influence of the audio over the image representation and verify if we obtain a better feature vector. Moreover, we intend to use deep learning techniques in order to replace the one-class SVM in the novelty detection module.