1. Introduction
Face recognition is a visual processing technology that automatically identifies individuals based on their facial features. In recent years, face recognition technology has been widely applied in fields such as face-based payments, attendance systems, and identity verification [
1]. Compared with traditional biometric technologies, face recognition technology is characterized by multi-feature capabilities, concealment, non-contact operation, and low cost, enabling technologies originally used in stringent identity verification contexts to proliferate on mobile devices [
2]. However, current face recognition systems are primarily static, requiring the person to be recognized to remain stationary in front of the capturing camera during the recognition process. The device then matches the detected face with faces in a database, a method that necessitates cooperation from the individual being recognized. In contrast, dynamic face recognition does not require cooperation from the person being recognized and can identify identities in unrestricted, unconstrained conditions [
3]. In unconstrained scenarios, factors such as changes in the recognized person’s facial expressions, posture, and lighting intensity [
4], the proportion of occlusion, and varying sizes of captured facial images [
5] interfere with the overall face recognition process. In real-life scenarios, these factors randomly and collectively affect face recognition, making it challenging to accurately determine the identity of the person being recognized in actual dynamic settings. Deep learning is widely applied in the field of face recognition [
6]. It can extract deep-level features, and training using vast datasets enhances the model’s robustness and broader adaptability, resulting in excellent performance in detection and recognition effects. Recently, face recognition has made significant progress by referencing general advancements in object detection and deep learning. CNN-based object detectors can be divided into two-stage detection algorithms and one-stage detection algorithms based on regression [
7]. Recent cutting-edge face recognition methods focus on adopting dense sampling within a single-stage framework, which demonstrates good performance in both speed and accuracy compared to two-stage methods [
8].
With the advancement of technology, low-quality images are increasingly becoming an important component of face recognition datasets, as they are encountered in surveillance videos and drone footage. Given that state-of-the-art face recognition (FR) methods [
9,
10] can achieve over 98% verification accuracy on relatively high-quality datasets such as LFW or CFP-FP [
11,
12], recent FR challenges have shifted to lower-quality datasets like IJB-B, IJB-C, and IJB-S [
13,
14]. Despite the challenge of achieving high accuracy on low-quality datasets, most popular training datasets still consist of high-quality images [
9,
15,
16], with only a small portion of training data being of low quality. One issue with low-quality face images is that they are often unrecognizable. When the image degradation is too significant, relevant identity information disappears from the image, resulting in unrecognizable images. These unrecognizable images are detrimental to the training process because the model will attempt to utilize other visual features, such as clothing color or image resolution, to reduce the training loss. If these images dominate the distribution of low-quality images, the model’s performance on low-quality datasets during testing may be poor.
For such low-resolution images, two primary approaches have been explored to address this issue: (1) construction-based and (2) projection-based methods. Construction-based methods involve enhancing the visual quality of the low-resolution (LR) input prior to recognition, known as face super-resolution (FSR). In this way, the FR process is divided into two tasks: identity-preserving FSR and super-resolution face recognition (SRFR). Special attention has been given to Generative Adversarial Networks (GANs) [
16,
17,
18] within the face generation module. Although GANs achieve remarkable outputs in terms of image quality and human perception, they add high-frequency components to the synthesized images, which adversely affect the recognition process [
19]. Furthermore, FSR is an ill-posed problem due to the existence of multiple high-resolution (HR) faces for each LR image [
19]. Additionally, face images are influenced by several other covariate factors (esthetics), such as head pose, lighting, and expression. These factors result in a significant gap between the feature embeddings of HR and SR faces in the identity metric space, which significantly degrades the final FR performance.
Projection-based methods aim to create a shared embedding space that can accommodate both HR and LR images. To achieve this, synthetic LR data can be utilized to enhance the resolution diversity of the dataset [
20,
21]. However, due to the fixed angular margin in traditional FR methods, there are convergence issues, and they do not adapt well to data augmentations such as downsampling or random cropping [
22,
23]. To address this problem, methods based on adaptive margin adjustment according to sample difficulty have been proposed [
23,
24]. MagFace suggests using feature norms as a metric of image quality and adjusting the margin accordingly. Adaptive margins have solved the convergence problem to some extent. However, performance still deteriorates significantly when dealing with LR images [
25]. For instance, the face verification accuracy on LFW is typically above 99%. Yet, the performance of Tinyface is around 59%. Furthermore, Nourelahi et al. [
26] demonstrated that training models on perturbed data results in poorer performance on original samples while enhancing robustness.
Many researchers have investigated face recognition solutions tailored for surveillance videos. In pursuit of achieving stable video surveillance, Baomansi et al. designed a review process based on the RPCA-PCP method [
27], comparing it with the BMC dataset and showcasing the performance of 13 state-of-the-art RPCA methods. Zhang et al. proposed a framework for recognizing personnel in video surveillance scenarios by leveraging heterogeneous contextual information and facial features to address face recognition issues with low-quality data [
28]. Mandal et al. introduced a robust visual analysis system for detecting driver fatigue in buses, which involves detecting the driver’s state based on head–shoulder detection, face detection, and eye detection [
29]. Liu et al. presented a PRO framework based on deep neural networks [
30], which not only utilizes multimodal data from large-scale video surveillance, such as visual features and camera locations, but also constructs its own dataset of surveillance videos to ensure accuracy. Ding et al. proposed a trunk–branch ensemble CNN to enhance the robustness of CNN features against pose variations and occlusions [
31]. This model extracts complementary information from the entire face image and patches cropped around facial components, achieving state-of-the-art performance on three popular video face databases. Wang et al. introduced a deep learning-based method for face recognition in real-world surveillance videos [
32]. Through face detection, tracking, and labeling, they automatically and incrementally constructed a new dataset of target real-world surveillance videos and then fine-tuned a convolutional neural network with the labeled dataset. Mahdi designed a system for real-time monitoring using cameras [
33], which consists of two steps: face detection using the Viola–Jones method and face recognition using the Kanade–Lucas–Tomasi algorithm as a feature tracker and PCA for identifying specific individuals. In 2018, Deng et al. chose an AdaBoost-based face detection algorithm to detect faces [
34] and implemented a face recognition algorithm based on LBPHFace to create a laboratory management system using face recognition. Jose et al. implemented an intelligent multi-camera face recognition surveillance system using FaceNet and MTCNN algorithms on Jetson TX2 [
35]. The proposed portable system tracks objects or suspects using camera IDs/locations and timestamps and records their status in a database through multiple camera installations. Wang et al. used an improved MTCNN algorithm for face detection [
36], optimizing MTCNN and replacing the network feature extraction module in FaceNet with MobileNet for face recognition. They also designed a face recognition-based laboratory access control system. In 2023, Dong et al. adopted the DRN algorithm for super-resolution of low-resolution images and then performed face recognition using ArcFace [
37], designing a smart classroom management system.
With the frequent use of university laboratories, an increasing number of issues have emerged. To better implement open management of laboratories, it is necessary to identify individuals entering and exiting the laboratories. Traditional face recognition methods rely heavily on high-definition facial images as input, and when faced with low-resolution faces, recognition may fail due to insufficient features. While increasing the clarity of camera equipment can address the issue of low resolution, it is limited by high costs and the substantial maintenance expenses associated with such equipment. Another approach is to reconstruct images using super-resolution techniques to convert low-resolution facial images into high-resolution ones, but as resolution decreases, facial features gradually diminish, leading to poorer reconstruction results. Furthermore, issues such as occlusion can arise due to factors like camera placement and personnel movement. Therefore, a better recognition solution is needed for facial recognition in surveillance scenarios.
This paper proposes a low-resolution face recognition method for application in laboratory surveillance scenarios. In this work, the backbone feature extraction network is improved, and the latest face recognition techniques and multi-object tracking algorithms are integrated into the network to address face recognition under practical surveillance conditions with low resolution.
Currently, there are several main factors that hinder face recognition in laboratory surveillance: the complex classroom environment often results in excessively small face resolutions, making detection and recognition difficult. Additionally, there are issues of misrecognition caused by occlusion and pose changes due to personnel movement. To address these problems, the research is divided into two parts:
Small Face Detection: In this paper, the WiderFace dataset is used to divide the training and testing sets. An improved face detection algorithm based on Retinaface is employed to complete the face detection task in laboratory surveillance. The improvement mainly involves incorporating SPD-Conv into the original Retinaface. Compared to traditional convolutions, this method maintains high performance while having fewer computations and achieving higher perception, thereby enhancing its performance in small target detection.
Small Face Recognition: In response to the decline in detection accuracy for low-resolution faces observed in current face recognition algorithms such as FaceNet, AdaFace is selected as the face recognition algorithm. During the face recognition process, misrecognition may occur due to occlusion caused by personnel movement. To address this, the ByteTrack multi-object tracking algorithm is integrated into AdaFace. Kalman filtering and the Hungarian algorithm are used to track the IDs of already recognized individuals, preventing the need for re-recognition. Finally, comparative experiments demonstrate that the improved method outperforms existing common face recognition algorithms.
2. Materials and Methods
Face recognition is one of the important research topics in the field of computer vision. It consists of face location, face alignment, and face classification [
38]. Specifically, the first step is to detect faces and locate their positions in the image. Then, the preprocessed and cropped main face region is input into the backend network for face feature extraction and face matching. The main issues with current face recognition are not only the loss of features due to low resolution but also the scarcity of low-resolution datasets and the difficulty in applying them to practical applications.
2.1. Small-Sized Face Detection
RetinaFace is the latest single-stage face detection model proposed by Insight Face in 2019. This model is based on the structure of RetinaNet, utilizing deformable convolutions and a dense regression loss [
8]. This paper leverages RetinaFace, which traditionally employs two types of backbone feature extraction networks: ResNet and MobileNet. Among them, ResNet outputs three effective feature layers {C3, C4, C5} from the convolutional blocks conv3_x, conv4_x, and conv5_x. This detection network is pre-trained and initialized using the ImageNet dataset. It adopts a Feature Pyramid Network (FPN) to extract features with rich semantic information using a top-down pyramid and lateral connections. Learning from the successful designs of Single Shot Headless Face Detector (SSH) [
39] and PyramidBox [
40], it employs separate context modules following the FPN to expand the receptive fields of pre-detection regions and enhance reasoning capabilities, thereby efficiently computing the corresponding multi-task losses.
This paper proposes to improve RetinaFace using Space-to-Depth Convolution. RetinaFace’s performance tends to decline rapidly when faced with tasks involving low-resolution images or small objects. This is due to the inevitable loss of fine-grained information and the learning of ineffective features when using strided convolutions or pooling layers.
This paper reconstructs RetinaFace using TensorFlow and experimentally validates the selection of ResNet as the backbone network for implementing the face detection model. Although ResNet’s powerful feature extraction capabilities have been widely applied in face detection, its performance tends to decline rapidly when faced with tasks involving low-resolution images or small objects due to the inevitable loss of fine-grained information and the learning of ineffective features caused by strided convolutions and pooling layers.
As the number of convolutional and pooling layers increases, it does not necessarily lead to better learning outcomes; instead, issues such as gradient vanishing, gradient exploding, and degradation arise. The prediction performance tends to worsen as the number of layers deepens. ResNet proposes a method where, when building a deep network by stacking new layers onto a shallow network, the added layers can be made to learn nothing and merely replicate the features of the shallow network. In this way, the new layers become identity mappings, ensuring that the performance of the deep network is consistent with that of the shallow network, thereby addressing the degradation problem. Compared to traditional convolutional networks, ResNet introduces shortcut connections that are connected to the input of the second activation function. In ResNet, this operation where the output equals the input is referred to as an identity mapping, which is the key to the residual structure.
SPD-Conv [
41] is a novel convolutional module whose primary purpose is to enhance performance when dealing with low-resolution images and small-sized objects. As shown in
Figure 1, SPD-Conv adopts a new approach by utilizing an SPD layer combined with a non-strided convolutional layer to address this issue. It processes the original feature map through a series of transformations, resulting in a decrease in spatial resolution and an increase in the number of channels. Subsequently, a non-strided convolutional layer is applied to obtain more discriminative feature representations.
To address the issue of information loss as the network depth increases when dealing with low-resolution images, we propose an improvement to the residual structure of ResNet, as shown in
Figure 2. We select ResNet as the backbone feature extraction network and replace the convolutional layers with a stride of 2 with SPD-Conv. This modification helps prevent the loss of important information for low-resolution images and small-sized objects due to downsampling.
After modifying the backbone feature extraction network, we obtain three feature maps with different shapes, as shown in
Figure 3. RetinaFace constructs an FPN (Feature Pyramid Network) structure using these three effective feature layers. Firstly, 1 × 1 convolutions are used to adjust the number of channels in these three feature layers. Then, upsampling and addition (Add) operations are performed for feature fusion. Finally, three feature layers are obtained, and the SSH (Single Shot Head) module is used to enhance the receptive field. After obtaining these three effective feature layers, prediction results are obtained through them. The face detection part corresponds to the RetinaFace-SPD section in
Figure 4. Finally, based on the characteristics of surveillance scenarios with low-resolution images, we reconstruct RetinaFace using TensorFlow to accelerate the training speed.
2.2. Quality-Adaptive Margin for Face Recognition and Motion Occlusion Issues
Face classification differs from general object classification due to the challenging distinction between intra-class and inter-class feature variations in practical applications. The biggest challenge in large-scale face classification is optimizing the loss function to enhance intra-class compactness and inter-class separability for highly similar faces [
42]. The Additive Angular Margin Loss (ArcFace) aims to enhance the discriminative power of learned deep features, thereby maximizing the separability of face classes [
9]. However, the detection process heavily relies on clear face images and scenarios with minimal noise, leading to poor performance in identity recognition in surveillance scenarios. Due to the blurring and degradation of face images, face recognition in low-quality images and videos can result in the loss of relevant identity information. AdaFace [
21] proposes an image quality-adaptive loss function that assigns different weights to samples of varying difficulty based on image quality. It adapts the margin function based on the phenomenon where the angular margin scales with training difficulty, optimizing hard samples when image quality is high and ignoring extremely hard samples when image quality is low. Moreover, it does not require additional modules to compute image quality but directly uses the correlation between feature norms and image quality, which is greater than the correlation between probability outputs and image quality.
AdaFace adjusts its function adaptively based on image quality indicators, as illustrated in the face classification section of
Figure 4. When image quality is low, it does not emphasize hard samples, whereas when image quality is high, it emphasizes hard samples. By utilizing a margin-based loss function, the learned features are made sufficiently discriminative. The model can automatically assess the quality of images and differentiate between high-quality and low-quality images during the recognition process by assigning higher gradient scales to high-norm features far from the decision boundary and higher gradient scales to low-norm features close to the decision boundary.
For scenarios with multiple individuals in laboratory surveillance footage and low resolution, issues such as re-identification and recognition failure can arise when the movement of individuals obscures the recognition targets. To address these issues, the ByteTrack multi-object tracking algorithm is incorporated into the existing AdaFace face recognition system. The ByteTrack algorithm inputs a video sequence, a detector, and a detection threshold. As shown in
Figure 5, the algorithm outputs the trajectories of the video, with each frame containing the bounding boxes and IDs of the objects. For each frame in the video, the detector (Det) is first used to predict the bounding boxes and prediction scores. Then, based on the detection score threshold, the bounding boxes are classified into two categories: Det(high) and Det(low). After separating the bounding boxes, a Kalman filter is used to predict the new positions in the current frame for each trajectory T. By calculating the Intersection over Union (IOU) between the detected bounding boxes and the predicted bounding boxes, the Hungarian algorithm is finally used to match the IOU and return successful and failed trajectories. Instead of directly discarding low-score bounding boxes that may result from occlusion, the algorithm performs a secondary matching for these low-score bounding boxes, optimizing the issue of ID switching caused by occlusion during the tracking process. This avoids the need for secondary matching of already matched faces due to factors such as occlusion.
When dealing with severely occluded and overlapping trajectories, a series of strategies are employed: 1. The detection boxes in ByteTrack are classified based on confidence levels, dividing them into high-confidence and low-confidence groups. High-confidence detection boxes are used for initial matching, while low-confidence detection boxes are utilized in subsequent matching. Low-confidence detection boxes that have not been deleted continue to participate in the evaluation of subsequent frames, which helps maintain tracking even when the target is occluded. When a target is severely occluded or overlapped, the confidence of its detection box may decrease. However, ByteTrack helps to restore the identity of occluded or overlapped targets by retaining low-confidence detection boxes and reassessing their status in subsequent frames. 2. ByteTrack assigns a life cycle to each trajectory. If a trajectory does not match any detection box within a certain period, it will be deleted, avoiding fragmented trajectories caused by false detections or missed detections. For detection boxes that do not match any trajectory but have a sufficiently high confidence level, a new tracking trajectory will be created, which helps resume tracking after the target reappears from occlusion. 3. ByteTrack uses a Kalman filter to predict the movement of tracked objects, enabling the prediction of the target’s position in the next frame to assist in tracking even when the target is occluded or even disappears. 4. ByteTrack minimizes the use of ReID models. Instead of relying on identity matching, ByteTrack relies more on the positional overlap and motion continuity between detection boxes and trajectories for matching.
This paper proposes a multi-face recognition method tailored for laboratory surveillance scenarios. As shown in
Figure 4, images of individuals captured from a self-constructed laboratory setting are input into the designed face recognition system for automatic processing. The obtained images are fed into an improved backbone feature extraction network for feature extraction. Face detection is achieved through the Feature Pyramid Network (FPN) and Single Shot MultiBox Detector (SSH) detection network, which outlines the faces. AdaFace is utilized to classify and match the detected faces with those recorded in the database. Finally, multi-object tracking is employed to track the classified faces, preventing issues such as misrecognition and re-identification caused by occlusion and movements of individuals.