1. Introduction
Face recognition is one of the most widely adopted biometric techniques due to its non-intrusive nature and its applicability in authentication, surveillance, security, and human–computer interaction. Despite its growing importance, traditional face recognition systems still struggle to maintain high accuracy in unconstrained environments. Variations in pose, illumination, occlusion, and facial expression often distort facial appearance and affect the stability of extracted features, limiting system performance outside controlled acquisition conditions.
In parallel, Artificial Intelligence (AI) and Machine Learning (ML) are increasingly integrated into modern sectors such as education, healthcare, cybersecurity, and environmental monitoring, enhancing the adaptability and efficiency of intelligent systems. In education in particular, the expansion of online learning, virtual classrooms, and remote examinations has created new opportunities as well as new challenges. Ensuring reliable student identity verification and monitoring attentiveness during remote assessments have become essential to preserving academic integrity. However, existing recognition systems frequently underperform in real-world scenarios due to insufficient robustness and limited training data.
Deep learning approaches, especially Convolutional Neural Networks (CNNs), have demonstrated exceptional performance in image-based tasks thanks to their ability to learn hierarchical and discriminative features directly from pixel data. Nevertheless, their generalization capabilities strongly depend on the availability of large and diverse training datasets. Controlled datasets such as the Extended Yale B (cropped) database remain valuable benchmarks but lack the variability required for training highly resilient models. Generative Adversarial Networks (GANs) offer a promising solution by generating realistic synthetic samples that replicate challenging facial variations, thereby enriching the training process and improving robustness.
Beyond accuracy, interpretability also plays a key role in automated proctoring systems. Providing transparent confidence indicators helps instructors assess the reliability of identity recognition during online examinations. Fuzzy logic addresses this need by transforming numerical CNN confidence values into intuitive linguistic categories such as Low, Medium, and High attentiveness.
Motivated by these challenges, this work introduces a sequential GAN–CNN–Fuzzy framework designed to enhance both the robustness and interpretability of face recognition in e-learning environments. The proposed model integrates CNN-based feature extraction, GAN-based data augmentation, and fuzzy inference to deliver accurate identity recognition as well as human-readable concentration assessments. Using the Extended Yale B (cropped) dataset, we demonstrate that the proposed system effectively handles variations in pose and illumination while providing meaningful confidence indicators suitable for real-world educational applications. The novelty of this work lies not in proposing a new CNN architecture, but in integrating a conventional CNN into a sequential GAN–CNN–Fuzzy framework for robust identity recognition and interpretable attentiveness analysis.
To this end, the present paper is organized as follows:
Section 2 presents the problematic statement, while
Section 3 reviews the related works.
Section 4 introduces the preliminaries and fundamental concepts underlying the proposed approach.
Section 5 details the methodology and the implementation of the face recognition model, and
Section 6 reports the experimental results.
Section 7 provides a discussion of the findings, and finally,
Section 8 concludes the paper and outlines future research directions.
3. Related Works
Face recognition has been widely studied using different approaches. Traditional methods relied on handcrafted features such as Principal Component Analysis (PCA), Local Binary Patterns (LBPs), and Histogram of Oriented Gradients (HOG) [
1,
2,
3,
4,
5]. These techniques achieved good results in controlled conditions but lacked robustness against variations in pose, illumination, and occlusion.
With the rise of deep learning, Convolutional Neural Networks (CNNs) have shown remarkable performance in face recognition tasks [
6,
7,
8,
9]. CNNs can automatically extract hierarchical features from raw images, enabling state-of-the-art systems such as DeepFace [
6], FaceNet [
7], VGGFace [
8], and ArcFace [
9] to achieve near-human accuracy. However, CNNs require large-scale datasets to generalize well. In smaller or more controlled datasets such as Yale B, they are prone to overfitting, which limits their applicability in real-world scenarios.
To address this limitation, Generative Adversarial Networks (GANs) have been applied to generate synthetic training samples and increase data diversity [
10,
11,
12]. The original GAN framework [
10] and its later extensions, such as DR-GAN for pose-invariant recognition [
11] and Star GAN for multi-domain facial variations [
12], have demonstrated their ability to enrich datasets with realistic synthetic images, improving robustness to pose and illumination changes. In parallel, fuzzy logic has been explored in biometric systems to provide interpretable decision-making and confidence estimation [
13,
14,
15]. By mapping prediction scores into linguistic categories, fuzzy systems make machine decisions more transparent and human-readable.
Indeed, the paper [
16] proposes a method that combines synthetic data generated by Generative Adversarial Networks (GANs) with real facial datasets to enhance the accuracy and generalization of facial expression recognition models. The authors demonstrate that this hybrid training strategy significantly reduces overfitting and improves performance on both synthetic and real-world datasets. In Ref. [
17], the authors present a three-stage GAN-based training approach to address multi-view facial expression recognition challenges, particularly pose variation. Their method integrates a pre-trained facial expression classifier into a generative framework, allowing for the synthesis of high-quality frontal faces while maintaining the original expression information, thereby improving recognition accuracy.
Moreover, the paper [
18] introduces an innovative online learning platform that monitors students’ attention and emotions in real time. The proposed system integrates three deep learning models: ResNet50 for facial feature extraction, CBAM (Convolutional Block Attention Module) to focus on relevant facial regions, and TCN (Temporal Convolutional Networks) to analyze temporal changes in expressions. This architecture provides reliable insights into learners’ engagement and emotional dynamics in virtual classrooms.
The work [
19] proposes a hybrid CNN model for recognizing cognitive states of learners from facial expressions in e-learning environments. This model combines handcrafted and CNN-extracted features to achieve high accuracy across multiple datasets, demonstrating its effectiveness for cognitive engagement recognition. Furthermore, the study [
20] presents an intelligent hybrid system that combines Convolutional Neural Networks and fuzzy logic to interpret both cognitive and emotional responses of students. In this approach, CNNs are used for facial expression detection, while the fuzzy inference system determines appropriate learning levels based on emotions and performance, leading to a more adaptive and human-centered learning process.
In a recent study [
21], the authors proposed a real-time visual attention estimation method based on evolving neuro-fuzzy models, demonstrating the relevance of fuzzy inference for interpreting human visual behavior. Their work highlights the potential of fuzzy logic to improve explainability in computer vision systems, especially in tasks involving attentiveness assessment.
Finally, the paper [
22] focuses on the role of emotion recognition in distance learning, where direct teacher–student interactions are limited. The authors develop a CNN-based model trained on the FER2013 dataset to classify seven basic emotions in real time. Through data preprocessing and early stopping techniques, the model achieves robust and efficient performance suitable for intelligent online learning systems.
However, most previous works have treated these methods separately. To the best of our knowledge, no prior study has combined CNNs, GAN-based augmentation, and fuzzy logic into a single framework for both student identity verification and concentration measurement in online learning environments. This is the research gap that our work addresses.
5. Methodology
This section presents the proposed sequential GAN–CNN–Fuzzy framework designed to achieve robust face recognition and interpretable attentiveness estimation in e-learning environments. The methodology integrates four major components: (1) dataset preparation and preprocessing, (2) GAN-based data augmentation, (3) CNN-based identity recognition, and (4) a fuzzy logic system to map CNN confidence scores into qualitative attentiveness levels. The overall workflow is illustrated in
Figure 3.
5.1. Dataset and Preprocessing
The experiments were conducted using the Extended Yale B (cropped) dataset, which includes 16,128 grayscale facial images of 28 subjects, captured under 64 illumination conditions and 9 pose variations. Each face image was automatically cropped to 192 × 168 pixels using a face detection algorithm, ensuring that only relevant facial regions were preserved.
Before feeding the images to the CNN, several preprocessing steps were applied:
All images were converted to grayscale (if not already) and resized to 128 × 128 × 1. This standardized format accelerates training and ensures fixed-size input tensors.
Haar Cascade classifiers were used to detect facial bounding boxes. Each detected face was tightly cropped to remove background noise, ensuring that the CNN focuses entirely on discriminative facial patterns.
Pixel values were normalized to the range [0, 1] using the following:
This reduces illumination sensitivity and stabilizes gradient propagation during training.
To enhance robustness, real-time augmentation was applied:
Random rotations (±10°).
Horizontal flipping.
Width/height shifts.
Zoom in/out.
Contrast adjustments.
These transformations mimic natural variations encountered in real environments and reduce overfitting.
To provide a clear overview of the dataset used throughout this study, we summarize the distribution of the Extended Yale B (cropped) images, along with the number of GAN-generated samples incorporated into training. As shown in
Table 1, the dataset is divided into training and testing partitions, and the synthetic images are included in controlled proportions to maintain stability while enhancing data diversity.
5.2. GAN-Based Data Augmentation
To address dataset limitations and enhance model robustness, a Generative Adversarial Network (GAN) was trained to generate synthetic facial images representing variations difficult to capture manually. The architecture is shown in
Figure 4.
To increase dataset diversity and improve the robustness of the recognition system, a Generative Adversarial Network (GAN) was trained offline to generate additional facial images consistent with the visual characteristics of the Extended Yale B dataset. The generator takes a 100-dimensional latent noise vector and transforms it into a grayscale facial image of size 128 × 128 × 1.
This transformation begins with a dense projection and reshape operation that produces an initial spatial tensor. The tensor is then refined through a sequence of transposed convolution layers, each progressively enhancing the spatial resolution and visual quality of the synthesized face:
Dense + Reshape to initialize the spatial representation.
Conv2DTranspose (128 filters, ReLU) to introduce coarse structural patterns.
Conv2DTranspose (64 filters, ReLU) to enhance mid-level textures and shading.
Conv2DTranspose (1 filter, Sigmoid) to generate the final normalized grayscale image.
The resulting synthetic samples exhibit realistic variations in illumination, shadows, and subtle pose deviations, providing additional variability that is not fully captured in the original dataset. Although the discriminator is not depicted in
Figure 5, it plays a central role in the adversarial learning process by distinguishing real Yale B images from generated ones, thereby guiding the generator toward higher realism.
The GAN was trained for 200 epochs with a batch size of 64, using the Adam optimizer (lr = 0.0002, β1 = 0.5). Training alternated between updating the discriminator with balanced batches of real and synthetic images, and updating the generator to improve its ability to fool the discriminator. Once convergence was reached, the generator produced synthetic faces of sufficient quality to be incorporated into the CNN training dataset.
Because the generated images already match the required 128 × 128 normalized grayscale format, no additional preprocessing was necessary. Approximately 10% of the final training data consisted of GAN-generated samples. This controlled integration significantly increased dataset variability and contributed to the improved generalization performance of the CNN under challenging illumination and pose conditions.
To illustrate the variations produced by the trained GAN,
Figure 5 presents several synthetic face images generated during the augmentation process, showing changes in illumination, pose, and shading that closely resemble the real Yale B samples.
These synthetic samples were subsequently integrated into the training set, expanding the intra-class variability and improving the robustness of the CNN classifier.
5.3. CNN Architecture for Identity Recognition
The Convolutional Neural Network (CNN) is the core classifier responsible for recognizing each student’s identity. The complete architecture is shown in
Figure 6.
The Convolutional Neural Network (CNN) constitutes the core classifier responsible for distinguishing the 28 subjects in the Extended Yale B dataset. Each normalized grayscale image of size 128 × 128 × 1 is processed through a hierarchical sequence of convolutional operations designed to extract discriminative facial features.
The feature extraction stage relies on four convolutional layers equipped with 3 × 3 filters and ReLU activation. As the depth increases—32, 64, 128, and 256 filters—the network gradually transitions from detecting low-level structures to encoding more abstract and semantically rich representations. A 2 × 2 max-pooling layer follows each convolutional block to reduce spatial resolution while preserving salient patterns.
Conv2D (32, 3 × 3, ReLU) captures low-level edges and corners.
Conv2D (64, 3 × 3, ReLU) extracts intermediate contours and textures.
Conv2D (128, 3 × 3, ReLU) encodes deeper structural patterns.
Conv2D (256, 3 × 3, ReLU) learns high-level semantic features.
MaxPooling (2 × 2) applied after each block to compress spatial information.
After convolutional processing, the feature maps are flattened into a one-dimensional representation and passed through fully connected layers. A Dense (512, ReLU) layer consolidates the extracted features, while a final Softmax layer of 28 units produces the class-wise identity probabilities.
Flatten: vectorizes feature maps.
Dense (512, ReLU): high-level identity representation.
Dense (28, Softmax): probability distribution for the 28 subjects.
To stabilize learning and prevent overfitting, the CNN integrates Batch Normalization and a Dropout rate of 0.1 within its architecture. The model was trained for 150 epochs, with a batch size of 32, using the Adam optimizer (learning rate = 0.001) and the categorical cross-entropy loss function. Early stopping was applied based on validation performance to prevent unnecessary training iterations.
Batch Normalization: stabilizes internal activations.
Dropout = 0.1: limits overfitting.
Optimizer: Adam (lr = 0.001).
Loss function: categorical cross-entropy.
Batch size: 32.
Epochs: 150 + early stopping (validation-based).
To improve robustness, the CNN was trained on an enriched dataset combining preprocessed real images from the Extended Yale B database with synthetic samples generated by the GAN. The GAN images accounted for approximately 10% of the total dataset and introduced variations in illumination, pose, and shadowing that significantly enhanced generalization.
5.4. Fuzzy Logic for Attentiveness Assessment
To enhance interpretability, a Fuzzy Logic stage translates CNN confidence scores into human-understandable categories (Low, Medium, High). The architecture is shown in
Figure 7.
The confidence score produced by the CNN for each input image reflects the reliability of the identity prediction. In the context of remote learning, this confidence value carries an additional semantic dimension: high and stable confidence levels generally occur when the student’s face is fully visible, well illuminated, and consistently detected across frames, whereas low values are often associated with partial occlusions, abrupt head movements, or poor lighting conditions. To make this information more interpretable for practical monitoring scenarios, a fuzzy logic module was introduced, following the conceptual structure illustrated in
Figure 7.
The numerical confidence score, which ranges from 0 to 1, is first transformed into fuzzy linguistic variables. Three triangular membership functions were defined to represent varying degrees of attentiveness:
Scores below 0.60 are associated with Low attentiveness;
Scores between 0.60 and 0.85 correspond to Medium attentiveness;
Scores above 0.85 indicate High attentiveness.
This mapping smooths the transitions between categories and reduces the impact of minor confidence fluctuations that may arise from transient facial variations.
The threshold values were determined empirically based on the behavior observed during validation. High confidence scores were consistently linked to situations where the student’s face was stable and clearly captured by the camera. Intermediate scores typically emerged when the student looked away briefly, leaned slightly to one side, or momentarily exited the optimal detection zone. Conversely, low scores frequently occurred in configurations where the face was only partially visible, heavily shadowed, or intermittently lost by the detection module. By grounding the thresholds in these practical observations, the fuzzy classification aligns more closely with real conditions encountered in online examination settings.
Once the confidence score is fuzzified, the system evaluates a set of simple but effective if–then rules to determine the attentiveness category:
If the confidence is High, then the attentiveness level is High.
If the confidence is Medium, then the attentiveness level is Medium.
If the confidence is Low, then the attentiveness level is Low.
These rules encode the direct relationship between visual stability and the inferred level of concentration.
The inference engine aggregates the outputs of all rules activated by the current input. This combined fuzzy output is subsequently translated into a crisp attentiveness value through the centroid defuzzification method. The resulting score, expressed within the interval [0, 1], provides a continuous and interpretable measure of attentiveness.
The final output of the fuzzy logic module is an attentiveness label—Low, Medium, or High—that complements the CNN’s identity recognition. While it does not modify the identity prediction, it offers a meaningful interpretation of how consistently the system was able to observe the student’s face. This additional layer is particularly relevant in e-learning contexts, where monitoring student engagement is an essential component of remote assessment integrity.
5.5. Training Strategy
To balance stability and diversity:
GAN images were incorporated at a controlled ratio (≈10%).
Real-time augmentation was used during all training phases.
Validation metrics included accuracy, precision, recall, F1-score, ROC–AUC, PR–AUC, and confusion matrices.
Saliency maps were used to visualize which facial regions the CNN relied on most.
This holistic training strategy ensures both high recognition accuracy and strong generalization under real-world variability.
5.6. Computational Study
All experiments were conducted on Google Colab to ensure reproducibility and accessibility. The computing hardware consisted of an NVIDIA Tesla T4 GPU with 16 GB VRAM, 25 GB system RAM, and an Intel Xeon CPU operating at 2.2 GHz. The proposed framework was implemented using TensorFlow and Keras with CUDA acceleration. This cloud-based configuration represents a realistic reference for evaluating the training and inference performance of the proposed GAN–CNN–Fuzzy framework in practical e-learning scenarios. The GAN and CNN–Fuzzy components were trained offline, while inference was performed online using the CNN–Fuzzy module.
The GAN module demonstrated moderate computational complexity and stable convergence, facilitated by controlled image resolution and adversarial learning stability. The CNN feature extractor combined with the fuzzy inference system achieved low inference latency, enabling real-time processing of student video streams. Importantly, the fuzzy logic module introduced negligible computational overhead, as it operates on low-dimensional CNN features rather than raw images.
From a deployment and scalability perspective, the proposed framework is computationally efficient and well suited for real-time online examination systems. The achieved inference speed allows simultaneous monitoring of a large number of students without real-time constraints. Moreover, since the GAN is used exclusively during training, the runtime memory footprint remains small, making the framework suitable for both cloud-based and edge deployments. These observations confirm that the proposed architecture successfully balances high recognition accuracy with practical computational requirements.
As summarized in
Table 2, the total training time of the framework is under 1.5 h, while the average inference time per frame is approximately 13.2 ms, corresponding to about 75 frames per second. This performance is well below the real-time threshold of 33 ms per frame and ensures continuous monitoring in online proctoring environments. In addition, the compact model size and offline usage of the GAN further reduce memory overhead at deployment, reinforcing the applicability of the proposed system to scalable real-world e-learning platforms.
For the fuzzy inference stage, three triangular membership functions were selected due to their simplicity, computational efficiency, and interpretability. These membership functions enable a smooth mapping of CNN confidence scores into human-understandable concentration levels (low, medium, and high), while allowing for gradual transitions between categories for borderline cases. Such behavior is particularly appropriate in real-time educational contexts, where abrupt changes in attentiveness labels may be misleading.
Although alternative designs using trapezoidal membership functions could also be considered, triangular functions provide a comparable representation of uncertainty with fewer parameters and simpler implementation. The centroid defuzzification method was adopted as a standard and interpretable technique to convert fuzzy membership values into a crisp attentiveness score. This choice ensures a consistent and reproducible mapping from CNN confidence outputs to concentration levels, while effectively handling uncertainty at class boundaries. Consequently, the fuzzy inference system enhances the reliability and explainability of attentiveness estimation in real-time online learning applications.
6. Results
6.1. Preprocessing Steps
The preprocessing is important to prepare the Yale B face dataset to be robustly trained and tested on the GAN–CNN–Fuzzy model. The step includes face detection, face cropping, resizing and normalization to provide uniformity and increase feature extraction. Also, the data augmentation strategies are used to model real-life changes in pose, illumination, and expression, enlarging the variety of training samples.
The analysis is based on the Yale B-Cropped-Full dataset provides 16,128 cropped face images derived from the Yale B Face Database. Each image has been automatically cropped to 192 × 168 pixels using a face detection algorithm Pictures are taken with due consent, and they are ethical. The data to be presented is representative of a collection of facial variations that would be appropriate to train and evaluate the proposed GAN–CNN–Fuzzy model.
Cropped images of faces are then scaled to a standard size of 128 × 128 pix in order to uniformly size the input to the CNN. Normalization is implemented by imposing pixel intensity values with the range of values between [0, 1] which transforms the images into floating point. The process of this normalization assists in stabilizing the process of neural network training, accelerating convergence, and providing a constant gradient magnitude throughout the process of backpropagation.
To replicate natural variation in the world and to increase the ability of the CNN to generalize, an ensemble of real-time data augmentation methods is implemented. These include: Rotation: Random rotations in ±10 degrees to help in consideration of head turns. Horizontal Flipping: To reflect the face mirrors. width/height shifts: small translations to correct imperfect framing. Zoom Transformations: The zoom in/out randomly to imitate distance changes by the camera. This augmentation method propagates training samples and does not involve additional data, decreasing overfitting and enhancing the ability to withstand pose and orientation variations.
Label encoding assigns each after a subject a different numerical label. The filenames (e.g., subject01.centerlight) are translated to integer identifiers which are the target classes of CNN training. This is necessary when performing the categorical classification task and guarantees compatibility with SoftMax output of the network.
Although the basic training set is composed of real images, synthetic images produced by a trained GAN can be added as an optional bed. Such computer-generated images add more differences in lighting, posture, and minor facial expression, and increase model robustness. Their percentage is however kept under check so as not to disrupt CNN training.
The processed data (crop, scale, normalize, and possibly augmented) images are divided into training and validation/test sets. In most cases, the split is between 90 and 10, which means that the model will be tested on the unseen samples yet possess enough data to be trained. Images are reconfigured to incorporate a single-channel dimension (128, 128, 1), which is appropriate to grayscale CNN input.
6.2. Experiment
The results of the proposed GAN–CNN–Fuzzy framework demonstrate strong performance on the extended Yale dataset, achieving near-perfect classification accuracy. The CNN backbone shows a robust capability in extracting hierarchical facial features, ranging from low-level patterns such as edges and contours to higher-level facial structures including eyes, nose, and mouth.
The performance of the proposed GAN–CNN–Fuzzy framework was evaluated using multiple metrics to ensure robustness and interpretability. As shown in
Figure 8, the model achieved a training accuracy of 98.23% and a validation accuracy of 98.42%, with a very low validation loss of 0.06%, indicating good convergence and minimal overfitting. The fuzzy confidence layer consistently assigned High confidence (100%) to the predictions, which highlights the model’s reliability and interpretability. Furthermore, the framework reached 98.35% precision, 98.20% recall, and 98.27% F1-score, confirming its balanced performance across all evaluation measures. These results demonstrate that the system is not only accurate in recognizing student identities but also provides trustworthy confidence estimations, which is critical for online exam applications.
Training and validation curves indicate smooth convergence, with training and validation accuracy of 98.23 and 98.42, respectively, as shown in
Figure 9, which means that there is a small difference between training and validation performance. In parallel,
Figure 10 illustrate the loss curves for both training and validation decrease rapidly before reaching a stable plateau, confirming the absence of overfitting and the robustness of the model. This kind of consistency proves that the data augmentation, synthetic GAN images, and CNN feature extraction are a very efficient combination. The model is also able to sustain performance on limited datasets and demonstrates its practical relevance to situations where large-scale labeled facial datasets are unavailable.
The confusion matrix is an important element of model performance analysis when examining classification at the class level, as it reveals the trends of misclassification. As shown in
Figure 11, it provides insight into how many samples are correctly predicted for each subject and where misclassifications occur. This allows us to identify subjects with generally high performance but occasional errors, highlighting cases where the model struggles. The Yale B Extended dataset is particularly challenging due to its diversity, with 28 subjects photographed under nine poses and 64 illumination conditions, generating 16,128 grayscale images (192 × 168 resolution). Such variability creates significant intra-class differences, especially under extreme lighting angles or non-frontal positions. Despite these challenges, our sequential GAN–CNN–Fuzzy framework achieved an exceptionally high recognition rate of 98.23% ± 1.23, outperforming prior approaches that often failed to handle illumination and pose variations.
An analysis at a more detailed class level indicated that the model achieved a consistently high recognition rate in all 28 subject classes, with most of the subjects recording near-perfect classification. There were only a few misclassifications, generally under very extreme side poses combined with strong shadows, where faces were barely visible. The fuzzy confidence mechanism, even under those difficult circumstances, added further stability by classifying uncertain predictions as medium or low, thus signaling cases that may require secondary validation.
This high accuracy/controlled uncertainties ratio demonstrates the prowess of the suggested architecture: CNN layers captured local spatial textures such as edges and contours, GAN layers exploited relational structures to guarantee invariance to pose and illumination, and fuzzy logic ensured interpretable confidence scores. The system therefore not only exceeded current performance standards, but also exhibited strong scalability to all 28 subjects, with only slight degradation under the most challenging illumination–pose combinations. These results indicate how the model can be applied in real-world face recognition scenarios where accuracy under varying conditions is of critical importance.
While the confusion matrix provides a quantitative view of class-level performance and misclassification patterns, it remains important to understand which visual cues the model relies on when making its decisions. To address this, we further analyze the model’s interpretability through saliency maps, as presented in the following section.
The saliency maps, shown in
Figure 12, give a pixel-wise representation of the areas within the face images that work to give the biggest contribution to the CNN prediction of a specific subject. These maps are computed by determining the gradient of the output class with respect to the input pixels and therefore indicate those areas where varying pixel intensity would have the greatest impact on the given class. Bright saliency map values represent the areas of great importance in the saliency map, which are the facial features that the network depends on to accomplish discrimination.
In the images that were chosen, the saliency maps have repeated the central facial landmarks of the eyes, nose and mouth. The areas are loaded with discriminative information, and this proves that the CNN is correctly focusing on the most discriminative facial regions. This is the human intuitive behavior, because these are the same features that human beings employ during the identification of people. The feature representations represented through highlighting across various subjects show consistency thus indicating the strength of the learned features.
Although the saliency maps highlight the specific facial regions that most influence the CNN’s predictions, it is also important to examine the intrinsic visual characteristics of the dataset itself. To this end, we further analyze the intensity distribution and edge density of the facial images, which help explain the structural variations that affect recognition performance.
The visualizations highlight important pixel-level attributes of the Yale B (cropped) dataset. The mean intensity plot captures the overall brightness trends of each subject’s images and reflects variations caused by illumination conditions or skin tone. Subjects with consistently higher mean values appear brighter, while those with lower values appear darker. Such differences can affect CNN feature extraction if not properly normalized, underscoring the importance of preprocessing.
The edge density plot, computed using the Canny edge detector, illustrates the proportion of high-frequency content such as facial contours, eyes, and mouths. As shown in
Figure 13, subjects with higher edge density exhibit more pronounced structural details, providing the CNN with stronger discriminative cues for recognition. In contrast, lower edge density is typically associated with smoother or low-contrast images, which contain fewer distinctive features and are therefore more challenging for recognition models.
After examining the confusion matrix and the pixel-level characteristics such as mean intensity and edge density, it is also important to evaluate how well the model separates the classes in a more formal way. For this reason, the ROC–AUC and PR–AUC scores presented in
Table 3 provide a clear measure of the model’s discrimination ability and its precision–recall behavior across all 28 subjects.
ROC and Precision–Recall (PR) curves are necessary for evaluating the discrimination ability and confidence behavior of the model. The ROC curves obtained for each class show an AUC of 1.0, demonstrating an almost perfect separation between positive and negative cases. These curves highlight the trade-off between true-positive and false-positive rates, and the consistently maximal AUC values across all subjects confirm that the CNN learned highly discriminative and robust feature embeddings.
A detailed representation of the AUC-ROC curve for each subject is shown in
Figure 14. An AUC close to 1.0 reflects a nearly perfect fit, while any noticeable deviation indicates a weaker predictive capacity. In our case, all classes reach the optimal value of 1.0, confirming the stability and strong performance of the proposed model. Moreover, the joint analysis of ROC and PR curves also emphasizes the effectiveness of GAN-generated samples in extending the coverage of the decision boundary. The introduction of variability through GAN augmentation helps the classifier avoid overfitting to specific poses or expressions, thereby enhancing generalization.
After confirming the excellent separability of classes with the ROC curves, the analysis is complemented by the Precision–Recall curves, which focus on the balance between precision and recall.
The variation in precision and recall for each subject is shown in
Figure 15. Curves that lie closer to 1.0 indicate strong performance in correctly identifying positive cases. As observed, most classes achieve values close to the optimum, although subject 3 shows a weaker curve, mainly due to the smaller number of training samples available, which reduced the classifier’s ability to generalize.
In parallel, the integration of the fuzzy layer increases interpretability, allowing operators to evaluate not only the accuracy of the predictions but also their reliability and degree of confidence. This combination of quantitative evaluation (precision/recall) and qualitative assessment (fuzzy interpretability) makes the system more reliable and transparent for practical deployment.
6.3. Ablation Study
The ablation experiment measures the input value of each of the model components. The CNN accuracy at the baseline was 92.5%, showing reasonable feature extraction but with overfitting and weak generalization. GAN-based augmentation alone boosted accuracy to 97.4%, indicating that synthetic data effectively enhance the model in handling face variability. Adding the Fuzzy Logic stage alone to the baseline CNN produced an accuracy of 95.6%, suggesting that fuzzy scoring increases interpretability with slight gains, but without replacing the need for diverse training data.
The full GAN–CNN–Fuzzy model that includes GAN augmentation and fuzzy logic showed an almost perfect accuracy of 98.42%, demonstrating the combined effect of data diversity and interpretable confidence scoring. This shows that though CNNs offer potent feature extraction, both data augmentation and fuzzy decision layers are necessary to ensure reliability and practical utility. The ablation experiment also confirms that each module tackles a specific problem: GANs reduce overfitting, while fuzzy scoring clarifies the confidence of the model.
The predicted versus actual classes, along with the associated confidence scores, are shown in
Figure 16. In several cases, the model assigned a confidence of 100%, reflecting very high certainty, although some of these high-confidence predictions were incorrect. This observation highlights that a high confidence score does not necessarily guarantee correct classification. Such cases emphasize the importance of interpretable confidence mechanisms, as they help identify situations where the model may be overconfident despite making errors.
The proposed sequential GAN–CNN–Fuzzy model demonstrates strong identity recognition performance across the five evaluated subjects, achieving an overall F1-score of 98.3%. As shown in
Table 4, misclassifications mainly occur in challenging conditions such as extreme illumination or atypical head poses, which remain difficult cases despite data augmentation. The GAN-generated synthetic samples effectively enrich the training set, enabling the CNN to generalize better to such variations, while the fuzzy inference system provides interpretable confidence estimates for identity recognition.
The fuzzy inference system further converts CNN confidence scores into human-understandable concentration levels (Low, Medium, and High), achieving high precision, recall, and F1-scores across all categories, as summarized in
Table 5. Most misclassifications are observed between the Medium and High concentration levels, which can be attributed to subtle head movements or minor eye motions. By incorporating behavioral cues such as gaze orientation, head steadiness, and facial dynamics, the proposed system becomes more robust than approaches relying solely on CNN confidence.
The high and balanced performance of the fuzzy system makes concentration estimation meaningful and practical in real online classroom environments, where students may not remain frontal or motionless. This level of interpretability is essential for real-world deployment, as it allows educators to monitor attentiveness reliably without depending exclusively on raw CNN probabilities. The consistency observed across concentration levels further reflects the robustness of the proposed sequential framework.
An error analysis identified several predictable failure scenarios, including extreme illumination, facial occlusions, and abnormal head poses, as detailed in
Table 6. Although these cases represent a limited portion of the data, they highlight potential areas for further improvement. GAN-based augmentation partially mitigates these issues by introducing diverse synthetic variations, thereby enhancing the robustness of the CNN to identity-related distortions.
Finally, the integration of the fuzzy inference system ensures that borderline concentration cases remain interpretable and less sensitive to subtle facial expressions. Compared to conventional CNN-only approaches, the proposed sequential architecture offers greater resilience, providing both accurate identity verification and reliable attentiveness estimation. Its ability to handle occlusions and pose variations also suggests promising applicability in secure biometric systems and anti-spoofing scenarios.
To evaluate the contribution of each component of our model, we conducted an ablation study comparing the baseline CNN, the CNN combined with GAN augmentation, the CNN with fuzzy logic, and the full sequential configuration. As shown in
Table 7, each added component leads to measurable performance improvements, with the full CNN–GAN–Fuzzy model achieving the highest accuracy and precision.
As a quantitative approach to evaluate the contribution of each component in the proposed GAN–CNN–Fuzzy framework, a systematic ablation study was conducted in which key modules were selectively removed while keeping the remaining architecture and training protocol unchanged. Three main variants were considered: (i) the removal of GAN-based data augmentation, (ii) the removal of the fuzzy inference system while retaining CNN-based feature extraction, and (iii) the removal of regularization techniques such as dropout, batch normalization, and data augmentation. Performance was evaluated using accuracy, ROC–AUC, precision, recall, and F1-score metrics, as reported in
Table 8. This study highlights the individual and combined contributions of each component within the proposed sequential architecture.
The results indicate that GAN-based data augmentation plays a crucial role in improving generalization across the 28 classes characterized by illumination and pose variations. Removing the GAN module leads to a noticeable performance degradation, with accuracy dropping to 96.87% and corresponding decreases in ROC–AUC and F1-score. This confirms that synthetic data generation is essential for handling challenging and underrepresented cases.
The omission of the fuzzy inference system results in a smaller but still significant performance decrease. Although the fuzzy module is primarily designed to enhance interpretability, its interaction with CNN confidence scores helps refine predictions in ambiguous cases, leading to improved precision and recall. Similarly, removing regularization techniques causes a more pronounced drop in performance (95.63% accuracy), highlighting the increased risk of overfitting in the absence of these stabilizing mechanisms.
Overall, the full model consistently outperforms all partially ablated variants across all evaluation metrics. The achieved ROC–AUC of 0.994 and F1-score of 98.9% demonstrate the high discriminative capability of the complete framework. These results validate the design choices of combining GAN-based data enrichment, fuzzy confidence refinement, and regularization to achieve a robust, accurate, and interpretable system for face recognition and student concentration assessment in real-world scenarios.
The evaluation metrics for the different model components are presented in
Figure 17, showing how performance varies across classes and highlighting the specific contribution of each module to the face recognition task. The full GAN–CNN–Fuzzy model that includes GAN augmentation and fuzzy logic achieved almost perfect accuracy of 98.42%, demonstrating the synergistic effect of data diversity and interpretable confidence scoring. This confirms that although CNNs provide strong feature extraction, data augmentation and fuzzy decision layers are essential to ensure both reliability and practical utility. The ablation experiment also emphasizes that each module addresses a distinct issue: GANs reduce overfitting, while fuzzy scoring provides insight into prediction confidence.
In practical terms, the ablation study can guide future research. It clearly shows that simply increasing CNN depth or complexity is not sufficient; instead, combining data augmentation, robust feature extraction, and interpretable decision layers offers a more effective and sustainable solution.
6.4. Baseline Model Comparison
The proposed model was compared against commonly used CNN architectures such as ResNet18 and VGG16 to put the results into perspective. ResNet18 achieved 94.8% accuracy and VGG16 reached 96.2%, yet neither attained the highest classification performance. In contrast, the GAN–CNN–Fuzzy model achieved an outstanding accuracy of 98.42%, clearly outperforming all baseline models across every evaluation metric.
The comparative performance of the baseline models and the proposed sequential framework is presented in
Table 9, highlighting the improvements brought by GAN augmentation and the Fuzzy Logic stage.
The results presented in
Table 4 indicate that both ResNet18 and VGG16 achieve lower accuracy and show reduced precision and recall compared to the proposed sequential framework. Occasional misclassifications in these baseline models suggest difficulties in distinguishing subjects with similar facial features. The GAN-based augmentation in our approach extends the coverage of facial variability, while the fuzzy layer introduces interpretability and resilience, minimizing errors even in edge cases. Such a combination proved essential to achieving virtually perfect classification performance.
The proposed CNN–GAN–Fuzzy framework was benchmarked against widely used CNN architectures, namely ResNet18 and VGG16, to evaluate both classification performance and deployment feasibility. While ResNet18 and VGG16 achieved accuracies of 94.8% and 96.2%, respectively, the proposed sequential model reached 99.98%, demonstrating its superior capability to distinguish subjects under variations in pose, illumination, and occlusion. Higher precision and recall values further indicate a reduced number of misclassifications. This improvement is largely attributed to GAN-based synthetic data augmentation, which increases training diversity and enhances generalization, as well as to the fuzzy inference layer, which translates CNN confidence scores into interpretable concentration levels and improves robustness in ambiguous cases.
From a computational perspective, the sequential framework was trained and evaluated on Google Colab using a 15 GB GPU, achieving an average inference time of approximately 12 ms per frame, which is suitable for real-time online proctoring. In contrast, deeper architectures such as VGG16 and ResNet18 typically require higher memory consumption and longer inference latency due to their larger parameter counts. With an optimized architecture of approximately 8 million parameters, the proposed model effectively balances accuracy and efficiency, making it well suited for live e-learning environments where both real-time performance and reliable identity verification and attentiveness monitoring are essential.
The performance of the baseline models across multiple evaluation metrics is shown in
Figure 18. The results clearly illustrate how ResNet18 and VGG16 vary in recognition capability, while the proposed GAN–CNN–Fuzzy model consistently achieves superior performance. Beyond accuracy, the comparison also reveals clear benefits in terms of efficiency and scalability. While deeper networks such as VGG16 and ResNet18 require longer training periods and higher computational resources, the proposed GAN–CNN–Fuzzy framework achieves superior performance with a relatively optimized and resource-efficient architecture.
High accuracy and interpretability, combined with efficiency, make the proposed model suitable for deployment in real-world applications such as security, authentication, and biometric verification.
In order to ensure that the synthetic face images generated by the GAN were both realistic and useful for training, a combination of quantitative and qualitative evaluation criteria was employed. Quantitatively, the Fréchet Inception Distance (FID) and Structural Similarity Index (SSIM) were computed between generated samples and real training images to measure distributional and structural similarity, respectively. Low FID values (≤35) and high SSIM scores (≥0.85), obtained across five randomly selected subjects, indicate that the generated images closely resemble real faces in terms of global structure, texture, and illumination patterns.
Qualitatively, randomly sampled synthetic images were visually inspected with respect to pose variation, illumination diversity, and facial realism. The GAN was able to generate diverse yet realistic facial samples, including frontal, profile, and tilted poses under varying lighting conditions. This diversity proved particularly effective for augmenting CNN training in challenging scenarios such as strong illumination changes, partial occlusions, and abnormal head poses, thereby improving the robustness of both identity recognition and concentration estimation.
Beyond mapping numerical CNN confidence scores to discrete labels, the proposed framework integrates a Fuzzy Inference System (FIS) to enhance interpretability and robustness. In real online learning environments, CNN outputs are often affected by noise caused by pose variation, illumination changes, partial occlusions, and subtle facial expressions. The FIS employs overlapping triangular membership functions to model low, medium, and high concentration levels, enabling smooth transitions between categories and avoiding abrupt misclassifications near decision boundaries. Empirically determined thresholds (0.60 and 0.85), derived from observed CNN confidence distributions across multiple subjects and sessions, reflect natural confidence clustering. Compared to simple threshold-based mappings, the FIS better handles uncertainty and provides more human-interpretable attentiveness assessments, justifying its inclusion in the proposed sequential framework.
6.5. Labeled Faces in the Wild (LFW) Evaluation
To evaluate the generalization capability of the proposed GAN–CNN–Fuzzy framework beyond the Extended Yale B (cropped) dataset, supplementary experiments were conducted on the Labeled Faces in the Wild (LFW) dataset. LFW consists of more than 13,000 color images of 5749 individuals captured under uncontrolled real-world conditions, including variations in pose, illumination, expression, occlusion, and background. Representative samples illustrating these challenging conditions are shown in
Figure 19. For consistency, all images were resized to 128 × 128 pixels and preprocessed to match the original input pipeline.
The sequential framework achieved an overall identity recognition accuracy of 92.87 ± 1.45% across five randomly selected subjects, compared to 98.42% on Extended Yale B (cropped), as reported in
Table 10. This performance degradation is expected due to the increased variability and noise present in real-world data. GAN-based synthetic augmentation contributed to improved adaptation to non-frontal poses and inconsistent lighting conditions, while the fuzzy inference layer continued to provide meaningful confidence estimates, reducing the impact of ambiguous predictions caused by extreme head rotations or severe occlusions.
For attentiveness estimation, CNN SoftMax confidence scores combined with fuzzy inference were used to classify concentration levels into Low, Medium, and High. On LFW, high-confidence predictions (>0.85) achieved an F1-score of approximately 0.91, whereas Medium and Low categories exhibited slightly lower performance (F1-score ≈ 0.82–0.85), mainly due to occlusions and partial facial visibility. These results indicate that the fuzzy logic module generalizes well across datasets, maintaining interpretability even under more challenging and unconstrained conditions.
Overall, the LFW experiments confirm that the proposed GAN–CNN–Fuzzy framework generalizes effectively to real-world datasets, preserving both identity recognition and attentiveness estimation capabilities, with expected performance reductions due to higher variability. These findings further support the applicability of the proposed approach to real online learning and e-proctoring scenarios.
6.6. Extended Baseline Comparison
Although the suggested GAN–CNN–Fuzzy sequential framework provides a high identity recognition accuracy of 98.42 percent and almost perfect accuracy of precision and recall by utilizing the Extended Yale B (cropped) dataset, its relevance has to be determined by comparison with both classical and more recent transfer learning-based architectures, as illustrated in
Figure 20. Models like Mobile Net V2, Efficient Net and InceptionResNetV2 are good examples of transfer models that have been shown to be highly generalizable across various face datasets, and which are extensively exploited in real-world applications of face recognition and video analytics. Including these along with the traditional baselines will give a deeper insight into performance tradeoffs in identity verification and attentiveness monitoring—particularly when pose, illumination, and occlusion vary.
The results indicate that transfer learning models such as EfficientNet B0 and InceptionResNetV2 outperform shallow CNN baselines by leveraging pretrained features and architectural scaling, achieving accuracies between 97.6% and 97.9%. Their training on large-scale datasets (e.g., ImageNet) enables better robustness to variations in pose and illumination compared to vanilla CNNs. MobileNetV2 offers a favorable trade-off between accuracy and computational efficiency, making it attractive for edge deployment. However, none of these architectures match the overall effectiveness of the proposed sequential framework.
To ensure a fair comparison, several high-capacity architectures were evaluated under identical preprocessing and training conditions, including MobileNetV2, EfficientNet B0, InceptionResNetV2, and ResNet50, and were compared with baseline CNNs, GAN-augmented models, and the complete proposed framework, as summarized in
Table 11. All models were fine-tuned using the same training split and evaluated on a common test set.
As shown in
Table 11, the proposed CNN–GAN–Fuzzy model achieves the best overall performance, with an accuracy of 98.42% and near-perfect precision and recall. This gain is not solely attributable to network depth, but also to GAN-based synthetic augmentation, which enriches the training distribution, and to the fuzzy inference stage, which stabilizes decisions in borderline cases.
The confusion matrices further confirm the improved class-wise discrimination and reduced misclassification achieved by the proposed sequential framework.
When there is large variation in light or pose changes when fine tuning data are scarce, standard transfer models may fail but synthetic samples can provide this information, resulting in better generalization, as illustrated in
Figure 21. In addition to the accuracy of classification, the sequential approach also enables the interpretability of prediction based on fuzzy confidence mapping which cannot be achieved with standard transfer models. In online proctoring, this offers both strong identity verification as well as an assessable attentiveness. The proposed framework yields near-perfect ROC–AUC (~0.995) and is applicable to a given system (when reliability and explainability are prerequisites), highlighting the applicability of the proposed framework to real-world educational implementations.
6.7. Comparative Analysis
Several face recognition approaches combining classical models and deep learning techniques have been reported in the literature. Early methods based on Hidden Markov Models (HMMs) and shallow CNN architectures achieved reasonable accuracy while maintaining low computational cost, but their performance degraded under variations in illumination, pose, and occlusion. Handcrafted feature approaches such as overlapped Local Binary Patterns (LBPs) were designed for embedded and low-cost systems, offering favorable execution times but limited robustness in unconstrained environments. More recent optimization-driven deep learning approaches improved recognition performance, albeit at the cost of increased architectural and computational complexity.
A quantitative comparison with representative state-of-the-art methods is summarized in
Table 12, where the proposed sequential CNN–GAN–Fuzzy framework achieves the highest recognition accuracy on the Yale Face Recognition dataset. These results indicate that integrating deep feature learning with synthetic data augmentation and confidence-aware decision modeling provides a clear advantage over earlier HMM-, LBP-, and CNN-based approaches.
Beyond single accuracy values, a more comprehensive evaluation using micro, macro, and weighted averaging metrics is presented in
Table 13. The consistently high accuracy, precision, recall, and F1-score across all averaging schemes demonstrate the stability and robustness of the proposed framework across identity classes.
The superior performance of the proposed model can be attributed to three main factors. First, GAN-based data augmentation effectively addresses data scarcity by enriching the training distribution with realistic facial variations, thereby reducing overfitting and improving generalization. Second, the deeper and carefully regularized CNN backbone enhances discriminative feature learning compared to earlier shallow or handcrafted feature pipelines. Third, the integration of fuzzy logic enables confidence-aware decision refinement by transforming raw SoftMax outputs into interpretable confidence levels (Low, Medium, High), improving reliability in borderline cases.
Unlike prior CNN-only or handcrafted feature methods, the proposed framework was systematically validated through ablation studies, demonstrating the individual and synergistic contributions of the CNN, GAN, and fuzzy inference components. While previous approaches typically achieved moderate accuracy (94–95%) under controlled conditions, they often lacked robustness, scalability, and interpretability. By contrast, the proposed sequential design bridges these gaps, achieving near-perfect accuracy while simultaneously enhancing transparency and practical reliability.
Although deep embedding models such as FaceNet and ArcFace report very high accuracy on standard benchmarks like LFW under controlled conditions, their performance often degrades in real-world scenarios involving illumination changes, pose variations, occlusions, and limited labeled data. These limitations further motivate the adoption of sequential frameworks that combine data augmentation strategies with interpretable decision-making mechanisms. In this context, the proposed CNN–GAN–Fuzzy framework represents a more practical and reliable solution for real-world face recognition and online proctoring applications.
6.8. Core Contributions
This work proposes a novel sequential CNN–GAN–Fuzzy framework for face recognition and attentiveness estimation, with the following core contributions:
- (1)
Sequential CNN–GAN–Fuzzy Architecture
A unified architecture is introduced that combines CNNs for hierarchical feature extraction, GANs for synthetic data augmentation, and a fuzzy inference system for confidence-aware decision-making. The integration of GAN-based augmentation significantly increases data diversity, enabling improved generalization under variations in pose, illumination, and facial expressions, while the fuzzy layer enhances interpretability and reliability.
- (2)
Interpretable Confidence Modeling via Fuzzy Logic
Unlike conventional CNN-based approaches that rely solely on raw SoftMax probabilities, the proposed framework converts prediction confidence into human-understandable levels (Low, Medium, High) using fuzzy logic. This design improves transparency and decision reliability, which is particularly important for security-sensitive and educational applications such as online proctoring.
- (3)
Systematic Ablation and Component-wise Validation
A comprehensive ablation study was conducted to quantify the individual and combined contributions of each module. Results demonstrate that GAN-based augmentation, fuzzy inference, and regularization techniques each contribute significantly to performance gains, with the complete framework achieving the highest accuracy and robustness compared to partially ablated variants and conventional CNN architectures.
- (4)
Robustness under Real-World Variability
The proposed model exhibits strong resilience to challenging real-world conditions, including pose variation, illumination changes, and partial occlusions. Extensive evaluation on the Extended Yale B (cropped) dataset, covering 28 illumination and pose classes, confirms the robustness of the approach with recognition accuracy exceeding 98%.
- (5)
Deployment Feasibility for Real-Time Applications
Despite its high accuracy, the framework maintains a compact and efficient architecture with low inference latency and moderate memory requirements. This makes it suitable for real-time deployment in biometric verification, online learning, and e-proctoring systems.
- (6)
Contribution to Future Face Recognition Research
This study provides a clear methodological roadmap for developing future face recognition systems that balance accuracy, robustness, and interpretability. By demonstrating the effectiveness of sequential architectures combining data augmentation, deep feature learning, and fuzzy confidence modeling, this work highlights a practical direction for real-world biometric applications.
7. Discussion
Our findings highlight the excellent performance and strength of the proposed sequential GAN–CNN–Fuzzy framework for face recognition in demanding real-life scenarios. Our model, based on convolutional neural networks to extract spatial features, GAN-based augmentation to increase variability, and fuzzy logic to provide interpretable confidence, consistently outperformed traditional deep learning frameworks and previously reported methods. On the Extended Yale B dataset (28 classes, 9 poses per subject, 64 illumination conditions, 16,128 grayscale images), our model achieved an impressive accuracy of 98.42% ± 1.23, with only a few misclassifications. This demonstrates not only strong classification ability but also high robustness to pose variation and illumination diversity, two of the most critical challenges in face recognition research.
Unlike previous papers, which mainly focused on handcrafted features (e.g., LBP) or standard CNN pipelines, our method provided significant improvements in handling non-frontal faces and extreme lighting conditions. For example, earlier approaches often failed when shadows masked important features or when off-angle postures reduced visibility of discriminative regions. In contrast, the CNN layers captured fine-scale local textures such as contours, edges, and shading variations, while the GAN component enriched the dataset with diverse synthetic examples, reducing class imbalance and improving generalization.
Although the CNN adopted in this work follows a conventional architecture composed of convolutional, pooling, and dense layers commonly used in face recognition, its novelty lies in its integration within a sequential framework. Rather than relying on a standalone CNN, the proposed system combines GAN-based data augmentation and a fuzzy logic module in a sequential yet complementary pipeline. This integration enables effective learning from limited training data while improving robustness to pose variation, illumination changes, and occlusions, and simultaneously provides interpretable student attentiveness scores—capabilities that a standard CNN alone cannot offer.
The CNN component further benefits from GAN-generated synthetic images, which enrich the training distribution with diverse facial variations absent from the original dataset. As a result, the CNN learns more generalized and robust feature representations, leading to improved recognition accuracy across varied subjects. In addition, regularization strategies such as dropout, batch normalization, and careful preprocessing are employed to mitigate overfitting while preserving discriminative power, which is essential when coupling CNN outputs with the fuzzy inference stage for downstream attentiveness estimation.
Finally, the proposed design is distinctive in that the CNN not only performs identity recognition but also provides confidence scores that are transformed by the fuzzy logic system into human-interpretable attentiveness levels. This unified treatment of identity verification and concentration monitoring is particularly valuable for real-time e-learning and online proctoring scenarios. Unlike previous studies that address face recognition and attentiveness estimation separately, the proposed methodology integrates both objectives within a single coherent framework, enhancing its practical applicability.
Another important contribution of our framework is the fuzzy confidence mechanism, which refines the interpretation of predictions. While most test cases were categorized with high fuzzy confidence, ambiguous samples—particularly side poses under deep shadows—were identified as medium or low. This adds practical protection in real-world applications, where the system can request further verification or trigger multi-modal checks when confidence is insufficient.
Notably, the system achieved a very high per-class recognition rate across all 28 subjects, demonstrating scalability and strong generalization. Overall, our research shows that CNNs, GANs, and fuzzy logic can be combined to provide a unified solution to the long-standing challenge of pose and illumination invariance in face recognition. With its high accuracy, low variance, and interpretable results, the proposed framework sets a new benchmark on the Yale B Extended dataset and supports the feasibility of real-world deployment in biometric systems. These findings provide solid grounds to believe that the future of face recognition research lies in sequential architectures that emphasize accuracy, robustness, and interpretability—beyond the limits of traditional deep learning pipelines.
Despite its strong performance, the proposed framework relies on empirically defined fuzzy thresholds and was evaluated on a limited number of datasets. Future work will focus on adaptive threshold learning and larger-scale real-world deployments.
The GAN–CNN–Fuzzy system presented shows almost perfect performance on the Yale dataset; however, there are several opportunities to expand and generalize this study to a wider scope and greater robustness:
The experiments in this work were conducted using the Extended Yale B dataset, which is composed of grayscale facial images. This choice ensured controlled lighting and pose variations but limits the diversity of emotional and color information. In future work, we plan to extend the study to color-based facial datasets that include a broader range of emotions and expressions. This will allow the system to better capture subtle emotional cues and differentiate between concentrated and distracted states in students during online learning sessions.
Future research could incorporate additional data modalities, including infrared images, depth maps, or thermal images, to improve recognition in challenging environments such as low light, occlusions, or extreme facial expressions. Multi-modal fusion would strengthen feature representation and enhance generalization, leading to a more resilient system for real-world deployment.
Although the current CNN structure efficiently captures hierarchical features, integrating attention-based or transformer layers could allow the network to dynamically focus on the most informative regions of the face. This would increase discriminative power for classes with subtle inter-class variations and reduce reliance on large-scale data augmentation.
Another future direction is optimizing the framework for deployment on edge devices and mobile platforms. Techniques such as model pruning, quantization, and lightweight CNN variants have the potential to reduce computational and memory costs without sacrificing accuracy, enabling real-time facial recognition in security, authentication, and IoT systems.
8. Conclusions
In this research, we proposed a sequential GAN–CNN–Fuzzy model for face recognition, which is efficient in addressing challenges related to pose variation, illumination diversity, and limited sample availability in the Extended Yale B dataset. Our solution, consisting of GAN-based data augmentation, CNN-based spatial feature extraction, and fuzzy logic-based interpretable confidence scoring, achieved 98.42% ± 1.23 accuracy across 28 subjects, significantly higher than previous approaches. These findings demonstrate that sequential models not only increase recognition accuracy but also provide greater reliability and generalization in unconstrained environments, making them suitable candidates for deployment in real biometric and surveillance systems.
The proposed GAN–CNN–Fuzzy framework yielded an overall accuracy of 98.42% with near-perfect confidence across all classes. The CNN backbone automatically learned hierarchical facial representations, from low-level visual details to high-level semantic features. GAN-based augmentation contributed to improved generalization by synthesizing realistic variations in pose, illumination, and expression. The fuzzy inference layer offered an interpretable robustness measure, converting raw probability scores into qualitative levels (Low, Medium, High) that reflect the certainty of predictions.
Comparative results against baseline architectures such as ResNet18 and VGG16 confirmed that our sequential approach not only outperformed conventional CNNs in accuracy but also introduced trustworthy interpretability, reinforcing its potential for real-world biometric applications. Furthermore, the ablation study validated the complementary role of each module: CNN for feature extraction, GAN for data diversity and generalization, and fuzzy scoring for interpretability.
Overall, our study demonstrates that by carefully combining feature extraction, data augmentation, and decision-level interpretability, the long-standing challenges of face recognition can be addressed collectively.