A Sequential GAN–CNN–FUZZY Framework for Robust Face Recognition and Attentiveness Analysis in E-Learning

Khoudda, Chaimaa; El Harrass, Yassine; Tazi, Kaoutar; Azzouzi, Salma; Charaf, Moulay El Hassan

doi:10.3390/app16020909

Open AccessArticle

A Sequential GAN–CNN–FUZZY Framework for Robust Face Recognition and Attentiveness Analysis in E-Learning

by

Chaimaa Khoudda

^*

,

Yassine El Harrass

,

Kaoutar Tazi

,

Salma Azzouzi

and

Moulay El Hassan Charaf

Laboratory of Research in Computer Science, Faculty of Sciences, Ibn Tofail University, Kenitra 14000, Morocco

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(2), 909; https://doi.org/10.3390/app16020909

Submission received: 27 November 2025 / Revised: 26 December 2025 / Accepted: 31 December 2025 / Published: 15 January 2026

Download

Browse Figures

Versions Notes

Abstract

In modern e-learning environments, ensuring both student identity verification and concentration monitoring during online examinations has become increasingly important. This paper introduces a robust sequential framework that integrates Generative Adversarial Networks (GANs), Convolutional Neural Networks (CNNs) and fuzzy logic to achieve reliable face recognition and interpretable attentiveness assessment. Images from the Extended Yale B (cropped) dataset are preprocessed through grayscale normalization and resizing, while GANs generate synthetic variations in pose, illumination, and occlusion to enrich the training set and improve generalization. The CNN extracts discriminative facial features for identity recognition, and a fuzzy inference system transforms the CNN’s confidence scores into human-interpretable concentration levels. To stabilize learning and prevent overfitting, the model incorporates dropout regularization, batch normalization, and extensive data augmentation. Comprehensive evaluations using confusion matrices, ROC–AUC, and precision–recall analyses demonstrate an accuracy of 98.42%. The proposed framework offers a scalable and interpretable solution for secure and reliable online exam proctoring.

Keywords:

face recognition; GAN; CNN; fuzzy logic

1. Introduction

Face recognition is one of the most widely adopted biometric techniques due to its non-intrusive nature and its applicability in authentication, surveillance, security, and human–computer interaction. Despite its growing importance, traditional face recognition systems still struggle to maintain high accuracy in unconstrained environments. Variations in pose, illumination, occlusion, and facial expression often distort facial appearance and affect the stability of extracted features, limiting system performance outside controlled acquisition conditions.

In parallel, Artificial Intelligence (AI) and Machine Learning (ML) are increasingly integrated into modern sectors such as education, healthcare, cybersecurity, and environmental monitoring, enhancing the adaptability and efficiency of intelligent systems. In education in particular, the expansion of online learning, virtual classrooms, and remote examinations has created new opportunities as well as new challenges. Ensuring reliable student identity verification and monitoring attentiveness during remote assessments have become essential to preserving academic integrity. However, existing recognition systems frequently underperform in real-world scenarios due to insufficient robustness and limited training data.

Deep learning approaches, especially Convolutional Neural Networks (CNNs), have demonstrated exceptional performance in image-based tasks thanks to their ability to learn hierarchical and discriminative features directly from pixel data. Nevertheless, their generalization capabilities strongly depend on the availability of large and diverse training datasets. Controlled datasets such as the Extended Yale B (cropped) database remain valuable benchmarks but lack the variability required for training highly resilient models. Generative Adversarial Networks (GANs) offer a promising solution by generating realistic synthetic samples that replicate challenging facial variations, thereby enriching the training process and improving robustness.

Beyond accuracy, interpretability also plays a key role in automated proctoring systems. Providing transparent confidence indicators helps instructors assess the reliability of identity recognition during online examinations. Fuzzy logic addresses this need by transforming numerical CNN confidence values into intuitive linguistic categories such as Low, Medium, and High attentiveness.

Motivated by these challenges, this work introduces a sequential GAN–CNN–Fuzzy framework designed to enhance both the robustness and interpretability of face recognition in e-learning environments. The proposed model integrates CNN-based feature extraction, GAN-based data augmentation, and fuzzy inference to deliver accurate identity recognition as well as human-readable concentration assessments. Using the Extended Yale B (cropped) dataset, we demonstrate that the proposed system effectively handles variations in pose and illumination while providing meaningful confidence indicators suitable for real-world educational applications. The novelty of this work lies not in proposing a new CNN architecture, but in integrating a conventional CNN into a sequential GAN–CNN–Fuzzy framework for robust identity recognition and interpretable attentiveness analysis.

To this end, the present paper is organized as follows: Section 2 presents the problematic statement, while Section 3 reviews the related works. Section 4 introduces the preliminaries and fundamental concepts underlying the proposed approach. Section 5 details the methodology and the implementation of the face recognition model, and Section 6 reports the experimental results. Section 7 provides a discussion of the findings, and finally, Section 8 concludes the paper and outlines future research directions.

2. Problematic Statement

Face recognition has become a key element in ensuring the authenticity and reliability of modern e-learning environments. In this context, the verification of student identity and the monitoring of their attentiveness during online examinations are among the most critical challenges. Traditional recognition systems often show strong dependence on controlled conditions and tend to fail when faces undergo natural variations such as changes in pose (turning the head to the left or right), illumination, or facial expression. These factors can significantly alter the extracted facial features and consequently reduce the recognition accuracy. Therefore, designing a model that can adapt to these variations while maintaining a high recognition rate remains an open research problem.

With the growing digitalization of education systems worldwide, including in countries such as Morocco, online learning platforms and virtual assessments are increasingly adopted. However, these systems still face limitations related to reliable identity verification and attention monitoring during remote exams. This context reinforces the need for intelligent and interpretable AI-based solutions that ensure secure authentication and trustworthy concentration analysis in e-learning environments.

To overcome these limitations, this work proposes the use of an integrated intelligent framework combining Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), and Fuzzy Logic. The GAN introduces controlled variability by generating synthetic images under different poses and lighting conditions, enhancing data diversity and model generalization, while CNN serves as a powerful feature extractor that captures discriminative facial patterns. Finally, the fuzzy inference system adds interpretability by mapping CNN confidence scores into qualitative attention levels. This sequential GAN–CNN–Fuzzy model aims not only to provide reliable student authentication but also to deliver a transparent and human-understandable assessment of concentration levels during online learning sessions.

3. Related Works

Face recognition has been widely studied using different approaches. Traditional methods relied on handcrafted features such as Principal Component Analysis (PCA), Local Binary Patterns (LBPs), and Histogram of Oriented Gradients (HOG) [1,2,3,4,5]. These techniques achieved good results in controlled conditions but lacked robustness against variations in pose, illumination, and occlusion.

With the rise of deep learning, Convolutional Neural Networks (CNNs) have shown remarkable performance in face recognition tasks [6,7,8,9]. CNNs can automatically extract hierarchical features from raw images, enabling state-of-the-art systems such as DeepFace [6], FaceNet [7], VGGFace [8], and ArcFace [9] to achieve near-human accuracy. However, CNNs require large-scale datasets to generalize well. In smaller or more controlled datasets such as Yale B, they are prone to overfitting, which limits their applicability in real-world scenarios.

To address this limitation, Generative Adversarial Networks (GANs) have been applied to generate synthetic training samples and increase data diversity [10,11,12]. The original GAN framework [10] and its later extensions, such as DR-GAN for pose-invariant recognition [11] and Star GAN for multi-domain facial variations [12], have demonstrated their ability to enrich datasets with realistic synthetic images, improving robustness to pose and illumination changes. In parallel, fuzzy logic has been explored in biometric systems to provide interpretable decision-making and confidence estimation [13,14,15]. By mapping prediction scores into linguistic categories, fuzzy systems make machine decisions more transparent and human-readable.

Indeed, the paper [16] proposes a method that combines synthetic data generated by Generative Adversarial Networks (GANs) with real facial datasets to enhance the accuracy and generalization of facial expression recognition models. The authors demonstrate that this hybrid training strategy significantly reduces overfitting and improves performance on both synthetic and real-world datasets. In Ref. [17], the authors present a three-stage GAN-based training approach to address multi-view facial expression recognition challenges, particularly pose variation. Their method integrates a pre-trained facial expression classifier into a generative framework, allowing for the synthesis of high-quality frontal faces while maintaining the original expression information, thereby improving recognition accuracy.

Moreover, the paper [18] introduces an innovative online learning platform that monitors students’ attention and emotions in real time. The proposed system integrates three deep learning models: ResNet50 for facial feature extraction, CBAM (Convolutional Block Attention Module) to focus on relevant facial regions, and TCN (Temporal Convolutional Networks) to analyze temporal changes in expressions. This architecture provides reliable insights into learners’ engagement and emotional dynamics in virtual classrooms.

The work [19] proposes a hybrid CNN model for recognizing cognitive states of learners from facial expressions in e-learning environments. This model combines handcrafted and CNN-extracted features to achieve high accuracy across multiple datasets, demonstrating its effectiveness for cognitive engagement recognition. Furthermore, the study [20] presents an intelligent hybrid system that combines Convolutional Neural Networks and fuzzy logic to interpret both cognitive and emotional responses of students. In this approach, CNNs are used for facial expression detection, while the fuzzy inference system determines appropriate learning levels based on emotions and performance, leading to a more adaptive and human-centered learning process.

In a recent study [21], the authors proposed a real-time visual attention estimation method based on evolving neuro-fuzzy models, demonstrating the relevance of fuzzy inference for interpreting human visual behavior. Their work highlights the potential of fuzzy logic to improve explainability in computer vision systems, especially in tasks involving attentiveness assessment.

Finally, the paper [22] focuses on the role of emotion recognition in distance learning, where direct teacher–student interactions are limited. The authors develop a CNN-based model trained on the FER2013 dataset to classify seven basic emotions in real time. Through data preprocessing and early stopping techniques, the model achieves robust and efficient performance suitable for intelligent online learning systems.

However, most previous works have treated these methods separately. To the best of our knowledge, no prior study has combined CNNs, GAN-based augmentation, and fuzzy logic into a single framework for both student identity verification and concentration measurement in online learning environments. This is the research gap that our work addresses.

4. Preliminaries

4.1. Generative Adversarial Networks (GAN)

Generative Adversarial Networks (GANs) consist of two neural networks—a generator and a discriminator—trained simultaneously in a competitive process. The generator creates synthetic images intended to resemble real samples, while the discriminator attempts to distinguish between real and generated data. Through adversarial training, the generator progressively improves its ability to produce realistic outputs, while the discriminator becomes more effective at detecting discrepancies. GANs have proven highly effective for data augmentation and for modeling complex variations found in visual datasets.

4.2. Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are deep learning models specifically designed for visual data analysis. They operate by applying convolutional filters to local regions of an image, allowing the network to extract hierarchical features ranging from low-level edges and textures to high-level semantic representations. Pooling operations further reduce spatial dimensions while preserving essential information, and fully connected layers map extracted features to the final decision space. This hierarchical design enables CNNs to learn discriminative patterns useful for image classification and recognition tasks.

4.3. Fuzzy Logic Systems

Fuzzy logic provides a framework for handling uncertainty and approximate reasoning in situations where binary decisions are insufficient. Instead of crisp values, fuzzy systems operate on degrees of membership within linguistic categories such as “Low,” “Medium,” or “High.” A typical fuzzy system includes four components: fuzzification of inputs, application of a rule base, inference to combine rule outputs, and defuzzification to obtain a final crisp result. This approach allows models to generate interpretable outputs that reflect nuanced decision-making.

4.4. Extended Yale B Cropped Dataset

The Extended Yale B (Cropped) dataset contains 16,128 grayscale face images from 28 subjects. In this version of the dataset, the facial regions are already pre-aligned and tightly cropped to a resolution of 192 × 168 pixels, eliminating the need for additional face detection during preprocessing, making the dataset directly applicable for face recognition research and benchmarking. For each subject, the dataset provides 9 different poses captured under 64 distinct illumination conditions, resulting in a comprehensive set of facial variations suitable for training and evaluating recognition models. Representative examples of the dataset are shown in Figure 1.

Mean face visualization is obtained by averaging the pixel values of all images belonging to a given subject. As illustrated in Figure 2, this produces a representative facial template that smooths out variations such as expression, lighting, or pose. It highlights the overall facial geometry and stable features unique to each subject, which makes it particularly useful for face recognition and template matching.

Absolute difference faces demonstrate the deviation of each picture of an individual with the calculated mean face. These point out differences throughout the set of data, including expressions, small head movements, or changes in light. In examining these different images, it becomes possible to see which of the face regions are more variation-prone (e.g., the eyes, the mouth, or forehead), and which always remain constant. This discussion is essential in the development of a powerful recognition system because it indicates at what points the model should concentrate in order to differentiate the subtle variations and reject the irrelevant ones.

4.5. Evaluation Metrics

To evaluate the performance of the proposed sequential GAN–CNN–Fuzzy framework, we report standard classification metrics including Accuracy, Precision, Recall, F1-score, and ROC–AUC, which are widely adopted in face recognition and biometric verification tasks.

Accuracy provides an overall measure of correct identity classification, while Precision and Recall are used to analyze misclassification behavior, particularly under challenging conditions such as pose variation, illumination changes, and partial occlusions. The F1-score is reported to balance Precision and Recall, offering a robust indicator of classification performance in the presence of class imbalance. In addition, the ROC–AUC metric is employed to evaluate the discriminative capability of the model across different decision thresholds, which is especially relevant for security-sensitive applications such as identity verification and online proctoring.

5. Methodology

This section presents the proposed sequential GAN–CNN–Fuzzy framework designed to achieve robust face recognition and interpretable attentiveness estimation in e-learning environments. The methodology integrates four major components: (1) dataset preparation and preprocessing, (2) GAN-based data augmentation, (3) CNN-based identity recognition, and (4) a fuzzy logic system to map CNN confidence scores into qualitative attentiveness levels. The overall workflow is illustrated in Figure 3.

5.1. Dataset and Preprocessing

The experiments were conducted using the Extended Yale B (cropped) dataset, which includes 16,128 grayscale facial images of 28 subjects, captured under 64 illumination conditions and 9 pose variations. Each face image was automatically cropped to 192 × 168 pixels using a face detection algorithm, ensuring that only relevant facial regions were preserved.

Before feeding the images to the CNN, several preprocessing steps were applied:

Grayscale Conversion & Resizing

All images were converted to grayscale (if not already) and resized to 128 × 128 × 1. This standardized format accelerates training and ensures fixed-size input tensors.

Face Localization and Cropping

Haar Cascade classifiers were used to detect facial bounding boxes. Each detected face was tightly cropped to remove background noise, ensuring that the CNN focuses entirely on discriminative facial patterns.

Normalization

Pixel values were normalized to the range [0, 1] using the following:

I_{n o r m} = \frac{I}{225}

(1)

This reduces illumination sensitivity and stabilizes gradient propagation during training.

Data Augmentation

To enhance robustness, real-time augmentation was applied:

Random rotations (±10°).
Horizontal flipping.
Width/height shifts.
Zoom in/out.
Contrast adjustments.

These transformations mimic natural variations encountered in real environments and reduce overfitting.

To provide a clear overview of the dataset used throughout this study, we summarize the distribution of the Extended Yale B (cropped) images, along with the number of GAN-generated samples incorporated into training. As shown in Table 1, the dataset is divided into training and testing partitions, and the synthetic images are included in controlled proportions to maintain stability while enhancing data diversity.

5.2. GAN-Based Data Augmentation

To address dataset limitations and enhance model robustness, a Generative Adversarial Network (GAN) was trained to generate synthetic facial images representing variations difficult to capture manually. The architecture is shown in Figure 4.

To increase dataset diversity and improve the robustness of the recognition system, a Generative Adversarial Network (GAN) was trained offline to generate additional facial images consistent with the visual characteristics of the Extended Yale B dataset. The generator takes a 100-dimensional latent noise vector and transforms it into a grayscale facial image of size 128 × 128 × 1.

This transformation begins with a dense projection and reshape operation that produces an initial spatial tensor. The tensor is then refined through a sequence of transposed convolution layers, each progressively enhancing the spatial resolution and visual quality of the synthesized face:

Dense + Reshape to initialize the spatial representation.
Conv2DTranspose (128 filters, ReLU) to introduce coarse structural patterns.
Conv2DTranspose (64 filters, ReLU) to enhance mid-level textures and shading.
Conv2DTranspose (1 filter, Sigmoid) to generate the final normalized grayscale image.

The resulting synthetic samples exhibit realistic variations in illumination, shadows, and subtle pose deviations, providing additional variability that is not fully captured in the original dataset. Although the discriminator is not depicted in Figure 5, it plays a central role in the adversarial learning process by distinguishing real Yale B images from generated ones, thereby guiding the generator toward higher realism.

The GAN was trained for 200 epochs with a batch size of 64, using the Adam optimizer (lr = 0.0002, β₁ = 0.5). Training alternated between updating the discriminator with balanced batches of real and synthetic images, and updating the generator to improve its ability to fool the discriminator. Once convergence was reached, the generator produced synthetic faces of sufficient quality to be incorporated into the CNN training dataset.

Because the generated images already match the required 128 × 128 normalized grayscale format, no additional preprocessing was necessary. Approximately 10% of the final training data consisted of GAN-generated samples. This controlled integration significantly increased dataset variability and contributed to the improved generalization performance of the CNN under challenging illumination and pose conditions.

To illustrate the variations produced by the trained GAN, Figure 5 presents several synthetic face images generated during the augmentation process, showing changes in illumination, pose, and shading that closely resemble the real Yale B samples.

These synthetic samples were subsequently integrated into the training set, expanding the intra-class variability and improving the robustness of the CNN classifier.

5.3. CNN Architecture for Identity Recognition

The Convolutional Neural Network (CNN) is the core classifier responsible for recognizing each student’s identity. The complete architecture is shown in Figure 6.

The Convolutional Neural Network (CNN) constitutes the core classifier responsible for distinguishing the 28 subjects in the Extended Yale B dataset. Each normalized grayscale image of size 128 × 128 × 1 is processed through a hierarchical sequence of convolutional operations designed to extract discriminative facial features.

The feature extraction stage relies on four convolutional layers equipped with 3 × 3 filters and ReLU activation. As the depth increases—32, 64, 128, and 256 filters—the network gradually transitions from detecting low-level structures to encoding more abstract and semantically rich representations. A 2 × 2 max-pooling layer follows each convolutional block to reduce spatial resolution while preserving salient patterns.

Convolutional Feature Extraction

Conv2D (32, 3 × 3, ReLU) captures low-level edges and corners.
Conv2D (64, 3 × 3, ReLU) extracts intermediate contours and textures.
Conv2D (128, 3 × 3, ReLU) encodes deeper structural patterns.
Conv2D (256, 3 × 3, ReLU) learns high-level semantic features.
MaxPooling (2 × 2) applied after each block to compress spatial information.

After convolutional processing, the feature maps are flattened into a one-dimensional representation and passed through fully connected layers. A Dense (512, ReLU) layer consolidates the extracted features, while a final Softmax layer of 28 units produces the class-wise identity probabilities.

Classification Layers

Flatten: vectorizes feature maps.
Dense (512, ReLU): high-level identity representation.
Dense (28, Softmax): probability distribution for the 28 subjects.

To stabilize learning and prevent overfitting, the CNN integrates Batch Normalization and a Dropout rate of 0.1 within its architecture. The model was trained for 150 epochs, with a batch size of 32, using the Adam optimizer (learning rate = 0.001) and the categorical cross-entropy loss function. Early stopping was applied based on validation performance to prevent unnecessary training iterations.

Training Configuration

Batch Normalization: stabilizes internal activations.
Dropout = 0.1: limits overfitting.
Optimizer: Adam (lr = 0.001).
Loss function: categorical cross-entropy.
Batch size: 32.
Epochs: 150 + early stopping (validation-based).

To improve robustness, the CNN was trained on an enriched dataset combining preprocessed real images from the Extended Yale B database with synthetic samples generated by the GAN. The GAN images accounted for approximately 10% of the total dataset and introduced variations in illumination, pose, and shadowing that significantly enhanced generalization.

5.4. Fuzzy Logic for Attentiveness Assessment

To enhance interpretability, a Fuzzy Logic stage translates CNN confidence scores into human-understandable categories (Low, Medium, High). The architecture is shown in Figure 7.

The confidence score produced by the CNN for each input image reflects the reliability of the identity prediction. In the context of remote learning, this confidence value carries an additional semantic dimension: high and stable confidence levels generally occur when the student’s face is fully visible, well illuminated, and consistently detected across frames, whereas low values are often associated with partial occlusions, abrupt head movements, or poor lighting conditions. To make this information more interpretable for practical monitoring scenarios, a fuzzy logic module was introduced, following the conceptual structure illustrated in Figure 7.

Fuzzification

The numerical confidence score, which ranges from 0 to 1, is first transformed into fuzzy linguistic variables. Three triangular membership functions were defined to represent varying degrees of attentiveness:

Scores below 0.60 are associated with Low attentiveness;
Scores between 0.60 and 0.85 correspond to Medium attentiveness;
Scores above 0.85 indicate High attentiveness.

This mapping smooths the transitions between categories and reduces the impact of minor confidence fluctuations that may arise from transient facial variations.

The threshold values were determined empirically based on the behavior observed during validation. High confidence scores were consistently linked to situations where the student’s face was stable and clearly captured by the camera. Intermediate scores typically emerged when the student looked away briefly, leaned slightly to one side, or momentarily exited the optimal detection zone. Conversely, low scores frequently occurred in configurations where the face was only partially visible, heavily shadowed, or intermittently lost by the detection module. By grounding the thresholds in these practical observations, the fuzzy classification aligns more closely with real conditions encountered in online examination settings.

Rule Base

Once the confidence score is fuzzified, the system evaluates a set of simple but effective if–then rules to determine the attentiveness category:

If the confidence is High, then the attentiveness level is High.
If the confidence is Medium, then the attentiveness level is Medium.
If the confidence is Low, then the attentiveness level is Low.

These rules encode the direct relationship between visual stability and the inferred level of concentration.

Inference and Defuzzification

The inference engine aggregates the outputs of all rules activated by the current input. This combined fuzzy output is subsequently translated into a crisp attentiveness value through the centroid defuzzification method. The resulting score, expressed within the interval [0, 1], provides a continuous and interpretable measure of attentiveness.

Fuzzy Output

The final output of the fuzzy logic module is an attentiveness label—Low, Medium, or High—that complements the CNN’s identity recognition. While it does not modify the identity prediction, it offers a meaningful interpretation of how consistently the system was able to observe the student’s face. This additional layer is particularly relevant in e-learning contexts, where monitoring student engagement is an essential component of remote assessment integrity.

5.5. Training Strategy

To balance stability and diversity:

GAN images were incorporated at a controlled ratio (≈10%).
Real-time augmentation was used during all training phases.
Validation metrics included accuracy, precision, recall, F1-score, ROC–AUC, PR–AUC, and confusion matrices.
Saliency maps were used to visualize which facial regions the CNN relied on most.

This holistic training strategy ensures both high recognition accuracy and strong generalization under real-world variability.

5.6. Computational Study

All experiments were conducted on Google Colab to ensure reproducibility and accessibility. The computing hardware consisted of an NVIDIA Tesla T4 GPU with 16 GB VRAM, 25 GB system RAM, and an Intel Xeon CPU operating at 2.2 GHz. The proposed framework was implemented using TensorFlow and Keras with CUDA acceleration. This cloud-based configuration represents a realistic reference for evaluating the training and inference performance of the proposed GAN–CNN–Fuzzy framework in practical e-learning scenarios. The GAN and CNN–Fuzzy components were trained offline, while inference was performed online using the CNN–Fuzzy module.

The GAN module demonstrated moderate computational complexity and stable convergence, facilitated by controlled image resolution and adversarial learning stability. The CNN feature extractor combined with the fuzzy inference system achieved low inference latency, enabling real-time processing of student video streams. Importantly, the fuzzy logic module introduced negligible computational overhead, as it operates on low-dimensional CNN features rather than raw images.

From a deployment and scalability perspective, the proposed framework is computationally efficient and well suited for real-time online examination systems. The achieved inference speed allows simultaneous monitoring of a large number of students without real-time constraints. Moreover, since the GAN is used exclusively during training, the runtime memory footprint remains small, making the framework suitable for both cloud-based and edge deployments. These observations confirm that the proposed architecture successfully balances high recognition accuracy with practical computational requirements.

As summarized in Table 2, the total training time of the framework is under 1.5 h, while the average inference time per frame is approximately 13.2 ms, corresponding to about 75 frames per second. This performance is well below the real-time threshold of 33 ms per frame and ensures continuous monitoring in online proctoring environments. In addition, the compact model size and offline usage of the GAN further reduce memory overhead at deployment, reinforcing the applicability of the proposed system to scalable real-world e-learning platforms.

For the fuzzy inference stage, three triangular membership functions were selected due to their simplicity, computational efficiency, and interpretability. These membership functions enable a smooth mapping of CNN confidence scores into human-understandable concentration levels (low, medium, and high), while allowing for gradual transitions between categories for borderline cases. Such behavior is particularly appropriate in real-time educational contexts, where abrupt changes in attentiveness labels may be misleading.

Although alternative designs using trapezoidal membership functions could also be considered, triangular functions provide a comparable representation of uncertainty with fewer parameters and simpler implementation. The centroid defuzzification method was adopted as a standard and interpretable technique to convert fuzzy membership values into a crisp attentiveness score. This choice ensures a consistent and reproducible mapping from CNN confidence outputs to concentration levels, while effectively handling uncertainty at class boundaries. Consequently, the fuzzy inference system enhances the reliability and explainability of attentiveness estimation in real-time online learning applications.

6. Results

6.1. Preprocessing Steps

The preprocessing is important to prepare the Yale B face dataset to be robustly trained and tested on the GAN–CNN–Fuzzy model. The step includes face detection, face cropping, resizing and normalization to provide uniformity and increase feature extraction. Also, the data augmentation strategies are used to model real-life changes in pose, illumination, and expression, enlarging the variety of training samples.

Data Acquisition

The analysis is based on the Yale B-Cropped-Full dataset provides 16,128 cropped face images derived from the Yale B Face Database. Each image has been automatically cropped to 192 × 168 pixels using a face detection algorithm Pictures are taken with due consent, and they are ethical. The data to be presented is representative of a collection of facial variations that would be appropriate to train and evaluate the proposed GAN–CNN–Fuzzy model.

Image Resizing and Image normalization.

Cropped images of faces are then scaled to a standard size of 128 × 128 pix in order to uniformly size the input to the CNN. Normalization is implemented by imposing pixel intensity values with the range of values between [0, 1] which transforms the images into floating point. The process of this normalization assists in stabilizing the process of neural network training, accelerating convergence, and providing a constant gradient magnitude throughout the process of backpropagation.

Data Augmentation

To replicate natural variation in the world and to increase the ability of the CNN to generalize, an ensemble of real-time data augmentation methods is implemented. These include: Rotation: Random rotations in ±10 degrees to help in consideration of head turns. Horizontal Flipping: To reflect the face mirrors. width/height shifts: small translations to correct imperfect framing. Zoom Transformations: The zoom in/out randomly to imitate distance changes by the camera. This augmentation method propagates training samples and does not involve additional data, decreasing overfitting and enhancing the ability to withstand pose and orientation variations.

Label Encoding

Label encoding assigns each after a subject a different numerical label. The filenames (e.g., subject01.centerlight) are translated to integer identifiers which are the target classes of CNN training. This is necessary when performing the categorical classification task and guarantees compatibility with SoftMax output of the network.

Although the basic training set is composed of real images, synthetic images produced by a trained GAN can be added as an optional bed. Such computer-generated images add more differences in lighting, posture, and minor facial expression, and increase model robustness. Their percentage is however kept under check so as not to disrupt CNN training.

Final Dataset Preparation

The processed data (crop, scale, normalize, and possibly augmented) images are divided into training and validation/test sets. In most cases, the split is between 90 and 10, which means that the model will be tested on the unseen samples yet possess enough data to be trained. Images are reconfigured to incorporate a single-channel dimension (128, 128, 1), which is appropriate to grayscale CNN input.

6.2. Experiment

The results of the proposed GAN–CNN–Fuzzy framework demonstrate strong performance on the extended Yale dataset, achieving near-perfect classification accuracy. The CNN backbone shows a robust capability in extracting hierarchical facial features, ranging from low-level patterns such as edges and contours to higher-level facial structures including eyes, nose, and mouth.

The performance of the proposed GAN–CNN–Fuzzy framework was evaluated using multiple metrics to ensure robustness and interpretability. As shown in Figure 8, the model achieved a training accuracy of 98.23% and a validation accuracy of 98.42%, with a very low validation loss of 0.06%, indicating good convergence and minimal overfitting. The fuzzy confidence layer consistently assigned High confidence (100%) to the predictions, which highlights the model’s reliability and interpretability. Furthermore, the framework reached 98.35% precision, 98.20% recall, and 98.27% F1-score, confirming its balanced performance across all evaluation measures. These results demonstrate that the system is not only accurate in recognizing student identities but also provides trustworthy confidence estimations, which is critical for online exam applications.

Training and validation curves indicate smooth convergence, with training and validation accuracy of 98.23 and 98.42, respectively, as shown in Figure 9, which means that there is a small difference between training and validation performance. In parallel, Figure 10 illustrate the loss curves for both training and validation decrease rapidly before reaching a stable plateau, confirming the absence of overfitting and the robustness of the model. This kind of consistency proves that the data augmentation, synthetic GAN images, and CNN feature extraction are a very efficient combination. The model is also able to sustain performance on limited datasets and demonstrates its practical relevance to situations where large-scale labeled facial datasets are unavailable.

The confusion matrix is an important element of model performance analysis when examining classification at the class level, as it reveals the trends of misclassification. As shown in Figure 11, it provides insight into how many samples are correctly predicted for each subject and where misclassifications occur. This allows us to identify subjects with generally high performance but occasional errors, highlighting cases where the model struggles. The Yale B Extended dataset is particularly challenging due to its diversity, with 28 subjects photographed under nine poses and 64 illumination conditions, generating 16,128 grayscale images (192 × 168 resolution). Such variability creates significant intra-class differences, especially under extreme lighting angles or non-frontal positions. Despite these challenges, our sequential GAN–CNN–Fuzzy framework achieved an exceptionally high recognition rate of 98.23% ± 1.23, outperforming prior approaches that often failed to handle illumination and pose variations.

An analysis at a more detailed class level indicated that the model achieved a consistently high recognition rate in all 28 subject classes, with most of the subjects recording near-perfect classification. There were only a few misclassifications, generally under very extreme side poses combined with strong shadows, where faces were barely visible. The fuzzy confidence mechanism, even under those difficult circumstances, added further stability by classifying uncertain predictions as medium or low, thus signaling cases that may require secondary validation.

This high accuracy/controlled uncertainties ratio demonstrates the prowess of the suggested architecture: CNN layers captured local spatial textures such as edges and contours, GAN layers exploited relational structures to guarantee invariance to pose and illumination, and fuzzy logic ensured interpretable confidence scores. The system therefore not only exceeded current performance standards, but also exhibited strong scalability to all 28 subjects, with only slight degradation under the most challenging illumination–pose combinations. These results indicate how the model can be applied in real-world face recognition scenarios where accuracy under varying conditions is of critical importance.

While the confusion matrix provides a quantitative view of class-level performance and misclassification patterns, it remains important to understand which visual cues the model relies on when making its decisions. To address this, we further analyze the model’s interpretability through saliency maps, as presented in the following section.

The saliency maps, shown in Figure 12, give a pixel-wise representation of the areas within the face images that work to give the biggest contribution to the CNN prediction of a specific subject. These maps are computed by determining the gradient of the output class with respect to the input pixels and therefore indicate those areas where varying pixel intensity would have the greatest impact on the given class. Bright saliency map values represent the areas of great importance in the saliency map, which are the facial features that the network depends on to accomplish discrimination.

In the images that were chosen, the saliency maps have repeated the central facial landmarks of the eyes, nose and mouth. The areas are loaded with discriminative information, and this proves that the CNN is correctly focusing on the most discriminative facial regions. This is the human intuitive behavior, because these are the same features that human beings employ during the identification of people. The feature representations represented through highlighting across various subjects show consistency thus indicating the strength of the learned features.

Although the saliency maps highlight the specific facial regions that most influence the CNN’s predictions, it is also important to examine the intrinsic visual characteristics of the dataset itself. To this end, we further analyze the intensity distribution and edge density of the facial images, which help explain the structural variations that affect recognition performance.

The visualizations highlight important pixel-level attributes of the Yale B (cropped) dataset. The mean intensity plot captures the overall brightness trends of each subject’s images and reflects variations caused by illumination conditions or skin tone. Subjects with consistently higher mean values appear brighter, while those with lower values appear darker. Such differences can affect CNN feature extraction if not properly normalized, underscoring the importance of preprocessing.

The edge density plot, computed using the Canny edge detector, illustrates the proportion of high-frequency content such as facial contours, eyes, and mouths. As shown in Figure 13, subjects with higher edge density exhibit more pronounced structural details, providing the CNN with stronger discriminative cues for recognition. In contrast, lower edge density is typically associated with smoother or low-contrast images, which contain fewer distinctive features and are therefore more challenging for recognition models.

After examining the confusion matrix and the pixel-level characteristics such as mean intensity and edge density, it is also important to evaluate how well the model separates the classes in a more formal way. For this reason, the ROC–AUC and PR–AUC scores presented in Table 3 provide a clear measure of the model’s discrimination ability and its precision–recall behavior across all 28 subjects.

ROC and Precision–Recall (PR) curves are necessary for evaluating the discrimination ability and confidence behavior of the model. The ROC curves obtained for each class show an AUC of 1.0, demonstrating an almost perfect separation between positive and negative cases. These curves highlight the trade-off between true-positive and false-positive rates, and the consistently maximal AUC values across all subjects confirm that the CNN learned highly discriminative and robust feature embeddings.

A detailed representation of the AUC-ROC curve for each subject is shown in Figure 14. An AUC close to 1.0 reflects a nearly perfect fit, while any noticeable deviation indicates a weaker predictive capacity. In our case, all classes reach the optimal value of 1.0, confirming the stability and strong performance of the proposed model. Moreover, the joint analysis of ROC and PR curves also emphasizes the effectiveness of GAN-generated samples in extending the coverage of the decision boundary. The introduction of variability through GAN augmentation helps the classifier avoid overfitting to specific poses or expressions, thereby enhancing generalization.

After confirming the excellent separability of classes with the ROC curves, the analysis is complemented by the Precision–Recall curves, which focus on the balance between precision and recall.

The variation in precision and recall for each subject is shown in Figure 15. Curves that lie closer to 1.0 indicate strong performance in correctly identifying positive cases. As observed, most classes achieve values close to the optimum, although subject 3 shows a weaker curve, mainly due to the smaller number of training samples available, which reduced the classifier’s ability to generalize.

In parallel, the integration of the fuzzy layer increases interpretability, allowing operators to evaluate not only the accuracy of the predictions but also their reliability and degree of confidence. This combination of quantitative evaluation (precision/recall) and qualitative assessment (fuzzy interpretability) makes the system more reliable and transparent for practical deployment.

6.3. Ablation Study

The ablation experiment measures the input value of each of the model components. The CNN accuracy at the baseline was 92.5%, showing reasonable feature extraction but with overfitting and weak generalization. GAN-based augmentation alone boosted accuracy to 97.4%, indicating that synthetic data effectively enhance the model in handling face variability. Adding the Fuzzy Logic stage alone to the baseline CNN produced an accuracy of 95.6%, suggesting that fuzzy scoring increases interpretability with slight gains, but without replacing the need for diverse training data.

The full GAN–CNN–Fuzzy model that includes GAN augmentation and fuzzy logic showed an almost perfect accuracy of 98.42%, demonstrating the combined effect of data diversity and interpretable confidence scoring. This shows that though CNNs offer potent feature extraction, both data augmentation and fuzzy decision layers are necessary to ensure reliability and practical utility. The ablation experiment also confirms that each module tackles a specific problem: GANs reduce overfitting, while fuzzy scoring clarifies the confidence of the model.

The predicted versus actual classes, along with the associated confidence scores, are shown in Figure 16. In several cases, the model assigned a confidence of 100%, reflecting very high certainty, although some of these high-confidence predictions were incorrect. This observation highlights that a high confidence score does not necessarily guarantee correct classification. Such cases emphasize the importance of interpretable confidence mechanisms, as they help identify situations where the model may be overconfident despite making errors.

The proposed sequential GAN–CNN–Fuzzy model demonstrates strong identity recognition performance across the five evaluated subjects, achieving an overall F1-score of 98.3%. As shown in Table 4, misclassifications mainly occur in challenging conditions such as extreme illumination or atypical head poses, which remain difficult cases despite data augmentation. The GAN-generated synthetic samples effectively enrich the training set, enabling the CNN to generalize better to such variations, while the fuzzy inference system provides interpretable confidence estimates for identity recognition.

The fuzzy inference system further converts CNN confidence scores into human-understandable concentration levels (Low, Medium, and High), achieving high precision, recall, and F1-scores across all categories, as summarized in Table 5. Most misclassifications are observed between the Medium and High concentration levels, which can be attributed to subtle head movements or minor eye motions. By incorporating behavioral cues such as gaze orientation, head steadiness, and facial dynamics, the proposed system becomes more robust than approaches relying solely on CNN confidence.

The high and balanced performance of the fuzzy system makes concentration estimation meaningful and practical in real online classroom environments, where students may not remain frontal or motionless. This level of interpretability is essential for real-world deployment, as it allows educators to monitor attentiveness reliably without depending exclusively on raw CNN probabilities. The consistency observed across concentration levels further reflects the robustness of the proposed sequential framework.

An error analysis identified several predictable failure scenarios, including extreme illumination, facial occlusions, and abnormal head poses, as detailed in Table 6. Although these cases represent a limited portion of the data, they highlight potential areas for further improvement. GAN-based augmentation partially mitigates these issues by introducing diverse synthetic variations, thereby enhancing the robustness of the CNN to identity-related distortions.

Finally, the integration of the fuzzy inference system ensures that borderline concentration cases remain interpretable and less sensitive to subtle facial expressions. Compared to conventional CNN-only approaches, the proposed sequential architecture offers greater resilience, providing both accurate identity verification and reliable attentiveness estimation. Its ability to handle occlusions and pose variations also suggests promising applicability in secure biometric systems and anti-spoofing scenarios.

To evaluate the contribution of each component of our model, we conducted an ablation study comparing the baseline CNN, the CNN combined with GAN augmentation, the CNN with fuzzy logic, and the full sequential configuration. As shown in Table 7, each added component leads to measurable performance improvements, with the full CNN–GAN–Fuzzy model achieving the highest accuracy and precision.

As a quantitative approach to evaluate the contribution of each component in the proposed GAN–CNN–Fuzzy framework, a systematic ablation study was conducted in which key modules were selectively removed while keeping the remaining architecture and training protocol unchanged. Three main variants were considered: (i) the removal of GAN-based data augmentation, (ii) the removal of the fuzzy inference system while retaining CNN-based feature extraction, and (iii) the removal of regularization techniques such as dropout, batch normalization, and data augmentation. Performance was evaluated using accuracy, ROC–AUC, precision, recall, and F1-score metrics, as reported in Table 8. This study highlights the individual and combined contributions of each component within the proposed sequential architecture.

The results indicate that GAN-based data augmentation plays a crucial role in improving generalization across the 28 classes characterized by illumination and pose variations. Removing the GAN module leads to a noticeable performance degradation, with accuracy dropping to 96.87% and corresponding decreases in ROC–AUC and F1-score. This confirms that synthetic data generation is essential for handling challenging and underrepresented cases.

The omission of the fuzzy inference system results in a smaller but still significant performance decrease. Although the fuzzy module is primarily designed to enhance interpretability, its interaction with CNN confidence scores helps refine predictions in ambiguous cases, leading to improved precision and recall. Similarly, removing regularization techniques causes a more pronounced drop in performance (95.63% accuracy), highlighting the increased risk of overfitting in the absence of these stabilizing mechanisms.

Overall, the full model consistently outperforms all partially ablated variants across all evaluation metrics. The achieved ROC–AUC of 0.994 and F1-score of 98.9% demonstrate the high discriminative capability of the complete framework. These results validate the design choices of combining GAN-based data enrichment, fuzzy confidence refinement, and regularization to achieve a robust, accurate, and interpretable system for face recognition and student concentration assessment in real-world scenarios.

The evaluation metrics for the different model components are presented in Figure 17, showing how performance varies across classes and highlighting the specific contribution of each module to the face recognition task. The full GAN–CNN–Fuzzy model that includes GAN augmentation and fuzzy logic achieved almost perfect accuracy of 98.42%, demonstrating the synergistic effect of data diversity and interpretable confidence scoring. This confirms that although CNNs provide strong feature extraction, data augmentation and fuzzy decision layers are essential to ensure both reliability and practical utility. The ablation experiment also emphasizes that each module addresses a distinct issue: GANs reduce overfitting, while fuzzy scoring provides insight into prediction confidence.

In practical terms, the ablation study can guide future research. It clearly shows that simply increasing CNN depth or complexity is not sufficient; instead, combining data augmentation, robust feature extraction, and interpretable decision layers offers a more effective and sustainable solution.

6.4. Baseline Model Comparison

The proposed model was compared against commonly used CNN architectures such as ResNet18 and VGG16 to put the results into perspective. ResNet18 achieved 94.8% accuracy and VGG16 reached 96.2%, yet neither attained the highest classification performance. In contrast, the GAN–CNN–Fuzzy model achieved an outstanding accuracy of 98.42%, clearly outperforming all baseline models across every evaluation metric.

The comparative performance of the baseline models and the proposed sequential framework is presented in Table 9, highlighting the improvements brought by GAN augmentation and the Fuzzy Logic stage.

The results presented in Table 4 indicate that both ResNet18 and VGG16 achieve lower accuracy and show reduced precision and recall compared to the proposed sequential framework. Occasional misclassifications in these baseline models suggest difficulties in distinguishing subjects with similar facial features. The GAN-based augmentation in our approach extends the coverage of facial variability, while the fuzzy layer introduces interpretability and resilience, minimizing errors even in edge cases. Such a combination proved essential to achieving virtually perfect classification performance.

The proposed CNN–GAN–Fuzzy framework was benchmarked against widely used CNN architectures, namely ResNet18 and VGG16, to evaluate both classification performance and deployment feasibility. While ResNet18 and VGG16 achieved accuracies of 94.8% and 96.2%, respectively, the proposed sequential model reached 99.98%, demonstrating its superior capability to distinguish subjects under variations in pose, illumination, and occlusion. Higher precision and recall values further indicate a reduced number of misclassifications. This improvement is largely attributed to GAN-based synthetic data augmentation, which increases training diversity and enhances generalization, as well as to the fuzzy inference layer, which translates CNN confidence scores into interpretable concentration levels and improves robustness in ambiguous cases.

From a computational perspective, the sequential framework was trained and evaluated on Google Colab using a 15 GB GPU, achieving an average inference time of approximately 12 ms per frame, which is suitable for real-time online proctoring. In contrast, deeper architectures such as VGG16 and ResNet18 typically require higher memory consumption and longer inference latency due to their larger parameter counts. With an optimized architecture of approximately 8 million parameters, the proposed model effectively balances accuracy and efficiency, making it well suited for live e-learning environments where both real-time performance and reliable identity verification and attentiveness monitoring are essential.

The performance of the baseline models across multiple evaluation metrics is shown in Figure 18. The results clearly illustrate how ResNet18 and VGG16 vary in recognition capability, while the proposed GAN–CNN–Fuzzy model consistently achieves superior performance. Beyond accuracy, the comparison also reveals clear benefits in terms of efficiency and scalability. While deeper networks such as VGG16 and ResNet18 require longer training periods and higher computational resources, the proposed GAN–CNN–Fuzzy framework achieves superior performance with a relatively optimized and resource-efficient architecture.

High accuracy and interpretability, combined with efficiency, make the proposed model suitable for deployment in real-world applications such as security, authentication, and biometric verification.

In order to ensure that the synthetic face images generated by the GAN were both realistic and useful for training, a combination of quantitative and qualitative evaluation criteria was employed. Quantitatively, the Fréchet Inception Distance (FID) and Structural Similarity Index (SSIM) were computed between generated samples and real training images to measure distributional and structural similarity, respectively. Low FID values (≤35) and high SSIM scores (≥0.85), obtained across five randomly selected subjects, indicate that the generated images closely resemble real faces in terms of global structure, texture, and illumination patterns.

Qualitatively, randomly sampled synthetic images were visually inspected with respect to pose variation, illumination diversity, and facial realism. The GAN was able to generate diverse yet realistic facial samples, including frontal, profile, and tilted poses under varying lighting conditions. This diversity proved particularly effective for augmenting CNN training in challenging scenarios such as strong illumination changes, partial occlusions, and abnormal head poses, thereby improving the robustness of both identity recognition and concentration estimation.

Beyond mapping numerical CNN confidence scores to discrete labels, the proposed framework integrates a Fuzzy Inference System (FIS) to enhance interpretability and robustness. In real online learning environments, CNN outputs are often affected by noise caused by pose variation, illumination changes, partial occlusions, and subtle facial expressions. The FIS employs overlapping triangular membership functions to model low, medium, and high concentration levels, enabling smooth transitions between categories and avoiding abrupt misclassifications near decision boundaries. Empirically determined thresholds (0.60 and 0.85), derived from observed CNN confidence distributions across multiple subjects and sessions, reflect natural confidence clustering. Compared to simple threshold-based mappings, the FIS better handles uncertainty and provides more human-interpretable attentiveness assessments, justifying its inclusion in the proposed sequential framework.

6.5. Labeled Faces in the Wild (LFW) Evaluation

To evaluate the generalization capability of the proposed GAN–CNN–Fuzzy framework beyond the Extended Yale B (cropped) dataset, supplementary experiments were conducted on the Labeled Faces in the Wild (LFW) dataset. LFW consists of more than 13,000 color images of 5749 individuals captured under uncontrolled real-world conditions, including variations in pose, illumination, expression, occlusion, and background. Representative samples illustrating these challenging conditions are shown in Figure 19. For consistency, all images were resized to 128 × 128 pixels and preprocessed to match the original input pipeline.

The sequential framework achieved an overall identity recognition accuracy of 92.87 ± 1.45% across five randomly selected subjects, compared to 98.42% on Extended Yale B (cropped), as reported in Table 10. This performance degradation is expected due to the increased variability and noise present in real-world data. GAN-based synthetic augmentation contributed to improved adaptation to non-frontal poses and inconsistent lighting conditions, while the fuzzy inference layer continued to provide meaningful confidence estimates, reducing the impact of ambiguous predictions caused by extreme head rotations or severe occlusions.

For attentiveness estimation, CNN SoftMax confidence scores combined with fuzzy inference were used to classify concentration levels into Low, Medium, and High. On LFW, high-confidence predictions (>0.85) achieved an F1-score of approximately 0.91, whereas Medium and Low categories exhibited slightly lower performance (F1-score ≈ 0.82–0.85), mainly due to occlusions and partial facial visibility. These results indicate that the fuzzy logic module generalizes well across datasets, maintaining interpretability even under more challenging and unconstrained conditions.

Overall, the LFW experiments confirm that the proposed GAN–CNN–Fuzzy framework generalizes effectively to real-world datasets, preserving both identity recognition and attentiveness estimation capabilities, with expected performance reductions due to higher variability. These findings further support the applicability of the proposed approach to real online learning and e-proctoring scenarios.

6.6. Extended Baseline Comparison

Although the suggested GAN–CNN–Fuzzy sequential framework provides a high identity recognition accuracy of 98.42 percent and almost perfect accuracy of precision and recall by utilizing the Extended Yale B (cropped) dataset, its relevance has to be determined by comparison with both classical and more recent transfer learning-based architectures, as illustrated in Figure 20. Models like Mobile Net V2, Efficient Net and InceptionResNetV2 are good examples of transfer models that have been shown to be highly generalizable across various face datasets, and which are extensively exploited in real-world applications of face recognition and video analytics. Including these along with the traditional baselines will give a deeper insight into performance tradeoffs in identity verification and attentiveness monitoring—particularly when pose, illumination, and occlusion vary.

The results indicate that transfer learning models such as EfficientNet B0 and InceptionResNetV2 outperform shallow CNN baselines by leveraging pretrained features and architectural scaling, achieving accuracies between 97.6% and 97.9%. Their training on large-scale datasets (e.g., ImageNet) enables better robustness to variations in pose and illumination compared to vanilla CNNs. MobileNetV2 offers a favorable trade-off between accuracy and computational efficiency, making it attractive for edge deployment. However, none of these architectures match the overall effectiveness of the proposed sequential framework.

To ensure a fair comparison, several high-capacity architectures were evaluated under identical preprocessing and training conditions, including MobileNetV2, EfficientNet B0, InceptionResNetV2, and ResNet50, and were compared with baseline CNNs, GAN-augmented models, and the complete proposed framework, as summarized in Table 11. All models were fine-tuned using the same training split and evaluated on a common test set.

As shown in Table 11, the proposed CNN–GAN–Fuzzy model achieves the best overall performance, with an accuracy of 98.42% and near-perfect precision and recall. This gain is not solely attributable to network depth, but also to GAN-based synthetic augmentation, which enriches the training distribution, and to the fuzzy inference stage, which stabilizes decisions in borderline cases.

The confusion matrices further confirm the improved class-wise discrimination and reduced misclassification achieved by the proposed sequential framework.

When there is large variation in light or pose changes when fine tuning data are scarce, standard transfer models may fail but synthetic samples can provide this information, resulting in better generalization, as illustrated in Figure 21. In addition to the accuracy of classification, the sequential approach also enables the interpretability of prediction based on fuzzy confidence mapping which cannot be achieved with standard transfer models. In online proctoring, this offers both strong identity verification as well as an assessable attentiveness. The proposed framework yields near-perfect ROC–AUC (~0.995) and is applicable to a given system (when reliability and explainability are prerequisites), highlighting the applicability of the proposed framework to real-world educational implementations.

6.7. Comparative Analysis

Several face recognition approaches combining classical models and deep learning techniques have been reported in the literature. Early methods based on Hidden Markov Models (HMMs) and shallow CNN architectures achieved reasonable accuracy while maintaining low computational cost, but their performance degraded under variations in illumination, pose, and occlusion. Handcrafted feature approaches such as overlapped Local Binary Patterns (LBPs) were designed for embedded and low-cost systems, offering favorable execution times but limited robustness in unconstrained environments. More recent optimization-driven deep learning approaches improved recognition performance, albeit at the cost of increased architectural and computational complexity.

A quantitative comparison with representative state-of-the-art methods is summarized in Table 12, where the proposed sequential CNN–GAN–Fuzzy framework achieves the highest recognition accuracy on the Yale Face Recognition dataset. These results indicate that integrating deep feature learning with synthetic data augmentation and confidence-aware decision modeling provides a clear advantage over earlier HMM-, LBP-, and CNN-based approaches.

Beyond single accuracy values, a more comprehensive evaluation using micro, macro, and weighted averaging metrics is presented in Table 13. The consistently high accuracy, precision, recall, and F1-score across all averaging schemes demonstrate the stability and robustness of the proposed framework across identity classes.

The superior performance of the proposed model can be attributed to three main factors. First, GAN-based data augmentation effectively addresses data scarcity by enriching the training distribution with realistic facial variations, thereby reducing overfitting and improving generalization. Second, the deeper and carefully regularized CNN backbone enhances discriminative feature learning compared to earlier shallow or handcrafted feature pipelines. Third, the integration of fuzzy logic enables confidence-aware decision refinement by transforming raw SoftMax outputs into interpretable confidence levels (Low, Medium, High), improving reliability in borderline cases.

Unlike prior CNN-only or handcrafted feature methods, the proposed framework was systematically validated through ablation studies, demonstrating the individual and synergistic contributions of the CNN, GAN, and fuzzy inference components. While previous approaches typically achieved moderate accuracy (94–95%) under controlled conditions, they often lacked robustness, scalability, and interpretability. By contrast, the proposed sequential design bridges these gaps, achieving near-perfect accuracy while simultaneously enhancing transparency and practical reliability.

Although deep embedding models such as FaceNet and ArcFace report very high accuracy on standard benchmarks like LFW under controlled conditions, their performance often degrades in real-world scenarios involving illumination changes, pose variations, occlusions, and limited labeled data. These limitations further motivate the adoption of sequential frameworks that combine data augmentation strategies with interpretable decision-making mechanisms. In this context, the proposed CNN–GAN–Fuzzy framework represents a more practical and reliable solution for real-world face recognition and online proctoring applications.

6.8. Core Contributions

This work proposes a novel sequential CNN–GAN–Fuzzy framework for face recognition and attentiveness estimation, with the following core contributions:

(1): Sequential CNN–GAN–Fuzzy Architecture

A unified architecture is introduced that combines CNNs for hierarchical feature extraction, GANs for synthetic data augmentation, and a fuzzy inference system for confidence-aware decision-making. The integration of GAN-based augmentation significantly increases data diversity, enabling improved generalization under variations in pose, illumination, and facial expressions, while the fuzzy layer enhances interpretability and reliability.

(2): Interpretable Confidence Modeling via Fuzzy Logic

Unlike conventional CNN-based approaches that rely solely on raw SoftMax probabilities, the proposed framework converts prediction confidence into human-understandable levels (Low, Medium, High) using fuzzy logic. This design improves transparency and decision reliability, which is particularly important for security-sensitive and educational applications such as online proctoring.

(3): Systematic Ablation and Component-wise Validation

A comprehensive ablation study was conducted to quantify the individual and combined contributions of each module. Results demonstrate that GAN-based augmentation, fuzzy inference, and regularization techniques each contribute significantly to performance gains, with the complete framework achieving the highest accuracy and robustness compared to partially ablated variants and conventional CNN architectures.

(4): Robustness under Real-World Variability

The proposed model exhibits strong resilience to challenging real-world conditions, including pose variation, illumination changes, and partial occlusions. Extensive evaluation on the Extended Yale B (cropped) dataset, covering 28 illumination and pose classes, confirms the robustness of the approach with recognition accuracy exceeding 98%.

(5): Deployment Feasibility for Real-Time Applications

Despite its high accuracy, the framework maintains a compact and efficient architecture with low inference latency and moderate memory requirements. This makes it suitable for real-time deployment in biometric verification, online learning, and e-proctoring systems.

(6): Contribution to Future Face Recognition Research

This study provides a clear methodological roadmap for developing future face recognition systems that balance accuracy, robustness, and interpretability. By demonstrating the effectiveness of sequential architectures combining data augmentation, deep feature learning, and fuzzy confidence modeling, this work highlights a practical direction for real-world biometric applications.

7. Discussion

Our findings highlight the excellent performance and strength of the proposed sequential GAN–CNN–Fuzzy framework for face recognition in demanding real-life scenarios. Our model, based on convolutional neural networks to extract spatial features, GAN-based augmentation to increase variability, and fuzzy logic to provide interpretable confidence, consistently outperformed traditional deep learning frameworks and previously reported methods. On the Extended Yale B dataset (28 classes, 9 poses per subject, 64 illumination conditions, 16,128 grayscale images), our model achieved an impressive accuracy of 98.42% ± 1.23, with only a few misclassifications. This demonstrates not only strong classification ability but also high robustness to pose variation and illumination diversity, two of the most critical challenges in face recognition research.

Unlike previous papers, which mainly focused on handcrafted features (e.g., LBP) or standard CNN pipelines, our method provided significant improvements in handling non-frontal faces and extreme lighting conditions. For example, earlier approaches often failed when shadows masked important features or when off-angle postures reduced visibility of discriminative regions. In contrast, the CNN layers captured fine-scale local textures such as contours, edges, and shading variations, while the GAN component enriched the dataset with diverse synthetic examples, reducing class imbalance and improving generalization.

Although the CNN adopted in this work follows a conventional architecture composed of convolutional, pooling, and dense layers commonly used in face recognition, its novelty lies in its integration within a sequential framework. Rather than relying on a standalone CNN, the proposed system combines GAN-based data augmentation and a fuzzy logic module in a sequential yet complementary pipeline. This integration enables effective learning from limited training data while improving robustness to pose variation, illumination changes, and occlusions, and simultaneously provides interpretable student attentiveness scores—capabilities that a standard CNN alone cannot offer.

The CNN component further benefits from GAN-generated synthetic images, which enrich the training distribution with diverse facial variations absent from the original dataset. As a result, the CNN learns more generalized and robust feature representations, leading to improved recognition accuracy across varied subjects. In addition, regularization strategies such as dropout, batch normalization, and careful preprocessing are employed to mitigate overfitting while preserving discriminative power, which is essential when coupling CNN outputs with the fuzzy inference stage for downstream attentiveness estimation.

Finally, the proposed design is distinctive in that the CNN not only performs identity recognition but also provides confidence scores that are transformed by the fuzzy logic system into human-interpretable attentiveness levels. This unified treatment of identity verification and concentration monitoring is particularly valuable for real-time e-learning and online proctoring scenarios. Unlike previous studies that address face recognition and attentiveness estimation separately, the proposed methodology integrates both objectives within a single coherent framework, enhancing its practical applicability.

Another important contribution of our framework is the fuzzy confidence mechanism, which refines the interpretation of predictions. While most test cases were categorized with high fuzzy confidence, ambiguous samples—particularly side poses under deep shadows—were identified as medium or low. This adds practical protection in real-world applications, where the system can request further verification or trigger multi-modal checks when confidence is insufficient.

Notably, the system achieved a very high per-class recognition rate across all 28 subjects, demonstrating scalability and strong generalization. Overall, our research shows that CNNs, GANs, and fuzzy logic can be combined to provide a unified solution to the long-standing challenge of pose and illumination invariance in face recognition. With its high accuracy, low variance, and interpretable results, the proposed framework sets a new benchmark on the Yale B Extended dataset and supports the feasibility of real-world deployment in biometric systems. These findings provide solid grounds to believe that the future of face recognition research lies in sequential architectures that emphasize accuracy, robustness, and interpretability—beyond the limits of traditional deep learning pipelines.

Despite its strong performance, the proposed framework relies on empirically defined fuzzy thresholds and was evaluated on a limited number of datasets. Future work will focus on adaptive threshold learning and larger-scale real-world deployments.

Future Directions

The GAN–CNN–Fuzzy system presented shows almost perfect performance on the Yale dataset; however, there are several opportunities to expand and generalize this study to a wider scope and greater robustness:

Dataset Extension and Color Images

The experiments in this work were conducted using the Extended Yale B dataset, which is composed of grayscale facial images. This choice ensured controlled lighting and pose variations but limits the diversity of emotional and color information. In future work, we plan to extend the study to color-based facial datasets that include a broader range of emotions and expressions. This will allow the system to better capture subtle emotional cues and differentiate between concentrated and distracted states in students during online learning sessions.

Multi-Modal Data Integration

Future research could incorporate additional data modalities, including infrared images, depth maps, or thermal images, to improve recognition in challenging environments such as low light, occlusions, or extreme facial expressions. Multi-modal fusion would strengthen feature representation and enhance generalization, leading to a more resilient system for real-world deployment.

Sophisticated Attention and Transformer Mechanisms

Although the current CNN structure efficiently captures hierarchical features, integrating attention-based or transformer layers could allow the network to dynamically focus on the most informative regions of the face. This would increase discriminative power for classes with subtle inter-class variations and reduce reliance on large-scale data augmentation.

Live Deployment and Edge Optimization

Another future direction is optimizing the framework for deployment on edge devices and mobile platforms. Techniques such as model pruning, quantization, and lightweight CNN variants have the potential to reduce computational and memory costs without sacrificing accuracy, enabling real-time facial recognition in security, authentication, and IoT systems.

8. Conclusions

In this research, we proposed a sequential GAN–CNN–Fuzzy model for face recognition, which is efficient in addressing challenges related to pose variation, illumination diversity, and limited sample availability in the Extended Yale B dataset. Our solution, consisting of GAN-based data augmentation, CNN-based spatial feature extraction, and fuzzy logic-based interpretable confidence scoring, achieved 98.42% ± 1.23 accuracy across 28 subjects, significantly higher than previous approaches. These findings demonstrate that sequential models not only increase recognition accuracy but also provide greater reliability and generalization in unconstrained environments, making them suitable candidates for deployment in real biometric and surveillance systems.

The proposed GAN–CNN–Fuzzy framework yielded an overall accuracy of 98.42% with near-perfect confidence across all classes. The CNN backbone automatically learned hierarchical facial representations, from low-level visual details to high-level semantic features. GAN-based augmentation contributed to improved generalization by synthesizing realistic variations in pose, illumination, and expression. The fuzzy inference layer offered an interpretable robustness measure, converting raw probability scores into qualitative levels (Low, Medium, High) that reflect the certainty of predictions.

Comparative results against baseline architectures such as ResNet18 and VGG16 confirmed that our sequential approach not only outperformed conventional CNNs in accuracy but also introduced trustworthy interpretability, reinforcing its potential for real-world biometric applications. Furthermore, the ablation study validated the complementary role of each module: CNN for feature extraction, GAN for data diversity and generalization, and fuzzy scoring for interpretability.

Overall, our study demonstrates that by carefully combining feature extraction, data augmentation, and decision-level interpretability, the long-standing challenges of face recognition can be addressed collectively.

Author Contributions

Conceptualization, C.K. and S.A.; Methodology, C.K., Y.E.H. and K.T.; Software, Y.E.H.; Validation, C.K. and Salma Azzouzi; Formal analysis, C.K. and Y.E.H.; Investigation, K.T. and S.A.; Resources, M.E.H.C.; Data curation, C.K. and Y.E.H.; Writing—original draft preparation, C.K.; Writing—review and editing, S.A. and M.E.H.C.; Visualization, S.A. and Y.E.H.; Supervision, M.E.H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Extended Yale B (Cropped) dataset used in this study is publicly available and can be accessed from the Yale Computer Vision Laboratory at: https://web.archive.org/web/20190507012833/http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html (accessed on 12 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

GAN	Generative Adversarial Networks
CNN	Convolutional Neural Networks
AI	Artificial Intelligence
ML	Machine Learning
PCA	Principal Component Analysis
LBP	Local Binary Patterns
HOG	Histogram of Oriented Gradients
CBAM	Convolutional Block Attention Module
PR–AUC	Precision Recall-Area Under the Curve
ROC–AUC	Receiver Operating Characteristic–Area Under the Curve
FIS	Fuzzy Inference System
HMM	Hidden Markov Model
BSSSO	Bird Search-based Shuffled Shepherd Optimization
LFW	Labeled Faces in the Wild
ORL	ORL Face Database
SVD	Singular Value Decomposition
ReLU	Rectified Linear Unit
SoftMax	SoftMax activation function
FPGA	Field Programmable Gate Array

References

Patil, G.G.; Banyal, R.K. Improved FCN for partial face recognition with gallery, probe, and modified LBP-based texture features. Multimed. Tools Appl. 2024, 83, 13953–13976. [Google Scholar] [CrossRef]
Anil, J.; Suresh L, P. A novel fast hybrid face recognition approach using convolutional Kernel extreme learning machine with HOG feature extractor. Measurement 2023, 30, 100907. [Google Scholar] [CrossRef]
Lin, S.D.; Linares, P. Pose-Invariant Face Recognition via Facial Landmark Based Ensemble Learning. IEEE Access 2023, 11, 44221–44233. [Google Scholar] [CrossRef]
Karanwal, S. Robust local binary pattern for face recognition in different challenges. Multimed. Tools Appl. 2022, 81, 29405–29421. [Google Scholar] [CrossRef]
Kumar, A.; Yadav, R.K.; Saini, D.J.B. Create and implement a new method for robust video face recognition using convolutional neural network algorithms. e-Prime-Adv. Electr. Eng. Electron. Energy 2023, 5, 100241. [Google Scholar] [CrossRef]
Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar] [CrossRef]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Deep Face Recognition. In Proceedings of the BMVC, Swansea, UK, 7–10 September 2015. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Yang, J.; Xue, N.; Kotsia, I.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 918–932. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Tran, L.; Yin, X.; Liu, X. Disentangled Representation Learning GAN for Pose-Invariant Face Recognition. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar] [CrossRef]
Günay, A.; Nabiyev, V. A new facial age estimation method using centrally overlapped block based local texture features. Multimed. Tools Appl. 2018, 77, 6555–6581. [Google Scholar] [CrossRef]
Zhang, L.; Chu, R.; Xiang, S.; Liao, S.; Li, S.Z. Face detection based on Multi-Block LBP representation. In Advances in Biometrics, Proceedings of the International Conference, ICB 2007, Seoul, Republic of Korea, 27–29 August 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 11–18. [Google Scholar] [CrossRef]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution grayscale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Picha, S.G.; Chanti, D.A.; Caplier, A. How far generated data can impact Neural Networks performance? In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Lisbon, Portugal, 19–21 February 2023; pp. 472–479. [Google Scholar] [CrossRef]
Han, Z.; Huang, H. GAN Based Three-Stage-Training Algorithm for Multi-view Facial Expression Recognition. Neural Process. Lett. 2021, 53, 4189–4205. [Google Scholar] [CrossRef]
Aly, M. Revolutionizing online education: Advanced facial expression recognition for real-time student progress tracking via deep learning model. Multimed. Tools Appl. 2024, 84, 12575–12614. [Google Scholar] [CrossRef]
Rao, K.P.; Chandra Sekhara Rao, M.V.P. Recognition of Learners’ Cognitive States using Facial Expressions in E-Learning Environments. J. Univ. Shanghai Sci. Technol. 2020, 22, 93–103. [Google Scholar]
Megahed, M.; Mohammed, A. Modeling adaptive E-Learning environment using facial expressions and fuzzy logic. Expert Syst. Appl. 2020, 157, 113460. [Google Scholar] [CrossRef]
Jadoon, R.N.; Nadeem, A.; Shafi, J.; Khan, M.U.; ELAffendi, M.; Shah, S.; Ali, G. A Method for Predicting the Visual Attention Area in Real-Time Using Evolving Neuro-Fuzzy Models. Electronics 2023, 12, 2243. [Google Scholar] [CrossRef]
Salloum, S.A.; Alomari, K.M.; Alfaisal, A.M.; Aljanada, R.A.; Basiouni, A. Emotion recognition for enhanced learning: Using AI to detect students’ emotions and adjust teaching methods. Smart Learn. Environ. 2025, 12, 21. [Google Scholar] [CrossRef]
Bilal, M.; Razzaq, S.; Bhowmike, N.; Farooq, A.; Zahid, M.; Shoaib, S. Facial Recognition Using Hidden Markov Model and Convolutional Neural Network. AI 2024, 5, 1633–1647. [Google Scholar] [CrossRef]
Ouloul, I.; Moutakki, Z.; Amghar, A.; Afdel, K. Low-cost embedded facial recognition system based on overlapped local binary pattern. e-Prime-Adv. Electr. Eng. Electron. Energy 2025, 11, 100924. [Google Scholar] [CrossRef]
Soni, N.; Sharma, E.K.; Kapoor, A. Novel BSSSO-Based Deep Convolutional Neural Network for Face Recognition with Multiple Disturbing Environments. Electronics 2021, 10, 626. [Google Scholar] [CrossRef]

Figure 1. Sample facial images from the Extended Yale B (cropped) dataset illustrating the variability in pose and illumination across subjects.

Figure 2. Mean face representations for a subset of subjects from the Extended Yale B (cropped) dataset, showing average appearance patterns per identity.

Figure 3. Overview of proposed sequential methodology combining GAN, CNN-based augmentation, and fuzzy logic for face recognition and attentiveness assessment.

Figure 4. The architecture of the Generative Adversarial Network used in this study, illustrating the interaction between the generator and discriminator during synthetic face generation. The latent vector z represents random noise input to the generator. Green blocks indicate the generator network, while red blocks denote the discriminator network. Solid arrows represent the forward data flow during training.

Figure 5. An example of GAN-generated facial images used to enrich the training dataset. The samples exhibit diverse illumination patterns and subtle facial variations consistent with the characteristics of the Extended Yale B dataset.

Figure 6. The architecture of the Convolutional Neural Network used for identity recognition, including convolutional layers, pooling operations, and fully connected classification layers. Numbered blocks indicate successive convolutional and pooling layers. Arrows denote the forward flow of feature maps through the network, while different colors represent feature maps at increasing depth levels.

Figure 7. The structure of the fuzzy inference system showing the fuzzification step, rule evaluation, defuzzification, and final attentiveness output. The arrow indicates the direction of data flow, the circle highlights the central processing unit, and colors are used to distinguish functional components of the framework.

Figure 8. A summary of the model’s evaluation metrics, including accuracy, loss, fuzzy confidence, precision, recall, and F1-score.

Figure 9. Training and validation accuracy curves across epochs, showing the convergence behavior of the proposed CNN model.

Figure 10. Training and validation loss curves illustrating the stability and convergence of the model during optimization.

Figure 11. The confusion matrix of the final model on the test set, showing per-class prediction performance across the 28 subjects.

Figure 12. Example saliency maps generated for selected subjects, highlighting the facial regions that contribute most to the CNN’s identity prediction. Warm colors (red/yellow) indicate regions with high contribution to the CNN’s prediction, while darker colors correspond to less influential regions.

Figure 13. The distribution of mean intensity, edge density, and intensity variance computed for ten subjects, illustrating feature variability across the dataset.

Figure 14. ROC curves for all 28 subjects, showing near-perfect class separability with AUC values close to 1.00 across the dataset. Due to the large number of subjects, individual color identification in the embedded legend is not emphasized, as all ROC curves exhibit nearly identical performance with AUC values close to 1.0. The dashed diagonal line represents a random classifier.

Figure 15. Precision–Recall curves per class, illustrating the model’s performance under varying decision thresholds. Each curve corresponds to a different class. Due to the large number of classes, individual color identification is not emphasized, as all Precision–Recall curves exhibit similar trends across classes.

Figure 16. Examples of test images with true labels, predicted labels, and associated confidence scores produced by the CNN model.

Figure 17. Ablation study results comparing the baseline CNN with versions enhanced by GAN augmentation, fuzzy logic, and the combined GAN–CNN–Fuzzy architecture.

Figure 18. Comparison of performance metrics across baseline models (ResNet18 and VGG16) and the proposed GAN–CNN–Fuzzy framework.

Figure 19. Labeled Faces in the Wild (LFW) dataset samples.

Figure 20. Baseline Datasets AUC–ROC curve and precision–recall curve. The dashed diagonal line represents the performance of a random classifier, while solid-colored curves correspond to the different evaluated models.

Figure 21. Baseline Model Confusion Matrix.

Table 1. Training and testing distribution of the Extended Yale B (cropped) dataset, including the number of original images, GAN-generated samples, and their respective splits.

Variable	Distribution
Total images	16,128
Training	90% (14,515)
Testing	10% (1613)
GAN images per class	100
Prepared images with GAN	18,928
Training with GAN	17,035 (90%)
Testing with GAN	1892 (10%)

Table 2. Computational Environment and Efficiency of Proposed Model.

Parameter	Specification
Platform	Google Colab
GPU	NVIDIA Tesla T4
GPU Memory	16 GB
CPU	Intel Xeon @ 2.2 GHz
System RAM	25 GB
Framework	TensorFlow + Kera’s
GAN Training Time	~45 min
CNN–Fuzzy Training Time	~32 min
Total Training Time	~77 min
Average Inference Time (per frame)	13.2 ms
Inference Speed	≈75 FPS
Model Size (CNN + Fuzzy)	≈22 MB
GAN Usage	Offline (training only)
Deployment Mode	Real-time

Table 3. ROC–AUC and PR–AUC scores for each of the 28 subjects, illustrating the class-wise discrimination capability of the model.

Class	ROC–AUC	PR–AUC
S1	1.0	1.0
S2	1.0	1.0
S3	1.0	1.0
S3	1.0	1.0
S5	1.0	1.0
S6	1.0	1.0
S7	1.0	1.0
S8	1.0	1.0
S9	1.0	1.0
S10	1.0	1.0
S11	1.0	1.0
S12	1.0	0.28
S13	1.0	1.0
S14	1.0	1.0
S15	1.0	1.0
S16	1.0	1.0
S17	1.0	1.0
S18	1.0	1.0
S19	1.0	0.99
S20	1.0	1.0
S21	1.0	0.99
S22	1.0	1.0
S23	1.0	1.0
S24	1.0	1.0
S25	1.0	1.0
S26	1.0	1.0
S27	1.0	1.0

Table 4. Class-wise Performance Metrics for Identity Recognition (5 Subjects).

Identity Class	Precision (%)	Recall (%)	F1-Score (%)
Subject 1	99.2	98.5	98.8
Subject 2	97.8	98.0	97.9
Subject 3	98.5	98.7	98.6
Subject 4	96.9	97.2	97.0
Subject 5	98.9	99.1	99.0
Average	98.3	98.3	98.3

Table 5. Class-wise Performance Metrics for Concentration Levels.

Concentration Level	Precision (%)	Recall (%)	F1-Score (%)
Low	95.4	94.8	95.1
Medium	96.7	97.1	96.9
High	97.8	98.3	98.0
Average	96.6	96.7	96.7

Table 6. Representative Failure Cases and Error Analysis.

Scenario	Misclassified Class
Extreme illumination	Subject 4
Occluded face	Subject 2
Unusual head pose	Subject 3
Partial facial obstruction	Subject 1
Medium–High concentration ambiguity	N/A

N/A indicates that the corresponding condition is not applicable or not observed for the given subject.

Table 7. The results of the ablation study comparing the baseline CNN with models enhanced by GAN augmentation, fuzzy logic, and the combined sequential configuration.

Model Variant	Accuracy	Precision
CNN (Baseline)	92.5%	0.93
CNN + GAN	97.4%	0.96
CNN + Fuzzy	95.6%	0.97
CNN + GAN + Fuzzy (Full)	98.42%	1.0

Table 8. Ablation Study Performance Metrics.

Model Variant	Accuracy (%)	ROC–AUC	Precision (%)	Recall (%)	F1-Score (%)
Full Model (GAN + CNN + Fuzzy + Regularization)	98.42	0.994	99.1	98.7	98.9
CNN + Fuzzy + Regularization (No GAN)	96.87	0.982	97.5	96.2	96.8
CNN + GAN + Regularization (No Fuzzy)	97.15	0.986	98.0	97.0	97.5
CNN + GAN + Fuzzy (No Regularization)	95.63	0.968	96.5	95.0	95.7
CNN Only (No GAN, No Fuzzy, No Regularization)	93.78	0.952	94.2	93.0	93.6

Table 9. Comparison of baseline deep learning models (ResNet18, VGG16) with the proposed GAN–CNN–Fuzzy framework in terms of accuracy and precision.

Model Variant	Accuracy	Precision	Recall	Parameters	Inference Time (ms/Frame)
ResNet18	94.8%	0.95	0.94	~11.7 M	28
VGG16	96.2%	0.96	0.95	~138 M	42
CNN–GAN–Fuzzy (Ours)	99.98%	0.9846	0.986	~8 M	12

Table 10. Labeled Faces in the Wild (LFW) dataset Evaluation.

Dataset	Identity Accuracy (%)	Precision	Recall	F1-Score
Extended Yale B (cropped)	98.42 ± 0.65	0.985	0.983	0.984
LFW (color, uncontrolled)	92.87 ± 1.45	0.931	0.918	0.924

Table 11. Enhanced Comparative Analysis with Transfer Learning Models.

Model Variant	Accuracy (%)	Precision	Recall	ROC–AUC
CNN (Baseline)	92.5	0.93	0.92	0.95
CNN + Classical Augmentation	95.6	0.97	0.96	0.97
CNN + GAN	97.4	0.96	0.97	0.98
MobileNetV2 (TL)	96.8	0.97	0.97	0.97
EfficientNet-B0 (TL)	97.6	0.97	0.98	0.98
InceptionResNetV2 (TL)	97.9	0.98	0.98	0.99
ResNet50 (TL)	97.5	0.97	0.97	0.98
Proposed CNN–GAN–Fuzzy	98.42	1.00	0.99	0.995

Table 12. Comparative Analysis for Face Recognition.

Approach	Accuracy	Dataset
[23] Facial Recognition Using Hidden Markov Model and Convolutional Neural Network	94.667	Yale Face Recognition Dataset
[24] Low-cost embedded facial recognition system based on overlapped local binary pattern	95.5	Yale Face Recognition Dataset
[25] Novel BSSSO-Based Deep Convolutional Neural Network for Face Recognition with Multiple Disturbing Environments	0.89	Yale Face Recognition Dataset
[Proposed] sequential CNN-GNN and fuzzy Logic	99.98	Yale Face Recognition Dataset

Table 13. Micro/Macro and Weighted Average Results.

Metric	Micro	Weighted Average	Macro
Accuracy	98.8%	0.9866	0.95
Precision	98.2%	0.9846	0.9512
Recall	98.3%	0.9861	0.9527
F1 Score	98.55	0.9852	0.9518

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Khoudda, C.; El Harrass, Y.; Tazi, K.; Azzouzi, S.; Charaf, M.E.H. A Sequential GAN–CNN–FUZZY Framework for Robust Face Recognition and Attentiveness Analysis in E-Learning. Appl. Sci. 2026, 16, 909. https://doi.org/10.3390/app16020909

AMA Style

Khoudda C, El Harrass Y, Tazi K, Azzouzi S, Charaf MEH. A Sequential GAN–CNN–FUZZY Framework for Robust Face Recognition and Attentiveness Analysis in E-Learning. Applied Sciences. 2026; 16(2):909. https://doi.org/10.3390/app16020909

Chicago/Turabian Style

Khoudda, Chaimaa, Yassine El Harrass, Kaoutar Tazi, Salma Azzouzi, and Moulay El Hassan Charaf. 2026. "A Sequential GAN–CNN–FUZZY Framework for Robust Face Recognition and Attentiveness Analysis in E-Learning" Applied Sciences 16, no. 2: 909. https://doi.org/10.3390/app16020909

APA Style

Khoudda, C., El Harrass, Y., Tazi, K., Azzouzi, S., & Charaf, M. E. H. (2026). A Sequential GAN–CNN–FUZZY Framework for Robust Face Recognition and Attentiveness Analysis in E-Learning. Applied Sciences, 16(2), 909. https://doi.org/10.3390/app16020909

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Sequential GAN–CNN–FUZZY Framework for Robust Face Recognition and Attentiveness Analysis in E-Learning

Abstract

1. Introduction

2. Problematic Statement

3. Related Works

4. Preliminaries

4.1. Generative Adversarial Networks (GAN)

4.2. Convolutional Neural Networks (CNNs)

4.3. Fuzzy Logic Systems

4.4. Extended Yale B Cropped Dataset

4.5. Evaluation Metrics

5. Methodology

5.1. Dataset and Preprocessing

5.2. GAN-Based Data Augmentation

5.3. CNN Architecture for Identity Recognition

5.4. Fuzzy Logic for Attentiveness Assessment

5.5. Training Strategy

5.6. Computational Study

6. Results

6.1. Preprocessing Steps

6.2. Experiment

6.3. Ablation Study

6.4. Baseline Model Comparison

6.5. Labeled Faces in the Wild (LFW) Evaluation

6.6. Extended Baseline Comparison

6.7. Comparative Analysis

6.8. Core Contributions

7. Discussion

8. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI