A Comparative Benchmark of Face Detection Models for Noisy and Dynamic Online Class Environments

Isaza, Cesar; Ibarra Tapia, Pamela Rocío; Ramirez-Gutierrez, Cristian Felipe; Zavala de Paz, Jonny Paul; Rizzo Sierra, Jose Amilcar; Anaya, Karina

doi:10.3390/fi18040208

Open AccessArticle

A Comparative Benchmark of Face Detection Models for Noisy and Dynamic Online Class Environments

by

Cesar Isaza

^*,†

,

Pamela Rocío Ibarra Tapia

,

Cristian Felipe Ramirez-Gutierrez

^*,†

,

Jonny Paul Zavala de Paz

,

Jose Amilcar Rizzo Sierra

and

Karina Anaya

Cuerpo Académico de Tecnologías de la Información y Comunicación Aplicada, Universidad Politécnica de Querétaro, El Marqués, Querétaro 76240, Mexico

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Future Internet 2026, 18(4), 208; https://doi.org/10.3390/fi18040208

Submission received: 1 March 2026 / Revised: 10 April 2026 / Accepted: 12 April 2026 / Published: 15 April 2026

(This article belongs to the Special Issue Developments of Computer Vision and Image Processing: Methodologies and Applications—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Monitoring students’ on-screen availability is increasingly critical for analyzing participation patterns in synchronous online learning, especially under videoconferencing conditions characterized by compressed video streams, low-resolution face regions, fluctuating bandwidth, and dynamically reconfigured grid layouts. This study introduces a practical computer vision pipeline that integrates deep learning-based face detection, lightweight embedding-based identity matching, and frame-level temporal aggregation to estimate students’ visual presence (VP) during live online classes. A real-world dataset comprising 27 participants and 16,200 frames was collected under authentic conditions, including codec compression, variable image quality, and dynamic layout changes. Four widely used face detection models (Haar Cascade, DSFD, MTCNN, and YuNet) were benchmarked on noisy and low-quality images. Quantitative evaluation on a manually annotated subset of 270 frames demonstrates that MTCNN and YuNet yield lower average VP estimation errors (27.63% and 22.20%, respectively) compared to Haar Cascade (75.34%) and DSFD (47.14%), with YuNet also achieving the shortest average processing time of 0.069 s per frame. While the pipeline is intentionally streamlined to facilitate practical use by instructors, the study provides clearly defined steps and parameter settings, establishing a reproducible procedure for benchmarking face detection performance in synchronous online class environments.

Keywords:

face detection; MTCNN; online learning; neural networks; virtual classrooms

Graphical Abstract

1. Introduction

Within the broader context of the future internet, virtual educational systems are increasingly embedded in intelligent digital ecosystems characterized by distributed platforms, cloud-based infrastructures, and real-time analytics [1]. Such environments enable the creation of smart learning ecosystems [2] that orchestrate educational data flows, continuously adapt instructional content, and deliver immediate feedback at scale. Online assessments have become increasingly common in higher education, driven by technological advancements and the demand for flexible, accessible, and scalable evaluation methods. However, assessing crucial student variables, such as visual presence, which is closely linked to student attention and engagement, poses a unique challenge in these virtual environments [3]. While instructors can gauge attention in traditional in-person settings by interpreting sensory cues (i.e., vision, hearing, speech, and touch), these parameters become more complex in online settings. They are relatively understudied in the context of synchronous online classes. This gap highlights a broader need to understand how factors such as students’ readiness and acceptance influence their academic performance and satisfaction in online settings [4].

In online learning environments, students’ visual presence on screen is frequently used as a proxy for engagement, with recent systems modeling this through observable behavioral cues [5,6]. However, visual presence should not be equated with cognitive attention, which involves complex brain processes that enable adaptive and effective behavioral choices [7]. A student’s visible face on screen does not necessarily indicate genuine cognitive engagement. Nevertheless, visual presence functions as a low-level behavioral indicator that allows instructors to monitor student connectivity, facilitate prompt re-engagement, and provide guidance or redirection during synchronous sessions. In traditional classrooms, environmental and social cues support attention through direct interaction with instructors and peers [8,9,10]. In virtual settings, a reduction in these cues increases the challenge of maintaining focus. Therefore, although visual presence offers a limited perspective, it remains a practical and interpretable indicator within the broader context of supporting student attention in online learning.

Several recent developments and proxies have emerged to address the gap between psychological theory and practical technology. Most modern systems operationalize attentiveness not as a covert cognitive state, but as an overt, observable behavior that correlates with being on task [9]. These behavioral proxies include on- and off-screen gaze, head orientation, and eye state [11], which can be efficiently measured using computer vision pipelines. Such approaches often employ short temporal windows (e.g., 5–15 s) to capture state changes and micro-events while smoothing noisy data. Importantly, these methods avoid common pitfalls, such as assuming that a student’s gaze toward the camera directly reflects cognitive attention, and they acknowledge that facial emotion alone is a weak indicator of focus. Within this broader context, state-of-the-art engagement-detection frameworks increasingly rely on advanced facial behavior analysis toolkits such as OpenFace, which integrate facial landmarks, head pose, gaze estimation, and action unit recognition into multitask architectures [12], or on multimodal systems that combine student behavior cues with affective inference using DeepFace-based models [13]. These systems aim to estimate higher-level engagement or affective states, requiring richer input signals and more complex computational pipelines [14,15,16,17,18]. However, these approaches go beyond simple attendance tracking, providing educators with real-time data on the duration and frequency of on-screen presence. By understanding these patterns of visual presence, instructors can adapt their teaching methods to better manage the learning process, ensuring that students remain behaviorally engaged and that instruction is more effective. Ultimately, these tools aim to enhance learning outcomes by directly addressing the fundamental challenge of maintaining student attention in dynamic educational environments [19,20,21]. However, the very platforms designed to facilitate this learning present significant technical challenges.

Video conferencing services, such as Google Meet, Zoom, Microsoft Teams, and Webex, host large sessions that often exceed the capacity of physical classrooms. To optimize performance and bandwidth, they compress video and display participants’ feeds in small, dynamic grids. This practice, however, poses a significant obstacle to computer vision algorithms that rely on high-resolution images for accurate tracking. The low-resolution input and constantly shifting layout, caused by participants joining, leaving, or speaking, make it challenging to track facial features and analyze student attention precisely. Recent advances in robust feature representation have explored novel strategies to stabilize deep models under visual degradation. For example, recent works have introduced a hybrid-resolution and boundary-aware architecture combined with a dual-task mutual learning framework that enhances robustness across increasingly complex scenes [22,23]. Their domain-transform approach improves feature stability even under corrupted or noisy conditions. However, this framework comes with notable limitations for face-centric online class environments, including its computational complexity, reliance on high-resolution inputs, and its focus on full-body semantic parsing rather than small-scale facial regions. Additionally, the method does not explicitly address dynamic grid layouts typical of videoconferencing platforms. Therefore, developing a robust system requires overcoming the dual hurdles posed by low-resolution video and highly dynamic visual structure. Existing face detection benchmarks have primarily evaluated models on curated, high-quality datasets such as WIDER FACE [24] or FDDB [25], or under controlled laboratory conditions with stable lighting, fixed camera positions, and uncompressed video streams. To our knowledge, there are no publicly available benchmarks that assess face detection performance in the conditions typical of real synchronous online classrooms, namely, low-resolution video tiles produced by platform-level compression, fluctuating bandwidth, and dynamically reconfigured grids triggered by speaking activity or student entry and exit. This gap is significant, as the visual distortions and layout changes introduced by platforms such as Google Meet or Zoom fundamentally alter the detection problem in ways not captured by existing datasets.

To address these challenges, this study proposes a computer vision pipeline specifically designed for noisy and dynamic online class grids, aligning with the goal of benchmarking multiple deep learning methods for robust face detection. The methodology begins with the acquisition of real-time video streams recorded through Open Broadcaster Software (OBS), from which frame sequences containing all participants are extracted. These images are processed through a multi-stage detection and identification pipeline that incorporates four face detection models: Haar Cascade (HAAR) [15,26], a classical machine-learning method; Dual Shot Face Detector (DSFD) [27], a two-stage deep detector; Multitask Cascaded Convolutional Network (MTCNN) [14], a lightweight cascaded architecture widely used for multi-scale face localization; and YuNet [28], a recent millisecond-level deep learning model optimized for low-resolution input. After detection, each face is normalized, embedded into a numerical feature vector, and matched against a reference dictionary to establish participant identity across all frames. This enables the computation of the visual presence (VP) metric, defined as the proportion of frames in which each student is successfully detected. By applying this pipeline to a dataset of 27 participants and 16,200 frames, the study not only quantifies visual presence but also explicitly compares the detection performance of HAAR, DSFD, MTCNN, and YuNet, demonstrating how different models behave under the noisy, compressed, and dynamically shifting grid layouts characteristic of modern videoconference platforms. This integrated approach aligns the manuscript with a benchmark of image-processing pipelines for face detection in real-world online classroom scenarios, providing a foundation for evaluating behavioral engagement in synchronous virtual learning.

Considering the above, the novelty of this study lies not only in introducing a deep learning-based approach for monitoring students’ visual presence, but in conducting a systematic benchmark of multiple face detection pipelines under noisy and dynamic online class conditions. Unlike previous works that focused on single models or relied exclusively on pre-recorded, idealized datasets, this research evaluates four distinct algorithms in a controlled yet realistic case study based on actual synchronous class sessions. This comparative design provides empirical evidence of how each method behaves under low-resolution grids, compression artifacts, and temporally shifting layouts, conditions typical of videoconferencing platforms. Beyond proving feasibility, the proposed framework offers a practical, replicable, and experimentally validated strategy for measuring behavioral engagement, highlighting the strengths and limitations of each detection model in scenarios closely aligned with real teaching practice.

2. Material and Methods

2.1. Models of Interaction Between Humans and Machines

Human–machine interaction describes the dynamics by which two or more people communicate and collaborate with a machine or technological system. This encompasses a diverse range of technologies that facilitate communication and teamwork, including computers, mobile devices, video conferencing systems, and online collaborative platforms. In synchronous online classes, interaction between students and instructors is mediated entirely through videoconferencing platforms, which impose technical constraints highly relevant to computer vision analysis. Unlike in physical classrooms, where multiple sensory cues are available, online interaction depends almost exclusively on low-resolution, compressed video tiles that are dynamically rearranged as participants join, leave, or speak. These conditions introduce variable frame quality, inconsistent face sizes, and frequent layout changes, all of which directly affect the stability of face detection pipelines. For this reason, the experimental design in the present study focuses specifically on evaluating detection methods under such platform-induced distortions rather than modeling higher-level pedagogical interaction structures.

Figure 1 illustrates four distinct scenarios of the human–machine–human interaction model:

The materials and methods employed in this study are crucial for detailing the experimental approach, particularly the Multitask Cascaded Convolutional Neural Network (MTCNN) algorithm. This research aims to assess participants’ visual presence in online classes. Therefore, this section outlines the proposed methodology to achieve this objective. Figure 2 illustrates the overall framework. The approach involves analyzing image sequences using computer vision techniques. The experimental design was carefully crafted to ensure voluntary participation from all individuals in refining the methodology for evaluating visual presence. Specifically, 27 participants were selected for data collection using a screen-capture tool.

The strategy is based on analyzing image sequences with computer vision techniques. The experimental framework was structured to ensure voluntary participation from all individuals in refining the methodology for assessing visual presence. In particular, the screen capture tool recorded data for all group attendees. During data preprocessing, atypical, duplicate, or out-of-context data were avoided. If they occurred, they were normalized and explored. A deep artificial neural network was selected as the learning strategy due to its strong performance on computer vision tasks. Thus, face detection was performed using this method. Subsequently, the face characteristics were mapped to perform face identification, in which the similarity between the numerical values and the reference representations in the database was measured.

When creating the datasets, it was necessary to prioritize the confidentiality of sensitive information. Accordingly, informed consent had to be obtained from all participants, and data collection had to be carried out in strict adherence to ethical guidelines and regulatory standards. Anonymity involves removing personal identifiers such as names, addresses, and contact information to protect individuals’ privacy. Informed consent was obtained, with students aware of how their data was used and having the autonomy to decide whether to share it publicly. Ethical guidelines recommend sharing data in open repositories following the FAIR principles [29] (Findable, Accessible, Interoperable, Reusable) to enhance transparency and reproducibility in research. Respecting data ownership and intellectual property rights, acknowledging contributions, and fostering a collaborative research environment are crucial.

The experiment was conducted in synchronous online classes, with the characteristic evaluated being oscillation. Oscillation refers to the ability to efficiently and effectively shift attention from one stimulus to another. In an online class setting, oscillation is evident as students alternate their focus among various screen elements, such as teacher presentations, class chats, and individual assignments. In this context, visual presence is linked to the duration of their participation in the sessions. The first step is to broadcast the class in real time and record it with OBS. OBS is a popular and open-source tool for live streaming and recording multimedia content, including classes, presentations, and gaming sessions.

2.2. Data Acquisition

Data were collected by accessing synchronous online class sessions conducted via Google Meet. The sessions were recorded in real time using OBS, which captures the full videoconferencing grid and preserves all participant video tiles as displayed on screen. Unlike the built-in Google Meet capture option, which prioritizes the active speaker or shared content, OBS records the complete layout of attendees throughout the session.

The resulting video file was stored in MP4 format with a spatial resolution of 640 × 360 pixels and a frame rate of 30 fps. The video stream was encoded in RGB24 format with an average bitrate of approximately 0.166 Mbps, reflecting the strong compression and resolution constraints imposed by the videoconferencing platform. These characteristics define the visual input conditions for all subsequent face detection and visual-presence analyses.

After recording, image sequences were extracted from the video using Google Colab, where execution and processing were performed in Python 3.12.13. A graphics processing unit (GPU) was used to accelerate frame extraction and downstream analysis. Since the video data were stored in the cloud, appropriate access permissions were required to retrieve them. Following extraction, standard preprocessing steps, including data cleaning, integration, transformation, and reduction, were applied. All frames retained identical spatial dimensions and resolution and were stored remotely for reproducibility.

The extracted images were handled as numerical arrays using the NumPy 2.0.2 library, while visualization and inspection were performed with the Pyplot module of the Matplotlib 3.10.0 library. Figure 3 presents an example frame obtained from the recorded session.

2.3. Face Detection Methods

Before the emergence of deep learning models, several traditional methods were developed for face detection, among which the Haar Cascade is among the most influential. It was introduced by Viola and Jones [30] and is based on Haar-like features to identify regions of an image that resemble a human face [30]. The strategy begins by transforming the input image to grayscale and computing rectangular features that capture differences in pixel intensity, such as edges, lines, and textures. Then these features are combined using the AdaBoost algorithm, which selects the most informative strong classifiers from a cascade of weak classifiers. Despite its practicality, the method has been surpassed by deep learning models, such as the Dual Shot Face Detector (DSFD), in terms of accuracy and robustness.

The Dual Shot Face Detector (DSFD), introduced by Li et al. [27], is a deep learning-based two-stage detection framework designed to improve both feature learning and detection precision in face recognition tasks. In the first stage, the model uses a Feature Enhancement Module (FEM) to enhance multi-scale feature representations by fusing contextual information via dilated convolutions, yielding more discriminative and robust features. The second stage refines these preliminary detections by applying a Progressive Anchor Loss (PAL) mechanism that initially uses smaller anchors and subsequently larger ones, gradually enhancing localization accuracy across hierarchical feature maps. Additionally, the Improved Anchor Matching (IAM) strategy integrates optimized anchor assignment with data augmentation to ensure a more balanced and accurate correspondence between anchors and ground-truth faces, stabilizing the training process. Although DSFD incurs a higher computational cost than traditional approaches such as Haar Cascades, its dual-shot design achieves state-of-the-art performance on benchmarks like WIDER FACE and FDDB, particularly under challenging conditions involving small, occluded, or low-quality facial images.

Similar to DSFD, the Multitask Cascade Convolutional Neural Network (MTCNN) has emerged as a superior choice for face detection compared with the previously described methods [31]. Due to its accuracy and versatility, studies have underscored its selection as the preferred option in various applications. For instance, a survey by Sanchez [32] extensively explores face detection methods, emphasizing the critical role of accurate detection in tasks such as face recognition and video surveillance. This study examines the evolution of face detection techniques, highlighting the significance of MTCNN for precise detection, particularly in challenging scenarios characterized by low image resolution and severe occlusions, including varying lighting conditions, focus, and subject orientations.

Furthermore, research evaluating face detection performance and estimation assesses face detectors based on processing time and accuracy. Chaves et al. [33] explicitly identify MTCNN as one of the methods being evaluated, showcasing its capabilities in face detection tasks. The study provides insights into the strengths and weaknesses of deep learning models for face recognition, particularly against image degradation, further affirming the effectiveness of MTCNN in face detection applications [33].

The architecture of the MTCNN algorithm is based on the pyramidal processing of the original image. Initially, the algorithm processes a 12 × 12 × 3 image using a 3 × 3 filter and a depth of 10 layers. This pyramidal approach involves rescaling the image at multiple scales, facilitating face detection across varying scales. The algorithm comprises three stages: P-Net, R-Net, and O-Net. Each stage specializes in detecting faces at different scales and refining the bounding boxes around them. During the feature extraction phase, specific facial features, such as the eyes, nose, mouth, and other landmarks, are extracted to aid in face identification. Ultimately, the MTCNN algorithm produces bounding boxes outlining detected faces and landmark points for each face.

In 2023, Wei et al. [28] introduced a face detection framework, called YuNet, that builds upon deep convolutional neural network architectures to improve precision and robustness under complex visual conditions. The approach optimizes feature representation and detection accuracy by integrating multi-scale contextual information and advanced anchor-based strategies. Specifically, it introduces mechanisms to refine feature learning, balance the distributions of positive and negative samples, and strengthen detection consistency across varying image resolutions and illumination conditions. By combining these components, the proposed framework achieves more accurate and stable face localization compared to prior state-of-the-art methods, demonstrating superior performance in challenging scenarios such as occlusion, pose variation, and small-scale face detection.

Table 1 summarizes the four face detection models selected for this benchmark and clarifies why they constitute a representative cross-section of architectural paradigms. Haar Cascade exemplifies early handcrafted-feature methods with minimal computational requirements, providing a historical and computational baseline. DSFD represents high-capacity two-stage detectors designed for accuracy under ideal imaging conditions, allowing us to examine how such architectures degrade in the presence of online classroom distortions. MTCNN, a cascaded multi-stage model, is particularly well-suited for small faces and thus relevant for grid layouts where students occupy only a few pixels. Finally, YuNet illustrates modern lightweight, anchor-free detectors optimized for CPU environments, making it a strong candidate for practical deployment by instructors. Together, these models span classical ML, multi-stage deep pipelines, high-capacity anchor-based CNNs, and lightweight real-time architectures, enabling a systematic comparison across diverse detection strategies under compressed, low-resolution, and dynamically changing videoconferencing conditions. In our implementation, MTCNN was imported directly from the standard Python/OpenCV-based library (OpenCV v4.10) and used with its reference configuration. No retraining or architecture modification was performed. The detector was instantiated with thresholds [0.6, 0.7, 0.7] for the three-stage cascade, a minimum detectable face size of 20 pixels, and an input normalization size of 160 × 160 pixels. These values define the valid detection range between 20 and 160 pixels, which was selected empirically through trial and error and reflects common practice in face detection under low-resolution and compressed video conditions.

2.4. Image Segmentation from Face Detectors

Before computing the visual presence, it is essential to convert the image to a bitmap format, which offers lossless compression and preserves image quality, enabling subsequent face identification and facilitating the tracking of each attendee’s session duration.

The proposed strategy uses the most efficient deep neural network-based algorithms as the primary face detection method. It is formed by combining several neural network models that sequentially refine the generated detections. The algorithm operates with the PyTorch 2.10.0+cpu detector, creating dynamic neural networks and graphs rather than static ones.

The algorithm can identify the probability of detecting faces. The amount varies with the detection limits of each face detection algorithm. Subsequently, the extraction of faces is performed as shown in Figure 4. The generated images are stored in bitmap format, which offers lossless compression and preserves image quality.

2.5. Embedding of Faces and Similarity Between Faces

Face embeddings are vectors that represent a face’s key features in a specific dimensional space. During training, the neural network learns to map facial features, clustering similar faces together. This process allows the network to capture relevant and distinctive information about faces. The closer the embeddings are in dimensional space, the greater the face similarity. This similarity measure is often calculated using Euclidean distance.

During this process, the algorithm-identified faces are separated and stored in a folder corresponding to the analyzed frame. This validates the number of faces detected.

2.5.1. Generation of Reference Dictionary

A reference dictionary was employed to support identity consistency during face recognition within a single online session. In this framework, the dictionary is constructed directly from the same Google Meet session used for data acquisition, without any separate or prior enrollment phase. For each participant, the system extracts the participant’s displayed name from the videoconferencing platform and associates it with the corresponding face regions detected in video frames where the camera is active. Specifically, between three and five representative face crops are automatically extracted from frames in which the face is clearly visible. These frames are selected opportunistically from the live stream, ensuring frontal or near-frontal views whenever available, while preserving the platform’s native imaging conditions (resolution, compression, illumination, and camera quality).

The role of the reference dictionary is not to perform cross-session biometric identification, but to provide a lightweight and session-specific identity mapping that enables consistent tracking of visual presence across frames. Instead of storing raw pixel data, each face crop is transformed into a compact embedding vector, yielding a normalized feature representation that supports efficient comparison using distance-based similarity measures. This design reduces memory requirements and avoids discrepancies between enrollment and recognition conditions, since both reference and query embeddings originate from the same session. As a result, identity matching remains constrained to intra-session analysis and can be performed without curated datasets or dedicated enrollment procedures, which aligns with the practical constraints of real-time online classroom environments.

2.5.2. Identity Recognition

A reference dictionary is a well-established approach for recognizing specific individuals. A pipeline detection approach was implemented for face detection and identification, incorporating the following key parameters: the minimum similarity threshold for matching identified faces, the minimum confidence required for a detected face to be included in the results, and the detection thresholds for each deep neural network in the face detector.

Figure 5 illustrates an example of the face identification process. This process is performed using the optimal network parameters. Subsequently, each detected face is identified, and the frequency of each student’s appearance across all frames is recorded. This frequency is used to determine whether students are present or the connection percentage. Students who appear in all frames are considered connected 100% of the time.

2.6. Quantifying Visual Presence and Algorithm

This study introduces the visual presence (VP) metric to quantify each participant’s on-screen presence during live online sessions. VP is defined as the proportion of video frames in which a student’s face is successfully detected, calculated as:

VP = \frac{i}{n} * 100 %

(1)

where i is the number of frames in which the student appears, and n is the total number of frames in the session. To compute this metric, the system captures video streams using OBS and processes each frame with a deep neural network-based face detection model. The model identifies facial regions and landmarks, extracts and normalizes cropped face images, and generates embeddings that are compared against a reference dictionary using Euclidean distance. If a match is found, the participant is identified; otherwise, a new entry is added to the dictionary.

In brief, the algorithm’s flow Algorithm 1 is described as follows: Once the facial regions are localized, individual faces are extracted from the detected bounding boxes. These cropped face images are subsequently normalized by resizing to a consistent dimension across the dataset. Next, facial embeddings are computed for each face using a trained deep neural network, encoding distinctive facial characteristics into high-dimensional feature vectors. These embeddings are compared against a set of pre-existing reference embeddings using the Euclidean distance as a similarity metric. When the computed distance exceeds a predefined threshold, indicating the presence of a new, previously unregistered individual, the reference database is updated to include the new embedding. Recognition is then performed by matching embeddings from the current frame to those stored in the reference dictionary, thereby enabling the identification of known participants in real time.

Algorithm 1 Student visual presence detection algorithm

1:: Start streaming the class.
2:: Record, the class with OBS
3:: Get the image sequence.
4:: Segment the images using the deep neural network face detection.
5:: while Face detection is running do
6:: Obtain detection matrix with the following information:
7:: for all bounding boxes detected do
8:: Coordinates top-left corner and bottom-right corner
9:: Probability that the bounding box contains a face
10:: Coordinates of eyes, nose, and corner of mouth
11:: end for
12:: end while
13:: Extract multiple faces:
14:: for all each bounding box do
15:: Cropped Face ← CropFace(image, bounding box)
16:: Square Face ← ResizeToSquare(Cropped Face)
17:: Embed the faces:
18:: Embedding ← PerformEmbedding(Square Face)
19:: Find similarities between faces:
20:: for all embedding existing in the reference dictionary do
21:: Euclidean Distance ← CalculateEuclideanDistance(Embedding, Embedding in reference)
22:: if Euclidean distance < Similarity threshold then
23:: Save to reference dictionary
24:: else
25:: The reference vector of an identity is not available
26:: end if
27:: end for
28:: end for
29:: Face Recognition:
30:: for all each face clipped in the sequences do
31:: Perform face embedding:
32:: Embedding ← PerformEmbedding(ResizeToSquare(Cropped Face))
33:: Compare with embeddings in the reference dictionary:
34:: for all identity name, embedding in reference in the reference dictionary do
35:: Euclidean Distance ← CalculateEuclideanDistance(Embedding, Embedding in reference)
36:: if Euclidean distance < Similarity threshold then
37:: Identify face with identity name
38:: end if
39:: end for
40:: end for
41:: Quantify the Visual Presence:
42:: function CalculateVisualPresence(NumberOfPeople, TotalSequence)
43:: a ← $\frac{i}{n} \times 100$
44:: return VisualPresence
45:: end function

In the subsequent computational stage, implemented through the function CalculateAttentionLevel, which receives two parameters—NumberOfPeople (the number of detected individuals) and TotalSequence (the total sequence of frames)—the visual presence metric is determined. This metric is computed by dividing the number of detected individuals by the total number of frames, then multiplying the result by 100. The resulting percentage quantifies the visual presence and serves as an indicator of participant engagement in the captured session.

The VP metric serves as a quantitative proxy for measuring participant presence during online sessions. It complements qualitative classifications of attention, ranging from “very noisy,” where many students are inattentive, to “generally listening” and “attention,” when most or all are focused. By integrating this value with facial detection and identification algorithms, the framework could quantify a student’s presence as a measurable variable, enabling precise evaluation of participants in virtual classroom environments.

3. Results

This section shows the results of the experiments carried out; it is divided into two parts: Experimental Design describes the experimental design used to conduct the research, including details of the experiment setup, while in the Experimental Results, the results obtained from the experiment are presented.

The algorithm begins by identifying the optimal values, referred to as detection limits, for each of the three networks that comprise the face detector, thereby providing an optimization strategy to enhance the method’s efficacy. Subsequently, the algorithm is executed in a controlled setting, i.e., experiments conducted under meticulously structured conditions to ensure the reliability and reproducibility of the results.

3.1. Experimental Design

To run the experiments, twenty-seven students in the class participated in obtaining the data dictionary and identifying the optimal network configurations for the face identification and recognition algorithm. From this set, five students were carefully selected to participate in the first experimental stage, during which the algorithm’s performance was rigorously evaluated through a series of controlled tests. Conducting these experiments in a controlled environment ensured greater precision and reliability in assessing the algorithm’s effectiveness.

It is crucial to emphasize that the study was conducted in a controlled environment, with the camera activated and deactivated at specific times. However, the study primarily focuses on scenarios in which students participate in online learning under conditions that are not always optimal. Among the challenges were issues with background configuration, lighting, camera quality, and internet connectivity, among other parameters, that were not controlled during the experiments.

Table 2 describes a scenario in which cameras are turned on and off in a controlled manner by different individuals over a period of nine minutes. This setup facilitates the analysis of controlled visual presence patterns under predefined conditions.

The original recording yielded a total of 16,200 frames at approximately 30 frames per second over a nine-minute session. However, preliminary empirical observations across multiple class sessions indicated that changes in students’ visual presence typically occur on timescales longer than a few seconds. Based on this heuristic evidence, and to reduce unnecessary computational overhead, the processing pipeline was adjusted to perform temporal subsampling by retaining one frame every two seconds. This resulted in a reduced dataset of 270 frames for analysis, while preserving representative visual presence patterns across the full session. Additional tests at higher sampling rates confirmed that this reduction did not materially affect the final visual presence estimates but significantly minimized processing costs. All experiments were executed using the standard Google Colab environment, equipped with 12 GB of RAM and approximately 100 GB of disk storage. Inference was performed on CPU under identical conditions for all evaluated detectors. The measured processing times were consistently lower than the 2 s temporal sampling interval, ensuring that detection, labeling, and visual presence counting were completed within each sampling window.

3.2. Experimental Results

The experimental results shown in Figure 6 demonstrate the comparative performance of various deep neural network-based algorithms for face detection in a virtual classroom environment. The first column contains example images captured during live online sessions, while the subsequent columns display detection results from various models, including Haar Cascade, Dual Shot Face Detector (DSFD), Multitask Cascaded Convolutional Network (MTCNN), and YuNet. Given the number of correctly detected faces, we can validate the improved performance of recently developed methods, such as YuNet, relative to classic algorithms, including Haar Cascade and DSFD.

Subsequently, the percentage of visual presence was calculated using Equation (1). Once the faces were detected and each participant was identified through the face dictionary, the performance of each algorithm was evaluated. To this end, five selected subjects were manually analyzed across 270 frames. In each frame, visual presence was measured both manually and automatically using the various face detection and recognition methods described in Section 2.3.

Table 3 summarizes the performance of the proposed framework. The table includes the following attributes: Person (the assigned identifier for each participant) and % of Presence Identification (the percentage of visual presence calculated manually using the Haar Cascade, DSFD, MTCNN, and YuNet algorithms). All percentages were computed based on a total of 270 frames, corresponding to 100%, which represents the time interval for the controlled experiment introduced in the previous section.

For each image sequence, the number of people present and the algorithm’s ability to recognize and identify them are evaluated. The algorithm performs detection by using the values found in the reference dictionary. Although everyone was connected, the connection percentage was not 100%. This is because factors such as lighting, camera quality, and network quality can significantly affect results in artificial vision. Nevertheless, the attendance percentage remains high. Understanding these metrics is crucial for evaluating the performance and effectiveness of a face detection model in this practical application.

The experimental results presented in Table 3 are derived from the statistical values in Table 4, illustrating apparent differences in the accuracy of the visual presence estimation methods. The manual measurements show the highest mean value (82.86%) with notable variability, reflecting the inherent variability in human annotation. In particular, algorithms such as HAAR and DSFD exhibit the lowest means (20.05% and 43.29%, respectively) and relatively small standard deviations, indicating consistent but significantly underestimated detection performance. In contrast, the deep learning-based methods MTCNN and YuNet exhibit higher mean values (58.68% and 64.63%) and tighter confidence intervals, suggesting a stronger agreement with the manual reference data. Among them, YuNet achieved the broadest detection range but also exhibited slightly higher variance due to its sensitivity to varying lighting and resolution conditions. In general terms, the results confirm that recent neural architectures, particularly MTCNN and YuNet, provide more accurate and reliable detection of visual presence.

The performance of the proposed framework was evaluated using the percent error metric. In particular, the deviation between the manually computed visual presence (VP) values and those obtained using different deep neural network methods was calculated. Table 5 shows that the MTCNN and YuNet models achieved the lowest average percent errors of 27.63% and 22.20%, respectively, indicating superior robustness and precision compared to other approaches, such as HAAR and DSFD. These results confirm that deep learning-based detectors, particularly YuNet, provide more stable and reliable estimations of visual presence.

The HAAR and DSFD algorithms exhibit relatively high mean errors with low variability, reflecting consistently lower visual presence estimates under the evaluated conditions. In contrast, MTCNN and YuNet yield lower mean errors across the same dataset. Among the tested methods, YuNet presents the narrowest confidence interval (18.02–26.40%), indicating reduced variability in its error distribution relative to the other detectors within this specific experimental setup, rather than implying general superiority across broader or untested conditions. Table 6 reports the descriptive statistics of the accuracy-related metrics for each algorithm, computed under identical image frame conditions, providing a comparative summary of their performance behavior within the same acquisition context.

4. Discussion

The results indicate that the proposed framework provides a functional approximation of students’ on-screen visual presence during synchronous online classes based on a deep neural network-driven face detection. The comparative analysis shows that MTCNN and YuNet produce lower VP errors than Haar Cascade and DSFD under identical videoconferencing conditions. This behavior is reflected in the percent-error measurements, where MTCNN and YuNet exhibit average deviations of 27.63% and 22.20%, respectively, relative to manual annotations. These values characterize the detection pipelines’ responses when faces are small, compressed, and subject to frequent layout rearrangements inherent in online classroom grids. Within these constraints, the framework yields a low-level indicator of visual availability, without aiming to provide high-precision behavioral or cognitive measurement.

An important observation from the experimental sessions is that although all participants were continuously connected to the online meeting, the measured VP did not reach 100% for any detector. This discrepancy reflects the inherent limitations of face detection pipelines operating on low-resolution, heavily compressed video streams, where faces may intermittently disappear due to lighting variations, camera angles, bandwidth fluctuations, or layout reconfigurations. Rather than indicating a failure of the pipeline, these effects illustrate the difficulty of maintaining consistent detection under realistic videoconferencing conditions. The resulting VP values, therefore, reflect the observable output of the detection process rather than an exact measure of user availability.

It is important to clarify that visual presence, as defined in this work, does not imply attentiveness, engagement, or learning effectiveness. The VP metric is intentionally designed as a simple, interpretable quantity that captures whether a participant’s face appears on screen, and it should not be interpreted as a proxy for cognitive or pedagogical constructs. No claims are made regarding correlations between VP and learning outcomes, instructional effectiveness, or student engagement, as establishing such relationships would require dedicated educational validation beyond the scope of this benchmarking study.

From a methodological perspective, the primary contribution of this work lies in evaluating how different face detection architectures behave under authentic online classroom constraints. The observed performance differences between classical methods (Haar Cascade), high-capacity detectors (DSFD), cascaded architectures (MTCNN), and lightweight models (YuNet) highlight the sensitivity of detection pipelines to platform-induced distortions such as compression artifacts, resolution reduction, and dynamically changing grids.

Several directions for future technical refinement are suggested by these findings. These include improved calibration of detection and identification thresholds, more robust embedding strategies for low-quality inputs, and integration of additional visual cues to mitigate intermittent detection loss. However, any extension toward higher-level behavioral or educational interpretation would require independent validation with appropriate human-subject studies and outcome measures.

Overall, the experimental results support the feasibility of the proposed framework as a lightweight, technically grounded benchmarking tool for estimating visual availability under real videoconferencing conditions. The study demonstrates how face detection models respond to the constraints imposed by online platforms, and it provides a reproducible reference for comparing detection robustness without extending claims beyond what is empirically validated. Beyond the models evaluated in this study, several lightweight detectors, such as RetinaFace [34,35] and YOLO-Face [36], are widely adopted due to their strong balance between accuracy and efficiency, leveraging multi-scale feature pyramids, single-stage architectures, and compact backbones for real-time facial localization and fast inference in face-centric datasets. However, these approaches tend to rely on higher-resolution inputs and GPU-oriented pipelines, which limits their robustness when faces appear extremely small and highly compressed, as is typical in dynamic online classroom grids. Furthermore, adapting these models would require domain-specific retraining and threshold tuning to avoid false positives in densely packed layouts. For these reasons, they were not included in our benchmark, though future work will consider integrating such lightweight architectures to broaden the scope of comparison.

Finally, an important contextual limitation of this study concerns the rate at which visual conditions change during typical synchronous online classes. Based on empirical observation, sessions with camera-enabled participants are generally characterized by relatively stable lighting conditions and background configurations over short and medium time intervals. In contrast, the dominant sources of visual variability arise from camera activation and deactivation events and from dynamic layout reconfigurations imposed by the videoconferencing platform (e.g., speaker switching, participant entry or exit). For this reason, the proposed benchmark focuses on low-resolution and compressed grid dynamics rather than attempting to model rapidly fluctuating illumination or background changes. More extreme or highly dynamic visual scenarios are acknowledged as relevant but were not systematically evaluated and are therefore left for future work.

5. Conclusions

This study presented a comparative evaluation of four face detection pipelines, Haar Cascade, DSFD, MTCNN, and YuNet, applied to estimating visual presence in synchronous online classes under realistically degraded videoconferencing conditions. Rather than proposing a high-accuracy behavioral measurement system, the primary objective was to benchmark how distinct detection architectures respond to low-resolution video tiles, strong compression artifacts, dynamic grid rearrangements, and intermittent face visibility.

The experimental results reveal clear trade-offs between detection accuracy, computational efficiency, and robustness. Classical approaches such as Haar Cascade exhibit consistently low detection rates in this setting, confirming their limited suitability for modern videoconferencing environments. High-capacity deep detectors such as DSFD achieve moderate gains in accuracy but incur substantial computational overhead, making them less practical for real-time use. In contrast, cascaded architectures (MTCNN) and lightweight models (YuNet) achieve lower visual presence errors and significantly reduced processing times, illustrating a more favorable balance between robustness and efficiency. However, even for the best-performing model (YuNet), average errors remain above 20%, indicating that face detection-based estimation of visual presence under these constraints is inherently noisy and far from precise.

These findings suggest that, with a simple detection-and-embedding pipeline, visual presence can only be approximated as a low-level indicator of on-screen availability rather than a reliable or exhaustive measure of student behavior. The observed error rates highlight the limitations imposed by platform-level resolution reduction, compression, and layout dynamics, and they caution against interpreting VP as an accurate proxy for engagement or attention without additional validation.

From a practical perspective, the benchmark demonstrates that deploying face detection models in online classroom grids involves unavoidable compromises: lightweight models offer speed and stability at the cost of detection completeness, while more complex architectures improve detection quality but may be unsuitable for real-time or large-scale deployment. Accordingly, the main contribution of this work lies in characterizing these trade-offs and providing a reproducible reference for evaluating face detection behavior in noisy and dynamic online classroom environments.

Future research should explore more robust detection strategies, improved identity-matching schemes, and hybrid approaches that may mitigate intermittent detection failures. Any extension toward educational or behavioral interpretation, however, will require dedicated validation studies that combine visual presence with pedagogically meaningful outcome measures, which are beyond the scope of the present benchmarking study.

Author Contributions

Conceptualization, C.I., C.F.R.-G., J.A.R.S. and K.A.; Methodology, C.I., P.R.I.T., J.P.Z.d.P. and K.A.; Software, C.I. and P.R.I.T.; Validation, C.I., P.R.I.T., C.F.R.-G. and J.A.R.S.; Formal analysis, C.I., P.R.I.T., C.F.R.-G. and J.A.R.S.; Investigation, P.R.I.T. and J.P.Z.d.P.; Resources, C.I., J.P.Z.d.P. and K.A.; Data curation, P.R.I.T., J.P.Z.d.P. and J.A.R.S.; Writing—original draft, C.I., C.F.R.-G., J.A.R.S. and K.A.; Writing—review & editing, C.I. and C.F.R.-G.; Visualization, P.R.I.T. and J.P.Z.d.P.; Supervision, C.F.R.-G.; Project administration, C.F.R.-G. and K.A.; Funding acquisition, C.F.R.-G. and K.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study only involved the collection and analysis of visual interaction data captured from an educational platform. No biological samples, medical procedures, or invasive experiments were used with the participants. The data used were strictly limited to the visual information from students’ online learning sessions, in full compliance with the platform’s terms of use and standard educational practices. Our research is classified as Minimal-Risk Research and is therefore exempt from full review by an Institutional Review Board (IRB) or Ethics Committee. Nonetheless, the study was conducted with a commitment to transparency, data confidentiality, and respect for participant privacy.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to ethical considerations and the privacy of the participants.

Acknowledgments

The authors gratefully acknowledge the financial support of this work from the Ministry of Science, Humanities, Technology and Innovation (SECIHTI), received through the National Research System’s (SNII) program. C.F.R.G. acknowledges support from the Kempe Foundation (JCSMK24-0033). We disclose that Grammarly Pro, ChatGPT-5.3, and Microsoft Copilot were used solely to improve language clarity and readability. These tools did not contribute to the generation of scientific content, analysis, interpretation, or conclusions. All research activities were conducted entirely by the authors in accordance with MDPI’s ethical guidelines.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, M.; Yu, D. Towards intelligent E-learning systems. Educ. Inf. Technol. 2023, 28, 7845–7876. [Google Scholar] [CrossRef] [PubMed]
Benita, F.; Virupaksha, D.; Wilhelm, E.; Tunçer, B. A smart learning ecosystem design for delivering Data-driven Thinking in STEM education. Smart Learn. Environ. 2021, 8, 11. [Google Scholar] [CrossRef]
Liu, Q.; Hu, A.; Daniel, B. Online assessment in higher education: A mapping review and narrative synthesis. Res. Pract. Technol. Enhanc. Learn. 2025, 20, 007. [Google Scholar] [CrossRef]
Martín-Bylund, A.; Stenliden, L. Closer to the senses in post-pandemic teacher training—Reclaiming the body in online educational encounters. Educ. Inf. Technol. 2024, 29, 3133–3154. [Google Scholar] [CrossRef]
Rosengrant, D.; Hearrington, D.; O’Brien, J. Investigating Student Sustained Attention in a Guided Inquiry Lecture Course Using an Eye Tracker. Educ. Psychol. Rev. 2021, 33, 11–26. [Google Scholar] [CrossRef]
Bradbury, N.A. Attention span during lectures: 8 seconds, 10 minutes, or more? Adv. Physiol. Educ. 2016, 40, 509–513. [Google Scholar] [CrossRef] [PubMed]
Krauzlis, R.J.; Wang, L.; Yu, G.; Katz, L.N. What is attention? WIREs Cogn. Sci. 2023, 14, e1570. [Google Scholar] [CrossRef]
Damián, A.R.; Roselló, E.G.; Paz, R.I.; Dacosta, J.G.; Heine, J. Las TIC en la educación superior: Estudio de los factores intervinientes en la adopción de un LMS por docentes innovadores. RELATEC Rev. Latinoam. Tecnol. Educ. 2009, 8, 35–51. [Google Scholar]
Das, S.; Chakraborty, S.; Mitra, B. I Cannot See Students Focusing on My Presentation; Are They Following Me? Continuous Monitoring of Student Engagement through “Stungage”. In UMAP’ 22: Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization; Association for Computing Machinery: New York, NY, USA, 2022; pp. 243–253. [Google Scholar] [CrossRef]
Wang, R.; Chen, S.; Tian, G.; Wang, P.; Ying, S. Post-secondary classroom teaching quality evaluation using small object detection model. Sci. Rep. 2024, 14, 5816. [Google Scholar] [CrossRef] [PubMed]
Al-Rahayfeh, A.; Faezipour, M. Eye Tracking and Head Movement Detection: A State-of-Art Survey. IEEE J. Transl. Eng. Health Med. 2013, 1, 2100212. [Google Scholar] [CrossRef]
Hu, J.; Mathur, L.; Liang, P.P.; Morency, L.P. OpenFace 3.0: A Lightweight Multitask System for Comprehensive Facial Behavior Analysis. In Proceedings of the 2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG); IEEE: Piscataway, NJ, USA, 2025; pp. 1–11. [Google Scholar] [CrossRef]
Zhang, H.; Peng, Y.; Liu, Y. Multimodal fusion for real-time classroom engagement assessment using YOLOv9 and DeepFace. Vis. Comput. 2025, 41, 12325–12337. [Google Scholar] [CrossRef]
Chandrappa, D.N.; Akshay, G.; Ravishankar, M. Face Detection Using a Boosted Cascade of Features Using OpenCV. In Proceedings of the Wireless Networks and Computational Intelligence; Venugopal, K.R., Patnaik, L.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 399–404. [Google Scholar] [CrossRef]
Farokhah, L. Perbandingan Metode Deteksi Wajah Menggunakan OpenCV Haar Cascade, OpenCV Single Shot Multibox Detector (SSD) dan DLib CNN. J. RESTI (Rekayasa Sist. Teknol. Inf.) 2021, 5, 609–614. [Google Scholar] [CrossRef]
Mehta, J.; Ramnani, E.; Singh, S. Face Detection and Tagging Using Deep Learning. In Proceedings of the 2018 International Conference on Computer, Communication, and Signal Processing (ICCCSP); IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar] [CrossRef]
Anilkumar, C.; Venkatesh, B.; Annapoorna, S. Smart Attendance System with Face Recognition using OpenCV. In Proceedings of the 2023 Second International Conference on Augmented Intelligence and Sustainable Systems (ICAISS); IEEE: Piscataway, NJ, USA, 2023; pp. 1149–1155. [Google Scholar] [CrossRef]
Thirugnanam, A.; Jayasekar, M.; Shreya, S.; Chandrasekaran, T. Face recognition attendance system using OpenCV. AIP Conf. Proc. 2025, 3343, 020048. [Google Scholar] [CrossRef]
Abate, A.F.; Cascone, L.; Nappi, M.; Narducci, F.; Passero, I. Attention monitoring for synchronous distance learning. Future Gener. Comput. Syst. 2021, 125, 774–784. [Google Scholar] [CrossRef]
Hwu, S.L. Developing SAMM: A Model for Measuring Sustained Attention in Asynchronous Online Learning. Sustainability 2023, 15, 9337. [Google Scholar] [CrossRef]
Hossen, M.K.; Uddin, M.S. Attention monitoring of students during online classes using XGBoost classifier. Comput. Educ. Artif. Intell. 2023, 5, 100191. [Google Scholar] [CrossRef]
Liu, Y.; Wang, C.; Lu, M.; Yang, J.; Gui, J.; Zhang, S. From Simple to Complex Scenes: Learning Robust Feature Representations for Accurate Human Parsing. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5449–5462. [Google Scholar] [CrossRef]
Wang, C.; Zhang, Q.; Wang, X.; Zhou, L.; Li, Q.; Xia, Z.; Ma, B.; Shi, Y.Q. Light-Field Image Multiple Reversible Robust Watermarking Against Geometric Attacks. IEEE Trans. Dependable Secur. Comput. 2025, 22, 5861–5875. [Google Scholar] [CrossRef]
Yang, S.; Luo, P.; Loy, C.C.; Tang, X. WIDER FACE: A Face Detection Benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
Jain, V.; Learned-Miller, E. Fddb: A Benchmark for Face Detection in Unconstrained Settings; Technical Report; UMass Amherst: Amherst, MA, USA, 2010. [Google Scholar]
Javed Mehedi Shamrat, F.M.; Majumder, A.; Antu, P.R.; Barmon, S.K.; Nowrin, I.; Ranjan, R. Human Face Recognition Applying Haar Cascade Classifier. In Proceedings of the Pervasive Computing and Social Networking; Ranganathan, G., Bestak, R., Palanisamy, R., Rocha, Á., Eds.; Springer: Singapore, 2022; pp. 143–157. [Google Scholar]
Li, J.; Wang, Y.; Wang, C.; Tai, Y.; Qian, J.; Yang, J.; Wang, C.; Li, J.; Huang, F. DSFD: Dual Shot Face Detector. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2019; pp. 5055–5064. [Google Scholar] [CrossRef]
Wu, W.; Peng, H.; Yu, S. Yunet: A tiny millisecond-level face detector. Mach. Intell. Res. 2023, 20, 656–665. [Google Scholar] [CrossRef]
Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.W.; da Silva Santos, L.B.; Bourne, P.E.; et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 2016, 3, 160018. [Google Scholar] [CrossRef]
Viola, P.; Jones, M.J. Robust Real-Time Face Detection. Int. J. Comput. Vis. 2004, 57, 137–154. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, Z. Improving multiview face detection with multi-task deep convolutional neural networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision; IEEE: Piscataway, NJ, USA, 2014; pp. 1036–1041. [Google Scholar] [CrossRef]
Sanchez-Moreno, A.S.; Olivares-Mercado, J.; Hernandez-Suarez, A.; Toscano-Medina, K.; Sanchez-Perez, G.; Benitez-Garcia, G. Efficient Face Recognition System for Operating in Unconstrained Environments. J. Imaging 2021, 7, 161. [Google Scholar] [CrossRef] [PubMed]
Chaves, D.; Fidalgo, E.; Alegre, E.; Alaiz-Rodríguez, R.; Jáñez-Martino, F.; Azzopardi, G. Assessment and Estimation of Face Detection Performance Based on Deep Learning for Forensic Applications. Sensors 2020, 20, 4491. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Ververas, E.; Kotsia, I.; Zafeiriou, S. RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
Ren, Z.; Liu, X.; Xu, J.; Zhang, Y.; Fang, M. LittleFaceNet: A Small-Sized Face Recognition Method Based on RetinaFace and AdaFace. J. Imaging 2025, 11, 24. [Google Scholar] [CrossRef]
Chen, W.; Huang, H.; Peng, S.; Zhou, C.; Zhang, C. YOLO-face: A real-time face detector. Vis. Comput. 2021, 37, 805–813. [Google Scholar] [CrossRef]

Figure 1. Conceptual representation of four human–machine interaction models relevant to online education: (1) direct human–human interaction, (2) human–machine interaction, (3) mediated human–machine–human interaction, and (4) synchronous online class interaction. These models illustrate the sensory channels involved in attention and perception, as well as the constraints imposed by virtual environments.

Figure 2. Overview of the proposed computer vision framework for detecting students’ visual presence during synchronous online classes. The methodology integrates face detection, identification, and temporal analysis to quantify visual presence.

Figure 3. Example frame captured during a synchronous online class session using OBS 32.1.1 software. The image shows all participants as they appear in the video conferencing interface and serves as input for face detection and presence analysis.

Figure 4. Extracted face images from session frames using deep neural network-based detectors. These cropped faces are used for identification and embedding generation to assess individual visual presence.

Figure 5. Examples of face identification results using the MTCNN algorithm. Each detected face is matched against a reference dictionary to determine the participant’s identity and to compute the frequency of visual presence across video frames. The bounding boxes indicate the localized face detections, while the numerical labels associated with each box correspond to the individual identifiers defined in the reference dictionary.

Figure 6. Comparative face detection results using four algorithms applied to frames from synchronous online class sessions. The first column displays the original input images, followed by detection outputs from Haar Cascade, Dual Shot Face Detector (DSFD), Multitask Cascaded Convolutional Network (MTCNN), and YuNet. The figure illustrates the relative performance of each method in identifying student faces under low-resolution and dynamic layout conditions, supporting the evaluation of visual presence metrics. The bounding green boxes indicate the detected and identified faces associated with entries in the reference dictionary.

Table 1. Summary of architectural paradigms, computational characteristics, and limitations of the evaluated face detection models.

Model	Architectural Paradigm	Strengths	Limitations in Online Class Grids
Haar Cascade	Classical ML (Haar features + AdaBoost cascade)	Very lightweight; CPU real-time baseline	Highly sensitive to compression, low resolution, pose changes, and grid fluctuations
DSFD	Two-stage deep detector (anchor-based CNN)	High accuracy on curated datasets; strong multi-scale features	Heavy model; slower inference; brittle under low-bitrate compression and small faces
MTCNN	Cascaded multi-stage CNN (P-Net, R-Net, O-Net)	Performs well on small faces; joint detection + landmarks	Multi-step pipeline increases latency; affected by noise and strong compression
YuNet	Lightweight anchor-free CNN	Real-time on CPU; robust for practical deployment	Slightly less accurate under extreme poses or illumination conditions

Table 2. Controlled camera activation schedule for five participants during a nine-minute online session. The table records the on/off status of each participant’s camera at 1 min intervals, enabling analysis of visual presence patterns under predefined conditions.

Num. Person	1 min	2 min	3 min	4 min	5 min	6 min	7 min	8 min	9 min
1	on	on	off	off	off	off	off	off	on
2	on	off	on	off	off	off	off	on	off
3	on	off	off	on	off	off	off	on	off
4	on	off	off	off	on	off	off	off	on
5	on	off	off	off	off	on	off	off	on

Table 3. Percentage of visual presence detected for each participant using manual annotation and four face detection algorithms. The values are based on 270 frames per participant, which represents full-session coverage.

Person	Manual VP	HAAR VP	DSFD VP	MTCNN VP	YuNet VP
1	98.42	22.57	48.84	59.20	74.75
2	67.34	21.46	37.74	56.24	50.85
3	89.17	18.50	40.70	60.68	71.41
4	70.67	18.13	43.29	62.16	52.89
5	88.70	19.61	45.88	55.13	73.26

Table 4. Descriptive statistical summary of visual presence detection results across manual annotation and four algorithmic methods.

Method	Mean	Std. Dev.	Variance	Std. Error	95% CI Lower	95% CI Upper
Manual	82.86	13.28	176.37	5.93	66.37	99.35
HAAR	20.05	1.91	3.65	0.85	17.68	22.42
DSFD	43.29	4.33	18.75	1.93	37.91	48.66
MTCNN	58.68	2.95	8.73	1.32	55.01	62.35
YuNet	64.63	11.73	137.64	5.24	50.06	79.19

Table 5. Error percentage in visual presence (VP) estimation for each participant, comparing manual annotation with algorithmic detection. The table also reports the average processing time per image for each method, highlighting computational efficiency.

Person	HAAR VP %Error	DSFD VP %Error	MTCNN VP %Error	YuNet VP %Error
1	77.07	50.38	39.85	24.05
2	68.13	43.96	16.48	24.49
3	79.25	54.36	31.95	19.92
4	74.35	38.74	12.04	25.16
5	77.89	48.28	37.85	17.41
%Av. Error	75.34	47.14	27.63	22.20
Av. processing time (s)	0.455	7.487	0.460	0.069

Table 6. Aggregate error metrics for visual presence estimation methods, including mean error, standard deviation, variance, standard error, and 95% confidence intervals. The results demonstrate the relative accuracy and stability of each algorithm under varying conditions.

Statistic	HAAR	DSFD	MTCNN	YuNet
Mean Error (%)	75.34	47.14	27.63	22.21
Std. Deviation	4.41	6.01	12.65	3.38
Variance	19.44	36.15	159.95	11.39
Std. Error	1.97	2.69	5.66	1.51
95% CI Lower	69.86	39.68	11.93	18.02
95% CI Upper	80.81	54.61	43.34	26.40

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Isaza, C.; Ibarra Tapia, P.R.; Ramirez-Gutierrez, C.F.; Zavala de Paz, J.P.; Rizzo Sierra, J.A.; Anaya, K. A Comparative Benchmark of Face Detection Models for Noisy and Dynamic Online Class Environments. Future Internet 2026, 18, 208. https://doi.org/10.3390/fi18040208

AMA Style

Isaza C, Ibarra Tapia PR, Ramirez-Gutierrez CF, Zavala de Paz JP, Rizzo Sierra JA, Anaya K. A Comparative Benchmark of Face Detection Models for Noisy and Dynamic Online Class Environments. Future Internet. 2026; 18(4):208. https://doi.org/10.3390/fi18040208

Chicago/Turabian Style

Isaza, Cesar, Pamela Rocío Ibarra Tapia, Cristian Felipe Ramirez-Gutierrez, Jonny Paul Zavala de Paz, Jose Amilcar Rizzo Sierra, and Karina Anaya. 2026. "A Comparative Benchmark of Face Detection Models for Noisy and Dynamic Online Class Environments" Future Internet 18, no. 4: 208. https://doi.org/10.3390/fi18040208

APA Style

Isaza, C., Ibarra Tapia, P. R., Ramirez-Gutierrez, C. F., Zavala de Paz, J. P., Rizzo Sierra, J. A., & Anaya, K. (2026). A Comparative Benchmark of Face Detection Models for Noisy and Dynamic Online Class Environments. Future Internet, 18(4), 208. https://doi.org/10.3390/fi18040208

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparative Benchmark of Face Detection Models for Noisy and Dynamic Online Class Environments

Abstract

1. Introduction

2. Material and Methods

2.1. Models of Interaction Between Humans and Machines

2.2. Data Acquisition

2.3. Face Detection Methods

2.4. Image Segmentation from Face Detectors

2.5. Embedding of Faces and Similarity Between Faces

2.5.1. Generation of Reference Dictionary

2.5.2. Identity Recognition

2.6. Quantifying Visual Presence and Algorithm

3. Results

3.1. Experimental Design

3.2. Experimental Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Num. Person	1 min	2 min	3 min	4 min	5 min	6 min	7 min	8 min	9 min
1	on	on	off	off	off	off	off	off	on
2	on	off	on	off	off	off	off	on	off
3	on	off	off	on	off	off	off	on	off
4	on	off	off	off	on	off	off	off	on
5	on	off	off	off	off	on	off	off	on

Num. Person	1 min	2 min	3 min	4 min	5 min	6 min	7 min	8 min	9 min
1	on	on	off	off	off	off	off	off	on
2	on	off	on	off	off	off	off	on	off
3	on	off	off	on	off	off	off	on	off
4	on	off	off	off	on	off	off	off	on
5	on	off	off	off	off	on	off	off	on

Num. Person	1 min	2 min	3 min	4 min	5 min	6 min	7 min	8 min	9 min
1	on	on	off	off	off	off	off	off	on
2	on	off	on	off	off	off	off	on	off
3	on	off	off	on	off	off	off	on	off
4	on	off	off	off	on	off	off	off	on
5	on	off	off	off	off	on	off	off	on