1. Introduction
Within the broader context of the future internet, virtual educational systems are increasingly embedded in intelligent digital ecosystems characterized by distributed platforms, cloud-based infrastructures, and real-time analytics [
1]. Such environments enable the creation of smart learning ecosystems [
2] that orchestrate educational data flows, continuously adapt instructional content, and deliver immediate feedback at scale. Online assessments have become increasingly common in higher education, driven by technological advancements and the demand for flexible, accessible, and scalable evaluation methods. However, assessing crucial student variables, such as visual presence, which is closely linked to student attention and engagement, poses a unique challenge in these virtual environments [
3]. While instructors can gauge attention in traditional in-person settings by interpreting sensory cues (i.e., vision, hearing, speech, and touch), these parameters become more complex in online settings. They are relatively understudied in the context of synchronous online classes. This gap highlights a broader need to understand how factors such as students’ readiness and acceptance influence their academic performance and satisfaction in online settings [
4].
In online learning environments, students’ visual presence on screen is frequently used as a proxy for engagement, with recent systems modeling this through observable behavioral cues [
5,
6]. However, visual presence should not be equated with cognitive attention, which involves complex brain processes that enable adaptive and effective behavioral choices [
7]. A student’s visible face on screen does not necessarily indicate genuine cognitive engagement. Nevertheless, visual presence functions as a low-level behavioral indicator that allows instructors to monitor student connectivity, facilitate prompt re-engagement, and provide guidance or redirection during synchronous sessions. In traditional classrooms, environmental and social cues support attention through direct interaction with instructors and peers [
8,
9,
10]. In virtual settings, a reduction in these cues increases the challenge of maintaining focus. Therefore, although visual presence offers a limited perspective, it remains a practical and interpretable indicator within the broader context of supporting student attention in online learning.
Several recent developments and proxies have emerged to address the gap between psychological theory and practical technology. Most modern systems operationalize attentiveness not as a covert cognitive state, but as an overt, observable behavior that correlates with being on task [
9]. These behavioral proxies include on- and off-screen gaze, head orientation, and eye state [
11], which can be efficiently measured using computer vision pipelines. Such approaches often employ short temporal windows (e.g., 5–15 s) to capture state changes and micro-events while smoothing noisy data. Importantly, these methods avoid common pitfalls, such as assuming that a student’s gaze toward the camera directly reflects cognitive attention, and they acknowledge that facial emotion alone is a weak indicator of focus. Within this broader context, state-of-the-art engagement-detection frameworks increasingly rely on advanced facial behavior analysis toolkits such as OpenFace, which integrate facial landmarks, head pose, gaze estimation, and action unit recognition into multitask architectures [
12], or on multimodal systems that combine student behavior cues with affective inference using DeepFace-based models [
13]. These systems aim to estimate higher-level engagement or affective states, requiring richer input signals and more complex computational pipelines [
14,
15,
16,
17,
18]. However, these approaches go beyond simple attendance tracking, providing educators with real-time data on the duration and frequency of on-screen presence. By understanding these patterns of visual presence, instructors can adapt their teaching methods to better manage the learning process, ensuring that students remain behaviorally engaged and that instruction is more effective. Ultimately, these tools aim to enhance learning outcomes by directly addressing the fundamental challenge of maintaining student attention in dynamic educational environments [
19,
20,
21]. However, the very platforms designed to facilitate this learning present significant technical challenges.
Video conferencing services, such as Google Meet, Zoom, Microsoft Teams, and Webex, host large sessions that often exceed the capacity of physical classrooms. To optimize performance and bandwidth, they compress video and display participants’ feeds in small, dynamic grids. This practice, however, poses a significant obstacle to computer vision algorithms that rely on high-resolution images for accurate tracking. The low-resolution input and constantly shifting layout, caused by participants joining, leaving, or speaking, make it challenging to track facial features and analyze student attention precisely. Recent advances in robust feature representation have explored novel strategies to stabilize deep models under visual degradation. For example, recent works have introduced a hybrid-resolution and boundary-aware architecture combined with a dual-task mutual learning framework that enhances robustness across increasingly complex scenes [
22,
23]. Their domain-transform approach improves feature stability even under corrupted or noisy conditions. However, this framework comes with notable limitations for face-centric online class environments, including its computational complexity, reliance on high-resolution inputs, and its focus on full-body semantic parsing rather than small-scale facial regions. Additionally, the method does not explicitly address dynamic grid layouts typical of videoconferencing platforms. Therefore, developing a robust system requires overcoming the dual hurdles posed by low-resolution video and highly dynamic visual structure. Existing face detection benchmarks have primarily evaluated models on curated, high-quality datasets such as WIDER FACE [
24] or FDDB [
25], or under controlled laboratory conditions with stable lighting, fixed camera positions, and uncompressed video streams. To our knowledge, there are no publicly available benchmarks that assess face detection performance in the conditions typical of real synchronous online classrooms, namely, low-resolution video tiles produced by platform-level compression, fluctuating bandwidth, and dynamically reconfigured grids triggered by speaking activity or student entry and exit. This gap is significant, as the visual distortions and layout changes introduced by platforms such as Google Meet or Zoom fundamentally alter the detection problem in ways not captured by existing datasets.
To address these challenges, this study proposes a computer vision pipeline specifically designed for noisy and dynamic online class grids, aligning with the goal of benchmarking multiple deep learning methods for robust face detection. The methodology begins with the acquisition of real-time video streams recorded through Open Broadcaster Software (OBS), from which frame sequences containing all participants are extracted. These images are processed through a multi-stage detection and identification pipeline that incorporates four face detection models: Haar Cascade (HAAR) [
15,
26], a classical machine-learning method; Dual Shot Face Detector (DSFD) [
27], a two-stage deep detector; Multitask Cascaded Convolutional Network (MTCNN) [
14], a lightweight cascaded architecture widely used for multi-scale face localization; and YuNet [
28], a recent millisecond-level deep learning model optimized for low-resolution input. After detection, each face is normalized, embedded into a numerical feature vector, and matched against a reference dictionary to establish participant identity across all frames. This enables the computation of the visual presence (VP) metric, defined as the proportion of frames in which each student is successfully detected. By applying this pipeline to a dataset of 27 participants and 16,200 frames, the study not only quantifies visual presence but also explicitly compares the detection performance of HAAR, DSFD, MTCNN, and YuNet, demonstrating how different models behave under the noisy, compressed, and dynamically shifting grid layouts characteristic of modern videoconference platforms. This integrated approach aligns the manuscript with a benchmark of image-processing pipelines for face detection in real-world online classroom scenarios, providing a foundation for evaluating behavioral engagement in synchronous virtual learning.
Considering the above, the novelty of this study lies not only in introducing a deep learning-based approach for monitoring students’ visual presence, but in conducting a systematic benchmark of multiple face detection pipelines under noisy and dynamic online class conditions. Unlike previous works that focused on single models or relied exclusively on pre-recorded, idealized datasets, this research evaluates four distinct algorithms in a controlled yet realistic case study based on actual synchronous class sessions. This comparative design provides empirical evidence of how each method behaves under low-resolution grids, compression artifacts, and temporally shifting layouts, conditions typical of videoconferencing platforms. Beyond proving feasibility, the proposed framework offers a practical, replicable, and experimentally validated strategy for measuring behavioral engagement, highlighting the strengths and limitations of each detection model in scenarios closely aligned with real teaching practice.
2. Material and Methods
2.1. Models of Interaction Between Humans and Machines
Human–machine interaction describes the dynamics by which two or more people communicate and collaborate with a machine or technological system. This encompasses a diverse range of technologies that facilitate communication and teamwork, including computers, mobile devices, video conferencing systems, and online collaborative platforms. In synchronous online classes, interaction between students and instructors is mediated entirely through videoconferencing platforms, which impose technical constraints highly relevant to computer vision analysis. Unlike in physical classrooms, where multiple sensory cues are available, online interaction depends almost exclusively on low-resolution, compressed video tiles that are dynamically rearranged as participants join, leave, or speak. These conditions introduce variable frame quality, inconsistent face sizes, and frequent layout changes, all of which directly affect the stability of face detection pipelines. For this reason, the experimental design in the present study focuses specifically on evaluating detection methods under such platform-induced distortions rather than modeling higher-level pedagogical interaction structures.
Figure 1 illustrates four distinct scenarios of the human–machine–human interaction model:
The materials and methods employed in this study are crucial for detailing the experimental approach, particularly the Multitask Cascaded Convolutional Neural Network (MTCNN) algorithm. This research aims to assess participants’ visual presence in online classes. Therefore, this section outlines the proposed methodology to achieve this objective.
Figure 2 illustrates the overall framework. The approach involves analyzing image sequences using computer vision techniques. The experimental design was carefully crafted to ensure voluntary participation from all individuals in refining the methodology for evaluating visual presence. Specifically, 27 participants were selected for data collection using a screen-capture tool.
The strategy is based on analyzing image sequences with computer vision techniques. The experimental framework was structured to ensure voluntary participation from all individuals in refining the methodology for assessing visual presence. In particular, the screen capture tool recorded data for all group attendees. During data preprocessing, atypical, duplicate, or out-of-context data were avoided. If they occurred, they were normalized and explored. A deep artificial neural network was selected as the learning strategy due to its strong performance on computer vision tasks. Thus, face detection was performed using this method. Subsequently, the face characteristics were mapped to perform face identification, in which the similarity between the numerical values and the reference representations in the database was measured.
When creating the datasets, it was necessary to prioritize the confidentiality of sensitive information. Accordingly, informed consent had to be obtained from all participants, and data collection had to be carried out in strict adherence to ethical guidelines and regulatory standards. Anonymity involves removing personal identifiers such as names, addresses, and contact information to protect individuals’ privacy. Informed consent was obtained, with students aware of how their data was used and having the autonomy to decide whether to share it publicly. Ethical guidelines recommend sharing data in open repositories following the FAIR principles [
29] (Findable, Accessible, Interoperable, Reusable) to enhance transparency and reproducibility in research. Respecting data ownership and intellectual property rights, acknowledging contributions, and fostering a collaborative research environment are crucial.
The experiment was conducted in synchronous online classes, with the characteristic evaluated being oscillation. Oscillation refers to the ability to efficiently and effectively shift attention from one stimulus to another. In an online class setting, oscillation is evident as students alternate their focus among various screen elements, such as teacher presentations, class chats, and individual assignments. In this context, visual presence is linked to the duration of their participation in the sessions. The first step is to broadcast the class in real time and record it with OBS. OBS is a popular and open-source tool for live streaming and recording multimedia content, including classes, presentations, and gaming sessions.
2.2. Data Acquisition
Data were collected by accessing synchronous online class sessions conducted via Google Meet. The sessions were recorded in real time using OBS, which captures the full videoconferencing grid and preserves all participant video tiles as displayed on screen. Unlike the built-in Google Meet capture option, which prioritizes the active speaker or shared content, OBS records the complete layout of attendees throughout the session.
The resulting video file was stored in MP4 format with a spatial resolution of 640 × 360 pixels and a frame rate of 30 fps. The video stream was encoded in RGB24 format with an average bitrate of approximately 0.166 Mbps, reflecting the strong compression and resolution constraints imposed by the videoconferencing platform. These characteristics define the visual input conditions for all subsequent face detection and visual-presence analyses.
After recording, image sequences were extracted from the video using Google Colab, where execution and processing were performed in Python 3.12.13. A graphics processing unit (GPU) was used to accelerate frame extraction and downstream analysis. Since the video data were stored in the cloud, appropriate access permissions were required to retrieve them. Following extraction, standard preprocessing steps, including data cleaning, integration, transformation, and reduction, were applied. All frames retained identical spatial dimensions and resolution and were stored remotely for reproducibility.
The extracted images were handled as numerical arrays using the
NumPy 2.0.2 library, while visualization and inspection were performed with the
Pyplot module of the
Matplotlib 3.10.0 library.
Figure 3 presents an example frame obtained from the recorded session.
2.3. Face Detection Methods
Before the emergence of deep learning models, several traditional methods were developed for face detection, among which the Haar Cascade is among the most influential. It was introduced by Viola and Jones [
30] and is based on Haar-like features to identify regions of an image that resemble a human face [
30]. The strategy begins by transforming the input image to grayscale and computing rectangular features that capture differences in pixel intensity, such as edges, lines, and textures. Then these features are combined using the AdaBoost algorithm, which selects the most informative strong classifiers from a cascade of weak classifiers. Despite its practicality, the method has been surpassed by deep learning models, such as the Dual Shot Face Detector (DSFD), in terms of accuracy and robustness.
The Dual Shot Face Detector (DSFD), introduced by Li et al. [
27], is a deep learning-based two-stage detection framework designed to improve both feature learning and detection precision in face recognition tasks. In the first stage, the model uses a Feature Enhancement Module (FEM) to enhance multi-scale feature representations by fusing contextual information via dilated convolutions, yielding more discriminative and robust features. The second stage refines these preliminary detections by applying a Progressive Anchor Loss (PAL) mechanism that initially uses smaller anchors and subsequently larger ones, gradually enhancing localization accuracy across hierarchical feature maps. Additionally, the Improved Anchor Matching (IAM) strategy integrates optimized anchor assignment with data augmentation to ensure a more balanced and accurate correspondence between anchors and ground-truth faces, stabilizing the training process. Although DSFD incurs a higher computational cost than traditional approaches such as Haar Cascades, its dual-shot design achieves state-of-the-art performance on benchmarks like WIDER FACE and FDDB, particularly under challenging conditions involving small, occluded, or low-quality facial images.
Similar to DSFD, the Multitask Cascade Convolutional Neural Network (MTCNN) has emerged as a superior choice for face detection compared with the previously described methods [
31]. Due to its accuracy and versatility, studies have underscored its selection as the preferred option in various applications. For instance, a survey by Sanchez [
32] extensively explores face detection methods, emphasizing the critical role of accurate detection in tasks such as face recognition and video surveillance. This study examines the evolution of face detection techniques, highlighting the significance of MTCNN for precise detection, particularly in challenging scenarios characterized by low image resolution and severe occlusions, including varying lighting conditions, focus, and subject orientations.
Furthermore, research evaluating face detection performance and estimation assesses face detectors based on processing time and accuracy. Chaves et al. [
33] explicitly identify MTCNN as one of the methods being evaluated, showcasing its capabilities in face detection tasks. The study provides insights into the strengths and weaknesses of deep learning models for face recognition, particularly against image degradation, further affirming the effectiveness of MTCNN in face detection applications [
33].
The architecture of the MTCNN algorithm is based on the pyramidal processing of the original image. Initially, the algorithm processes a 12 × 12 × 3 image using a 3 × 3 filter and a depth of 10 layers. This pyramidal approach involves rescaling the image at multiple scales, facilitating face detection across varying scales. The algorithm comprises three stages: P-Net, R-Net, and O-Net. Each stage specializes in detecting faces at different scales and refining the bounding boxes around them. During the feature extraction phase, specific facial features, such as the eyes, nose, mouth, and other landmarks, are extracted to aid in face identification. Ultimately, the MTCNN algorithm produces bounding boxes outlining detected faces and landmark points for each face.
In 2023, Wei et al. [
28] introduced a face detection framework, called YuNet, that builds upon deep convolutional neural network architectures to improve precision and robustness under complex visual conditions. The approach optimizes feature representation and detection accuracy by integrating multi-scale contextual information and advanced anchor-based strategies. Specifically, it introduces mechanisms to refine feature learning, balance the distributions of positive and negative samples, and strengthen detection consistency across varying image resolutions and illumination conditions. By combining these components, the proposed framework achieves more accurate and stable face localization compared to prior state-of-the-art methods, demonstrating superior performance in challenging scenarios such as occlusion, pose variation, and small-scale face detection.
Table 1 summarizes the four face detection models selected for this benchmark and clarifies why they constitute a representative cross-section of architectural paradigms. Haar Cascade exemplifies early handcrafted-feature methods with minimal computational requirements, providing a historical and computational baseline. DSFD represents high-capacity two-stage detectors designed for accuracy under ideal imaging conditions, allowing us to examine how such architectures degrade in the presence of online classroom distortions. MTCNN, a cascaded multi-stage model, is particularly well-suited for small faces and thus relevant for grid layouts where students occupy only a few pixels. Finally, YuNet illustrates modern lightweight, anchor-free detectors optimized for CPU environments, making it a strong candidate for practical deployment by instructors. Together, these models span classical ML, multi-stage deep pipelines, high-capacity anchor-based CNNs, and lightweight real-time architectures, enabling a systematic comparison across diverse detection strategies under compressed, low-resolution, and dynamically changing videoconferencing conditions. In our implementation, MTCNN was imported directly from the standard Python/OpenCV-based library (OpenCV v4.10) and used with its reference configuration. No retraining or architecture modification was performed. The detector was instantiated with thresholds [0.6, 0.7, 0.7] for the three-stage cascade, a minimum detectable face size of 20 pixels, and an input normalization size of 160 × 160 pixels. These values define the valid detection range between 20 and 160 pixels, which was selected empirically through trial and error and reflects common practice in face detection under low-resolution and compressed video conditions.
2.4. Image Segmentation from Face Detectors
Before computing the visual presence, it is essential to convert the image to a bitmap format, which offers lossless compression and preserves image quality, enabling subsequent face identification and facilitating the tracking of each attendee’s session duration.
The proposed strategy uses the most efficient deep neural network-based algorithms as the primary face detection method. It is formed by combining several neural network models that sequentially refine the generated detections. The algorithm operates with the PyTorch 2.10.0+cpu detector, creating dynamic neural networks and graphs rather than static ones.
The algorithm can identify the probability of detecting faces. The amount varies with the detection limits of each face detection algorithm. Subsequently, the extraction of faces is performed as shown in
Figure 4. The generated images are stored in bitmap format, which offers lossless compression and preserves image quality.
2.5. Embedding of Faces and Similarity Between Faces
Face embeddings are vectors that represent a face’s key features in a specific dimensional space. During training, the neural network learns to map facial features, clustering similar faces together. This process allows the network to capture relevant and distinctive information about faces. The closer the embeddings are in dimensional space, the greater the face similarity. This similarity measure is often calculated using Euclidean distance.
During this process, the algorithm-identified faces are separated and stored in a folder corresponding to the analyzed frame. This validates the number of faces detected.
2.5.1. Generation of Reference Dictionary
A reference dictionary was employed to support identity consistency during face recognition within a single online session. In this framework, the dictionary is constructed directly from the same Google Meet session used for data acquisition, without any separate or prior enrollment phase. For each participant, the system extracts the participant’s displayed name from the videoconferencing platform and associates it with the corresponding face regions detected in video frames where the camera is active. Specifically, between three and five representative face crops are automatically extracted from frames in which the face is clearly visible. These frames are selected opportunistically from the live stream, ensuring frontal or near-frontal views whenever available, while preserving the platform’s native imaging conditions (resolution, compression, illumination, and camera quality).
The role of the reference dictionary is not to perform cross-session biometric identification, but to provide a lightweight and session-specific identity mapping that enables consistent tracking of visual presence across frames. Instead of storing raw pixel data, each face crop is transformed into a compact embedding vector, yielding a normalized feature representation that supports efficient comparison using distance-based similarity measures. This design reduces memory requirements and avoids discrepancies between enrollment and recognition conditions, since both reference and query embeddings originate from the same session. As a result, identity matching remains constrained to intra-session analysis and can be performed without curated datasets or dedicated enrollment procedures, which aligns with the practical constraints of real-time online classroom environments.
2.5.2. Identity Recognition
A reference dictionary is a well-established approach for recognizing specific individuals. A pipeline detection approach was implemented for face detection and identification, incorporating the following key parameters: the minimum similarity threshold for matching identified faces, the minimum confidence required for a detected face to be included in the results, and the detection thresholds for each deep neural network in the face detector.
Figure 5 illustrates an example of the face identification process. This process is performed using the optimal network parameters. Subsequently, each detected face is identified, and the frequency of each student’s appearance across all frames is recorded. This frequency is used to determine whether students are present or the connection percentage. Students who appear in all frames are considered connected 100% of the time.
2.6. Quantifying Visual Presence and Algorithm
This study introduces the visual presence (VP) metric to quantify each participant’s on-screen presence during live online sessions. VP is defined as the proportion of video frames in which a student’s face is successfully detected, calculated as:
where
i is the number of frames in which the student appears, and
n is the total number of frames in the session. To compute this metric, the system captures video streams using OBS and processes each frame with a deep neural network-based face detection model. The model identifies facial regions and landmarks, extracts and normalizes cropped face images, and generates embeddings that are compared against a reference dictionary using Euclidean distance. If a match is found, the participant is identified; otherwise, a new entry is added to the dictionary.
In brief, the algorithm’s flow Algorithm 1 is described as follows: Once the facial regions are localized, individual faces are extracted from the detected bounding boxes. These cropped face images are subsequently normalized by resizing to a consistent dimension across the dataset. Next, facial embeddings are computed for each face using a trained deep neural network, encoding distinctive facial characteristics into high-dimensional feature vectors. These embeddings are compared against a set of pre-existing reference embeddings using the Euclidean distance as a similarity metric. When the computed distance exceeds a predefined threshold, indicating the presence of a new, previously unregistered individual, the reference database is updated to include the new embedding. Recognition is then performed by matching embeddings from the current frame to those stored in the reference dictionary, thereby enabling the identification of known participants in real time.
| Algorithm 1 Student visual presence detection algorithm |
- 1:
Start streaming the class. - 2:
Record, the class with OBS - 3:
Get the image sequence. - 4:
Segment the images using the deep neural network face detection. - 5:
while Face detection is running do - 6:
Obtain detection matrix with the following information: - 7:
for all bounding boxes detected do - 8:
Coordinates top-left corner and bottom-right corner - 9:
Probability that the bounding box contains a face - 10:
Coordinates of eyes, nose, and corner of mouth - 11:
end for - 12:
end while - 13:
Extract multiple faces: - 14:
for all each bounding box do - 15:
Cropped Face ← CropFace(image, bounding box) - 16:
Square Face ← ResizeToSquare(Cropped Face) - 17:
Embed the faces: - 18:
Embedding ← PerformEmbedding(Square Face) - 19:
Find similarities between faces: - 20:
for all embedding existing in the reference dictionary do - 21:
Euclidean Distance ← CalculateEuclideanDistance(Embedding, Embedding in reference) - 22:
if Euclidean distance < Similarity threshold then - 23:
Save to reference dictionary - 24:
else - 25:
The reference vector of an identity is not available - 26:
end if - 27:
end for - 28:
end for - 29:
Face Recognition: - 30:
for all each face clipped in the sequences do - 31:
Perform face embedding: - 32:
Embedding ← PerformEmbedding(ResizeToSquare(Cropped Face)) - 33:
Compare with embeddings in the reference dictionary: - 34:
for all identity name, embedding in reference in the reference dictionary do - 35:
Euclidean Distance ← CalculateEuclideanDistance(Embedding, Embedding in reference) - 36:
if Euclidean distance < Similarity threshold then - 37:
Identify face with identity name - 38:
end if - 39:
end for - 40:
end for - 41:
Quantify the Visual Presence: - 42:
function CalculateVisualPresence(NumberOfPeople, TotalSequence) - 43:
a ← - 44:
return VisualPresence - 45:
end function
|
In the subsequent computational stage, implemented through the function CalculateAttentionLevel, which receives two parameters—NumberOfPeople (the number of detected individuals) and TotalSequence (the total sequence of frames)—the visual presence metric is determined. This metric is computed by dividing the number of detected individuals by the total number of frames, then multiplying the result by 100. The resulting percentage quantifies the visual presence and serves as an indicator of participant engagement in the captured session.
The VP metric serves as a quantitative proxy for measuring participant presence during online sessions. It complements qualitative classifications of attention, ranging from “very noisy,” where many students are inattentive, to “generally listening” and “attention,” when most or all are focused. By integrating this value with facial detection and identification algorithms, the framework could quantify a student’s presence as a measurable variable, enabling precise evaluation of participants in virtual classroom environments.
3. Results
This section shows the results of the experiments carried out; it is divided into two parts: Experimental Design describes the experimental design used to conduct the research, including details of the experiment setup, while in the Experimental Results, the results obtained from the experiment are presented.
The algorithm begins by identifying the optimal values, referred to as detection limits, for each of the three networks that comprise the face detector, thereby providing an optimization strategy to enhance the method’s efficacy. Subsequently, the algorithm is executed in a controlled setting, i.e., experiments conducted under meticulously structured conditions to ensure the reliability and reproducibility of the results.
3.1. Experimental Design
To run the experiments, twenty-seven students in the class participated in obtaining the data dictionary and identifying the optimal network configurations for the face identification and recognition algorithm. From this set, five students were carefully selected to participate in the first experimental stage, during which the algorithm’s performance was rigorously evaluated through a series of controlled tests. Conducting these experiments in a controlled environment ensured greater precision and reliability in assessing the algorithm’s effectiveness.
It is crucial to emphasize that the study was conducted in a controlled environment, with the camera activated and deactivated at specific times. However, the study primarily focuses on scenarios in which students participate in online learning under conditions that are not always optimal. Among the challenges were issues with background configuration, lighting, camera quality, and internet connectivity, among other parameters, that were not controlled during the experiments.
Table 2 describes a scenario in which cameras are turned on and off in a controlled manner by different individuals over a period of nine minutes. This setup facilitates the analysis of controlled visual presence patterns under predefined conditions.
The original recording yielded a total of 16,200 frames at approximately 30 frames per second over a nine-minute session. However, preliminary empirical observations across multiple class sessions indicated that changes in students’ visual presence typically occur on timescales longer than a few seconds. Based on this heuristic evidence, and to reduce unnecessary computational overhead, the processing pipeline was adjusted to perform temporal subsampling by retaining one frame every two seconds. This resulted in a reduced dataset of 270 frames for analysis, while preserving representative visual presence patterns across the full session. Additional tests at higher sampling rates confirmed that this reduction did not materially affect the final visual presence estimates but significantly minimized processing costs. All experiments were executed using the standard Google Colab environment, equipped with 12 GB of RAM and approximately 100 GB of disk storage. Inference was performed on CPU under identical conditions for all evaluated detectors. The measured processing times were consistently lower than the 2 s temporal sampling interval, ensuring that detection, labeling, and visual presence counting were completed within each sampling window.
3.2. Experimental Results
The experimental results shown in
Figure 6 demonstrate the comparative performance of various deep neural network-based algorithms for face detection in a virtual classroom environment. The first column contains example images captured during live online sessions, while the subsequent columns display detection results from various models, including Haar Cascade, Dual Shot Face Detector (DSFD), Multitask Cascaded Convolutional Network (MTCNN), and YuNet. Given the number of correctly detected faces, we can validate the improved performance of recently developed methods, such as YuNet, relative to classic algorithms, including Haar Cascade and DSFD.
Subsequently, the percentage of visual presence was calculated using Equation (
1). Once the faces were detected and each participant was identified through the face dictionary, the performance of each algorithm was evaluated. To this end, five selected subjects were manually analyzed across 270 frames. In each frame, visual presence was measured both manually and automatically using the various face detection and recognition methods described in
Section 2.3.
Table 3 summarizes the performance of the proposed framework. The table includes the following attributes: Person (the assigned identifier for each participant) and % of Presence Identification (the percentage of visual presence calculated manually using the Haar Cascade, DSFD, MTCNN, and YuNet algorithms). All percentages were computed based on a total of 270 frames, corresponding to 100%, which represents the time interval for the controlled experiment introduced in the previous section.
For each image sequence, the number of people present and the algorithm’s ability to recognize and identify them are evaluated. The algorithm performs detection by using the values found in the reference dictionary. Although everyone was connected, the connection percentage was not 100%. This is because factors such as lighting, camera quality, and network quality can significantly affect results in artificial vision. Nevertheless, the attendance percentage remains high. Understanding these metrics is crucial for evaluating the performance and effectiveness of a face detection model in this practical application.
The experimental results presented in
Table 3 are derived from the statistical values in
Table 4, illustrating apparent differences in the accuracy of the visual presence estimation methods. The manual measurements show the highest mean value (82.86%) with notable variability, reflecting the inherent variability in human annotation. In particular, algorithms such as HAAR and DSFD exhibit the lowest means (20.05% and 43.29%, respectively) and relatively small standard deviations, indicating consistent but significantly underestimated detection performance. In contrast, the deep learning-based methods MTCNN and YuNet exhibit higher mean values (58.68% and 64.63%) and tighter confidence intervals, suggesting a stronger agreement with the manual reference data. Among them, YuNet achieved the broadest detection range but also exhibited slightly higher variance due to its sensitivity to varying lighting and resolution conditions. In general terms, the results confirm that recent neural architectures, particularly MTCNN and YuNet, provide more accurate and reliable detection of visual presence.
The performance of the proposed framework was evaluated using the percent error metric. In particular, the deviation between the manually computed visual presence (VP) values and those obtained using different deep neural network methods was calculated.
Table 5 shows that the MTCNN and YuNet models achieved the lowest average percent errors of 27.63% and 22.20%, respectively, indicating superior robustness and precision compared to other approaches, such as HAAR and DSFD. These results confirm that deep learning-based detectors, particularly YuNet, provide more stable and reliable estimations of visual presence.
The HAAR and DSFD algorithms exhibit relatively high mean errors with low variability, reflecting consistently lower visual presence estimates under the evaluated conditions. In contrast, MTCNN and YuNet yield lower mean errors across the same dataset. Among the tested methods, YuNet presents the narrowest confidence interval (18.02–26.40%), indicating reduced variability in its error distribution relative to the other detectors within this specific experimental setup, rather than implying general superiority across broader or untested conditions.
Table 6 reports the descriptive statistics of the accuracy-related metrics for each algorithm, computed under identical image frame conditions, providing a comparative summary of their performance behavior within the same acquisition context.
4. Discussion
The results indicate that the proposed framework provides a functional approximation of students’ on-screen visual presence during synchronous online classes based on a deep neural network-driven face detection. The comparative analysis shows that MTCNN and YuNet produce lower VP errors than Haar Cascade and DSFD under identical videoconferencing conditions. This behavior is reflected in the percent-error measurements, where MTCNN and YuNet exhibit average deviations of 27.63% and 22.20%, respectively, relative to manual annotations. These values characterize the detection pipelines’ responses when faces are small, compressed, and subject to frequent layout rearrangements inherent in online classroom grids. Within these constraints, the framework yields a low-level indicator of visual availability, without aiming to provide high-precision behavioral or cognitive measurement.
An important observation from the experimental sessions is that although all participants were continuously connected to the online meeting, the measured VP did not reach 100% for any detector. This discrepancy reflects the inherent limitations of face detection pipelines operating on low-resolution, heavily compressed video streams, where faces may intermittently disappear due to lighting variations, camera angles, bandwidth fluctuations, or layout reconfigurations. Rather than indicating a failure of the pipeline, these effects illustrate the difficulty of maintaining consistent detection under realistic videoconferencing conditions. The resulting VP values, therefore, reflect the observable output of the detection process rather than an exact measure of user availability.
It is important to clarify that visual presence, as defined in this work, does not imply attentiveness, engagement, or learning effectiveness. The VP metric is intentionally designed as a simple, interpretable quantity that captures whether a participant’s face appears on screen, and it should not be interpreted as a proxy for cognitive or pedagogical constructs. No claims are made regarding correlations between VP and learning outcomes, instructional effectiveness, or student engagement, as establishing such relationships would require dedicated educational validation beyond the scope of this benchmarking study.
From a methodological perspective, the primary contribution of this work lies in evaluating how different face detection architectures behave under authentic online classroom constraints. The observed performance differences between classical methods (Haar Cascade), high-capacity detectors (DSFD), cascaded architectures (MTCNN), and lightweight models (YuNet) highlight the sensitivity of detection pipelines to platform-induced distortions such as compression artifacts, resolution reduction, and dynamically changing grids.
Several directions for future technical refinement are suggested by these findings. These include improved calibration of detection and identification thresholds, more robust embedding strategies for low-quality inputs, and integration of additional visual cues to mitigate intermittent detection loss. However, any extension toward higher-level behavioral or educational interpretation would require independent validation with appropriate human-subject studies and outcome measures.
Overall, the experimental results support the feasibility of the proposed framework as a lightweight, technically grounded benchmarking tool for estimating visual availability under real videoconferencing conditions. The study demonstrates how face detection models respond to the constraints imposed by online platforms, and it provides a reproducible reference for comparing detection robustness without extending claims beyond what is empirically validated. Beyond the models evaluated in this study, several lightweight detectors, such as RetinaFace [
34,
35] and YOLO-Face [
36], are widely adopted due to their strong balance between accuracy and efficiency, leveraging multi-scale feature pyramids, single-stage architectures, and compact backbones for real-time facial localization and fast inference in face-centric datasets. However, these approaches tend to rely on higher-resolution inputs and GPU-oriented pipelines, which limits their robustness when faces appear extremely small and highly compressed, as is typical in dynamic online classroom grids. Furthermore, adapting these models would require domain-specific retraining and threshold tuning to avoid false positives in densely packed layouts. For these reasons, they were not included in our benchmark, though future work will consider integrating such lightweight architectures to broaden the scope of comparison.
Finally, an important contextual limitation of this study concerns the rate at which visual conditions change during typical synchronous online classes. Based on empirical observation, sessions with camera-enabled participants are generally characterized by relatively stable lighting conditions and background configurations over short and medium time intervals. In contrast, the dominant sources of visual variability arise from camera activation and deactivation events and from dynamic layout reconfigurations imposed by the videoconferencing platform (e.g., speaker switching, participant entry or exit). For this reason, the proposed benchmark focuses on low-resolution and compressed grid dynamics rather than attempting to model rapidly fluctuating illumination or background changes. More extreme or highly dynamic visual scenarios are acknowledged as relevant but were not systematically evaluated and are therefore left for future work.
5. Conclusions
This study presented a comparative evaluation of four face detection pipelines, Haar Cascade, DSFD, MTCNN, and YuNet, applied to estimating visual presence in synchronous online classes under realistically degraded videoconferencing conditions. Rather than proposing a high-accuracy behavioral measurement system, the primary objective was to benchmark how distinct detection architectures respond to low-resolution video tiles, strong compression artifacts, dynamic grid rearrangements, and intermittent face visibility.
The experimental results reveal clear trade-offs between detection accuracy, computational efficiency, and robustness. Classical approaches such as Haar Cascade exhibit consistently low detection rates in this setting, confirming their limited suitability for modern videoconferencing environments. High-capacity deep detectors such as DSFD achieve moderate gains in accuracy but incur substantial computational overhead, making them less practical for real-time use. In contrast, cascaded architectures (MTCNN) and lightweight models (YuNet) achieve lower visual presence errors and significantly reduced processing times, illustrating a more favorable balance between robustness and efficiency. However, even for the best-performing model (YuNet), average errors remain above 20%, indicating that face detection-based estimation of visual presence under these constraints is inherently noisy and far from precise.
These findings suggest that, with a simple detection-and-embedding pipeline, visual presence can only be approximated as a low-level indicator of on-screen availability rather than a reliable or exhaustive measure of student behavior. The observed error rates highlight the limitations imposed by platform-level resolution reduction, compression, and layout dynamics, and they caution against interpreting VP as an accurate proxy for engagement or attention without additional validation.
From a practical perspective, the benchmark demonstrates that deploying face detection models in online classroom grids involves unavoidable compromises: lightweight models offer speed and stability at the cost of detection completeness, while more complex architectures improve detection quality but may be unsuitable for real-time or large-scale deployment. Accordingly, the main contribution of this work lies in characterizing these trade-offs and providing a reproducible reference for evaluating face detection behavior in noisy and dynamic online classroom environments.
Future research should explore more robust detection strategies, improved identity-matching schemes, and hybrid approaches that may mitigate intermittent detection failures. Any extension toward educational or behavioral interpretation, however, will require dedicated validation studies that combine visual presence with pedagogically meaningful outcome measures, which are beyond the scope of the present benchmarking study.