Next Article in Journal
Zero-Shot to Head-Shot: Hyperpersonalization in the Age of Generative AI
Previous Article in Journal
Design and Evaluation of a Laser Triangulation System for Pencil Lead Defect Inspection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing the Effectiveness of Juvenile Protection: Deep Learning-Based Facial Age Estimation via JPSD Dataset Construction and YOLO-ResNet50

1
College of Information and Technology, Nanjing Police University, Nanjing 210023, China
2
College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 210095, China
3
College of International Relations, Nanjing University, Nanjing 210023, China
*
Author to whom correspondence should be addressed.
Appl. Syst. Innov. 2025, 8(6), 185; https://doi.org/10.3390/asi8060185 (registering DOI)
Submission received: 21 October 2025 / Revised: 21 November 2025 / Accepted: 27 November 2025 / Published: 29 November 2025
(This article belongs to the Section Artificial Intelligence)

Abstract

An increasing number of juveniles are accessing adult-oriented venues, such as bars and nightclubs, where supervision is frequently inadequate, thereby elevating their risk of both offline harm and unmonitored exposure to harmful online content. Existing facial age estimation systems, which are primarily designed for adults, have significant limitations when it comes to protecting juveniles, hindering the efficiency of supervising them in key venues. To address these challenges, this study proposes a facial age estimation solution for juvenile protection. First, we have designed a ‘detection–cropping–classification’ framework comprising three stages. This first detects facial regions using a detection algorithm, then crops the image before inputting the results into a classification model for age estimation. Secondly, we constructed the the Juvenile Protection Surveillance and Detection (JPSD) Dataset by integrating five public datasets: UTKface, AgeDB, APPA-REAL, MegaAge and FG-NET. This dataset contains 14,260 images categorised into four age groups: 0–8 years, 8–14 years, 14–18 years and over 18 years. Thirdly, we conducted baseline model comparisons. In the object detection phase, three YOLO algorithms were selected for face recognition. In the age estimation phase, traditional convolutional neural networks (CNNs), such as ResNet50 and VGG16, were contrasted with vision transformer (ViT)-based models, such as ViT and BiFormer. Gradient-weighted Class Activation Mapping (Grad-CAM) was used for visual analysis to highlight differences in the models’ decision-making processes. Experiments revealed that YOLOv11 is the optimal detector for accurate facial localisation, and that ResNet50 is the best base classifier for enhancing age-sensitive feature extraction, outperforming BiFormer. The results show that the framework achieves Recall of 89.17% for the 0–8 age group and 95.17% for the over-18 age group. However, we have found that the current model has low Recall rates for the 8–14 and 14–18 age groups. Therefore, in the near term, we emphasise that this technology should only be used as a decision-support tool under strict human-in-the-loop supervision. This study provides an essential dataset and technical framework for juvenile facial age estimation, offering support for juvenile online protection, smart policing and venue supervision.

1. Introduction

The rapid advancement of internet technologies and the proliferation of networked devices have significantly increased global juvenile internet exposure. According to the China Internet Network Information Center’s (CNNIC) 54th statistical report on Internet development in China, as of June 2024, China’s internet user base had grown to almost 1.1 billion, with an additional 7.42 million new users joining. Notably, 49.0% of these new users were aged 10–19 [1]. While the internet provides convenience for young people’s daily lives, it also exposes minors to issues such as cyberbullying, online pornography, and internet addiction. For instance, in 2024 alone, Chinese police conducted over 200,000 cybersecurity inspections, identifying and removing more than 500,000 pieces of illegal content, including materials involving violence, pornography, and terrorism. These challenges highlight the urgent need for global establishment of child-safe digital environments.
To mitigate online risks faced by juveniles, jurisdictions around the world have introduced specialised regulatory frameworks that establish age verification and content governance as fundamental requirements. In accordance with the European Union’s General Data Protection Regulation (GDPR), internet companies must obtain parental or legal guardian consent before processing the personal data of children under the age of 16 [2]. In the United States, the Children’s Online Privacy Protection Rule (COPPA), a regulation established to protect the online privacy of children under 13, requires enterprises to obtain parental consent before collecting children’s personal information. Enforcement is conducted by the Federal Trade Commission (FTC) [3]. China’s Minor Protection Law and Regulations on the Protection of Minors in Cyberspace cover all minors under the age of 18. It requires key platforms to implement a ‘Minor Mode’ and conduct annual protection impact assessments. Serious breaches are jointly enforced by the Cyberspace Administration of China and the Ministry of Public Security, which can result in the revocation of licences [4,5]. However, conventional age verification systems based on identity documents have inherent limitations, including identity misuse, technical circumvention and cross-border enforcement disparities, which undermine their effectiveness in global juvenile protection frameworks. To address these limitations, the integration of reliable biometric technologies, such as facial age estimation, is necessary to strengthen regulatory mechanisms and establish a more robust technical safeguard for online juvenile protection.
Facial age estimation technology has broad applications in policing practice, including cross-age face recognition [6], intelligent security [7], criminal investigation [8], and video retrieval [9]. It enables automatic real-time identification and age estimation of juveniles in video streams, triggering predefined regulatory responses—significantly improving supervision efficiency and freeing up limited police resources compared to manual inspections. Integrating artificial intelligence and edge computing, this technology promotes the development of new policing models and offers a practical solution for juvenile online protection. However, there are still some issues that need to be addressed when using existing techniques for estimating the age of juveniles from their faces. Firstly, most models are designed primarily for adult scenarios, resulting in less accurate age estimation for young populations. Secondly, biases may exist towards the faces of minors of different races and genders. Thirdly, privacy risks associated with biometric data collection and processing remain unresolved. Fourthly, few systems are optimised for complex law enforcement scenarios (e.g. low-resolution surveillance images and complex backgrounds).
This study proposes a multi-stage facial analysis framework for juvenile protection. The aim is to improve the accuracy, fairness and interpretability of cross-age age estimation. It offers new technical ideas and data to support the online supervision of juveniles. The core innovations and research content of this paper are as follows:
  • Specialised dataset for law enforcement scenarios: We have constructed the Juvenile Protection Surveillance and Detection Dataset (JPSDD), which covers all growth stages from birth to 18 years of age. This dataset incorporates real-world data, including surveillance footage and location-specific images, to provide a standardised benchmark for training and evaluating algorithms in practical policing applications.
  • Cascaded ‘Detection–Cropping–Classification’ Framework: We propose a novel framework comprising three stages: first, accurate location of facial regions via object detection; second, elimination of background interference through adaptive cropping; and third, fine-grained age classification based on deep features. To optimise performance, we systematically compared three detection models (YOLOv8, YOLOv11 and YOLOv12) and four classification models (VGG16, ResNet50, ViT and BiFormer). Experimental results identified the optimal combination that significantly enhances the accuracy and robustness of cross-age estimation.
  • Grad-CAM-Enhanced Interpretability: We use Gradient-weighted Class Activation Mapping (Grad-CAM) to generate visual heatmaps that highlight the key facial regions on which the model focuses. This approach reveals critical age-distinguishing features, verifies the rationality of model predictions and facilitates comparative analysis of feature extraction differences across various network architectures. Specifically, it validates the distinct decision-making mechanisms between CNN- and ViT-based models.
The rest of this paper is structured as follows: Section 2 provides a comprehensive review of related work, covering classification algorithms and techniques for estimating facial age. Section 3 presents the materials and methods, providing detailed descriptions of the JPSDD construction, the proposed cascaded framework and the experimental configuration. Section 4 presents the relevant experimental results and discussions. Section 5 summarises the content of the entire paper and outlines future work.

2. Related Works

2.1. Research into Algorithms for Facial Age Estimation

Although existing research on facial age estimation and juvenile protection technologies provides a foundation for this study, critical gaps remain in terms of scenario adaptation, fairness and privacy protection. This section provides a systematic review of the relevant literature to clarify the boundaries of the research and highlight the novel aspects of this work.
Facial age estimation has evolved from traditional, hand-crafted, feature-based methods to deep-learning-driven approaches, with continuous improvements in accuracy. Early studies relied on facial landmarks (e.g., eye distance and nose shape) and texture features (e.g., wrinkle density) for age estimation [10], but these methods were not robust in the face of changes in lighting, pose, and age-related variations in the face, resulting in low accuracy in cross-age scenarios. As technologies such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have developed rapidly, an increasing number of studies have applied them to age estimation.
Tan et al. [11] devised an Age Group Encoding (AGEn) and Local Age Decoding (LAD) algorithm, while Kuprashevich et al. [12] proposed MiVOLO, a multi-input Transformer integrating facial and full-body images to jointly estimate age and gender. MiVOLO demonstrates enhanced generalisation and state-of-the-art accuracy across multiple benchmarks. To achieve age-invariant recognition, Wang et al. [13] introduced Cross-Age Contrastive Learning (CACon), a semi-supervised approach that uses a novel triplet-based loss function for contrastive learning. In 2025, Yang et al. [14] proposed SimViT-Age, which integrates similarity, cross-entropy, and knowledge distillation. Using a ViT as its backbone, this method achieved reductions in mean absolute error (MAE) of 0.05 on the CACD dataset and 0.14 on the UTKface dataset. Furthermore, leveraging knowledge distillation compressed the model by over 90% in terms of size, parameters, and computational cost, while only increasing the MAE by less than 0.5. This significantly enhances the model’s feasibility for mobile deployment. In the same year, Qin et al. [15] introduced an attention-based convolutional LSTM model that dynamically focuses on fine-grained spatio-temporal facial features via channel weighting factors and normalisation. They reported an MAE of 3.60 on FG-NET and 2.45 on MORPH, outperforming both non-attention ConvLSTMs and traditional LSTM models.
In terms of integrated applications, Narayan et al. [16] introduced FaceXFormer in 2024. This end-to-end Transformer can handle nine facial analysis tasks, such as facial parsing, landmark detection and head pose estimation, within a unified encoder–decoder framework. It achieves high performance and real-time inference at 33.21 frames per second (FPS). Similarly, Chen et al. [17] developed an anti-addiction system that combines face recognition, age estimation and liveness detection. This system achieved accuracies of 97.3%, 75.02% and 78.8%, respectively, and has proven effective in preventing juvenile gaming addiction.
However, most existing algorithms are optimised for adult age estimation. Training data is predominantly adult-focused, resulting in a higher mean absolute error (MAE) for juveniles. Additionally, few models are adapted for use in law enforcement scenarios (e.g., low-resolution surveillance, multi-subject detection), which limits their practical applicability.

2.2. Fairness Research in Age Estimation

Fairness remains a critical, unresolved issue in age estimation research. Studies have confirmed that mainstream models exhibit significant bias across demographic groups.
For example, Cao et al. [18] evaluated 24 deep learning models and found that systems trained on Western-dominated datasets underestimated the age of Asian juveniles by an average of 2.3 years, and that the prediction error for female faces was 1.8 times higher than for male faces. A comparative study of Amazon AWS and Microsoft Azure age estimation services revealed a 4.92-fold fairness disparity between Caucasian and African American groups [19], with minority juveniles facing a higher risk of misclassification.
There has been limited research focusing on fairness specific to juveniles: existing bias mitigation methods are rarely tailored to the unique facial development characteristics of minors, resulting in ineffective fairness improvements in juvenile age estimation.

2.3. Privacy Protection Technologies in Biometric Applications

As sensitive biometric information, facial data raises serious privacy concerns in law enforcement and supervision applications. Relevant research mainly focuses on two areas.
Muhammed et al. [20] proposed a federated learning framework for facial recognition that enables distributed model training without centralising raw facial data, thereby reducing breach risks by 60%. However, this method increases communication costs and latency, rendering it unsuitable for real-time law enforcement scenarios. Zhang et al. [21] applied a differential privacy mechanism to age estimation models; however, this approach resulted in a significant loss of accuracy, which is unacceptable for juvenile protection tasks requiring high precision.
Current privacy-preserving technologies struggle to balance privacy, accuracy and real-time performance, creating a critical barrier to the deployment of facial age estimation for the supervision of juveniles online.

2.4. Technical Applications in Juvenile Protection

Existing juvenile protection technologies primarily focus on content filtering and time limits, with limited integration of facial age estimation.
Narayan et al. (2024) developed FaceXFormer, a Transformer-based system capable of handling nine facial analysis tasks (including age estimation) at a real-time performance rate of 33.21 FPS [22]. Although versatile, it lacks specialisation for juvenile facial features and law enforcement scenario constraints. Few studies have developed dedicated facial analysis systems for juvenile protection, and existing solutions fail to address the tripartite challenge of accuracy, fairness and privacy, creating an urgent need for targeted research.

2.5. Summary of Research Gaps

Progress has been made in existing research on facial age estimation, but critical gaps remain for juvenile protection applications: (1) insufficient adaptation to juvenile facial development characteristics and law enforcement scenarios, (2) neglected fairness issues across demographic groups, (3) an unresolved balance between privacy, accuracy and real-time performance, and (4) a lack of specialised datasets for juvenile law enforcement. This study addresses these gaps through a tailored framework, specialised dataset and comprehensive optimisation.

3. Materials and Methods

The experimental research process is as follows. First, two core datasets were constructed using publicly available and legally obtained images. The first dataset is the face detection dataset, containing 1506 images covering various scenarios, such as occlusion and low-light environments. The second dataset, consisting of 14,260 images, is for facial age estimation and is categorised into four age groups according to legal requirements: 0–8, 8–14, 14–18 and 18+. Next, the YOLO framework was used to train a high-precision face detection model. Then, the detected faces were precisely cropped and standardised through pre-processing. The processed results from this step were then input into classification models for age estimation and classification. The research workflow is illustrated in Figure 1.

3.1. Dataset Collection and Preprocessing

The images in the face detection dataset were mainly collected from open-source image repositories, such as Unsplash and Pexels. They were also voluntarily provided by participants and obtained from publicly available sources. In total, the dataset contains 1506 images. Additionally, the dataset contains images captured under challenging conditions, such as dim lighting and facial occlusion, to ensure accurate facial detection in non-ideal environments. The dataset was annotated using LabelImg v1.8.6 software to mark facial regions in YOLO format. During annotation, bounding boxes were drawn to fully cover the face while minimising background interference and truncation. For partially occluded faces, only the visible regions were annotated, and completely unrecognisable faces were skipped. Once all annotations had been completed, the final dataset was divided into training, validation and testing sets in a 7:2:1 ratio.
To address the shortage of samples for age estimation of juveniles, this study also created a dedicated dataset for facial age estimation. Specifically, five publicly available datasets were merged, and all images were cropped to remove complex backgrounds, leaving only the facial components. These images were then standardised through pre-processing operations such as alignment, illumination correction and size normalisation. The final dataset was named JPSD Dataset. Table 1 provides detailed information on the name of each dataset, the total number of images it contains, and the number of images containing juveniles.
The following section provides details on each dataset, along with a breakdown of the number of images containing juveniles that were extracted from them.
The UTKface dataset [23] is a large-scale, publicly available dataset containing over 20,000 facial images. Subject ages range from 0 to 116 years, and images vary in terms of occlusion, resolution, facial expression, pose and illumination. Each image is annotated with gender information (male: 0; female: 1) and race information (five categories: 0—White; 1—Black; 2—Asian; 3—Indian; 4—Others, including Hispanic, Latino and Middle Eastern. The dataset provides both raw images and aligned and cropped versions. From this dataset, we collected 4352 images of subjects under the age of 18.
The AgeDB dataset [24] is a large dataset containing 16,488 images of children and adults in real-life settings. The images and age annotations were manually collected from Google Images, and the sources include public photos of actors, scientists, and others. The images were captured in uncontrolled environments and show diverse expressions, poses, occlusions and noise. This dataset includes 568 distinct individuals spanning ages from 1 to 101 years old, with an average of 29 images per person. A pair.txt file is also provided in the LFW (Labeled Faces in the Wild) format for training and testing face recognition models. Of these, 303 images depict individuals under the age of 18.
The APPA-REAL dataset [25] is an advanced apparent age dataset comprising 7591 images annotated with real and apparent age labels. It covers all age ranges, from children to the elderly. The dataset is primarily used for visual age estimation tasks, i.e., inferring perceived age from facial images, and includes 4113 training images, 1500 validation images, and 978 test images. The dataset is widely used in studies of facial recognition related to age estimation. Within this dataset, we identified 1417 images of people under 18 years old.
The MegaAge dataset [26] contains 41,941 facial images. Additionally, a subset called MegaAge-Asian includes 40,000 Asian facial images covering ages 0–70. The dataset primarily features East Asian faces but also includes individuals from other ethnic groups. Due to its large volume and detailed age annotations, the MEGAAGE dataset has been widely used in research on facial age estimation. From this dataset, we extracted 7009 images of subjects aged 0–17.
The FG-NET dataset [27] is a classic benchmark for cross-age face recognition and age estimation. It consists of 1002 images of 82 individuals. It contains 818 training images and 170 test images, with ages ranging from 0 to 69 years. The dataset exhibits a large age gap of up to 45 years and contains color and grayscale images captured under controlled conditions. Each image provides 68 facial landmark points, supporting studies on cross-age face recognition, age estimation and age progression modelling. Of the images in this dataset, 621 depict individuals aged 18 years or younger, and we selected 619 pictures from them.
Taking into account issues of racial and gender bias, this paper only selects the age feature of images for dataset integration. By aggregating images of children from the aforementioned datasets, a total of 13,700 images were collected, forming one of the largest datasets of children’s faces. To train the age model impartially, our dataset was supplemented with 560 images of adults from the same sources, ensuring an even age distribution from 18 to 60 years. All images were categorised into four groups based on specific criteria (detailed in Section 3.2 of this chapter) and cropped to remove complex backgrounds, thereby enhancing classification accuracy. Additionally, all images were cleaned and standardised into the JPG format using ImageMagick v7.1.27. The JPSD dataset contains a total of 7216 images of children aged 0–8, 3260 images of children aged 8–14, 3224 images of children aged 14–18, and 560 images of children aged 18 and over. These were divided into training and testing sets in an 8:2 ratio. Figure 2 shows the statistical distribution of the different age groups in the dataset. Figure 3 shows some example images.

3.2. Age Classification Criteria

Although datasets such as UTKface contain facial images with various annotations including age, gender, and ethnicity, this study focuses on constructing classification models for specific venues in China, such as hotels, internet cafés, nightclubs, bars, and concert venues. Thus, gender and ethnicity are not considered in our approach. In accordance with relevant laws, regulations and policy guidelines, and in combination with practical supervision scenarios, the data was categorised into the following four age groups: 0–8, 8–14, 14–18 and 18+. This classification model has a solid legal foundation and aligns with the needs of real-world applications. The rationale behind the classification of each group is described below:
The first category is 0–8 years (exclusive of 8). Individuals in this group have no civil capacity, so their online activities and consumption behaviours must be conducted entirely by their guardians on their behalf (Article 23 of the Regulations on the Protection of Juveniles in Cyberspace). From a technical perspective, an absolute blocking strategy should be employed. For example, in internet café monitoring scenarios, the system should trigger an alarm and lock the device immediately once a face from this age group is detected, in order to comply with the special protection requirements for ‘sensitive personal information’ under the Provisions on the Protection of Children’s Personal Information Online.
The second category is 8–14 years old (exclusive of 14). Those in this group have limited civil capacity and may only engage in age- and intelligence-appropriate activities with guardian consent (Article 19 of the Civil Code). For example, in scenarios involving supervision of entertainment venues, the system should log their entry and notify guardians in real time in order to meet the requirements for early intervention specified in Article 28 of the Law on the Prevention of Juvenile Delinquency with regard to undesirable behaviours.
The third category is individuals aged 14–18 (exclusive of 18). According to Chinese law, juveniles aged 14 to under 16 bear criminal responsibility only for eight serious offences, such as intentional homicide. However, entering venues such as internet cafés or bars remains prohibited (Article 58 of the Law on the Protection of Juveniles). Those aged 16 and over still enjoy special protection. For example, in the supervision of KTV venues at night, the intensity of facial recognition should be increased between 22:00 and 06:00 (Article 23 of the Regulations on the Administration of Entertainment Venues), in order to comply with the principle of enforcement in Article 12 of the Law on Penalties for Administration of Public Security, which states that ‘education comes first and punishment is supplementary’. As management requirements for this age group are generally similar across venues, it can be discussed as one category.
The fourth category comprises individuals aged 18 and over. These individuals possess full civil capacity and form the primary group to be differentiated.

3.3. Related Technology Overview

The primary objective of this study is to detect faces in surveillance footage and to estimate the age of the individuals in order to determine whether they are juveniles. As only a rough age range is required, once a face is detected, it can be processed independently using a CNN to treat the age estimation task as a classification problem. When selecting the facial image detection models, we considered both model advancement and detection accuracy and adopted three YOLO series models that have recently attracted significant attention: YOLOv8, YOLOv11 and YOLOv12. For facial age estimation, we selected two widely used CNN models (ResNet50 and VGG16) and two Vision Transformer models (ViT and BiFormer). Finally, we determined the optimal model configuration by comparing the performance of various model combinations on the designated datasets.

3.3.1. Introduction to Face Image Detection Models

The YOLO model’s network architecture mainly consists of three core components: the feature extraction backbone network (Backbone), the feature fusion neck network (Neck) and the detection head network (Head). During the inference process, the input image is first processed by the Backbone network to extract multi-scale features and aggregate them hierarchically. The Neck network then performs feature cropping and cross-level concatenation to achieve multi-scale feature fusion. The optimised feature maps are subsequently passed to the prediction layer. Finally, the Head network performs object localisation and classification tasks based on the fused features, outputting bounding box coordinates and class probabilities. The following sections introduce three YOLO models in turn.
YOLOv8 [28] is a deep learning–based object detection algorithm that inherits the efficiency and accuracy of YOLO series models. Improvements to the anchor mechanism and feature fusion strategy further enhance face detection performance, and YOLOv8 excels in face detection tasks. Although it can quickly and accurately locate facial regions, its detection accuracy may decrease when dealing with severely occluded or blurred faces.
Compared with YOLOv8, YOLOv11 [29] boasts several enhancements to its model architecture and algorithm optimisation. It introduces more efficient feature extraction networks and improved loss functions, resulting in powerful feature extraction capabilities and the ability to adapt to complex scenes. However, the increased model complexity may result in longer training times and higher computational requirements. Figure 4 shows the network framework diagram of YOLOv11.
Based on YOLOv11, YOLOv12 [30] further optimises the model structure and training strategies. By introducing multi-scale feature fusion and an improved anchor allocation mechanism, YOLOv12 can better handle face detection tasks involving different scales and poses. It achieves new heights in detection accuracy and robustness, particularly excelling in images with occlusion, blur and extreme lighting conditions. However, YOLOv12’s high model complexity imposes greater demands on hardware resources, which may limit its deployment on devices with limited resources.

3.3.2. Introduction to Facial Age Estimation Models

VGG16 [31] is a classical convolutional neural network developed by the Visual Geometry Group (VGG) at the University of Oxford. This model uses small 3 × 3 convolution kernels as basic filters and has an overall structure composed of 16 convolutional layers and five alternating 2 × 2 pooling layers. This is followed by three fully connected layers, enabling multi-scale feature extraction. In age classification tasks, it can effectively extract facial texture and structural features for age estimation. VGG16’s advantages lie in its simple structure, ease of understanding and straightforward implementation. However, its disadvantages include a large number of parameters, a high computational cost for training and inference, and sensitivity to input image size.
ResNet50 [32] is a classical deep convolutional neural network composed primarily of a projection head, multiple stacked encoders and a classifier. It is widely used in image classification tasks. By introducing a residual learning mechanism, it effectively addresses the vanishing gradient problem in deep networks and enables the training of deeper architectures. In age classification tasks, it can extract rich facial features and achieve high classification accuracy. The advantages of ResNet50 include simplicity, stable training and ease of implementation. However, its limitations include reduced generalisation capability on large-scale datasets and the need for further architectural optimisation in complex age classification tasks. Figure 5 shows the overall framework of the ResNet50 model.
ViT [33] is a vision model based on the Transformer architecture that has achieved remarkable success in image classification and visual tasks in recent years. It divides the input image into multiple patches, treating them as sequence input and performing feature extraction and modelling through the Transformer’s self-attention mechanism. In age classification tasks, ViT captures long-range dependencies and global features within facial images, achieving high classification accuracy and robustness. Advantages of ViT include strong adaptability to various image sizes and resolutions, and a superior ability to capture global features. However, disadvantages include the high computational cost of training and inference, and the large data requirements.
BiFormer [34] is a vision Transformer model based on a dynamic sparse attention mechanism. By introducing bi-level routing attention, it enhances computational efficiency and feature representation capability. BiFormer autonomously identifies the most relevant regions of an image while retaining the ability to model long-term dependencies. This demonstrates its excellent computational efficiency and powerful feature extraction capabilities, making it particularly suitable for processing high-resolution images or applications with limited resources. However, it has higher implementation complexity and requires careful hyperparameter tuning for specific tasks.

3.4. Evaluation Metrics

This study employed six commonly used evaluation metrics to evaluate the performance of three face detection models and four age classification models in the tasks of juvenile face detection and age estimation.
The calculation formulas for these six metrics are shown in Equations (1)–(6).
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F 1   s c o r e = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
A c c u r a c y = T P + T N T P + T N + F P + F N
A P = 0 1   P r e c i s i o n ( R e c a l l ) d R e c a l l
m A P = 1 N i = 1 N A P i
where TP (true positive) represents the number of image regions that contain faces and are correctly predicted as such by the model. TP is a key indicator of the model’s detection capability, reflecting its accuracy in identifying faces.
FP (False Positive) indicates the number of images that do not contain faces, but which the model incorrectly predicts do. A high FP value will significantly reduce the model’s usability in high-precision scenarios.
FN (False Negative) represents the number of image regions that contain faces but are incorrectly predicted by the model as non-face regions. This metric indicates missed detections; a large number of FNs can lead to incomplete detection results and affect subsequent tasks.
TN (true negative) represents the number of image regions that do not contain faces and are correctly predicted as non-face regions by the model. TN is a key indicator of the model’s ability to identify non-face regions and reflects its stability under background noise.
AP (average precision) is the average accuracy of one category. A higher AP value indicates that the model balances detection coverage and accuracy well, achieving a high Recall rate while maintaining good precision.
mAP (mean Average Precision) is the mean of AP values across all categories, and is a core metric for evaluating the performance of multi-class detection systems. N is the number of categories. A higher mAP value suggests stronger generalisation and robustness, as well as greater reliability in real-world applications.
When it comes to face detection tasks, we evaluate the detection model using Accuracy, Recall, the F1 score and mAP. Accuracy is defined as the proportion of correctly recognised faces among all predicted face bounding boxes, reflecting the model’s ability to detect errors. Recall represents the proportion of correctly detected face bounding boxes among all actual faces. The F1 score is the harmonic mean of precision and recall, providing a comprehensive reflection of the model’s balanced performance in reducing false positives and avoiding missed detections. AP represents the area under the precision-recall curve and mAP is the average AP across multiple categories or joint intersection over union (IoU) thresholds. mAP@0.5 indicates the average accuracy when the IoU threshold is 0.5.
In age classification tasks, the performance of a model is evaluated using accuracy, recall, F1 score and precision. Accuracy is defined as the proportion of faces that a model correctly predicts to belong to a specific age group. Recall represents the proportion of correctly identified faces in a specific age group compared to the total number of faces in that group. The F1 score is the harmonic mean of accuracy and recall and provides a more comprehensive reflection of the model’s classification performance. Accuracy measures the proportion of correctly classified faces across all age groups. Higher accuracy indicates a stronger ability to distinguish adolescents from other age groups.

4. Results and Discussions

4.1. Experimental Environment Setup

The experiment was conducted on a Matpool server using the PyTorch v2.5.1 framework. The operating system was Ubuntu 20.04, and the hardware consisted of an Intel(R) Xeon(R) Platinum 8260 CPU @ 2.30 GHz and an NVIDIA GeForce RTX 4090 GPU. The code was compiled using Python 3.10 and CUDA version 11.8. The training parameters for face object detection and age classification are shown in Table 2 and Table 3.

4.2. Comparison and Analysis of Experimental Results

4.2.1. Comparative Performance of Object Detection Methods

The results for each model are presented in Table 4. YOLOv11 outperforms YOLOv8 by 2.0%, 1.9%, and 3.5% in Precision, Recall, and mAP@0.5, respectively. While YOLOv12 also delivers solid performance, its values across all metrics are slightly lower than those of YOLOv11. Although the performance gap is marginal, YOLOv11 can better meet the requirements of detection tasks in challenging scenarios (e.g., crowded public venues or low-illumination environments). Consistent performance advantages of YOLOv11 over YOLOv8 and YOLOv12 can be observed across all key evaluation metrics. These results align with the core objective of this study: to achieve highly reliable facial detection in real-world scenarios, particularly those involving minors, where minimising missed detections is paramount. While all three models perform competitively, YOLOv11 has a slight advantage over YOLOv8, with improvements of 2.0% in Precision, 1.9% in Recall, and 3.5% in mAP@0.5. YOLOv12 is also strong, but falls between the other two models, trailing YOLOv11 by 0.6% in Precision, 0.8% in Recall and 0.3% in mAP@0.5. The high Recall rates and robust mAP@0.5 values suggest that all models perform well in detecting faces across varied conditions, including differences in scale, occlusion and lighting. This capability is essential in challenging environments such as crowded venues or low-light settings, where the model must maintain strong multi-scale feature representation, likely supported by mechanisms such as feature pyramid networks or attention modules integrated into modern YOLO architectures.
Figure 6 shows the training loss and evaluation index curve of the YOLOv11 face detection model. The face detection model demonstrates strong overall performance. These high, balanced results indicate the model’s effectiveness in reducing false positives while maintaining a low rate of missed detections. This balance is further reflected in the F1 score. Furthermore, the model achieves a notable mAP of 87.5% at a confidence threshold of 0.5. The result signifies superior performance across different confidence thresholds and suggests high practical applicability.
In conclusion, YOLOV11 achieves an effective trade-off between Precision and Recall. It constitutes an accurate and reliable face detection solution, thereby establishing a solid foundation for subsequent tasks such as age classification.

4.2.2. Performance Comparison in the Age Classification

This study selected four models for comparative experiments on facial age estimation. The performance metrics of each model are summarised in Table 5.
The results show that ResNet50 achieved the best overall performance. In particular, ResNet50 demonstrated a clear advantage in handling the two most challenging age groups (8–14 and 14–18 years), achieving F1 scores of 43.54% and 45.78%, respectively, which are substantially higher than those of the other models. All four models also achieved good recognition performance in the 0–8 age group. However, ViT performed poorly in the middle-aged groups (8–14 and 14–18 years), achieving F1 scores of just 21.54% and 23.11%, respectively. This suggests that the Transformer architecture may be limited in its ability to perform fine-grained age classification tasks. It is also worth noting that all models showed relatively low F1 score in the 8–14 age group: 43.54% for ResNet50, 41.64% for VGG16 and just 21.54% for ViT. This suggests that learning features for intermediate age groups is a core challenge for age recognition algorithms and that these algorithms may require more refined feature extraction or data augmentation strategies to improve performance. Overall, CNN-based architectures such as ResNet50 and VGG16 demonstrated more reliable performance in age classification tasks. ViT performed adequately in older age groups but still has significant room for improvement in recognising critical middle-age ranges. The following section provides an analysis and interpretation of several phenomena observed in the experimental results.
Figure 7 shows the confusion matrices for the four models. As can be seen, ResNet50 demonstrates the most balanced performance of all the models. It correctly classifies 6435 images in the 0–8 age group, misclassifying only 781 as being in the 8–14 age group. For the 8–14 age group, ResNet50 achieves 1172 correct classifications but exhibits significant misclassification: 1156 images are erroneously assigned to the 0–8 age group and 932 to the 14–18 age group. In the 14–18 age group, ResNet50 correctly identifies 1199 images while misclassifying 1990 images as being in the 8–14 age group and 35 images as being in the 18+ age group. In the 18+ age group, ResNet50 accurately classifies 533 images, misclassifying only 27 as 14–18. This suggests that ResNet50 can handle the 0–8 and 18+ age groups effectively, but struggles with the intermediate age groups (8–14 and 14–18), where confusion between adjacent classes is common due to similar facial features.
While VGG16 performs reasonably well, it shows inferior results compared to ResNet50, particularly when it comes to distinguishing between younger age groups. This aligns with the trend observed in ResNet50, where intermediate ages pose challenges. ViT exhibits the weakest performance of all the models. Based on its confusion matrix, ViT struggles to accurately classify images across all age groups, particularly the 8–14 and 14–18 categories, where Recall values are notably low. This results in a high number of false negatives and positives, suggesting that the transformer-based model may not capture age-related features as effectively as convolutional networks for this task.
BiFormer achieves an overall accuracy of 59.12%, placing it between VGG16 and ViT. Its confusion matrix reveals relatively strong performance in the 0–8 and 18+ age groups, but it struggles with the 8–14 and 14–18 age groups. This pattern is consistent with the performance of the other models and highlights the inherent difficulty in classifying peri-adolescent ages due to rapid physiological changes.
Overall, ResNet50 is the most reliable model for age classification. VGG16 and BiFormer provide moderate performance, with VGG16 showing consistent issues in the intermediate age groups. Although ViT leverages advanced architecture, it underperforms due to its inability to handle age-related nuances. These findings are consistent with the performance metrics in Table 5, further validating the robustness of ResNet50 for this age estimation task.
Poor Performance in the 8–14 and 14–18 Age Groups
Although these two age groups were introduced for the first time in this study, their performance during training was suboptimal. This may be due to the relatively small number of participants in the 8–14 and 14–18 age groups and the narrow age spans of these groups. Furthermore, the facial changes observed in juveniles within these age ranges are likely influenced by various factors, such as physiological development, environmental conditions and individual variability. This issue can be analysed from three perspectives:
From a data perspective, individuals aged 8–18 are in a critical stage of growth and development, which makes collecting samples from this group significantly more challenging than from adults. Compared with adult data, the facial data of juveniles must strictly comply with ethical review and privacy protection requirements, which leads to insufficient sample sizes in publicly available datasets. Experimental data show that, in the training set used in this study, the sample proportions of the 8–14 and 14–18 age groups are only 22.86% and 22.61%, respectively. This results in data sparsity, which directly affects the model’s ability to learn relevant features. Furthermore, the facial features of individuals within this age range exhibit nonlinear evolutionary patterns. Specifically, individuals aged 8–14 are in the early stages of puberty, during which their facial bone structure, fat distribution and skin texture undergo rapid changes. Between the ages of 14 and 18, secondary sexual characteristics emerge and the differences in facial contours between males and females become more pronounced. This developmental uncertainty increases variability in facial features within the group while blurring the boundaries between groups, forming a sharp contrast with the relative stability of adult facial characteristics. At the same time, external interference factors are particularly significant: frequent hairstyle changes (e.g., bangs covering the forehead) and weaker control of facial expressions often introduce irrelevant features into the model. Figure 8 illustrates several misclassified facial images, which serve to confirm the explanations provided above.
To address the issue of insufficient distinction between groups in the original four-class model, this study optimised the classification strategy by refining the age classification task into a binary classification problem (under 14 years old vs. over 14 years old). A new experiment was conducted using ResNet50 as the baseline model while keeping other parameters unchanged. The results are shown in Table 6 and indicate that this adjustment significantly enhanced the model’s classification performance, with accuracy improving from 64.37% to 85.95%. It can be seen that the binary classification strategy effectively reduced the model’s classification complexity, as well as improving its performance in estimating age, particularly when dealing with age groups that are difficult to distinguish due to pronounced developmental differences.
In Traditional CNN Models, the ResNet50 Model Outperforms the VGG16 Model
The performance differences between ResNet50 and VGG16 in the age classification task primarily stem from their distinct architectural designs and differing capabilities in extracting age-related features. Experimental results demonstrate that ResNet50 achieves exceptional performance in the 0–8 and 18+ age groups, with F1 scores of 86.91% and 94.48%, respectively. In contrast, VGG16 exhibits Recall rates below 35% for the intermediate age ranges (8–14 and 14–18), reflecting the fundamental differences in their feature learning mechanisms.
ResNet50 benefits from residual connections, giving it a stronger capability for cross-layer feature fusion. Through skip connections, the network can retain both shallow local texture features, such as skin smoothness, and deep global structural information, such as facial contour variations. This property is particularly advantageous for recognising extreme age groups: the ‘baby fat’ feature of children and the wrinkle patterns of adults both display strong visual distinctiveness. Additionally, the residual structure alleviates the vanishing gradient problem, enabling the model to train deeper networks (50 layers) and capture more intricate age-related patterns.
In contrast, while the stacked small-convolution architecture of VGG16 (16 layers) is sensitive to local details, it lacks a cross-layer feature reuse mechanism. In the intermediate age group (8–18 years), facial aging features evolve progressively (e.g., acne emergence and jawline development), so models must establish long-range dependencies. However, the max pooling operations in VGG16 discard some of the spatial information, which makes it difficult to capture subtle yet critical transitional features.
Both models achieve over 90% Recall in the 18+ adult group due to the relative stability of adult facial features. In the 0–8 child group, distinctive biological traits (such as baby fat) enable ResNet50 to reach a high Recall rate of 89.17%. These findings suggest that the expression of facial features depends on different hierarchical levels across age groups. ResNet50, with its adaptive, multi-level network structure, is therefore better suited to age estimation tasks that require multi-scale and cross-layer feature extraction.
The ViT Model Performs Worse than Traditional CNNs
In facial age estimation tasks, the ViT model performs less well than traditional CNN models such as ResNet50 and VGG16. This is mainly because its architectural characteristics do not align with the requirements of the task. Conceptually, age-related changes are primarily manifested in local facial features, such as skin texture, wrinkle distribution and variation in the jawline. CNN’s hierarchical convolutional structures are naturally suited to capturing such localised details. In the shallow layers of ResNet50 and VGG16, convolutional kernels can effectively extract basic texture features. These are progressively combined into higher-level semantic representations as the network deepens. This enables precise differentiation of age across groups. For instance, in the 0–8 age group, CNNs can capture the rounded contours of baby fat. In the 18+ age group, they can detect wrinkles and skin laxity as indicators of aging.
In contrast, although the pure attention mechanism of ViT is adept at modeling long-range dependencies, it clearly struggles with tasks requiring fine-grained local features. This is because ViT divides images into fixed-size patches and primarily computes inter-patch relationships through global self-attention, which can overlook micro-level facial details that are critical for accurate age estimation. Particularly in the intermediate age groups (8–14 and 14–18 years), ViT performs noticeably worse than CNNs since distinguishing between these groups relies heavily on subtle facial transitions, such as jawline development, which global attention mechanisms struggle to capture. Furthermore, ViT requires significantly larger datasets than CNNs and, in age estimation tasks that typically involve only tens of thousands of images, it cannot perform to its full potential. Experimental data confirms this. ViT’s F1 score in the 0–8 age group is 4.36% lower than ResNet50’s, and the difference increases to 22% in the 8–14 age group. These results suggest that traditional CNNs have an unbeatable advantage when it comes to tasks that require fine-grained local feature perception.
BiFormer’s Performance Is Intermediate Between ResNet50 and VGG16
The performance of BiFormer lies between that of ResNet50 and VGG16, primarily due to the interaction between its attention mechanism and the data distribution. For data-rich age groups such as 0–8 years, bidirectional attention effectively captures global features, achieving a Recall rate of 93.63%. This is notably higher than the rates achieved by ResNet50 (89.17%) and VGG16 (88.02%) for these age groups. However, the model suffers from severe underfitting in age groups with fewer samples (8–18 years), due to its high feature flexibility, where Recall rates drop to around 23–25%. This demonstrates that BiFormer’s attention mechanism is sensitive to data volume. In contrast, ResNet50 maintains balanced performance across age groups thanks to its residual connections, which stabilize gradient propagation, and its inherent local inductive bias. Specifically, in the 8–18 age group, ResNet50’s F1 score surpasses that of BiFormer by around 12%, highlighting the robustness of convolutional architectures in scenarios with limited data. Compared with VGG16, BiFormer’s advantage lies in its use of dynamic attention instead of fixed receptive fields. These observations confirm the rule for visual tasks that global modelling benefits from large data, while small data require local priors [35]. Therefore, when data distribution is uneven, hybrid architectures may be more applicable than pure attention-based ones.
Notably, in the 0–8 age group, BiFormer achieves an F1 score of 87.82%, which is higher than ResNet50’s score of 86.91%. This is primarily due to BiFormer’s bidirectional and dynamic sparse attention mechanism. This feature can capture both local details and global contextual information simultaneously, enabling it to adapt more effectively to the different facial features of children, such as exaggerated expressions and different postures. Additionally, BiFormer’s token propagation mechanism enhances the fusion of shallow and deep features, which is essential for identifying the subtle facial features of children. By contrast, ResNet50’s remaining connections may lose some fine-grained information during multi-layer feature propagation.

4.2.3. Model Visualization

To improve the model’s reliability, this study uses the Grad-CAM technique to generate attention heatmaps for the final convolutional layers of four models. This method produces visualisations by computing gradient information, providing an intuitive representation of the key regions to which the neural network pays attention during the classification process. The highlighted red regions in the heatmaps represent areas that significantly impact the prediction results, while the blue regions indicate areas to which the model pays less attention. The heatmap results are shown in Figure 9.
Different network architectures exhibit distinct feature attention patterns when processing the same input image. In the age estimation task, ResNet50’s deep residual structure enables it to capture both local details and global semantic features. Heatmaps are primarily concentrated around key facial regions such as the eyes and mouth corners, indicating strong local feature extraction capability. Due to limitations in network depth and receptive field, VGG16 shows a relatively dispersed heatmap distribution. It tends to focus on the overall facial contour, making it less effective at precisely identifying subtle age-related features. The ViT model demonstrates a unique global attention mechanism. Heatmaps show a non-uniform distribution. However, it is slightly less effective at localising fine-grained features. As an emerging hybrid architecture, BiFormer exhibits a combination of local focus and global correlation in its heatmaps. This shows a more comprehensive and balanced understanding of features in the age estimation task.

4.2.4. Optimization of Classification Model

After an in-depth analysis of the ResNet50 baseline, this study incorporated the Convolutional Block Attention Module (CBAM) and the Coordinate Attention (CA) mechanism into its architecture. As can be seen in Table 7, the results highlight a stark contrast between the two enhanced models. The CA-ResNet50 model achieved superior performance across all metrics, with notable F1 scores of 90.74% and accuracy of 65.66%, representing significant improvements on the baseline model. In contrast, the accuracy of the CBAM-ResNet50 model dropped to 82.23%, underperforming the baseline ResNet50. This suggests that sequentially applying complex attention mechanisms (channel and spatial in CBAM) may increase model capacity excessively, potentially leading to overfitting of the training data and thus impairing generalisation. In contrast, the CA module, which efficiently encodes channel relationships and positional information, appears to offer a more targeted and effective inductive bias for age classification, enhancing feature discriminability without introducing redundancy.

5. Conclusions and Future Work

5.1. Research Conclusions

This study explores the use of an automated facial age recognition solution to improve the protection and supervision of children in various situations. From a technical perspective, this research makes three major contributions.
Firstly, the study proposes a cascaded framework based on ‘detection–cropping–classification’, which effectively addresses modelling challenges caused by the dynamic variation in facial features across different age groups.
Secondly, a dedicated JPSD dataset was constructed by integrating multiple public datasets, specifically tailored for juvenile protection and supervisory scenarios, although its class distribution still needs to be balanced further.
Thirdly, preliminary comparative experiments were conducted on several YOLO-series object detection models and classification architectures, including VGG16, ResNet50, ViT and BiFormer. The results indicate that the combination of YOLOv11 and ResNet50 serves as the optimal baseline, achieving the best performance in both four-class and binary tasks.
These results collectively demonstrate the practical value and application potential of the proposed method in assisting law enforcement agencies with juvenile protection efforts.

5.2. Improvements and Future Work

Although this study produced meaningful results, there are still several limitations in juvenile facial detection and age estimation that require improvement. Most importantly, the current system’s low Recall rate and insufficient accuracy in critical age ranges mean that it is not yet suitable for fully automated deployment. In the short term, any application of this technology must operate as a decision support tool under strict human-in-the-loop supervision.
The primary challenge stems from imbalanced data distribution. Insufficient sample sizes in the 8–14 and 14–18 age groups significantly reduce the accuracy of classifications within these critical ranges. This limitation reflects the difficulties in sourcing juvenile facial datasets. Future research could involve establishing partnerships with additional youth education and training institutions to acquire more facial image samples. Technically, advanced data augmentation techniques (e.g. SMOTE [36] ) could be adopted to address dataset imbalance.
The second area of future research is the model’s generalisation problem. We found that, in complex crowd detection tasks involving low light or facial occlusion, for example, the model’s detection and recognition performance significantly decreases. Integrating multimodal data sources, such as infrared facial images and depth-sensing data, can improve the adaptability of models in different scenarios by providing richer feature representations. At an algorithmic level, designing attention mechanisms specific to juveniles could improve the capture of age-sensitive features and strengthen generalisation. In addition, more advanced models should be selected for comparative research. Specifically, the latest face detection and age estimation models developed in recent years should be incorporated into the analysis, and the optimal model should be further optimised.
Finally, privacy and ethics must always be key considerations in future research. Due to the sensitive nature of facial data relating to adolescents, research must strictly comply with all relevant regulations protecting minors. Robust ethical review procedures must be implemented to prevent potential harm to young people resulting from the misuse of technology. Only by addressing these issues can facial age recognition technology advance responsibly to support juvenile protection.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/asi8060185/s1, Detecting model codes; Classification model code; Heat map code.

Author Contributions

Methodology, Y.W. and Z.Y.; formal analysis, Q.G.; investigation, Y.L.; data curation, Y.L.; writing, Y.W. and Q.G.; visualization, Q.G.; funding acquisition, Y.W.; supervision, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Central University Basic Scientific Research Business Fee Special Fund Project (LGZD202504) and the “Qing Lan Project” of Jiangsu Higher Education Institutions.

Institutional Review Board Statement

This study does not involve interventional experiments on human subjects and was therefore granted an ethical review exemption by the Research Ethics Committee of Nanjing Agricultural University (Approval Code: NJAU-IRB-Exemption-2025011; Approval Date: 15 May 2025) in compliance with the Declaration of Helsinki (2013 revision).

Informed Consent Statement

Not applicable.

Data Availability Statement

The object detection dataset is available from the corresponding author on reasonable request. The age estimation dataset used in this study is a newly integrated dataset derived from five publicly available datasets. The original datasets can be accessed via their respective source publications. The integration and redistribution of the dataset comply with the license agreements of all original datasets; for details on licensing, refer to the official documentation of each source dataset. The newly integrated dataset is available on request from the corresponding author. Some models and codes in the paper can be viewed in the Supplementary Materials.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Cyberspace Administration of China. The Number of Netizens in China is Nearly 1.1 Billion, and the Internet Penetration Rate Reaches 78.0%. Available online: https://www.cac.gov.cn/2024-08/30/c_1726702676681749.htm (accessed on 18 November 2025).
  2. GDPR.eu. (n.d.). Art. 8 GDPR: Conditions Applicable to Child’s Consent in Relation to Information Society Services. Available online: https://gdpr.eu/article-8-childs-consent/ (accessed on 18 November 2025).
  3. U.S. Federal Trade Commission. COPPA Cases Dataset [Data Set]. 2024. Available online: https://www.ftc.gov/enforcement/cases-proceedings/terms/1421 (accessed on 19 November 2025).
  4. Ministry of Education of the People’s Republic of China. Law on the Protection of Minors. 2020. Available online: http://www.moe.gov.cn/jyb_sjzl/sjzl_zcfg/zcfg_qtxgfl/202110/t20211025_574798.html (accessed on 19 November 2025).
  5. The State Council of the People’s Republic of China. Regulations on the Protection of Minors in Cyberspace. 2023. Available online: https://www.gov.cn/zhengce/zhengceku/202310/content_6911289.htm (accessed on 19 November 2025).
  6. Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A Discriminative Feature Learning Approach for Deep Face Recognition. In Computer Vision—ECCV 2016; ECCV 2016; Lecture Notes in Computer Science; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; Volume 9911. [Google Scholar] [CrossRef]
  7. Levi, G.; Hassner, T. Age and Gender Classification Using Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, MA, USA, 7–12 June 2015; pp. 34–42. [Google Scholar] [CrossRef]
  8. Luan, R.S.; Liu, G.F.; Wang, C.Q. Application of Dynamic Facial Recognition in Investigation Work. J. Crim. Investig. Police Univ. China 2019, 151, 122–128. [Google Scholar] [CrossRef]
  9. Jiang, C.; Liu, H.; Yu, X.; Wang, Q.; Cheng, Y.; Xu, J.; Liu, Z.; Guo, Q.; Chu, W.; Yang, M.; et al. Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning. In Proceedings of the ACM International Conference on Multimedia (ACM MM), Ottawa, ON, Canada, 19 October–3 November 2023; pp. 1–11. [Google Scholar] [CrossRef]
  10. Geng, X.; Zhou, Z.-H.; Smith-Miles, K. Correction to “Automatic age estimation based on facial aging patterns”. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 368. [Google Scholar] [CrossRef]
  11. Tan, Z.; Wan, J.; Lei, Z.; Zhi, R.; Guo, G.; Li, S.Z. Efficient Group-n Encoding and Decoding for Facial Age Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2610–2623. [Google Scholar] [CrossRef] [PubMed]
  12. Kuprashevich, M.; Tolstykh, I. MiVOLO: Multi-input Transformer for Age and Gender Estimation. arXiv 2023, arXiv:2307.04616v2. [Google Scholar] [CrossRef]
  13. Wang, H.; Sanchez, V.; Li, C.-T. Cross-Age Contrastive Learning for Age-Invariant Face Recognition. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024. [Google Scholar] [CrossRef]
  14. Yang, W.-Y.; Huang, A.-Q.; Tan, Z.-L.; Liu, Z.L.; Zhong, Y. Facial age estimation method combining similar cross-entropy and knowledge distillation. Comput. Technol. Dev. 2025, 35, 113–120. [Google Scholar] [CrossRef]
  15. Qin, J.; Jiao, Y.; Li, Z.-P.; Mao, Z.Y. Facial image age estimation based on attention ConvLSTM model. Comput. Appl. Softw. 2025, 42, 383–390. [Google Scholar] [CrossRef]
  16. Narayan, K.; Vibashan, V.S.; Chellappa, R.; Patel, V.M. FaceXFormer: A Unified Transformer for Facial Analysis. arXiv 2025, arXiv:2403.12960. [Google Scholar]
  17. Chen, X.; Fu, X. Application of face recognition and age estimation in online game anti-addiction systems. Technol. Mark. 2021, 28, 41–42. Available online: http://dianda.cqvip.com/Qikan/Article/Detail?id=7103757369 (accessed on 18 November 2025).
  18. Cao, Y.; Berend, D.; Tolmach, P.; Amit, G.; Levy, M.; Liu, Y. Fair and Accurate Age Prediction Using Distribution Aware Data Curation and Augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3404–3413. [Google Scholar] [CrossRef]
  19. Buolamwini, J.; Gebru, T. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the Conference on Fairness, Accountability and Transparency, New York, NY, USA, 23–24 February 2018; pp. 77–91. Available online: https://api.semanticscholar.org/CorpusID:3298854 (accessed on 23 August 2025).
  20. Muhammed, A.; Marcos, J.; Gonçalves, N. Federated Learning for Secure and Privacy-Preserving Facial Recognition: Advances, Challenges, and Research Directions. In Pattern Recognition and Image Analysis; Gonçalves, N., Oliveira, H.P., Sánchez, J.A., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15937, pp. 227–241. [Google Scholar] [CrossRef]
  21. Zhang, M.; Wei, E.; Berry, R.; Huang, J. Age-Dependent Differential Privacy. IEEE Trans. Inf. Theory 2024, 70, 1300–1319. [Google Scholar] [CrossRef]
  22. Narayan, A.; Kulkarni, P.; Gupta, S.; Singh, R. FaceXFormer: End-to-End Transformer for Multi-Task Facial Analysis. Comput. Vis. Image Underst. 2024, 241, 103987. [Google Scholar] [CrossRef]
  23. Zhang, Z.; Song, Y.; Qi, H. Age Progression/Regression by Conditional Adversarial Autoencoder. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5810–5818. [Google Scholar] [CrossRef]
  24. Moschoglou, S.; Papaioannou, A.; Sagonas, C.; Deng, J.; Kotsia, I.; Zafeiriou, S. AgeDB: The first manually collected, in-the-wild age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1997–2005. [Google Scholar] [CrossRef]
  25. Agustsson, E.; Timofte, R.; Escalera, S.; Baro, X.; Guyon, I.; Rothe, R. Apparent and real age estimation in still images with deep residual regressors on the APPA-REAL database. In Proceedings of the 12th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Washington, DC, USA, 30 May–3 June 2017. [Google Scholar] [CrossRef]
  26. Zhang, Y.; Liu, L.; Li, C.; Loy, C.C. Quantifying facial age by posterior of age comparisons. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 4–7 September 2017. [Google Scholar] [CrossRef]
  27. Yin, Y.; Chang, D.; Song, G.; Sang, S.; Zhi, T.; Liu, J.; Luo, L.; Soleymani, M. FG-Net: Facial action unit detection with generalizable pyramidal features. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 6099–6108. [Google Scholar] [CrossRef]
  28. Swathi, Y.; Challa, M. YOLOv8: Advancements and innovations in object detection. In Smart Trends in Computing and Communications: Proceedings of SmartCom 2024; Lecture Notes in Networks and Systems; Senjyu, T., So-In, C., Joshi, A., Eds.; Springer: Madison, WI, USA, 2024; Volume 946, pp. 1–13. [Google Scholar] [CrossRef]
  29. Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
  30. Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
  31. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  32. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  33. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  34. Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W.H. BiFormer: Vision Transformer with Bi-Level Routing Attention. arXiv 2023, arXiv:2303.08810. [Google Scholar]
  35. Li, K.; Ouyang, W.; Luo, P. Locality guidance for improving vision transformers on tiny data. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–29 October 2022. [Google Scholar] [CrossRef]
  36. Li, Y.; Yang, Y.; Song, P.; Wang, S. An improved SMOTE algorithm for enhanced imbalanced data classification by expanding sample generation space. Sci. Rep. 2025, 15, 23521. [Google Scholar] [CrossRef]
Figure 1. Overview of the proposed method.
Figure 1. Overview of the proposed method.
Asi 08 00185 g001
Figure 2. Dataset statistics.
Figure 2. Dataset statistics.
Asi 08 00185 g002
Figure 3. Sample examples across different age groups. Note: Reprinted/adapted with permission from Refs. [23,24,25,26,27]. Copyright 2017, IEEE and British Machine Vision Association (BMVA); Copyright 2024, IEEE. The same copyright notice applies to the preceding and following figures.
Figure 3. Sample examples across different age groups. Note: Reprinted/adapted with permission from Refs. [23,24,25,26,27]. Copyright 2017, IEEE and British Machine Vision Association (BMVA); Copyright 2024, IEEE. The same copyright notice applies to the preceding and following figures.
Asi 08 00185 g003
Figure 4. Network Architecture Diagram of YOLOv11.
Figure 4. Network Architecture Diagram of YOLOv11.
Asi 08 00185 g004
Figure 5. Network architecture diagram of ResNet50.
Figure 5. Network architecture diagram of ResNet50.
Asi 08 00185 g005
Figure 6. Training loss and evaluation metrics curves of the YOLOv11 model.
Figure 6. Training loss and evaluation metrics curves of the YOLOv11 model.
Asi 08 00185 g006
Figure 7. Confusion matrices of four models.
Figure 7. Confusion matrices of four models.
Asi 08 00185 g007
Figure 8. Examples of misclassified facial images.
Figure 8. Examples of misclassified facial images.
Asi 08 00185 g008
Figure 9. Heatmaps from the four models.
Figure 9. Heatmaps from the four models.
Asi 08 00185 g009
Table 1. Number of images per dataset.
Table 1. Number of images per dataset.
DatasetTotal Number of ImagesNumber of Juvenile Images
UTKface20,000+4352
AgeDB16,488303
APPA-REAL75911417
MegaAge41,9417009
FG-NET1002619
Table 2. Parameters of the object detection experiment.
Table 2. Parameters of the object detection experiment.
ParameterValues
Epoch100
Batch32
Initial learning rate0.01
Imgsz640 × 640
Table 3. Parameters of the age classification experiment.
Table 3. Parameters of the age classification experiment.
ParameterValues
Epoch100
Batch64
Initial learning rate0.001
Imgsz224 × 224
Table 4. Comparison of results among the three models.
Table 4. Comparison of results among the three models.
ModelPrecision (%)Recall (%)F1 Score (%)mAP@0.5 (%)
YOLOv882.779.180.984.0
YOLOv1184.781.082.887.5
YOLOv1284.180.282.187.2
Table 5. Performance Comparison of the Four Models.
Table 5. Performance Comparison of the Four Models.
ModelGroupPrecision (%)Recall (%)F1 Score (%)Accuracy (%)
VGG160–883.0488.0285.4562.61
8–1456.0133.1441.64
14–1857.2834.6943.21
18+90.3494.5992.42
ResNet500–884.7789.1786.9164.37
8–1455.1935.9543.54
14–1859.5137.1945.78
18+93.8095.1794.48
ViT0–881.8183.3182.5551.83
8–1434.3115.7021.54
14–1839.6216.3123.11
18+84.4292.0088.05
BiFormer0–882.6993.6387.8259.12
8–1448.0723.0831.18
14–1848.4724.9632.95
18+86.5594.8290.50
Table 6. Performance comparison of the models.
Table 6. Performance comparison of the models.
ModelGroupPrecision (%)Recall (%)F1 Score (%)Accuracy (%)
ResNet500–884.7789.1786.9164.37
8–1455.1935.9543.54
14–1859.5137.1945.78
18+93.8095.1794.48
ResNet5014−85.4884.2384.8585.95
14+90.9491.7191.32
Table 7. Comparison of various metrics for the optimized models.
Table 7. Comparison of various metrics for the optimized models.
ModelPrecision (%)Recall (%)F1 Score (%)Accuracy (%)
ResNet5089.7288.1188.9064.37
CBAM-ResNet5084.1586.4085.2661.71
CA-ResNet5092.6488.9290.7465.66
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, Y.; Gao, Q.; Lin, Y.; Yang, Z.; Wang, X. Enhancing the Effectiveness of Juvenile Protection: Deep Learning-Based Facial Age Estimation via JPSD Dataset Construction and YOLO-ResNet50. Appl. Syst. Innov. 2025, 8, 185. https://doi.org/10.3390/asi8060185

AMA Style

Wu Y, Gao Q, Lin Y, Yang Z, Wang X. Enhancing the Effectiveness of Juvenile Protection: Deep Learning-Based Facial Age Estimation via JPSD Dataset Construction and YOLO-ResNet50. Applied System Innovation. 2025; 8(6):185. https://doi.org/10.3390/asi8060185

Chicago/Turabian Style

Wu, Yuqiang, Qingyang Gao, Yichen Lin, Zhanhai Yang, and Xinmeng Wang. 2025. "Enhancing the Effectiveness of Juvenile Protection: Deep Learning-Based Facial Age Estimation via JPSD Dataset Construction and YOLO-ResNet50" Applied System Innovation 8, no. 6: 185. https://doi.org/10.3390/asi8060185

APA Style

Wu, Y., Gao, Q., Lin, Y., Yang, Z., & Wang, X. (2025). Enhancing the Effectiveness of Juvenile Protection: Deep Learning-Based Facial Age Estimation via JPSD Dataset Construction and YOLO-ResNet50. Applied System Innovation, 8(6), 185. https://doi.org/10.3390/asi8060185

Article Metrics

Back to TopTop