Assessment and Estimation of Face Detection Performance Based on Deep Learning for Forensic Applications †

Face recognition is a valuable forensic tool for criminal investigators since it certainly helps in identifying individuals in scenarios of criminal activity like fugitives or child sexual abuse. It is, however, a very challenging task as it must be able to handle low-quality images of real world settings and fulfill real time requirements. Deep learning approaches for face detection have proven to be very successful but they require large computation power and processing time. In this work, we evaluate the speed–accuracy tradeoff of three popular deep-learning-based face detectors on the WIDER Face and UFDD data sets in several CPUs and GPUs. We also develop a regression model capable to estimate the performance, both in terms of processing time and accuracy. We expect this to become a very useful tool for the end user in forensic laboratories in order to estimate the performance for different face detection options. Experimental results showed that the best speed–accuracy tradeoff is achieved with images resized to 50% of the original size in GPUs and images resized to 25% of the original size in CPUs. Moreover, performance can be estimated using multiple linear regression models with a Mean Absolute Error (MAE) of 0.113, which is very promising for the forensic field.


Introduction
Forensic laboratories very often examine digital evidence during a criminal investigation. In particular, the criminal investigation of Child Sexual Exploitation Material (CSEM) shows a growing interest internationally [1]. Advances in technology have increased the use of mobile devices, social media and P2P networks, making it easier for offenders to create and distribute CSEM, something that has become highly prevalent worldwide.
Given this scenario, a manual analysis to identify new CSEM in any seized electronic device (hard drive, desktop, smart phone, and memory stick, among others) becomes absolutely infeasible within the proposed time constraints of most investigations. Not only is it a very time-consuming and expensive task, but it also exposes image analysts to sensitive and disturbing data on a daily basis, which can affect their emotional state and consequently their performance. Hence, the development of fast, automatic and efficient tools for the automated discovery and analysis of images and videos to be implemented in criminal laboratories becomes crucial for the forensic field [2,3].
(MTCNN) [22], the Context-Assisted Single Shot Face Detector (often referred to as PyramidBox) [25] and Dual Face Shot Detector (DSFD) [26]. The validation of this strategy was, however, limited to one GTX 1060 GPU, and the results showed that the image resizing strategy can speed-up face detection with a small reduction in accuracy. A posterior work [35] also showed that it is possible to find a good balance between speed and performance with this resizing strategy.
There is a large variety of available Intel CPUs (https://www.intel.co.uk/content/www/uk/ en/products/processors/core.html) and Nvidia GPUs in the market like Tesla (https://www.nvidia. com/en-us/data-center/v100), TITAN (https://www.nvidia.com/en-us/deep-learning-ai/products/ titan-rtx), GTX (https://www.nvidia.com/en-us/geforce/10-series) or RTX (https://www.nvidia. com/en-us/geforce/20-series) series, with different specifications that might also speed-up face detection. End users, like law enforcement analysts, often face the problem of choosing the most suitable hardware for the analysis of forensic material at hand.
In this paper, we present a comprehensive comparison of the tradeoff between speed and accuracy of face detection methods through the image resizing strategy presented in [34] for a wide variety of hardware architectures. Specifically, we evaluate five Intel CPUs-i5-3450, i7-4790K, i7-8650U, i9-8950HK, and Xeon E5-2630-and seven Nvidia GPUs-Tesla K40, TITAN Xp, GTX 1050, GTX 1060, GTX 1070, RTX 2060, and RTX 2070. We evaluated three representative face detection methods, namely MTCNN, PyramidBox and DSFD, using a set of images chosen from the WIDER Face data set [36] and the Unconstrained Face Detection Data set (UFDD) [37]. The selected images contain less than five people per scene in order to replicate the number of individuals observed in CSEM images. Additionally, we train a model that is able to predict for unseen images, the performance metrics (in terms of accuracy and speed) that the end-user could expect based on the given face detection method, specific hardware, image size and percentage of image resizing. This research work is part of the European project Forensic Against Sexual Exploitation of Children (4NSEEK) and the research lines defined by the Framework agreement between INCIBE (Spanish National Cybersecurity Institute) and the University of León. Conclusions drawn from this study can be used as a face detection benchmark for users of the 4NSEEK tools in order to guide them in the selection of hardware for the analysis and categorization of CSEM.
The rest of the paper is organized as follows. Closely related work to the one addressed in this paper is presented in Section 2. The evaluation methodology proposed in this work is described in Section 3. Experimental evaluation is described in 4 and results are shown in Section 5. Finally, we draw conclusions in Section 6.

Related Work
Both processing time and accuracy are important performance issues for face detectors. The mean Average Precision (mAP) is an appropriate and widely used metric to assess the accuracy of object detectors [38]. Basically, it computes the area under the precision-recall curve obtained by applying several decision thresholds. Research efforts have focused on improving simultaneously both of them.
Recent advances in deep learning methods have contributed to significant performance improvements in a wide range of computer vision applications. They have been particularly successful for face detection problems where modern deep CNN models show a significant accuracy improvement in comparison to traditional approaches based on hand-crafted features [22][23][24][25][26][27][28][29][39][40][41][42][43][44][45]. Consequently, these deep learning methods have become the state-of-the-art for face detection.
The MTCNN [22] method uses custom CNNs to solve simultaneously the problem of face detection and alignment in real-time. It consists of three sub-networks that process the faces from coarse to fine. Compared with traditional methods, it has a better performance and faster detection speed, but it may show a low performance on low-quality images. Robust features obtained with standard CNNs like VGG16 [46] are, therefore, employed to improve face detection in these conditions [23][24][25][26]. In particular, the Single Shot Scale-Invariant Face Detector (S3DF) [24] method increases the recall of small faces by predicting candidate locations of faces on multi-scale feature maps extracted with VGG16.
The Feature Agglomeration Networks for Single Stage Face Detection (FANet) [23] and PyramidBox [25] methods integrate multi-scale feature maps with multi-level semantic information to improve the detection of small faces. Similarly, DSFD [26] aggregates multi-scale and semantic information with enhanced features corresponding to context information to increase face detection accuracy. More recently, AInnoFace [27] uses the RetinaNet detector [47] in addition to several optimization strategies to improve the detection of tiny faces and outperforms most of the state-of-art methods on the WIDER Face data set [36]. Table 1 reports the mAP and the speed values indicated by the reviewed detectors on the WIDER Face data set, which contains images labeled into three detection difficulty categories: 'Easy', 'Medium' and 'Hard' based on the detection rate of the EdgeBox [48]. Table 1. Face detection performance on the WIDER Face data set [36].

Method
Year New face detectors are commonly evaluated in the literature in terms of their accuracy, which is usually quantified by the mAP metric. Their speed is somehow overlooked and rarely reported. Zhang et al. [49] addressed this issue and presented a CNN-based face detector with a good tradeoff between accuracy and speed, considering both CPU and GPU and emphasising the worth of building effective models without being computationally prohibitive. The detection speed is, however, a relevant factor for end-users taking into account (i) the complexity of some of these models, (ii) the effect of the face detection step in the processing time of several applications where it is required to process large amount of data as found in forensic ones, and (iii) the wide offer in the market of CPUs and GPUs that may help to speed up deep-learning-based detectors.
A comparison of the required training time for several deep learning frameworks during the object classification task with various CPUs and GPUs is presented in [50,51], but they lack analysis of the speed at testing/deployment phase. The performance of common image processing algorithms, such as image segmentation, rotation and deblurring, was studied in [52]. That work, however, did not consider more complex tasks and it was limited to a small number of CPUs and GPUs. To the best of our knowledge, there is no study that can be used as a benchmark for face detection performance with several hardware configurations. Thus, we aim to provide one with this work.

Methodology
We aim to provide a guide for end-users to choose the most appropriate hardware for face-related applications, such as face recognition or child detection in CSEM. We address this objective in two ways. First, we present a comparison of the tradeoff between speed and accuracy metrics of face detection through the image resizing strategy described in [34] for several CPUs and GPUs with a small number of subjects per image in order to simulate CSEM. Second, using the collected information regarding the face detection performance, we train a model to predict the behavior of a face detector, in terms of speed and accuracy, in an image based on specific hardware, image size and percentage of image resizing, Figure 1.

Image Data Sets
We evaluate and gather the performance of face detectors, namely computation time and accuracy metrics (mAP and F1 score), by analyzing images from two data sets: WIDER Face [36] and UFDD [37]. Images on the WIDER Face data set contain a large number of real world scenes from 60 events, which are labeled by three levels of difficulty to detect faces: easy, medium and hard (see Table 2). Images in the UFDD data set are labeled into seven categories: rain, snow, haze, blur, high/low illumination, lens distortion, and distractors. These two data sets were chosen because they consider a wide range of acquisition conditions, including a high degree of variability in illumination, scale, pose and occlusion. Moreover, they allow to evaluate the generalization capability of face detectors since images in the WIDER Face data set have the same acquisition conditions of the data commonly used to train detectors [22,25,26]. On the contrary, images in the UFDD Face data set comprise conditions that are not usually considered in facial images data sets, such as weather degradation, motion and focus blur. Furthermore, analyzing images with a wide range of conditions allows to address different realistic CSEM situations.
In order to try to replicate the usual number of subjects involved in CSEM, only images with less than five people were analyzed. A total of 1994 images with 3358 faces, and resolution between 218 × 1024 and 1027 × 1024 pixels were manually chosen from the WIDER Face data set. Moreover, a total of 2222 images with 4214 faces, and resolution between 301 × 1024 and 1029 × 1024 pixels were selected from the UFDD data set.
MTCNN simultaneously applies face detection and face alignment to improve the detection of rotated faces. This method uses three CNNs: the first one obtains candidate regions that may contain faces, the second improves the initial face detection by rejecting false positive candidates and refining face locations, and the third CNN detects facial landmarks. MTCNN overcomes the limitations of other CNN models by considering the diversity of weights and reducing the number of filters and their sizes. MTCNN is the face detector currently integrated in the Evidence Detector software, provided by INCIBE to the two Law Enforcement Agencies that operate in Spain (Policia y Guardia Civil Española).
PyramidBox, a context-assisted single shot face detector, combines high-level context semantic features and low-level facial features to predict faces in different scales in a single shot which improved the detection of small faces. In addition, PyramidBox uses feature maps generated at different levels and anchors with an extended VGG16 as backbone and Data-anchor-sampling to increase the diversity of training data. DSFD extended SSD [59] by integrating feature maps obtained from a VGG16 architecture with enhanced feature maps. It uses a Feature Enhancement Module, which uses information from different levels. Moreover, DSFD introduced a collaborative face sampling and anchor design during augmentation to enhance regressor initialization. This module boosted the semantics of the features and improved the locations of faces in difficult detection conditions.

Resizing Strategy
In Figure 2 we illustrate the data flow of the resizing strategy that we evaluate. We use it to substantially decrease processing time for face detection in [34]. First, the largest dimension of the image-height or width-is used as reference to reduce the image resolution in a percentage of their original size using bilinear interpolation. This allows us to keep the aspect ratio proportional of the image content including faces. Secondly, bounding boxes corresponding to the face locations are detected on resized images using a deep-learning based method. Finally, detected bounding boxes containing face locations are scaled back to the original image dimensions and returned as output. This step is necessary due to the fact that detected faces would be used in applications where face locations are expected in original image coordinates, such as face recognition, age and gender estimation.
We compare the performance of the selected face detectors on several hardware architectures with four relative sizes-100%, 75%, 50% and 25% of the original dimensions-by following the image resizing strategy described above.

Prediction of Face Detection Performance Using Regression Models
Face detection is a crucial step in several applications, however the most accurate methods require high computational resources and processing time that may be limited in some domains, such as forensics, where (near) real time performance is expected. Taking into account the use of downsampled images speeds up the face detection stage but reduces the accuracy of the methods, it is desirable to predict the performance of several face detectors with various resized images in a specified hardware. This will enable the end-user to select the best parameters (method and image resolution) for face detection considering the available computational resources.
In this work, we built regression models to predict the face detection performance (computational time and F1 score) using five explanatory variables: the input image size (width and height), the image resized percentage (100%, 75%, 50% and 25% of the original image size), the face detector (MTCNN, PyramidBox, DSFD), and the hardware (specific CPU or GPU). We collected the face detection performance information considering the image data sets, detectors, resizing strategy, and hardware described above.

Experimental Setup
Experiments were run on a GNU/Linux machine box with Ubuntu 18.04, Cuda 9, and CuNDD 7 to: (i) compare the tradeoff between the accuracy and the speed of publicly available implementations of the face detectors-MTCNN, PymaridBox and DSFD-using as input four relative sizes-100%, 75%, 50% and 25% of the original sizes-in several GPUs and CPUs described in the Section 3, and (ii) evaluate the models built to predict the performance of face detectors in a given image with a specific hardware.
Face detectors were coded with Python 3 (https://www.python.org/) and Tensorflow (https: //www.tensorflow.org/). Both of them are commonly used to design, build, and train deep learning models. During the assessment of detectors in CPUs and GPUs, images containing less than five individuals, which is the usual number of subjects observed in CSEM, were processed sequentially. Moreover, in case computers were equipped by GPUs had their GPUs disabled in order to exploit the CPU computational capability during evaluation. Also, the usage of GPU memory was not limited during GPU tests, and it was set to grow as needed by the face detectors.
In order to evaluate the tradeoff between the accuracy and the speed of the face detectors, we assessed the accuracy using the mean Average Precision [38] (mAP) and the F1 score metrics [60]. The mAP combines the precision and recall measures by summarizing the shape of the precision-recall curve. The mAP is defined as the mean precision at a set of eleven equally spaced recall levels computed for a threshold that varies from 0 to 1 in intervals of 0.1. An interpolation of the precision-recall curve is used to reduce the impact of the "wiggles" in the curve, caused by small variations in the ranking of examples. The F1 score is the harmonic mean of the precision and the recall measures considering a threshold of 0.5 against ground truth regions to determine true positive and false positive face detections. Furthermore, we computed an improvement (Impv) measure to compare the performance of the face detectors considering the analyzed input image sizes and hardware. The improvement is defined in Equation (1) as the relative difference between the baseline configuration, A, and another one, B.
Positive values of Impv indicate that B outperforms A in terms of the evaluated performance metrics (mAP, F1 score or speed).
Regression models to predict the face detector speed were built considering a set of 461,826 examples with the four explanatory variables described in the Section 3: the input image size, the image resized percentage, the face detector and the type of hardware. We randomly split the data set into 80% for training and the remaining was used for the evaluation of the speed prediction model. Regression models to predict the F1 score metric of face detectors were trained considering only three explanatory variables (the input image size, the image resized percentage, and the face detector) since the detectors have the same F1 score performance regardless of the hardware used for the analysis. In this case, a set of 38,616 examples was used to build the prediction models, which was split into a training set with 80% of the examples and a test set with the remaining 20%.
In both cases, we evaluated the regression models using the Mean Absolute Error (MAE), the Mean Squared Error (MSE), and the Root Mean Squared Error (RMSE) [61]. These measures are defined below for a set of n samples where y is the ground truth value andŷ is the predicted value.
The MAE corresponds to the average of the absolute errors of the prediction model and indicates how close are the predicted values to the ground truth. The MSE measures the average of the square errors of the regression model. The RMSE corresponds to the standard deviation of the errors of the model. The closer MAE, RMSE and MSE are to zero, the better the regression models perform.

Speed-Accuracy Tradeoff Analysis
Evaluation metrics computed for face detectors (processing time, F1 score, mAP, precision-recall curves, and Impv) were grouped by the image data set (WIDER Face and UFDD) and discussed in the next two sections. Table 5 shows the mAP, the F1 score and the face detection speed (in seconds) computed for the evaluated hardware and image sizes. Table 6 presents the Impv of the mAP, the F1 score and the processing time by comparing the detectors' performance with three relative sizes-75%, 50% and 25%-against the results with original images. Figure 3 presents the precision-recall curves for the evaluated detectors, and Figure 4 exhibits the average computation time on CPUs and GPUs. In both cases, results are grouped by the evaluated image sizes.

Results on the WIDER Face Data Set
As can be seen in Table 5, Figures 3 and 4 The results show that GPU detection times outperformed the CPU ones (improvement between 55.51% and 96.86%). This percentage of Impv is related to the complexity of the methods. In general, complex detectors, such as PyramidBox and DSFD, have a large speed-up by using GPUs. Hence, using GPUs, MTCNN presented a speed improvement Impv between 55.51% and 62.70%, while DSFD speed increased between 92.14% and 96.86%.
(a) Original images, 100% (b) Images resized at 75% (c) Images resized at 50% (d) Images resized at 25%   Table 6. Improvement (Impv) in terms of accuracy (mAP and F1 score) and speed obtained with different image resolutions-75%, 50% and 25%-with respect to values computed for full size images-baseline-using MTCNN, PyramidBox and DSFD, and different CPUs/GPUs configurations on the WIDER Face data set. The best Impv per image size and face detector is highlighted in bold. Higher Impv values indicate a better performance.   Furthermore, the use of resized images improves detection speed in comparison with the analysis of full size images (see Table 6). A large resized image percentage leads to a significant processing time decrease. Therefore, the maximum speed-up, in both CPUs and GPUs, is observed with images resized to 25% of the original size. In this case, the fastest face detection is achieved using GPUs with an average (Avg.) Impv in speed of 81.81% for MTCNN, 85.45% for PyramidBox and 82.92% for DSFD. Although face detection is performed faster in GPUs, the use of resized images allow a larger speed-up during CPU analysis with an Avg. Impv in detection times of 85.24% for MTCNN, 92.53% for PyramidBox and 93.46% for DSFD.

Method
However, as expected, the resizing strategy leads to a reduction of accuracy metrics (the mAP and the F1 score), related to the modified size of the images. In particular, large resized image percentages led to higher decreases in mAP and F1 scores in comparison to the performance obtained using full size images. Thus, the maximum mAP and F1 score reduction are observed with images resized to 25% with a drop for MTCNN in mAP of −18.60% and F1 score of −22.35%, a reduction for PyramidBox in mAP of −21.70% and F1 score of −35.72%, and a decrease for DSFD in mAP of −8.35% and F1 score of −18.91%. The best accuracy performance is achieved with images resized to 75% where the mAP slightly improved in comparison to the mAP values obtained with full size images. Thus, it ranges from 0.11% for PyramidBox to 0.53% for MTCNN and 0.57% for DSFD. Moreover, DSFD performed better than MTCNN and PyramidBox with resized images. Figure 5 illustrates the face detection results on images with subjects in a side-view position (difficult pose). In these conditions, MTCNN does not detect any face from the original or resized images. While the most robust face detectors-DSFD and PyramidBox-detect faces in all the evaluated cases. The precise localization of the bounding boxes around the detected faces are affected by the dimensions of the concerned image. The best speed-accuracy tradeoff is obtained on the WIDER Face data set using DSFD and GPUs with images reduced to 50% of the original size or CPUs with images reduced to 25% of the full size (see Table 5 and Figure 4). In the case of the GPU analysis, the accuracy-Avg. mAP of 93.77% and Avg. F1 score of 0.883-is improved and the computation time is not affected-0.132 sec-in comparison to the results observed with MTCNN and full size images-Avg. mAP of 56.10%, Avg. F1 score of 0.505, and processing time of 0.132 sec. In general, during the processing of the WIDER Face data set, the best performance for MTCNN, PyramidBox and DSFD is achieved using the GPUs RTX 2060 and TITAN, and the CPUs i9-8650HK and Xeon E5, respectively. This indicates that on CPUs the detection speed is determined by the base frequency, number of cores and bus speed. Therefore, the CPU i9-8650HK, with high base frequency and bus speed, has a better performance in comparison to the CPU Xeon E5, with low base frequency and bus speed, regardless the high memory of the test computer and the CPU cache. In the case of GPUs, it is observed that the architecture determines the detection speed. Thus, RTX GPUs with Turing architecture performs better than GPUs with Pascal and Keppler architectures despite that these GPUs have a large number of cores, memory video or memory bandwidth. Furthermore, GPUs with a large number of cores, video memory and clock frequency perform face detection in less time. Table 7 shows the mAP values, the F1 scores and the speed obtained for the evaluated face detectors using the considered image sizes, CPUs and GPUs. Figure 6 depicts the precision-recall curves of detectors, and Figure 7 presents the average processing time on CPUs and GPUs. In both figures, results are summarized by the analysed image resolutions. Table 8 reports the Impv of the mAP, the F1 score and the speed of the face detectors with different image resolutions against the results using full size images.

Results on the UFDD Data Set
(a) Original images, 100% (b) Images resized at 75% (c) Images resized at 50% (d) Images resized at 25% Consistent with the results obtained on the WIDER Face data set, MTCNN is the fastest face detector, while DSFD is the most accurate (see Table 7, Figures 6 and 7). The mAP values range from 10.30% to 65.40%, whereas the F1 score values vary from 0.123 to 0.724.
Regarding the computation time (see Table 7), the best detection performance is achieved with the CPU i9-8650HK and the GPU RTX 2060, followed by the GPU TITAN Xp. Similar to the WIDER Face data set, on CPUs, it is observed that the base frequency, number of cores and bus speed determined the computational time during face detection. In contrast, the most relevant characteristics of GPUs are the architecture, number of cores, video memory and clock frequency. Furthermore, the use of GPUs speeds up face detectors in comparison to CPUs, reducing the processing time of complex detectors significantly. In particular, MTCNN shows an improvement in detection speed using GPUs between 52.68% and 62.87%, while DSFD (more computationally demanding) achieved a reduction in processing times between 92.16% and 96.68% using GPUs. Table 7. Speed and accuracy (mAP and F1 score) tradeoff results on the UFDD data set for the MTCNN, PyramidBox and DSFD face detection methods using four different image resolutions, and CPUs/GPUs configurations. The best mAP, F1 score and speed values per image size and face detector are highlighted in bold. Higher mAP and F1 score with lower speed values mean a better performance.

Method
Full Size Image (100%) Resized Image to 75% Resized Image to 50% Resized Image to 25% Moreover, the use of resized images speeds up face detection at the cost of a decrease in mAP and F1 scores. The smaller the input image size is, the higher reduction in mAP and F1 score are in comparison to the values achieved using the original images. Hence, the maximum improvement of the detection speed and decrease of the mAP is observed using CPUs and GPUs to process images resized to 25% of the original image size (see Table 8). In this case, the use of GPUs leads to the fastest face detection: MTCNN performed the detection in 0.023 sec with mAP of 10.30% and F1 score of 0.123; PyramidBox carried out detection in 0.060 sec with mAP of 23.60% and F1 score of 0.248 and DSFD achieved detection in 0.066 sec with mAP of 39.20% and F1 score of 0.419. This corresponds to an Impv in speed of 80.93% for MTCNN, 84.22% for PyramidBox and 84.31% for DSFD, and a reduction in mAP values of −48.24% for MTCNN, −55.81% for PyramidBox and −40.06% for DSFD. Similar reductions are observed for the F1 scores of the three face detectors.

MTCNN PyramidBox DSFD MTCNN PyramidBox DSFD MTCNN PyramidBox DSFD MTCNN PyramidBox DSFD
The best speed-accuracy tradeoff is obtained on the UFDD data set using DSFD and GPUs with images reduced to 50% of the original size-average mAP of 57.30%, F1 score of 0.626 and processing time of 0.130 sec-or CPUs with images reduced to 25% of the original size-average mAP of 39.20%, F1 score of 0.419 and computation time of 0.698 sec. In the case of the best GPU set-up, the mAP and F1 score is improved considerably in comparison to the results attained with MTCNN and full size images-average mAP of 19.90%, F1 score of 0.236 and processing time of 0.120 sec-with a similar detection speed.

Performance Estimation Model
Here we assess Generalized Linear Models (GLMs) to estimate the processing time and the F1 score of face detectors. GLM [62] is a flexible generalization of ordinary linear regression models that assume that response variables, such as processing time and F1 scores, follow an error distribution which does not have to correspond to the Gaussian or normal distribution. The GLM parameters were estimated using the generalized least squares method. In this work, we compared GLMs assuming a normal distribution of variables against different data distributions, such as the Binomial Negative one. Furthermore, we assessed the improvement of GLMs through a logarithmic transformation of the response variables or a concatenation of explanatory variables. The logarithmic transformation is commonly employed in regression models to handle a non-linear relationship between the response and explanatory variables. The concatenation of explanatory variables is considered as a way to reduce the complexity of regression models which may improve the linear fit of the data. Table 9 reports the MAE, the RMSE and the MSE values for the regression models built to predict the computational time of face detectors based on the areas of images, face detection methods, image resized percentages, and hardware used to process images (see Section 3.5). Taking into account that the computational time may have an exponential tendency, we first applied a logarithmic transform to the detection speed before building GLMs (see rows 2 and 5). It was observed that using this data transformation improves the fit of regression models to predict the processing time, in particular, the performance of the baseline (model 1)-MAE of 1.438, MSER of 2.164 and MSE of 4.682-improved to MAE of 0.624, MSER of 1.851 and MSE of 3.425 (model 2). Furthermore, we compared models built with individual variables (see rows 1-3) against models trained using a concatenation of the categorical variables: method, resized and machine (see rows 4 and 5). Results show that the concatenation of variables improved the fit of regression models for estimating the computational time. In the best case (model 5), the combination of the concatenated variables along with speed logarithmic transform allows a significant improvement in performance-MAE of 0.113, MSER of 0.455 and MSE of 0.207-in comparison to the baseline (model 1).  Table 10 shows the MAE, the RMSE and the MSE values for the regression models built to predict the F1 score of face detectors based on the areas of the images, face detection methods, and image resized percentages (see Section 3.5). Recall that the hardware is not considered as an explanatory variable since the detection has the same F1 score regardless of the CPUs or GPUs employed to process images. Besides, since the logarithmic function is defined for positive values larger than zero, and the F1 metric ranges between 0 and 1, it is not feasible to use logarithmic transformation in this case. Hence, we compared the performance of GLMs built with individual variables assuming a normal and a Binomial Negative distribution, models 1 and 2, respectively, against a model trained with a concatenation of the categorical variables: method and resized (see model 3). Results show that there is not a significant difference between the assessed models for F1 score estimation, having a slightly better performance the model built with a normal distribution and the concatenated variables-MAE of 0.370, MSER of 0.417 and MSE of 0.174.

Conclusions and Future Work
Deep learning approaches based on CNNs have proven to be highly effective for automated face detection achieving remarkable accuracy. Forensic face recognition, like CSEM detection systems, remains a more difficult task because it must be able to handle images captured under non-ideal conditions and meet stringent time constraints. These deep learning models, however, tend to be very computationally demanding, and some of them may not be appropriate for CSEM-like applications.
In this work, we present a comparison of the speed and the accuracy of three popular face detectors based on deep learning-MTCNN, PyramidBox, and DSFD-on five Intel CPUs-i5-3450, i7-4790K, i7-8650U, i9-8950HK, and Xeon E5-and seven Nvidia GPUs cards-Tesla K40, TITAN Xp, GTX 1050Ti, GTX 1060, GTX 1070, RTX 2060, and RTX 2070-by analyzing images reduced to three different images sizes from two data sets-WIDER Face and UFDD.
Results confirm that the use of resized images speeds up the face detection stage but reduces the accuracy. We found that the speed-up achieved by using resizing images and GPUs depends on the complexity of the face detector used. Thus, sophisticated detectors have a substantial improvement in processing times. It turns out that the best speed-accuracy tradeoff is yielded by applying the DSFD detector to images resized to 50% of the original size in GPUs and images resized to 25% of the original size in CPUs. Moreover, the best performances were obtained with CPU i9-8950HK and GPU RTX 2060.
Considering this tradeoff between speed and accuracy, we train a model capable of predicting for new images the performance of several face detectors with various resized images in a specified hardware. Experimental results with multiple linear regression models are able to predict the face detection performance with a MAE of 0.113.
The proposed models are expected to help end-users as forensic investigators to select the most appropriate hardware for applications where face detection is required, such as face recognition or child detection in CSEM. Additionally, the prediction model will guide the forensic practitioners to choose the best parameters (detection method and image resolution) for face detection considering the available computational resources.
Building more complex prediction models becomes part of our future work. Our aim is to also analyze the features that have the most influence on the model performance. Funding: This research has been funded with support from the European Commission under the 4NSEEK project with Grant Agreement 821966. This publication reflects the views only of the author, and the European Commission cannot be held responsible for any use which may be made of the information contained therein.