Transfer detection of YOLO to focus CNN’s attention on nude regions for adult content detection

: Video pornography and nudity detection aim to detect and classify people in videos into nude or normal for censorship purposes. Recent literature has demonstrated pornography detection utilising the convolutional neural network (CNN) to extract features directly from the whole frames and support vector machine (SVM) to classify the extracted features into two categories. However, existing methods were not able to detect the small-scale content of pornography and nudity in frames with diverse backgrounds. This limitation has led to a high false-negative rate (FNR) and misclassiﬁcation of nude frames as normal ones. In order to address this matter, this paper explores the limitation of the existing convolutional-only approaches focusing the visual attention of CNN on the expected nude regions inside the frames to reduce the FNR. The You Only Look Once (YOLO) object detector was transferred to the pornography and nudity detection application to detect persons as regions of interest (ROIs), which were applied to CNN and SVM for nude/normal classiﬁcation. Several experiments were conducted to compare the performance of various CNNs and classiﬁers using our proposed dataset. It was found that ResNet101 with random forest outperformed other models concerning the F1-score of 90.03% and accuracy of 87.75%. Furthermore, an ablation study was performed to demonstrate the impact of adding the YOLO before the CNN. YOLO–CNN was shown to outperform CNN-only in terms of accuracy, which was increased from 85.5% to 89.5%. Additionally, a new benchmark dataset with challenging content, including various human sizes and backgrounds, was proposed.


Introduction
Given the vast growth and quantity of videos and images in all types of media nowadays, various content understanding methods have been developed and employed in real-world scenarios. Pornography and nudity detection in videos or a series of images have a significant impact on visual censorship applications (e.g., TV broadcasting and video-sharing platforms such as YouTube and TikTok). Nudity content can include one or more persons with explicit exposure of full, upper (for female) or lower body in various scales, positions, and backgrounds. Pornographic content refers to the exposure of specific sexual organs and sexual activity. Numerous films are broadcasted to the public 24 h a day and possibly contain pornography and nudity content which may be viewed by underage children and cause serious social problems. In addition, nudity is prohibited in TV broadcasting in some countries. Similarly, video-sharing platforms are also governed by having rules to censor nudity and sexual content. In addressing some of these issues, manual intervention is typically employed for censorship. However, the introduction of network [10,24], two CNNs in parallel: one for static (picture) and another for dynamic (motion) information utilising optical flow and MPEG motion vectors [9], and 3D CNN [25] to detect pornography utilising VGG-C3D [26] with a Linear SVM classifier and ResNet R (2+1) D CNN [27] with Softmax classifier. However, the previous methods mentioned depend on SGD to tune the model's parameters and consequently take a considerable time for training. However, to reduce training time, least-square solutions such as local receptive field-based extreme learning machine was proposed for adult content detection [13].
The NPDI pornography dataset [28,29] has been extensively used to train and evaluate various models to detect pornography in videos [7,[9][10][11][12][13]. However, when the proposed models were used with real-life broadcasted movies for censorship purposes, the shortcomings of the existing methods and dataset have become evident. Although existing training strategies were able to classify pornographic videos into two classes-normal and nude-they were too simple to detect nudity with any scale and complex backgrounds in video frames. In other words, the NPDI dataset focuses mainly on pornographic content (sexual actions and exposure of genital organs) that covers the whole camera's field of view. Therefore, when we utilised the models trained on the NPDI dataset for real-world films, they were unable to detect nude frames, thus producing a high false negative rate. The main reason was that CNN loses the focus on nudity content and instead, considers other objects in the backgrounds. Therefore, to overcome the previous limitations of the dataset, this paper proposes a new nudity dataset with the challenging content of various nudity scales and backgrounds.
Although CNN-only methods have shown superior performance for pornography detection when the pornographic content covers the frames largely, their performance is still limited in detecting nudity when the nude or porn persons cover only small regions inside the frame and when the background is complex, such as nude people in a forest, snow, beach, supermarket, indoor, and streets. Nian et al. studied this problem and proposed a fast image scanning method which was based on the sliding window approach to detect the nude region [30]. However, the drawback associated with this approach was that the speed and number of sliding times depend on the width of the frame and the sliding stride. The authors claimed while mentioning to [31] that applying human detection is difficult as the frames may only show a small part of the human body in pornographic videos.
Human detection has been explored extensively in the literature. Fastest pedestrian detector of the west (FPDW) which is a feature engineering method utilises histogram of gradient (HOG) at different scales [32]. It balances the trade-off between accuracy and speed to detect humans faster than the state-of-the-art methods. Although FPDW requires a large number of pixels, only one frame is sufficient for detecting humans. Additionally, a combination of optical flow and AlexNet has been used to detect humans in various scales, viewpoints, positions, orientations, and cloth using varied altitudes of a camera attached to a moving airborne [33]. This combination requires two frames to find a set of candidate objects. Similarly, some works have demonstrated YOLO for human detection [34,35].
Some works have targeted nudity detection to remove the whole inappropriate frames from the video, while other works were proposed to filter out only sensitive regions utilising the approach of image-to-image translation that was based on adversarial training [36]. In addition, the detection of body parts with multi-labelled classification was also demonstrated [37].
Ensemble methods were also utilised in adult content recognition [38][39][40]. Here, a weighted sum of several deep neural networks (DNNs) was used to express the CNN's weights as a linear regression problem learned using ordinary least squares (OLS) [38]. Additionally, ensemble framework uncertain inference employed a Bayesian network. The prior global confidence of pornography for the candidate image was extracted using GoogleNet/ResNet-50 [39]. In addition, uncertain evidence was extracted using Single Shot MultiBox Detector to detect visual objects of the six sensitive semantic components [39]. Moreover, ensemble-based multi-instance learning has been proposed to utilise a group of extreme learning machine (ELM) classifiers with a different number of hidden nodes [40].
In this paper, we provide further insight into the nudity classification task that includes YOLO [41][42][43] to solve the problem of CNN-only methods. If the frame has more than one person, each person image patch is passed to the CNN to produce one category that is stored in a list. The list has labels of all person patches in one frame. If one element in the list has a nude label, the whole frame is considered nude, whereas if all labels in the list have normal elements, the whole frame is considered normal. Additionally, the proposed method detects nudity regions existing in different scales inside the frames with complex backgrounds. In other words, it can only edit and filter out only detected regions and keep other existing regions of the frame; thus, it reduces the interrupted period of the film and helps to understand the context. This paper demonstrates a novel automated system in speeding up censorship and reduce the cost. The key contributions of this paper included the following: • YOLO3 was utilised as a human detector. To the best of our knowledge, this is the first paper that uses YOLO for the nudity detection task.

•
Pre-trained CNNs that have already been trained on the ImageNet dataset were demonstrated as feature extractors having fixed parameters of the first layers. The fully connected layers were removed. The objective was to find discriminative features for the normal/nude classification task. • Various classifiers were used to replace the top layer of pre-trained CNNs to fit the proposed nudity dataset, in providing two distinct classes: normal and nude. • An ablation study was undertaken to demonstrate the impact of adding YOLO before CNN. • An ablation study was performed to provide further insight into the added advantage of data augmentation in nudity detection applications.
The organisation of this paper is structured into four sections. Section 1 has provided a relevant background and aim of the study. Section 2 demonstrates the dataset, and three main blocks in the proposed system, including YOLO for human detection, various pre-trained CNNs for feature extraction, and various classifiers. In Section 3, experimental setup, and results are presented and discussed. An ablation study to gain insight on the effect of adding YOLO before CNN is also described. Section 4 summarises the outcome and significance of this work.

Datasets Overview
In this section, an overview of the training, validation, and testing images and videos, in both versions: augmented and non-augmented, is explored. Additionally, we demonstrate the experimental protocol that describes how to split the dataset into train, validation, and test.

ImageNet Dataset
ImageNet is a large-scale dataset consisting of 12 subtrees with a total of 3.2 million annotated images spread over 5247 categories, with an average of over 600 images for Symmetry 2021, 13, 26 5 of 26 each category [16]. ImageNet is considered a benchmark dataset to train and validate deep CNN models. Consequently, the models that were trained on ImageNet can be transferred to other applications when the new dataset is not sufficiently large enough to reduce the overfitting problem.
In this paper, various CNNs such as AlexNet, VGG16, GoogleNet, Inception3, ResNet50, and ResNet101, already trained on the ImageNet dataset, were used to extract features from the frames before being classified into nude or normal.

COCO Dataset
The COCO dataset contains photos of 91 object types with a total of 2.5 million labelled instances in 328k images [17]. In our experiments, the YOLOv3 deep learningbased human detector, already trained on the COCO dataset, was utilised to focus the attention on humans and ignore other objects in the backgrounds of the frames.

NDPI Dataset
The NPDI dataset includes about 80 h of 800 videos [28,29]. These videos were divided into 400 normal and 400 pornographic videos. The normal videos contain two subcategories: easy with random videos and difficult, which includes body skin content such as beach, wrestling, and swimming. An extended version of the NPDI dataset had 140 h of 2k videos, including 1000 pornographic and 1000 normal videos [47]. In the first experiment, we utilised the NPDI dataset in the training stage with a transfer learning approach to extract features from video's frames using pre-trained ResNet50. In addition, SVM was trained to classify the extracted features into two categories, namely, nude, and normal. We utilised the same CNN and SVM used in state-of-the-art solutions to compare the proposed method with existing ones. Figure 1. illustrates a few samples of the NPDI dataset. The majority of samples in NPDI have explicit pornographic content that covers the whole frames and includes the exposure of specific sexual organs and sexual activity. ImageNet is a large-scale dataset consisting of 12 subtrees with a total of 3.2 millio annotated images spread over 5247 categories, with an average of over 600 images fo each category [16]. ImageNet is considered a benchmark dataset to train and validate dee CNN models. Consequently, the models that were trained on ImageNet can be transferred to other applications when the new dataset is not sufficiently large enough to reduce th overfitting problem.
In this paper, various CNNs such as AlexNet, VGG16, GoogleNet, Inception3, Res Net50, and ResNet101, already trained on the ImageNet dataset, were used to extract fea tures from the frames before being classified into nude or normal.

COCO Dataset
The COCO dataset contains photos of 91 object types with a total of 2.5 million la belled instances in 328k images [17]. In our experiments, the YOLOv3 deep learning-base human detector, already trained on the COCO dataset, was utilised to focus the attentio on humans and ignore other objects in the backgrounds of the frames.

NDPI Dataset
The NPDI dataset includes about 80 h of 800 videos [28,29]. These videos were di vided into 400 normal and 400 pornographic videos. The normal videos contain two sub categories: easy with random videos and difficult, which includes body skin content suc as beach, wrestling, and swimming. An extended version of the NPDI dataset had 140 of 2k videos, including 1000 pornographic and 1000 normal videos [47]. In the first exper iment, we utilised the NPDI dataset in the training stage with a transfer learning approac to extract features from video's frames using pre-trained ResNet50. In addition, SVM wa trained to classify the extracted features into two categories, namely, nude, and norma We utilised the same CNN and SVM used in state-of-the-art solutions to compare the pro posed method with existing ones. Figure 1. illustrates a few samples of the NPDI datase The majority of samples in NPDI have explicit pornographic content that covers the whol frames and includes the exposure of specific sexual organs and sexual activity.

Our Challenging MMU Dataset
To address and overcome the limitation of the well known pornography dataset suc as NPDI, this dataset was carefully collected from the internet to include various back grounds such as forest, snowing, beach, supermarket, indoor, and streets. It is also con sidered challenging due to variations in many factors including human position (stand ing, sitting, lying, etc.), size of nude people relative to frame's size, cloth, skin colour, an backgrounds. The dataset contains 800 images: 400 normal and 400 nudes images. Th images were divided into training and validation sets with 80% (640) and 20% (160), re spectively, as shown in Table 1. The number of training samples with augmentatio (AUG) and without Augmentation (no AUG) is illustrated in Table 1. Figures 2 and 3 i lustrate a few samples from this challenging dataset.
In addition, the size of this dataset was augmented eight times to become 6400 im ages, as shown in Figure

Our Challenging MMU Dataset
To address and overcome the limitation of the well known pornography dataset such as NPDI, this dataset was carefully collected from the internet to include various backgrounds such as forest, snowing, beach, supermarket, indoor, and streets. It is also considered challenging due to variations in many factors including human position (standing, sitting, lying, etc.), size of nude people relative to frame's size, cloth, skin colour, and backgrounds. The dataset contains 800 images: 400 normal and 400 nudes images. The images were divided into training and validation sets with 80% (640) and 20% (160), respectively, as shown in Table 1. The number of training samples with augmentation (AUG) and without Augmentation (no AUG) is illustrated in Table 1. Figures 2 and 3 illustrate a few samples from this challenging dataset.
In addition, the size of this dataset was augmented eight times to become 6400 images, as shown in Figure 4, as follows: (1) Flip image horizontally.  This dataset was used to train and validate the models in the ablation experiment that targets the study of adding YOLO3 in the pipeline before CNN. For CNN-only, the training samples were the whole images. On the other hand, YOLO-CNN used human-only patches that were extracted from the images by YOLO3. A few samples of the human-only dataset are shown in Figure 5. This dataset was used to train and validate the models in the ablation exp that targets the study of adding YOLO3 in the pipeline before CNN. For CNNtraining samples were the whole images. On the other hand, YOLO-CNN used only patches that were extracted from the images by YOLO3. A few samples o man-only dataset are shown in Figure 5.    Symmetry 2020, 12, x FOR PEER REVIEW This dataset was used to train and validate the models in the ablation exp that targets the study of adding YOLO3 in the pipeline before CNN. For CNN-o training samples were the whole images. On the other hand, YOLO-CNN used only patches that were extracted from the images by YOLO3. A few samples of man-only dataset are shown in Figure 5.

Testing Film Dataset
This dataset is also challenging and was only used for testing purposes. This co of five real-world videos available on the internet, as described in Table 2. The total of the testing videos is about one hour which can be divided into 360 videos wit lengths for each video. Figure 6 shows a few samples of the frames. It is obvious th scales or sizes of humans in the frames vary, and the backgrounds are complex, inc green grass, indoor, streets, and beach. The dataset includes a total of 9891 frame were applied to YOLO. As a result, YOLO discarded 1312 non-human frames an 8579 frames (5081 nudes and 3498 normal) that had at least one human. The datase sists of 25,983 human images detected from 8579 frames selected from these five v The frames were selected using three frames per second (first, middle, and last fram In the first experiment, this dataset was used to test the model trained on NPD ditionally, these five videos were also used in the second experiment to test the trained on human-only images. In the third experiment, we utilised these five vid validate and compare various feature extractors and classifiers.

Testing Film Dataset
This dataset is also challenging and was only used for testing purposes. This consists of five real-world videos available on the internet, as described in Table 2. The total length of the testing videos is about one hour which can be divided into 360 videos with 10 s lengths for each video. Figure 6 shows a few samples of the frames. It is obvious that the scales or sizes of humans in the frames vary, and the backgrounds are complex, including green grass, indoor, streets, and beach. The dataset includes a total of 9891 frames that were applied to YOLO. As a result, YOLO discarded 1312 non-human frames and kept 8579 frames (5081 nudes and 3498 normal) that had at least one human. The dataset consists of 25,983 human images detected from 8579 frames selected from these five videos. The frames were selected using three frames per second (first, middle, and last frames).
In the first experiment, this dataset was used to test the model trained on NPDI. Additionally, these five videos were also used in the second experiment to test the model trained on human-only images. In the third experiment, we utilised these five videos to validate and compare various feature extractors and classifiers.
Symmetry 2020, 12, x FOR PEER REVIEW 7 Figure 4. A few samples of images after data augmentation.

Testing Film Dataset
This dataset is also challenging and was only used for testing purposes. This con of five real-world videos available on the internet, as described in Table 2. The total le of the testing videos is about one hour which can be divided into 360 videos with lengths for each video. Figure 6 shows a few samples of the frames. It is obvious tha scales or sizes of humans in the frames vary, and the backgrounds are complex, inclu green grass, indoor, streets, and beach. The dataset includes a total of 9891 frames were applied to YOLO. As a result, YOLO discarded 1312 non-human frames and 8579 frames (5081 nudes and 3498 normal) that had at least one human. The dataset sists of 25,983 human images detected from 8579 frames selected from these five vid The frames were selected using three frames per second (first, middle, and last frame In the first experiment, this dataset was used to test the model trained on NPDI ditionally, these five videos were also used in the second experiment to test the m trained on human-only images. In the third experiment, we utilised these five vide validate and compare various feature extractors and classifiers.     Figure 6. A few samples of frames in the testing film dataset.

Methodology
In this section, the methodology adopted in this study is explored in detail. The proposed approach of model fusion represents a combination of YOLO for human detection and one of the CNNs including AlexNet [18], VGG16 [19], GoogleNet [20], Inception3 [21], and ResNet [22] for feature extraction. In addition, the last fully nconnected layers of CNNs were replaced by one of the various classifiers such as KNN [44], RF [45], ELM [46], and SVM with different kernels: linear (LSVM), Gaussian (GSVM), and polynomial (PSVM) [23] for nude/normal classification. The proposed system diagram is shown in Figure 7. The video's frame was applied to YOLO. The outcome of this stage was many image patches of humans available in the frame. These patches were applied to a pretrained CNN to extract features that the classifier utilised to give two classes: nude and normal. The red and black rectangles on the naked regions in this paper were added manually for public consideration.
A brief review of each module used in the proposed detection system (YOLO, pretrained CNN, and classifier) is summarised in the following subsections.

YOLO-Based Human Detection
YOLO is a unified simple model that is normally used when real-time detection is required, without losing too much accuracy [41][42][43]. The detection in YOLO is considered a regression task. YOLO has an end-to-end neural network to directly extract features from full images utilising convolutional layers and predicting bounding boxes and class probabilities using fully connected layers. Although YOLO produces more localisation errors, this makes less than half the number of background errors compared to fast region

Methodology
In this section, the methodology adopted in this study is explored in detail. The proposed approach of model fusion represents a combination of YOLO for human detection and one of the CNNs including AlexNet [18], VGG16 [19], GoogleNet [20], Inception3 [21], and ResNet [22] for feature extraction. In addition, the last fully nconnected layers of CNNs were replaced by one of the various classifiers such as KNN [44], RF [45], ELM [46], and SVM with different kernels: linear (LSVM), Gaussian (GSVM), and polynomial (PSVM) [23] for nude/normal classification. The proposed system diagram is shown in Figure 7. The video's frame was applied to YOLO. The outcome of this stage was many image patches of humans available in the frame. These patches were applied to a pre-trained CNN to extract features that the classifier utilised to give two classes: nude and normal. The red and black rectangles on the naked regions in this paper were added manually for public consideration.
A brief review of each module used in the proposed detection system (YOLO, pre-trained CNN, and classifier) is summarised in the following subsections.
Symmetry 2020, 12, x FOR PEER REVIEW 8 of Figure 6. A few samples of frames in the testing film dataset.

Methodology
In this section, the methodology adopted in this study is explored in detail. The pr posed approach of model fusion represents a combination of YOLO for human detectio and one of the CNNs including AlexNet [18], VGG16 [19], GoogleNet [20], Inception3 [21 and ResNet [22] for feature extraction. In addition, the last fully nconnected layers CNNs were replaced by one of the various classifiers such as KNN [44], RF [45], ELM [46 and SVM with different kernels: linear (LSVM), Gaussian (GSVM), and polynomi (PSVM) [23] for nude/normal classification. The proposed system diagram is shown Figure 7. The video's frame was applied to YOLO. The outcome of this stage was man image patches of humans available in the frame. These patches were applied to a pr trained CNN to extract features that the classifier utilised to give two classes: nude an normal. The red and black rectangles on the naked regions in this paper were added ma ually for public consideration.
A brief review of each module used in the proposed detection system (YOLO, pr trained CNN, and classifier) is summarised in the following subsections.

YOLO-Based Human Detection
YOLO is a unified simple model that is normally used when real-time detection required, without losing too much accuracy [41][42][43]. The detection in YOLO is considere a regression task. YOLO has an end-to-end neural network to directly extract featur from full images utilising convolutional layers and predicting bounding boxes and cla

YOLO-Based Human Detection
YOLO is a unified simple model that is normally used when real-time detection is required, without losing too much accuracy [41][42][43]. The detection in YOLO is considered a regression task. YOLO has an end-to-end neural network to directly extract features from full images utilising convolutional layers and predicting bounding boxes and class probabilities using fully connected layers. Although YOLO produces more localisation errors, this makes less than half the number of background errors compared to fast region CNN (R-CNN) because it can see the full image during training [41]. The YOLO approach differs from classification given it was trained on a loss function that directly matches detection performance, and the whole model is trained jointly [41].
The YOLO network architecture was inspired by the GoogleNet model for image classification, comprising 24 convolutional layers followed by two fully connected layers instead of the inception modules used by GoogleNet, as shown in Figure 8. Convolutional layers were trained on the ImageNet classification with an input image of 224 × 224, and then the resolution was doubled for detection [41].
Symmetry 2020, 12, x FOR PEER REVIEW 9 of 26 layers were trained on the ImageNet classification with an input image of 224 × 224, and then the resolution was doubled for detection [41]. In the YOLO model, the input image is divided into an S × S grid. The number of bounding boxes predicted for each grid cell (B) is 2, and the number of class probabilities (C) is 20. As a result, the size of the final output is S × S × (B × 5 + C) = 7 × 7 × 30 tensor of predictions [41].
The box confidence is formed as Pr (Object) × . If the object is not available in the cell, the confidence scores would be 0. Otherwise, it would be equal to the intersection over union (IOU) between the ground truth and predicted boxes [41]. Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The coordinates x, y represent the box centre relative to the bounds of the grid cell. The height (h) and width (w) are predicted relative to the full image. The conditional class probabilities Pr(Classi|Object) for each grid cell are conditioned on the grid cell containing an object. The class confidence scores for each box Pr (Classi) × are given as follows [41]: The problems associated with this model's instability and training divergence were raised when grid cells that did not have any object to push moved the confidence scores of these cells towards zero. Therefore, to address this problem, two parameters λcoord = 5 and λnoobj = 0.5 were used to increase the loss from the bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that do not contain objects [41].
Although YOLO can identify objects in images, it fails to localise some objects, especially small ones, since its loss function treats errors equally in small and large bounding boxes [41]. The previous matter has an impact on the intersection over union (IOU) of small boxes. Therefore, to address this problem, an error metric should consider the small deviations in small boxes that matter more than those in large boxes. As such, the square root of the bounding box width and height is predicted.
The multi-part loss function is optimised during training as follows [41]: In the YOLO model, the input image is divided into an S × S grid. The number of bounding boxes predicted for each grid cell (B) is 2, and the number of class probabilities (C) is 20. As a result, the size of the final output is S × S × (B × 5 + C) = 7 × 7 × 30 tensor of predictions [41].
The box confidence is formed as P r (Object) × IOU truth pred . If the object is not available in the cell, the confidence scores would be 0. Otherwise, it would be equal to the intersection over union (IOU) between the ground truth and predicted boxes [41]. Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The coordinates x, y represent the box centre relative to the bounds of the grid cell. The height (h) and width (w) are predicted relative to the full image. The conditional class probabilities P r (Class i |Object) for each grid cell are conditioned on the grid cell containing an object. The class confidence scores for each box P r (Class i ) × IOU truth pred are given as follows [41]: The problems associated with this model's instability and training divergence were raised when grid cells that did not have any object to push moved the confidence scores of these cells towards zero. Therefore, to address this problem, two parameters λcoord = 5 and λnoobj = 0.5 were used to increase the loss from the bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that do not contain objects [41].
Although YOLO can identify objects in images, it fails to localise some objects, especially small ones, since its loss function treats errors equally in small and large bounding boxes [41]. The previous matter has an impact on the intersection over union (IOU) of small boxes. Therefore, to address this problem, an error metric should consider the small deviations in small boxes that matter more than those in large boxes. As such, the square root of the bounding box width and height is predicted.
The multi-part loss function is optimised during training as follows [41]: where I I obj i refers to the existence of an object in cell i and I I obj ij refers to the role of jth bounding box predictor in cell i.
It is obvious that the loss function only punishes the error of classification if an object exists in the grid cell [41]. In addition, it punishes an error of the box coordinate if the predictor is responsible for the ground truth.
YOLOv2 was found to improve the performance of its first version by adding several new features as follows [42]:

2.
Fine tuning the classification network on ImageNet at a higher resolution 448 × 448 instead of 224 × 224.

3.
Using anchor boxes to predict bounding boxes and removing the fully connected layers from YOLO. This plays an important role to enhance recall.
k-means clustering of the training set of the boxes' dimensions to find good priors automatically instead of by hand.

6.
To overcome the model's instability resulting from anchor boxes, location prediction is directed in making the network more stable. 7.
Concatenating the higher and lower resolution features by stacking them into different channels instead of spatial locations. 8.
Training with a variety of input dimensions.
Most detection models are based on a VGG-16 network which requires 30.69 billion floating-point operations for forwarding only one 224 × 224 image. To increase the speed, YOLO [42] proposes a custom backbone, similar to GoogleNet [20], which requires only 8.52 billion operations, though its accuracy is slightly worse than VGG-16. On the other hand, YOLOv2 proposes Darknet19, which is like VGG having 19 convolutional layers and five max-pooling layers [42].
YOLOv3 transcends YOLOv2 as a good object detector [43], given it is both fast and accurate. Moreover, it is also very good for the detection metric of 0.5 IOUs. In addition, it is just as accurate as single shot detector (SSD), but three times faster. Though YOLOv3 is slower than YOLOv2 since it incorporates residual blocks, skips connections, and upsampling to outperform YOLOv2 [43]. For detection performance, YOLOv3 predicts ten-fold the number of boxes predicted by YOLOv2 and predicts an abjectness score for each bounding box using logistic regression. The network in YOLOv3 is a hybrid approach between the network used in YOLOv2, Darknet-19, and residual network stuff.
Furthermore, it has 53 convolutional layers and is thus called Darknet-53 having a similar performance to ResNet-152 and two-fold faster [43].
In this paper, we proposed to use YOLOv3 with Darknet 19 network for human detection. The input image was resized to 416 × 416 before being applied to the detector. YOLO3 was selected to balance the trade-off between the accuracy of detection and speed.
In addition, the pilot study and experiments performed with YOLO validated that YOLO was a good candidate to detect nude people in various scales and backgrounds. The model was tuned to filter and detect only persons and ignore other classes.

CNN-Based Feature Extraction
In this section, we demonstrate the transfer learning approach, which is summarised by training CNNs with a large-scale dataset and utilising the trained network with the proposed nudity dataset. In addition, various CNNs were demonstrated to provide further insight into the functionality of each network and its advantages and drawbacks. Two main types of CNN learning were sequentially performed as follows: This CNN is an end-to-end learning model. During training, the images, and labels (classes) are available and used to fine-tune the parameters of the whole network. The network consists of convolutional, pooling, batch normalisation, dropout, and fully connected (top) layers. The objective is to fit the large-scale dataset of ImageNet. The SGD was used to tune the parameters. This scenario is called feature learning. At the end of the training, the optimal parameters, which best fit the ImageNet images, and mapping them to 1000 categories, were generated and ready to be transferred to a new dataset.

•
Pre-trained CNN Model The previous CNNs, already trained on ImageNet dataset, were utilised after removing the top layers. The objective is to use the parameters of the first layers to extract features from the proposed nudity dataset. This scenario is called feature extraction. Various classifiers replaced the removed top layers in order to tune their parameters to fit the nude images and map them into two categories, including nude and normal.
In this paper, CNN-based learning was transferred to the nudity detection task. We explored various architectures of CNN such as AlexNet with its fewer layers [18], GoogleNet [20], and VGG16 [19] with their deeper layers, ResNet50 and ResNet101 with their very deep layers [22]. The comparison between these architectures was then carried out. In this work, the previously mentioned pre-trained CNNs were utilised to extract features from image patches that had only persons. The input images were converted from RGB to BGR, then each colour channel was zero-centred with respect to the Ima-geNet dataset, without scaling. After that, images were resized to 227 × 227 in AlexNet, 299 × 299 in Inception3, and 224 × 224 in Vgg16, GoogleNet, ResNet50, and ResNet101. The dimensions of the extracted features differ from CNN to another as follows: 4096 in AlexNet and VGG16, 1024 in GoogleNet, and 2048 in Inception3, ResNet50, and ResNet101.

Various CNN Architectures
• AlexNet This CNN network has five convolutional layers, few max-pooling layers, and three fully connected layers. The last classification layer has Softmax activation with 1000 categories. The network includes 60 million parameters [18]. In order to reduce overfitting, the dropout layer was added to the top layers as a regularisation technique. In this paper, AlexNet was utilised as a feature extractor without tuning the network's parameters to extract 4096 features from each patch of image patches. These patches have only persons and were extracted from the whole frame using YOLO3. Each patch was resized to 227 × 227 pixels.
• VGG This network has 16 weight layers and is scaled up to 19 layers. It demonstrates the advantage of representation depth to enhance classification accuracy. It was found under this architecture that increasing depth, using a very small (3 × 3) convolution filters, showed a significant improvement compared to previous configurations [19]. The depth was achieved by pushing the depth to 16-19 weight layers. VGGNet uses about three-fold more parameters than AlexNet. In this paper, VGG16 was utilised as a feature extractor without tuning the network's parameters to extract 4096 features from each patch of image patches. Each patch was resized to 224 × 224 pixels.

• GoogleNet
This network is a 22-layer-deep network with increased depth and width of the network [20]. GoogleNet used about 12 times fewer parameters than AlexNet. The inception model contains a series of fixed Gabor filters of various sizes to work with several scales. In addition, all filters in the inception model are learned. Furthermore, inception layers are repeated many times to get 22 layers of the GoogleNet [20]. A computational cost of 1.5 billion multiply-adds at inference time. In this paper, we utilised GoogleNet as a feature extractor without tuning the network's parameters to extract 1024 features from each patch of image patches. Each patch was resized to 224 × 224 pixels.

• Inception3
The network aims to balance the width and depth of the network. Inception includes less than 25 million parameters and has a computational budget of 5 billion multiply-adds per inference [21]. The computation of Inception is less than VGGNet [19]. Therefore, Inception networks can be used in big-data scenarios where large-scale data are needed to be processed at a reasonable, if not an affordable cost. It is also efficient in the scenario of limited memory or computational resources such as mobile vision. Inception3 combines fewer parameters, batch-normalised regularisation, and label-smoothing for training highquality networks on modest-sized training sets [21]. Although Inception3 has 42 layers, the computation cost is only about 2.5 higher compared to GoogleNet and is still more efficient than VGGNet [21]. In this paper, Inception3 was utilised as a feature extractor without tuning the network's parameters to extract 2048 features from each patch of image patches. Each patch was resized to 299 × 299 pixels.

• ResNet
ResNet is a residual learning framework to ease the training of very deep networks [22]. The layers are reformulated as learning residual functions with reference to the layer inputs. With 152 layers, ResNet is eight times deeper than that of VGG nets and less complex. It has many versions, such as ResNet50, 101, and 152 [22]. ResNet is a supervised CNN model that has already been trained on large-scale datasets such as ImageNet.
In this paper, ResNet50 and ResNet101 were utilised as feature extractors without tuning the network's parameters to extract 2048 features from each patch of image patches. Each patch was resized to 224 × 224 pixels.

Classification
In this paper, CNN's top layers were replaced by various classifiers to determine the best classifier having the highest accuracy and F1 score. The classifiers were trained to fit the features extracted from the proposed nudity dataset and map them into two categories: nude and normal. The following classifiers replaced the last fully connected layers of pre-trained CNNs:

• Support Vector Machine (SVM)
SVM is a supervised learning model that is normally utilised for classification purposes. The SVM has many versions, of which the simplest version has a linear kernel and is linearly separable. It is used when the relationship between data points is linear. On the other hand, various non-linear kernel functions, such as Gaussian and polynomial, are available [23].
SVM takes multiple feature vectors as inputs and produces numerous hyperplanes in many dimensions. The best or optimal hyperplane, called the decision boundary, separates the feature vectors to maximise the margins from both vectors. In other words, the model is trained to select the hyperplane whose distance to the nearest vector is the largest. The loss function should be minimised as follows [48]: where W is a weight vector, b is a bias vector, φ is the identity function, and C is a regularisation constant. Usually, the kernel is linear and gives a linear classifier: K (x, x ) = x T x . However, non-linear kernels produce non-linear classifiers without data transformation. The dot product is applied to the map, using the φ function, and the current space to a higher dimensional space for non-linear data classification [48]. The kernel computes the inner product between two functions as follows [48]: The output of the decision function after solving the optimisation problem has a sign that determines the predicted class, which is calculated by the sum of all support vectors for samples within the margin; where x is a given sample, α is the dual coefficient and equals zero for the samples outside the margin as follows [48]: Radial basis function (RBF) or Gaussian kernel was used as a non-linear kernel. Training an SVM requires tuning C and gamma. When C impacts on the decision surface, high C makes correct classification, whereas low C makes the decision surface smooth. In addition, gamma determines the impact of a single training example. A small value of gamma makes the model constrained and unable to capture the complexity of the data. The kernel is calculated as follows [48]: where σ is the standard deviation. The polynomial kernel was used as another non-linear kernel as follows [48]: where d signifies the degree.

• Extreme Learning Machine (ELM)
ELM is a fast and single hidden layer feedforward neural network that has good generalisation. The number of hidden nodes is considered a hyperparameter to be selected manually [46]. In this architecture, the input weights and biases are generated randomly.
On the other hand, the output weights, that link the output layers to the hidden layers, are calculated analytically as follows [46]: where F i (. ) is an activation function of the ith hidden node, w i is an input weight, b i is a bias, L neurons are used in the hidden layer, and β i is the weight applied to the output as follows [46]: U is a hidden layer output matrix, U + is the Moore-Penrose generalised inverse of U, T is a target matrix that has one hot encoded label for ELM binary classifier, and λ is a regulation coefficient.
ELM reduces the overfitting problem and enhances the learning speed more so compared to gradient-based methods [46]. However, the drawback of ELM is the randomness of input parameters.

• Random Forest (RF)
RF is supervised ensemble learning that combines multiple tree predictors for classification [45]. RF is used to reduce the overfitting problem. All trees in the forest have the same distribution, and each one is based on the values of a random vector sampled independently. The generalisation performance of forests converges better when the number of trees becomes larger [45]. The generalisation is also based on the power of each tree individually and the correlation between trees. Figure 9 shows the architecture of the RF.
Many hyperparameters are selected carefully to improve RF performance. The number of estimators or trees has an important impact; when the number of trees is increased, the performance is enhanced and makes the predictions more stable. However, more trees reduce computation time. Max features is another hyperparameter that refers to the maximum number of features to consider splitting a node [48]. where (. ) is an activation function of the ith hidden node, is an input weight, is a bias, L neurons are used in the hidden layer, and βi is the weight applied to the output as follows [46]: U is a hidden layer output matrix, is the Moore-Penrose generalised inverse of U, T is a target matrix that has one hot encoded label for ELM binary classifier, and λ is a regulation coefficient.
ELM reduces the overfitting problem and enhances the learning speed more so compared to gradient-based methods [46]. However, the drawback of ELM is the randomness of input parameters.

Random Forest (RF)
RF is supervised ensemble learning that combines multiple tree predictors for classification [45]. RF is used to reduce the overfitting problem. All trees in the forest have the same distribution, and each one is based on the values of a random vector sampled independently. The generalisation performance of forests converges better when the number of trees becomes larger [45]. The generalisation is also based on the power of each tree individually and the correlation between trees. Figure 9 shows the architecture of the RF.
Many hyperparameters are selected carefully to improve RF performance. The number of estimators or trees has an important impact; when the number of trees is increased, the performance is enhanced and makes the predictions more stable. However, more trees reduce computation time. Max features is another hyperparameter that refers to the maximum number of features to consider splitting a node [48].  KNN is a non-parametric technique and a simple classifier based on the similarity measure such as distance functions between data points. KNN depends on the approach that considers similar data points are closed to each other in order to capture the similarity or closeness of the data points. Simple mathematics is used to calculate the distance between the data points on a graph. K, which refers to the number of neighbours, is a hyperparameter to be selected because it impacts on the errors and accurate predictions. The advantages of KNN are its simplicity and ease of implementation.
In addition, there are no hyperparameters to be tuned in advance [44,48]. On the other hand, KNN speeds down significantly when the number of samples and independent variables is increased.  KNN is a non-parametric technique and a simple classifier based on the similarity measure such as distance functions between data points. KNN depends on the approach that considers similar data points are closed to each other in order to capture the similarity or closeness of the data points. Simple mathematics is used to calculate the distance between the data points on a graph. K, which refers to the number of neighbours, is a hyperparameter to be selected because it impacts on the errors and accurate predictions. The advantages of KNN are its simplicity and ease of implementation.
In addition, there are no hyperparameters to be tuned in advance [44,48]. On the other hand, KNN speeds down significantly when the number of samples and independent variables is increased.
Initialise the number of neighbours K.

2.
The distance between the unseen sample and each sample in the dataset is calculated.

3.
Store the distance and the index of the sample in a buffer.

4.
Order the distances and indices inside the buffer in ascending order. 5.
Pick the first K sample from the sorted buffer. 6.
Check the labels of the K first samples. 7.
For the classification task, the mode of the K labels is found and is considered as the predicted category.

The Proposed YOLO-CNN-SVM Method
Several traditional CNN-only methods were utilised in the state-of-the-art methods for pornography detection [7,9]. In these methods, the whole frames, including multiple objects, were resized, and applied directly to one of the pre-trained CNNs to extract the features and classify them by the attached classifier. The classifier gives a normal or nude category at the output. On the other hand, the proposed method consists of three main blocks, including pre-trained YOLO3, pre-trained CNN, and a classifier. The first block of YOLO3 was used for human detection in order to check if persons are available in the frames or not. The frames may have one, two, or a group of persons. The persons appear in various positions, sizes, and backgrounds. Some frames include persons wearing cloth with various colours, while other frames have nude persons with various skin colours. Another challenge is the overlapping of persons in the frames. After the persons were detected, and boundary boxes were drawn around them, the patches of images surrounded by boxes were extracted and resized to fit the input dimensions of the pre-trained CNN. The CNN-extracted features from patches have persons (regions of interest (ROIs)) to be classified into normal and nude. The flow chart of the proposed method and the CNN-only method is shown in Figure 10. 3. Store the distance and the index of the sample in a buffer. 4. Order the distances and indices inside the buffer in ascending order. 5. Pick the first K sample from the sorted buffer. 6. Check the labels of the K first samples. 7. For the classification task, the mode of the K labels is found and is considered as the predicted category.

The Proposed YOLO-CNN-SVM Method
Several traditional CNN-only methods were utilised in the state-of-the-art methods for pornography detection [7,9]. In these methods, the whole frames, including multiple objects, were resized, and applied directly to one of the pre-trained CNNs to extract the features and classify them by the attached classifier. The classifier gives a normal or nude category at the output. On the other hand, the proposed method consists of three main blocks, including pre-trained YOLO3, pre-trained CNN, and a classifier. The first block of YOLO3 was used for human detection in order to check if persons are available in the frames or not. The frames may have one, two, or a group of persons. The persons appear in various positions, sizes, and backgrounds. Some frames include persons wearing cloth with various colours, while other frames have nude persons with various skin colours. Another challenge is the overlapping of persons in the frames. After the persons were detected, and boundary boxes were drawn around them, the patches of images surrounded by boxes were extracted and resized to fit the input dimensions of the pre-trained CNN. The CNN-extracted features from patches have persons (regions of interest (ROIs)) to be classified into normal and nude. The flow chart of the proposed method and the CNN-only method is shown in Figure 10.
If the frame comprises more than one person, each person image patch is passed to CNN to produce one category that was stored in a list. The list has labels of all person patches in one frame. If one element in the list has a nude label, the whole frame is considered nude. However, if all labels in the list have normal elements, the whole frame is considered normal. The advantage of this method is the ability to censor (edit or blur) specific regions in the frames instead of blurring or cutting the whole frames. This technique, in contrast to CNN-only, keeps the frames in the movie videos and reduces the interrupted period of the film, which helps to understand the context.  If the frame comprises more than one person, each person image patch is passed to CNN to produce one category that was stored in a list. The list has labels of all person patches in one frame. If one element in the list has a nude label, the whole frame is considered nude. However, if all labels in the list have normal elements, the whole frame is considered normal. The advantage of this method is the ability to censor (edit or blur) specific regions in the frames instead of blurring or cutting the whole frames. This technique, in contrast to CNN-only, keeps the frames in the movie videos and reduces the interrupted period of the film, which helps to understand the context.

Experimental Setup and Results
In this section, an ablation study to validate the impact of adding YOLO3 before CNN is elaborated. Furthermore, we evaluate and compare various pre-trained CNNs and classifiers for the nudity classification task. Several performance metrics such as accuracy, recall, precision, false negative rate (FNR), false positive rate (FPR), F1 score, and area under curve (AUC) were utilised to evaluate the models. The experiments were performed on a desktop computer installed with Windows 10 (64-bit OS), 64.0 GB RAM, Nvidia GeForce GTX 1080 Ti, 12 GB GPU.

Performance Metrics
Multiple performance metrics were used to validate the performance of the machine learning model. Accuracy, F1 score, AUC, FPR, and FNR are important factors that should be considered to validate the nude/normal classification model.
The summary of performance mercies is as follows: 1. Accuracy is a measure that calculates the number of samples predicted correctly over all available samples.

The First Experiment
We began by comparing the model of ResNet50-average pooling strategy [10] with ACORDE-50 [10] on the 2K extended version of NPDI, which was already divided into fivefold for training and validation. It was found that the ResNet50-average pooling strategy gave an average accuracy of 95.04%, which was higher than 92.43% average accuracy of ACORDE-50. Therefore, we selected ResNet50-average pooling to train the model that would be used to infer the frames in the testing film dataset.
To evaluate the exiting CNN-only methods [7,10] with the testing film dataset, we conducted the experiment using the same model's architecture (ResNet50+SVM) that was used in [10]. We used the ResNet50 CNN-only method, which was already trained on the ImageNet dataset [16], to extract the features directly from the frames of the extended version of the NPDI Pornography dataset [47]. Additionally, SVM was trained to classify extracted features as normal and nude. The previous stage is called the training stage. On the other hand, in the testing stage, the frames of the testing film dataset were applied to ResNet50 to extract the features and utilise the already trained SVM to categorise them. The performance metrics were then calculated.
Unfortunately, the existing methods of CNN-only, that were trained on NPDI, were unable to detect nudity or pornography in the video's frames that had specific content such as nude people with various positions, scales, and backgrounds. Table 3 shows the experimental results. The high FNR was caused by the misclassification of nude frames since they have nudity regions in small-scale with complex backgrounds. In other words, most of the frames were classified as normal.

The Second Experiment
In this experiment, the existing method [10] of ResNet50-only, which was transferred to the NPDI dataset and used SVM instead of top layers, was compared with the proposed method of YOLO-ResNet50-SVM.
The second experiment explored the limitations of the existing convolutional-only approaches [10] that apply CNN directly on the whole frames of videos and the limitations of the NPDI dataset [47] in the task of small-scale nudity detection. Furthermore, it proposes to utilise an object detector such as YOLO to focus CNN's attention on the nudity region inside the frame. The outcome is a model fusion that combines YOLO and ResNet50 CNN. In this proposed method, in the training stage, a collected set of human-only images was used to train the ResNet50+SVM [10]. In the testing stage, the testing film dataset was applied to YOLO that was trained on the COCO dataset [17] to detect humans inside the frames. After that, the detected patches, that contain humans, were applied to the already trained ResNet50+SVM to decide if this frame is nude or not. If the frame has at least one person labelled as nude, the whole frame is considered nude, whereas if all persons have normal labels, the whole frame is considered normal. Table 4 shows the performance metrics for YOLO+ResNet50. Finally, we compared the proposed method of YOLO+ResNet50 with ResNet50-only, as shown in Figure 11. The proposed YOLO+ResNet50 was found to outperform ResNet50-only regarding accuracy and the F1-score in all five videos. Moreover, the FNR in YOLO+ResNet50 was much lower than ResNet50-only as shown in Tables 3 and 4.  The big difference between the performance of the proposed method and the Res Net50-only [10] is that the latter was trained on pornographic frames that have big scale of nudity with fewer backgrounds. On the other hand, the real-world video contents tha are common to be broadcasted on TV often contain more nudity with different scales an various backgrounds. Therefore, our proposed method plays an important role in th type of visual content to focus the attention on the expected nude regions before classify ing them as nude or not.

Ablation Study
In this study, we gained insight into the effect of adding the YOLO model before pre trained CNN in the pipeline. The comparison was performed between ResNet50-only [10 and YOLO-ResNet50 using the MMU collected dataset which includes variations in hu man position, the size relative to frame's size, cloth, skin colour, and backgrounds such a forest, snowing, beach, supermarket, indoor, and streets. The data were divided into trai and validation sets, as mentioned in Section 2.1.4. Figure 11. Comparison between the proposed YOLO-ResNet50 and ResNet50-only [10] in terms of accuracy using the testing film dataset.
The big difference between the performance of the proposed method and the ResNet50only [10] is that the latter was trained on pornographic frames that have big scales of nudity with fewer backgrounds. On the other hand, the real-world video contents that are common to be broadcasted on TV often contain more nudity with different scales and various backgrounds. Therefore, our proposed method plays an important role in this type of visual content to focus the attention on the expected nude regions before classifying them as nude or not.

Ablation Study
In this study, we gained insight into the effect of adding the YOLO model before pretrained CNN in the pipeline. The comparison was performed between ResNet50-only [10], and YOLO-ResNet50 using the MMU collected dataset which includes variations in human position, the size relative to frame's size, cloth, skin colour, and backgrounds such as forest, snowing, beach, supermarket, indoor, and streets. The data were divided into train and validation sets, as mentioned in Section 2.1.4. K = 5 cross-validation was performed to validate the improvement achieved by adding YOLO in the pipeline. The accuracy for each fold and average accuracy were calculated for both methods, as shown in Table 5. YOLO-CNN was found to outperform CNNonly regarding the accuracy, which increased from 85.5% to 89.5%. In addition, Table 5 demonstrates the impact of augmentation, as described in Section 2.1.4 to improve accuracy by 2%. The confusion matrix of each fold for both methods is shown in Figure 12.  The accuracy metric is not adequate to validate the results. Therefore, the ROC curve was drawn to measure the performance of the two methods. It was found that the mean AUC of YOLO-CNN is 97 % which outperforms CNN-only by 4%. ROC is illustrated in Figure 13. The blue curve represents YOLO-CNN, whereas the green curve represents CNN-only. In addition, performance metrics were calculated for both ResNet50-only and YOLO + ResNet50 as shown in Table 6.
The previous results in the ablation study validated the correctness of the hypothesis that was given in this paper. This concludes that attracting the attention of CNN-only on expected nudity regions in frames helps to classify these regions better and improve the performance of nudity classification. On the other hand, the results highlighted the limitation of other methods that make CNN take all various content and objects in the frame. The accuracy metric is not adequate to validate the results. Therefore, the ROC curve was drawn to measure the performance of the two methods. It was found that the mean AUC of YOLO-CNN is 97 % which outperforms CNN-only by 4%. ROC is illustrated in Figure 13. The blue curve represents YOLO-CNN, whereas the green curve represents CNN-only. In addition, performance metrics were calculated for both ResNet50-only and YOLO + ResNet50 as shown in Table 6.
The previous results in the ablation study validated the correctness of the hypothesis that was given in this paper. This concludes that attracting the attention of CNN-only on expected nudity regions in frames helps to classify these regions better and improve the performance of nudity classification. On the other hand, the results highlighted the limitation of other methods that make CNN take all various content and objects in the frame.
CNN-only. In addition, performance metrics were calculated for both ResNet50-only and YOLO + ResNet50 as shown in Table 6.
The previous results in the ablation study validated the correctness of the hypothesis that was given in this paper. This concludes that attracting the attention of CNN-only on expected nudity regions in frames helps to classify these regions better and improve the performance of nudity classification. On the other hand, the results highlighted the limitation of other methods that make CNN take all various content and objects in the frame.

The Third Experiment
In the previous section, an ablation study was conducted to validate the efficiency of adding YOLO3 before ResNet50 CNN to have three blocks in the pipeline, namely YOLO3, ResNet50, and SVM. In this section, the objective is to combine YOLO3 with other architectures of CNNs and classifiers. Therefore, an experiment was carried to utilise a combination of one of six feature extractors and one of six classifiers to have in total 6 × 6 models. In order to evaluate the performance of the 36 models, each feature extractor and classifier was tested on 25,983 human images detected from 8579 testing images, as mentioned in Section 2.1.5. The image is considered nude if any part of it contains nudity. Table 7. shows the performance metrics for all 36 models.
The settings of the classifiers were as follows: SVM was trained with three different kernels: Linear, Gaussian, and Polynomial. For RF, a different number of trees was tested. It was found that 100 trees produced the best performance with the testing dataset. On the other hand, to validate the impact of the distance function in KNN, different functions were evaluated, and Euclidean distance was nominated to be utilised in this experiment. In addition, a different number of neighbours was tested in selecting the best classifier, which was 30 nn to be employed in the experiments. Furthermore, the hyperparameters of ELM were also investigated to optimise the results. ELM, with 9000 hidden nodes and 2 15 regulation factors, was the best candidate.
The settings of the classifiers were as follows: SVM was trained with three different kernels: Linear, Gaussian, and Polynomial. For RF, a different number of trees was tested. It was found that 100 trees produced the best performance with the testing dataset. On the other hand, to validate the impact of the distance function in KNN, different functions were evaluated, and Euclidean distance was nominated to be utilised in this experiment. In addition, a different number of neighbours was tested in selecting the best classifier, which was 30 nn to be employed in the experiments. Furthermore, the hyperparameters of ELM were also investigated to optimise the results. ELM, with 9000 hidden nodes and 2 15 regulation factors, was the best candidate.

Class Activation Mapping
The global average pooling layer gives the classification-trained CNN ability to localize well despite being trained on image-level labels (without bounding box) [49,50]. Class activation maps (CAMs) help to visualize the scores of predicted classes. In other words, the discriminative object parts in an image are highlighted after being detected by the CNN. CAM has been used in many applications such as medical imaging task [50] to understand the predictions. In this paper, CAM was applied to few images in the proposed dataset to highlight discriminative regions in the images as shown in Figure 15. The patterns were extracted from global average pooling layers to identify the complete extent of the objects.

Conclusions and Future Work
In this paper, the problem associated with nudity detection at various scales and backgrounds was addressed. The proposed method utilised COCO-trained YOLO3 detector, which was transferred to our dataset to determine the regions of interest that in-

Conclusions and Future Work
In this paper, the problem associated with nudity detection at various scales and backgrounds was addressed. The proposed method utilised COCO-trained YOLO3 detector, which was transferred to our dataset to determine the regions of interest that included human patches. In addition, ImageNet-trained CNNs such as AlexNet, GoogleNet, VGG16, Inception3, and ResNet were transferred to extract the features automatically from human patches and classify them by a classifier into two categories: normal and nude. The proposed method of YOLO-CNN was compared to CNN-only and was found to improve the accuracy by 4% from 85.5% to 89.5%. In addition, AUC was also increased from 93% to 97%.
Various CNN-based feature extractors and classifiers were utilised and compared. ResNet101-RF was found to improve the performance of the detection system and outperform state-of-the-art methods regarding the F1-score (90.03%) and accuracy (87.75%). It also balances the trade-off between the FNR and the FPR.
The results of this work are summarised as follows: • YOLO is a good object detector that can be transferred to detect humans in the nudity dataset as a significant stage in the proposed nudity detection pipeline. This stage plays a role to apply CNN on ROI instead of applying CNN on the whole frame.

•
Pre-trained CNN can be transferred after removing the final layers to extract features from visual nudity content.

•
Various classifiers such as SVM, RF, and ELM were used to replace the final layers of CNN to fit the extracted features and classify them into normal and nude.
The advantages of the proposed system include the following: 1.
The proposed system can automatically detect nude humans with various scales in complex backgrounds such as forest, snowing, beach, supermarket, and streets. 2.
The proposed system runs in real-time as only one frame is sufficient to detect human patches and classify them as normal and nude. This helps to censor livecaptured videos.

3.
The proposed nude/normal classifier is robust against various scales, positions, cloth, and skin colour.

4.
It outperforms the state-of-the-art methods, such as CNN-only regarding the accuracy, F1 score, and AUC. 5.
The proposed method can also edit and blur only nude regions inside the frames. In other words, there is no need to blur or cut the whole detected frames from the video.
This work focuses on the utilisation of YOLO as a human detector. The limitation of the proposed method is related to the performance of the human detector. If the detector was unable to detect a human in the frame, the nude frames would be classified incorrectly as normal ones. However, this limitation may open the door by providing opportunities for future research to enhance the detection accuracy using better detectors such as EfficientDet [51] to increase the true positive rate of human detection.
In this work, the parameters of all layers except the top ones were frozen. Additionally, the top layers were replaced by SVM. To improve the performance, fine-tuning parameters of more layers on pornography or nudity dataset can be demonstrated [52].
The proposed solution utilized CNNs pretrained on the ImageNet-1K. In future work, Facebook's ResNeXt Weakly Supervised Learning (WSL) CNNs [53,54] could be adapted for further improvements in the current solution. WSL CNNs can act as fixed feature extractors to extract image-level features from the proposed dataset [54]. Furthermore, the advantage of fine-tuning these CNNs on ImageNet-1K could also be explored.
Finally, this paper proposed a new benchmark image dataset with more challenging content for nudity detection. In this research, a medium-scale dataset was used. Hence, in the future, we intend to further enhance this work with a larger number of samples to improve the performance of classification.