1. Introduction
With the recent increase in the number of online distributions of harmful pornographic contents via new types of personal broadcasting services, a system for automatic detection of online pornographic contents is highly being demanded [
1,
2]. Since much of the recent pornographic contents provided through the online personal broadcasting services have harmful scenes either in the form of visual or auditory manner, fast and accurate detection seems vital. However, most previous studies related to the automatic detection of pornographic contents have mainly focused on single modal detection that extracts and uses either visual or auditory features [
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14]. One of the limitations of existing methods based on single modal detection approach is that the harmful contents without the detectable elements cannot be detected.
In order to reduce these detection errors, several multimodal detection methods have been studied using visual and/or auditory features for detection [
15,
16]. Previous studies on the multimodal methods have shown better detection performance than previous single modal methods, even when portions of the harmful visual or auditory elements are absent in the content. However, since the methods proposed in [
15,
16] determine the harmfulness only after the harmful contents have completely played, it is difficult to determine the harmfulness of such contents during the early phase of the play if the contents are provided via streaming way on the online media platform. Therefore, there is a need for a detection method that can quickly and accurately detect harmful contents that are played or distributed online. In particular, a quick detection method that can minimize the detection omissions of harmful contents is needed.
Recently, to supplement the problems of the existing studies [
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16] and satisfy the requirements of the online harmful content detection, a method that utilizes multiple features extracted from the visual, motion, and auditory elements for the pornographic detections was proposed in [
17]. In the detection method proposed in [
17], the harmfulness is decided via unit segments that are divided as a particular unit length from the input content. To determine the harmfulness of the online content, the unit segments from an input content are used to classify the harmfulness as quickly as possible. In order to determine the harmfulness of a unit segment, an image descriptor of each video frame, a video segment descriptor of a continuous video frame sequence, a motion descriptor to notate the motion characteristics in the video segment, and an audio descriptor of the unit segment are utilized. These four types of descriptors are extracted from all content segments in the training dataset, and then the independent four component classifiers are developed using each type of the descriptors. This study [
17] used the multimodal pornographic detection method with a stacking ensemble approach, where four component classifiers are arranged in a descending order of their performance to improve the recognition performance, especially in terms of the false negative rate, as well as robustness against lack of the visual, motion, or auditory elements in the input content.
However, although the visual and auditory elements of the content have characteristics that change over time, because each descriptor used in [
17] uses simple static features by averaging the changes in each unit content segment over time, an accurate reflection of these changes over time can be limited. In addition, since three different types of visual features (image descriptor, video descriptor, and motion descriptor) are extracted from the visual elements of the input content, this method leads to a waste of computation time by engaging in three feature extractions independently. Since the extraction time for the motion features from an input segment via the optical flow is quite long in [
17], fast detection tends to be problematic in the online streaming service environments. In addition, when the performance of a specific component classifier is relatively poor, the result of the final decision is dominantly influenced since many false positives and reduced overall accuracy occur due to the unique characteristics of the model stacking method. In particular, since the detection accuracy of the audio component classifier is relatively lower than that of the other component classifiers, the resulting overall detection performance can be lower than expected [
17].
In order to solve these weaknesses, we propose an enhanced multimodal stacking scheme that can quickly and accurately detect online harmful contents on new types of personal broadcasting services. In the proposed harmful content detection technique, instead of extracting three descriptors from the visual elements, VGG-16 [
18] and bi-directional RNN using LSTM (long short-term memory) [
19] are used to extract a single implicative visual descriptor that reflects changes over time [
20]. This method improves both the accuracy of the hazard determination and the computational time required for the feature extractions and hazard decisions. In addition, to extract the characteristic that reflects the bidirectional correlations of the neighboring auditory signals, a multilayered dilated convolutional block [
21,
22,
23] is used to extract and utilize the implicative auditory descriptor to improve the accuracy of the audio component classifier more than that of [
17]. As the first detection step, the harmfulness is decided via the fusion component classifier trained by both the visual and auditory features to detect the hazardous contents with high accuracy over a short period of time. When the input content is classified as non-harmful, the content is checked serially with the video component classifier and audio component classifier, each of which is trained based on the visual and auditory features, respectively. By using the proposed multimodal stacking scheme, the harmfulness of the input content can be detected quickly using the first fusion classifier, and any hazardous content missed in the first filtering stage can be detected later by the video classifier or the audio classifier in a serial order.
This paper is composed of the following sections. In
Section 2, the previous related studies are reviewed. In
Section 3, we describe the proposed multimodal stacking scheme about its overall procedure, extracting of the implicative visual and auditory features, developing of the component classifiers, and stacking ensemble of the component classifiers. In
Section 4, the experiments and the analysis results are described. A short discussion is described in
Section 5. Finally, in
Section 6, the conclusions of the study and future studies are described.
2. Literature Review
Existing multimodal pornographic detection schemes generally use two or more features among the visual features extracted from either a single video frame or a video segment (i.e., a continuous sequence of video frames), the motion-based features, and the acoustic features extracted from the content. Despite the differences in the features used or the combination methods, most multimodal pornography detection methods involve three common steps. The first step extracts the features that each model seeks to use from the corresponding elements of the input content. In the past, low-level features such as skin color, specific female body areas, or distribution of skin pixels were utilized. However, as described in [
24], these low-level features do not have sufficient discriminative power to judge the harmful status of the content. Recently, low-level features extracted from the visual and auditory elements have been converted to high-level features by applying the Bag of Word (BoW) or deep learning frameworks [
15,
16]. The second step creates an overall classification model via training with the selected appropriate machine learning scheme using the extracted multimodal features to recognize pornographic contents. In general, in the mid-level fusion method approach, all multimodal features can be combined a priori into one representative integrated feature set to develop one classifier, whereas in the late fusion method approach, several component classifiers are made using individualized multimodal features. The late fusion method is the most common method in recent studies [
15,
16] because of its superior performance, as described in [
15]. In a previous study [
25], simple methods with a pre-determined threshold or simple machine learning model such as a decision tree and a naïve Bayes were used. In the recent studies of [
15,
16], which mainly used the high-level features, the support vector machine (SVM), neural network, and deep learning architectures (which can clearly reflect the non-linearity of the classification hyperspace) were utilized. The final stage is the output engineering step, where all classification results from the component classifiers are integrated for the late fusion method approach.
However, in the case of [
15], since all of the features were extracted from the visual elements, harmful content detection remained difficult if only the auditory elements of the input content were harmful as existing single modal methods use only the visual elements. In addition, in the case of [
16], although the disadvantages of [
15] were compensated by using the features extracted from both the visual and auditory elements, since the features extracted from the visual elements are comprised of static features extracted from one still image, the method could not detect the harmful contents well. Moreover, since the methods in [
15,
16] require the entire piece of content to determine the harmfulness, it is difficult to quickly detect the harmful contents that are played or distributed online. There is also a problem of omitting certain types of harmful contents in the detection process.
In order to resolve the disadvantages of previous studies, the authors in [
17] proposed a pornographic video detection method that offers a robust detection performance even if some of the elements used for detection are insufficient. The detection process in [
17] is also composed of three steps. In the feature extraction step, four types of descriptors are extracted, including an image descriptor that contains the static features of a video frame, a video segment descriptor containing the static features of a video segment, a motion descriptor representing the motion features in a video segment, and an audio descriptor that contains the static features of an audio segment. These descriptors are extracted by dividing the input content into 10-s unit content segments. Each segment is judged for its harmfulness instead of using the entire piece of the content for early detection. The four descriptors extracted through this process are used to train each component classifier via linear SVM. Each component classifier produces a probability value as the decision result for the pornographic status of the input content. Lastly, in order to combine all probability values to make the final decision, one of the model ensemble techniques, the model stacking method, is utilized to improve the final decision accuracy, especially to improve the true positive rate. In [
17], in order to ensure that pornographic videos are found as early as possible, the component classifiers are stacked in descending order according to the accuracy of each classifier—for example, in the respectable order of the video classifier, the image classifier, the motion classifier, and the audio classifier. This method not only provides better performance in detecting typical pornographic scenes with abundant harmful audiovisual elements but also provides good detection performance for scenes lacking some of the necessary elements for reliable pornographic detection.
However, the method in [
17] has the following three limitations. First, the video, motion, and audio descriptors used to express the characteristics of the visual or audio elements of the content segment are composed of the averaged static feature values extracted from each segment point. Therefore, the descriptors created through this method cannot reflect the temporal relevance between each signal of the segment. Consequently, the performance of the corresponding classifiers may be degraded since the static properties may not sufficiently reflect the changes over time. Second, although the model stacking technique can increase the true positive rate of the final decision results, the overall classification accuracy decrease because the false positive error rate increases when the performance of some classification modules is poor. Third, the processing time is wasteful since the image descriptor, the video descriptor, and the motion descriptor are extracted from the visual elements, and then the three component classifiers are individually trained and used to decide the harmful status of the content based on such features. Because a great amount of time is required to extract the features and detect the harmfulness, this method could be insufficient in properly detecting harmful contents in the online service environments.
5. Discussion
The main objective of this study is to develop an enhanced multimodal stacking scheme that can be used in real-time streaming environments by reducing the computation time for feature extractions and judging the harmful status of the input content. To accurately detect the harmful contents, the implicative visual and auditory features are extracted by a bi-directional RNN with VGG-16 and by a multilayered dilated convolutional network, respectively. Moreover, three component classifiers are trained, respectively, by using only the implicative visual features (for video classifier), only the implicative auditory features (for audio classifier), and by using both features together (for fusion classifier). Here, to reduce the detection time, we decreased the number of component classifiers to be stacked in the ensemble scheme to three from four as in the previous scheme proposed in [
17]. Then, these three component classifiers are stacked in the enhanced ensemble scheme to reduce the false negative errors in a serial order of the fusion classifier, video classifier, and audio classifier for quick online detections. According to the analysis of the experimental results, the performance rates of the proposed scheme are 95.40%, 92.33%, and 4.60% for the true positive rate, accuracy, and false negative rate, respectively.
In recent years, many studies have reported of high performance in the harmful content detections using various deep learning approaches [
6,
7,
8,
9,
10,
11,
13,
14]. Among the studies, some use video frame image or video clips [
7,
8,
9,
10,
11], motion analysis [
6], or age prediction from facial images [
14] as the visual element of input content to determine the harmfulness. When comparing the performance results of these approaches, the approach of [
6] with the accuracy rate of 95.1% and the approach of [
7] with the true positive rates of 97.52% are showed better performance than the enhanced multimodal stacking scheme suggested in this study. However, since the techniques used in [
6,
7,
8,
9,
10,
11,
13,
14] cannot properly detect the harmful contents based on the acoustic elements, the proposed scheme in this paper, which includes an auditory element detection, can be evaluated as more advanced.
In addition, we investigated as many previous studies as possible that utilize both the visual and auditory elements simultaneously for the detection of harmful content as in this study. We confirm that the performance of the proposed method in this study is more superior to the true positive rate of 94.44% for the current state-of-the-art technology in this field [
16]. However, because it is difficult to use the same data for performance comparisons because of the nature of the research field, it is difficult to determine relative superiority by simply comparing the numerical values published in each paper.
In order to provide a meaningful performance comparison, the enhanced stacking scheme proposed in this study is compared to the multimodal stacking scheme of the previous study [
17] conducted by our team using the same test data set. According to the experiment analysis results, the performances of the enhanced multimodal stacking scheme are analyzed to have the improved true positive rate of 95.40% and the false negative rate of 4.60% than 94.33% and 5.67% of the previous study, respectively. In addition, it is analyzed that the proposed scheme can detect the harmful contents up to 74.58% and an average of 62.16% faster than the previous scheme.
As to our best knowledge, this study is the first study to present the detection time required to determine harmfulness using the multimodal stacking ensemble technique to suggest the online pornographic content detection scheme for the online streaming environments. Therefore, higher accuracy and lower false negative rates with faster detection times are observed, showing this method’s greater harmful content filtering performance in the online environments. However, because of the incomplete performance of the component classifiers, especially the audio classifier, and the false negative rate of 4.6% demands an improvement. Since each element classifier needs to be trained separately, a great amount of time is still required to train all the classifiers. Additional efforts are needed to develop an optimized integrated model capable of the end-to-end learning in the future.
6. Conclusions
In this paper, a multimodal stacking ensemble scheme for the online pornography content detections is proposed. In the stacking ensemble scheme, three component classifiers are trained using only the implicative visual features, implicative auditory features, and both implicative visual and implicative auditory features, arranged serially. In order to detect the harmful content quickly, the input content is divided into the unit content segments to use them as the harmful detection units. We also propose an extraction process for the implicative visual features and auditory features that express signal pattern changes over time implicatively within the input unit content segment to detect the harmful contents more accurately. The two extracted features are independently utilized to train the video classifier and the audio classifier, and then both features are used together to train the fusion classifier to use the trained classifiers as the component classifiers. In addition, we apply a stacking ensemble scheme that orderly stacks the fusion classifier, video classifier, and audio classifier for early detection and to avoid the omissions of any harmful content. According to the analysis of the experimental results, the performance rates of the proposed scheme were 95.40% and 92.33% for the true positive rate and accuracy, respectively. However, the false negative rate was about 4.60% because of the incomplete performance of the component classifiers, especially the audio classifier. Therefore, in the future, studies should focus on improving the performance of the audio component classifier.