Next Article in Journal
Reflectance Transformation Imaging Visual Saliency: Local and Global Approaches for Visual Inspection of Engineered Surfaces
Next Article in Special Issue
Fall Detection System Based on Simple Threshold Method and Long Short-Term Memory: Comparison with Hidden Markov Model and Extraction of Optimal Parameters
Previous Article in Journal
Numerical Prediction of Internal Flows in He/LOx Seals for Liquid Rocket Engine Cryogenic Turbopumps
Previous Article in Special Issue
IoT-Based Intelligent Monitoring System Applying RNN
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Automatic Movie Tag Generation System for Improving the Recommendation System

Department of Computer Science and Engineering, Korea University of Technology and Education, Cheonan 31253, Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(21), 10777; https://doi.org/10.3390/app122110777
Submission received: 31 August 2022 / Revised: 20 October 2022 / Accepted: 20 October 2022 / Published: 24 October 2022
(This article belongs to the Special Issue Future Information & Communication Engineering 2022)

Abstract

:
As the content industry develops, the demand for movie content is increasing. Accordingly, the content industry is actively developing super-personalized recommendation systems that match consumers’ tastes. In this paper, we study automatic generation of movie tags to improve the movie recommendation system. We extracted background sounds from movie trailer videos, analyzed the sounds using STFT (Short-Time Fourier transform) and major audio attribute features, and created a genre prediction model. The experimental results show that the pre-collected dataset and the data extracted via the model are similar when compared. In this research, we suggest the methodology of an automatic genre prediction system for movie information from trailer videos. This will help to reduce the time and effort for metadata generation for a recommendation system.

1. Introduction

With the development of the Internet, a large amount of data and content are being produced worldwide. Today, data have become a key factor in everything, the size of data is growing exponentially, and the penetration rate of e-commerce in web—based businesses varies across business areas, such as the US (266 million, 84%) and France (fifty-four million, 81%), but, every month, there is a steady increase in new members. According to IBM, “Every day, Internet users generate 2500 trillion bytes of information, which is 90 percent of today’s information on the planet in the last three to four years” [1].
Many studies are being conducted for such data and content consumption, and they continue to develop. Among them, a recommendation system is a technology that has been used to provide data and content to users on the Internet from the past.
A recommendation system is an information tool that helps users to determine the items they want from a large number of items available. The main goals of a recommendation system are to predict the ratings given to items by specific users and to help users find the best solution in a list of available items [2,3]. Many companies, such as Netflix, YouTube, and Amazon, use recommendation systems to serve users and make profits [4].
For example, if we want to buy books, listen to music, watch movies, etc., there is one recommendation system that is working in the background that makes suggestions to the user based on his previous actions [5]. Many platforms—such as Netflix, which suggests movies, Amazon, which suggests products, Spotify, which suggests music, LinkedIn, which is used for recommending jobs, or any social networking sites that suggest users—all work on a recommendation system [6,7,8].
These companies benefit from recommendation systems and are shown in Table 1.
As such, with the continuous development of research and technology, the recommendation system method continues to develop, and research is being conducted more actively based on content-providing platform companies.
A movie recommendation system helps movie lovers to avoid looking into vast databases online and also reduces the time needed to search for the right movie by suggesting top—tier movies.
These movie recommendation systems use various techniques, such as content-based filtering algorithms and collaborative filtering algorithms. Among them, a content-based algorithm generates and utilizes various keywords, such as genre, title, director, actor, and atmosphere, of the movie in a way that recommends using the movie’s metadata or tags together with user information. Currently, the movie’s metadata or tags are manually entered by the operator, which means that the recommendation system has not been automated overall.
In this paper, we conduct a study to automatically generate tags (genre, mood, producer, title) of movie content required for recommendation systems through artificial intelligence and to verify the consistency of tags derived by the proposed method and manually entered tags.
This paper is structured as follows. Section 2 introduces the theoretical background related to speech processing. Section 3 includes related studies on automatic tag generation for movie content. In Section 4 and Section 5, the proposed method and results are described. Finally, Section 6 presents the overall conclusion of the proposed method.

2. Background

2.1. Sound Signal Processing

2.1.1. STFT (Short—Time Fourier Transform)

Fourier transform refers to the decomposition of a function into a frequency component per time or space. Currently, information on time or space disappears. Therefore, it is not possible to confirm at what point the signal corresponding to each frequency existed. STFT refers to a method of dividing data into sections over time and Fourier transforming them into sections [9,10]. STFT refers to how data are divided into sections over time and Fourier transforms them into sections. In this way, a change in frequency over time can be confirmed. The STFT conversion equation is as follows:
S T F T t , w = x τ w τ t e j w τ d τ
Here, x(t) is a signal to be analyzed, and w(τt) is a window function. The larger the window size, the more detailed the frequency domain can be observed. However, the size of the time domain increases, making it difficult to accurately measure the time domain. Therefore, it is important to determine the appropriate window size.

2.1.2. ZCR (Zero-Crossing Rate)

A zero-crossing rate (ZCR) is a code change rate of a voice signal during a given period. ZCR is simple and widely used for speech/music discrimination and is widely used in areas such as music genre classification, highlight detection, speech analysis, speech detection, and background sound recognition [11].

2.1.3. MFCCs (Mel-Frequency Cepstral Coefficient)

MFCCs are coefficients that collectively constitute MFCs. MFCs can be obtained by processing Mel-spectrogram with DCT (Discrete Cosine Transform) and are often used to distinguish music genres [12].

2.1.4. Spectral Feature

  • Spectral Centroid
The spectral center represents the location of the center of mass of the spectrum and is a measurement that is highly related to the brightness of the sound [13].
  • Spectral Roll-off
The spectral roll-off represents the frequency at which a certain percentage of total spectral energy (typically 80% to 90%) is concentrated in the spectrum [14].

2.2. ResNet34

ResNet34 is an algorithm developed by Microsoft that won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2015 [15]. ResNet34 is based on the structure of VGG-19. Figure 1 shows the structure of basic CNN, and Figure 1 shows the shortcut structure used by the ResNet34 algorithm [16].
This only needs to add a shortcut that is directly connected from input to output to the existing mapping method. Therefore, there is no effect on the number of parameters, and there is no additional operation through the shortcut connection except for the increase in addition. In the case of ResNet34, results are obtained through 16 shortcut structures. This structure reduces the difficulty of parameter optimization and avoids gradient loss problems when constructing deep networks.

3. Related Works

3.1. Automatic Metadata Generation System

Marko et al. (2009) [17] proposed an automated system of existing metadata generated resource relationships as a mediator for propagating metadata from metadata rich resources to metadata-poor resources, as well as analyzed existing resource metadata through relational networks and propagated metadata to resources lacking metadata. This is the process of automatically generating metadata through human verification of the propagated metadata. It proposes that it is independent of content analysis, so it can be applied to different resource media types and has a lower cost of resources.
Wangsung C. et al. (2009) [18] proposed a method to reduce metadata construction costs using video content scripts. They used a method of extracting keywords using the script of video content when extracting keywords for each video scene. A keyword for each scene of a moving image is extracted by the following extraction process. First, divide the entire video into scenes. Then, match each scene with each scene in the script. Finally, extract keywords based on the script content. Simple extraction of scene-specific keywords from video-content-based metadata is meaningful in providing an interactive service that provides additional information about the current scene and relevant advertisements during video viewing.

3.2. Video Metadata Tagging

As a related previous study, there is a model called ‘ViS4mer’ that is effective in long range video using self-attention. Using image and natural language processing, the relationship with the person in the video and the genre were analyzed [19]. In addition, there is a system that extracts keywords through automatically generated subtitles and tags the data with images [20].

3.3. Relevance between Audio and Movie

Gorbman (1987) [21] showed that music “mainly serves as a flag of emotion” in film production. He said that listeners can receive narrative signals, such as movie backgrounds and character settings, through movie music.
Jon G. et al. (2018) [22] experimentally revealed that all audio features of a soundtrack have a significant impact on user ratings of the movie they come out with. He proposed that certain musical aspects (acoustic, tempo) can generate audience response very effectively in the context of film music.
Thus, it can be seen that auditory signals in movies play a very important role in effectively capturing the audience’s attention [23,24].

3.4. Emotional Classification of Music

Research on various classifications of emotions through characteristic extraction in music continues [25,26,27]. Categorical approaches classify music into emotions, such as happiness, sadness, anger, fear, and disgust, and Russell’s circumplex emotional model represents the atmosphere on a two-dimensional axis representing arousal and valence.
Existing research suggests a method of automatically generating metadata to provide customized content. There are two large categories of research methods. The first study uses existing metadata resources, and the second study uses video scripts. All of them showed results of automatically generating metadata required for recommendations.
In this paper, we aim to automatically generate the tag of a movie as a basis for recommending movie content to users. To this end, tags for title, genre, mood, and production country are automatically created by analyzing the sound (background music, voice) and text properties of movie content differently from existing studies. The automatically generated tag was compared with the existing manual tag and the proposed method was verified.

4. Research Method

The system for automatically extracting movie tags consists of five steps. As a first step, we collected video content files. In the second step, extract audio data from the ‘Video Content’. In the third step, ‘Audio Data’ are separated into ‘Voice Data’ and ‘Sound Data’. Using ‘Voice Data’, a system analyzes the production country and ‘the Sound data’ are converted into STFT spectrogram and then the genre tag is extracted through data processing. Moreover, sound data are used for extracting the mood tag. In the last step, a system was established to generate the tag information. The details of each step are as follows, all of which consist of a single process from video content input to tag extraction, as shown in Figure 2.

4.1. Extracting Video Data

To verify tag information extracted from image data, the results were compared using the MPST dataset [28] and the “Movie Lens” dataset [29], which constructed about 70 types of tag information through natural language processing in the movie plot synopsis. Basic information such as film title, screening time, and screening year could be obtained through the unique number included in the IMDB [30] dataset. Information on the video trailer was used for research through the "Movie Net" Trailer 30k URL [31].

4.2. Pre-Data Processing

The sound of a movie is used as an important component of the movie. In the case of background sound, overall atmosphere information of the movie may be extracted, and the movie genre can be distinguished by analyzing the background sound of the movie [32]. In addition, in the case of voice, it can be used as data that can determine the country where the movie was made. In this paper, the movie trailer file was pre-processed in three stages.
First, out of 933 video data, extract audio sound then split only speech data and background sound data using a ‘spleeter’, a speech separation module. Then, speech data are used for production country analysis and background sound data for STFT (Short Time Fourier transform) spectrum extraction. Second, the STFT spectrum is extracted, visualized, and stored for frequency by time from the sound data. Finally, STFT spectrogram image processing is performed as shown in Figure 3 in order to extract genre tags using STFT. To use the stored STFT spectrogram as learning data, all the color bars and text except for the spectrogram are removed.

4.3. Genre Analysis

The background sound and sound effects of the movie are important factors that can determine the atmosphere and genre of the movie [33]. Therefore, we implement a training deep learning model for genre tag extraction using a refined STFT spectrum in the data processing process.
As a result of analyzing the STFT spectrogram for each movie genre, the waveform of spectrogram is shown to have similar patterns in the same genre. This shows that, in the case of movies of the same genre, the spectral patterns tend to be similar. Therefore, the movie genre could be extracted by learning the STFT spectrogram. Genre analysis system through STFT spectrogram is constructed, as shown in Figure 4. For image classification with STFT spectrograms, transfer learning is performed on the ResNet34 model. SGD is used in the process of adjusting the weight of the neural network to increase the learning speed.

4.4. Country of Manufacture Analysis

Since the actors in the movie use the language of the country they produced, the production country can be analyzed by separately extracting the voice data of the actors from the movie trailer. The production country analysis was constructed using Google’s Speech_recognition, which is voice recognition AI, as shown in Figure 5 below.
The separated voice data are obtained through voice recognition to obtain audio data of a certain section containing human voice and converted into text. The converted text was detected and stored as production country tag data.

4.5. Background Sound Analysis

Research is continuously being conducted to predict the atmosphere, genre, etc., by analyzing audio and music [34]. Leveraging audio feature vectors or incorporating spectrograms extracted from audio into deep learning models, such as CNN, to increase the precision of audio comprehension is an ongoing challenge in deep learning [35]. In this study, two main methods were used to analyze the background sound of the trailer video. First, we extracted the main features of sound, STFT, rms, spectral_centroid, spectral_bandwidth, spectral_rolloff, mfcc, and zero_crossing_rate using a useful tool for audio analysis. Second, we extracted the top 10 tags that appeared on each background sound using musicnn, a CNN-based model that achieved the best performance in the field of automatic music tagging [36]. Through these methods, we attempted to understand background sound audio and created a random forest model for genre prediction.

5. Results

5.1. Results of Genre Analysis

A total of 1013 data were extracted through the previous steps. After that, a process of deleting unnecessary data from among the data was performed. Therefore, the data were refined to a total of 935 data. The data were classified into “Action, Comedy, Crime, Drama, Horror, Romance”—six genre classes. Of the data, 70% was used as training data, and 30% of the total was prepared as test data. Through transfer learning, a genre recognition result was obtained as shown in Figure 6, and the class keywords (Action, Comedy, Crime, Drama, Horror, Romance) of the recognized image were saved as a CSV file.
Table 2 shows the accuracy and learning time of 100 epochs using ResNet34, VGG-19, and MobileNet models. MobileNet had 91.4164% accuracy at the last session, and the maximum accuracy is 92.9900%. VGG-19 had 94.4206% accuracy at the last session, and the maximum accuracy is 95.8512%, which is about 3% higher than MobileNet. In the case of ResNet34, the last session of accuracy is 95.0086% and the maximum accuracy is 96.2134%, which was about 0.5% higher than VGG-19. In addition, the largest advantage of ResNet34 is its learning speed. As a basis for completing 100 epochs, MobileNet showed a learning speed of 625.6896 s, VGG-19 showed 708.3994 s, and ResNet34 showed a learning speed of 601.7433 s. ResNet34 showed the highest performance in both learning speed and accuracy in this study.

5.2. Results of Country of Production Analysis

Through the pre-processed audio file, the results were obtained as shown in Table 3, and all the production countries were produced in English. This includes only videos produced in English using a dataset that is aforementioned in Section 4.1, so the results were all in English. Even if people spoke the same language, it was difficult to distinguish them because of differences in words and accents in each country. Accordingly, groups using the same language were grouped together.

5.3. Results of Genre Prediction with Musical Features and 10 Tags

Using sound data that were extracted from trailer videos and then separated with voice data, a random forest model was constructed to classify the genre tag based on major feature values. The experiment was performed with 100 audio data, which were classified into six genres: action, comedy, crime, drama, horror, and romance. A prediction rate of about 43% could be observed, and it was confirmed that MFCC was used as the most important feature.

5.4. Results of Automatic Tag Generation System

The proposed automatic tag generation system analyzes sound and voice by inputting movie data. A tag is generated based on the analysis, and the tag list is stored as a CSV file. The tags for movies were generated using the IMDB dataset. As shown in Table 4, only 15 lists were randomly extracted and summarized as results. Table 4 summarizes the matching results of the actual tags provided in the dataset and the automatically generated tags proposed in this paper. Although there is a difference in the number of actual tags, it can be confirmed that the extracted tags are included in the actual tags. The number of tags can be solved by increasing classification items or adding keywords in the system.
Video platforms are already used by many consumers worldwide, and platform companies are continuously studying recommendation systems to provide customized services to consumers. Automating generation of tags for movie content through the methods proposed in this paper will help platform companies that provide recommendation services to reduce the time and cost for matching content to consumers.

6. Conclusions

Research on automatically predicting tags in movies is still ongoing. In addition, studies are being conducted to understand movies using the unique characteristics of trailer videos. In previous research, researchers mainly attempted to tag using resources including images and synopsis.
In this paper, we conducted a study on tagging, focusing on ‘voice’ among the various resources of the film. In addition, we researched a method to automatically generate tags for movie content used in recommendation systems and proposed a system. We compared the consistency of automatically generated tags and existing generated tags through the proposed system. If the match rate of the compared tag is high, the possibility of automating the recommendation system is verified.
The system we have built allows video platform operators to provide a recommendation system that automatically generates tags through artificial intelligence without manually entering tags into movie content. Companies that provide movie content can use it to create tags at the same time as uploading movie data. In addition, a recommendation service may be provided in real time by matching the extracted tag with the user’s interest information.
However, our proposed method still has a tag type limitation and needs to further expand the tag type. It may also be a method to utilize other modality information in an image, such as a keyframe image or a motion feature. In the next work, we will expand the movie tag type and upgrade and optimize tag generation of movie content for real time processing.

Author Contributions

Conceptualization, H.P. and S.Y.; methodology, Y.Y. and S.L.; software, Y.Y. and S.L.; data curation, Y.Y., S.L. and H.P.; investigation, S.L. and S.Y.; writing—original draft preparation, H.P. and S.Y.; writing—review and editing, S.L., H.P. and S.Y.; supervision, I.-Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by Basic Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. 2021R1I1A3057800).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Movie Tag information was obtained from https://www.kaggle.com/datasets/cryptexcode/mpst-movie-plot-synopses-with-tags (accessed on 1 July 2022) and was made public by [28]. The data used to study was obtained from https://movielens.org/ (accessed on 2 July 2022) and was made public by [29]. And the data that received basic information was obtained from https://www.imdb.com/interfaces/ (accessed on 3 July 2022) and was made public by [30]. Movie trailer information was obtained from https://movienet.github.io/ (accessed on 12 July 2022) and was made public by [31].

Acknowledgments

The authors would like to thank the editors and reviewers for constructive suggestions and comments that helped improve the quality of the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Debashis, D.; Laxman, S.; Sujoy, D. A Survey on Recommendation System. Int. J. Comput. Appl. 2017, 160, 6–10. [Google Scholar]
  2. Sanpechuda, T.; Kovavisaruch, L. Evaluations of Museum Recommender System Based on Different Visitor Trip Times. J. Inf. Commun. Converg. Eng. 2022, 20, 131–136. [Google Scholar]
  3. Bang, J.; Hwang, D.; Jung, H. Product Recommendation System based on User Purchase Priority. J. Inf. Commun. Converg. Eng. 2020, 18, 55–60. [Google Scholar]
  4. Mahesh, G.; Neha, C. A Review of Movie Recommendation System: Limitations, Survey and Challenges. Electron. Lett. Comput. Vis. Image Anal. 2020, 19, 18–37. [Google Scholar]
  5. Sunghwan, M.; Ingoo, H. Detection of the customer time-variant pattern for improving recommender systems. Expert Syst. Appl. 2005, 28, 188–199. [Google Scholar]
  6. Sunil, W.; Yili, H.; Munir, M.; Abhijit, J. Technology Diffusion in the Society: Analyzing Digital Divide in the Context of Social Class. In Proceedings of the 2011 44th Hawaii International Conference on System Sciences, Kauai, HI, USA, 4–7 January 2011. [Google Scholar]
  7. Mikael, G.; Gunnar, K. Measurements on the Spotify peer-assisted music-On-Demand streaming system. In Proceedings of the 2011 IEEE International Conference on Peer-to-Peer Computing, Kyoto, Japan, 31 August–2 September 2011. [Google Scholar]
  8. Manoj, K.; Yadav, K.K.; Ankur, S.; Vijay, K.G. A Movie Recommender System: MOVREC. Int. J. Comput. Appl. 2015, 124, 7–11. [Google Scholar]
  9. Zhengshun, W.; Ping, S.; Qiang, T.; Yan, R. A Non-Stationary Signal Preprocessing Method based on STFT for CW Radio Doppler Signal. In Proceedings of the 2020 4th International Conference on Vision, ICVISP 2020, Bangkok, Thailand, 9–11 December 2020. [Google Scholar]
  10. Kunpeng, L.; Lihua, G.; Nuo, T.; Feixiang, G.; Qi, W. Feature Extraction Method of Power Grid Load Data Based on STFT-CRNN. In Proceedings of the 6th International Conference on Big Data and Computing, ICBDC’21, Shenzhen, China, 22–24 May 2021. [Google Scholar]
  11. Garima, S.; Kartikeyan, U.; Sridhar, K. Trends in Audio Signal Feature Extraction Methods. Appl. Acoust. 2020, 158, 1–21. [Google Scholar]
  12. Hossan, A.; Memon, S.; Gregory, M. A Novel Approach for MFCC Feature Extraction. In Proceedings of the 2010 4th International Conference on Signal Processing and Communication Systems, Gold Coast, QLD, Australia, 13–15 December 2010. [Google Scholar]
  13. Monir, R.; Kostrzewa, D.; Mrozek, D. Singing Voice Detection: A Survey. Entropy 2022, 24, 2–4. [Google Scholar] [CrossRef]
  14. Kos, M.; Kacic., Z.; Vlaj, D. Acoustic classification and segmentation using modified spectral roll-Off and variance-Based features. Digit. Signal Process. 2013, 23, 659–675. [Google Scholar] [CrossRef]
  15. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. arXiv 2014, arXiv:1409.0575. [Google Scholar] [CrossRef] [Green Version]
  16. Kaiming, H.; Xiangyu, Z.; Shaoqing, R.; Jian, S. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  17. Marko, A.R.; Johan, B.; Herbert, V.D.S. Automatic metadata generation using associative networks. ACM Trans. Inf. Syst. 2009, 27, 1–20. [Google Scholar]
  18. Wangsung, C.; Youngmin, C.; Wonseock, C. Automatic generation of the keyword metadata in each scenes using the script of a video content. In Proceedings of the Journal of the Korea Communications Association’s Comprehensive Academic Presentation (Summer), Jeju, Korea, 26 June 2009; Available online: https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE02088587 (accessed on 23 August 2022).
  19. Islam, M.M.; Bertasius, G. Long Movie Clip Classification with State-Space Video Models. arXiv 2022, arXiv:2204.01692. [Google Scholar]
  20. Antoine, M.; Dimitri, Z.; Jean-Baptiste, A.; Makarand, T.; Ivan, L.; Josef, S. How To 100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. arXiv 2019, arXiv:1906.03327. [Google Scholar]
  21. Gorbman, C. Unheard Melodies: Narrative Film Music; Indiana University Press: Bloomington, IN, USA, 1987; Volume 7, p. 186. [Google Scholar]
  22. Jon, G.; David, B. Telling Stories with Soundtracks: An Empirical Analysis of Music in Film. In Proceedings of the First Workshop on Storytelling; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 33–42. [Google Scholar]
  23. Barbara, M.; Juan, C.; Soyeon, A. Soundtrack design: The impact of music on visual attention and affective responses. Appl. Ergon. 2021, 93, 103301. [Google Scholar]
  24. Görne, T. The Emotional Impact of Sound: A Short Theory of Film Sound Design. EPiC Ser. Technol. 2019, 1, 17–30. [Google Scholar]
  25. Trohidis, K.; Tsoumakas, G.; Kalliris, G.; Vlahavas, L. Multi-Label classification of music by emotion. EURASIP J. Audio Speech Music. Process. 2011, 1, 1–9. [Google Scholar] [CrossRef] [Green Version]
  26. Deepti, C.; Niraj, P.S.; Sachin, S. Development of music emotion classification system using convolution neural network. Int. J. Speech Technol. 2021, 24, 571–580. [Google Scholar]
  27. Hizlisoy, S.; Yildirim, S.; Tufekci, Z. Music emotion recognition using convolutional long short term memory deep neural networks. Eng. Sci. Technol. Int. J. 2021, 24, 760–767. [Google Scholar] [CrossRef]
  28. Sudipta, K.; Suraj, M.A.; Pastor, L.M.; Thamar, S. MPST: A Corpus of Movie Plot Synopses with Tags. In Proceedings of the 11th Edition of its Language Resources and Evaluation Conference (LREC) 2018, Miyzaki, Japan, 9–11 May 2018. [Google Scholar]
  29. Harper, F.M.; Joseph, A.K. The MovieLens Datasets: History and Context. ACM Trans. Intell. Syst. 2016, 5, 1–19. [Google Scholar] [CrossRef]
  30. IMDb Datasets. Available online: https://www.imdb.com/interfaces/ (accessed on 16 August 2022).
  31. Qingqiu, H.; Yu, X.; Anyi, R.; Jiaze, W.; Dahua, L. MovieNet: A Holistic Dataset for Movie Understanding. In Proceedings of the 2020 European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
  32. Tomasz, C.; Szymon, R.; Dawid, W.; Adam, K.; Bozena, K. Classifying Emotions in Film Music-A Deep Learning Approach. Electronics 2021, 10, 2955. [Google Scholar]
  33. Jiyoung, J. The Correlation of Bach Music and the Scene as Seen in Films. Master’s Thesis, Music in Modern Media the Graduate School of Ewha Womans University, Seoul, Korea, January 2007. [Google Scholar]
  34. Umair, A.K.; Miguel, A.M.-D.-A.; Saleh, M.A.; Adnan, A.; Atiq, U.R.; Najm, U.S.; Khalid, H.; Naveed, I. Movie Tags Prediction and Segmentation Using Deep Learning. IEEE Access 2020, 8, 6071–6086. [Google Scholar]
  35. Gaurav, A.; Hari, O. An efficient supervised framework for music mood recognition using autoencoder-based optimized support vector regression model. IET Signal Process. 2021, 15, 98–121. [Google Scholar]
  36. Jordi, P.; Xavier, S. musicnn: Pre-Trained convolutional neural networks for music audio tagging. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), Delft, The Netherlands, 4–8 November 2019. [Google Scholar]
Figure 1. ResNet34 shortcut structure.
Figure 1. ResNet34 shortcut structure.
Applsci 12 10777 g001
Figure 2. Full system configuration diagram.
Figure 2. Full system configuration diagram.
Applsci 12 10777 g002
Figure 3. STFT spectrogram data preprocessing.
Figure 3. STFT spectrogram data preprocessing.
Applsci 12 10777 g003
Figure 4. Genre analysis system.
Figure 4. Genre analysis system.
Applsci 12 10777 g004
Figure 5. Country of manufacture analysis flow chart.
Figure 5. Country of manufacture analysis flow chart.
Applsci 12 10777 g005
Figure 6. Spectrogram image learning result for genre recognition: (a) is training data placement results by genre and (b) is test data prediction and evaluation results.
Figure 6. Spectrogram image learning result for genre recognition: (a) is training data placement results by genre and (b) is test data prediction and evaluation results.
Applsci 12 10777 g006
Table 1. Companies’ benefits through recommendation systems.
Table 1. Companies’ benefits through recommendation systems.
CompanyBenefit through Recommendation System
NetflixTwo-thirds of the movies users watch are recommended
Google Newsrecommendations generate 38% more click-throughs
Amazon35% sales from recommendations
Choice stream28% of people would buy more music if they found what they liked
Table 2. Accuracy and learning time when running 100 epochs.
Table 2. Accuracy and learning time when running 100 epochs.
Accuracy (100 Epochs)Max AccuracyLearning Time
ResNet3495.0086%96.2134%601.7433 s
VGG-1994.4206%95.8512%708.3994 s
MobileNet91.4163%92.9900%625.6896 s
Table 3. Results of production country tag generation via voice file.
Table 3. Results of production country tag generation via voice file.
No.TitleCountry
1Super GirlAnglosphere
2SpeedAnglosphere
3Star Trek 3Anglosphere
4The ShiningAnglosphere
5TomboyAnglosphere
6Treasure IslandAnglosphere
7Urban CowboyAnglosphere
8WolfAnglosphere
Table 4. Comparison of real tags and tags generated from the proposed system (15 randomly selected in dataset).
Table 4. Comparison of real tags and tags generated from the proposed system (15 randomly selected in dataset).
No.TitleGenreMovie TagCountryResult
1Air AmericaReal TagAction, Comedy, WarComedyUSO
Proposed TagActionAction, ComedyAnglosphere
2Batman ForeverReal TagAction, Adventurecomedy, murder, violence, insanity, action, revengeUSO
Proposed TagHorrormurder, revengeAnglosphere
3Black RainReal TagHorrorboring, neo noir, murder, violence, cult, romantic, suspensefulUSO
Proposed TagHorrorMurderAnglosphere
4Blood sport 2Real TagActionviolenceUSX
Proposed TagHorrorViolenceAnglosphere
5Bonnie and ClydeReal TagActioncomedy, depressing, murder, cult, violence, humor, romantic, revenge, storytellingUSO
Proposed TagActionviolence, revengeAnglosphere
6Captain RonReal TagComedycult, comedyUSO
Proposed TagComedyComedyAnglosphere
7CobraReal TagActioncomedy, mystery, neo noir, murder, violence, cult, humor, actionUSO
Proposed TagActionmurder, violenceAnglosphere
8Crimson TideReal TagActioncult, suspenseful, comedyUSO
Proposed TagActionsuspenseful, comedyAnglosphere
9Die HardReal TagAction, Thrillercomedy, mystery, murder, cult, revenge, violence, humor, action, claustrophobic, suspensefulUSO
Proposed TagActionaction, suspensefulAnglosphere
10EraserReal TagActionviolence, action, neo noir, murderUSO
Proposed TagActionaction, murderAnglosphere
11Far and AwayReal TagAdventure, Drama, RomanceromanticUSO
Proposed TagRomanceromanticAnglosphere
12FirefoxReal TagActionsuspenseful, murder, violenceUSO
Proposed TagActionsuspenseful, murderAnglosphere
13Fly Away HomeReal TagComedytragedy, inspiringUSO
Proposed TagComedyInspiringAnglosphere
14SidekicksReal TagComedyrevenge, violenceUSO
Proposed TagComedyrevenge, violenceAnglosphere
15Top GunReal TagAction, Dramafantasy, cult, action, humor, inspiring, romanticUSX
Proposed TagRomanceRomanticAnglosphere
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Park, H.; Yong, S.; You, Y.; Lee, S.; Moon, I.-Y. Automatic Movie Tag Generation System for Improving the Recommendation System. Appl. Sci. 2022, 12, 10777. https://doi.org/10.3390/app122110777

AMA Style

Park H, Yong S, You Y, Lee S, Moon I-Y. Automatic Movie Tag Generation System for Improving the Recommendation System. Applied Sciences. 2022; 12(21):10777. https://doi.org/10.3390/app122110777

Chicago/Turabian Style

Park, Hyogyeong, Sungjung Yong, Yeonhwi You, Seoyoung Lee, and Il-Young Moon. 2022. "Automatic Movie Tag Generation System for Improving the Recommendation System" Applied Sciences 12, no. 21: 10777. https://doi.org/10.3390/app122110777

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop