Automatic Movie Tag Generation System for Improving the Recommendation System

Park, Hyogyeong; Yong, Sungjung; You, Yeonhwi; Lee, Seoyoung; Moon, Il-Young

doi:10.3390/app122110777

Open AccessArticle

Automatic Movie Tag Generation System for Improving the Recommendation System

by

Hyogyeong Park

,

Sungjung Yong

,

Yeonhwi You

,

Seoyoung Lee

and

Il-Young Moon

^*

Department of Computer Science and Engineering, Korea University of Technology and Education, Cheonan 31253, Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(21), 10777; https://doi.org/10.3390/app122110777

Submission received: 31 August 2022 / Revised: 20 October 2022 / Accepted: 20 October 2022 / Published: 24 October 2022 / Corrected: 21 February 2025

(This article belongs to the Special Issue Future Information & Communication Engineering 2022)

Download

Browse Figures

Versions Notes

Abstract

:

As the content industry develops, the demand for movie content is increasing. Accordingly, the content industry is actively developing super-personalized recommendation systems that match consumers’ tastes. In this paper, we study automatic generation of movie tags to improve the movie recommendation system. We extracted background sounds from movie trailer videos, analyzed the sounds using STFT (Short-Time Fourier transform) and major audio attribute features, and created a genre prediction model. The experimental results show that the pre-collected dataset and the data extracted via the model are similar when compared. In this research, we suggest the methodology of an automatic genre prediction system for movie information from trailer videos. This will help to reduce the time and effort for metadata generation for a recommendation system.

Keywords:

movie tags; recommendation system; short-time Fourier transform; artificial intelligence

1. Introduction

With the development of the Internet, a large amount of data and content are being produced worldwide. Today, data have become a key factor in everything, the size of data is growing exponentially, and the penetration rate of e-commerce in web—based businesses varies across business areas, such as the US (266 million, 84%) and France (fifty-four million, 81%), but, every month, there is a steady increase in new members. According to IBM, “Every day, Internet users generate 2500 trillion bytes of information, which is 90 percent of today’s information on the planet in the last three to four years” [1].

Many studies are being conducted for such data and content consumption, and they continue to develop. Among them, a recommendation system is a technology that has been used to provide data and content to users on the Internet from the past.

A recommendation system is an information tool that helps users to determine the items they want from a large number of items available. The main goals of a recommendation system are to predict the ratings given to items by specific users and to help users find the best solution in a list of available items [2,3]. Many companies, such as Netflix, YouTube, and Amazon, use recommendation systems to serve users and make profits [4].

For example, if we want to buy books, listen to music, watch movies, etc., there is one recommendation system that is working in the background that makes suggestions to the user based on his previous actions [5]. Many platforms—such as Netflix, which suggests movies, Amazon, which suggests products, Spotify, which suggests music, LinkedIn, which is used for recommending jobs, or any social networking sites that suggest users—all work on a recommendation system [6,7,8].

These companies benefit from recommendation systems and are shown in Table 1.

As such, with the continuous development of research and technology, the recommendation system method continues to develop, and research is being conducted more actively based on content-providing platform companies.

A movie recommendation system helps movie lovers to avoid looking into vast databases online and also reduces the time needed to search for the right movie by suggesting top—tier movies.

These movie recommendation systems use various techniques, such as content-based filtering algorithms and collaborative filtering algorithms. Among them, a content-based algorithm generates and utilizes various keywords, such as genre, title, director, actor, and atmosphere, of the movie in a way that recommends using the movie’s metadata or tags together with user information. Currently, the movie’s metadata or tags are manually entered by the operator, which means that the recommendation system has not been automated overall.

In this paper, we conduct a study to automatically generate tags (genre, mood, producer, title) of movie content required for recommendation systems through artificial intelligence and to verify the consistency of tags derived by the proposed method and manually entered tags.

This paper is structured as follows. Section 2 introduces the theoretical background related to speech processing. Section 3 includes related studies on automatic tag generation for movie content. In Section 4 and Section 5, the proposed method and results are described. Finally, Section 6 presents the overall conclusion of the proposed method.

2. Background

2.1. Sound Signal Processing

2.1.1. STFT (Short—Time Fourier Transform)

Fourier transform refers to the decomposition of a function into a frequency component per time or space. Currently, information on time or space disappears. Therefore, it is not possible to confirm at what point the signal corresponding to each frequency existed. STFT refers to a method of dividing data into sections over time and Fourier transforming them into sections [9,10]. STFT refers to how data are divided into sections over time and Fourier transforms them into sections. In this way, a change in frequency over time can be confirmed. The STFT conversion equation is as follows:

S T F T (t, w) = \int_{- \infty}^{\infty} x (τ) w (τ - t) e^{- j w τ} d τ

(1)

Here, x(t) is a signal to be analyzed, and w(τ − t) is a window function. The larger the window size, the more detailed the frequency domain can be observed. However, the size of the time domain increases, making it difficult to accurately measure the time domain. Therefore, it is important to determine the appropriate window size.

2.1.2. ZCR (Zero-Crossing Rate)

A zero-crossing rate (ZCR) is a code change rate of a voice signal during a given period. ZCR is simple and widely used for speech/music discrimination and is widely used in areas such as music genre classification, highlight detection, speech analysis, speech detection, and background sound recognition [11].

2.1.3. MFCCs (Mel-Frequency Cepstral Coefficient)

MFCCs are coefficients that collectively constitute MFCs. MFCs can be obtained by processing Mel-spectrogram with DCT (Discrete Cosine Transform) and are often used to distinguish music genres [12].

2.1.4. Spectral Feature

Spectral Centroid

The spectral center represents the location of the center of mass of the spectrum and is a measurement that is highly related to the brightness of the sound [13].

Spectral Roll-off

The spectral roll-off represents the frequency at which a certain percentage of total spectral energy (typically 80% to 90%) is concentrated in the spectrum [14].

2.2. ResNet34

ResNet34 is an algorithm developed by Microsoft that won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2015 [15]. ResNet34 is based on the structure of VGG-19. Figure 1 shows the structure of basic CNN, and Figure 1 shows the shortcut structure used by the ResNet34 algorithm [16].

This only needs to add a shortcut that is directly connected from input to output to the existing mapping method. Therefore, there is no effect on the number of parameters, and there is no additional operation through the shortcut connection except for the increase in addition. In the case of ResNet34, results are obtained through 16 shortcut structures. This structure reduces the difficulty of parameter optimization and avoids gradient loss problems when constructing deep networks.

3. Related Works

3.1. Automatic Metadata Generation System

Marko et al. (2009) [17] proposed an automated system of existing metadata generated resource relationships as a mediator for propagating metadata from metadata rich resources to metadata-poor resources, as well as analyzed existing resource metadata through relational networks and propagated metadata to resources lacking metadata. This is the process of automatically generating metadata through human verification of the propagated metadata. It proposes that it is independent of content analysis, so it can be applied to different resource media types and has a lower cost of resources.

Wangsung C. et al. (2009) [18] proposed a method to reduce metadata construction costs using video content scripts. They used a method of extracting keywords using the script of video content when extracting keywords for each video scene. A keyword for each scene of a moving image is extracted by the following extraction process. First, divide the entire video into scenes. Then, match each scene with each scene in the script. Finally, extract keywords based on the script content. Simple extraction of scene-specific keywords from video-content-based metadata is meaningful in providing an interactive service that provides additional information about the current scene and relevant advertisements during video viewing.

3.2. Video Metadata Tagging

As a related previous study, there is a model called ‘ViS4mer’ that is effective in long range video using self-attention. Using image and natural language processing, the relationship with the person in the video and the genre were analyzed [19]. In addition, there is a system that extracts keywords through automatically generated subtitles and tags the data with images [20].

3.3. Relevance between Audio and Movie

Gorbman (1987) [21] showed that music “mainly serves as a flag of emotion” in film production. He said that listeners can receive narrative signals, such as movie backgrounds and character settings, through movie music.

Jon G. et al. (2018) [22] experimentally revealed that all audio features of a soundtrack have a significant impact on user ratings of the movie they come out with. He proposed that certain musical aspects (acoustic, tempo) can generate audience response very effectively in the context of film music.

Thus, it can be seen that auditory signals in movies play a very important role in effectively capturing the audience’s attention [23,24].

3.4. Emotional Classification of Music

Research on various classifications of emotions through characteristic extraction in music continues [25,26,27]. Categorical approaches classify music into emotions, such as happiness, sadness, anger, fear, and disgust, and Russell’s circumplex emotional model represents the atmosphere on a two-dimensional axis representing arousal and valence.

Existing research suggests a method of automatically generating metadata to provide customized content. There are two large categories of research methods. The first study uses existing metadata resources, and the second study uses video scripts. All of them showed results of automatically generating metadata required for recommendations.

In this paper, we aim to automatically generate the tag of a movie as a basis for recommending movie content to users. To this end, tags for title, genre, mood, and production country are automatically created by analyzing the sound (background music, voice) and text properties of movie content differently from existing studies. The automatically generated tag was compared with the existing manual tag and the proposed method was verified.

4. Research Method

The system for automatically extracting movie tags consists of five steps. As a first step, we collected video content files. In the second step, extract audio data from the ‘Video Content’. In the third step, ‘Audio Data’ are separated into ‘Voice Data’ and ‘Sound Data’. Using ‘Voice Data’, a system analyzes the production country and ‘the Sound data’ are converted into STFT spectrogram and then the genre tag is extracted through data processing. Moreover, sound data are used for extracting the mood tag. In the last step, a system was established to generate the tag information. The details of each step are as follows, all of which consist of a single process from video content input to tag extraction, as shown in Figure 2.

4.1. Extracting Video Data

To verify tag information extracted from image data, the results were compared using the MPST dataset [28] and the “Movie Lens” dataset [29], which constructed about 70 types of tag information through natural language processing in the movie plot synopsis. Basic information such as film title, screening time, and screening year could be obtained through the unique number included in the IMDB [30] dataset. Information on the video trailer was used for research through the "Movie Net" Trailer 30k URL [31].

4.2. Pre-Data Processing

The sound of a movie is used as an important component of the movie. In the case of background sound, overall atmosphere information of the movie may be extracted, and the movie genre can be distinguished by analyzing the background sound of the movie [32]. In addition, in the case of voice, it can be used as data that can determine the country where the movie was made. In this paper, the movie trailer file was pre-processed in three stages.

First, out of 1013 video data, extract audio sound then split only speech data and background sound data using a ‘spleeter’, a speech separation module. Then, speech data are used for production country analysis and background sound data for STFT (Short Time Fourier transform) spectrum extraction. Second, the STFT spectrum is extracted, visualized, and stored for frequency by time from the sound data. Finally, STFT spectrogram image processing is performed as shown in Figure 3 in order to extract genre tags using STFT. To use the stored STFT spectrogram as learning data, all the color bars and text except for the spectrogram are removed.

4.3. Genre Analysis

The background sound and sound effects of the movie are important factors that can determine the atmosphere and genre of the movie [33]. Therefore, we implement a training deep learning model for genre tag extraction using a refined STFT spectrum in the data processing process.

As a result of analyzing the STFT spectrogram for each movie genre, the waveform of spectrogram is shown to have similar patterns in the same genre. This shows that, in the case of movies of the same genre, the spectral patterns tend to be similar. Therefore, the movie genre could be extracted by learning the STFT spectrogram. Genre analysis system through STFT spectrogram is constructed, as shown in Figure 4. For image classification with STFT spectrograms, transfer learning is performed on the ResNet34 model. SGD is used in the process of adjusting the weight of the neural network to increase the learning speed.

4.4. Country of Manufacture Analysis

Since the actors in the movie use the language of the country they produced, the production country can be analyzed by separately extracting the voice data of the actors from the movie trailer. The production country analysis was constructed using Google’s Speech_recognition, which is voice recognition AI, as shown in Figure 5 below.

The separated voice data are obtained through voice recognition to obtain audio data of a certain section containing human voice and converted into text. The converted text was detected and stored as production country tag data.

4.5. Background Sound Analysis

Research is continuously being conducted to predict the atmosphere, genre, etc., by analyzing audio and music [34]. Leveraging audio feature vectors or incorporating spectrograms extracted from audio into deep learning models, such as CNN, to increase the precision of audio comprehension is an ongoing challenge in deep learning [35]. In this study, two main methods were used to analyze the background sound of the trailer video. First, we extracted the main features of sound, STFT, rms, spectral_centroid, spectral_bandwidth, spectral_rolloff, mfcc, and zero_crossing_rate using a useful tool for audio analysis. Second, we extracted the top 10 tags that appeared on each background sound using musicnn, a CNN-based model that achieved the best performance in the field of automatic music tagging [36]. Through these methods, we attempted to understand background sound audio and created a random forest model for genre prediction.

5. Results

5.1. Results of Genre Analysis

A total of 1013 data were extracted through the previous steps. After that, a process of deleting unnecessary data from among the data was performed. Therefore, the data were refined to a total of 972 data. The data were classified into “Action, Comedy, Crime, Drama, Horror, Romance”—six genre classes. Of the data, 70% was used as training data, and 15% as validation data, and 15% of the total was prepared as test data. Through transfer learning, a genre recognition result was obtained as shown in Figure 6, and the class keywords (Action, Comedy, Crime, Drama, Horror, Romance) of the recognized image were saved as a CSV file.

Table 2 shows the accuracy and learning time of 100 epochs using ResNet34, VGG-19, and MobileNet-v2 models. MobileNet-v2 had 22.6027% accuracy at the first session, and the maximum accuracy is 32.8767%. VGG-19 had 21.9178% accuracy at the first session, and the maximum accuracy is 30.1370%, which is about 2.7% lower than MobileNet-V2. In the case of ResNet34, the first session of accuracy is 28.0822% and the maximum accuracy is 38.3562%, which was about 8.2% higher than VGG-19. As a basis for completing 100 epochs, MobileNet-v2 showed a learning speed of 990.3590 s, VGG-19 showed 4423.7360 s, and ResNet34 showed a learning speed of 1311.7810 s. ResNet34 showed the highest performance in accuracy and MobileNet-v2 showed the highest performance in speed in this study.

5.2. Results of Country of Production Analysis

Through the pre-processed audio file, the results were obtained as shown in Table 3, and all the production countries were produced in English. This includes only videos produced in English using a dataset that is aforementioned in Section 4.1, so the results were all in English. Even if people spoke the same language, it was difficult to distinguish them because of differences in words and accents in each country. Accordingly, groups using the same language were grouped together.

5.3. Results of Genre Prediction with Musical Features and 10 Tags

Using sound data that were extracted from trailer videos and then separated with voice data, a random forest model was constructed to classify the genre tag based on major feature values. The experiment was performed with 100 audio data, which were classified into six genres: action, comedy, crime, drama, horror, and romance. A prediction rate of about 43% could be observed, and it was confirmed that MFCC was used as the most important feature.

5.4. Results of Automatic Tag Generation System

The proposed automatic tag generation system analyzes sound and voice by inputting movie data. We selected 20 mood-related tags from about 70 tags and performed a multilabel classification on the 780 movies that were multi-labeled with those tags. The split of the dataset was carried out in the ratio of 70%, 15%, and 15% for the training, validation, and test datasets, respectively. Predicted tag is generated based on the analysis, and the tag list is stored as a CSV file. As shown in Table 4, only 15 lists were randomly extracted and summarized as results. Table 4 summarizes the matching results of the actual tags provided in the dataset and the automatically generated tags proposed in this paper.

Our work also shows an accuracy score result value of 86.84% for the test dataset. Although there is a difference in the number of actual tags, it can be confirmed that the extracted tags are included in the actual tags. The number of tags can be solved by increasing classification items or adding keywords in the system. Video platforms are already used by many consumers worldwide, and platform companies are continuously studying recommendation systems to provide customized services to consumers. Automating generation of tags for movie content through the methods proposed in this paper will help platform companies that provide recommendation services to reduce the time and cost for matching content to consumers.

6. Conclusions

Research on automatically predicting tags in movies is still ongoing. In addition, studies are being conducted to understand movies using the unique characteristics of trailer videos. In previous research, researchers mainly attempted to tag using resources including images and synopsis.

In this paper, we conducted a study on tagging, focusing on ‘voice’ among the various resources of the film. In addition, we researched a method to automatically generate tags for movie content used in recommendation systems and proposed a system. We compared the consistency of automatically generated tags and existing generated tags through the proposed system. If the match rate of the compared tag is high, the possibility of automating the recommendation system is verified.

The system we have built allows video platform operators to provide a recommendation system that automatically generates tags through artificial intelligence without manually entering tags into movie content. Companies that provide movie content can use it to create tags at the same time as uploading movie data. In addition, a recommendation service may be provided in real time by matching the extracted tag with the user’s interest information.

However, our proposed method still has a tag type limitation and needs to further expand the tag type. It may also be a method to utilize other modality information in an image, such as a keyframe image or a motion feature. In the next work, we will expand the movie tag type and upgrade and optimize tag generation of movie content for real time processing.

Author Contributions

Conceptualization, H.P. and S.Y.; methodology, Y.Y. and S.L.; software, Y.Y. and S.L.; data curation, Y.Y., S.L. and H.P.; investigation, S.L. and S.Y.; writing—original draft preparation, H.P. and S.Y.; writing—review and editing, S.L., H.P. and S.Y.; supervision, I.-Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by Basic Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. 2021R1I1A3057800).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Movie Tag information was obtained from https://www.kaggle.com/datasets/cryptexcode/mpst-movie-plot-synopses-with-tags (accessed on 1 July 2022) and was made public by [28]. The data used to study was obtained from https://movielens.org/ (accessed on 2 July 2022) and was made public by [29]. And the data that received basic information was obtained from https://www.imdb.com/interfaces/ (accessed on 3 July 2022) and was made public by [30]. Movie trailer information was obtained from https://movienet.github.io/ (accessed on 12 July 2022) and was made public by [31].

Acknowledgments

The authors would like to thank the editors and reviewers for constructive suggestions and comments that helped improve the quality of the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Debashis, D.; Laxman, S.; Sujoy, D. A Survey on Recommendation System. Int. J. Comput. Appl. 2017, 160, 6–10. [Google Scholar]
Sanpechuda, T.; Kovavisaruch, L. Evaluations of Museum Recommender System Based on Different Visitor Trip Times. J. Inf. Commun. Converg. Eng. 2022, 20, 131–136. [Google Scholar]
Bang, J.; Hwang, D.; Jung, H. Product Recommendation System based on User Purchase Priority. J. Inf. Commun. Converg. Eng. 2020, 18, 55–60. [Google Scholar]
Mahesh, G.; Neha, C. A Review of Movie Recommendation System: Limitations, Survey and Challenges. Electron. Lett. Comput. Vis. Image Anal. 2020, 19, 18–37. [Google Scholar]
Sunghwan, M.; Ingoo, H. Detection of the customer time-variant pattern for improving recommender systems. Expert Syst. Appl. 2005, 28, 188–199. [Google Scholar]
Sunil, W.; Yili, H.; Munir, M.; Abhijit, J. Technology Diffusion in the Society: Analyzing Digital Divide in the Context of Social Class. In Proceedings of the 2011 44th Hawaii International Conference on System Sciences, Kauai, HI, USA, 4–7 January 2011. [Google Scholar]
Mikael, G.; Gunnar, K. Measurements on the Spotify peer-assisted music-On-Demand streaming system. In Proceedings of the 2011 IEEE International Conference on Peer-to-Peer Computing, Kyoto, Japan, 31 August–2 September 2011. [Google Scholar]
Manoj, K.; Yadav, K.K.; Ankur, S.; Vijay, K.G. A Movie Recommender System: MOVREC. Int. J. Comput. Appl. 2015, 124, 7–11. [Google Scholar]
Zhengshun, W.; Ping, S.; Qiang, T.; Yan, R. A Non-Stationary Signal Preprocessing Method based on STFT for CW Radio Doppler Signal. In Proceedings of the 2020 4th International Conference on Vision, ICVISP 2020, Bangkok, Thailand, 9–11 December 2020. [Google Scholar]
Kunpeng, L.; Lihua, G.; Nuo, T.; Feixiang, G.; Qi, W. Feature Extraction Method of Power Grid Load Data Based on STFT-CRNN. In Proceedings of the 6th International Conference on Big Data and Computing, ICBDC’21, Shenzhen, China, 22–24 May 2021. [Google Scholar]
Garima, S.; Kartikeyan, U.; Sridhar, K. Trends in Audio Signal Feature Extraction Methods. Appl. Acoust. 2020, 158, 1–21. [Google Scholar]
Hossan, A.; Memon, S.; Gregory, M. A Novel Approach for MFCC Feature Extraction. In Proceedings of the 2010 4th International Conference on Signal Processing and Communication Systems, Gold Coast, QLD, Australia, 13–15 December 2010. [Google Scholar]
Monir, R.; Kostrzewa, D.; Mrozek, D. Singing Voice Detection: A Survey. Entropy 2022, 24, 2–4. [Google Scholar] [CrossRef]
Kos, M.; Kacic., Z.; Vlaj, D. Acoustic classification and segmentation using modified spectral roll-Off and variance-Based features. Digit. Signal Process. 2013, 23, 659–675. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. arXiv 2014, arXiv:1409.0575. [Google Scholar] [CrossRef]
Kaiming, H.; Xiangyu, Z.; Shaoqing, R.; Jian, S. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Marko, A.R.; Johan, B.; Herbert, V.D.S. Automatic metadata generation using associative networks. ACM Trans. Inf. Syst. 2009, 27, 1–20. [Google Scholar]
Wangsung, C.; Youngmin, C.; Wonseock, C. Automatic generation of the keyword metadata in each scenes using the script of a video content. In Proceedings of the Journal of the Korea Communications Association’s Comprehensive Academic Presentation (Summer), Jeju, Korea, 26 June 2009; Available online: https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE02088587 (accessed on 23 August 2022).
Islam, M.M.; Bertasius, G. Long Movie Clip Classification with State-Space Video Models. arXiv 2022, arXiv:2204.01692. [Google Scholar]
Antoine, M.; Dimitri, Z.; Jean-Baptiste, A.; Makarand, T.; Ivan, L.; Josef, S. How To 100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. arXiv 2019, arXiv:1906.03327. [Google Scholar]
Gorbman, C. Unheard Melodies: Narrative Film Music; Indiana University Press: Bloomington, IN, USA, 1987; Volume 7, p. 186. [Google Scholar]
Jon, G.; David, B. Telling Stories with Soundtracks: An Empirical Analysis of Music in Film. In Proceedings of the First Workshop on Storytelling; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 33–42. [Google Scholar]
Barbara, M.; Juan, C.; Soyeon, A. Soundtrack design: The impact of music on visual attention and affective responses. Appl. Ergon. 2021, 93, 103301. [Google Scholar]
Görne, T. The Emotional Impact of Sound: A Short Theory of Film Sound Design. EPiC Ser. Technol. 2019, 1, 17–30. [Google Scholar]
Trohidis, K.; Tsoumakas, G.; Kalliris, G.; Vlahavas, L. Multi-Label classification of music by emotion. EURASIP J. Audio Speech Music. Process. 2011, 1, 1–9. [Google Scholar] [CrossRef]
Deepti, C.; Niraj, P.S.; Sachin, S. Development of music emotion classification system using convolution neural network. Int. J. Speech Technol. 2021, 24, 571–580. [Google Scholar]
Hizlisoy, S.; Yildirim, S.; Tufekci, Z. Music emotion recognition using convolutional long short term memory deep neural networks. Eng. Sci. Technol. Int. J. 2021, 24, 760–767. [Google Scholar] [CrossRef]
Sudipta, K.; Suraj, M.A.; Pastor, L.M.; Thamar, S. MPST: A Corpus of Movie Plot Synopses with Tags. In Proceedings of the 11th Edition of its Language Resources and Evaluation Conference (LREC) 2018, Miyzaki, Japan, 9–11 May 2018. [Google Scholar]
Harper, F.M.; Joseph, A.K. The MovieLens Datasets: History and Context. ACM Trans. Intell. Syst. 2016, 5, 1–19. [Google Scholar] [CrossRef]
IMDb Datasets. Available online: https://www.imdb.com/interfaces/ (accessed on 16 August 2022).
Qingqiu, H.; Yu, X.; Anyi, R.; Jiaze, W.; Dahua, L. MovieNet: A Holistic Dataset for Movie Understanding. In Proceedings of the 2020 European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Tomasz, C.; Szymon, R.; Dawid, W.; Adam, K.; Bozena, K. Classifying Emotions in Film Music-A Deep Learning Approach. Electronics 2021, 10, 2955. [Google Scholar]
Jiyoung, J. The Correlation of Bach Music and the Scene as Seen in Films. Master’s Thesis, Music in Modern Media the Graduate School of Ewha Womans University, Seoul, Korea, January 2007. [Google Scholar]
Umair, A.K.; Miguel, A.M.-D.-A.; Saleh, M.A.; Adnan, A.; Atiq, U.R.; Najm, U.S.; Khalid, H.; Naveed, I. Movie Tags Prediction and Segmentation Using Deep Learning. IEEE Access 2020, 8, 6071–6086. [Google Scholar]
Gaurav, A.; Hari, O. An efficient supervised framework for music mood recognition using autoencoder-based optimized support vector regression model. IET Signal Process. 2021, 15, 98–121. [Google Scholar]
Jordi, P.; Xavier, S. musicnn: Pre-Trained convolutional neural networks for music audio tagging. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), Delft, The Netherlands, 4–8 November 2019. [Google Scholar]

Figure 1. ResNet34 shortcut structure.

Figure 2. Full system configuration diagram.

Figure 3. STFT spectrogram data preprocessing.

Figure 4. Genre analysis system.

Figure 5. Country of manufacture analysis flow chart.

Figure 6. Spectrogram image learning result for genre recognition: (a) is training data placement results by genre and (b) is test data prediction and evaluation results.

Table 1. Companies’ benefits through recommendation systems.

Company	Benefit through Recommendation System
Netflix	Two-thirds of the movies users watch are recommended
Google News	recommendations generate 38% more click-throughs
Amazon	35% sales from recommendations
Choice stream	28% of people would buy more music if they found what they liked

Table 2. Accuracy and learning time when running 100 epochs.

	Accuracy (Initial Epochs)	Max Accuracy	Learning Time
ResNet34	28.0822%	38.3562%	1311.7810 s
VGG-19	21.9178%	30.1370%	4423.7360 s
MobileNet	22.6027%	32.8767%	990.3590 s

Table 3. Results of production country tag generation via voice file.

No.	Title	Country
1	Super Girl	Anglosphere
2	Speed	Anglosphere
3	Star Trek 3	Anglosphere
4	The Shining	Anglosphere
5	Tomboy	Anglosphere
6	Treasure Island	Anglosphere
7	Urban Cowboy	Anglosphere
8	Wolf	Anglosphere

Table 4. Comparison of real tags and tags generated from the proposed system (15 randomly selected in dataset). PM means partially matched; AM means all matched.

No.	Title		Genre	Movie Tag	Country	Result
1	Assassins	Real Tag	Action, Crime, Thriller	violence, romantic, suspenseful, sadist	US	PM
1	Assassins	Proposed Tag	Crime	violence, romantic	Anglosphere	PM
2	Beautiful Girls	Real Tag	Comedy, Drama, Romances	violence, romantic, humor, atmospheric	US	PM
2	Beautiful Girls	Proposed Tag	Comedy	romantic	Anglosphere	PM
3	Blood sport 3	Real Tag	Action, Sport	violence	US	PM
3	Blood sport 3	Proposed Tag	Action	violence, comedy	Anglosphere	PM
4	Bottle Rocket	Real Tag	Comedy, Crime, Drama	romantic, humor	US	PM
4	Bottle Rocket	Proposed Tag	Comedy	romantic, comedy	Anglosphere	PM
5	Braveheart	Real Tag	Biography, Drama, History	violence, romantic, action, dramatic, inspiring	US	PM
5	Braveheart	Proposed Tag	Drama	violence, suspenseful, atmospheric	Anglosphere	PM
6	Desperado	Real Tag	Action, Crime, Thriller	violence, romantic, comedy, humor, action, mystery	US	PM
6	Desperado	Proposed Tag	Action	violence	Anglosphere	PM
7	Drive	Real Tag	Action, Adventure, Comedy	violence, suspenseful	US	PM
7	Drive	Proposed Tag	Action	violence, comedy	Anglosphere	PM
8	Escape from L.A.	Real Tag	Action, Adventure, Sci-Fi	violence, humor, mystery	US	PM
8	Escape from L.A.	Proposed Tag	Action	violence, comedy	Anglosphere	PM
9	Executive Decision	Real Tag	Action, Adventure, Thriller	violence, suspenseful, mystery	US	PM
9	Executive Decision	Proposed Tag	Action	violence	Anglosphere	PM
10	House Arrest	Real Tag	Comedy, Family	romantic	US	AM
10	House Arrest	Proposed Tag	Comedy	romantic	Anglosphere	AM
11	In the Mouth of Madness	Real Tag	Drama, Fantasy, Horror	violence, comedy, suspenseful, fantasy	US	PM
11	In the Mouth of Madness	Proposed Tag	Drama	violence	Anglosphere	PM
12	Kids	Real Tag	Drama	violence	US	PM
12	Kids	Proposed Tag	Drama	violence, suspenseful	Anglosphere	PM
13	Leaving Las Vegas	Real Tag	Drama, Romance	romantic, dramatic, dark, atmospheric, depressing	US	PM
13	Leaving Las Vegas	Proposed Tag	Drama	romantic	Anglosphere	PM
14	Mallrats	Real Tag	Comedy, Romance	romantic, comedy, humor, comic	US	PM
14	Mallrats	Proposed Tag	Comedy	violence, comedy	Anglosphere	PM
15	Tommy Boy	Real Tag	Adventure, Comedy	comedy, humor	US	PM
15	Tommy Boy	Proposed Tag	Comedy	comedy	Anglosphere	PM

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, H.; Yong, S.; You, Y.; Lee, S.; Moon, I.-Y. Automatic Movie Tag Generation System for Improving the Recommendation System. Appl. Sci. 2022, 12, 10777. https://doi.org/10.3390/app122110777

AMA Style

Park H, Yong S, You Y, Lee S, Moon I-Y. Automatic Movie Tag Generation System for Improving the Recommendation System. Applied Sciences. 2022; 12(21):10777. https://doi.org/10.3390/app122110777

Chicago/Turabian Style

Park, Hyogyeong, Sungjung Yong, Yeonhwi You, Seoyoung Lee, and Il-Young Moon. 2022. "Automatic Movie Tag Generation System for Improving the Recommendation System" Applied Sciences 12, no. 21: 10777. https://doi.org/10.3390/app122110777

APA Style

Park, H., Yong, S., You, Y., Lee, S., & Moon, I.-Y. (2022). Automatic Movie Tag Generation System for Improving the Recommendation System. Applied Sciences, 12(21), 10777. https://doi.org/10.3390/app122110777

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Movie Tag Generation System for Improving the Recommendation System

Abstract

1. Introduction

2. Background

2.1. Sound Signal Processing

2.1.1. STFT (Short—Time Fourier Transform)

2.1.2. ZCR (Zero-Crossing Rate)

2.1.3. MFCCs (Mel-Frequency Cepstral Coefficient)

2.1.4. Spectral Feature

2.2. ResNet34

3. Related Works

3.1. Automatic Metadata Generation System

3.2. Video Metadata Tagging

3.3. Relevance between Audio and Movie

3.4. Emotional Classification of Music

4. Research Method

4.1. Extracting Video Data

4.2. Pre-Data Processing

4.3. Genre Analysis

4.4. Country of Manufacture Analysis

4.5. Background Sound Analysis

5. Results

5.1. Results of Genre Analysis

5.2. Results of Country of Production Analysis

5.3. Results of Genre Prediction with Musical Features and 10 Tags

5.4. Results of Automatic Tag Generation System

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI