Movie Title Keywords: A Text Mining and Exploratory Factor Analysis of Popular Movies in the United States and China

: Unprecedented opportunities have been brought by advancements in machine learning in the prediction of the economic success of movies. The analysis of movie title keywords is one promising but rarely investigated direction of study. To address this gap, we performed a text mining and exploratory factor analysis (EFA) of the relationships between movie titles and their corresponding movies’ levels of success. Speciﬁcally, intragroup and intergroup analyses of 217 top hit movies in the United States and 245 top hit movies in China showed that successful movies in these two major movie markets with outstanding total lifetime grosses featured titles with similar and different patterns of most frequently used words, revealing useful information about viewers’ preferences in these countries. The ﬁndings of this study will serve to better inform the movie industry in giving more economically promising names to their products from a machine-learning perspective and inspire further studies.


Introduction
The movie industry is a risky one, as there is no guarantee of even a self-sustaining return on investment for the majority of its players. In the U.S., where the movie market is the most developed and leading the rest of the world, the Disney empire alone claimed 33 percent of the total box office revenue in 2009, without taking titles under its newly purchased 21st Century Fox into account (McClintock 2019). In a country with an exponentially expanding movie market such as China, hundreds of low-budget movies are struggling to earn their budgets back while a handful of top hits sweep the lion's share of the total box office revenue into their pockets (Zhang 2010).
What on Earth, then, made the big names so successful and the small potatoes so powerless in the movie industry? Is it possible at all for the vast majority of producers of low-budget movies around the globe, who are more vulnerable to risks (Pikowicz and Zhang 2006), to know, in advance, the likelihood of at least winning their costs back? A scant number of studies have been conducted to propose potential predictors. For instance, Chang and Ki (2005) found that budgets, among many other conjectured factors, played a significant role in the performance of successful movies in the U.S. However, their study failed to explain the exact mechanism and conflicted with Feng and Sharma (2016) study, which showed, with quantile regressions, that high-budget films were not always successful in the Chinese movie market. Others have postulated that action movies and sequel movies are more likely to be high grossing, as evidenced by the fact that more than half of the highest-grossing movies belong to these genres, including Toy Story, The Dark Knight, Harry Potter and Transformers (Pangarker and Smit 2013). However, as mentioned 2 of 19 before, these highest-grossing movies only account for the tip of the iceberg and hardly represent the long tail of the less-famous movies unknown to the public.
Fortunately, a new line of hope is brought by the mass analysis of text. Thanks to the advancement of machine learning, which enables the processing of extremely large datasets of texts that would otherwise be virtually impossible to interpret with human intelligence, more economically promising decisions can be made. For example, Legoux et al. (2016) showed that the success of a movie could be largely predicted by the critical reviews it received from distribution intermediaries, or exhibitors, owing to their machine-learningpowered study of 165,000 weekly theatre-level exhibitor notes in the U.S. Furthermore, Du et al. (2014) built their machine-learning-based model to predict a movie's box office revenue by analyzing comments about it on Sina Weibo, the most popular microblog site in China.
As powerful as reviews from exhibitors and viewers are as predictors of a movie's success, they are, nevertheless, factors that could not be controlled during its production. For movie producers who wish to do whatever they can to predict the success of their products before they are finished, the focus has to be turned to the investigation of texts at a more fundamental level. Many researchers have therefore chosen to analyze movie titles, an impactful element that largely determines viewers' first impressions about a movie that movie producers can relatively easily adjust at a low cost. For example, Sood and Drèze (2006) found that informative movie titles significantly boosted consumers' purchasing behaviors. Additionally, recent research conducted by Bae and Kim (2019) showed that the sequels that consumers responded to the most in South Korea were those that differed very slightly in their titles from their successful first episodes. In the winner-take-all movie market, where a tiny shift in movie titles can lead to enormous economic gain or loss (Elberse and Oberholzer-Gee 2007), it is therefore necessary to build upon these findings and better understand what it takes to create a successful movie title that is likely to bring decent box office revenue.
To this end, text mining was employed in this study to visualize and compare popular movie title keywords in the United States and China. Exploratory factor analyses (Henson and Roberts 2006) were conducted using the movie title keywords of popular movies in these countries from 2015 to 2019 to explore and identify potential factors that underlie popular movie title keywords in the United States and China. Similarities and differences between extracted factors of successful movie title keywords in the United States and China are discussed in terms of their implications for the movie industry and further research. More specifically, we were dedicated to answering the following research questions:

1.
What are the factors underlying popular movie title keywords that contributed to the success of their corresponding movies from 2015 to 2019 in the United States and China? 2.
What differences, if any, exist among the extracted factors of movie title keywords in the United States and their counterparts in China? 3.
What implications about the contribution of movie title keywords to the success of movies, if any, can be drawn from the similarities and differences among the extracted factors of movie title keywords in the United States and their counterparts in China?

Text Mining and Exploratory Factor Analysis
In this study, we employed two major research methods to answer the research questions listed above, namely, text mining and exploratory factor analysis. Text mining is a method widely used in information-science-related research based on extracting significant and meaningful patterns from unstructured text documents. It was a perfect fit for this study, as we were interested in identifying patterns underlying a huge set of movie titles. Furthermore, to estimate the number of factors and indicators related to these underlying dimensions, we employed a multivariate statistical analysis called exploratory factor analysis (EFA).

Text Mining
Text is an extensive and highly diverse medium that transfers information across cultures. Due to the gigantic volume of text in various formats, the processing of text necessitates techniques that efficiently and effectively uncover underlying patterns. Text mining, a tool frequently applied in the fields of big data analyses and better known as statistical text mining when used in combination with other statistical methods, has proven to be one of such techniques. The interdisciplinarity of statistical text mining has become more powerful today, as information science researchers cooperate increasingly closely with statisticians to combine more statistical methods with text mining. For example, Tan (1999) proposed that text mining could synergize exceptionally well with the following statistical disciplines, at least: content analysis, clustering analysis, factor analysis and data mining techniques; information extraction; and machine learning.
Tan also summarized the framework of text mining into two phases: text refining and knowledge distillation, as shown in Figure 1. Text refining transforms free-form text documents into one specific intermediate form, and knowledge distillation continues to infer knowledge or patterns revealed by the intermediate form.
this study, as we were interested in identifying patterns underlying a huge set of movie titles. Furthermore, to estimate the number of factors and indicators related to these underlying dimensions, we employed a multivariate statistical analysis called exploratory factor analysis (EFA).

Text Mining
Text is an extensive and highly diverse medium that transfers information across cultures. Due to the gigantic volume of text in various formats, the processing of text necessitates techniques that efficiently and effectively uncover underlying patterns. Text mining, a tool frequently applied in the fields of big data analyses and better known as statistical text mining when used in combination with other statistical methods, has proven to be one of such techniques. The interdisciplinarity of statistical text mining has become more powerful today, as information science researchers cooperate increasingly closely with statisticians to combine more statistical methods with text mining. For example, Tan (1999) proposed that text mining could synergize exceptionally well with the following statistical disciplines, at least: content analysis, clustering analysis, factor analysis and data mining techniques; information extraction; and machine learning.
Tan also summarized the framework of text mining into two phases: text refining and knowledge distillation, as shown in Figure 1. Text refining transforms free-form text documents into one specific intermediate form, and knowledge distillation continues to infer knowledge or patterns revealed by the intermediate form.

Exploratory Factor Analysis
As computational power rapidly grows, factor analysis is becoming more and more popular as a powerful method of "reducing variable complexity to greater simplicity" (Kerlinger 1979). Factor analysis is a member of the family of statistical methods used to describe the relationships among many observed variables in terms of a few underlying but unobservable constructed factors. Researchers mainly focus on two factor analysis models: exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). Compared to CFA, EFA does not require a strong empirical or conceptual foundation to be established before the evaluation of models, as researchers do not have to specify the number of latent factors at the beginning (Rahmawati et al. 2017). A model of an exploratory factor analysis model is shown in Figure 2.

Exploratory Factor Analysis
As computational power rapidly grows, factor analysis is becoming more and more popular as a powerful method of "reducing variable complexity to greater simplicity" (Kerlinger 1979). Factor analysis is a member of the family of statistical methods used to describe the relationships among many observed variables in terms of a few underlying but unobservable constructed factors. Researchers mainly focus on two factor analysis models: exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). Compared to CFA, EFA does not require a strong empirical or conceptual foundation to be established before the evaluation of models, as researchers do not have to specify the number of latent factors at the beginning (Rahmawati et al. 2017). A model of an exploratory factor analysis model is shown in Figure 2. Error terms are also represented by rectangles that are associated with the measured variables. They are latent variables as well. Each variable that is predicted (has a directional arrow pointing to it) must be associated with a residual variable. • Paths, specifically, directional paths of influence or prediction, are represented in the diagram by lines with arrows pointing in a given direction. Stapleton (1997) proposed that EFA is a useful tool for estimating the minimum number of latent hypothetical factors within a larger number of variables. Stevens (1996) also outlined the following tenets of the exploratory theory: Since the basic single factor model has very few real-world applications, we focus on the more commonly utilized k-factor analysis model, which can use any number of latent factors to describe relationships in a dataset. If we observe a set of observed variables represented by (x1, x2, …, xq) and assume these are linked to a set of unobserved latent variables represented by (f1, f2, …, fk), where k < q since the number of latent factors must be less than the total number of observed variables x, we can obtain a set of regression model equations where ( , , … , ) are the observed variables linked to the unobserved latent variables (factors) ( , , … , ).
From a standard regression model perspective, we consider the values of the equations to act as regression coefficients for our manifest x variables. More specifically, however, from a factor analysis perspective, these s are called factor loadings and represent and show the nature and magnitude of the relationships between each x and each latent factor f (Everitt and Hothorn 2011).

Data Analysis and Results
In this study, we performed EFA to extract factors from the titles of popular movies in the United States and China released between 2015 and 2019 based on ratings provided by IMDb to help us to better identify and make sense of meaningful patterns in them. In this section, we present a series of charts that resulted from this process. Error terms are also represented by rectangles that are associated with the measured variables. They are latent variables as well. Each variable that is predicted (has a directional arrow pointing to it) must be associated with a residual variable. • Paths, specifically, directional paths of influence or prediction, are represented in the diagram by lines with arrows pointing in a given direction. Stapleton (1997) proposed that EFA is a useful tool for estimating the minimum number of latent hypothetical factors within a larger number of variables. Stevens (1996) also outlined the following tenets of the exploratory theory: Since the basic single factor model has very few real-world applications, we focus on the more commonly utilized k-factor analysis model, which can use any number of latent factors to describe relationships in a dataset. If we observe a set of observed variables represented by (x 1 , x 2 , . . . , x q ) and assume these are linked to a set of unobserved latent variables represented by (f 1 , f 2 , . . . , f k ), where k < q since the number of latent factors must be less than the total number of observed variables x, we can obtain a set of regression model equations where x 1 , x 2 , . . . , x q are the observed variables linked to the unobserved latent variables (factors) ( f 1 , f 2 , . . . , f k ).
From a standard regression model perspective, we consider the λ values of the equations to act as regression coefficients for our manifest x variables. More specifically, however, from a factor analysis perspective, these λs are called factor loadings and represent and show the nature and magnitude of the relationships between each x and each latent factor f (Everitt and Hothorn 2011).

Data Analysis and Results
In this study, we performed EFA to extract factors from the titles of popular movies in the United States and China released between 2015 and 2019 based on ratings provided by IMDb to help us to better identify and make sense of meaningful patterns in them. In this section, we present a series of charts that resulted from this process.

Descriptive Statistics
Firstly, Table 1 shows a summary of movie title keywords, their frequencies, and their corresponding movies' total lifetime grosses for the 50 most popular movies in the United States and China. In both countries, it turns out that "man" and "movie" are among the three most frequently appearing keywords. As the most frequently appearing movie title keyword in both countries, "man" is a major, most-favored keyword likely because of the immense popularity of superhero movies across the two countries. The second most frequently appearing keyword in the United States, "star", undoubtedly reveals the success of the Star Wars series in terms of both its total lifetime gross and its fame in general, as marked by its relative success in the Chinese movie market despite the cultural differences. The third most frequently appearing keyword, "movie", likely indicates the popularity of the common format of naming a movie as "xxx, the movie". The bar charts of the top 10 movie title keywords in the United States ( Figure 3) and the top 10 movie title keywords in China ( Figure 4) further confirm that the success of "man" as a movie title keyword stems from the success of movies about superheroes such as Ant-Man, Spider-Man, and Batman. In addition, comparing Figure 3 with Figure 4, it is clear that there is a difference between the U.S. and China in terms of the frequency of the appearance of man in the most popular movie titles. The bar charts of the top 10 movie title keywords in the United States ( Figure 3) and the top 10 movie title keywords in China ( Figure 4) further confirm that the success of "man" as a movie title keyword stems from the success of movies about superheroes such as Ant-Man, Spider-Man, and Batman. In addition, comparing Figure 3 with Figure 4, it is clear that there is a difference between the U.S. and China in terms of the frequency of the appearance of man in the most popular movie titles.   Next, we compared the total lifetime grosses of the movies released between 2015 and 2019 in the United States and China. Apparently, based on the 217 movies we observed that were released in the United States and the 245 movies we observed that were released in China, the movies from the United States outnumbered their Chinese counterparts according to all the five summary statistics we chose, including the mean, standard deviation, 25th percentile, 75th percentile, and interquartile range, as shown in Table 2.

Data Visualization
In order to offer a more visually direct representation of the movie title keywords, we present the word clouds formed by movie title keywords in the United States ( Figure 5) and by movie title keywords in China ( Figure 6). In both figures, the movie title keywords that appear more frequently in the dataset show up in bigger fonts. Our previous discussion still stands regarding the success of man, star, and movie as outstanding movie title keywords.

Exploratory Factor Analysis Result (U.S.)
An exploratory factor analysis of the top 50 popular movie title keywords (MTK) from 2015 to 2019 in the United States was performed on 217 movies. Prior to running the analysis with RStudio, the data were screened by examining, for each MTK, descriptive statistics, interitem correlations, and possible univariate and multivariate assumption violations. From this initial assessment, all the variables were found to be interval like, var-

Exploratory Factor Analysis Result (U.S.)
An exploratory factor analysis of the top 50 popular movie title keywords (MTK) from 2015 to 2019 in the United States was performed on 217 movies. Prior to running the analysis with RStudio, the data were screened by examining, for each MTK, descriptive statistics, interitem correlations, and possible univariate and multivariate assumption violations. From this initial assessment, all the variables were found to be interval like, variable pairs appeared to follow bivariate normal distributions, and all the MTKs were independent of one another. Because our sample size of 217 movies in this data analysis was large, the MTK to movie ratio was deemed adequate. Prior to conducting the EFA, the adequacy of the input data was confirmed by Bartlett's sphericity test and the matrix determinant. The Bartlett test of sphericity result was significant (χ 2 = 1859.17, p < 0.001), suggesting that the correlation matrix was significantly different from the identity matrix and that the variables were correlated, which supported the data reduction. The correlation matrix determinant of 0.00003 was greater than the necessary value of 0.00001. Hence, the data were adequate for the EFA, and multicollinearity was not a problem for these data.
Furthermore, EFA with principal axis factoring was used in extracting the factors to be retained. As indicated by the scree plot (Figure 7), three factors to be extracted had eigenvalues greater than one. Therefore, these factors were extracted in the first set of analyses.
Next, in order to improve the interpretability of the extracted factors, both Varimax and Promax Rotations were performed. The results were compared and indicated no significant differences. Therefore, to simplify the interpretation of the extracted factors (IBM Knowledge Center 2020), Varimax Rotation with Kaiser Normalization and the structure coefficients from the Varimax Rotation are presented in Table 3. As shown in the table, the structure coefficients are reasonable but not notably large in magnitude, owing to the relatively small amount of variance explained by this structure. tion matrix determinant of 0.00003 was greater than the necessary value of 0.00001. Hence, the data were adequate for the EFA, and multicollinearity was not a problem for these data.
Furthermore, EFA with principal axis factoring was used in extracting the factors to be retained. As indicated by the scree plot (Figure 7), three factors to be extracted had eigenvalues greater than one. Therefore, these factors were extracted in the first set of analyses.

Figure 7. Scree plot indicating the number of retained factors for movies in the United
States.
Next, in order to improve the interpretability of the extracted factors, both Varimax and Promax Rotations were performed. The results were compared and indicated no significant differences. Therefore, to simplify the interpretation of the extracted factors (IBM Knowledge Center 2020), Varimax Rotation with Kaiser Normalization and the structure coefficients from the Varimax Rotation are presented in Table 3. As shown in the table, the structure coefficients are reasonable but not notably large in magnitude, owing to the relatively small amount of variance explained by this structure.    Table 4, the first factor extracted from the analysis (FA2) was "Family Movie", which included the following MTKs: little, first, captain, lego, and movie. The corresponding movies can also be found in the table, such as My Little Pony: The Movie; Captain Underpants: The First Epic Movie; and The Lego Batman Movie. The second factor extracted from the analysis (FA1) was "Sequels", which included the following MTKs: movie, star, story, and last. A few examples of their corresponding movies are Star Wars: Episode VII-The Force Awakens; Toy Story 4; and Star Wars: Episode VIII-The Last Jedi.
The third factor extracted from the analysis (FA 3) was "Horror and Thriller Movie", which included the following MTKs: angel, death, black, day, fallen, and men. The corresponding movies are The Woman in Black 2: Angel of Death; Happy Death Day; and X-Men: Apocalypse.
Moreover, Table 5 shows the correlations between the extracted factors by using Promax Rotation. Two extracted factors, Family Movie and Sequels, were positively correlated with each other (corr(FA2, FA1) = −0.30). On the contrary, Horror and Thriller Movie was found to be negatively correlated with Family Movie and Sequel(corr(FA3, FA2) = −0.08 and corr(FA3, FA1) = −0.21). The three-factor varimax with a rotated structure mirrored the analysis of the top 50 popular MTKs in the United States. As shown in Table 3, the top 50 popular MTKs were distributed among the three factors in patterns that indicated distinct dimensions that could be used to further analyze the potential factors that lead to the success of movies in the United States. This will be discussed further in Section 4.

Exploratory Factor Analysis Result (China)
Similarly, an EFA was carried out on 245 movies to determine the number of factors extracted from the MTKs of the 50 most popular movies released from 2015 to 2019 in China. Before conducting the EFA, the adequacy of the input data was confirmed by Bartlett's sphericity test and the matrix determinant, which indicated that the data were suitable for EFA and that there was a sufficient correlation between the variables to proceed with the analysis. A total of five factors with eigenvalues greater than 1.00 were extracted based on the scree plot (see Figure 8). To better interpret the meaning behind the extracted factors, both Varimax and Promax Rotations were performed and indicated no significant difference between the results. All the extraction methods yielded the same structure, and the results of the principal factor solution with Varimax Rotation are reported in Table 6.
The five-factor Varimax rotated structure mirrored the underlying factors of the MTKs in the U.S. The first factor extracted from the data (FA1) was "Family Movie", which included the following MTKs: new, dad, and son. Clearly, a family movie theme can be seen in these three MTKs. Moreover, by inspecting their corresponding movies, namely, New Happy Dad and Son 2: The Instant Genius; Crazy New Year's Eve; and Dad, Where Are We Going 2, the underlying factor becomes even more visible.
The second factor extracted (FA2) was "Comedy", including these MTKs: Boonie Bears, secret, and adventure. The corresponding movies were as follows: Fantastica: A Boonie Bears Adventure; The Secret Life of Pets 2; New Happy Dad and Son 3: Adventure in Russia. Clearly, these movies belong to the genre of comedy.
lett's sphericity test and the matrix determinant, which indicated that the data were suitable for EFA and that there was a sufficient correlation between the variables to proceed with the analysis. A total of five factors with eigenvalues greater than 1.00 were extracted based on the scree plot (see Figure 8). To better interpret the meaning behind the extracted factors, both Varimax and Promax Rotations were performed and indicated no significant difference between the results. All the extraction methods yielded the same structure, and the results of the principal factor solution with Varimax Rotation are reported in Table 6.     The third factor extracted (FA3) was "Detective", including the following MTKs: detective, blue, and Conan. Their corresponding movies were Detective Chinatown; Detective Conan: The Fist of Blue Sapphire; and Detective Conan: Zero the Enforcer.
The fourth factor extracted (FA4) was "Action/Crime". The manifested MTKs were part, three, and Chinese. Our extraction was justifiable by the content of their corresponding movies, including The Hunger Games: Mockingjay-Part 1; Three Billboards Outside Ebbing, Missouri; and A Chinese Odyssey: Part Three.
The fifth factor extracted (FA5) was "Sequels". The observed MTKs were Bonnie Bears, wars, star, and dad. Movie sequels such as Star Wars: Episode VII-The Force Awakens; Star Trek Beyond; Star Wars: Episode VIII-The Last Jedi; Rogue One: A Star Wars Story; and Solo: A Star Wars Story are a good example of manifesting this factor.
The correlations between the extracted five factors based on Promax Rotation are summarized in Table 7. The implications underlying the extracted five factors are discussed in the next section.

Discussion
Our results indicate many interesting patterns of successful MTKs and their corresponding popular movies that we have not elaborated in detail so far. This section is dedicated to describing these patterns and discussing their implications. We begin by presenting our intragroup analyses of popular movies in the U.S. followed by intragroup analyses of popular movies in China and proceed to intergroup analyses of popular movies in both countries.

Intragroup Analysis of Movie Title Keywords in the United States
"Family Movie" is, unsurprisingly, an important factor of popular movies in the U.S. As a genre that can be enjoyed by viewers of all ages, family movies, or animated family movies to be specific, are innately attractive to a bigger population of consumers than other types of movie. Given the presence of the Motion Picture Association movie rating system in the U.S., family movies are especially more likely to be freely accessed by a greater population of consumers that most other rated movies with certain age restrictions cannot possibly compare to. As long as parents across the U.S. continue to take their children to the movies as a treat, movies made for the whole family will most likely continue to have a uniquely important role to play in the movie market. As such, for those who are planning to make a widely welcomed movie, it is never a bad idea to make something that can make a 6-year-old laugh alongside a 60-year-old.
More specifically, it is an interesting pattern that many of them share one structure of naming, or a concise combination of the names of the characters featured in the movie followed by a colon and "the Movie", such as My Little Pony: The Movie. This structure is advantageous in many ways, as it allows the titles of family movies to efficiently convey to potential viewers who they can expect to see. In cases where the characters featured are adaptations from well-established iconic characters, such titles can also greatly attract the attention of audiences who love these characters by revealing, to a large extent, what a viewer can expect to see. Similarly, the success of The Lego Batman Movie indicates the importance of emphasizing the major character featured in the shortest length possible. In just four words, this title can distinguish itself from other, more-serious Batman movies for older viewers only, such as The Dark Knight, sending a clear and strong message to parents anxiously choosing from an array of movies that this one is a safe choice for the whole family.
"Sequels" is also not an unexpected factor extracted from the MTKs of popular movies in the U.S., largely because of the success of the Star Wars sequel. An all-time classic that sets a milestone of success challenged by few competitors in the entire history of the movie industry, The Star Wars has been favored by Americans for decades. Additionally, while some viewers might prefer some episodes over others depending on when they were born, how acceptive they are of less-advanced visual effects, and the plots of the stories in general, The Star Wars carries a culture that stands the test of time and guarantees a sound return on investment. Meanwhile, The Star Wars sequels were noticeably not the only sequel movies among the 217 movies we analyzed. Although not as phenomenally well received as The Star Wars, sequels such as Transformers, The Woman in Black, Underworld, and Happy Death Day are still great examples of how impactful the titles of sequel movies can be in terms of establishing and preserving brand effects.
The third extracted factor, "Horror and Thriller Movie", covers a more diverse range of movie titles. Although these movies cover a variety of topics, their titles are constantly made using religious elements. Frequently appearing keywords such as "angels" and "death" indicate that the movie producers know exactly how to create thrillingly popular movie titles in the U.S. by taking advantage of a stunning similarity shared by the majority of Americans. Given that nearly 8 in 10 Americans believe in the existence of angels despite differences in their religious beliefs (The Associated Press 2011), using the image of angels and death in the titles of horror and thriller movies proves to be the best way of entertaining the highly diverse viewers in the U.S. with a common factor that almost everyone can find some resonance with, albeit in slightly different ways.

Intragroup Analysis of Movie Title Keywords in China
Similar to in the U.S. market, "Family Movies" and "Sequels" are two extracted factors of popular movies in China as well, although the exact keywords involved differ with the exception of "star" and "wars". Once again, almost all the movies under the "Sequels" category are Star Wars episodes, proving the extraordinary success of The Star Wars sequels at an international level. Two Chinese original sequels, New Happy Dad and Son and Dad, Where Are We Going, stood out without doubt, given the popularity of these icons as TV series. As its name suggests, New Happy Dad and Son depicts a dad and his son living happily together. Considering that the first Happy Dad and Son episodes were aired about 25 years ago, the New Happy Dad and Son movies are, indeed, dedicated to a new generation of dads and sons. On the other hand, Dad, Where Are We Going is based on a reality show of how a group of fathers and sons from big cities adapt to life in the countryside. Overall, the success of these movies indicates that Chinese audiences truly appreciate and value harmonious father-son relationships.
As a new factor extracted from popular movies in China, "Comedy" includes many animated sequel movies that may as well be categorized as "Family Movies" and "Sequels". A salient example is New Happy Dad and Son: Adventure in Russia, the third episode of the New Happy Dad and Son sequel. Similarly, Boonie Bears is a Chinese original series favored by viewers of all ages that tells hilarious stories of two carefree bears casually defending the woods from a lone lumberman who always fails. Compared to the Happy Dad and Son sequels, the Boonie Bears carries a shorter history but has earned its popularity over the past eight years on TV screens and in cinemas. In addition, apart from these Chinese original series, audiences in China enjoy The Secret Life of Pets, an animated sequel produced by Universal Pictures, as well.
The third extracted factor, "Detective", turns out to be contributed almost solely by Detective Conan, a Japanese detective series that portrays the adventures of a teenage detective solving one mysterious case after another in the hope of finding the solution to a poison that turned his appearance into a child's. Originally distributed as manga and animations back in the 1990s, the series has gained immense popularity in China among generations of viewers, for whom the stories of the forever elementary schooler never get old. While the earlier episodes of the series were quite thrilling, the elements of horror have been fading away in more recent chapters of Detective Conan, making them good fits for family movies. Meanwhile, Detective Chinatown is a Chinese comedic detective series featuring some of the most popular comedy actors in China such as Baoqiang Wang. The first episode of the series was aired in 2015, and more installments are under production. Similar to Detective Conan, Detective Chinatown contains minimal blood and gore to the extent that it can be enjoyed by family members of almost all ages.
The fourth extracted factor, "Action/Crime", describes a diverse set of movies, ranging from the dystopian fiction The Hunger Games to the black comedy crime drama Three Billboards Outside Ebbing, Missouri and the slapstick comedy fantasy A Chinese Odyssey. As different as these movies are, Chinese audiences demonstrate a consistent preference for movies with comedic elements, though the nature of such elements can vary from dark to light.

Intergroup Analysis of Movie Title Keywords in the United States and China
Comparing the extracted factors of popular movies in the U.S. with the extracted factors of popular movies in China, there exist many interesting similarities. Movie producers may find some of these similarities inspiring as they create movie titles. First, audiences in both countries have a strong preference for family movies. In both countries, the majority of such movies tend to be animated movies with titles that primarily consist of the names of iconic characters such that consumers taking their families with them at the box office can quickly identify their perfect choices. One unique example in the case of China is Dad, Where Are We Going, as it is adapted from a popular reality show of the same name. In this case, the key element in its title that signals to potential viewers is, therefore, not the names of particular characters but the fame of the reality show itself. Either way, however, the titles of popular family movies never hesitate to take advantage of brand effects by using well-established content distributed in other formats to draw attention from viewers of all ages. Second, the popularity of The Star Wars sequels proves to be international, given that in both countries, "Sequels" is an extracted factor contributed almost entirely by various episodes of the series. It is, therefore, never a bad idea for a producer to work with George Lucas and create movies that depict the universe of The Star Wars to a greater extent.
The extracted factors also indicate that there is a salient difference between U.S. and Chinese viewers that movie producers need to keep in mind as they consider who their target audiences are. Specifically, Chinese viewers tend to enjoy comedies more than their American counterparts to the extent that many popular movies in China contain comedic elements even though they fit into more serious categories such as detective movies and action and crime movies. Patterns in the titles of such movies are difficult to detect, but they almost certainly belong to certain popular sequels that Chinese viewers are familiar with, as exemplified by the Detective Conan series, the Detective Chinatown series, and the Chinese Odyssey series. On the other hand, American viewers tend to appreciate elements of horror and thrill more. The titles of such movies are highly diverse as well, but most of them make use of keywords such as angels and death to cater to the religious nature of American culture.

Implications
In sum, inspired by the Bae and Kim (2019) study and the Sood and Drèze (2006) study, which revealed that informative movie title keywords play a critical role in predicting the economic success of their corresponding movies, this study discovered, with the use of the methods of statistical text mining and exploratory factor analysis, interesting patterns that underlie the titles of popular movies in the United States and China. These patterns have a range of implications for a variety of stakeholders in different ways.
For movie producers who have established their brands in the market, the message they can get is straightforward. With the success of sequel movies in both the U.S. and China, the strategy of building up brand effects proves to be quite effective in summoning loyal audiences who are willing to pay for the sake of the character or the name of the sequel that they see in the titles. There is no need for them to use any complex skills in the creation of successful movie titles, as simply adding "the movie" next to the name of the main character would work. This not only confirms the findings of Bae and Kim (2019) that sequel movies with titles similar to their successful first episodes are more likely to be successful as well, but also explains the massive success achieved by the Disney empire as it promotes one episode after another under the name of the phenomenally successful Star Wars, as exemplified by their decision to produce Rogue One, a fairly independent story on its own. Had Disney forgotten to add "a Star Wars Story" in the title, the revenue achieved by Rogue One would probably be an entirely different story.
For producers who do have limited budgets, on the other hand, the strategy they should use based on our findings is more complicated. Clearly, as they do not have a successful first episode to benefit from the brand effects of, they are unlikely to achieve success via action sequel movies. Even though these genres describe the majority of the most successful movies, as Pangarker and Smit (2013) suggested, low-budget movie producers will inevitably find the budgets needed for visually impressive CGI and action stars unaffordable. Therefore, as discussed in the previous section, horror and thriller movies should be considered by low-budget movie producers in the U.S., while comedic movies should be considered by their counterparts in China. In both the U.S. and China, low-cost, animated family movies are a decent direction to look at, too. In the case of China, where the advantage of high-budget films is known to be less salient (Feng and Sharma 2016), low-budget film producers stand an especially better chance.
One major limitation of this study is that despite identifying important factors that largely describe the features of successful movie title keywords, we did not fully dive into analyzing and uncovering the relationships among these extracted factors. Another major limitation of this study is that our findings are highly time-sensitive. In years when the Star Wars franchise is not producing new episodes, our methods would not be capable of detecting its immense popularity, as words such as star, wars, and rogue as in Rogue One would not appear on our radar. Meanwhile, as COVID-19 continues to have devastating effects on the movie industry across the world, the representativeness of the movies we focused on was undoubtedly compromised, as they were all released between 2015 and 2019. Due to the space limit, we will continue to address these limitations in a future study in which we are going to quantify the relationships among the extracted factors with structural equation modeling techniques and qualitatively investigate the implications of the pandemic for the box office.