Signiﬁcant Labels in Sentiment Analysis of Online Customer Reviews of Airlines

: Sentiment analysis is becoming an essential tool for analyzing the contents of online customer reviews. This analysis involves identifying the necessary labels to determine whether a comment is positive, negative, or neutral, and the intensity with which the customer’s sentiment is expressed. Based on this information, service companies such as airlines can design and implement a communication strategy to improve their customers’ image of the company and the service received. This study proposes a methodology to identify the signiﬁcant labels that represent the customers’ sentiments, based on a quantitative variable, that is, the overall rating. The key labels were identiﬁed in the comments’ titles, which usually include the words that best deﬁne the customer experience. This database was applied to more extensive online customer reviews in order to validate that the identified tags are meaningful for assessing the sentiments expressed in them. The results show that the labels elaborated from the titles are valid for analyzing the feelings in the comments, thus, simplifying the labels to be taken into account when carrying out a sentiment analysis of customers’ online comments.


Introduction
Communication between companies and clients increasingly takes place through user-generated content (UGC) on social media and specialized websites [1]. The online opinions expressed by customers on TripAdvisor, Expedia, Facebook, Instagram, or Twitter influence the reputation and brand image of service companies. Customers share their experiences related to the service they have received with others. In this context, the analysis of the online content shared by customers is essential in order to implement an effective communication strategy. Sentiment analysis includes different methodologies to evaluate the meaning of online comments [2][3][4], so that steps can be taken to increase customer loyalty. Sentiment analysis involves designing automated learning models that make it possible to assess whether the sentiments communicated by clients are positive, negative, or neutral, and their degree of intensity [5][6][7][8]. The aim is to create and implement machine learning and artificial intelligence methodologies that help to manage the large amount of data generated on the Internet between clients and service companies [9]. As service companies, airlines are exposed to constant information transmitted by their customers, which directly influences potential customers. Therefore, they need methods that speed up effective online communication with their customers [10].
The online reputation is the basis for the different research lines being developed to improve knowledge and provide useful tools for airlines to better understand their clientele's preferences and the competitiveness of their service offerings [11,12]. The objective is for the image transmitted by companies on the Internet through social media and specialized websites to correspond to the service perceived by customers [13]. The online reputation evaluated by customers on quantitative scales or in Table 1. Related research on overall ratings from online reviews in tourism.

Study Variables Research Context Key Findings
Park et al.
(2020) [71] Number of user reviews; average user ratings -TripAdviso -20 US airlines with 157,035 reviews and overall ratings -The quality of specific service attributes, such as cleanliness, food and beverages, and in-flight entertainment, affects the variations in positive ratings as a satisfier. -Other airline service attributes, such as customer service and check-in and boarding, influence deviations in negative ratings as dissatisfaction.
Tsai et al.
(2020) [74] Online hotel reviews; the overall ratings -TripAdvisor -1009 US hotels with 23,430 reviews -A novel approach is proposed to generate high-quality summaries of online hotel reviews. -Both review helpfulness and hotel features were considered before review summarization. -Online hotel reviews were collected in an experimental evaluation. Sharma et al. (2020) [8] Number of user sentiment reviews and the overall rating of the specific flight -TripAdvisor -20 US airlines with 157,036 reviews -Prospect theory explains the relationship between ratings and review sentiment. -Loss aversion and diminishing sensitivity are confirmed. -Negative deviations in ratings lead to a higher impact on review sentiment than positive deviations. -Variations in ratings closer to (away from) the reference point result in higher (lower) marginal impacts on sentiment.  [75] Online sentiment reviews and the ratings of airlines -SKYTRAX -24,165 online reviews -Text mining technology is used to automatically access the information in text comments. -Sentiment analysis based on a sentiment dictionary is used to classify user reviews. -Co-occurrence analysis is used to identify passengers' concerns about different aspects of service in the aviation industry.  [80] User online reviews, text readability, and historical rating distribution -Online attraction reviews from TripAdvisor -Two-level empirical analysis; Tobit regression model Both text readability and reviewer characteristics affect the perceived value of reviews.

Study Variables Research Context Key Findings
Amblee (2015) [81] The density of negative reviews -SquareMouth.com -Pooled ordinary least squares (OLS) regression -Over 21,000 reviews of travel insurances When the density of negative reviews is high, sales are lower and vice versa.
Park and Nicolau (2015) [82] Online reviews (star ratings) on usefulness and enjoyment -Yelp.com -Data collected from restaurant reviews from New York and London -35 restaurants in London with 2500 reviews and 10 in New York with 2590 reviews -The valence of online reviews has a U-shaped effect on usefulness and enjoyment. -Negative ratings of reviews are more useful than positive reviews. -Positive ratings are associated with higher enjoyment than negative reviews.
Zhu and Zhang (2010) [83] Coefficient of variation in ratings; the total number of reviews posted.
-Gamespot.com -VideoGames.com -Psychological choice model Online reviews are more influential for less popular games and games whose players have more Internet talent.
Mudambi and Scuff (2010) [84] Star rating of the reviewer; the total number of votes about each review's helpfulness; word count of the review -Amazon.com -6 products with 1587 reviews Review depth is correlated with helpfulness, but review extremity is less helpful for experience goods.
Word of mouth (WOM) is not an unbiased indicator of quality and will affect sales. User-generated content (UGC); product ratings; product reviews -Amazon, DPReview, and Epinions -Logistic regression -148 digital cameras with 31,522 reviews Online WOM on external review websites is a more significant indicator of sales for high involvement products.
Moe and Trusov (2011) [88] Average of all ratings -Bath, fragrance, and beauty products -500 products with 3801 ratings Online WOM affects sales and is subject to social dynamics in that ratings will affect future rating behavior. The average number of reviews, recommendations, and sales rank -Book data from Amazon.com -Multiple regression -Six hundred ten observations with 58,566 total reviews.
Consumer ratings are not found to be related to sales, but recommendations are highly significant.

Research Methodology
Sentiment analysis is usually performed through machine learning. In this case, a machine learning methodology based on multiple regression analysis is proposed, based on quantitative information offered by TripAdvisor, such as the rating, which makes it possible to measure the relationship with the identified labels. Therefore, the proposed methodology uses a quantitative variable, the general rating, and multiple qualitative variables, which are the customers' labels to communicate their sentiments. To carry out this analysis, it is necessary to convert the qualitative variables of the labels into dichotomous variables (0, 1), in order to convert each comment into a vector. With these data, a multiple regression analysis is carried out where the dependent variable is the general rating, and the independent variables are the identified labels.
To achieve this goal, it is necessary to follow a sequence of steps (see Figure 1). The first step is to create an initial database of labels with all the words found in the titles of the customers' online comments. In the second step, the tags are debugged by eliminating those that do not offer direct information about the feelings expressed, such as the articles "the" or "a" or commonly used verbs such as "to be" or "to have". In the third step, the possibility of simplifying this database by reducing the number of labels that share the same root through lemmatization is evaluated. Thus, plurals are eliminated as well as verb tenses in regular verbs. Sentiment analysis is usually performed through machine learning. In this case, a machine learning methodology based on multiple regression analysis is proposed, based on quantitative information offered by TripAdvisor, such as the rating, which makes it possible to measure the relationship with the identified labels. Therefore, the proposed methodology uses a quantitative variable, the general rating, and multiple qualitative variables, which are the customers' labels to communicate their sentiments. To carry out this analysis, it is necessary to convert the qualitative variables of the labels into dichotomous variables (0, 1), in order to convert each comment into a vector. With these data, a multiple regression analysis is carried out where the dependent variable is the general rating, and the independent variables are the identified labels.
To achieve this goal, it is necessary to follow a sequence of steps (see Figure 1). The first step is to create an initial database of labels with all the words found in the titles of the customers' online comments. In the second step, the tags are debugged by eliminating those that do not offer direct information about the feelings expressed, such as the articles "the" or "a" or commonly used verbs such as "to be" or "to have". In the third step, the possibility of simplifying this database by reducing the number of labels that share the same root through lemmatization is evaluated. Thus, plurals are eliminated as well as verb tenses in regular verbs. The next step is to create a numerical database with the rating variable and the dichotomous variables (0, 1) of all the defined labels in comments or titles of online customer reviews. This transformation leads to the next step, which is the regression analysis. Then, the model's robustness is evaluated, and the essential labels are extracted depending on whether they have a significant relationship with the rating. Finally, a database of the significant labels is generated, where the sign and intensity of their statistical relationship with the general rating variable are determined. Thus, a specific lexicon of the airline's customers is generated with the tags that they usually use and that The next step is to create a numerical database with the rating variable and the dichotomous variables (0, 1) of all the defined labels in comments or titles of online customer reviews. This transformation leads to the next step, which is the regression analysis. Then, the model's robustness is evaluated, and the essential labels are extracted depending on whether they have a significant relationship with the rating. Finally, a database of the significant labels is generated, where the sign and intensity of their statistical relationship with the general rating variable are determined. Thus, a specific lexicon of the airline's customers is generated with the tags that they usually use and that predict a positive or negative evaluation of their sentiments about the service received. This proposed process can be updated continuously as the airline receives a relevant number of new comments.
In this research, these steps were followed, starting with obtaining 5278 online opinions about TripAdvisor's Iberia airline, which were all available on the web in Spanish. The information collected was the overall rating, a variable with five alternatives ranging from 1 for low service to 5 for excellent service. The online comments were made in Spanish. However, the proposed data processing methodology can easily be applied to other languages such as English or French. Another piece of data obtained was the title of the comment, where the customers specify their feelings in a short sentence. Finally, the comment, where customers relate their experiences and emotions about the airline's service in greater detail, was also entered into the database. The next step was to build up a database of all the words used in all the online comment titles. For this purpose, a program was developed that created a database with each of the words used in the titles. In this study, the titles were used to obtain the labels because they are short and customers have to briefly express their sentiments about the service received from the airline. If the labels were created from the comments, their number would increase considerably, making the task of carrying out statistical analyses more difficult. In this context, if this study demonstrates that title tags can be used to perform sentiment analysis of comments, it will represent a step forward in the research in this field by significantly simplifying the number of tags to be evaluated. The words in the titles were refined to eliminate terms that do not influence the customers' sentiment, such as articles, certain verbs, or pronouns. This work created the basic tags that were used in the research. Once the initial database was cleaned up, 2567 labels were obtained. To reduce this number, the labels were lemmatized. Several programs perform this function in English, but because all the texts are in Spanish, we decided to develop specific software to perform this function. For this purpose, a minimum of six letters was specified, so that the labels could be detected by their roots because the labels could share very short strings with different meanings. Thus, if a tag had less than six letters, the complete tag was searched for, whereas if it had six or more letters, its root was searched for. This process reduced the pool to 1523 labels that were later used in the regression analysis.
The next step was to build the numerical database to be used for multiple regression analysis. The first variable was the overall rating, which, as indicated above, is quantitative and has values from 1 to 5, depending on the degree of customer satisfaction with the service received. The next variables are the 1523 dichotomous labels, so that 0 means that the label is not in the customer's comment, and 1 means that it is used by the customer to express the rating. To prepare this database, it was necessary to develop a software whose output was in Excel format. These data were processed through the statistical program SPSS in order to carry out the multiple regression. The dependent variable is the general rating, whereas the independent variables are the defined labels. The result offered by this program is the adjusted R square, which measures the degree of robustness of the model and the coefficients and levels of significance. With these outputs, the labels that are significantly related to the general rating are determined and can be considered key labels to measure customers' feelings. Likewise, the sign and intensity of the sentiments are established in the coefficients, which can be positive or negative, with a value that determines the degree of relationship with the general rating. If in the study, the model obtains a high adjusted R square, it will demonstrate that the labels extracted from the titles of the comments can be used to establish the sentiments reflected in the online comments.

Regression Analysis
Quantitative and qualitative data from online customer feedback organized into vectors were entered into a multiple regression analysis. The dependent variable is the overall rating, whereas the independent variables are the defined labels. Table 2 compares the results obtained in the regression with all the tags and the regression with the tags' roots. Table 2 shows how the number of labels is reduced to reach the significant labels for measuring sentiment based on the overall rating. When the regression model was performed with all the labels after debugging, the adjusted R square was 0.614, which is high for this type of study. When the labels were lemmatized, the adjusted R square was reduced to 0.579, which is still high. Therefore, to simplify the study of the key labels, performing the regression with the tags' roots was justified. Thus, the final result of this regression was 295 significant labels that best define the customers' sentiments because they were significantly related to 5% and 10% of the overall rating. The multiple regression analysis results are presented in Table 3, where only the significant labels at 5% and 10% are displayed. Two hundred ninety-five tags exhibit a significant relationship with the overall rating, allowing us to assess customers' sentiments about the quality of service received from the airline. The sign of the coefficient determines the degree of a direct or inverse relationship. The value of the coefficient indicates the intensity with which each label is related to the rating. The model constant, which reaches a significant value (p < 0.05) of 3.679, is particularly noteworthy. This means that the coefficients of the labels found in the comments will be added or subtracted to predict the overall rating from this constant.   Among the labels with a positive coefficient, coordination stands out, with a coefficient of 2.245 (p < 0.05), as well as clause (1.824), unbeatable (0.990), perfection (0.633), or affordable (0.832), to give some examples. It should be taken into account that, in the comments, the number of words used is much higher than in the titles, and so some labels can be found that express neutral or negative sentiments with positive coefficients. This is because they appear very rarely in the comments, and there are other tags in the comments that have a different meaning. For example, the doubt label appears with a positive coefficient of 1.489 (p < 0.05). By itself, it expresses a negative sentiment, but if it is inserted in the expression "do not doubt it", for example, its meaning changes to a positive sentiment of the client who has had this experience.
In contrast, some labels show negative sentiments, including roadkill with a coefficient of −2.124, indignation (−1.513), tablets (−1.058), irresponsible (−1.038), rude (−0.966), or minuscule (−0.914). There is also a label related to an attribute of the service, as in the case of check-in, which obtains a coefficient of −0.810, showing that it is an aspect of the airline that customers value negatively. Some tags express positive sentiments, but they appear with negative coefficients due to the context of the sentence in which they are found. Some examples are the labels exclusive (−0.861) and sensitivity (−0.965), which must be accompanied by negative words that change their meaning. A label that expresses a feeling is sardines (−0.758), which is usually used when passengers are overcrowded. Other terms that are used to communicate negative feelings are swindle (−0.746), badly (−0.346), uncomfortable (−0.314), bad (−0.305), scarce (−0.281), or disappointment (−0.263). When these adjectives appear in the comment, they are indicating a negative sign in the feelings communicated by the clients.
Labels with coefficients close to zero can be assumed to describe neutral sentiments. In other words, when this label appears, its value will hardly modify the constant regression value. From this perspective, it is plausible to say that they are labels that manifest a neutral sentiment. This is the case of the entertainment label, which has a coefficient of 0.086. Other similar cases are the labels without (−0.077), normal (−0.85), passenger (−0.086), or hours (−0.096). Table 3 shows that other labels with coefficients below 0.2 can be found and would also be considered neutral. However, the positive or negative sign marks a trend in customer sentiment.

Discussion of Results
The present study has validated the use of the labels extracted from the titles of comments to determine customers' sentiments in their full comments. The titles are short phrases that synthesize the customer's experience with the airline, whereas the comments are composed of longer texts where mixed feelings can be collected. Therefore, there may be positive impressions next to words that show negative sensations in a comment. The problem is to determine how these sentiments with different signs influence the final evaluation of the overall rating. In this context, the rating is a quantitative variable with five alternatives that evaluate a synthesis of the experience of the passengers of an airline or service company. Hence, this dimension has great importance in measuring the sign and intensity of customers' labels to express their feelings.
Not only is it necessary to determine whether the sentiments are positive, negative, or neutral, but also to assess their intensity. Generally, studies on sentiment analysis use lexicons developed generically to evaluate the sign, and sometimes the intensity, of clients' emotions and experiences to provide their perceptions in a specific way through open structured models [75,77,79]. However, language is a living reality that can vary according to geographical areas, time, and cultural backgrounds. English-speaking customers tend to post higher ratings than non-English speaking customers [95]. Moreover, terms to evaluate specific services may become more specific over time, creating a flexible, adaptable, and specific lexicon for each service or company. This research is carried out from this perspective, in order to propose a methodology for each airline to develop, assess, and test the key labels in knowing the sentiments of the customers. At present, it is an essential tool in companies' communication strategy because the majority of the communication is being carried out through the Internet and spontaneously through social media.
The methodology proposed to develop machine learning involves obtaining the information and creating the label databases. This study shows that the labels can be simplified by considering only their roots because the adjusted R square, although somewhat lower than that of all the labels, is significantly high. Moreover, the difference is minimal when reducing about one thousand tags to be used in the statistical analyses. The study also demonstrates that the labels obtained from the titles are valid to determine the relationship between the contents of the comments and the general rating. This is especially useful because the number of tags obtained from the comments would be much higher, making the regression analysis more complicated. Therefore, this study reveals that the labels can be simplified to establish the customers' sentiments in their online ratings, with 295 key labels identified as having a significant relationship with the overall rating.
The more complicated regression defines the labels that show positive sentiments, which have a positive sign and a high coefficient; whereas, the labels with negative signs and a high coefficient identify the negative sentiments. However, labels that obtain coefficients close to zero, either positive or negative, can determine a neutral feeling. In this context, the regression constant obtained a value of 3.679, which indicates that ratings around this value are reporting a neutral assessment of customers. Knowing that the average value of TripAdvisor's scale is three, and that the regression constant is more than 20% higher, greater values mean that customers assess the airline's service positively. Furthermore, the airline should consider any rating below 4 to be a non-positive rating. Therefore, any label that has obtained a coefficient close to 0 signifies that the customer's rating using that label is close to the constant, which is a neutral value.
The results obtained show that the multiple regression analysis is valid to develop machine learning of customer sentiment analysis. It has an advantage over other models based on neural networks, which only result in a percentage of success in the prediction. However, it does not provide information about the key labels to follow to detect the sentiments of customers, or the level of intensity with which these sentiments are expressed. In this context, tags that can reflect the same feeling, whether positive or negative, may vary in intensity because a word is not the same as its synonym. Regression analysis referring to quantitative feedback, such as the overall rating, facilitates this task and helps to decipher the more emotional communication an airline has with customers through written texts. Therefore, one of the fundamental contributions of this study is that it demonstrates the need to have a quantitative reference in order to identify a lexicon of labels with their corresponding sentiments and develop effective communication with airline customers. From this perspective, general rating predictions can be made based on customers' vocabulary in their comments. This is a strategic aspect in developing and applying artificial intelligence systems to communications between airlines or other service companies and their customers.

Conclusions
The main conclusion of this study is that regression analysis based on the overall rating can be used as the basis for machine learning of sentiment analysis of online customer reviews. A quantitative variable that serves as a reference to measure the sentiments of customers transmitted through written comments helps to determine their sign and intensity. Moreover, the results show that the labels extracted from the titles are valid for evaluating the feelings collected in the comments.
One of the main problems when assessing feelings is that there are no dynamic elements to guide feelings' value. In many cases, pre-developed lexicons are used to determine a positive or negative sign or even an intensity level. However, all of this has been done on a general basis without focusing on evaluating the services offered by a company such as an airline. Moreover, companies need to detect the keywords used by their clients to evaluate their services because company-customer communication is increasingly carried out through the Internet, either on social media or by e-mail. Likewise, companies that want to go further and have immediate feedback when the client is receiving the service need to know the type of vocabulary their clients use to express their sentiments.
Along these lines, this study has validated the process of simplifying the number of labels to be used in sentiment analyses, showing that the roots of the labels are useful. With the multiple regression analysis, the labels significantly related to the general rating are determined, and their coefficients display the sign of the relationship and its intensity, according to the value obtained. This is a customer-centered method for developing the lexicon of customer sentiments that includes their sense and intensity based on the dynamics of customers' online dialogues. This study makes an exciting contribution to current and future research. It is a proposal for each company to draw up its customer communication codes using the tags automatically extracted from online dialogues or comments.
In the context of the airline industry, managers can use this dynamic method to identify the different labels to achieve the maximum customer satisfaction across time and position their service offerings. Our findings are in line with the literature on the important role of Big Data in providing airlines with a sustainable competitive advantage [12,96,97]. The practical implications of this study have strategic relevance for airlines. First, the words used are related to the rating given by customers, so that a statistical analysis of relationships can be carried out. Second, it is essential for companies to determine which key labels best define their service, whether in a positive or negative sense. This study shows that it is a useful and practical method for applying a continuous learning procedure through technological means. Finally, airlines need to understand the qualitative assessments of customers more in-depth, going beyond the market classification of customer feelings as positive or negative. It is also necessary to know the intensity at the moment they receive the clients' comments, in order to be able to give them an effective answer that will make them loyal.
This study has some limitations that should be investigated in the future. The first is that the number of times the labels appear is not evaluated. This is an essential factor in determining whether the results are significant or not. A label used in only a few comments may not coincide with its actual meaning because it might be biased in that comment by other words from its context. Another aspect of future evaluations would be to analyze tags that appear in the same comment or sentence. This is another dimension of the content analysis study because the tags that are integrated into the same comment can have an empowering or neutral effect. This line of development would lead to validating the analysis of label structures according to their relationship with the general rating.
Future studies should also be carried out to evaluate the degree of accuracy in the predictions made with the results of the regressions and compare them with other models already used in sentiment analysis, such as in airports [98], hotels [99], and in different online services [100]. Another aspect to take into account in future research is to introduce hypotheses to validate the methodology and important aspects such as the determination of the key labels and their capacity to predict the general rating. In this context, this research can be used as a theoretical support for further progress in this field. It would also be interesting to find out whether the significant labels are verified in other competing airlines in order to establish whether the lexicon is similar in the customers of different airlines. In this regard, the sentiments expressed in terms of sense and intensity may differ between companies that offer a high level of quality to their customers and those that do not.
In conclusion, the results obtained confirm that multiple regression analysis is adequate to evaluate clients' sentiments, the availability of a quantitative reference variable is essential to evaluate the sign and intensity of the sentiments, and, finally, the number of labels used to evaluate the clients' sentiments can be simplified based on their roots and levels of significance, with regard to the quantitative reference variable. The study also provides a method for developing, assessing, and developing a lexicon of labels that represent customers' sentiments towards a service offered, such as airlines.