Next Article in Journal
SmartMeasurer: A Secure and Automated Bandwidth Measurement for Tor with Smart Contract
Next Article in Special Issue
Community Structure and Resilience of the City Logistics Networks in China
Previous Article in Journal
Item Difficulty Prediction Using Item Text Features: Comparison of Predictive Performance across Machine-Learning Algorithms
Previous Article in Special Issue
ProMatch: Semi-Supervised Learning with Prototype Consistency
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Research on a Hotel Collaborative Filtering Recommendation Algorithm Based on the Probabilistic Language Term Set

1
Zhuhai Campus, Beijing Institute of Technology, Zhuhai 519088, China
2
Faculty of Business Administration, University of Macau, Macau, China
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(19), 4106; https://doi.org/10.3390/math11194106
Submission received: 4 August 2023 / Revised: 20 September 2023 / Accepted: 26 September 2023 / Published: 28 September 2023
(This article belongs to the Special Issue Applications of Big Data Analysis and Modeling)

Abstract

:
In the face of problems such as information overload and the information cocoon resulting from big data, it is a key point of current research to solve the problem of semantic fuzziness of online reviews and improve the accuracy of personalized recommendation algorithms by using online reviews. Based on the advantage of the probabilistic language term set to deal with fuzzy information and the historical data of online hotel reviews, this paper proposes a collaborative filtering recommendation algorithm for hotels. Firstly, the text data of hotel online reviews are crawled by a crawler and processed by jieba and TF-IDF tools. Secondly, the hotel evaluation attribute set is constructed, and the sentiment analysis of the review statements is carried out with the help of the HowNet sentiment dictionary and manual annotation method. The probabilistic language term set is used to classify the data and derive statistics, and the maximum deviation method is used to determine the weight of each attribute. Then, the cosine similarity formula is fused with the modified cosine similarity formula to calculate the similarity and construct the decision matrix. Finally, combined with the historical data of the user’s hotel selection, the hotel recommendation results are generated. This paper collected review data from 10 hotels in Macau from the official “Ctrip” website. The proposed recommendation algorithm model was then applied to process and analyze the data, resulting in the generation of a ranked list of hotel recommendations. To validate the accuracy and effectiveness of this research, the recommendation results were compared with those produced by other algorithms.

1. Introduction

With the continuous development of Internet technology and the widespread application of e-commerce, people’s consumption patterns have gradually shifted from offline to online. As the pandemic subsides, there is a growing demand for travel, and users can book hotels online through various platforms [1]. However, the explosive growth of data has led to the increasingly serious problem of “information overload,” making the extraction of useful information from massive data increasingly important [2]. Online reviews can assist users in hotel selection, but when faced with a large amount of hotel information and online reviews, users often find themselves in a dilemma. Personalized recommendation systems can help users choose suitable hotels by mining user preferences and hotel features. Research has shown that collaborative filtering recommendation algorithms are more suitable and widely used in hotel recommendations [3].
Online reviews contain information such as user preferences and hotel features and have strong interpretability. They have been widely used in hotel recommendation algorithms. Determining how to extract user preferences and hotel features from online reviews is crucial to improving the accuracy of hotel recommendation algorithms [4,5,6]. Tao et al. [7] conducted an instance analysis of hotel online reviews using the λ-fuzzy measure and the Choquet integral operator, considering the correlation between different features, and generated hotel recommendation rankings for travelers. Wang et al. [8] used statistical methods to construct a recommendation index based on online reviews to generate recommendation results. Cui et al. [9] built a scenario-based user profile, calculated similarity using the improved P-J correlation coefficient, and proposed a collaborative filtering recommendation algorithm. Qian et al. [10] applied uncertain variables and uncertain sets to design a sentiment analysis model, analyzed user preferences, and designed a new nearest-neighbor search method based on the sentiment analysis model to generate recommendation results. Li et al. [11] proposed a hybrid approach that used the LSTM (long short-term memory) algorithm to calculate user preference information from online reviews and combined it with matrix decomposition recommendation algorithms to improve recommendation accuracy. Cao et al. [12] established a CNN sentiment classifier to classify emotions in online review text, then merged the classification results into user ratings, improving collaborative filtering algorithms.
However, due to the strong semantic ambiguity of online review information and the presence of malicious reviews, the above methods have their limitations. In the field of natural language processing, scholars such as Pang [13] have proposed the concept of the probabilistic linguistic term set (PLTS) as an effective tool for handling fuzzy information. PLTS can describe the importance of different evaluation terms and has been used to solve problems such as sentiment analysis, text classification, and information retrieval, and has been widely applied in hotel reviews [14], engineering evaluation, logistics management, medical services [15], and other fields. Some researchers have combined PLTS with recommendation algorithms. Zhou et al. [16] used PLTS to extract user emotions from review information, proposed a VIKOR-based probabilistic linguistic multi-criteria decision-making method, and conducted experimental analysis using movie information recommendations. Chen et al. [17] proposed a user-friendly function to transform personalized semantics of online reviews into probability distributions corresponding to user ratings. They constructed a probability language rating matrix and used integrated probabilities to represent the fuzziness of trust, proposing a user collaborative filtering recommendation algorithm based on probability language and dual trust networks. Cui et al. [18] used PLTS to aggregate user preferences from online reviews while eliminating ambiguity in online reviews. They calculated similarity using the cosine similarity formula and generated recommendation results. In the above studies, there is room for improvement in the richness of the user sentiment hierarchy extracted from online reviews (using five-granularity linguistic terms), and the similarity calculation formulas used did not consider the influence of the evaluation scale, resulting in limited differentiation of evaluation results.
To address the above issues, this paper proposes a collaborative filtering recommendation algorithm based on a probabilistic linguistic term set (PLTS). When extracting user sentiment from online reviews, a seven-granularity linguistic term is used to more accurately reflect user preferences in online reviews and improve the accuracy of recommendation results. In attribute weight calculation, the maximum deviation method is used to determine attribute weights. In similarity calculation, given that a single cosine similarity calculation tends to be optimistic [19], while corrected cosine similarity calculation tends to be pessimistic, a fusion similarity calculation formula is used, taking into account the influence of the evaluation scale, to improve the quality of the recommendation algorithm.

2. Preliminaries

2.1. Probabilistic Language Term Set

In 2016, Pang et al. [13] proposed the probabilistic linguistic term set (PLTS), which uses probability distributions to represent different levels of importance of linguistic evaluation. Lin and Xu [20] proposed the standardization of PLTS, the distance formula, and other related theories.
Definition 1.
Assuming that  S = { s α | α = τ , , 1 , 0 , 1 , , τ }  is a Linguistic Term Set (LTS) where  τ  is a positive integer, then the probabilistic linguistic term set  L p  can be defined as:
L p = L k p k | L k S , p k 0 , k = 1 , 2 , , # L p , k = 1 # L p p k 1
where  L k  represents the  k th term, p k  is the probability of the  k th term, and  L k p k  represents the language term  L k  with a probability value  p k . # L p  represents the number of terms in the probabilistic language term set  L p .
With reference to [18], it is assumed that there is a linguistic term set:
S = s 3 : E x t r e m e l y   p o o r , s 2 : V e r y   p o o r , s 1 : P o o r , s 0 : G e n e r a l l y , s 1 : G o o d , s 2 : V e r y   g o o d , s 3 : E x t r e m e l y   g o o d
If 40% of the users give a “ Very   good ” rating of the hotel, 45% give a “ Generally ” rating of the hotel, and 15% give a “ Poor ” rating of the hotel, then there is  L p = s 1 0.15 , s 0 0.45 , s 2 0.4 .
The sum of the term probabilities in PLTS is   k = 1 # L p p k . When  k = 1 # L p p k = 1 , this PLTS is called the complete probabilistic language term set; when  k = 1 # L p p k < 1 , this PLTS is called the incomplete probabilistic language term set. In PLTS operation or information aggregation, the existence of an incomplete PLTS will lead to the loss of the final information and the inaccuracy of the result, so it is necessary to standardize the incomplete PLTS before the operation or information aggregation. Pang’s standardized formula is as follows:
L ˜ p = L k p ˜ k | k = 1 , 2 , , # L p
where  p ˜ k = p k k = 1 # L p p k .
Definition 2
[13]. When the number of terms in two probabilistic language term sets is different, it will increase the difficulty of calculation and comparison. Therefore, it is necessary to standardize the number of terms in the two probabilistic language term sets, fill in the missing terms, and make the number of terms in the two probabilistic language term sets the same. The supplementary process is as follows:
Suppose there are two arbitrary sets of probabilistic language terms  L n p  and  L m p . L n p = { L n k p n k | k = 1 , 2 , , # L n p } ,  L m p = { L m k p m k | k = 1 , 2 , , # L m p } , # L n p  and  # L m p  the number of terms for  L n p  and  L m p ,  respectively. When the number of the two is not equal, the missing parts needs to be completed. If  # L n p > # L m p , then the language terms  # L n p # L m p  are added to  L m p , and the added element is the smallest term in  L m p , and its probability is 0, and vice versa.
Definition 3
[13]. Assume that the probability of two arbitrary linguistic term set  L n p  and  L m p ,  L n p = { L n k p n k | k = 1 , 2 , , # L n p } , L m p = { L m k p m k | k = 1 , 2 , , # L m p } , and  # L n p = # L m p , the distance  L n p  and  L m p  as follows:
d L n p , L m p = k = 1 # L n p ( p n k r n k p m k r m k ) 2 / # L n p
where  r n k  and  r m k  represent subscripts of the language terms  L n k  and  L m k , respectively, and satisfy the following properties:
(1)
d L n p , L n p = 0 ;
(2)
d L n p , L m p = d L m p , L n p .

2.2. Cosine Similarity

Cosine similarity is used to calculate the distance between two vectors, mainly by calculating the cosine value of the angle between two vectors to calculate the difference between them [21].
Definition 4
[22]. Assume that there exists an arbitrary set of probabilistic language terms  L i p = L i k p i k | k = 1 , 2 , , # L p , i = 1 , 2 , , n , where  L n k , L m k S . Set  C  represents the properties of the hotel,  C = { c j | c 1 , c 2 , , c n } ; then, according to the cosine similarity formula, under the same attribute  c j , the formula for the similarity between  L n p , L m p  two probabilistic language term sets  S i m c s L n p , L m p  is as follows:
S i m c s L n p , L m p = k = 1 # L p τ L n k p n k / τ   τ L m k p m k / τ   k = 1 # L p τ L n k p n k / τ   2   k = 1 # L p τ L m k p m k / τ   2  
where  τ L n k p n k / τ  and  τ L m k p m k / τ  represent the respective score sets of hotel   n  and hotel  m  under the attribute  c j   , respectively. The cosine similarity satisfies the following properties:
(1)
S i m c s L n p , L m p = 1 ;
(2)
S i m c s L n p , L m p = S i m c s L m p , L n p .

2.3. Modified Cosine Similarity

Modified cosine similarity is a variant of cosine similarity. Although cosine similarity can be used to measure the difference between two vectors [21], it pays more attention to the relationship between vector angles and is insensitive to the absolute values of specific data such as size and length. Therefore, in 2001, Sarvar et al. [23] proposed the concept of modified cosine similarity, and subtracted the average value of corresponding features in the calculation to reduce the problems caused by these deficiencies. Considering the influence of the degree of scoring bias on the score [24], the calculation of similarity is more reasonable.
Definition 5
[23]. Assume the existence of an arbitrary set of probabilistic language terms  L i p = L i k p i k | k = 1 , 2 , , # L p , i = 1 , 2 , , n , where  L n k , L m k S , Set C represents the properties of the hotel,  C = { c j | c 1 , c 2 , , c n } . Under the same attribute  c j , according to the modified cosine similarity formula,  L n p   a n d   L m p  can obtain the similarity between the two probabilistic language term sets  S i m a c s L n p , L m p ; the formula is as follows:
S i m a c s L n p , L m p = k = 1 # L p τ L n k p n k / τ τ n ¯ τ L m k p m k / τ τ m ¯ k = 1 # L p τ L n k p n k / τ τ n ¯ 2   k = 1 # L p τ L m k p m k / τ τ m ¯ 2  
where  τ n ¯  and  τ m ¯  represent the respective average score sets of hotel  n  and hotel  m , respectively, and the modified cosine similarity satisfies the following properties:
(1)
S i m a c s L n p , L m p = 1 ;
(2)
S i m a c s L n p , L m p = S i m a c s L m p , L n p .

3. Materials and Methods

3.1. Model Proposal

With the widespread development and adoption of various travel apps and websites, tourists often refer to evaluations of hotels by other users. They combine these evaluations with their own preferences for hotel attributes such as service, price, and location when assessing and choosing hotels. However, due to time constraints and the abundance of hotel review information, tourists can only access a portion of the information, leading to a lack of comprehensive and objective assessments of hotels. Addressing the challenge of providing valuable information to tourists from vast amounts of data is essential, and recommendation algorithms are a commonly used approach to solve this problem.
Considering the aforementioned issues, current research on hotel recommendation algorithms rarely integrates users’ historical data, such as textual reviews. Furthermore, probabilistic linguistic term sets (PLTSs) exhibit favorable characteristics. They can effectively capture users’ perceptions and intentions from textual data, enabling precise recommendations. Hence, this paper proposes a model that combines PLTS with online reviews to enhance the accuracy of hotel recommendation algorithms. This model consists of three main modules.
Module 1: Sentiment Analysis.
Using the jieba tool for tasks such as word segmentation and part-of-speech tagging, the original review text is processed. Sentiment words and intensity adverbs are analyzed to construct the PLTS for the review text.
Module 2: Attribute Weight Calculation and Similarity Computation.
Based on the PLTS of the review text, attribute weights and similarities are computed to obtain a hotel similarity evaluation matrix.
Module 3: Prediction and Recommendation.
Utilizing the hotel similarity evaluation matrix based on the probabilistic linguistic dataset and incorporating user historical data, the model generates a recommendation list.

3.2. Material and Methods

With the large-scale development and popularization of various tourism software packages and websites, tourists prefer to read other users’ evaluation of hotels and evaluate and choose hotels based on their own preferences for hotel attributes, such as service, price, and transportation. However, due to the limited time and the large amount of hotel review information, tourists can only obtain partial information and lack a comprehensive and objective evaluation of the hotel. Determining how to provide tourists with valuable information from a large number of data is an urgent problem to be solved, and recommendation algorithms are common used to solve this problem.
In this paper, the set H = { h i | h 1 , h 2 , , h m } will be employed to represent the set of hotels, while the set { c j | c 1 , c 2 , , c n } will represent the hotel attributes that users consider when referencing other users’ review information during the hotel selection process. A large number of online review texts are processed based on the probabilistic language term set, and after obtaining useful information, the system will recommend multiple hotels according to the historical data of hotel selection by tourists. The specific idea is shown in Figure 1.

3.3. Data Acquisition and Preprocessing

After determining the research content, the data will be crawled and processed to prepare for the subsequent recommendation algorithm. The specific steps are shown in Figure 2.

3.3.1. Data Crawling

In this paper, Python 3.6 is used to crawl the hotel review data of a tourism website in the same period of time, and the set H represents the hotel set, that is, H = { h i | h 1 , h 2 , , h m } , where m is the number of hotels. Each hotel is crawled for l review information, which is exported to a csv file.

3.3.2. Online Comment Text Processing

After obtaining the review data, the jieba tool performs word segmentation and parts-of-speech tagging processing on the original review text. In addition, we select and refer to the stop word table UTF-8 provided by IKAnalyzer to delete words with little or no value, such as modal words and function words, so as to reduce the interference of stop words on sentence analysis, and the sentences with ambiguous semantics are processed manually. After word segmentation and stop word processing, the words describing hotel attributes and evaluation in the review are obtained. In this paper, Z i k = z i k 1 , z i k 2 , , z i k m , , z i k p represents the word set of the k comment statement of the i hotel, and z i k m represents the m word in the word set Z i k .

3.3.3. Word Analysis of Hotel Evaluation

(1)
Construct word knowledge base
As there is currently no sentiment dictionary specifically for hotel review analysis, this paper developed its own vocabulary for this model, which was constructed by a professional familiar with the web terminology of online hotel reviews, by manually examining the review content. Tourists’ online evaluations of hotels generally involve multiple attributes of the hotel and represent different attitudes, such as the attitude towards hotel facilities and services, as shown in Table 1. By looking at online reviews of various hotels, we can identify the attribute objects evaluated in the reviews and judge their tendency fields. After consulting experts, we finally built a word knowledge base for analyzing hotel reviews [25].
(2)
Establish standard attribute words
In the model, the attribute words in the reviews reflect the tourists’ attention to some aspects of the hotel when choosing the hotel, but the expression of the attributes is slightly different. Therefore, before the analysis and processing of the attribute words, the word frequency of the reviews is analyzed, the standard attribute words are selected, and the common attribute words are classified. The properties of the hotel are represented by the set C , C = { c j | c 1 , c 2 , , c n } , and Z i j k is used to represent the set of attribute words involved in the comment k of the i hotel, Z i j k = z i 1 k , z i 2 k , , z i n k . Assuming that z j c represents the common attribute words in the comment, and z j c represents the selected standard attribute words, the similarity between z j c and z j c is calculated according to the calculation method of semantic similarity in the Synonymous Word Forest [19], and the common attribute words are transformed into standard attribute words.
(3)
Determine sentiment analysis rules
Since the expression of comments is highly flexible, the HowNet sentiment dictionary and manual annotation methods are used for sentiment analysis of comment statements [22]. In the analysis of hotel emotion words, in addition to the positive and negative emotion analysis, 7-granularity is also adopted to distinguish the degree of emotion. As shown in Table 2 below, the positive and negative of emotion words express the tourists’ positive, neutral, or even negative attitudes towards the hotel, while the degree adverbs describing emotion are mainly used to describe the intensity of the attitude, such as “extremely”, “very” and other degree adverbs.
In this paper, we construct a more detailed 7-granularity classification of emotion words, transform emotion adverbs into 7-granularity language terms, and explore the intensity of emotion words. Suppose S is a 7-granularity size scale set, S = s 3 , s 2 , s 1 , s 0 , s 1 , s 2 , s 3 , where the positive and negative signs of the s subscript represent positive and negative emotions, respectively, s 0 represents a neutral attitude, and the number of subscripts represents the degree of attitude. The larger the number, the stronger the attitude. s 3 represents “very poor”, s 2 represents “very poor”, s 1 represents “bad”. s 0 represents “general”, s 1 represents “good”, s 2 represents “very good”, s 3 represents “very good”.
In order to make the degree of positive and negative emotions more accurate, the division method follows the following rules:
(a)
Positive or negative words without degree adverbs are classified as s 1 or s 1 ;
(b)
Other degree adverbs are uniformly classified as s 2 or s 2 according to the classification of emotion words;
(c)
Emotion intensity words containing “extreme”, “invincible”, “only”, and “very” are classified as s 3 or s 3 according to the positive and negative direction of emotion words.

3.3.4. Conversion of Online Comments

After the hotel reviews are preprocessed, the data are described and counted using probabilistic language term sets (PLTSs). After data crawling, the attributes in the q review data of each hotel are transformed with language terms, and all attributes and the frequency of language terms are counted. This is represented by a probabilistic language term set, which is divided into the following three steps, as shown in Figure 3:
Step 1: Assume that hotel k exists, analyze its review text set Z i k = z i k 1 , z i k 2 , , z i k m , , z i k p , attribute word set Z i j k = z i 1 k , z i 2 k , , z i n k , and then transform the emotion words in the evaluation text into 7-granularity language terms according to the rules.
Step 2: Calculate the frequency S u m s α j of each evaluation language term in each attribute word, and count the total frequency S u m c j of each attribute c j .
Step 3: Calculate the evaluation term probability p α of attribute words. The calculation formula is shown in Formula (6). After calculating the p α of all attribute words, use the probabilistic language term set L i j p to describe the evaluation set of the j th attribute of the i hotel.
p α = S u m s α j S u m c j

3.4. Steps for Recommendation Algorithms

After the comments are converted into a probabilistic language term set, the recommendation algorithm will be constructed. The specific steps are shown in Figure 4.
  • Step 1: Construct evaluation matrix
The evaluation matrix P of the hotel is constructed to describe the obtained probabilistic language term set. L i j is used to represent the evaluation information about the hotel h i attribute c j . i 1 , m , j 1 , n , i and j are integers.
P = L 11 p L 12 p L 1 n p L 21 L 2 n p L m 1 L m 2 L m n p
  • Step 2: Determine the attribute weights
Each tourist has different needs for hotels, so they will pay different degrees of attention to the attributes of hotels [18]. In this paper, the maximum deviation method is adopted to determine the attribute weight. In general, the greater the attribute weight, the greater the deviation degree between hotels, and the stronger their attribute differentiation degree [20]. Assuming that there is a hotel h i , under the same attribute c j , the deviation degree between the hotel h i and other hotels is first calculated, and then the total deviation degree between the hotel h i and other hotels is calculated. Secondly, the total deviation of attributes in the evaluation matrix is calculated, and the maximum deviation optimization model is constructed. Finally, the Lagrange function is used to solve the problem, and the standardized attribute weights are obtained. Specific calculations are as follows:
Suppose that there is an attribute weight set w = { w j | j = 1 , 2 , , n } . According to Formula (3), under attribute c j , the deviation degree between hotel h i and other hotels is:
d i j w = l = 1 , l i m d L i j p , L l j p
where d L i j p , L l j p = k = 1 # L i j p ( p i j k r i j k p l j k r l j k ) 2 / # L i j p .
The total deviation degree between hotel h i and each hotel is:
d j w = i = 1 m d i j
In the evaluation matrix P , the total deviation degree between all attributes is:
d P w = j = 1 n w j d j
The maximum deviation optimization model is constructed as follows:
max d P w = j = 1 n w j i = 1 m l = 1 , l i m d L i j p , L l j p j = 1 n ( w j ) 2 = 1 , w j 0
Then, the Lagrange function is constructed to solve the model:
L w , λ = j = 1 n w j i = 1 m l = 1 , l i m d L i j p , L l j p + λ 2 j = 1 n ( w j ) 2 1
Finally, the normalized attribute weights are obtained:
w j = i = 1 m l = 1 , l i m d L i j p , L l j p j = 1 n i = 1 m l = 1 , l i m d L i j p , L l j p
  • Step 3: Construct the similarity matrix
The method of weighted similarity is used to calculate the similarity between hotels, that is, the similarity of two hotels under the same attributes is first calculated, and then the weight of attributes is added to calculate the weighted similarity, and finally the similarity matrix between hotels is constructed, from which the similarity between the two matrices can be directly observed.
In view of the fact that the cosine similarity calculation results tend to be optimistic [19], while the modified cosine similarity calculation results tend to be pessimistic, this paper adopts a compromise and an improved similarity calculation formula, as shown in Formula (13), where α is the adjustment coefficient:
S i m = α S i m c s + 1 α S i m a c s
(1)
According to Formula (13), calculate the similarity of   h i   and   h l of two hotels under the same attribute c j , # L i j p = # L l j p ,
S i m L i j p , L l j p = 1 2 S i m c s L i j p , L l j p + 1 2 S i m a c s L i j p , L l j p
(2)
The weighted similarity between the two hotels is obtained according to the attribute weights.
S i m h i , h l = j = 1 n w j × S i m L i j p , L l j p
(3)
Construct the hotel similarity matrix M:
M = S i m h 1 , h 1 S i m h 1 , h 2 S i m h 1 , h m S i m h 2 , h 1 S i m h 2 , h 2 S i m h 2 , h m S i m h m , h 1 S i m h m , h 2 S i m h m , h m
  • Step 4: Hotel recommendation ranking
According to the similarity matrix obtained in Step 3 and the historical hotel stays of tourists, the hotels that reach the threshold are recommended and sorted according to the similarity size, and the hotel recommendations are made to tourists. Assume that h i is the hotel where the visitor has previously stayed, i is the hotel subscript, and the recommended threshold is Δ . When S i m h i , h j Δ , the system will recommend the hotel h j to the tourist; otherwise, the hotel h j will not be recommended.

4. Case Application

4.1. Data Crawling

In the case analysis, this paper randomly selected 10 hotels in Macau according to their popularity, including five-star, four-star, and three-star hotels. In accordance with the principle of priority of recent reviews, Python was used to crawl 200 pieces of recent online review information, and “hotel name” and “review content” were selected as valid fields for subsequent analysis. Table 3 shows the 10 hotels selected, namely two three-star hotels, three four-star hotels and five five-star hotels. The hotel collection can be used as H = h 1 , h 2 , h 3 , h 4 , h 5 , h 6 , h 7 , h 8 , h 9 , h 10 .

4.2. Definition of Hotel Standard Attributes

Based on the word frequency analysis of reviews and the analysis of historical residents’ concerns about hotels, hotel attribute words can be divided into the following five attribute sets C = c 1 , c 2 , c 3 , c 4 , c 5 , as shown in Table 4, from which it can be seen that there are five standard attribute words, such as z ¯ 1 c for “service”, z ¯ 2 c for “location”, etc.

4.3. Data Preprocessing

Firstly, word segmentation of the online review data retrieved by jieba is performed, and then the stop words are deleted to obtain the attribute words, emotion words, and degree adverbs of emotion in the review, and the attribute set Z i k = z i k 1 , z i k 2 , , z i k m , , z i k p is formed. Then, according to the method of synonym merging, attribute words are converted into standard attribute words, and emotion words and emotion degree adverbs are analyzed and converted into 7-granularity language terms. The number of occurrences S u m s α j of each evaluation term in each standard attribute word is counted, and the total number of occurrences S u m c j of each attribute c j is counted. For example, for the hotel h 1 , Table 5 shows the analysis results.
Thus, its probabilistic language term set L 1 j is represented as follows:
L 11 = s 3 0.125 , s 2 0.470 , s 1 0.110 , s 0 0.200 , s 1 0.035 , s 2 0.035 , s 3 0.025
L 12 = s 3 0.030 , s 2 0.295 , s 1 0.190 , s 0 0.485
L 13 = s 2 0.070 , s 1 0.055 , s 0 0.755 , s 1 0.060 , s 2 0.055 , s 3 0.005
L 14 = s 3 0.065 , s 2 0.095 , s 1 0.245 , s 0 0.550 , s 1 0.010 , s 2 0.005 , s 3 0.030
L 15 = s 3 0.030 , s 2 0.185 , s 1 0.495 , s 0 0.200 , s 1 0.040 , s 2 0.035 , s 3 0.015

4.4. Hotel Recommendation Result Generation

Step 1: After the preliminary data processing, the evaluation matrix of 10 hotels can be constructed according to the probabilistic language term set.
P = s 3 0.125 , s 2 0.470 , s 1 0.110 , s 0 0.200 , s 1 0.035 , s 2 0.035 , s 3 0.025 s 3 0.030 , s 2 0.295 , s 1 0.190 , s 0 0.485 s 3 0.030 , s 2 0.185 , s 1 0.495 , s 0 0.200 , s 1 0.040 , s 2 0.035 , s 3 0.015 s 3 0.005 , s 2 0.120 , s 1 0.185 , s 0 0.660 , s 1 0.025 , s 2 0.005 s 0 0.554 , s 1 0.446 s 3 0.005 , s 2 0.051 , s 1 0.198 , s 0 0.633 , s 1 0.103 , s 2 0.009 s 3 0.070 , s 2 0.500 , s 1 0.085 , s 0 0.280 , s 1 0.030 , s 2 0.010 , s 3 0.025 s 3 0.030 , s 2 0.130 , s 1 0.400 , s 0 0.435 , s 2 0.005 s 3 0.029 , s 2 0.166 , s 1 0.275 , s 0 0.483 , s 1 0.026 , s 2 0.015 , s 3 0.006 10 × 5
Step 2: First, the probabilistic language term set L i j is standardized, and then the maximum deviation method is used to calculate the weight of each attribute.
For example, for hotel h 1 and h 2 , for the same standard properties, the standardized treatments of L ¯ 11 and L ¯ 21 are as follows:
L 11 = s 2 0.470 , s 3 0.125 , s 1 0.110 , s 0 0.200 , s 1 0.035 , s 2 0.035 , s 3 0.025
L 21 = s 2 0.120 , s 1 0.185 , s 3 0.005 , s 0 0.660 , s 2 0.000 , s 2 0.005 , s 1 0.025
According to Formula (7), under the standard attribute   z ¯ 1 c , the difference between hotel h 1 and h 2 is d L 11 p , L 21 p = 0.3185 . In the same way, the mutual difference between different attributes of each hotel can be calculated, and finally the deviation degree of each attribute can be obtained: d 1 w = 13.3258 , d 2 w = 14.1691 , d 3 w = 3.9918 , d 4 w = 6.0544 , d 5 w = 7.3635 . Then, according to Formula (8), the total deviation degree can be obtained as follows:
d P w = 13.3258 w 1 + 14.1691 w 2 + 3.9918 w 3 + 6.0544 w 4 + 7.3635 w 5
The maximum deviation degree optimization model is constructed and solved using the Lagrange function L w , λ . The normalized attribute weights are as follows:
w 1 = 0.3035 , w 2 = 0.3227 , w 3 = 0.0909 , w 4 = 0.1151 , w 5 = 0.1677
Step 3: Calculate the similarity between pairwise matrices and construct the similarity matrix.
Taking hotels h 1 and h 2 as an example, the cosine similarity and modified cosine similarity of the two hotels can be calculated according to Formulas (4) and (5), and then the final similarity of the two hotels can be calculated according to Formula (13) (Table 6).
Based on the formula and attribute weights, the final similarity between the two hotels is 0.8650. Similarly, the similarity between the two hotels is calculated, and the similarity matrix M is constructed.
M = 1 0.865 0.874 0.858 0.970 0.850 0.846 0.810 0.965 0.874 0.865 1 0.995 0.959 0.848 0.977 0.951 0.940 0.838 0.977 0.874 0.995 1 0.965 0.859 0.978 0.958 0.926 0.842 0.980 0.858 0.959 0.965 1 0.837 0.973 0.979 0.906 0.849 0.971 0.970 0.848 0.859 0.837 1 0.857 0.853 0.806 0.956 0.855 0.850 0.977 0.978 0.973 0.857 1 0.987 0.960 0.860 0.981 0.846 0.951 0.958 0.979 0.853 0.987 1 0.940 0.861 0.981 0.810 0.940 0.926 0.906 0.806 0.960 0.940 1 0.854 0.953 0.965 0.838 0.842 0.849 0.956 0.860 0.869 0.854 1 0.877 0.874 0.977 0.980 0.971 0.855 0.981 0.981 0.953 0.877 1
Step 4: Set a threshold Δ to generate the hotel recommendation order based on the historical check-in data of the passenger. If there are two hotels with the same similarity, the recommendation order is the same.
Assuming that the preset recommendation threshold is Δ = 0.805 , if the passenger has used hotel h 1 before, the sequence of hotel recommendation is h 5 h 9 h 3 h 10 h 2 h 4 h 6 h 7 h 8 , and the same can be obtained in other cases.
Assuming that the preset recommendation threshold is Δ = 0.805 , if the passenger has used hotel h 8 before, the sequence of hotel recommendation is: h 6 h 10 h 2 , h 7 h 3 h 4 h 9 h 1 h 5 , and the same can be obtained in other cases. For hotel h 8 , since the similarity between h 2 and h 7 is the same, hotel h 2 and hotel h 7 are recommended in the same order. Similarly, for hotel h 10 , there is the same recommendation ranking situation.

5. Comparative Analysis

In this paper, the hotel recommendation algorithm based on the probabilistic language term set can make use of the online review information of users, intelligently provide users with hotel recommendations that match their preferences, and at the same time sort multiple recommended hotels. In order to highlight the excellent performance of this algorithm, we conducted an in-depth analysis and comparison with the recommended algorithms proposed by Cui et al. [19] and Zhou et al. [16], as shown in Table 7. In order to prove the feasibility of the algorithm, we performed the calculations. According to the law, for the h 2 hotel, the recommended hotels are ranked. To rank the recommended hotels, h 6 h 10 h 2 , h 7 h 3 h 4 h 9 h 1 h 5 , the results are consistent with the outcomes obtained using the algorithm proposed in this paper. According to the attribute weight formula ( w j = N j j = 1 n N j ) and the VIKOR method criteria from the literature, the recommended hotel ranking is h 6 h 10 h 2 , h 7 h 3 h 4 h 9 h 1 h 5 , which aligns with the results obtained by the algorithm presented in this paper. Except for hotel h 2 , the optimal hotel recommendations are the same as those in this paper, effectively validating the feasibility of the proposed method.
As observed from the table above, this paper employs the maximum deviation method, consistent with Cui’s literature algorithm, for handling weights. Meanwhile, Zhou’s study primarily relies on the frequency of occurrence of keywords, which can be easily influenced by the quantity of keywords. Regarding attribute similarity, this paper combines cosine similarity with modified rectangular similarity, while Cui’s article relies solely on cosine similarity. Concerning the sorting method, this paper primarily ranks the hotels based on their similarity to generate a recommendation list. In terms of the recommendation approach, this paper combines user’s historical hotel selection data with the generated hotel similarity list for recommendations. In summary, the method proposed in this paper exhibits superiority.

6. Conclusions and Future Research Directions

In response to the application challenges of a large volume of hotel online review data in recommendations, this paper employs a probabilistic language terminology set to express hotel attributes and user preferences within online reviews. It uses a seven-level granularity language terminology to more precisely capture the fuzzy emotional semantics in online reviews and proposes a hotel collaborative filtering recommendation algorithm based on PLTS. We utilize jieba for text segmentation, conduct sentiment analysis using the HowNet sentiment dictionary, and effectively transform the review terms into PLTS using the TF-IDF statistical method. We compile probability language information by constructing an evaluation matrix, determine attribute weights using the maximum deviation method, calculate the weighted similarity between hotels using an improved fusion similarity computation formula, and present the recommendation results. Via a case analysis with examples from Ctrip, the effectiveness of the recommendation algorithm proposed in this paper is validated. Furthermore, via a comparative analysis with other methods, the relative advancement of the recommendation approach presented in this paper is demonstrated.
Limitation and Future Work: The data acquisition and attributes in the case study are relatively small, and the selection of hotel attributes relies on subjective experience. In the future, these shortcomings will be addressed; we plan to extract hotel attributes from online reviews based on the model, and use larger-scale data to optimize the algorithm.

Author Contributions

Conceptualization, E.W.; Methodology, E.W. and Y.C.; Software, Y.L.; Investigation, Y.C. and Y.L.; Data curation, Y.C.; Writing—original draft, Y.C.; Writing—review & editing, E.W.; Project administration, E.W. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Research on Tourism Product Recommendation Algorithm in the Key Fields of Guangdong General Universities (Digital Economy; No. 2021ZDZX3010).

Data Availability Statement

Both data and algorithms are listed in the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Government of the Macao Special Administrative Region. Bureau of Statistics and Census [DB/OL]. Available online: https://www.dsec.gov.mo/zh-CN/Statistic?id=401 (accessed on 10 January 2023).
  2. Shokeen, J.; Rana, C. A study on features of social recommender systems. Artif. Intell. Rev. 2020, 53, 965–988. [Google Scholar] [CrossRef]
  3. Liu, S.; Chen, Z.; Li, X. Time-semantic-aware Poisson tensor factorization approach for scalable hotel recommendation. Inf. Sci. 2019, 504, 422–434. [Google Scholar] [CrossRef]
  4. Linden, G.; Smith, B.; York, J. Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Comput. 2003, 7, 76–80. [Google Scholar] [CrossRef]
  5. Shi, Y.; Larson, M.; Hanjalic, A. Collaborative filtering beyond the user-item matrix: A survey of the state of the art and future challenges. ACM Comput. Surv. (CSUR) 2014, 47, 1–45. [Google Scholar] [CrossRef]
  6. Zhu, P.; Cheng, L.; Gao, C.; Wang, Z.; Li, X. Locating multi-sources in social networks with a low infection rate. IEEE Trans. Netw. Sci. Eng. 2022, 9, 1853–1865. [Google Scholar] [CrossRef]
  7. Tao, L.; You, T.; Yuan, Y. Hotel Selection Based on Hotel Feature Information and Online Review Information. J. Northeast Univ. Nat. Sci. 2019, 40, 1667–1672. [Google Scholar]
  8. Wang, S.; Wu, S. Personalized Attraction Recommendation Using Online Reviews. J. Huaqiao Univ. Nat. Sci. 2018, 39, 467–472. [Google Scholar]
  9. Cui, C.; Wang, X.; Li, W. Research on Contextual Environment-Based User Profile Tourism Product Recommendation Algorithm. Pract. Underst. Math. 2019, 49, 122–131. [Google Scholar]
  10. Lihua, S. Advanced collaborative filtering recommendation model based on sentiment analysis of online review. J. Shandong Univ. Eng. Sci. 2019, 49, 47–54. [Google Scholar]
  11. Li, X.J.; Deng, G.S.; Wang, X.Z.; Wu, X.L.; Zeng, Q.W. A hybrid recommendation algorithm based on user comment sentiment and matrix decomposition. Inf. Syst. 2023, 117, 102244. [Google Scholar] [CrossRef]
  12. Cao, H.; Kang, J. Study on improvement of recommendation algorithm based on emotional polarity classification. In Proceedings of the 2020 5th International Conference on Computer and Communication Systems (ICCCS), Shanghai, China, 15–18 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 182–186. [Google Scholar]
  13. Pang, Q.; Wang, H.; Xu, Z. Probabilistic linguistic term sets in multi-attribute group decision making. Inf. Sci. 2016, 369, 128–143. [Google Scholar] [CrossRef]
  14. Zhang, Y.; Liang, D.; Xu, Z. Cross-platform hotel evaluation by aggregating multi-website consumer reviews with probabilistic linguistic term set and Choquet integral. Ann. Oper. Res. 2022, 1–35. [Google Scholar] [CrossRef] [PubMed]
  15. Celotto, A.; Loia, V.; Senatore, S. Fuzzy linguistic approach to quality assessment model for electricity network infrastructure. Inf. Sci. 2015, 304, 1–15. [Google Scholar] [CrossRef]
  16. Zhou, H.; Ma, H.; Liu, J. Research on Film and Television Recommendation Algorithm Integrating Sentiment Analysis and Probabilistic Language. Inf. Stud. Theory Pract. 2020, 43, 180–186. [Google Scholar]
  17. Chen, S.; Zhang, C.; Zeng, S.; Wang, Y.; Su, W. A probabilistic linguistic and dual trust network-based user collaborative filtering model. Artif. Intell. Rev. 2023, 56, 429–455. [Google Scholar] [CrossRef]
  18. Cui, C.; Wang, X.; Li, W. Research on Tourism Attraction Recommendation Algorithm Based on User Online Reviews. Syst. Sci. Math. 2020, 40, 1103. [Google Scholar]
  19. Cui, C.; Wei, M.; Che, L.; Wu, S.; Wang, E. Hotel recommendation algorithms based on online reviews and probabilistic linguistic term sets. Expert Syst. Appl. 2022, 210, 118503. [Google Scholar] [CrossRef]
  20. Lin, M.; Xu, Z. Probabilistic linguistic distance measures and their applications in multi-criteria group decision making. Soft Comput. Appl. Group Decis.-Mak. Consens. Model. 2018, 357, 411–440. [Google Scholar]
  21. Chen, Z.; Huang, C.; Zhou, R.; Wang, Z.; Cao, F. Crowdsourcing Task Recommendation Algorithm Based on Collaborative Filtering. Inf. Technol. Informatiz. 2021, 8, 119–121. [Google Scholar]
  22. Luo, S.Z.; Zhang, H.Y.; Wang, J.Q.; Li, L. Group decision-making approach for evaluating the sustainability of constructed wetlands with probabilistic linguistic preference relations. J. Oper. Res. Soc. 2019, 70, 2039–2055. [Google Scholar] [CrossRef]
  23. Sarwar, B.; Karypis, G.; Konstan, J.; Riedl, J. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, Hong Kong, China, 1–5 April 2001; pp. 285–295. [Google Scholar]
  24. Chu, H.; Liu, Q.; Mu, C. A Collaborative Filtering Recommendation Algorithm Improved for Modified Cosine Similarity. J. Yantai Univ. Nat. Sci. Eng. Ed. 2021, 34, 330–336. [Google Scholar]
  25. Jiang, L.; Zhang, Q. Research on Personalized Recommendation Strategy Based on Sentiment Analysis of Reviews—A Case Study of Douban Movie Reviews. Inf. Stud. Theory Pract. 2017, 40, 99–104. [Google Scholar]
Figure 1. Research idea.
Figure 1. Research idea.
Mathematics 11 04106 g001
Figure 2. Idea of data acquisition and preprocessing.
Figure 2. Idea of data acquisition and preprocessing.
Mathematics 11 04106 g002
Figure 3. Steps for converting comments into a probabilistic language term set.
Figure 3. Steps for converting comments into a probabilistic language term set.
Mathematics 11 04106 g003
Figure 4. Idea of recommendation algorithm.
Figure 4. Idea of recommendation algorithm.
Mathematics 11 04106 g004
Table 1. Categories of words for hotel evaluation.
Table 1. Categories of words for hotel evaluation.
CategoryDefinitionExample
Attribute wordsWords that describe some aspect of a hotel“service”, “facilities”, “price”, etc.
Emotional wordsWords that express emotional attitudes towards hotel properties“comfortable”, “convenient”, “affordable”, etc.
Adverb of degreeWords that modify emotional words and express emotional levels“very”, “very”, “quite”, etc.
Table 2. Reference for emotion analysis and intensity ranking of emotion word classification scale.
Table 2. Reference for emotion analysis and intensity ranking of emotion word classification scale.
Example of ExpressionScale Expression and Intensity RankingExample
Positive emotion s 1 < s 2 < s 3 “Clean”, “comfortable”, “cost-effective”, etc.
Neutral emotion s 0 “Ok”, “average”, etc.
Negative emotion s 3 > s 2 > s 1 “Dirty”, “noisy”, “disorderly”, “remote”, etc.
Table 3. Selected hotels in Macau.
Table 3. Selected hotels in Macau.
Hotel   Code   ( h i )Hotel NameHotel Class
h 1 Emperor HotelThree-star level
h 2 Fu Hua HotelThree-star level
h 3 Hotel Beverly PlazaFour-star
h 4 Casa Real HotelFour-star
h 5 Hotel Golden DragonFour-star
h 6 The Parisian MacaoFive-star
h 7 Riviera Hotel MacauFive-star
h 8 Hotel LisboaFive-star
h 9 The Venetian MacaoFive-star
h 10 Sheraton Grand MacaoFive-star
Table 4. Standard hotel property categories.
Table 4. Standard hotel property categories.
Standard   Attribute   Word   Character   ( z ¯ j c )Standard Attribute Word NameDefinitionExample
z ¯ 1 c ServiceRefers to the necessary labor provided to meet the material or spiritual needs of the guests, including the hotel’s service content, mode, attitude, speed, and efficiency.“Check-in”, “Room type upgrade”, “late check-out”, “Reception”, “drop-off service”, etc.
z ¯ 2 c PositionRefers to the geographical location and traffic conditions of the hotel, considering the distance from the airport, station, business center, tourist attractions and other environmental factors, as well as whether it is easy to find and other factors.“Bus station”, “walking”, “attractions”, “airport”, “shopping”, “transportation”, etc.
z ¯ 3 c FacilityRefers to the hardware and software facilities provided by the hotel, mainly including the necessary accommodation facilities, hot water supply facilities, communication facilities, fire-fighting facilities, etc.“WIFI”, “projection”, “bed”, “TV”, “bathtub”, “Restaurant”, “gym”, etc.
z ¯ 4 c PriceRefers to the price of staying at the hotel and the cost performance.“Cost performance”, “cheap price”, etc.
z ¯ 5 c EnvironmentRefers to the hotel’s environmental comfort and cleanliness, mainly including the two aspects of the internal and external environment.“Quiet”, “clean”, “comfortable”, “bright”, etc.
Table 5. Statistics of the occurrence of evaluation terms of hotel h1.
Table 5. Statistics of the occurrence of evaluation terms of hotel h1.
Hotel   ( h i ) Standard   Attribute   ( z ¯ j c ) s 3 s 2 s 1 s 0 s 1 s 2 s 3 S u m c j
h 1 Service ( z ¯ 1 c )57740229425200
Position ( z ¯ 2 c )0009738596200
Facility ( z ¯ 3 c )1111215111140200
Price ( z ¯ 4 c )612110491913200
Environment ( z ¯ 5 c )3784099376200
Table 6. Similarity calculation process of hotel h 1 and h 2 for the same attribute.
Table 6. Similarity calculation process of hotel h 1 and h 2 for the same attribute.
L 11 p ,   L 21 p L 12 p ,   L 22 p L 13 p ,   L 23 p L 14 p ,   L 24 p L 15 p ,   L 25 p L 11 p ,   L 21 p
S i m c s 0.99880.99600.99990.99950.99990.9988
S i m a c s 0.99960.24600.89830.87310.99440.9996
S i m 0.99920.62100.94910.93630.99720.9992
Table 7. Algorithm comparison.
Table 7. Algorithm comparison.
Recommended AlgorithmAttribute WeightAttribute SimilarityRanking MethodRecommended Method
Textual algorithmMaximum deviation degree method S i m = α S i m c s + 1 α S i m a c s The similarity between hotelsCombined with historical data, based on hotel similarity
Cui et al. [19], literature algorithm Maximum deviation degree method S i m c s The similarity between productsBased on similarities between products
Zhou et al. [16], literature algorithm Number of occurrences of subject terms-VIKOR methodBased on user product evaluation score
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, E.; Chen, Y.; Li, Y. Research on a Hotel Collaborative Filtering Recommendation Algorithm Based on the Probabilistic Language Term Set. Mathematics 2023, 11, 4106. https://doi.org/10.3390/math11194106

AMA Style

Wang E, Chen Y, Li Y. Research on a Hotel Collaborative Filtering Recommendation Algorithm Based on the Probabilistic Language Term Set. Mathematics. 2023; 11(19):4106. https://doi.org/10.3390/math11194106

Chicago/Turabian Style

Wang, Erwei, Yingyin Chen, and Yumin Li. 2023. "Research on a Hotel Collaborative Filtering Recommendation Algorithm Based on the Probabilistic Language Term Set" Mathematics 11, no. 19: 4106. https://doi.org/10.3390/math11194106

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop