Comparing Fine-Tuning and Prompt Engineering for Multi-Class Classification in Hospitality Review Analysis

Botunac, Ive; Brkić Bakarić, Marija; Matetić, Maja

doi:10.3390/app14146254

Open AccessArticle

Comparing Fine-Tuning and Prompt Engineering for Multi-Class Classification in Hospitality Review Analysis

by

Ive Botunac

^*

,

Marija Brkić Bakarić

and

Maja Matetić

Faculty of Informatics and Digital Technologies, University of Rijeka, Radmile Matejčić 2, 51000 Rijeka, Croatia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(14), 6254; https://doi.org/10.3390/app14146254

Submission received: 16 June 2024 / Revised: 16 July 2024 / Accepted: 17 July 2024 / Published: 18 July 2024

(This article belongs to the Special Issue Neural Network Technologies in Natural Language Processing and Data Mining)

Download

Browse Figures

Versions Notes

Abstract

This study compares the effectiveness of fine-tuning Transformer models, specifically BERT, RoBERTa, DeBERTa, and GPT-2, against using prompt engineering in LLMs like ChatGPT and GPT-4 for multi-class classification of hotel reviews. As the hospitality industry increasingly relies on online customer feedback to improve services and strategize marketing, accurately analyzing this feedback is crucial. Our research employs a multi-task learning framework to simultaneously conduct sentiment analysis and categorize reviews into aspects such as service quality, ambiance, and food. We assess the capabilities of fine-tuned Transformer models and LLMs with prompt engineering in processing and understanding the complex user-generated content prevalent in the hospitality industry. The results show that fine-tuned models, particularly RoBERTa, are more adept at classification tasks due to their deep contextual processing abilities and faster execution times. In contrast, while ChatGPT and GPT-4 excel in sentiment analysis by better capturing the nuances of human emotions, they require more computational power and longer processing times. Our findings support the hypothesis that fine-tuning models can achieve better results and faster execution than using prompt engineering in LLMs for multi-class classification in hospitality reviews. This study suggests that selecting the appropriate NLP model depends on the task’s specific needs, balancing computational efficiency and the depth of sentiment analysis required for actionable insights in hospitality management.

Keywords:

transformer; sentiment analysis; BERT; GPT; machine learning; text classification; multi-task learning

1. Introduction

In the era of advancements in information technologies and the digitalization of hospitality, user reviews have become vital for tourism business entities and users in selecting offerings. With increasing numbers of users inclined to share and exchange their personal experiences on social networks, forums, and websites, the tourism industry has experienced significant changes in attracting and engaging them. In this new age of e-tourism, businesses are developing innovative marketing strategies focused on user needs and satisfaction, using user-generated content to gain insights into user behavior and sentiment [1]. Sentiment analysis has emerged as a promising approach to understanding guest experiences and preferences in the hospitality industry [2]. However, the complexity of the language and the various aspects of reviews, such as price, rating, and location, make this analysis challenging. Researchers increasingly combine machine learning methods and sentiment analysis to overcome these challenges, especially using advanced deep learning techniques like neural networks [3,4].

In recent years, models based on Transformer architecture [5], such as Bidirectional Encoder Representations from Transformers (BERT) [6], have become a standard in natural language processing (NLP) due to their ability to efficiently process large amounts of text and extract deeper semantic features. Following the success of BERT, large language models (LLMs) have emerged, offering advanced language understanding and generation capabilities. These LLMs, coupled with the technique of prompt engineering, have opened new possibilities for tackling various NLP tasks, including sentiment analysis and multi-class classification. This paper compares the effectiveness of two approaches for multi-class classification in hospitality review analysis: fine-tuning selected models specifically for the task and using prompt engineering in LLMs. The main hypothesis is that fine-tuning models can achieve better results and faster execution than prompt engineering in LLMs due to the latter’s complex architecture and computational requirements.

The paper first reviews relevant research in the Section 2. The Section 3 provides insight into the basics of machine learning in NLP, focusing on Transformer architecture, various models, and multi-task learning. The Section 4 describes the techniques, datasets, and evaluation criteria. The Section 5 provide details of the experiments conducted and an analysis of the results, while the Section 6 summarizes the key findings of this work.

2. Related Work

In the context of the rapid development of natural language processing technologies and the growing importance of review analysis, many researchers have contributed to this field through various methods and approaches. This section provides an overview of fundamental research and works that have shaped the current understanding of sentiment analysis and categorization in the context of hotel reviews. This review will offer insight into the evolution of methods, their advantages and disadvantages, and how recent techniques, such as BERT and GPT, compare to more traditional approaches.

The analysis of accommodation guest reviews using a simpler Multinomial Naive Bayes model can be found in [7]. Within the study in [3], the BERT model was used to understand better emotions expressed in reviews, proving that this approach can help hotels better understand the needs of their clients. The research highlighted ERNIE as an enhancement of BERT, with better performance in sentiment analysis. Another study in [2] focused on the classification of reviews using Long Short-Term Memory (LSTM) recurrent neural networks, achieving high accuracy for two sentiment categories but lower accuracy for three categories. In the research presented in [8], a recommendation system is proposed, which uses BERT for sentiment classification and a Random Forest classifier for additional processing, achieving an accuracy close to the previously mentioned results.

Regarding sentiment analysis using LLMs, we notice an uneven representation in research compared to BERT and classical neural networks. The findings of [9] with GPT-J and GPT-3 models indicate that, while these models are impressive, they still need to provide competitive results compared to specialized models like BERTweet [10] or RoBERTa-Retrained [11]. On the other hand, ref. [12] thoroughly examines different methods and approaches to using GPT models for sentiment analysis on the SemEval 2017 dataset. The study’s results showed that GPT models, particularly GPT 3.5 Turbo, outperform existing machine learning methods, achieving an exceptional accuracy of 97.3%. Additionally, the paper highlights GPT models’ advantages in understanding context, detecting sarcasm, interpreting emojis, slang, and hashtags, and detecting negation.

The research in [13] evaluates ChatGPT in the context of sentiment analysis. The authors analyzed GPT model’s ability to understand and interpret opinions, sentiments, and emotions. ChatGPT was tested on 18 reference datasets and five representative sentiment analysis tasks. Key results showed that ChatGPT, in “zero-shot” settings, provides a performance comparable to fine-tuned models like BERT. However, when using the “few-shot” prompt technique, ChatGPT often surpasses the performance of fine-tuned BERT. The paper highlighted ChatGPT’s impressive abilities in understanding emotions and inferring their causes. Despite ChatGPT’s strong capabilities, the authors identified potential challenges and needed further adjustments and optimizations for specific domains and tasks.

Reviewing related works shows that sentiment analysis techniques are evolving rapidly, with a particular emphasis on applying state-of-the-art models like ChatGPT. While traditional approaches still have their role and importance, new models like ChatGPT offer promising results, often surpassing the performance of existing methods. However, research also points to the need for further adjustments and optimizations to achieve the best performance in specific domains and tasks.

3. Theoretical Background

This chapter provides an overview of the key concepts underlying modern methods of NLP, focusing on the role of machine learning, the Transformer architecture, and models like BERT and GPT. NLP is an interdisciplinary field that combines linguistics, computer science, and artificial intelligence to develop systems that can understand, interpret, and respond to human language [14]. Deep learning has significantly advanced NLP, particularly with the introduction of pre-trained models such as GPT-3 and ChatGPT [13,15], which allow for tailoring models to specific tasks without lengthy training [14].

Text classification, a classic problem in NLP, involves categorizing textual strings into predefined categories by understanding and interpreting the content [16]. Various models, including convolutional (CNN), recurrent neural networks (RNN), and advanced models like GPT-3, are used for accurate classification [12]. With the growing amount of textual data, efficient and accurate text classification is crucial for making informed decisions based on text analysis. The Transformer architecture has revolutionized NLP, with models like BERT and GPT setting new standards in sentiment analysis.

3.1. Overview of Transformer Architecture

The development of today’s popular large language models like ChatGPT [15,17] was preceded by the Transformer architecture, which has become state-of-the-art for NLP tasks. Vaswani et al. introduced this revolutionary architecture in their paper “Attention Is All You Need” [5] while working at Google Research and Google Brain [4]. The Transformer architecture replaced RNN and LSTM networks, which faced challenges in understanding context, and became the foundation for advanced NLP models such as BERT and GPT. The key innovation lies in its attention mechanism, especially the “self-attention” method, which allows the model to efficiently connect information from different parts of the text [18].

The Transformer model utilizes an encoder–decoder structure, as shown in Figure 1. The encoder transforms the input sequence into continuous vector representations using a series of identical layers containing a “multi-head” and “self-attention” mechanism and a positional feed-forward neural network. The decoder generates the output using these representations and an additional “attention” mechanism that refers to the encoder’s outputs [5,18]. The central part of this architecture is the “self-attention” mechanism, which allows each word in the sequence to dynamically attend to other words regardless of their relative position, enabling the model to understand contextual relationships between words in a way that was impossible with previous models. Another key feature is “multi-head attention”. The Transformer uses multiple parallel heads to analyze the input from different perspectives, allowing the model to gain deeper insights into the data [18]. Positional encoding is used to provide the model with information about the position of each word within the sequence, compensating for the Transformer’s lack of inherent word order information.

Different models in this research study utilize specific parts of the Transformer architecture to solve particular NLP tasks. BERT [6] uses only the encoder part to analyze input sequences in both directions, gaining rich contextual insights. GPT models [19] use only the decoder part to generate text by predicting the next word based on previous words. Models like BART [20] combine encoder and decoder to achieve bidirectional understanding and generative capabilities, enabling tasks such as text correction and summarization. The Transformer architecture has become the foundation for modern NLP models due to its ability to efficiently process language, understand context, and adapt to various tasks, making it revolutionary in machine learning. The following sections will delve deeper into the most famous representatives of this architecture.

3.2. Transformers-Based Models

Transformer-based models have revolutionized NLP in recent years. Three prominent models are BERT and GPT, with all variations, each with unique characteristics and applications.

BERT, developed by Google researchers [6], utilizes the encoder part of the Transformer architecture. BERT’s key feature is its ability to interpret words in the context of the entire sentence using a bidirectional attention mechanism and a complex “multi-head attention” mechanism [4]. BERT’s training process involves pre-training on a large unlabeled text corpus and fine-tuning for specific NLP tasks [6]. Numerous variations of BERT, such as RoBERTa [11] and DistilBERT [21], have been developed to meet specific needs and challenges [18].

GPT [19] is a series of language models developed by OpenAI that exclusively utilize the decoder part of the Transformer architecture. GPT models have set new standards by combining innovative technical features with massive datasets [19]. The initial GPT model had 110 million parameters and used a two-stage training method [19]. GPT-2 increased the model’s capacity to 1.5 billion parameters and was trained on an expanded dataset [22]. GPT-3, the latest publicly available model, has 175 billion parameters and introduced “few-shot” learning, allowing it to adapt to various tasks with minimal examples [23].

ChatGPT, developed by OpenAI, is based on the InstructGPT model [24] and utilizes the advanced Transformer architecture. ChatGPT can adapt to a wide range of NLP tasks without specific training, demonstrating “zero-shot” learning ability [25]. Its adaptability is attributed to the instruction-based fine-tuning framework using reinforcement learning with human feedback (RLHF) [17]. The training process of ChatGPT involves three steps: supervised fine-tuning (SFT), reward model training (RM), and reinforcement learning (RL) [26].

These Transformer-based models have provided deeper language understanding and versatility in various applications, setting new standards in NLP.

4. Research Methodology

In the methodology of this research study, the focus is placed on applying multi-task learning (MTL) as an advanced approach to machine learning [27]. In this approach, a single model is trained to solve multiple tasks simultaneously, which allows for better generalization and reduces the need for large datasets for each task. Specifically, this paper presents a model capable of parallel sentiment classification and categorization. For example, when analyzing hotel reviews, the model is trained to identify sentiment (positive, negative, neutral) [28] and to classify some of the categories (e.g., food, price, service) [29,30]. By integrating these tasks, the model can use information from one task to improve performance on another, achieving greater efficiency and accuracy in review analysis.

The challenge of text classification is posed where traditional models such as BERT and RoBERTa are used to map input text to corresponding categories directly through encoder architecture. However, with the development of generative models like GPT, a new horizon of possibilities opens up in the text classification domain. While the GPT-2 model can be used in the traditional sense of classification through fine-tuning downstream tasks, ChatGPT offers an alternative approach through generative use with the help of prompt engineering [17]. This approach involves posing specific queries to the model as prompts, after which the model generates a response. The response is then further processed to achieve the desired classification. Using GPT models in a generative manner has the potential to achieve better performance in classification compared to traditional encoder models. This possibility arises from the characteristics of GPT models, which can deeply understand the context and nuances of language, leading to more precise and intuitive classifications.

4.1. Selection and Description of the Dataset

For our study, we selected the “restaurants” subset of the “SemEval-2014 Task 4: Aspect Based Sentiment Analysis” dataset [31]. This dataset is widely used in sentiment analysis and aspect-based classification research within the hospitality domain, making it an excellent benchmark for evaluating different models and approaches.

The dataset comprises customer reviews tagged with the sentiment (positive, negative, neutral) and specific categories (ambience, anecdotes/miscellaneous, food, price, service). This structure allows for a nuanced approach to classifying and understanding customer feedback in the hospitality industry, aligning with our research objectives.

Overall, the dataset contains 2322 reviews, with a distribution of approximately 58% positive, 23% negative, and 18% neutral sentiments. The sentiment distribution varies across different categories, providing a diverse range of examples for our multi-task classification approach. Figure 2 illustrates this distribution of sentiments across the various categories, visually representing the dataset’s composition.

4.2. Data Preprocessing and Preparation

The preprocessing and preparation of data are crucial steps in any machine learning project, as the quality of input data directly affects model performance. After defining the data sources for our research, we approached the detailed process of cleaning and structuring the data to make it suitable for model training.

The data retrieved from the SemEval-2014 set come in XML format, requiring additional processing to convert the data into a tabular format suitable for analysis. Through this conversion, special attention was paid to basic filtering reviews to ensure that each review was marked with one of the categories “ambience”, “anecdotes/miscellaneous”, “food”, “price”, or “service” and one sentiment polarity (“positive”, “negative”, or “neutral”). This check ensured that our model addressed a classification problem where each review is comprehensively labeled with a sentiment and a category.

After basic filtering, we obtained a set of 2322 reviews. The data were divided into training and test sets to ensure the model’s robustness and generalization. An 80/20 split was applied, resulting in 1858 reviews allocated for training and 464 reviews reserved for testing. This division allows the model to be trained on a comprehensive dataset while providing a separate set for evaluating model performance on unseen data.

4.3. System Architecture Design

This research study explores two distinct approaches for the multi-class classification of hotel reviews using advanced natural language processing models: fine-tuning and prompt engineering. Each approach utilizes different models and techniques to tackle the task at hand.

In the fine-tuning approach, we select pre-trained models and further train them on a specific downstream task, in this case, multi-class classification of hotel reviews. The models chosen for this approach are BERT, RoBERTa, and DeBERTa. BERT uses the Transformer architecture with bidirectional encoders, simultaneously analyzing the context from both sides of a word or phrase, making it powerful in understanding context [6]. RoBERTa, while similar to BERT, uses optimized hyperparameters and eliminates some of BERT’s components, like next-sentence prediction, resulting in better performance [11]. DeBERTa, another variant of BERT, employs disentangled attention and enhanced training techniques to improve performance.

These models’ implementation, training, and evaluation are performed using the Grouphug library [32], which extends the Hugging Face Transformers platform. Grouphug allows for the flexible integration of various machine learning models and attaches a classification head for each task, enabling simultaneous fine-tuning of the model on multiple tasks. The proposed system architecture for the fine-tuning approach is illustrated in Figure 3.

On the other hand, the prompt engineering approach focuses on utilizing the generative capabilities of large language models, specifically GPT-3.5 (ChatGPT) and GPT-4. These models use the Transformer architecture emphasizing the decoder side, allowing for sequential text generation [22]. ChatGPT, optimized for conversational text, uses a similar approach to GPT-2 but is further tailored to generate conversational responses through prompts [17].

Due to the inherent generative nature of GPT models, we employ prompt engineering techniques to direct the model towards the desired classification task [33]. For this purpose, we use the LangChain library, which allows for the dynamic generation of prompts tailored to the specific task. This library, in combination with the OpenAI API, gives us access to the “chatgpt-3.5-turbo” and “gpt-4” models, key components of the ChatGPT and GPT-4 platforms. The proposed system architecture for the prompt engineering approach is depicted in Figure 4.

By comparing these two approaches, we aim to determine the most effective method for multi-class classification of hotel reviews, considering factors such as accuracy, efficiency, and computational requirements. These models are selected based on their unique architectures and proven performance in various NLP tasks, and our goal is to adapt, fine-tune, and optimize each model to achieve optimal performance in the multi-task classification of sentiment and categories of hotel reviews.

5. Experimental Procedure and Results

In this chapter, we delve into the experimental procedure and evaluation of models for multi-task classification of sentiment and categories in hotel reviews. After establishing the appropriate infrastructure, we optimize the models through fine-tuning and prompt engineering. Throughout this process, we will use various metrics to assess the performance of each model, considering accuracy, F1 measure, and execution time. This experimental procedure aims to provide an in-depth analysis and comparison of different approaches in the context of classification tasks and to identify the most effective methods and techniques.

5.1. Parameters and Fine-Tuning Strategy

In fine-tuning models for multi-task classification of hotel reviews, ensuring an optimal environment and strategy to achieve the best performance is crucial. For this research, we used a powerful computing infrastructure, including a computer with an i9-10900X CPU @ 3.70 GHz processor, 64 GB of RAM, and an NVIDIA RTX 3090 graphics card with 24 GB of memory. This configuration allows for the efficient training and evaluation of models, significantly speeding up the experimental process.

A key component in achieving high-performance models is selecting the right hyperparameters. We employed a “random search” method to find the optimal parameters for each model individually, focusing on three key hyperparameters: number of epochs, learning rate, and train batch size. This method randomly samples different combinations of these hyperparameters within predefined ranges to identify the configuration that yields the best results on the validation dataset [16]. For our study, we defined the range for epochs from 5 to 30, the learning rate from 0.00001 to 0.0001, and the train batch size from 16 to 128.

For each model, we conducted approximately 10 iterations of random search. During the random search process, we evaluated each configuration based on accuracy metrics. The combination with the highest accuracy scores was selected as the optimal configuration for each model. The results of our hyperparameter optimization process are presented in Table 1, which shows the selected hyperparameters for each fine-tuning model. As we can see, the optimal configurations vary across models, highlighting the importance of individualized tuning.

Additionally, we implemented a comprehensive fine-tuning strategy for each transformer model. This process involved adapting the pre-trained models to our specific multi-task classification problem. We used the Grouphug library [32], which extends the Hugging Face Transformers library, to load pre-trained models and add custom classification heads for sentiment and category tasks. The fine-tuning process involved training all model layers, allowing it to adapt its pre-trained knowledge to our domain. We used a combined loss function that included both sentiment and category classification tasks and a language modeling task. The weighting of these tasks was controlled through the head configs parameter, with the language modeling task given a lower weight (0.1) than the classification tasks. We used the AdamW optimizer with learning rates, number of epochs, and batch sizes individually optimized for each model through a random search, as detailed in Table 1. To prevent overfitting, we implemented early stopping based on the evaluation loss. This approach allowed us to optimize each model’s performance while maintaining comparability across models, providing a fair basis for evaluating their capabilities in the multi-task classification of hotel reviews.

5.2. Prompt Engineering for LLMs

Prompt engineering, or creating specific queries for generative language models, has become a key component in working with advanced language models like GPT-3.5 (ChatGPT) and GPT-4. This technique allows us to precisely direct the model towards the desired response or outcome, especially in tasks where traditional training methods may not be sufficiently effective [12,13,25].

In the context of a “zero-shot” approach, the goal is to predict the most relevant categories for a given review without needing labeled data for training [33]. This approach relies on semantic alignment between the given review text, category description, and word embeddings on which the model was initially trained rather than on the availability of labeled training data. However, as category descriptions can often be brief and not provide enough context, the prompt approach extends the query to enable easier understanding and connection within the model.

Considering the “few-shot” approach, models are provided with several examples to direct them towards the desired response [12]. This is particularly useful when the model needs to understand a specific context or format of the response sought. For example, in sentiment analysis, queries can be designed to steer the model towards a sentiment-oriented response rather than a purely factual one.

To thoroughly evaluate the ChatGPT and GPT-4 models for sentiment analysis and categorization, we applied both “zero-shot” and “few-shot” approaches. These approaches ensure the model adequately interprets the task context and generates responses with the set expectations. In the “few-shot” approach, we present the model with a specific prompt:

“Please perform multi-task and multi-class text classification on the following review to one of the categories: ambience, anecdotes/miscellaneous, food, price, service, and one of the sentiments: negative, neutral, positive. A few examples are provided below where this classification is done correctly.”

We used the mentioned prompt and a few examples from the training dataset to facilitate in-context learning for the ChatGPT and GPT-4 models. This approach allows the models to adapt to sentiment analysis and categorization by providing relevant examples. The model output was configured to return only the categories and sentiments, ensuring a focused and consistent classification result. The actual evaluation of the models’ performance was conducted on the test dataset, which consists of unseen reviews not used during the training or few-shot learning process. Using the test dataset for evaluation, we ensure that the models can accurately classify sentiment and categories for previously unseen reviews. This allows for a fair and comprehensive comparison of their performance with other models studied in this research.

5.3. Comparison and Evaluation of Models

In this chapter, we analyze the performance of six prominent NLP models: BERT, RoBERTa, DeBERTa, GPT-2, ChatGPT, and GPT-4. Each model was tested in the multi-task classification of sentiment and categories in hotel reviews. This multi-task approach allows us to evaluate the models’ capabilities simultaneously handling two distinct but related classification tasks. The relationship between sentiment and category classification in this context is important and complex, reflecting the nuanced nature of customer feedback in the hospitality industry. Sentiment classification in our study refers to identifying the overall emotional tone of a review, categorized as positive, negative, or neutral. This task requires the model to interpret subjective language, context, and sometimes subtle cues that indicate the reviewer’s satisfaction level. On the other hand, category classification involves identifying specific aspects of the hotel experience mentioned in the review, such as service, food, ambiance, price, or miscellaneous observations. This task demands the model’s ability to recognize and categorize concrete topics within the text.

The interplay between sentiment and category adds a layer of complexity to our analysis. A single review may express different sentiments for different categories, requiring models to perform fine-grained analysis. For instance, a review might praise the food quality while expressing disappointment about the price, necessitating an accurate classification of the categories (food and price) and their associated sentiments (positive and negative, respectively). The results of this study provide valuable insights into the effectiveness of different NLP techniques. By comparing fine-tuned Transformer models with prompt engineering approaches in LLMs, we better understand how these methods perform in the simultaneous classification of sentiment and categories in hotel reviews. The performance of each model, considering factors such as accuracy, efficiency, and computational requirements, sheds light on the strengths and limitations of fine-tuning versus prompt engineering in practical applications.

These findings offer important guidance for implementing NLP solutions in real-world hospitality scenarios, particularly for analyzing complex customer feedback. The evaluation is based on three key metrics on the test dataset: accuracy, F1-Score, and execution time. These metrics are reported separately for sentiment and category classification to allow for a nuanced understanding of each model’s performance across both tasks. All performance information of the models is summarized and presented in Table 2, providing a comprehensive overview of their capabilities and effectiveness in the assigned tasks. This comparative analysis not only highlights the strengths and weaknesses of each model but also provides insights into the broader applicability of these NLP techniques in real-world hospitality contexts.

Figure 5 graphically showcases the accuracy outcomes for each model. Here, we can observe the superior performance of the DeBERTa model over its counterparts, highlighting its advantages in both sentiment and category classification tasks within the tested dataset.

BERT exhibited a robust performance, achieving an 86.0% accuracy rate in category classification and 84.9% in sentiment classification, demonstrating its capability for deep contextual understanding. The model was also relatively swift, with a processing time of 7.34 s, marking it as an efficient choice.

RoBERTa advanced the benchmarks set by BERT, achieving an 88.3% accuracy rate in category classification and an 88.1% rate in sentiment classification, indicating its proficiency in handling nuanced language tasks. It completed its runs slightly quicker, in 7.32 s, positioning it as the fastest and most accurate model among the transformer models tested.

DeBERTa, a variant of BERT, achieved an accuracy rate of 88.6% in category classification and 90.3% in sentiment classification, surpassing both BERT and RoBERTa in sentiment analysis. Its processing time was 7.28 s, making it the fastest among the transformer models tested.

GPT-2’s results lagged behind the aforementioned models, with a category classification accuracy of only 55.2% and sentiment classification accuracy of 60.8%, signaling its limitations in this specific domain. Regarding speed, it registered an 8.07-second completion time, making it the slowest among the transformer models tested.

ChatGPT in the zero-shot setting reached a 58.2% accuracy for category classification and a significantly higher accuracy of 84.5% for sentiment classification, showcasing its strength in grasping the subtleties of human emotion in text. Nevertheless, its processing time was considerably longer, at 335.45 s, which may pose constraints for applications requiring prompt responses. When utilizing a few-shot learning approach, ChatGPT’s accuracy in category classification improved to 70.3%, and sentiment classification maintained a strong 78.0%. However, this strategy resulted in an even longer processing time of 557.66 s, further highlighting the trade-offs between the depth of language understanding and operational practicality.

GPT-4, in a zero-shot setting, achieved an impressive 81.0% accuracy in category classification and 89.4% in sentiment classification, outperforming ChatGPT. However, its processing time was even longer, at 545.57 s. With few-shot learning, GPT-4’s performance improved, reaching an accuracy of 84.0% in category classification and 87.9% in sentiment classification, while the processing time slightly decreased to 507.02 s.

In light of these findings, DeBERTa stood out for its high accuracy in sentiment classification and speed, while RoBERTa also demonstrated strong performance across both tasks. BERT showed robust performance with a good balance of accuracy and speed. ChatGPT and GPT-4 showed enhanced performance in sentiment analysis but at the cost of significantly longer processing times. These results are instrumental in guiding the choice of an appropriate model for specific use cases, considering the trade-offs between accuracy and processing efficiency.

5.4. Discussion of Results

We aim to understand and interpret the results of different models, emphasizing differences in their performance and each model’s limitations. GPT-2, although a revolutionary language model, is primarily designed for text generation and may not be as sophisticated in understanding text semantics for classification tasks compared to BERT, RoBERTa, and DeBERTa. These models are trained on predicting words (MLM) and predicting the next sentence (NSP), enabling them to better understand the context of words in a sentence. Moreover, BERT, RoBERTa, and DeBERTa use a bidirectional architecture, allowing the model to understand the context of words in relation to all other words in the sentence, whereas GPT-2 uses a unidirectional architecture. This fundamental architectural difference can explain why GPT-2 did not achieve results similar to those of BERT, RoBERTa, and DeBERTa in classification tasks.

ChatGPT and GPT-4, models trained on large datasets and various instructions, can understand human emotions and sentiments, which explains their impressive results in sentiment classification. These models may be trained on more diverse datasets encompassing a wide range of human emotions and tones. At the same time, BERT, RoBERTa, and DeBERTa may focus more on the semantic understanding of the text than emotional tone. However, the execution time of ChatGPT and GPT-4 is notably longer than that of other models, which can be attributed to their inherent operational complexity. As conversational models, they require the construction of prompts for each query they process, adding to the cognitive load the model must handle.

On the other hand, categorizing hotel reviews may require specific domain understanding and terminology often used in hotel reviews. While ChatGPT and GPT-4 can understand the general context of a review, they may need more domain knowledge for accurate categorization. Fine-tuning these models could lead to better results in the category classification task, but this is left for future research. Additionally, the difference in performance between zero-shot and few-shot approaches indicates that ChatGPT and GPT-4 may benefit from additional examples during training, but not to the extent one might expect.

DeBERTa, a variant of BERT, demonstrates the best performance in category and sentiment classification tasks, with results comparable to BERT and RoBERTa. This highlights the effectiveness of the enhancements to the BERT architecture in DeBERTa, such as disentangled attention and enhanced training techniques.

In contrast, BERT, RoBERTa, and DeBERTa, being models optimized for quick classification tasks, benefit from a more direct approach to processing input data. Their architectures allow them to infer meaning without the intermediate step of prompt generation, streamlining their operation and minimizing the time required to deliver results. This difference reflects the trade-offs between the depth of interaction offered by conversational models and the speed and directness of classification-optimized models.

While each model has strengths and weaknesses, it is essential to understand their fundamental architectures and training processes to grasp the differences in their performance better. This analysis provides deeper insight into how and why each model operates in a certain way, enabling the end-user to decide which model best suits their specific needs.

6. Conclusions

After thorough research and analysis of various models for processing and analyzing hotel reviews, it has become evident that each model offers unique advantages and challenges. The intricacies of each model’s architecture cater to different aspects of natural language understanding, which are crucial in the multifaceted domain of sentiment analysis. The findings of this study suggest that fine-tuned models, such as BERT, RoBERTa, and DeBERTa, are more effective in classification tasks than LLMs like ChatGPT and GPT-4 when using prompt engineering. These fine-tuned models offer high accuracy and execution speed, with their bidirectional architecture allowing for a better understanding of the context of words within a sentence, which is crucial for accurate sentiment analysis. This characteristic is advantageous in parsing the nuanced language typical of customer feedback.

On the other hand, while GPT-2 did not achieve similar results to BERT, RoBERTa and DeBERTa, its primary purpose as a generative model may not have been ideal for this type of task. The model’s architecture, which excels in language generation, underscores the evolving landscape of NLP models where specialization in either generative or discriminative tasks marks a significant distinction in model applicability. However, ChatGPT and GPT-4, despite their longer execution times, showed impressive accuracy in sentiment classification, indicating their potential to understand human emotions and tones. This aspect of these models underscores the increasing importance of empathic AI in customer experience analysis.

The results of this study support the hypothesis that fine-tuned models can achieve better results and faster execution compared to using prompt engineering on large language models for the task of multi-class classification in the domain of hospitality reviews. The complex architecture and computational requirements of large language models like ChatGPT and GPT-4 lead to longer execution times, making them less suitable for real-time applications or scenarios with strict latency requirements.

One of the key insights of this research study is the importance of adapting models to the specific application domain. While some models may be better at general classification tasks, others may be more suited to sentiment analysis or understanding specific aspects of hotel reviews. The adaptability of these models to specialized domains can be further leveraged by incorporating domain-specific fine-tuning, which remains an area ripe for exploration. The impact of these advanced NLP models extends beyond the hospitality industry, potentially influencing broader social development. These models can improve service quality across various sectors by enabling more accurate customer feedback analysis. As these technologies evolve, they have the potential to enhance cross-cultural communication and foster a more responsive and empathetic service industry, thus shaping a more connected and customer-centric social landscape.

In the future, there is a need for further research to optimize models for specific applications within the hospitality industry. The ongoing development of NLP models hints at a future where they could become even more context-aware and sensitive to the subtle cues within natural language. Fine-tuning models on specific datasets, integrating with other machine learning techniques, and developing hybrid models could provide even better results. Future research directions could include exploring the application of these models for document-level classification, evaluating their effectiveness on larger datasets, and developing models capable of efficiently processing reviews in multiple languages. This multi-faceted approach would provide a more comprehensive and global perspective on user sentiment. Moreover, the potential for models to understand and generate multilingual content could drastically change the landscape of global customer service. Ultimately, the goal is to develop tools that will provide tourism businesses with deeper insights into the needs and desires of their customers, enabling them to offer better services and experiences. This would signify a considerable stride towards personalized customer care powered by advanced AI tools at scale.

Author Contributions

Conceptualization, I.B.; methodology, I.B.; software, I.B.; validation, I.B., M.B.B. and M.M.; formal analysis, I.B., M.B.B. and M.M.; investigation, I.B.; resources, I.B.; data curation, I.B.; writing—original draft preparation, I.B., M.B.B. and M.M.; writing—review and editing, M.B.B. and M.M.; visualization, I.B.; supervision, M.B.B. and M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bompotas, A.I.; Ilias, A.; Kanavos, C.; Makris, G.; Rompolas, P.; Savvopoulos, A. A Sentiment-Based Hotel Review Summarization Using Machine Learning Techniques. In Proceedings of the 16th IFIP WG 12.5 International Conference, AIAI 2020, Neos Marmaras, Greece, 5–7 June 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 155–164. [Google Scholar] [CrossRef]
Ishaq, A.; Umer, M.; Mushtaq, M.F.; Medaglia, C.; Siddiqui, H.U.R.; Mehmood, A.; Choi, G.S. Extensive hotel reviews classification using long short-term memory. J. Ambient Intell. Humaniz. Comput. 2021, 12, 9375–9385. [Google Scholar] [CrossRef]
Wen, Y.; Liang, Y.; Zhu, X. Sentiment analysis of hotel online reviews using the BERT model and ERNIE model—Data from China. PLoS ONE 2023, 18, e0275382. [Google Scholar] [CrossRef] [PubMed]
Rothman, D. Transformers for Natural Language Processing Build Innovative Deep Neural Network Architectures for NLP with Python, Pytorch, TensorFlow, BERT, RoBERTa, and More; Packt Publishing, Limited: Birmingham, UK, 2021. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Čumlievski, N.; Brkić Bakarić, M.; Matetić, M. A Smart Tourism Case Study: Classification of Accommodation Using Machine Learning Models Based on Accommodation Characteristics and Online Guest Reviews. Electronics 2022, 11, 913. [Google Scholar] [CrossRef]
Ray, A.; Garain, A.; Sarkar, R. An ensemble-based hotel recommender system using sentiment analysis and aspect categorization of hotel reviews. Appl. Soft. Comput. 2021, 98, 106935. [Google Scholar] [CrossRef]
Rodríguez-Ibánez, M.; Casánez-Ventura, A.; Castejón-Mateos, F.; Cuenca-Jiménez, P.M. A review on sentiment analysis from social media platforms. Expert Syst. Appl. 2023, 223, 119862. [Google Scholar] [CrossRef]
Nguyen, D.Q.; Vu, T.; Nguyen, A.T. BERTweet: A pre-trained language model for English Tweets. arXiv 2020, arXiv:2005.10200. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Kheiri, K.; Karimi, H. SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its Departure from Current Machine Learning. arXiv 2023, arXiv:2307.10234. [Google Scholar]
Wang, Z.; Xie, Q.; Ding, Z.; Feng, Y.; Xia, R. Is ChatGPT a Good Sentiment Analyzer? A Preliminary Study. arXiv 2023, arXiv:2304.04339. [Google Scholar]
Kublik, S.; Saboo, S. Building Innovative NLP Products Using Large Language Models; O’Reilly Media: Sebastopol, CA, USA, 2023. [Google Scholar]
Zhou, C.; Li, Q.; Li, C.; Yu, J.; Liu, Y.; Wang, G.; Zhang, K.; Ji, C.; Yan, Q.; He, L.; et al. A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT. arXiv 2023, arXiv:2302.09419. [Google Scholar]
Sun, C.; Qiu, X.; Xu, Y.; Huang, X. How to Fine-Tune BERT for Text Classification? arXiv 2019, arXiv:1905.05583. [Google Scholar]
Liu, Y.; Han, T.; Ma, S.; Zhang, J.; Yang, Y.; Tian, J.; He, H.; Li, A.; He, M.; Liu, Z.; et al. Summary of ChatGPT-Related Research and Perspective Towards the Future of Large Language Models. Meta-Radiol. 2023, 1, 100017. [Google Scholar] [CrossRef]
Ravichandiran, S. Getting Started with Google BERT: Build and Train State-of-the-Art Natural Language Processing Models Using BERT; Packt Publishing, Limited: Birmingham, UK, 2021. [Google Scholar]
OpenAI. Improving Language Understanding by Generative Pre-Training. Available online: https://gluebenchmark.com/leaderboard (accessed on 13 May 2024).
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. 2018. Available online: https://github.com/codelucas/newspaper (accessed on 13 May 2024).
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. [Google Scholar]
Møller, G.; Dalsgaard, J.A.; Pera, A.; Aiello, L.M. Is a prompt and a few samples all you need? Using GPT-4 for data augmentation in low-resource classification tasks. arXiv 2023, arXiv:2304.13861. [Google Scholar]
Rodriguez, J. Inside Open Assistant: The Open Source Platform for Light, High-Performance LLMs. Towards AI. Available online: https://pub.towardsai.net/inside-open-assistant-the-open-source-platform-for-light-high-performance-llms-fed9e1ebc7c6 (accessed on 13 May 2024).
Pujari, S.C.; Friedrich, A.; Strötgen, J. A Multi-Task Approach to Neural Multi-Label Hierarchical Patent Classification using Transformers. In Advances in Information Retrieval; Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; Volume 12656, pp. 513–528. [Google Scholar] [CrossRef]
Tran, T.; Ba, H.; Huynh, V.N. Measuring hotel review sentiment: An aspect-based sentiment analysis approach. In Lecture Notes in Computer Science (Including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer Verlag: Berlin/Heidelberg, Germany, 2019; pp. 393–405. [Google Scholar] [CrossRef]
Godnov, U.; Redek, T. Good food, clean rooms and friendly staff: Implications of user-generated content for Slovenian skiing, sea and spa hotels’ management. Management 2018, 23, 29–57. [Google Scholar] [CrossRef]
Zhuang, Y.; Kim, J. A bert-based multi-criteria recommender system for hotel promotion management. Sustainability 2021, 13, 8039. [Google Scholar] [CrossRef]
Pontiki, M.; Galanis, D.; Pavlopoulos, J.; Papageorgiou, H.; Androutsopoulos, I.; Manandhar, S. SemEval-2014 Task 4: Aspect Based Sentiment Analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland; 2014; pp. 27–35. [Google Scholar]
Chatdesk. Grouphug. GitHub. Available online: https://github.com/chatdesk/grouphug (accessed on 13 May 2024).
Zhang, R.; Wang, Y.-S.; Yang, Y. Generation-driven Contrastive Self-training for Zero-shot Text Classification with Instruction-tuned GPT. arXiv 2023, arXiv:2304.11872. [Google Scholar]

Figure 1. Transformer architecture.

Figure 2. Distribution of sentiments across categories.

Figure 3. Fine-tuning approach architecture for multi-class classification of hotel reviews.

Figure 4. Prompt engineering approach architecture for multi-class classification of hotel reviews.

Figure 5. Accuracy of models.

Table 1. Selected hyperparameters.

Hyperparameter	BERT	RoBERTa	DeBERTa	GPT-2
Epochs	18	15	12	15
Learning rate	0.00004035	0.00005921	0.00004325	0.00004012
Train batch size	50	42	60	24

Table 2. Comparison of model performance.

Model	Accuracy		F1		Time
Model	Category	Sentiment	Category	Sentiment	Time
BERT	0.860	0.849	0.861	0.835	7.34 s
RoBERTa	0.883	0.881	0.884	0.880	7.32 s
DeBERTa	0.886	0.903	0.886	0.902	7.28 s
GPT-2	0.552	0.608	0.507	0.537	8.07 s
ChatGPT (zero-shot)	0.582	0.845	0.511	0.848	335.45 s
ChatGPT (few-shot)	0.703	0.780	0.678	0.787	557.66 s
GPT-4 (zero-shot)	0.810	0.894	0.818	0.892	545.57 s
GPT-4 (few-shot)	0.840	0.879	0.842	0.883	507.02 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Botunac, I.; Brkić Bakarić, M.; Matetić, M. Comparing Fine-Tuning and Prompt Engineering for Multi-Class Classification in Hospitality Review Analysis. Appl. Sci. 2024, 14, 6254. https://doi.org/10.3390/app14146254

AMA Style

Botunac I, Brkić Bakarić M, Matetić M. Comparing Fine-Tuning and Prompt Engineering for Multi-Class Classification in Hospitality Review Analysis. Applied Sciences. 2024; 14(14):6254. https://doi.org/10.3390/app14146254

Chicago/Turabian Style

Botunac, Ive, Marija Brkić Bakarić, and Maja Matetić. 2024. "Comparing Fine-Tuning and Prompt Engineering for Multi-Class Classification in Hospitality Review Analysis" Applied Sciences 14, no. 14: 6254. https://doi.org/10.3390/app14146254

APA Style

Botunac, I., Brkić Bakarić, M., & Matetić, M. (2024). Comparing Fine-Tuning and Prompt Engineering for Multi-Class Classification in Hospitality Review Analysis. Applied Sciences, 14(14), 6254. https://doi.org/10.3390/app14146254

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparing Fine-Tuning and Prompt Engineering for Multi-Class Classification in Hospitality Review Analysis

Abstract

1. Introduction

2. Related Work

3. Theoretical Background

3.1. Overview of Transformer Architecture

3.2. Transformers-Based Models

4. Research Methodology

4.1. Selection and Description of the Dataset

4.2. Data Preprocessing and Preparation

4.3. System Architecture Design

5. Experimental Procedure and Results

5.1. Parameters and Fine-Tuning Strategy

5.2. Prompt Engineering for LLMs

5.3. Comparison and Evaluation of Models

5.4. Discussion of Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI