2. Background
This section of the paper will explore different approaches to performing content analysis on customer feedback and prior approaches that predict and identify the sentiments of online review text data. Conventional content analysis is performed in a study to interpret the meaning of text data [
8]. In this analysis, coding categories are derived directly from the text data. Human experts read the data word by word to derive the codes by first capturing the exact words from the text that highlight the key thoughts or concepts. Then, these codes are categorized into providing the meaning of the text [
9]. Another successful analysis is performed on the raw data collected from transcribed interviews. This qualitative content analysis works on four principles (condensation, code, category, and theme). The text data are shortened while preserving the core meaning of the data. Most exact condensed text is labeled into codes. The codes that are related to each other are gathered to form a category. The theme is given for each category by doing further manual analysis to express the underlying meaning of the category [
10]. In a variety of inductive approaches to analyzing qualitative data, thematic content analysis is performed in this research by analyzing the data transcripts, figuring out the identical themes within the data, and combining them. Data were generated from an interview with a child in a public health study to understand children’s knowledge of healthy foods. Researchers extracted all the words and phrases from the interviews and then, by performing thematic content analysis on these data, generated a list of categories that children like about the food [
11].
However, qualitative data derived from surveys, interviews, and written open questions and pictures are expressed in words. Only statistical analysis generated meaning for the given data, which is not sufficient to predict customer opinions; other methods were required to analyze the data with great accuracy [
7]. In conventional market research, customer feedback and its attributes are obtained through extended interviews and surveys, leading to an increase in the research time and the cost of the research project [
12].
Manual data collection models are beneficial for indicating the significance of different product features concerning customer needs. However, these models need prominent skilled employees; only customer information is extracted from small categories. Hence, this type of data collection causes economic hardship and consumes much time in the product development cycle [
13]. Due to the shifting of shopping habits, such as purchasing from in-store to online, there is a requirement for advanced methods to collect online data. When a customer purchases a product from an online store, they will provide feedback on that product in the form of a customer review. Online reviews became more prominent as people consider these comments before purchasing. Gathering the text review information manually will take considerable time, and sometimes manual data extraction will mislead the correct information. As online shopping is gaining popularity with each passing day and driving a more significant number of customers to purchase through online shopping, many researchers have attempted to leverage the customer requirement data available in the form of online product reviews. If one can extract useful information on customer needs from these online reviews, manual methods such as resource-intensive customer surveys can be avoided in the product design cycle [
14].
In recent years, deep learning concepts such as convolutional neural networks (CNN) and NLP have made breakthroughs in many fields, such as object recognition, data mining, and image processing. The researchers have used these data mining and natural language processing concepts to develop algorithms to collect online data. Enhancements in the applications of deep learning and the contribution of engineers to the field of AI have made the product designers’ jobs more simple by enabling them to evaluate the sentiments of customers on a product, observe the various product trends in the market, examine the design of new successful products, and to recognize where the contemporary design opportunities evolve [
13].
By using core deep learning concepts such as sentiment analysis to predict the sentiment of the data, it is straightforward to extract customer opinions in a fraction of a second. In the literature, there are several studies that used NLP methods to generate data-driven decisions. For example, Kim et al. [
15] worked on research that conducted an experiment on how to deploy life cycle assessment combined with data mining and the interpretation of online reviews to enhance the design of more sustainable products. In this approach, customer reviews are extracted using the NLP pipeline to divide these reviews into two clusters. Then, ABSA is performed on those clusters that potentially contain relevant sustainability-related data by identifying the product features connected with sustainability-related comments. These classified customer opinions will be helpful to obtain meaningful sustainable design insights into eco-friendly products in complement to the Life Cycle Assessment (LCA) results.
A research study by Saidani et al. [
16] aimed to improve industrial ecology and the circular economy by enhancing eco-design practices. Text mining analysis is conducted on the definitions of circular economy, industrial economy, and eco-design. Textalyser and Wordle are the online tools used to perform text mining to identify further and discuss common themes or disparities. In another study [
12], online smartphone reviews and manual product documents distributed by manufacturers were collected as data. These data are preprocessed and lemmatized using an NLP toolkit that provides the semantic characteristics of each word. Lemmatized form and parts of speech (POS) tagging are implemented in this study. Using word2vec, all the words from online review data and product manuals are embedded into vectors. These embedded phrases were clustered, and each cluster was labeled with the most frequent word. Product designers select important product features after this cluster labeling. This sub-feature analysis will improve the design implication to the embodiment design. However, this sub-feature analysis focused on smartphone data; it can be utilized to improve the design of eco-friendly products. A customer network is developed based on customer attributes. Customer attributes were extracted from the customer reviews, and product features were extracted from the online data of 58 mobile phones. In this method, customer attributes were divided into positive and negative features. For example, a consumer satisfied with the product’s feature (F) and another customer complaining about it were classified. The similarity between the customer opinions is measured by two indices: 1. cosine similarity and 2. topic similarity. This method develops a customer network if the customer meets the criterion threshold. Each cluster represents a segment, and by segmentation and analysis, the segment’s characterization features and sentiments are predicted. This approach suggests that product designers focus on customer-oriented products [
16]. Mokadam et al. [
14], a study on eco-friendly product reviews, applied ABSA to determine customer needs. Twelve aspects were generated using key terms and keywords in the raw text data, each with one sentiment category. They trained an aspect detection model that learns sentence similarities between manually generated aspects and the review text to predict the aspect’s presence in the review. The intensity of the sentiment expressed in a sentence is generated by VADER, a rule-based model for sentiment analysis, which outputs results in terms of positive and negative comments.
Previously, methodologies in natural language processing leaned heavily on word embeddings, such as word2vec and GloVe, for training models on vast datasets. Yet, these embeddings have limitations tied to their reliance on various language models and occasional struggles in extracting nuanced word contexts. To overcome these drawbacks, the NLP research landscape witnessed the emergence of expansive language models such as BERT, GPT, LLAMA, BART, and T5.
These large-scale language models have redefined the approach to tackling specific tasks within NLP, ranging from named entity recognition to question answering and sentiment analysis to text classification. Their advent represents a shift toward more comprehensive and versatile models capable of grasping intricate linguistic nuances, thereby advancing the efficacy of natural language understanding and analysis.
The Generative Pre-Trained Transformer (GPT) signifies a series of neural language models developed by OpenAI, functioning as generative models adept at sequential text generation, akin to human-like progression, word by word after receiving a initial text prompt. Its autoregressive text generation approach uses the transformer architecture, employing decoder blocks.
The GPT series, comprising models pre-trained on a combination of five datasets encompassing Common Crawl, WebText2, Books1, Books2, and Wikipedia, is distinguished by an impressive parameter count of 175 billion. Like BERT, GPT models exhibit the capability for fine-tuning specific datasets, enabling adaptation of their learned representations. This adaptability empowers GPT to excel in diverse downstream tasks, including text completion, summarization, question answering, and dialogue generation. This flexibility positions GPT as a versatile tool in natural language processing applications [
17].
LLAMA (Language Model for Language Model Meta AI) is a comprehensive framework developed explicitly for research purposes, aiming to evaluate and compare existing language models across diverse natural language processing (NLP) tasks. Its creation assisted AI and NLP researchers in conducting standardized assessments of various large language models, comprehensively analyzing their capabilities, strengths, and weaknesses.
Trained across a spectrum of parameters spanning from 7 billion to 65 billion, LLAMA incorporates an array of evaluation metrics and tasks. These metrics encompass facets such as generalization, robustness, reasoning abilities, language understanding, and generation capabilities of language models. By encompassing such a wide range of assessments, LLAMA offers a holistic approach to evaluating and comparing the performance and functionalities of language models within the NLP domain [
18].
Bayesian Adaptive Representations for Dialogue (BARD) is a sizable language model crafted by Google, primarily targeting computational efficiency within transformer-based NLP models. Designed to minimize computational demands, BARD underwent training on an estimated data volume exceeding a terabyte. Through tokenization, it dissects prompts and adeptly grasps semantic meanings and inter-word relationships via embeddings.
BARD’s notable attributes are its cost effectiveness and adaptability, representing a model well suited for various tasks. Moreover, its unique capabilities, such as accessing the internet and leveraging current online information, position BARD ahead compared to ChatGPT chatbot. This amalgamation of computational efficiency and access to real-time information endows BARD with distinct advantages in dialogue-based language models.
The Text-to-Text Transfer Transformer (T5), an encoder–decoder model introduced by Google, explicitly targets text-to-text tasks. Trained on a blend of supervised and unsupervised datasets, T5 operates via the encoder–decoder transformer architecture, akin to other transformer-based models. T5 has two sides, just like a coin. The encoder analyzes your text, understanding its meaning and relationships between words. The decoder then crafts a response, weaving words together to form something new—such as a summary, a translation, or even dataset labeling. Its spectrum of tasks encompasses text summarization, text classification, and language translation, processing textual inputs to generate textual outputs.
Diversifying its capabilities, T5 spans various iterations, including the small, base, and large models, and T5-3b and T5-11b, each trained with parameter counts ranging from 3 billion to 11 billion. This expansive pre-training contributes to T5’s adaptability and ease of fine tuning for domain-specific NLP tasks, marking it as a versatile and customizable model for various text-based applications [
18].
In natural language processing (NLP), Google’s groundbreaking Bidirectional Encoder Representations from Transformers (BERT) model has emerged as a pivotal framework. BERT’s advancement lies in its pre-training phase, where it ingests an extensive corpus comprising 2500 million unlabeled words from Wikipedia and an additional 800 million from the Book Corpus. This pre-training methodology involves unsupervised learning tasks that entail predicting masked words within sentences and discerning logical coherence between pairs of sentences. The essence of BERT’s innovation is its bidirectional context understanding; it comprehends context from both the left and right sides of words within sentences. Leveraging its pre-training on colossal text datasets, BERT exhibits adaptability by fine-tuning domain-specific datasets for distinct tasks. This fine-tuning process involves initializing the BERT model with pre-learned parameters tailored explicitly to labeled downstream tasks, such as sentiment analysis.
BERT’s significance in the NLP landscape is underscored by its robust pre-training on extensive textual resources, enabling it to grasp nuanced linguistic contexts. Moreover, its adaptability via fine tuning on domain-specific datasets is a testament to its versatility across various NLP applications, positioning BERT as a cornerstone framework in language understanding and analysis.
Several recent studies have showcased the efficacy of BERT in diverse applications within natural language processing (NLP). Zhang et al. [
19] developed the BERT-IAN model, specifically enhancing aspect-based sentiment analysis on datasets related to restaurants and laptops. Their model, utilizing a BERT-pretrained framework, separately encodes aspects and contexts. Through a transformer encoder, it adeptly learns the attention mechanisms of these aspects and contexts, achieving sentiment polarity accuracies of 0.80 for laptops and 0.74 for restaurants. Tiwari et al. [
20] proposed a multiclass classifier leveraging the BERT model on SemEval 2016 benchmark datasets. Their approach effectively predicted sentiment and aspects within reviews, employing multiple classifiers combining sentiment and aspect classifiers using BERT. In a study by Shi et al. [
21], the utilization of BERT in market segmentation analysis based on customer expectations of product features was explored. By deploying a BERT classification model on apparel product data, the analysis focused on understanding user attention toward specific product features such as pattern, price, cases, brand, fabric, and size. This analysis provided valuable insights for companies adapting to the apparel market. Tong et al. [
22], applied a BERT-based approach to comprehending contextual elements (Task, User, Environment) from online reviews. Context of Use (COU) is pivotal in successful User Experience (UX) analysis. Their study extracted COU elements using the BERT model on a scientific calculator dataset, elucidating the tasks associated with the product and the diverse user types and environments in which these calculators are utilized.
The literature found in [
12,
14,
16] revealed that the ABSA and NLP computational mechanisms can still be exploited for text mining and automation of content analysis. Collectively [
20,
21,
22], these studies underscore the versatility and effectiveness of BERT in various NLP domains, from aspect-based sentiment analysis to understanding customer reviews. Based on the advantages of BERT, we propose an aspect-based sentiment analysis approach to automate the extraction of customer expectations from online reviews to provide valuable data-driven insight to product designers.
The objective of this paper is to construct an NLP pipeline that automates the extraction of customer expectations from Amazon online reviews across various product categories. This will involve implementing aspect-based sentiment analysis to predict and categorize the aspects discussed within the reviews, while simultaneously classifying their sentiments.
4. Results and Discussion
The BERT-based classification model is pre-trained comparatively on a low amount of data. Hence, to train the model on three datasets, the training and validation experiments run on the Google collab using the T4 GPU. Once the tokenizer and sequence classification models are trained, these models catch 386 MB, and both are saved in the drive to recall and deploy them in the NLP prediction pipeline to predict the aspects of the input review data. T5-small and T5-base both are deployed and fine tuned on the Google collab using T4 GPU.
Accuracy: the proportion of correctly classified reviews among the total reviews.
Precision: the accuracy of positive predictions, considering both true positives and false positives.
Recall: the proportion of correctly predicted positive reviews out of all actual positives.
F1 Score: the means of precision and recall, providing a balanced measure.
To evaluate the performance of a large language model, we required calculate the accuracy, precision, recall, and F-score. The evaluation metrics for the BERT model on COMB dataset are resulted in
Table 4 and the evaluation metric results for T5-Base and T5-small can be seen in
Table 5 and
Table 6. In the below tables, the accuracy for the three models is compared and can be seen as 92% for BERT, 91% for T5-base, and 88%- T5-small. Three models show almost equal results in terms of accuracy.
In
Figure 3, evaluation metrics of the three aspect detection models are compared. BERT model comparatively efficient and performs better than T5-Base and small.
In this paper in the prediction pipeline, we utilized the fine-tuned BERT model as the final NLP model to detect the aspect for sentiment analysis. According to the results, BERT outperformed the T5 model. But this is not the only reason to choose BERT over T5. In the training stage, T5 models, excluding the 12 aspect labels they generated as new labels in
Table 7, we can see duplicate aspects such as attraction, color, charge. To fix this, we used cosine similarity to map the duplicate labels to original labels. Sentence transformer, a library, is used to map these labels to original labels. This affects the computational efficiency of the T5 model, and the BERT model obtained an advantage over these models.
The aspects are predicted, and the distribution of the aspects in the training dataset can be seen in
Figure 4.
The distribution of the customer ratings in the input data can be seen in
Figure 5. Most portions are occupied with five- and 4-star ratings, and the other rating distribution is varied.
VADER is performed on the preprocessed data, and the prediction results are obtained, as shown in
Table 3. Once the sentences are classified into positive and negative sentiments, the frequency of the classified reviews can be seen in
Figure 6. High ratings lead to the creation of positive feedback on a product. In this study, we found that, in some cases, even if the rating is high, due to the negative context of the review expressed by the consumer, the product would obtain a negative sentiment score.
Figure 7 and
Figure 8 shows the common words extracted from each review and classified as positive or negative. Using the word cloud, these classified results are visualized to provide design insights and ease of understanding of the keywords used to determine the positive and negative feedback from the customers.