A Cloud-based Machine Learning Pipeline for the Efficient Extraction of Insights from Customer Reviews

— The efficiency of natural language processing has improved dramatically with the advent of machine learning models, particularly neural network-based solutions. However, some tasks are still challenging, especially when considering specific domains. In this paper, we present a cloud-based system that can extract insights from customer reviews using machine learning methods integrated into a pipeline. For topic modeling, our composite model uses transformer-based neural networks designed for natural language processing, vector embedding-based keyword extraction, and clustering. The elements of our model have been integrated and further developed to meet better the requirements of efficient information extraction, topic modeling of the extracted information, and user needs. Furthermore, our system can achieve better results than this task’s existing topic modeling and keyword extraction solutions. Our approach is validated and compared with other state-of-the-art methods using publicly available datasets for benchmarking.


I. INTRODUCTION
Users of social platforms, forums, and online stores generate a significant amount of textual data.One of the most useful applications of machine learning-based text processing is to find words and phrases that describe the content of these texts.In e-commerce, the knowledge contained in data such as customer reviews can be of great value and provide a tangible and measurable financial return.However, it is impossible to efficiently extract information from such large amounts of data using human labor alone.
The difficulty in solving this problem effectively with automated methods is that human-generated texts often contain a lot of noise in addition to substantive details.Filtering the relevant information is further complicated by the fact that different texts can have different characteristics.For example, the document to be analyzed may contain words too common to be distinctive or, for example, information irrelevant to the analysis objective.In fact, different parts of * These authors contributed equally.the text may be considered noise, depending on how we view the data and what we think is relevant.This in turn makes it difficult to solve this task: it is not enough to find some specific information in texts, but we also have to decide what information we need based on the texts.
Our aim is to extract information from textual data in the field of e-commerce.Our application is an end-to-end system that runs on a cloud-based infrastructure and can be used as a service by small and medium-sized businesses.Our system uses machine learning tools developed for natural language processing and can identify those sets of words and phrases in customer reviews that characterize their opinion.We have built a system based on machine learning solutions that effectively handles such text-processing tasks and, in some aspects, outperforms currently available approaches.
To build an application that can be used in an e-commerce environment, we needed a model that could identify topics in texts and provide a way to determine which topics are relevant, given our analysis goals.Therefore, before developing our system, we investigated the N-gram model [1], dependency parsing [2] and embedded vector space-based keyword extraction solutions, and various distance or density-based and hierarchical clustering [3] [4] techniques.In addition, we tested the LDA [5], [6], Top2Vec [7], and BERTopic [8] complex topic modeling methods.We focused on these tools because we found them to be the most appropriate for our goals based on the available literature.It was also important to us that these methods and complex models have a stable implementation, are verifiably executable in our cloud-based environment, and can be adequately integrated.
Extracting information from customer feedback and reviews will achieve the desired result if we can identify the words and phrases that describe the customers' opinions and thus help to further improve a product or service, both from a sales and technical point of view.In other words, it can help increase revenue or avoid potential loss.However, it is not enough to know the frequency of certain words and phrases; it is also necessary to provide the possibility of grouping the phrases according to different criteria.For example, different sentiment features can affect the information value of a frequent phrase.Furthermore, highly negative or positive reviews should be excluded from the analysis due to their bias.
To produce good-quality results, it is important to identify arXiv:2306.07786v2[cs.CL] 18 Jun 2023 the text passages that we consider noise.Certain parts of texts are generally considered noise.These include stop words [9], special characters, and punctuation.Removing these words is often one of the first steps in text preprocessing.However, noise is often more difficult to identify.In the case of stop words, for example, a complicated problem is the removal of elements that express negation.While this may seem trivial, it may result in a loss of information, depending on the model's behavior.In addition, removing certain punctuation marks can distort the semantics and lead to information loss.Therefore, it is not possible to address these high-level issues with the general word-level methods.As a result, we needed an adaptive approach to deal with these issues.We run into further difficulties if we take a higher perspective and look at this problem at the sentence level.Namely, customer reviews can consist of several separate parts describing different problems.These topics are easier to capture in more complex situations at the sentence or sub-sentence level.Therefore, solving the aforementioned problems requires a specialized approach.
Suppose we identify separate text parts, phrases, sentences, or sub-sentences with different meanings.In this case, they can be grouped into useful and useless texts according to their meaning.This is another level of abstraction with its own difficulties in information extraction.One of the critical problems with organizing text into topics is that the number of topics is unknown in advance, and this number changes as the amount of text increases.Finding these semantically distinct sets of text belongs to the topic modeling and text clustering subfields of natural language processing.This is an unsupervised machine learning problem, which is particularly difficult because it requires finding the optimal clusters.Optimal results are obtained when we can create sufficiently dissimilar clusters in the useful customer reviews about a given product.The corpus should be clustered according to the product's substantive information, not word frequency or other properties common to textual data.
In our experience, among the topic modelers, distance or density-based and hierarchical clustering methods, and keyword extraction solutions we have investigated, LDA, Top2Vec, and BERTopic could meet our requirements.However, none of them offered a comprehensive solution for all problems.The clustering and keyword extraction solutions cannot be considered complex enough.The topic modeling tools did not provide us with adequate results without requiring significant modifications in their implementation to adjust their functionality.Therefore, we decided to build our own model.Of course, our solution draws on the experience gained with the models and tools listed above.However, we have taken the building blocks of these approaches and reworked them to better address the problems described.
In the end-to-end system we designed, we used a semantic embedding approach for keyword extraction and applied recursive hierarchical clustering to find relevant topics.It enabled our system to perform parameterizable contentdependent clustering using cosine distance for semantic similarity measurement.With this architecture, we could achieve that our system could adapt to the specific structure of the text.
As a result, we created a model integrated into a pipeline to group the extracted sets based on their semantic meaning.Furthermore, we could influence the density of the resulting sets and remove outliers.Our model can address all these issues by making the extracted words and phrases retain as much information content as possible.To validate this claim, the words and phrases extracted by our model were compared with those extracted by the topic modeling methods LDA, Top2Vec, and BERTopic.We tested their loss of information during the comparison process using a text classification task.As a result, we were able to build a solution better suited to our needs and, based on our measurements, more sophisticated and usable in terms of the resulting topic/keyword groups.Moreover, after removing the irrelevant text passages according to the topic modeling, the extracted text yielded better classification results using a regression model than the text extracted by the topic modeling methods in the literature.
The rest of the paper is organized as follows.In section II, we overview the recent related work.Our methodology, including the dataset used during development and our topic modeling pipeline details, is presented in section III.Then, in section IV, we provide the results of our experiments regarding the performance of the keyword and phrase extraction methods as well as the performance of our model compared to the state-of-the-art.Finally, some conclusions are drawn in section V.

A. Keyphrase extraction
There are several approaches to extracting keywords from natural language texts.On the one hand, this problem can be approached by deciding which words and phrases are relevant to us depending on the frequency of words.Approaches based on word frequency [10] can be an effective solution for a comprehensive analysis of large texts.However, this approach is less efficient for shorter texts, unless we have some general information about the frequency of words in that language.Furthermore, such approaches can be sensitive to the quality of the text cleaning, depending on the nature of the text.
Dependency parsing [2] is another approach that can be used to extract information from text.This technique attempts to identify specific elements of a text based on the grammatical rules of the language.It can be particularly useful if we have prior knowledge about the type of words that carry the information we are looking for.In our experience, dependency parsing-based solutions tend to work better for smaller texts and sentence fragments compared to frequency-based approaches.When dealing with large amounts of text, it is often helpful to break it down into smaller parts, e.g., into sentences.This approach can improve the accuracy of information extraction.One potential draw-back of using dependency parsing is that it can be sensitive to the preprocessing of the text.
Semantic approaches based on text embedding [11]- [13] can also be used to identify keywords.Such an approach involves identifying the relationship between words and parts of the text.This can be done by vectorizing the elements of the text and their larger units, such as sentences, using embedding techniques, and measuring the similarity between them using some metric.The advantage of methods based on this approach is that they are less sensitive to the lack of text cleaning.Their disadvantage is that the quality of the vector space required for similarity measurement largely determines the model's functionality.Furthermore, unlike the previous two approaches, it currently imposes a higher computational burden and works better on smaller texts than on larger texts.However, because of the semantic approach, if the choice of text splitting rules, similarity metrics, and vector space are well chosen, better results can be obtained than with approaches based on frequency or dependency parsing.
In the case of frequency-based techniques, dependency parsing, or semantic embedding, it can generally be said that, although they offer the possibility of finding the essential elements of a text, none of them provides a clear answer to the question of the relationship between the words found and the topics of the text.If we need to find the main terms of the text but also group them according to their content to answer higher-level questions, we need to use clustering or topic modeling.

B. Clustering
The effectiveness of text clustering is determined by what can be done to transform the text into an embedded vector space that best represents the documents regarding the target task.
There are several ways to vectorize a text.There are frequency-based techniques, such as one-hot encoding or count vectorization [14].However, there are solutions using more complex statistical methods, such as term frequency [15] and inverse document frequency-based [16] (TF-IDF) techniques or the transformation mechanism of LDA [17].In addition, we can use semantic embedding like GloVe [11] as a first pioneer, which generates vectors for each word on a statistical basis.However, statistical models were soon replaced by neural networks-based solutions with the rise of word2vec [12], and fastText [13].With the advent of transformers [18], neural networks with special architectures have emerged that can create semantically superior embedded vector spaces.
Several clustering techniques are available to group the entities in the embedded vector space.In our work, we investigated the k-means [19], agglomerative [4], DBSCAN [20], and HDBSCAN [21] clustering.The source of our problem was that we found that embedded textual entities do not usually form dense regions in the vector space, even for terms with the same meaning.For this reason, either centroid, hierarchical, or density-based methods, in our experience, would have difficulty handling vector spaces considered for text and word embeddings.
Another problem of working with text-based embedded vector spaces is that, in order to achieve good semantic quality, textual data is currently converted into high-dimensional vectors, which is resource intensive to develop.Although the resource demand can be reduced by various dimension reduction techniques such as PCA [22]- [24], UMAP [25], [26] or T-SNE [27], it results in a loss of information, which in turn can lead to distortions due to the particular structure of the embedded vector space generated from the text.

C. LDA
Latent Dirichlet Allocation (LDA) is a popular model for extracting text topics.It has the advantage of having efficient and reliable implementations, making it easy to integrate into machine learning systems.To adapt it to a specific context, Batch [28] or Online Variational Bayes [29] methods have become popular.To measure the quality of the topics generated by LDA and thus find the right number of topics, the perplexity [17] values of each topic are computed.For dictionary construction, the bag-of-words [30] technique is a common choice, with lemmatization and stop word removal.
Considering our system, a drawback of this model is that it is a probabilistic statistical method that ignores word order and semantics and is less sensitive to finding smaller topics.It means that it finds fewer topics and cannot isolate finer details.In practice, when using LDA, we tend to focus on e.g., the top 10 words, which narrows down the list of key terms, especially when there are few topics.Although these properties of the model are parameterizable, our experience shows that the more words we select from a given topic, the more the quality of the words or phrases associated with that topic deteriorates.This leads to greater overlap between the words of the topics, which requires further corrections.

D. Top2Vec
Unlike LDA, Top2Vec uses semantic embedding and does not require stop word removal or lemmatization.Such preprocessing steps can even be detrimental to the model's performance, as they can distort the semantic meaning of the text in the documents.
The quality of semantic embedding-based methods is determined by the embedded vector space they use.This makes Top2Vec sensitive to the vector space of the model it uses on the target corpus.It has the advantage that the model offers a compact topic number determination method and is, therefore, able to search for topics that it considers to be related automatically.In our tests, its main disadvantage was that, like LDA, it proved to be less capable of finding smaller topics.
Although, in our experience, there have been cases where Top2Vec has performed better than LDA, its performance is not consistently better than LDA.In addition, the implementation available for it was not always stable in our development environment.

E. BERTopic
BERTopic is also a topic modeling technique that uses semantic embedding.It forms dense clusters using c-TF-IDF [8] for easier clustering.The model is based on the BERT transformer-based neural network [31] and considers the internal representation of BERT to generate the embedded vector space.Like Top2Vec, it does not require stop word removal or lemmatization.As with techniques based on semantic embedding, the model's effectiveness is highly dependent on the quality of the embedded vector space it uses.However, the use of BERT can certainly be considered an effective tool to achieve this goal.
Based on our research, the main drawback is the limitations of the model's implementation, which the developers have encoded.This problem was an unnecessary inconvenience despite the open-source implementation.Unfortunately, the number of words that could be extracted from the topics found was limited by the developers to a total of 30 words, which for us, made the ability to extract targeted information impaired and inflexible.In fact, the low number of retrievable words causes a similar phenomenon as in the case of LDA, where we only focus on the top 10 words per topic.

III. METHODOLOGY
Now we present the proper background of our methodology including the dataset used during the development and the details of our topic modeling pipeline.

A. Dataset
We used data from the Amazon Review (2018) [32] dataset to evaluate our system.The Amazon Review dataset contains 233.1 million Amazon product reviews (review text, ratings, and auxiliary votes), along with corresponding product metadata and links.For our purposes, we selected the electronic subset of this dataset, which contains 7,824,482 reviews, and created a dataset (hereafter Electronics_50K) as follows.Firstly, we tokenized the review text using the vocabulary and tokenization method of the BERT neural network architecture.The vocabulary used for BERT tokenization is built using the WordPiece [33] subword tokenization algorithm.After tokenization, we kept only those reviews with lengths between 16 and 128 tokens since BERT was used for sentence and word embedding.The maximum input size of BERT is 512 tokens, so we fixed the input size to 128 to fit our own development environment for efficiency reasons.We then performed a uniform sampling to obtain 10,000 reviews for each rating (1 to 5, without fractions).This resulted in a dataset of 50,000 reviews covering 8281, 8267, 8388, 8219, and 8063 different products for ratings from 1 to 5. For a quick impression of how the dataset looks like, see Figures 1 and 2 for its 2D visualizations obtained by PCA and TSNE, respectively.

B. Keyphrase extraction 1) N-gram-based keyphrase extraction:
An established method for extracting the semantic meaning of texts based on the N-gram approach is to examine keywords and the text fragments containing them.In the case of N-grams, the "N" denotes the length of the text fragments (e.g., 2grams, 3-grams); see also Figure 3.This method can generate the sentence narrow context of our keywords, potentially allowing us to examine a few words reviews.The method can be further improved by stop words removal, where the most commonly used words in a given language are omitted from our analysis, and we focus on words with stronger meanings.
The following steps summarise the extraction of keywords using classical language processing tools.Firstly, the text is pre-processed and cleaned, and stop words are removed using an appropriate dictionary.This is followed by extracting noun phrases from which a dictionary can be defined.The last step is the extraction of keywords and keyphrases, taking into account the noun phrases.This is the contextual extraction of 1, 2, or 3 words depending on the position of the noun phrases in the sentence.
2) Dependency parsing based keyphrase extraction: Once we had broken down the reviews into sentences and identified their respective keywords, we began looking for the contexts of these keywords.We achieved this by applying dependency parsing, an example of which is shown in Figure 4.During this process, we parsed the whole sentence to obtain its grammatical structure and identify "head" words and words that are grammatically connected to them.After this, we identified the keyword kw and its "head" word h, then looked for words that were connected to h while keeping the original order of the words in the sentence.We looked for words connected to h instead of kw because most of the time kw was either an adjective, adverb, or some other word that expressed emotions, sentiments, or qualities and had no particular meaning by itself.So instead of looking for words connected to this term, we focused on the word that is grammatically connected to kw (e.g., a noun).During the search procedure, we excluded common words, such as prepositional modifiers and words representing coordinating conjunctions.This way, the resulting phrase contained the most important parts (nouns, adjectives, etc.) while still being relatively short.This procedure resulted in very long phrases that were sometimes not much shorter than the original sentence.To decrease their lengths, we integrated a thresholding mechanism into the search procedure to decide which keywords to keep and which to discard.This thresholding was based solely on the sentiment of the keyword: positive keywords were kept if their sentiment score [34] was above 0.89, while negative and neutral ones were kept if their score was above 0.79.The exact thresholding levels were calculated based on our training set and were optimized based on the average length of the resulting contexts and their interpretability.Optimization was a step-by-step method with expert reinforcement.During which, starting from a still meaningful lower value of 0.49 with a step size of 0.1 we produced all possible outputs of a 100-item randomly sampled validation set, based on which the optimal parameters were defined relying on evaluation by human experts.This step-by-step optimization was performed for both positive and negative keywords.
3) Cosine similarity based keyphrase extraction: It is a complex problem to extract the words and phrases from a text that contain the most relevant information with respect to a given analysis goal.This problem has several approaches; each may perform better for different text types.Next, we describe the methods tested on the review dataset detailed above.
One commonly used metric to measure the similarity of two words or even phrases is the cosine similarity measure.Mathematically, it measures the cosine of the angle between two vectors x = (x 1 , . . ., x n ), y = (y 1 , . . ., y n ) ∈ R d projected in a vector space: The smaller the angle, the higher the cosine similarity.The similarity value is between -1 and 1; full similarity (identity) is described by 1, full dissimilarity by -1, and neutral behavior (orthogonality) by 0. The main advantage of it is that it measures the similarity of the documents irrespective of their size.In contrast, the method of counting the maximum number of common words between the documents can give a false result.For example, if a document grows in size, the number of common words tends to increase, even if the topic is not the same in the two documents.
KeyBERT [35] is a simple-to-use method for keyword extraction based on BERT embeddings.An example of a KeyBERT output is shown in Figure 5. C. Pre-trained models 1) BERT+R architecture: Dataset labels are formed by user ratings.This means that users rated its elements on a scale of 1 to 5.However, we wanted to build an unsupervised machine learning system that could separate multiple details of the dataset, even if it had a simpler (e.g., binary) scale.
We have designed our system so that, in theory, any other method can be incorporated into the architecture to improve the detection capability.In practice, however, we have built on our own needs and created a sentiment analysis system at this level.We believe a more detailed sentiment-based rating scale would allow for a more efficient sentiment-based evaluation of texts.
We have therefore created a model that can predict the sentiment of a sentence containing a statement on a scale of 1 to 5 with high accuracy.For this we used a pre-trained BERT network and modified its architecture by adding a regression layer to the output instead of a classification layer.We then fine-tuned the specific BERT+R model on our dataset.The resulting model, depicted in Figure 6, allows us to predict sentiment values with finer granularity.Furthermore, our system has become more sensitive to the differences between negation and assertion sentences.2) Sentence BERT: Sentence BERT [36] is a special transformer-based architecture designed for sentence embedding.The network is designed to assign vectors to each sentence such that their similarity is measured by cosine similarity.Basically, Sentence-BERT (SBERT) is a modification of the pre-trained BERT network that uses Siamese and triplet network structures to derive semantically meaningful sentence embeddings.

D. The modules of the pipeline
For later integration into a commercial application, the information extraction process is implemented as a single pipeline.This pipeline was designed to ensure that the model applied can evolve with the growth and changes of the dataset.In addition, it needs to be stable and easy to maintain so that it can be made available as a service to other applications.

1) Text cleaning:
The purpose of text cleaning is to remove characters and text elements from the raw data that do not contain relevant information and may distort the results obtained by the model.This is the first level of our pipeline, where these text transformations are performed to ensure that the text meets the tokenization requirements of the model.
As described in more detail in section III-A, our requirement of tokenization is to work properly with the specialized BERT+R neural network.Text cleaning removes punctuation and unnecessary spaces that we think might cause noise.In addition, the entire text has been lower-cased, and language-specific abbreviations have been removed to meet the tokenization requirements of the BERT+R model used.
2) Splitting the text into sentences: A single review may contain multiple relevant topics that express the opinion of the customer.We, therefore, needed to find a way to separate the different topics.One sentence or subsentence usually contains one or more statements about the same topic.However, less frequently, users make several statements within a sentence without using punctuation.
Our implementation at this level divided the text along the sentence and sub-sentence ending punctuation used in English.Thus, the information carried by each statement is extracted sentence by sentence.

3) Generating regression-based sentiment values:
We predicted sentiment scores after decomposing user reviews into sentences.For this purpose, we used our BERT+R model, which we applied to each sentence.As mentioned earlier, this pipeline uses a modular approach, which means that the module implementing a step can be replaced by any other solution if it improves the performance of the pipeline, i.e., in this case, leads to better sentiment scores.

4) Sentiment-based classification:
Based on the sentiment scores predicted by the regression model, each sentence was classified with 3 labels (negative, neutral, positive).Sentences with sentiment scores less than 2 were classified as negative, and sentences with sentiment scores greater than 4 were classified as positive.Consequently, sentences with a sentiment score between 2 and 4 were classified as neutral.
Our research focused mainly on negative statements, so we considered these 3 groups sufficient for further investigation.Of course, the regression values we generated also allow for a finer resolution.This can be seen as a parameter of the pipeline that can be chosen depending on the application.

E. Keyphrase extraction
We apply semantic embedding to extract the keywords.For this, the text split into sentences, and the words belonging to each sentence and their bigram and trigram combinations we embedded separately in a 768-dimensional vector space using SBERT.Then we measured the cosine distance of the given words and their bigrams and trigrams from the sentence that contained them.In the next step, we selected from each sentence the top 3 expressions whose semantic meaning is closest to the sentence containing it.This gave us an item triplet for each sentence, which could contain words and phrases or their negation auxiliaries.In addition, combinations of bigrams and trigrams were used with elementary words since these are the N-gram combinations that still make sense in natural languages.In fact, by maximizing the value of N, we can eliminate redundant combinations.

F. Keyphrase embedding
After cleaning, classifying, and extracting key terms from the text, to group words and phrases according to their meaning, we need to vectorize them.In other words, we need to represent them so that words and phrases describing similar topics can be identified.To do this, we used the text embedding technique SBERT.It generated a 768-dimensional embedded vector from each extracted key expression.In the resulting embedded vector space, we could apply our own clustering approach for topic search.

G. Hierarchical and density-based recursive clustering
Embedded vectors were used to group terms with similar meanings.A special property of embedded vector spaces generated from text data is that they are difficult to cluster, as the distances between adjacent vectors are usually similar.However, slightly dense clusters appear in the vector space for similar contents.To extract these slightly dense clusters, we need to remove outliers.To achieve this, we used our own solution.This consists of using hierarchical clustering with recursive execution, as shown in Figure 7. Based on our tests, the elements of slightly dense clusters describe the same topics.Our experience shows that the cosine similarity between these clusters is above 0.7.These slightly denser clusters could be extracted by recursively applying hierarchical clustering.This method re-clusters the clusters with densities below 0.7 in the next clustering cycle.This is repeated until the minimum number of elements (in our case, chosen as 5) or the density value 0.7 is reached.Of course, these hyperparameters of the pipeline can be freely adjusted to improve the generalization ability of our system concerning the nature of the text.The resulting clusters have sufficiently good descriptive properties.

H. Computational and development environment
During development and research, we worked in a cloud environment.For data storage, text data was stored in JSON and CSV files in a Hadoop-based storage system [37].We also used two different clusters for computations for cost efficiency.
We worked with a CPU-optimized virtual environment for operations requiring more CPU computations.This environment was configured with a 16-core AMD EPYC 7452 processor, 128 GB RAM and 400 GB of storage.
For operations with transformer-based neural networks, we used a GPU-optimized virtual environment that was configured with a 12-core Intel Xeon E5-2690 processor, 224 GB RAM, 1474 GB (SSD) cache, and two NVIDIA Tesla P100 (32GB) GPUs.

IV. EXPERIMENTAL RESULTS
In this section, we present the methods and results of our experiments.First, we present our experimental results that informed our choice of keyword and phrase extraction methods for our pipeline.Then, in a supervised classification problem, we compare the efficiency of our complex model with that of LDA, Top2Vec, and BERTopic topic modeling solutions.

A. Evaluation of keyword extraction solutions
Three solutions were compared to find the best keyword extraction method.7 independent human experts evaluated their results.To be able to compare and evaluate each model by independent experts, we needed to produce a dataset that human experts could process.For our experiments, we used the Cell Phones and Accessories (CPA) subset of the Amazon Review dataset, from which we randomly sampled 100 sentences (see Supplementary Material 1).
We randomly selected 100,000 products from this subset and chose the 4 products with the most reviews.This step was introduced to ensure that there would be a sufficient number of reviews for the same product, including a sufficient number of explicitly positive and negative samples.The 4 products with the most reviews already provided a sufficient sample for further narrowing.In the next step, we removed the neutral reviews with a rating of 3.This step was introduced because our experience has shown that extracting information from extreme emotional expressions is more difficult.Words with a strong emotional charge and negative sentences can obscure the information, making it difficult to extract.
The remaining reviews have been split into sentences, and grammatical abbreviations have been corrected.We then removed sentences containing fewer than 6 or more than 14 words.In our experience, sentences shorter than 6 words generally do not contain meaningful information, and the 14-word upper limit fits well with the length of an average English sentence.From the remaining sentences, we randomly selected 100.
Finally, N-gram, dependency parsing, and semantic embedding-based keyword extraction methods were applied to each of the 100 sentences.The keywords generated by the tested models were evaluated by independent experts, who voted for each sentence which method best extracted the meaning of that sentence.The experts could vote for multiple models or choose the "none of them" category.
The evaluation's result (see Table I) showed that the embedding-based technique dominated in 61% of the cases.Therefore we implemented this method into our framework.In Figure IV-A, the different models' correlation can also be checked regarding this task.B. Topic modeling evaluation 1) Approach: To assess our own and referenced models' information retention capabilities, we trained unsupervised keyword retrieval on a supervised text classification task.Evaluating unsupervised learning models often involves relating them to a supervised classification task, as seen in the case of LDA where binary classification is used to measure its effectiveness.We hypothesized that if the models can extract meaningful keywords, the accuracy of an independent classification model using only those extracted keywords should be correlated with the information value of the words.This assumption is supported by the observation that text cleaning, which removes noise, improves classification accuracy.However, this improvement is only seen when elements considered noise are removed.Consequently, if words with information value for classification are eliminated, the accuracy of classification models will decrease.
While keyword extraction does not aim to enhance classification accuracy, it is plausible that the quality of the extracted words can impact classification outcomes.If the classification accuracy significantly drops compared to using all text elements, it suggests that the topics and their associated words might not have been correctly extracted.Conversely, when relevant words are extracted, classification accuracy should ideally increase.It is important to note that a keyword extraction system will not always improve classification accuracy in every case.Nevertheless, this evaluation method allows for comparing the relevance of words extracted by different models for information retention.
2) Dataset: For the evaluation, we used the 20newsgroups and Amazon Reviews databases.20newsgroups is a widely used benchmarking dataset that provides training and test subsets.Furthermore, software packages allow quick and easy use without pre-processing steps.From the 20newsgroups dataset, binary classification tasks were created as follows.The dataset contains 20 classes in 6 categories, so a single class from each category was chosen to obtain sufficiently different elements for binary classification.Finally, from the 6 classes selected (comp.graphics,rec.autos, sci.med, misc.forsale,talk.politics.guns,soc.religion.christian),we created as classification tasks all possible binary combinations with comparing each pair of classes.For an impression of this dataset, see Figures 9 and 10 for its 2D visualizations obtained by PCA and TSNE, respectively.
We have created several classification tasks from the Amazon Reviews dataset.Due to its relatively large size compared to the available resources, a reduced set was used in this case, as it was in the development phase.In this case, the CPA subset was used, as in the evaluation of the keyword extraction models.From the dataset, we selected products with a uniform set of 100 reviews for each class (ratings from 1 to 5).Thus, we created a classification task with 6000 train and 1500 test records, which we used for a 5-label multi-classification.  3) Setup: Because of the different approaches of the different models, each model has been set up to maximize their perfromance.
BERTopic has limited configuration options, so we maximized the keyword extraction capability of the model.This means that the maximum number of words that can be defined for a topic is 30.
For Top2Vec, the mincount parameter of the model is responsible for the number of words to extract.Therefore, we tried numbers from 10 to 500.500 was the maximum value as long as the model was stable in our development environment.
Automatic determination of the number of topics is not built into the LDA by default, so a more complex optimization was performed for this model.In the LDA, a separate parameter can be used to set the number of topics searched for.This parameter was set to be between 2 and 50.Within this range, we tried both online and batch optimization techniques to find the best number of topics.The optimal topic number for the LDA always has the best average perplexity among the selected topics.
In the case of our own model, we left the parameters at the values we had found in our development experience and set them as defaults.This means that each of the topics selected by our model had more than 5 elements and the cosine similarity value between the elements of each topic was greater than 0.7.Sets outside this range were treated as outliers and irrelevant topics.
The topic words extracted from the models were used as a dictionary, and only these words were retained from the datasets for each evaluation of the model for the classification tasks.The texts filtered by the model dictionaries were vectorized using one-hot, count vectorization, and TF-IDF techniques.We then used a simple regression model for each model evaluation.The advantage of the regression model is that it is simple, and the operation of the model can be clearly explained using weights.In each case, the models were trained on a training dataset filtered by the dictionaries generated by the models, and their accuracy was measured on the test set.

4) Results:
Since we tested several parameters for the LDA and Top2Vec models, we considered their average performances.
On the binary classification tasks, we measured an average accuracy of 89.82% for our own model, 82.75% for LDA with batch optimization, 82.38% for LDA with online optimization, 76.21% for Top2Vec and 79.46% for BERTopic (see Table II).In terms of the number of topics found, our model found 58 topics on average, LDA with batch optimization 7 topics, LDA with online optimization 7 topics, To2Vec 3 topics, and BERTopic 7 topics as enclosed also in Table III.The detailed results for each binary classification task and topic modeling approach are enclosed in Supplementary Material 1.For the multi-classification tasks, our model provided an average accuracy of 45.24%, we got 35.73% for LDA with batch optimization, 35.48% for LDA with online optimization, 39.41% for Top2Vec and 42.8% for BERTopic; see Table IV.In terms of the number of topics found, our model found 64 topics on average, LDA with batch optimization 2 topics, LDA with online optimization 2 topics, Top2Vec 17 topics, and BERTopic 94 topics; see Table V.

V. CONCLUSIONS
This paper outlines the difficulties of applying information retrieval from texts when accounting market considerations.In doing so, we have pointed out that the popular literary methods we have studied and tested in natural language text processing cannot comprehensively address these considerations.Indeed, in order to solve a target task effectively, we need to focus on aspects for which the solutions we tested did not work satisfactorily.Our experience, therefore, suggests that there is currently no general solution to the problems of information retrieval and topic modeling.
In our development work, we have tried to review the various possible solutions and combine and refine them to meet our needs.By doing the necessary research, we could develop solutions that could put together these elements in a uniform pipeline.As a result, we have created a complex pipeline model that performs the minimum necessary steps of data cleansing, finds key terms, and organizes them into coherent topics.The process is consistent and produces a result that proves to be more efficient than the available literature models.
The complex pipeline we propose is a computational cloud-based system that, based on our measurements, has good information retention and can use semantic embedding to find key terms and the topics around which they cluster.This provides us with more responsive and usable system results than other solutions.In addition, the steps of the overall process have been designed and implemented in a well-separated way so that additional elements needed for other target tasks can be easily inserted between each step.Supplementary Material 2 Average number of topics and vocabulary sizes found by the different models and their classification results Here we summarize all the models and their outputs in terms of the number of topics and vocabulary sizes they found for the 20newsgroups dataset.We also present their results for a classification task using different embedding strategies.The results for each model are presented in separate tables below.The columns Topic number and Vocabulary size contain the averages of the different runs for each model.It can be seen that our approach outperforms the others with respect to the classification problem investigated.

Fig. 1 .
Fig. 1.Embedded vector space from the Amazon dataset depicted by PCA.

Fig. 2 .
Fig. 2. Embedded vector space of the Amazon dataset depicted by TSNE.

Fig. 6 .
Fig.6.The BERT+R architecture modified for our system.

9 Fig. 8 .
Fig. 8. Correlation between the models based on the votes.

TABLE I EVALUATION
OF KEYWORD EXTRACTION MODELS.

TABLE II AVERAGE
ACCURACY FOR BINARY CLASSIFICATION.

TABLE III NUMBER
OF TOPICS AND VOCABULARY SIZES FOR BINARYCLASSIFICATION.

TABLE VI EXPERT
EVALUATION OF KEYWORD EXTRACTION MODELS FOR REVIEWS 1 TO 20.

TABLE IX EXPERT
EVALUATION OF KEYWORD EXTRACTION MODELS FOR REVIEWS 61 TO 80.

TABLE XI RESULTS
OF THE BERTOPIC MODEL.

TABLE XII RESULTS
OF THE LDA MODEL WITH BATCH OPTIMIZATION.

TABLE XIII RESULTS
OF THE LDA MODEL WITH ONLINE OPTIMIZATION.

TABLE XIV RESULTS
OF THE TOP2VEC MODEL.

TABLE XV RESULTS
OF OUR MODEL.