Short Text Clustering Algorithms, Application and Challenges: A Survey

.


Introduction
Recently, the number of text documents on the Internet has increased significantly and rapidly.The rapid development of mobile devices and Internet technologies has encouraged users to search for information, communicate with friends and share their opinions and ideas on social media such as Twitter, Instagram, and Facebook and search engines such as Google.The texts generated every day in social media are vast and unstructured data [1].
Most of these generated texts come in the form of short texts and need special analysis compared with formally written ones [2,3].Short texts can be found on the Internet, including on social media, in product descriptions, in advertisement text, on questions and answers (Q&A) websites [4] and in many other applications.Short texts are distinguished by a lack of context, so finding knowledge in them is difficult.This issue motivates researchers to develop novel, effective methods.Examples of short texts can be found in various contexts, like tweets, search inquiries, chat messages, online reviews and product descriptions.Short text also presents a challenge in clustering owing to its chaotic nature, which typically contains noise, slang, emojis, misspellings, abbreviations and grammatical errors.Tweets are a good example of these challenges.In addition, short text represents various facets of people's daily lives.As an illustration, Twitter generates 500 million tweets per day.These short texts can be used in several applications, such as trend detection [5], user profiling [6], event exploration [7], system recommendation [8], online user clustering [9] and cluster-based retrieval [2,10].
With the vast amount of short texts being added to the web every day, extracting valuable information from short text corpora by using data-mining techniques is essential [11,12].Among the many different data-mining techniques, clustering stands out as a unique technique for short text that provides the exciting potential to automatically recognize valuable patterns from a massive, messy collection of short texts [13].Clustering techniques focus on detecting similarity patterns in corpus data, automatically detecting groups of similar short texts, and organising documents into semantic and logical structures.
Clustering techniques help governments, organisations and companies monitor social events, interests and trends by identifying various subjects from user-generated content [14,15].Many users can post short text messages, image captions, search queries and product reviews on social media platforms.Twitter sets a restriction of 280 characters on the length of each tweet [16], and Instagram sets a limit of 2200 characters for each post [17].
Clustering short texts (forum titles, result snippets, frequently asked questions, tweets, microblogs, image or video titles and tags) within groups assigned to topics is an important research subject.Short text clustering (STC) has undergone extensive research in recent years to solve the most critical challenges to the current clustering techniques for short text, which are data sparsity, limited length and high-dimensional representation [18][19][20][21][22][23].
Applying standard clustering techniques directly to a short text corpus creates issues.The accuracy is worse when using traditional clustering techniques such as the K-means [24] technique to group short text data than when using the same method to group regular-length documents [25].One of the reasons is that standard clustering techniques such as K-means [24] and DBSCAN [26] depend on methods that measure the similarity/distance between data objects and accurate text representations [25].However, the use of standard text representation methods for STC, such as a term frequency inversedocument-frequency (TF-IDF) vectors or bag of words (BOW) [27], leads to sparse and high-dimensional feature vectors that are less distinctive for measuring distance [18,28,29].Therefore, using dimensionality reduction as an optional step of the STC system is essential.For example, if we use TF-IDF as our text representation for the datasets, assuming we have 300k unique words, the dimensions are high, and the computational time is extensive.We can reduce these choices by using feature reduction.
In this paper, we present various concepts for STC and introduce several text clustering methodologies and some recent strategies of these models.In addition, we discuss techniques and algorithms used when representing a short text from a dataset.We advise readers to look up the original publications cited here for any methods or techniques to assist them in understanding STC fully and remain open to different approaches they may come across when reviewing published research.
The main contribution of this study is a comprehensive review of techniques and applications of STC, along with the components of STC and its main challenges.This paper overviews many STC types and options for various data scenarios and tries to answer the following research questions: RQ1: What are the applications of STC? RQ2: What are the main components of STC? RQ3: Which method is used for the representation of STC? RQ4: What are the main challenges of STC, and how can one overcome them?
The remaining sections of this study are structured as follows.In Section 2, we briefly mention the applications of STC.In Section 3, we describe the detailed components of STC.In Section 4, we describe the challenges of STC.In Section 5, we draw the conclusions.

Applications of Short Text Clustering
Many clustering methods have been used in several real-world applications.The following disciplines and fields use clustering: i.
Information retrieval (IR): Clustering methods have been used in various applications in information retrieval, including clustering big datasets.In search engines, text clustering plays a critical role in improving document retrieval performance by grouping and indexing related documents [30].ii.
Internet of Things (IoT): With the rapid advancement of technology, several domains have focused on IoT.Data collection in the IoT involves using a global positioning system, radio frequency identification technology, sensors and various other IoT devices.Clustering techniques are used for distributed clustering, which is essential for wireless sensor networks [31,32].iii.
Biology: When clustering genes and samples in gene expression, the gene expression data characteristics become meaningful.They can be classified into clusters based on their expression patterns [33].iv.
Industry: Businesses collect large volumes of information about current and prospective customers.For further analysis, customers can be divided into small groups [34].v.
Climate: Recognising global climate patterns necessitates detecting patterns in the oceans and atmosphere.Data clustering seeks to identify atmospheric pressure patterns that significantly impact the climate [35].vi.
Medicine: Cluster analysis is used to differentiate among disease subcategories.It can also detect disease patterns in the temporal or spatial distribution [36].

Components of Short Text Clustering
Clustering is a type of data analysis that has been widely studied; it aims to group a collection of data objects or items into subsets or clusters [37].Specifically, the main goal of clustering is to generate cohesive and identical groups of similar data elements by grouping related data points into unified clusters.All the documents or objects in the same cluster must be as similar as possible [38].In other words, similar documents in a cluster have similar topics so that the cluster is coherent internally.Distinctions between each cluster are notable.Documents or objects in the same cluster must be as different from those in the other clusters as possible.
Text clustering is essential to many real-world applications, such as text mining, online text organisation and automatic information retrieval systems.Fast and highquality document clustering greatly aids users in successfully navigating, summarizing and organizing large amounts of disorganized data.Furthermore, it may determine the structure and content of previously unknown text collections.Clustering attempts to automatically group documents or objects with similar clusters by using various similarity/distance measures [39].
Differentiating between clustering and classification of documents is crucial [40].Still, the difference between the two may be unclear because a set of documents must be split into groups in both cases.In general, labelled training data are supplied during classification; however, the challenge arises when attempting to categorize test sets, which consist of unlabelled data, into a predetermined set of classes.In most cases, the classification problem may be handled using a supervised learning method [41,42].
As mentioned above, one of the significant challenges in clustering is grouping a set of unlabelled and non-predefined data into similar groups.Unsupervised learning methods are commonly used to solve the clustering problem.Furthermore, clustering is used in many data fields that do not rely on predefined knowledge of the data, unlike classification, which requires prior knowledge of the data [43].
Short texts are becoming increasingly common as online social media platforms such as Instagram, Twitter and Facebook increase in size.They have very minimal vocabulary; many words even appear only once.Therefore, STC significantly affects semantic analysis, demonstrating its importance in various applications, such as IR and summarisation [2].
However, the sparsity of short text representation makes the traditional clustering methods unsatisfying.This is due to the sparsity problems caused by each short text document only containing a few words.
Short text data contain unstructured sentences that lead to massive variance from regular texts' vocabulary when using clustering techniques.Therefore, self-corpus-based expansion is presented as a semantically aligned substitutional approach by defining and augmenting concepts in the corpus using clustering techniques [44] or topics based on the probability of frequency of the term [45].However, dealing with the microblogging data is challenging for any of these methods because of their lack of structure and the small number of co-occurrences among words [46].
Several strategies have been proposed to alleviate the sparsity difficulties caused by lack of context, such as corpus-based metrics [47] and knowledge-based metrics [25,48].One of these simple strategies concentrates on data-level enhancements.The main idea is to combine short text documents to create longer ones [49].For aggregation, related models utilize metadata or external data [48].Although these models can alleviate some of the sparsity issues, a drawback remains.That is, these models rely on external data to a large extent.
STC is more challenging than traditional text clustering.Representations of the text in the original lexical space are typically sparse, and this problem is exacerbated for short texts [50].Therefore, learning an efficient short text representation scheme suitable for clustering is critical to the success of STC.In essence, the major drawback of the standard STC techniques is that they cannot adequately handle the sparseness of words in the documents.Compared with long texts containing rich contexts, distinguishing the clusters of short documents with few words occurring in the training set is more challenging.
Generally, most models primarily focus on learning representation from local cooccurrences of words [21].Understanding how a model works is critical for using and developing text clustering methods.STC generally contains different steps that can be applied, as shown in Figure 1.
in many data fields that do not rely on predefined knowledge of the data, unlike classification, which requires prior knowledge of the data [43].
Short texts are becoming increasingly common as online social media platforms such as Instagram, Twitter and Facebook increase in size.They have very minimal vocabulary; many words even appear only once.Therefore, STC significantly affects semantic analysis, demonstrating its importance in various applications, such as IR and summarisation [2].However, the sparsity of short text representation makes the traditional clustering methods unsatisfying.This is due to the sparsity problems caused by each short text document only containing a few words.
Short text data contain unstructured sentences that lead to massive variance from regular texts' vocabulary when using clustering techniques.Therefore, self-corpus-based expansion is presented as a semantically aligned substitutional approach by defining and augmenting concepts in the corpus using clustering techniques [44] or topics based on the probability of frequency of the term [45].However, dealing with the microblogging data is challenging for any of these methods because of their lack of structure and the small number of co-occurrences among words [46].
Several strategies have been proposed to alleviate the sparsity difficulties caused by lack of context, such as corpus-based metrics [47] and knowledge-based metrics [25,48].One of these simple strategies concentrates on data-level enhancements.The main idea is to combine short text documents to create longer ones [49].For aggregation, related models utilize metadata or external data [48].Although these models can alleviate some of the sparsity issues, a drawback remains.That is, these models rely on external data to a large extent.
STC is more challenging than traditional text clustering.Representations of the text in the original lexical space are typically sparse, and this problem is exacerbated for short texts [50].Therefore, learning an efficient short text representation scheme suitable for clustering is critical to the success of STC.In essence, the major drawback of the standard STC techniques is that they cannot adequately handle the sparseness of words in the documents.Compared with long texts containing rich contexts, distinguishing the clusters of short documents with few words occurring in the training set is more challenging.
Generally, most models primarily focus on learning representation from local co-occurrences of words [21].Understanding how a model works is critical for using and developing text clustering methods.STC generally contains different steps that can be applied, as shown in Figure 1.(I) Pre-processing: It is the first step to take in STC.The data must be cleaned by removing unnecessary characters, words, symbols, and digits.Then, text representation methods can be applied.Pre-processing plays an essential role in building an efficient clustering system because short text data (original text) are unsuitable to be used directly for clustering.
(II) Representation: Documents and texts are collections of unstructured data.These unstructured data need to be transformed into a structured feature space to use mathematical modelling during clustering.The standard techniques of text representation can be divided into the representation-based corpus and representation-based external knowledge methods.(III) Dimensionality reduction: Texts or documents, often after being represented by traditional techniques, become high-dimensional.Data-clustering procedures may be slowed down by extensive processing time and storage complexity.Dimensionality reduction is a standard method for dealing with this kind of issue.Many academics employ dimensionality reduction to lessen their application time and memory complexity rather than risk a performance drop.Dimensionality reduction may be more effective than developing inexpensive representation.(IV) Similarity measure: It is the fundamental entity in the clustering algorithm.It makes it easier to measure similar entities, group the entities and elements that are most similar and determine the shortest distance between related entities.In other words, distance and similarity have an inverse relationship, so they are used interchangeably.The vector representation of the data items is typically used to compute similarity/distance measures.(V) Clustering techniques: The crucial part of any text clustering system is selecting the best algorithm.We cannot choose the best model for a text clustering system without a deep conceptual understanding of each approach.The goal of clustering algorithms is to generate internally coherent clusters that are obviously distinct from one another.(VI) Evaluation: It is the final step of STC.Understanding how the model works is necessary before applying or creating text clustering techniques.Several models are available to evaluate STC.

Document Pre-Processing in Short Text Clustering
Document pre-processing plays an essential part in STC because the short text data (original text) are unsuitable to be used directly for clustering.The textual document likely contains every type of string, such as digits, symbols, words, and phrases.Noisy strings may negatively impact clustering performance, affecting information retrieval [51,52].The pre-processing phase for STC enhances the overall processing [47].In this context, the pre-processing step must be used on documents to cluster if one wants to use machine learning approaches [53].Pre-processing consists of four steps: tokenization, normalization, stop word removal and stemming.The main pre-processing steps are shown in Figure 2.
(I) Pre-processing: It is the first step to take in STC.The data must be cleaned by removing unnecessary characters, words, symbols, and digits.Then, text representation methods can be applied.Pre-processing plays an essential role in building an efficient clustering system because short text data (original text) are unsuitable to be used directly for clustering.(II) Representation: Documents and texts are collections of unstructured data.These unstructured data need to be transformed into a structured feature space to use mathematical modelling during clustering.The standard techniques of text representation can be divided into the representation-based corpus and representation-based external knowledge methods.(III) Dimensionality reduction: Texts or documents, often after being represented by traditional techniques, become high-dimensional.Data-clustering procedures may be slowed down by extensive processing time and storage complexity.Dimensionality reduction is a standard method for dealing with this kind of issue.Many academics employ dimensionality reduction to lessen their application time and memory complexity rather than risk a performance drop.Dimensionality reduction may be more effective than developing inexpensive representation.(IV) Similarity measure: It is the fundamental entity in the clustering algorithm.It makes it easier to measure similar entities, group the entities and elements that are most similar and determine the shortest distance between related entities.In other words, distance and similarity have an inverse relationship, so they are used interchangeably.The vector representation of the data items is typically used to compute similarity/distance measures.(V) Clustering techniques: The crucial part of any text clustering system is selecting the best algorithm.We cannot choose the best model for a text clustering system without a deep conceptual understanding of each approach.The goal of clustering algorithms is to generate internally coherent clusters that are obviously distinct from one another.(VI) Evaluation: It is the final step of STC.Understanding how the model works is necessary before applying or creating text clustering techniques.Several models are available to evaluate STC.

Document Pre-Processing in Short Text Clustering
Document pre-processing plays an essential part in STC because the short text data (original text) are unsuitable to be used directly for clustering.The textual document likely contains every type of string, such as digits, symbols, words, and phrases.Noisy strings may negatively impact clustering performance, affecting information retrieval [51,52].The pre-processing phase for STC enhances the overall processing [47].In this context, the preprocessing step must be used on documents to cluster if one wants to use machine learning approaches [53].Pre-processing consists of four steps: tokenization, normalization, stop word removal and stemming.The main pre-processing steps are shown in Figure 2.  According to [54], short texts have many unwanted words which may harm the representation rather than assist it.This fact validates the benefits of pre-processing the document in STC.Utilizing the documents with all their words, including unnecessary ones is a complicated task.Generally, words classified under particles, conjunctions and other grammar-based categories, which are commonly used, may be unsuitable for supporting studies on short text clustering.Furthermore, as suggested by Bruce et al. [55], even standard terms such as 'go', 'gone' and 'going' in the English language are created by the derivational and inflectional processes.They fare better if their inflectional and derivational morphemes are taken to remain in their original stems.This reduces the number of words in a document whilst preserving the semantic functions of these words.Therefore, a document free of unwanted words is an appropriate pre-processing goal [56,57].

Tokenization and Normalization
Tokenization is defined as a standard text representation that divides a flow of natural language text into distinct significant elements called tokens as part of the preprocessing [58].Tokenization transforms the text from a document into data that can be analysed by machine learning methods.Generally, these algorithms segment the text into separate units by adding a space or some other kind of distinctive marker so that each team may be mapped to a different word in the text [55].
The normalization step aims to clean the data by removing unnecessary and noisy data, such as numbers, symbols, code tags and special characters.Whilst clustering search results, noise filtering is an essential task of the tokenizer.Likewise, the retrieved results, such as the contextual snippets supplied as input, include file names; URLs; characters that demarcate portions of whole documents, such as ellipsis characters (@, %, &, etc.) and other symbols whose meanings are not readily apparent.A reliable tokenizer needs to be able to recognize and get rid of this type of noise whilst it creates a token sequence.This step is necessary to carry out acceptable data representation and pre-processing.
As explained above, text tokenization and normalization transform the text into words or phrases by deleting extraneous strings of characters, such as punctuation marks, numerals and other strings of characters [59].In essence, the white space is utilized to distinguish the collection of tokens that may be differentiated from one another.A text tokenization sample is illustrated in Figure 3.The text is displayed as a collection of tokens, with all of the characters written in lowercase, and white space is utilized to differentiate between each token.The commas, periods and other punctuation marks, along with any other special characters, are deleted.
According to [54], short texts have many unwanted words which may harm the representation rather than assist it.This fact validates the benefits of pre-processing the document in STC.Utilizing the documents with all their words, including unnecessary ones is a complicated task.Generally, words classified under particles, conjunctions and other grammar-based categories, which are commonly used, may be unsuitable for supporting studies on short text clustering.Furthermore, as suggested by Bruce et al. [55], even standard terms such as 'go', 'gone' and 'going' in the English language are created by the derivational and inflectional processes.They fare better if their inflectional and derivational morphemes are taken to remain in their original stems.This reduces the number of words in a document whilst preserving the semantic functions of these words.Therefore, a document free of unwanted words is an appropriate pre-processing goal [56,57].

Tokenization and Normalization
Tokenization is defined as a standard text representation that divides a flow of natural language text into distinct significant elements called tokens as part of the pre-processing [58].Tokenization transforms the text from a document into data that can be analysed by machine learning methods.Generally, these algorithms segment the text into separate units by adding a space or some other kind of distinctive marker so that each team may be mapped to a different word in the text [55].
The normalization step aims to clean the data by removing unnecessary and noisy data, such as numbers, symbols, code tags and special characters.Whilst clustering search results, noise filtering is an essential task of the tokenizer.Likewise, the retrieved results, such as the contextual snippets supplied as input, include file names; URLs; characters that demarcate portions of whole documents, such as ellipsis characters (@, %, &, etc.) and other symbols whose meanings are not readily apparent.A reliable tokenizer needs to be able to recognize and get rid of this type of noise whilst it creates a token sequence.This step is necessary to carry out acceptable data representation and pre-processing.
As explained above, text tokenization and normalization transform the text into words or phrases by deleting extraneous strings of characters, such as punctuation marks, numerals and other strings of characters [59].In essence, the white space is utilized to distinguish the collection of tokens that may be differentiated from one another.A text tokenization sample is illustrated in Figure 3.The text is displayed as a collection of tokens, with all of the characters written in lowercase, and white space is utilized to differentiate between each token.The commas, periods and other punctuation marks, along with any other special characters, are deleted.The first step of the text pre-processing is describing the tokenization and normalization, which are as follows: 1. Remove numbers (2, 1…).The first step of the text pre-processing is describing the tokenization and normalization, which are as follows: 1.
5. Remove non-English words, such as ‫.ﺍﺳﻢ‬ 6. Remove words with less than three letters.
Figure 4 shows the tokenization and normalization steps. ).

5.
Remove non-English words, such as .

6.
Remove words with less than three letters.
5. Remove non-English words, such as ‫.ﺍﺳﻢ‬ 6. Remove words with less than three letters.
Figure 4 shows the tokenization and normalization steps.

Stop-Word Removal
Stop words are utilized as a grammatical function of the language when a document lacks context instead of specifying a semantic function or meaning.Stop words are considered less useful in text than other terms.Generally, they have a direct effect on the meaning of the text.In most cases, documents include many unnecessary words in English.Stop words are typically utilized by writers to improve the structure of their writing linguistically.Examples of stop words include demonstratives such as 'this', 'that' and 'those' and articles such as 'the', 'a' and 'an.' Removing stop words from a document is typical and positively affects document clustering.This is because the capacity of the terms' space is significantly reduced upon completion of word removal.Every language has its unique collection of stop words [56].By removing these frequently used stop words from the text documents, the number of words each search term has to be matched against is reduced, significantly increasing the time it takes for queries to receive a result without affecting accuracy.
These words often communicate more grammatical functions than semantic functions, which may increase the conversational or informative aspects of the document's content.Considering this, removing unnecessary words results in an improved ability to transmit the meaning of the text or document content and leads to an easier time understanding it using machine learning approaches.Many search engines have implemented stop word removal to help users or writers with queries obtain improved results by searching for information or meaning instead of searching for functional words [55].

Stop-Word Removal
Stop words are utilized as a grammatical function of the language when a document lacks context instead of specifying a semantic function or meaning.Stop words are considered less useful in text than other terms.Generally, they have a direct effect on the meaning of the text.In most cases, documents include many unnecessary words in English.Stop words are typically utilized by writers to improve the structure of their writing linguistically.Examples of stop words include demonstratives such as 'this', 'that' and 'those' and articles such as 'the', 'a' and 'an.' Removing stop words from a document is typical and positively affects document clustering.This is because the capacity of the terms' space is significantly reduced upon completion of word removal.Every language has its unique collection of stop words [56].By removing these frequently used stop words from the text documents, the number of words each search term has to be matched against is reduced, significantly increasing the time it takes for queries to receive a result without affecting accuracy.
These words often communicate more grammatical functions than semantic functions, which may increase the conversational or informative aspects of the document's content.Considering this, removing unnecessary words results in an improved ability to transmit the meaning of the text or document content and leads to an easier time understanding it using machine learning approaches.Many search engines have implemented stop word removal to help users or writers with queries obtain improved results by searching for information or meaning instead of searching for functional words [55].Figure 5 displays the word document sample after removing stop words.

Stemming
This is the third step in pre-processing.We utilize stemmed words to represent the texts in this step.Stemming is a traditional shallow natural language processing (NLP) technique.Word stemming removes all prefixes and suffixes to obtain stem words [60].Indexing and keyword filtering are crucial steps of stemming because they improve clus-  This is the third step in pre-processing.We utilize stemmed words to represent the texts in this step.Stemming is a traditional shallow natural language processing (NLP) technique.Word stemming removes all prefixes and suffixes to obtain stem words [60].Indexing and keyword filtering are crucial steps of stemming because they improve clustering faster and more accurately by reducing the vocabulary quantity and dependence on certain vocabulary forms [61].Figure 6 illustrates how the stemming transforms the words 'consultant', 'consultants', 'consulting' and 'consultative' into a single stem, 'consult', which is also a word in the dictionary.However, this is not always the case; a stem may not always be an accurate word.

Stemming
This is the third step in pre-processing.We utilize stemmed words to represent the texts in this step.Stemming is a traditional shallow natural language processing (NLP) technique.Word stemming removes all prefixes and suffixes to obtain stem words [60].Indexing and keyword filtering are crucial steps of stemming because they improve clustering faster and more accurately by reducing the vocabulary quantity and dependence on certain vocabulary forms [61].Figure 6 illustrates how the stemming transforms the words 'consultant', 'consultants', 'consulting' and 'consultative' into a single stem, 'consult', which is also a word in the dictionary.However, this is not always the case; a stem may not always be an accurate word.

Document Representation
Even after the noise in the text has been removed during pre-processing, the text still does not fit together well enough to produce the best results when clustering.Therefore, focusing on the text representation step is essential, which involves converting the word or the full text from its initial form into another.Directly applying learning algorithms to text information without representing it is impossible [62] because text information has complex nature [63].Textual document content must be converted into a concise representation before applying a machine learning approach to the text.Language-independent approaches are particularly successful because they are not dependent on the meaning of the language and perform well in the event of noisy text.As these methods do not depend on language, they are efficient [64].
Short text similarity has attracted more attention in recent years, and understanding semantics correctly between documents is challenging to understanding lexical diversity and ambiguity [65].Representing short text is critical in NLP yet challenging owing to its

Document Representation
Even after the noise in the text has been removed during pre-processing, the text still does not fit together well enough to produce the best results when clustering.Therefore, focusing on the text representation step is essential, which involves converting the word or the full text from its initial form into another.Directly applying learning algorithms to text information without representing it is impossible [62] because text information has complex nature [63].Textual document content must be converted into a concise representation before applying a machine learning approach to the text.Language-independent approaches are particularly successful because they are not dependent on the meaning of the language and perform well in the event of noisy text.As these methods do not depend on language, they are efficient [64].
Short text similarity has attracted more attention in recent years, and understanding semantics correctly between documents is challenging to understanding lexical diversity and ambiguity [65].Representing short text is critical in NLP yet challenging owing to its sparsity; high dimensionality; complexity; large volume and much irrelevant, redundant and noisy information [1,66].As a result, the traditional methods of computing semantic similarity are a significant roadblock because they are ineffective in various circumstances.Many existing traditional systems fail to deal with terms not covered by synonyms and cannot handle abbreviations, acronyms, brand names and other terms [67].Examples of these traditional systems are BOW and TF-IDF, which represent text as real value vectors to help with semantic similarity computation.However, these strategies cannot account for the fact that words have diverse meanings and that different words may be used to represent the same concept.For example, consider two sentences: 'Majid is taking insulin' and 'Majid has diabetes'.Although these two sentences have the same meaning, they do not use the same words.These methods capture the lexical features of the text and are simple to implement; however, they ignore the semantic and syntactic features of the text.To address this issue, several studies have expanded and enriched the context of data from an ontology [68,69] or Wikipedia [70,71].However, these techniques require a great deal of understanding of NLP.They still use high-dimensional representation for short text, which may lead to wasting memory and computing time.Generally, these methods use external knowledge to improve contextual information for short texts.Many short text similarity measurement approaches exist, such as representation-based measurement [72,73], which learn new representations for short text and then measure similarity based on this model [74].A large number of similarity metrics have previously been proposed in the literature.We choose corpus-based and knowledge-based metrics because of their observed performance in NLP applications.This study explains several representation-based measurement methods, as shown in Figure 7.
amples of these traditional systems are BOW and TF-IDF, which represent text as real value vectors to help with semantic similarity computation.However, these strategies cannot account for the fact that words have diverse meanings and that different words may be used to represent the same concept.For example, consider two sentences: 'Majid is taking insulin' and 'Majid has diabetes'.Although these two sentences have the same meaning, they do not use the same words.These methods capture the lexical features of the text and are simple to implement; however, they ignore the semantic and syntactic features of the text.To address this issue, several studies have expanded and enriched the context of data from an ontology [68,69] or Wikipedia [70,71].However, these techniques require a great deal of understanding of NLP.They still use high-dimensional representation for short text, which may lead to wasting memory and computing time.Generally, these methods use external knowledge to improve contextual information for short texts.Many short text similarity measurement approaches exist, such as representation-based measurement [72,73], which learn new representations for short text and then measure similarity based on this model [74].A large number of similarity metrics have previously been proposed in the literature.We choose corpus-based and knowledge-based metrics because of their observed performance in NLP applications.This study explains several representation-based measurement methods, as shown in Figure 7.

Non-DL Measures
In this section, the literature is comprehensively reviewed to understand the research attempts and trends in measuring the similarity of STC, including corpus-based measures and knowledge-based measures.

Bag of Words Model
According to [75,76], BOW is the most traditional text representation method used to simplify the data to be more suitable in the processing stage by considering the text data as groups of words.It is widely used in IR and NLP because of its simplicity and efficiency, where it uses simple words or phrases as features to represent text.The difficulties

Non-DL Measures
In this section, the literature is comprehensively reviewed to understand the research attempts and trends in measuring the similarity of STC, including corpus-based measures and knowledge-based measures.

Bag of Words Model
According to [75,76], BOW is the most traditional text representation method used to simplify the data to be more suitable in the processing stage by considering the text data as groups of words.It is widely used in IR and NLP because of its simplicity and efficiency, where it uses simple words or phrases as features to represent text.The difficulties with text processing stem from the fact that text data's syntactic and semantic content is challenging to quantify.Creating a comprehensive model of all text data is challenging.Thus, the current text analysis approach usually represents a text document by reducing the text structure complexity and simplifying text documents.BOW is a text document representation that treats a written document as a BOW, ignoring word order and grammar.Figure 8 illustrates how BOW can be used to represent two texts.Stop words are removed, such as 'a' and 'is', from practical processing to underline the relevance of other words.In Figure 8, we see a fixed-length vector is represented for each short text.The value assigned to each dimension in the vector represents the term frequency (tf ) in the corresponding section of the text document.BOW can help text representation by simplifying the text document, but it may not distinguish the difference between two documents with the same bag of words but in different sequences.
mar. Figure 8 illustrates how BOW can be used to represent two texts.Stop words are removed, such as 'a' and 'is', from practical processing to underline the relevance of other words.In Figure 8, we see a fixed-length vector is represented for each short text.The value assigned to each dimension in the vector represents the term frequency (tf) in the corresponding section of the text document.BOW can help text representation by simplifying the text document, but it may not distinguish the difference between two documents with the same bag of words but in different sequences.However, the BOW model has several drawbacks.Some corpora, such as social media corpora, include slang and misspelt words, which result in a high-dimensional feature space.Furthermore, these models cannot process complex word meaning differences, such as synonyms and polysemy.BOW has a weak sense of the semantics of the words, or more formally, the distances between words [77].

Vector Space Model (VSM)
It is a traditional method for measuring the distance between text documents after simplifying text data by BOW, where term weight vectors represent the original texts [27,78].VSM uses document-level word occurrences as its representational basis.Using different term-weighting methods, a document with basic terms can be mapped into the high-dimensional term feature space.The term-weighting algorithms are utilized to determine which terms are most significant.The performance of text analysis can be improved by using the appropriate term weighting.Term weighting aims to assess the significance of terms within a given document or corpus.Several different term-weighting methods are available.Short text has a limited length of the text, and the size vocabulary of words in the corpus is often quite large.Therefore, when calculating similarity/distance using cosine similarity or Euclidean distances [29], the VSM-based representation for short texts produces sparse and high-dimensional vectors, which are less discriminative.
Topic-model-related methods are utilized to learn high-level semantic text representations to alleviate the disadvantages of VSM with a short text [1].Based on the frequencies of the terms in the original text documents, the weights of the terms are calculated.However, the BOW model has several drawbacks.Some corpora, such as social media corpora, include slang and misspelt words, which result in a high-dimensional feature space.Furthermore, these models cannot process complex word meaning differences, such as synonyms and polysemy.BOW has a weak sense of the semantics of the words, or more formally, the distances between words [77].

Vector Space Model (VSM)
It is a traditional method for measuring the distance between text documents after simplifying text data by BOW, where term weight vectors represent the original texts [27,78].VSM uses document-level word occurrences as its representational basis.Using different term-weighting methods, a document with basic terms can be mapped into the highdimensional term feature space.The term-weighting algorithms are utilized to determine which terms are most significant.The performance of text analysis can be improved by using the appropriate term weighting.Term weighting aims to assess the significance of terms within a given document or corpus.Several different term-weighting methods are available.Short text has a limited length of the text, and the size vocabulary of words in the corpus is often quite large.Therefore, when calculating similarity/distance using cosine similarity or Euclidean distances [29], the VSM-based representation for short texts produces sparse and high-dimensional vectors, which are less discriminative.
Topic-model-related methods are utilized to learn high-level semantic text representations to alleviate the disadvantages of VSM with a short text [1].Based on the frequencies of the terms in the original text documents, the weights of the terms are calculated.The VSM transforms the term frequency into a numerical vector.Although simple to set up, VSM has drawbacks, including high dimensionality and sparsity.These weaknesses of VSM become even more apparent when dealing with short text data.In addition, the global term weighting is calculated by querying all documents.Generally, the common word weighting strategies can be divided into local and global term weighting schemes [27,79].

I. Local term weight
It is the term frequency value within the document derived by several methods [78].For example, the most important and commonly used local weighting schemes, as shown in Table 1, are term presence (tp), term frequency (tf), augmented term frequency (atf ), the logarithm of term frequency (ltf ), and BM25 term frequency (btf ).The most notable and common representation is tf, which indicates the number of occurrences of the term in the document.Thus, it emphasizes the words that appear more frequently.tp is the simple binary representation, which ignores the number of appearances of the term in the document.This can be useful when the number of times a word appears is unimportant.tp and tf are combined in the atf scheme.It tries to instil confidence in any term in the document and add confidence to frequently occurring terms.ltf is used as a logarithmic function to set within-document frequency because a term that appears five times in a document is not always five times as important as a term that occurs once in that document.
aver dl represents the average number of terms found in all texts Global weights

II. Global term weight
It calculates the weight by collecting all training documents [78].It tries to grant a discrimination value to each term and emphasize discriminatory terms.For example, the most popular and notable metric global term weighting schemes shown in Table 1, such as idf, bidf, and pidf, are unsupervised because they do not use the category label information from training documents.In idf, the main idea of the inverse document frequency is to provide high weights for rare terms and low values for standard terms.This scheme is calculated using the logarithmic ratio of the number of documents in a collection to the number of documents containing a specific term.The versions of bidf and pidf are the other two approaches to idf.The premise behind idf, bidf and pidf is that a phrase that appears less frequently in documents is more discriminatory.This strategy works well in IR, but it is inappropriate for text categorization and text clustering because these tasks are designed to distinguish between categories, not documents [78].
Table 1 shows that a and b represent the number of training documents in group one, including the terms t i and c; d represents the number of training documents in group two, including term t i .N represents the number of documents in the corpus, N = a + b + c + d.
Finally, text data are represented by considering the context or topic of text documents or segments.However, a document is characterized by its topic or the keywords that stand out the most.Consequently, several topics can be associated with a single document, and documents are clustered according to the number of topics they share.

Latent Dirichlet Allocation (LDA)
LDA is defined as generative probabilistic modelling for text data.LDA is one of the most widely used methods in topic modelling, and it was developed in 2003 by [73].The fundamental concept is based on the texts represented as random mixtures from latent topics, where the distribution of the words characterizes a topic.The simplicity and effectiveness of LDA lead to its widespread use.LDA uses word probabilities to represent topics.The words with the highest probabilities in each topic usually give a good idea of what the topic is using word probabilities from LDA.
LDA assumes that each text may represent a probability distribution across latent topics, with a shared Dirichlet prior across all texts.Each latent topic is represented as a probabilistic distribution from words in the LDA model, and the word distributions of topics share a common Dirichlet prior.Assuming a corpus D consisting of M documents, with document d having N d words (d ∈ 1, . . ., M), LDA models D using to the following generative process [73]: Choose a word w n from ϕ zn .
In the generative process described above, words in texts are the only observable variables, whereas others (ϕ and θ) are latent variables.(α and β) are hyperparameters.The probability of observed data D, as shown in Figure 9, is calculated and acquired from the data corpus by using the following equation [80]: Several methods for estimating LDA parameters have been proposed for parameter estimation, inference and training for LDA, such as Gibbs sampling [81].

I. Gibbs sampling
It is a powerful and simple technique in statistical inference.It is a Monte-Carlo-Markov-chain algorithm.Gibbs sampling produces a sample from a joint distribution when only conditional distributions of each variable can be efficiently computed.Many researchers have used this technique for the LDA [82][83][84].

Dirichlet Multinomial Mixture (DMM)
The other technique is based on model-level improvements, in which standard procedures impose additional constraints on model assumptions to generate topics.Assuming that in traditional models, each document is composed of several topics, given that each short text document has only a few words.The DMM [66] model assumes that there is only one topic covered in each document.The constraints on these models are excessive.The number of relevant topics depends on the information in the various texts.As a result, simply putting such restrictions has the potential to cause noise, so this technique may be less effective and less generic.A corpus is a collection of search results composed of .The character  denotes the number of documents in the corpus.Each d ⃗ includes a group of words ( ϵ 1,2, . . .,  ).Several methods for estimating LDA parameters have been proposed for parameter estimation, inference and training for LDA, such as Gibbs sampling [81].

I. Gibbs sampling
It is a powerful and simple technique in statistical inference.It is a Monte-Carlo-Markov-chain algorithm.Gibbs sampling produces a sample from a joint distribution when only conditional distributions of each variable can be efficiently computed.Many researchers have used this technique for the LDA [82][83][84].

Dirichlet Multinomial Mixture (DMM)
The other technique is based on model-level improvements, in which standard procedures impose additional constraints on model assumptions to generate topics.Assuming that in traditional models, each document is composed of several topics, given that each short text document has only a few words.The DMM [66] model assumes that there is only one topic covered in each document.The constraints on these models are excessive.The number of relevant topics depends on the information in the various texts.As a result, simply putting such restrictions has the potential to cause noise, so this technique may be less effective and less generic.A corpus is a collection of search results composed of D. The character D denotes the number of documents in the corpus.Each → d includes a group of words (w 1, 2, . . . ,N d ).A DMM-based method for STC was suggested by [66].However, how to create an effective model remains unclear.Based on BOW, most of these methods are trained, which are shallow structures that cannot maintain semantic similarities [25].In Equation ( 2), the probability of observed data  is calculated and maximized to infer the latent variables and hyperparameters [66]: where  denotes the topic Dirichlet prior parameters and the distribution of words over topics from the Dirichlet distribution; for a specific , the vocabulary size is represented by the letter .The frequency of the word  in  is  .Additionally, the number of words is  .For corpus-level topic distributions, the Dirichlet-multinomial pair is ( ,  ).
A DMM output matrix has rows for documents and columns for topics.The labelled one is assigned to the cell with the coordinates (i,j) if the document  belongs to the topic  .

Latent Semantic Analysis (LSA)
This model is a technique in text representation that can be used for modelling the conceptual relationship among several documents based on their set of words, which can be computed as semantic information using this model.One of the methods for improving the text representation model is using semantic information [85].This concept is founded on the premise that words with lexical distinctions frequently appear in similar documents and have similar meanings.LSA is a promising method for constructing a latent semantic structure in textual data and identifying relevant documents that do not share common words.It also reduces the sizeable term matrix to a smaller one and provides a stable clustering space.LSA differs from standard NLP because it does not use dictionaries, knowledge bases, grammar or syntactic parsers.It accepts as input only raw text that has been split into meaningful paragraphs.In a matrix, LSA represents the text that is described.Each column of the matrix refers to a passage where the word appears, and each word corresponds to one row in the table.The matrix cells show how often the term appears in the paragraph, as shown in Figure 11.A topic z n ∈ {1, 2, . . . ,T}cc is selected from multinomial(α), where α = (∝ 1 , ∝ 2 , . . . ,∝ n ) represents the topic distribution in the corpus.2.
The word count N d is selected, and a word w d from d from multinomial(β) is also independently selected, where β = (β ∝1 , β ∝2 , . . . ,β ∝n ) represents the word topic distribution in the corpus.
A DMM-based method for STC was suggested by [66].However, how to create an effective model remains unclear.Based on BOW, most of these methods are trained, which are shallow structures that cannot maintain semantic similarities [25].In Equation (2), the probability of observed data D is calculated and maximized to infer the latent variables and hyperparameters [66]: where α denotes the topic Dirichlet prior parameters and the distribution of words over topics from the Dirichlet distribution; for a specific β, the vocabulary size is represented by the letter V.The frequency of the word W in d is N w d .Additionally, the number of words is N d .For corpus-level topic distributions, the Dirichlet-multinomial pair is ( α, θ ).A DMM output matrix has rows for documents and columns for topics.The labelled one is assigned to the cell with the coordinates (i,j) if the document d i belongs to the topic t j .

Latent Semantic Analysis (LSA)
This model is a technique in text representation that can be used for modelling the conceptual relationship among several documents based on their set of words, which can be computed as semantic information using this model.One of the methods for improving the text representation model is using semantic information [85].This concept is founded on the premise that words with lexical distinctions frequently appear in similar documents and have similar meanings.LSA is a promising method for constructing a latent semantic structure in textual data and identifying relevant documents that do not share common words.It also reduces the sizeable term matrix to a smaller one and provides a stable clustering space.LSA differs from standard NLP because it does not use dictionaries, knowledge bases, grammar or syntactic parsers.It accepts as input only raw text that has been split into meaningful paragraphs.In a matrix, LSA represents the text that is described.Each column of the matrix refers to a passage where the word appears, and each word corresponds to one row in the table.The matrix cells show how often the term appears in the paragraph, as shown in Figure 11.

Word Embedding
It is a neural network representation learning approach that be capture syntactic and semantic similarities between words [86].Word embedding aims to map the words in unlabelled text data to a continuously valued low-dimensional space to capture the similarities between words [87].It creates latent feature vectors for words to maintain their syntactic and semantic information.The efficiency of word representations relies on implicit relations between words in the corpus.The three common approaches for word embedding learning are Word2Vec [86], Doc2Vec and Glove [88].Owing to vocabulary mismatches, the noisy nature of microblogging data, and a lower number of word co-occurrence in text data, applying these pre-trained word embedding algorithms to short text input is limited [89].Most word embedding strategies can only learn one vector for each word [90].Many words, however, have multiple meanings.For example, the word apple can have numerous semantics.When used in the statement 'I like eating apples', it refers to a type of food.It refers to the name of a technological corporation when it appears in the sentence 'We went to the Apple store yesterday'.

I. Word2Vec
It was proposed by [86] as a collection of related models used to generate word embedding.It utilizes a 'shallow' neural network capable of quickly processing billions of word occurrences and producing syntactically and semantically relevant word representation models.The authors also investigated two models of word-embedding learning: skip-gram and continuous bag of words (CBOW).The former takes in a word and predicts the context words, whereas the latter indicates the target word using a source of context words [20].

II. Doc2Vec
It was proposed by [77] as a straightforward extension to Word2Vec [86] for extending learning embeddings from words to word sequences.Doc2Vec is agnostic to the granularity of the word sequence, which can be a word n-gram, sentence, paragraph, or document.Doc2Vec also produces sub-par performance compared with vector-averaging methods based on previous studies [46].

III. GloVe (Global Vectors for Word Representation)
It is a log-bilinear regression model proposed by [88].It attempts to resolve the disadvantages of global factorization approaches (e.g.latent semantic analysis [91]) and local context window approaches (e.g.skip-gram model [73]) on the word analogies and semantic relatedness task.GloVe's global vectors are trained via unsupervised learning on a corpus of aggregated global (word x word) co-occurrence information.GloVe's goal is to factorize the log-count matrix and find a word embedding that meets this criterion [92].

Word Embedding
It is a neural network representation learning approach that be capture syntactic and semantic similarities between words [86].Word embedding aims to map the words in unlabelled text data to a continuously valued low-dimensional space to capture the similarities between words [87].It creates latent feature vectors for words to maintain their syntactic and semantic information.The efficiency of word representations relies on implicit relations between words in the corpus.The three common approaches for word embedding learning are Word2Vec [86], Doc2Vec and Glove [88].Owing to vocabulary mismatches, the noisy nature of microblogging data, and a lower number of word co-occurrence in text data, applying these pre-trained word embedding algorithms to short text input is limited [89].Most word embedding strategies can only learn one vector for each word [90].Many words, however, have multiple meanings.For example, the word apple can have numerous semantics.When used in the statement 'I like eating apples', it refers to a type of food.It refers to the name of a technological corporation when it appears in the sentence 'We went to the Apple store yesterday'.

I. Word2Vec
It was proposed by [86] as a collection of related models used to generate word embedding.It utilizes a 'shallow' neural network capable of quickly processing billions of word occurrences and producing syntactically and semantically relevant word representation models.The authors also investigated two models of word-embedding learning: skip-gram and continuous bag of words (CBOW).The former takes in a word and predicts the context words, whereas the latter indicates the target word using a source of context words [20].

II. Doc2Vec
It was proposed by [77] as a straightforward extension to Word2Vec [86] for extending learning embeddings from words to word sequences.Doc2Vec is agnostic to the granularity of the word sequence, which can be a word n-gram, sentence, paragraph, or document.Doc2Vec also produces sub-par performance compared with vector-averaging methods based on previous studies [46].

III. GloVe (Global Vectors for Word Representation)
It is a log-bilinear regression model proposed by [88].It attempts to resolve the disadvantages of global factorization approaches (e.g., latent semantic analysis [91]) and local context window approaches (e.g., skip-gram model [73]) on the word analogies and semantic relatedness task.GloVe's global vectors are trained via unsupervised learning on a corpus of aggregated global (word x word) co-occurrence information.GloVe's goal is to factorize the log-count matrix and find a word embedding that meets this criterion [92].Owing to vocabulary mismatch, a lower number of word co-occurrence in short text data and noisy nature of microblogging data, the applicability of these pre-trained wordembedding models to short text data is minimal [89].

Pseudo
One typical strategy to compensate for the sparsity of short texts is to use 'pseudorelevance feedback', which involves enriching the original short text corpus with supplementary data from semantically related long texts.This can be accomplished by submitting the short text data as input to a search engine as queries, which returns a set of the most relevant results [48].
Although the pseudo-relevance feedback-based data augmentation strategy appears promising, this strategy's drawbacks should be noted.Such a process is inherently noisy, and some of the auxiliary material may be semantically unrelated to the original short texts.Similarly, unconnected or loud extra issues may have a negative impact.As a result, combining short texts with long texts or themes that are semantically unrelated to the short texts may degrade the performance of the short texts.The problem can become even more severe because there is no labelling information to guide the selection of auxiliary data and auxiliary topics for unsupervised learning of short texts.
Another strategy is to combine short texts into large pseudo-documents and then use standard topic models to infer topics from these pseudo-documents [49,93].This strategy is highly data dependent, so extending it to deal with more generic forms, such as questions/answers and news headlines, is complex.One of the current strategies' main weaknesses is that the exact short text may contain different topics and therefore can be related to more than one topic.The assumption that only one topic is addressed in each text is inappropriate for these short texts.Furthermore, most standard similarity measures depend heavily on the co-occurrence of words in two documents.As they have no words in common, aggregating a large number of short texts into a small number of pseudo-documents is challenging [94].

External Knowledge
One of these strategies is to use external knowledge as a source of enrichment.[95] suggested a strategy that can be summarized by using external knowledge to uncover hidden topics to address the data sparsity issue.The dual-LDA model was proposed by [48], which generates topics using short texts and related lengthy texts.Document expansion strategies usually expand feature vectors by adding relevant terms [48, 70,96].External knowledge sources such as Wikipedia [70], WordNet [96] and ontologies [48] are commonly used for document expansion.However, owing to semantic incoherence, short text from social media enriched with these static external sources provides insufficient information [45].Given the dynamic nature of short text data on the web, comprehensive background information from an external knowledge sources such as Wikipedia may not accurately capture the meaning of context-sensitive short texts.In addition, external knowledge such as Wikipedia may not always be available on the web or may be too costly.

I. WordNet
It is defined as a vast lexical database.Nouns, verbs, adverbs and adjectives are grouped into sets of cognitive synonyms.Each of these expresses a distinct concept.Conceptual-semantic and linguistic relationships link synsets together [96].WordNet is helpful for computational linguistics and NLP because of its structure.
WordNet resembles a thesaurus because it groups words depending on their meanings.Nevertheless, some key distinctions are noted.Firstly, WordNet connects not just word forms-letter strings-but also precise meanings of words.Therefore, words in the network close to one another are semantically disambiguated.Secondly, WordNet labels the semantic relationships between words, whereas thesaurus groupings follow no defined pattern other than meaning similarity.One issue with using WordNet is that it does not cover the most recent topics.

II. Wikipedia
It is a free online encyclopaedia where experts and volunteers express various concepts.It contains a substantial knowledge base: history, art, society and science.It is an ideal knowledge base for readers and scholars seeking information and modern data-mining algorithms looking for supplementary data to increase performance.Each Wikipedia entity's article contains a comprehensive explanation from multiple perspectives.Furthermore, the content of these articles is organized logically [70].This benefit may make retrieving entity information easier for autonomous learning systems.
Many links in an entity corpus can indicate a semantic relationship between connected entities, aiding automatic concept recognizers in finding related data.Wikipedia is used to improve the short text quality, where clustered short text is enhanced based on the enriched representation.They enrich the short text with information from the Wikipedia database.The concepts from Wikipedia are used to improve short text clustering.Related concepts are extracted and computed using a combination of statistical laws and categories.Then, the semantically related concept sets are built to extend the eigenvector of a short text to supply its semantic features.
However, non-deep-learning measures have several disadvantages.Table 2 illustrates and summarizes the advantages and disadvantages of non-deep-learning measures.

LSA
It can distinguish between synonyms and polysemy and take semantic relationships among the concepts to find relevant documents.
It disregards the sequence of the words in a sentence.

Word2Vec
It can process semantic information quickly.It ignores the order of words in a sentence.

Doc2Vec
It analyses word order and trains different-length texts.It ignores polysemy and synonyms of words.

Glove it preserves the regular linear pattern between words and words and is faster in training.
It cannot retain the memory relationship between words and words.

Deep Learning Measures
Deep learning is currently the undisputed best technology for supervised machine learning, particularly for numerical data classification and clustering.However, its use in unsupervised learning has been more limited and recent [97].Recently, deep learning has been used for unsupervised tasks, including topic modelling and clustering [98].In many cases, the training goals are still the same, and deep learning appears to be most helpful with feature extractors such as convolutional neural network (CNN) [99].The process of transforming input data into a collection of features is known as feature extraction [99].Feature extraction is a technique used in machine learning to improve the efficacy of learning algorithms by transforming training data and augmenting them with extra features to make machine learning algorithms much more adequate.
Deep learning is one of several strategies utilized for short text.Recently, short text has grown on social media platforms, where people can share information and assemble societal opinions through short text conversation.The short text data comprise sparse word co-occurrences; it is challenging for unsupervised text mining to uncover categories, concepts or subjects within the data [89].During the last few years, deep learning methods have shown much power to extract features autonomously and automatically from raw data [100].In general, a deep learning model is constructed of many layers of neural networks.Each layer comprises numerous basic signal-processing units known as neurons.The basic structure of neurons is depicted in Figure 12.Feature extraction is a technique used in machine learning to improve the efficacy of learning algorithms by transforming training data and augmenting them with extra features to make machine learning algorithms much more adequate.
Deep learning is one of several strategies utilized for short text.Recently, short text has grown on social media platforms, where people can share information and assemble societal opinions through short text conversation.The short text data comprise sparse word co-occurrences; it is challenging for unsupervised text mining to uncover categories, concepts or subjects within the data [89].During the last few years, deep learning methods have shown much power to extract features autonomously and automatically from raw data [100].In general, a deep learning model is constructed of many layers of neural networks.Each layer comprises numerous basic signal-processing units known as neurons.The basic structure of neurons is depicted in Figure 12.A neuron can take an input signal and produce an output signal using the neuron's strategies it has learned.Raw information is gradually processed as it passes through multiple interconnected layers of neurons.The structure of a multi-neuron neural network is illustrated in Figure 13.The artificial neuron network uses a massive scale of basic units to handle input in a similar way that the human brain does.Recently, deep text representation has been learned using supervised deep learning techniques [25], depending on shallow-to-deep auto-encoders utilizing recurrent neural networks (RNNs) [101], CNNs [101], long short term memory (LSTM), bidirectional long short term memory (Bi-LSTM) and recursive tree LSTM [102].Nevertheless, in many applications, a dense representation should be discovered in an unsupervised fashion to identify clusters, concepts or topics in the short text.Two elementary procedures, convolution and pooling, form the basis of deep neural network models.In text data, the convolution procedure is the product of a sentence vector and a weight matrix, with each element contributing to the whole.When attempting to extract features, convolution operations are performed.Features with a negative impact can be ignored, and only feature values with a significant effect on the work at hand are taken into account, thanks to pooling operations.The most common pooling operation is called 'max pooling'.It involves picking the highest value in a particular filter space [103].In this section, the literature is extensively reviewed to measure the similarity of short text based on deep learning measures.The most common models are listed below.A neuron can take an input signal and produce an output signal using the neuron's strategies it has learned.Raw information is gradually processed as it passes through multiple interconnected layers of neurons.The structure of a multi-neuron neural network is illustrated in Figure 13.The artificial neuron network uses a massive scale of basic units to handle input in a similar way that the human brain does.Recently, deep text representation has been learned using supervised deep learning techniques [25], depending on shallowto-deep auto-encoders utilizing recurrent neural networks (RNNs) [101], CNNs [101], long short term memory (LSTM), bidirectional long short term memory (Bi-LSTM) and recursive tree LSTM [102].Nevertheless, in many applications, a dense representation should be discovered in an unsupervised fashion to identify clusters, concepts or topics in the short text.Two elementary procedures, convolution and pooling, form the basis of deep neural network models.In text data, the convolution procedure is the product of a sentence vector and a weight matrix, with each element contributing to the whole.When attempting to extract features, convolution operations are performed.Features with a negative impact can be ignored, and only feature values with a significant effect on the work at hand are taken into account, thanks to pooling operations.The most common pooling operation is called 'max pooling'.It involves picking the highest value in a particular filter space [103].In this section, the literature is extensively reviewed to measure the similarity of short text based on deep learning measures.The most common models are listed below.

Convolutional Neural Networks
It is a popular deep learning approach; specific techniques use CNNs as feature extractors.Recently, the CNN has improved performance in many NLP applications, including relation classification [104], phrase modelling [105] and other traditional NLP tasks [106,107].This is because the CNN is the most popular nonbiased model and applies convolutional filters to capture local features.A reliable feature function that extracts higher-level characteristics from constituent words or n-grams became necessary with the widespread use of word embeddings due to its capacity to represent words in a dispersed space.The self-taught CNN (STC 2 ) was proposed by Xu et al. [25] to learn implicit features from short texts for short text representation.Short text representation learning has also been implemented using neural-network-based techniques.The proposed model needs two different raw representations of short text: binary coding representations of short text-based dimensionality reduction on term-frequency vectors and word embedding representations of short texts pre-trained from large external corpora.The input for CNNs is word-embedding representations of short texts, and the binary codes are utilized as data labels to train the CNN model.After the CNN is trained successfully, the deep representations for short text are taken from the last hidden layer of the CNN.However, short texts are usually sparse, so the deep features learned by neural network-based techniques may not accurately represent the short text.

Convolutional Neural Networks
It is a popular deep learning approach; specific techniques use CNNs as feature extractors.Recently, the CNN has improved performance in many NLP applications, including relation classification [104], phrase modelling [105] and other traditional NLP tasks [106,107].This is because the CNN is the most popular nonbiased model and applies convolutional filters to capture local features.A reliable feature function that extracts higher-level characteristics from constituent words or n-grams became necessary with the widespread use of word embeddings due to its capacity to represent words in a dispersed space.The self-taught CNN (STC ) was proposed by Xu et al. [25] to learn implicit features from short texts for short text representation.Short text representation learning has also been implemented using neural-network-based techniques.The proposed model needs two different raw representations of short text: binary coding representations of short textbased dimensionality reduction on term-frequency vectors and word embedding representations of short texts pre-trained from large external corpora.The input for CNNs is word-embedding representations of short texts, and the binary codes are utilized as data labels to train the CNN model.After the CNN is trained successfully, the deep representations for short text are taken from the last hidden layer of the CNN.However, short texts are usually sparse, so the deep features learned by neural network-based techniques may not accurately represent the short text.

Recurrent Neural Networks
Recently, neural networks such as recursive neural networks (RecNN) [108] and RNN [109] have demonstrated superior performance in creating text representations via word embedding.RecNN has high temporal complexity in building the textual tree, whereas RNN, which uses the hidden layer computed at the last word to represent the text, is a biased model in which later words are more prominent than early words [110].By contrast, non-biased models may extract the learnt representation of a single text from all of the words in the text using non-dominant learning weights [25].In recent years, RNNs have seen widespread adoption in research focusing on sequential data types, such as text, audio and video.However, when the input gap is wide, the RNN cannot learn important information from the input data.The problem of long-term dependencies is well-handled by the LSTM after gate functions are introduced into the cell structure [111].

Recurrent Neural Networks
Recently, neural networks such as recursive neural networks (RecNN) [108] and RNN [109] have demonstrated superior performance in creating text representations via word embedding.RecNN has high temporal complexity in building the textual tree, whereas RNN, which uses the hidden layer computed at the last word to represent the text, is a biased model in which later words are more prominent than early words [110].By contrast, non-biased models may extract the learnt representation of a single text from all of the words in the text using non-dominant learning weights [25].In recent years, RNNs have seen widespread adoption in research focusing on sequential data types, such as text, audio and video.However, when the input gap is wide, the RNN cannot learn important information from the input data.The problem of long-term dependencies is well-handled by the LSTM after gate functions are introduced into the cell structure [111].

Long Short-Term Memory
The LSTM networks are a subclass of RNNs.RNNs can remember the previous words, capturing the context, which is crucial for processing text input.RNNs have the issue of long-term reliance because not all the past content is relevant to the following word/phrase.To counteract this issue, LSTMs are developed.Owing to the gates in LSTMs, the network can pick and choose which bits of information to remember [111].The LSTM framework is widely used for determining how similar two sections of text are semantically [112].To predict the similarity of sentences, Tien et al. [113] utilized a network that combines LSTM and a CNN to create sentence embedding using pre-trained word embeddings.Tai et al. [114] suggested an LSTM design to measure the semantic similarity between two supplied sentences.Tree-LSTM is then trained over the parse tree to provide sentence representations.A neural network is trained with these phrase representations and determines the absolute distance and angle between the vectors.

Bidirectional Long Short-Term Memory
Bidirectional RNNs are just two independent RNNs combined.This structure enables the networks to contain both backward and forward sequence information at each time step.Bi-LSTMs use two LSTMs that run in parallel in order to fully capture the context [102].By running the inputs in two directions, one from the past into the future and the other from the future into the past, one may preserve information from both the past and the future simultaneously, making this method superior to the more common unidirectional one.Like NLP, there are occasions when knowing what comes next is just as important as knowing what came before.To estimate the model's semantic similarity, He and Lin [115] presented a hybrid architecture based on Bi-LSTM and CNN to fully capture the context.The approach takes advantage of Bi-LSTM to perform context modelling.Two LSTMs' hidden states are used to generate vectors that are then compared using a comparison unit, resulting in a model of paired word interactions.
Mueller and Thyagarajan presented a MaLSTM [72], which is a Siamese deep neural network that uses LSTM networks with connected weights as sub-modules to learn presentations for sentences.MaLSTM receives sentence pairs, initially expressed as word embedding vectors, as inputs.MaLSTM is trained to utilize a loss function based on the Manhattan distance to learn new representations for sentences.

Bidirectional Encoder Representations in Transformers (BERT)
It is a computational method that allows machine learning models to be trained on textual data.BERT learns contextual embeddings for words as a result of the training procedure [116].Following the computationally expensive pretraining, BERT can be finetuned with lower resources on smaller datasets to optimize its performance on specific tasks.It refers to bidirectional encoder representations in transformers.In contrast with modern theories of language representation [117], pretraining deep bidirectional representations from the unlabelled text is the goal of BERT, and it does so by concurrently conditioning the left and right context across all layers [118].For this reason, the pre-trained BERT model may be fine-tuned with a single extra output layer to provide state-of-the-art models for various tasks, including Q&A and language inference, without significant task-specific architecture alterations.
Many different types of NLP tasks have been improved using language model pretraining [119].Paraphrasing [120] and natural language inference [121] are examples of sentence-level tasks that aim to predict relationships between sentences by analysing them holistically.Named entity recognition and Q&A are examples of token-level tasks that require models to produce fine-grained output at the token level [122], as shown in Figure 14.In Table 3, the deep learning similarity measure works of literature are illustrated and summarized.In Table 3, the deep learning similarity measure works of literature are illustrated and summarized.

Dimensionality Reduction
It is commonly used in machine learning and big data analytics because it aids in analysing large, high-dimensional datasets.It can benefit tasks like data clustering and classification [126].Recently, dimensional-reduction methods have emerged as a promising avenue for improving clustering accuracy [127].Text sequences in term-based vector models have many features.As a result, memory and time complexity consumption are prohibitively expensive for these methods.To address this issue, many researchers use dimensionality reduction to reduce the feature-space size [101].Existing dimensionality reduction algorithms are discussed in depth in this section.

Principal Component Analysis (PCA)
It is the most common technique in data analysis and dimensionality reduction, and almost all scientific disciplines use it.PCA seeks to find the most meaningful basis for re-expressing a given dataset [101].This entails identifying new uncorrelated variables and maximizing variance to maintain as much variation as possible [128].This new basis is expected to reveal hidden structures in the dataset and filter out noise [129].PCA has numerous applications, including dimensionality reduction, feature extraction, data compression and visualization.
A dataset with observations on p numerical variables for each of n entities or individuals is the standard context for PCA as an exploratory data analysis tool.These data values define p n-dimensional vectors x 1 , . . ., x p , or equivalently, an x × p data matrix X, with the jth column containing the vector x j of observations of the jth variable.We are looking for a linear combination of the columns of matrix X with the slightest variance.The linear combination can be written as the following equation (see Equation ( 3)) [128]: where a represents a vector of constants a 1 , a 2 , . . ., a p .This linear combination's variance is given as Equation (4): where S represents the sample covariance matrix.The goal is to find the linear combination with the least amount of variance.This is equivalent to maximizing Equation ( 5): where λ is a Lagrange multiplier.ICA is a statistical modelling method that expresses observed data as a linear transformation [130].A statistical 'latent variables' model can be used to rigorously define ICA [131].Assume we find n linear mixtures x 1 , . . ., x n of n independent components.The x j for all j can be computed as in Equation ( 6): Sometimes we require the columns of matrix A; denoting them by a j , the model can also be written as Equation ( 7):

. Linear Discriminant Analysis (LDA')
It is a standard data-mining algorithm used for supervised or unsupervised learning.LDA is commonly used for dimensionality reduction [132].It determines the projection hyperplane with the lowest interclass variance and the most significant distance between the projected means of the classes [133].LDA is beneficial when the within-class frequencies are unequal and their performances have been assessed using randomly generated test data.
Let us say X j ∈ R d×n j , which are d-dimensional samples, and y i ∈ {1, 2, • • • , k} denotes the class label of the i − th sample, where n is the number of documents, d is the data dimensionality and k is the number of classes.Equations ( 8)-( 10) calculate the number of samples in each class: In discriminant analysis [134], three scatter matrices are defined as within-class, between-class and total scatter matrices, as shown in Equations ( 11)-( 13): where c (j) It is the centroid of the j − th class and c is the global centroid.It follows from the definition that S t = S b + S w .Furthermore, trace (S w ) measures the within-class cohesion, and tracing (S b ) measures the between-class separation.

T-Distributed Stochastic Neighbour Embedding (t-SNE)
T-SNE is a method for reducing dimensionality.Dimensionality reduction is significant in extracting the essential features from a complex set of expression profiles from various samples.This method is commonly used for low-dimensional feature space visualization [135].This entails mapping the high-dimensional state-vectors onto a low-dimensional space (typically a plane) while maintaining critical information about the relatedness of the component samples.SNE converts high-dimensional Euclidean distances into conditional probabilities representing similarities [136].The conditional probability p j|1 is computed as follows: where σ i is the variance of the Gaussian that is centred on datapoint x i .To calculate the similarity of point y j with y i , the following is calculated: SNE uses a gradient descent method to minimize the sum of Kullback-Leibler divergences over all data points.Cost function C is denoted by Equation ( 16): SNE performs a binary search for the value of Q i to produce a P i with the user-specified fixed perplexity.The perplexity is defined as Equation ( 17): where H(p i ) It is the Shannon entropy of Pi measured in bits, as shown in Equation ( 18): The minimization of the cost function is performed using a gradient descent method, as illustrated in Equation ( 19): The spring force between y i and y j is proportional to its length and stiffness, which is the mismatch p j|i − q j|i + p i|j − q i|j between the pairwise similarities of the data points and the map points.
In addition, to determine the changes in the coordinates of the map points at each iteration of the gradient search, the current gradient is added to an exponentially decaying sum of previous gradients.The gradient update with a momentum term is given mathematically by the following Equation (20): where Y (t) represents the solution at iteration t, η represents the learning rate and α(t) represents momentum at iteration t.

Uniform Manifold Approximation and Projection (UMAP)
It is an embedding method for dimensionality reduction and a newly proposed multivariate learning method for adequately representing the local structure while better incorporating the global structure [137].UMAP scales well with massive datasets.UMAP uses a high-dimensional graph representing the data points to generate the fuzzy topological structure.The created high-dimensional graph is a weighted graph, with edge weights indicating the probability that two points are related.UMAP computes the similarity between high-dimensional data points using an exponential probability distribution, as given in Equation (21) [126]: where d x i , x j represents the distance between the i − th and j − th data points, and ρ i is the distance between the i − th data point and its first nearest neighbor(s).When the weight of the graph between i and j nodes is greater than the weight between j and i nodes, UMAP employs a high-dimensional probability symmarization, as shown in Equation ( 22): UMPA in the graph must indicate k, the number of nearest neighbours, where k is calculated by Equation ( 23): UMAP uses a probability measure for modelling distance in few dimensions, as shown in Equation ( 24): For default UMAP, a ≈ 1.93 and b ≈ 0.79.UMAP employs binary cross-entropy (CE) as a cost function due to its ability to capture the global data structure, as illustrated in Equation (25): where P represents the probability similarity of high-dimensional data points and Q represents low-dimensional data points.

Similarity and Distance Measure
The similarity measure determines the similarity between diverse terms, such as words, sentences, documents or concepts.The goal of determining similarity measures between two terms is to determine the degree of relevance by matching the conceptually similar terms but not necessarily lexicographically similar terms [138].
Generally, the similarity measure is a significant and essential component of any clustering technique.This is because it makes it easier to measure two things, group the most similar elements and entities together and determine the shortest distance between them [139,140].In other words, distance and similarity have an inverse relationship, so they are used interchangeably.In general, similarity/distance measures are computed using the vector representations of data items.Document similarity is vital in text processing [141].It calculates the degree to which two text objects may be identical.Nonetheless, the similarity and distance measures are used as a retrieval module in information retrieval.Similarity measurements include cosine, Jaccard and inner products; distance measures include Euclidean distance and KL divergence [142].An analysis of the literature studies shows that several similarity metrics have been developed.However, none of the similarity metrics appears to be the most effective for any research [143].

Cosine Similarity
It is one of the primary measures utilized to compute the similarity between two terms.The cosine similarity is used with documents in several applications, including text mining, IR and text clustering [144].We choose the documents → t a and → t b , to define the similarity between the two documents using the cosine similarity method.We used Equation ( 26) for cosine similarity, as shown below: where → t a and → t b are interpreted as m-dimensional vector models by using the term set T {t 1 . . .t m }.They represent all terms with weights together in the document by a specific dimension that is also non-negative.Therefore, the cosine similarity scale runs between 0 and 1.
Cosine similarity is one of the main qualities and essential characteristics, independent of the document length, which makes it distinct and characterized by cosine similarity.For instance, if we have two copies of the same document and want to determine the cosine similarity between them, we will combine document d to create the new pseudo document d 0 .Consequently, the cosine similarity between documents d and d 0 is equal to 1.According to the evidence presented here, these two documents are the same.

Jaccard Coefficient
The Tanimoto coefficient or Jaccard coefficient is a common statistical coefficient found in NLP [144].The Jaccard coefficient is a measurement unit that determines how similar two items are by dividing the intersection of the objects by their union.The Jaccard coefficient is applied to the text document to compare the sum of the weight of terms found in either of the two documents and the total weight of shared words, but they must not be shared terms.Equation ( 27) is a presentation of the mathematically correct definition of the Jaccard coefficient: The Jaccard coefficient is a measure of similarity with a range of [0, 1]; when → t a and → t b are mutually exclusive, the coefficient is 0, and when they are equivalent, it is 1.

Euclidean Distance
Euclidean distance, also known as the Euclidean metric, is a frequently used distance measure in clustering algorithms, including clustering text and is the default measure of distance in the K-means algorithm [144].For instance, to calculate the distance between two documents, d a and d b are represented by their term vectors

Clustering Algorithms
Clustering methods divide a collection of documents into groups or subsets.Cluster algorithms seek to generate internally coherent clusters yet distinct from one another.In other words, documents inside one cluster must be similar as feasible, whereas documents in different clusters should be as diverse as possible.The clustering method splits many text messages into many significant clusters.Clustering has become a standard strategy in information retrieval and text mining [145].Concurrently, text clustering faces various challenges.On the one hand, a text vector is a high-dimensional vector, typically ranging in the thousands or even the ten thousand dimensions.On the other hand, the text vector generally is sparse, making it challenging to identify the cluster centre.Clustering has become an essential means of unsupervised machine learning, attracting many researchers [146,147].
In general, there are three types of clustering algorithms: hierarchical-based clustering, partition-based clustering and density-based clustering.We quickly discuss a few traditional techniques for each category; clustering algorithms have been extensively studied in the literature [148,149].

Hierarchical Algorithms
Hierarchical algorithms create a hierarchy of clusters.Hierarchical clustering algorithms have become the standard method for document clustering [150] by combining the ideal measure similarities such as cosine similarity, Jaccard similarity coefficient and Dice coefficient.
The most popular text clustering technique that produces nested groups in the form of a hierarchy is called hierarchical clustering.To use this strategy, the category must be hierarchical.Generally, the relevant objects will be updated if the category changes.The output of using a hierarchical clustering method is a single-category tree.A sample of hierarchical clustering is shown in Figure 15; each class node has several child nodes, and a brother node is a division of its parent nodes.This can form extended, almost identical clusters.For clusters of comparable sizes, the complete-link approach is preferable (in volume).The similarity between two groups can be defined as the degree to which their two most similar objects and most distinct [150,151].
As a result, this method allows for data classification at various granularities.In general, hierarchical clustering is accurate.However, each class must integrate and compare the overall similarity of all classes to choose the two more similar classes, which is comparably slow.Another problem of hierarchical clustering is that once a stage merge or split is finished, it cannot be halted, making it impossible to correct a mistake [147].The hierarchical clustering techniques may be classified into two groups based on the for- This can form extended, almost identical clusters.For clusters of comparable sizes, the complete-link approach is preferable (in volume).The similarity between two groups can be defined as the degree to which their two most similar objects and most distinct [150,151].
As a result, this method allows for data classification at various granularities.In general, hierarchical clustering is accurate.However, each class must integrate and compare the overall similarity of all classes to choose the two more similar classes, which is comparably slow.Another problem of hierarchical clustering is that once a stage merge or split is finished, it cannot be halted, making it impossible to correct a mistake [147].The hierarchical clustering techniques may be classified into two groups based on the formation of the category tree methods: the top-down split technique and the bottom-up integration technique [151].
Bottom-up (merge-up) hierarchical clustering starts with a single item.It begins with an item as a solitary category and then consistently combines two or more appropriate categories.The hierarchical clustering does not loop as long as the stop criteria are fulfilled (the number of parameters is generally K, where K = Number of clusters).The bottom-up hierarchical clustering method is viewed as constructing the tree, consisting of data on the class hierarchy and the degree of similarity between all classes.Hierarchical clustering has the following advantages: it may be used with any shape, degree of similarity or distance and it features an inherently adaptable clustering granularity.One drawback of hierarchical clustering is the ambiguous termination condition: once the clustering is complete, it should define the human experience.Often, this technique cannot be rebuilt to provide better results, and the faults produced cannot be corrected [147,151].
The top-down (split-down) hierarchical clustering technique begins with a single completed item and splits it into multiple categories.The standard method is to construct a minimal spanning tree on related graphs, and then, at each step, choose a side closest to (or farthest from) the spanning tree in terms of similarity and eliminate it.It can create a new category if one side is removed.The cluster may cease whenever the lowest similarity reaches a certain threshold.The top-down technique often involves more computing than the bottom-up method, making top-down method applications less common than the latter.A cluster in the top-down approach is split into two categories simultaneously, and this process continues until the class is broken into (k) clusters.
Generally, both hierarchical clustering approaches are simple and adaptable enough to tackle multi-granularity clustering issues.They can handle a wide variety of attributes and can employ many kinds of distance or similarity measurements.The bottom-up and top-down hierarchical clustering approaches have these limitations: determining the algorithm's termination criteria and choosing the merge or split points are challenging.These choices are crucial because after a set of items has been combined or divided, the subsequent phase operates on the newly created clusters, and this procedure cannot be undone; the objects cannot be moved between the clusters.Furthermore, it is too challenging to expand these clustering algorithms.If poor judgments are made during the merge or split processes, it may impact the quality of the cluster findings [147,151].

Partitioned Algorithms
Partitioned clustering is a common technique that divides the data into K distinct point sets, each of which has homogeneous points, by selecting the appropriate scoring function and minimizing the distance between each end and the cluster centroid of each cluster [152,153].The evaluation function is the most critical aspect of partitioned clustering.However, some elements of the method are pretty much like general algorithms.Partitioned clustering is suitable for nourishing the cluster in the small-scale database to identify the collection (each cluster class regarded as one cluster).The K-means algorithm is one of the most common flat clustering algorithms and is one of the most well-known partitional clustering methods.James Mac Queen coined the term 'K-means' in 1967 [154,155].Stuart Lloyd (1957) was the first to offer the standard method as a pulse-code modulation approach.The K-means algorithm's purpose is based on the input parameters K, which split the dataset into K clusters.First, we select K objects as initial cluster centres, compute the distance between each cluster centre and each object, assign it to the nearest cluster and update the cluster averages.This process continues until the criterion function is satisfied [156].
The K-means algorithm has a time complexity is O (knI), where (k) refers to the number of clusters, (n) refers to the number of objects, and (I) refers to the number of iterations (which depending on the stopping condition, can typically be seen as being included by a limited number).The cluster centroids and (kn) similarities between all objects and all clusters must be calculated in each iteration [157].
The K-means algorithm requires specifying the number of clusters (K) as input, and therefore, determining the optimal number is critical.However, the process can be performed whilst varying numbers of clusters and clustering with the best results documented (for example, measured by the objective function).A conventional partitioning technique allows for cluster merging and splitting, and the conclusion should theoretically have the most significant number of clusters [24].K-medoids [151] is a partition clustering algorithm with significant similarities to the K-means clustering algorithm.Nonetheless, K-medoids differs from K-means because the centre of a cluster is an actual data object with K-medoids.K-means requires calculating the mean vector for the data objects in a cluster.Thus, the K-means algorithm can only be applied to a Euclidean feature space.The K-means++ algorithm [158] is an improvement on the original K-means algorithm, which uses randomized seeding approaches to attain higher accuracy and less complexity.

Density-Based Clustering Methods
The spatial density of the data objects is used to find clusters in density-based clustering algorithms [149].The goal of data partitioning density is to identify groups of dense data points that cluster together in Euclidean space.A cluster is defined as a densely linked component which grows in any direction to increase density.One advantage of density-based algorithms compared with the partition-based clustering approaches is that they can detect groups with more dense and natural forms.Furthermore, these approaches can find outliers in a dataset in a natural way [159].The difference between the two types of clustering algorithms is shown in Figure 16.The standard method for density-based clustering type is DBSCAN [26].It uses two parameters Minpts and ∈ to determine the following rules:

•
The main data object (a data object which has more than MinPts neighbours in its neighbourhood).
• A neighbourhood of a data object  is denoted by (N (x) = y ∈ X | d (x,y) < ∈).

•
The density of the accessible data objects shows that two data items, x and y, can be reached via a set of core data objects.

Performance Evaluation Measure
This step provides an overview of the performance measures used to evaluate the proposed model.These performance measures involve comparing the clusters created by the proposed model with the proper clusters.The assessment of clustering results is often called cluster validation.Cluster validity can be employed to identify the number of clusters and determines the corresponding best partition.Many suggestions have been made for measuring the similarity between the two clusters [160,161].These measures may be used to evaluate the effectiveness of various data clustering techniques applied to a given The standard method for density-based clustering type is DBSCAN [26].It uses two parameters Minpts and ∈ to determine the following rules:

•
The main data object (a data object which has more than MinPts neighbours in its neighbourhood).

•
A neighbourhood of a data object x is denoted by (N (x) = y ∈ X | d (x,y) < ∈).

•
The density of the accessible data objects shows that two data items, x and y, can be reached via a set of core data objects.

Performance Evaluation Measure
This step provides an overview of the performance measures used to evaluate the proposed model.These performance measures involve comparing the clusters created by the proposed model with the proper clusters.The assessment of clustering results is often called cluster validation.Cluster validity can be employed to identify the number of clusters and determines the corresponding best partition.Many suggestions have been made for measuring the similarity between the two clusters [160,161].These measures may be used to evaluate the effectiveness of various data clustering techniques applied to a given dataset.When assessing the quality of a clustering approach, these measurements are typically related to the different kinds of criteria being considered.The term 'internal assessment' refers to assessing the clustering outcome using only the data clustered by itself [162].
These methods often give the algorithm the perfect score, producing values with a higher degree of similarity inside a cluster and a low degree between clusters.The outcomes of external assessment clustering are evaluated based on data not utilized for clustering, such as known-class labels and external benchmarks.It is noteworthy these external benchmarks are composed of a group of things that have already been categorized, and typically, these sets are created by human specialists.These assessment techniques gauge how well the clustering complies with the established benchmark classes [163,164].We review several performance evaluations measures that are used to evaluate the performance of the cluster as follows: 3.6.1.Homogeneity (H) It calculates the ratio of data points in each predicted cluster that belong to the same ground-truth class, as shown in Equation (29).

Completeness (C)
It calculates the ratio of predicted clusters with an accurate alignment with the groundtruth class, which is illustrated in Equation (30).
where C is the ground truth clustering, and H( C|k) is the conditional entropy of the class distribution given the clustering results obtained by the employed clustering method.
3.6.3.V-Measure (V) It calculates the harmonic mean of completeness and homogeneity by using Equation (31), which illustrates the balance between completeness and homogeneity [165].
3.6.4.Adjusted Rand Index Score (ARI) It is the corrected-for-chance version of the Rand index that views the clustering process as a sequence of decisions to quantify the similarity between the achieved clustering results and the ground truth, as shown in Equation (32).
where n ij = |X i ∩ Y i | The X and Y refer to two groupings X ={x 1 , x 2 , • • • , x r } and Y = {Y 1 , Y 2 , • • • , Y s }.Additionally, n refers to elements, a i = ∑ s j=1 n ij and b j = ∑ r i=1 n ij .

Normalized Mutual Information (NMI)
It is a metric for validating clustering methods that quantify the amount of statistical information shared between ground truth and the predicted cluster assignments, irrespective of the absolute cluster label values.Clustering may be viewed as a sequence of pair-wise decisions in which two elements are placed in the same cluster if they have similarities [166].It is calculated as shown in Equation (33): where N MI(X, Y) is the mutual information between X and Y and H is the entropy.
3.6.6.Adjusted Mutual Information (AMI) AMI normalizes mutual information based on the adjust index.Mutual information quantifies the percentage of information exchanged by two partitions [167].It is computed as illustrated in Equation (34): where MI, E and H indicate the mutual information between clusters.

Purity (P')
It is the measured degree of incidence of text data from one class in each cluster.The purity of a given cluster j of size n j is defined as shown in Equation ( 35): where n ji is the number of class documents i assigned to cluster j. p j is defined as the proportion of the whole cluster size that comprises the most important class of documents allocated to that cluster.The total weighted sum of individual cluster purities yields the overall purity of the clustering solution, as illustrated in Equation (36).
N denotes the total number of documents in the document collection.When the purity values are higher, the clustering solution is superior.

F-Measure
It is another popular external validation metric known as 'clustering accuracy'.The F-measure, an information retrieval statistic, influenced the calculation of this accuracy.If we compare clusters, a clear and simple technique would be to compute the precision (P), recall (R) and the F-measure, commonly used in the IR literature, to assess retrieval success.
The P is calculated using our clustering notation as follows (Equation ( 37)) [165]: where the R is calculated as in Equation ( 28): Then, the F-measure value of the cluster is the harmonic mean of P and R, as shown in Equation ( 39):

Challenges of Short Text Clustering
Short texts contain several issues, including a lack of information due to documents that include few words [1].Short texts are used in various applications, including microblogs, Facebook, Twitter, Instagram, mobile messages and news comments.These texts are usually about 200 characters long, which is very short [168].For instance, Twitter determines the length of each tweet to be no more than 280 characters [16,169,170], and Instagram sets a 2200 characters maximum caption length [17].A short mobile message is limited to 70 characters.To be precise, short texts exhibit the following problems: 1.
Lack of information: A short text has only a few words, leading to a lack of information and poor document representation.Each short text does not include sufficient information on word co-occurrence, and most texts are likely created for only one topic [171].

2.
Sparsity: The length of a short text is limited.This short text can represent a wide range of topics, and each user uses unique word choice and writing style [172].A given topic has a wide range of content, so determining its features is difficult.

3.
High dimensionality: Representing the short text using standard text representation methods, such as TF-IDF vectors or BOW [27], leads to high-dimensional features that are less distinct for measuring distance.In addition, the computational time required is extensive [18,28,29].4.
Informal writing and misspelling: Short text is used in many applications, such as comments on microblogs, which contain noise and many misspellings, and the presence of a particular language style [47].In other words, users of social media platforms such as Twitter tend to use informal, straightforward and simple words to share their opinions and ideas.As an illustration, many people on Twitter may write '4you' rather than 'for you' when posting tweets.In addition, users may create new abbreviations and acronyms to simplify the language: 'Good9t' and 'how r u' are widespread on social networks.Furthermore, the online questions and search queries do not use the grammar seen in official documents.
According to [173], the lack of information and sparsity considerably impact short text clustering performance.The typical clustering algorithms cannot be applied directly to short texts because of the many variations in the word counts of short texts, and the limited number of words in each post.For example, the accuracy of using the traditional K-means [24] algorithm to group short text is lower than when using K-means to group longer text [25].This issue complicates feature space extraction from the short text for text clustering.

Conclusions
STC is a complex problem, as web users and social media applications produce an increasing number of short texts containing only a few words.Sparsity, high dimensionality, lack of information and noise in data are common problems in STC.Finding and developing clustering algorithms have become crucial issues.With a better understanding of what the current text representation techniques are and how to use them successfully, we can improve the efficiency of the existing STC algorithms.
Our study summarizes the published literature that focuses on STC.The summary presents the applications of STC.We provide an overview of STC and describes the various stages of STC in detail.We present the approaches used in the short text representation, their pros and cons and the impacts of applying different methods to short texts.In addition, we explain the essential methods of deep learning used with text.Several methods perform well in some studies but poorly in others, such as TF-IDF vectors and BOW, which lead to sparse and high-dimensional feature vectors that are less distinctive for measuring distance.Further research can address related issues in short text representation and avoid poor clustering accuracy.
We believe in promising research directions in the field of STC.The focuses are on the following aspects.Problems with low performance for text representation can be solved using multi-representation and feature ranking.These two strategies are influential in enhancing the quality of text representation by extracting more information from the short text but with only significant features.In addition, using dimensional reduction is an essential step in STC to deal with time and memory complexity.Of note, the representation of the short text has a vast area that makes short text problems a promising area of research.

Figure 1 .
Figure 1.Components for text-data clustering.Figure 1. Components for text-data clustering.

Figure 1 .
Figure 1.Components for text-data clustering.Figure 1. Components for text-data clustering.

Figure 3 .
Figure 3. Example of output of tokenization.

Figure 3 .
Figure 3. Example of output of tokenization.
Figure 5 displays the word document sample after removing stop words.

Figure 5 .
Figure 5. Sample text after stop word removal.

Figure 5 .
Figure 5. Sample text after stop word removal.

Figure 7 .
Figure 7. Main taxonomies for short text representation.

Figure 7 .
Figure 7. Main taxonomies for short text representation.

Figure 8 .
Figure 8. BOW with two text documents represented as binary vectors.

Figure 8 .
Figure 8. BOW with two text documents represented as binary vectors.
Figure 10 describes the DMM model D. The DMM models work according to the following process:

Figure 12 .
Figure 12.Basic structure of a neural network.

Figure 12 .
Figure 12.Basic structure of a neural network.

Figure 13 .
Figure 13.Structure of a multi-neural network.

Figure 16 .
Figure 16.The difference between density-based clustering and partition-based clustering methods.

Figure 16 .
Figure 16.The difference between density-based clustering and partition-based clustering methods.

Table 1 .
Local and global term weighting schemes.

Table 2 .
Advantages and disadvantages of non-deep-learning measures.

Table 3 .
Analysis of the studies on the deep learning similarity measures.

Table 3 .
Analysis of the studies on the deep learning similarity measures.