Hidden Variable Models in Text Classification and Sentiment Analysis

.


Introduction
In this fast-paced world of technological advances, one of the most significant contributing factors has been the emergence of various digital data forms, opening opportunities in different fields to gather helpful information.Everyday, massive amounts of digital data are stored in digital data archives.The same distinction can be made for the enormous quantity of textual data available on the Internet.Therefore, it is critical to developing effective and scalable statistical models to extract hidden knowledge from such rich data [1].
One of the main challenges in the statistical analysis of textual data is capturing and representing their complexity.Different approaches have been applied to deal with this problem.Furthermore, due to information technology's rapid development, vast quantities of scientific documents are now freely available to be mined.Thus, the analysis and mining of scientific documents have been very active research areas for many years.
Data projection and clustering are crucial for document analysis, with projections aimed at creating low-dimensional, meaningful data representations and clustering and grouping similar data patterns [2,3].Traditionally, these methods are studied separately, but they intersect in many applications [3].K-means clustering, though widely used for creating compact cluster representations, does not fully capture document semantics.This gap has led to the adoption of machine learning and deep learning for text mining challenges, including text classification [4], summarization [5], segmentation [6], topic modeling [7], and sentiment analysis [8].
In this paper, we will focus on topic modeling aspects.Topic models are generally classified into two categories: those based on matrix decomposition, like singular value decomposition (SVD), and generative models [9].The matrix decomposition approach, such as probabilistic latent semantic analysis (PLSA) [10,11], analyzes text via mining and requires a deep understanding of the corpus structure.PLSA, also known as probabilistic latent semantic indexing (pLSI) [11], represents documents as a mix of topics by performing matrix decomposition on the term-document matrix and is effective in identifying relevant words for each topic.In contrast, the generative approach of topic modeling focuses on the context of words across the entire document corpus.These models use latent variable models, and a document is treated as a combination of various topics, each represented by a random vector of words [3].
Meanwhile, the research by [12] indicates that while the probabilistic latent semantic indexing (pLSI) model offers some insights, it falls short in clustering and as a generative model due to its inability to generalize to new documents.To address these limitations, latent Dirichlet allocation (LDA) [12] was introduced, enhancing pLSI using a Dirichlet distribution for topic mixtures.LDA stands out as a more effective generative model, though it still lacks robust clustering capabilities [3].The integration of clustering and projection into a single framework has been a recent focus in this field, recognizing the need to combine these two approaches [13,14].
The LDA model [12] has been proposed to solve these shortcomings, and this model has proved to be an efficient and scalable data processing method [15,16].
The main issue with current text analysis models is their failure to clearly define a probability model encompassing hidden variables and assumptions [11,[17][18][19].To address this, variational Expectation Maximization (EM) has been utilized, notably in Multinomial PCA (MPCA), which links topics to latent mixture proportions in a probabilistic matrix factorization framework [19,20].Extensions of LDA, like its hierarchical [21] and online versions [22], have been developed, although they lack the integration of Dirichlet priors in modeling.Researchers have explored alternative models using conjugate priors and methods, like Gibbs sampling and Markov Chain Monte Carlo (MCMC) methods [23], which, despite their effectiveness, require longer convergence times compared to the variational Bayes approach.
In this paper, we introduce two novel models, GDMPCA and BLMPCA, that significantly improve text classification and sentiment analysis by combining generalized Dirichlet (GD) and Beta-Liouville (BL) distributions for a more in-depth understanding of text data complexities [16,24,25].Both models employ variational Bayesian inference and collapsed Gibbs sampling for efficient and scalable computational performances, which is critical for handling large datasets.
The generalized Dirichlet (GD) distribution, introduced in [26], exhibits a more flexible covariance structure than its Dirichlet counterpart.Similarly, the Beta-Liouville (BL) distribution, enriched with additional parameters, offers improved adjustments for data spread and modeling efficiency.Our contribution was validated through a rigorous empirical evaluation on real-world datasets, which demonstrated our models' superior accuracy and adaptability.This work represents a significant step forward in text analysis methodologies, bridging theoretical innovation with practical application, with experimental results demonstrating the relationships between these models.
The structure of the rest of this paper is as follows.In Section 2, we cover the related work.Section 3 introduces the extension of MPCA with generalized Dirichlet and Beta-Liouville distributions with all the details about the parameters estimation.Section 4 is devoted to the discussion of the experimental results.Finally, we conclude our work in Section 6.

Related Work
In this section, we delve into the vast array of the literature on topic modeling approaches.The foundation of this field is built upon traditional topic modeling tech-niques [10,11], with significant contributions from topic-class modeling [27][28][29] and the nuanced exploration of global and local document features [30,31].
Innovative strides have been made with the introduction of a two-stage topic extraction model for bibliometric data analysis, employing word embeddings and clustering for a more refined topic analysis [32].This approach provides a nuanced lens with which to view the thematic undercurrents of scholarly communication.
The landscape of sentiment analysis is similarly evolving, with breakthroughs like a term-weighted neural language model paired with a stacked bidirectional LSTM (long short-term memory) framework, enhancing the detection of subtle sentiments like sarcasm in text [33].Such advancements offer deeper insights into the complexities of language and its sentiments.
Cross-modal sentiment analysis also takes center stage with deep learning techniques, as seen in works that identify emotions from facial expressions [34].These studies, which utilize convolutional neural networks and Inception-V3 transfer learning [35], pave the way for multimodal sentiment analysis, potentially influencing strategies for textual sentiment analysis.
A hybrid deep learning method has been introduced for analyzing sentiment polarities and knowledge graph representations, particularly focusing on health-related social media data, like tweets on monkeypox [36].This underscores the importance of versatile and dynamic models in interpreting sentiment from real-time data streams.
Collectively, these contemporary works highlight the expansive applicability and dynamic nature of deep learning across various domains and data types.Their inclusion in our review underlines the potential for future cross-disciplinary research, expanding the scope of sentiment analysis to include both text and image data.
Alongside these emerging approaches, well-established techniques such as principal component analysis (PCA) and its text retrieval counterpart, latent semantic indexing [37], continue to be pivotal.Probabilistic latent semantic indexing (pLSI) [11] and latent Dirichlet allocation (LDA) [12] further enrich the discussion on discrete data and topic modeling.Non-negative matrix factorization (NMF) [17] has also demonstrated effectiveness, emphasizing the need for models that can simultaneously handle clustering and projection.In addressing a gap in the literature, a multinomial PCA model has been proposed to offer probabilistic interpretations of the relationships between documents, clusters, and factors [19].
Our focus on the MPCA model and its extensions aims to consolidate these disparate strands of research, presenting a comprehensive framework for topic modeling that accounts for both clustering and projection, reflecting the ongoing dialogue within the research community on these topics.

Multinomial PCA
Probabilistic approaches to reducing dimensions generally hypothesize that each observation x i corresponds to a hidden variable, referred to as a latent variable θ i .This latent variable exists within a subspace of dimension K. Typically, the relationship involves a linear mapping (β) within the latent space coupled with a probabilistic mechanism.
In the probabilistic PCA (pPCA) framework, as detailed in the work in [38], it is posited that each observation x i originates from a standard Gaussian distribution N K (0 K ; Z K ).
The assumption of a Gaussian distribution is also employed for the conditional distribution of the observations: where Z is a "standard" normal distribution, (β, µ) are the model parameters, and σ 2 is the variance that is learned using maximum likelihood inference.The Gaussian assumption is suitable for real-valued data, yet it is less applicable to non-negative count data.Addressing this, [19] introduced a variant of pPCA where the latent variables are modeled as a discrete probability distribution, specifically using a Dirichlet distribution, where as m ∼ Dir(α), where α = (α 1 , . . ., α k ) ≥ 0.Then, the probabilistic function is assumed to be multinomial: The variables m and w are assumed to be hidden parameters for each document.For the parameter estimation of MPCA, first, the variable Ω is estimated using the Dirichlet prior on m using parameters α [19].The likelihood model for the MPCA is given as follows [20]: In the MPCA model, it is assumed that each observation x i can be broken down into a probabilistic mixture of K topics that represent the whole corpus.Then, m indicates the observation with mixture weights in the latent space, and Ω is a global parameter that encapsulates all the information at the corpus level.
As a result, the following equation is derived when the hidden variables have a Dirichlet prior [19]: The following updated formula converges to the local maximum log p(Ω, α m |r), where ) is a normalizing constant for the Dirichlet, and r is the total row-wise number of words in the document representation with the k component [19]: Equations ( 8) and ( 9) are the parameters for a multinomial and a Dirichlet, respectively.
According to the exponential family definition (Appendix A), Equation (9) rewrites α in terms of its dual representation.Minka's approach is used to derive α, where n k is the number of times that the outcome was k [39]: Connection between MPCA and LDA The multinomial PCA model is closely connected to LDA [12] and forms the foundation over several topic models.
In text analysis, an observation typically refers to a document represented by a sequence of tokens or words, denoted as w i = w in , where n = 1 . . .L i .Each word w in within a document i is initially linked to a topic, which is specified by a vector z in that is derived from a Multinomial(1, β k ) distribution.The model for any given document i can be described as follows: At the word level, marginalizing on z in yields a distribution similar to Equation (3): Furthermore, the distinction between LDA and MPCA is that LDA is a word-level model, whereas MPCA is a document-level model.Since GDMPCA and BLMPCA are new variations of MPCA, both new models are assumed to be document-level in the following proposed approaches.

Proposed Models
In this section, we present two pioneering models, generalized Dirichlet Multinomial Principal Component Analysis (GDMPCA) and Beta-Liouville Multinomial Principal Component Analysis (BLMPCA), which were designed to revolutionize text classification and sentiment analysis.At the core of our approaches is the integration of generalized Dirichlet and Beta-Liouville distributions, respectively, into the PCA framework.This integration is pivotal, as it allows for a more nuanced representation of text data, capturing the inherent sparsity and thematic structures more effectively than traditional methods.
The GDMPCA model leverages the flexibility of the generalized Dirichlet distribution to model the variability and co-occurrence of terms within documents, enhancing the model's ability to discern subtle thematic differences.On the other hand, the BLMPCA model utilizes the Beta-Liouville distribution to precisely capture the polytopic nature of texts, facilitating a deeper understanding of sentiment and thematic distributions.Both models employ variational Bayesian inference, offering a robust mathematical framework that significantly improves computational efficiency and scalability.This approach not only aids in handling large datasets with ease but also ensures that the models remain computationally viable without sacrificing accuracy.
To elucidate the architecture of our proposed models, we delve into the algorithmic underpinnings, detailing the iterative processes that underlie the variational Bayesian inference technique.This includes a comprehensive discussion of the optimization strategies employed to enhance convergence rates and ensure the stability of the models across varied datasets.Moreover, we provide a comparative analysis, drawing parallels and highlighting distinctions between our models and existing text analysis methodologies.This comparison underscores the superior performances of GDMPCA and BLMPCA in terms of accuracy, adaptability, and computational efficiency, as evidenced by an extensive empirical evaluation on diverse real-world datasets.
Our exposition on the practical implications of these models reveals their broad applicability across numerous domains, from automated content categorization to nuanced sentiment analysis in social media texts.The innovative aspects of the GDMPCA and BLMPCA models, coupled with their empirical validation, underscore their potential to set a new standard in text analysis, offering researchers and practitioners alike powerful tools for uncovering insights from textual data.
Table 1 summarizes the relevant variables for the proposed models.Bouguila [40] demonstrated that when mixture models are used, the generalized Dirichlet (GD) distribution is a reasonable alternative to the Dirichlet distribution for clustering count data.
As we mentioned previously, the GD distribution, like the Dirichlet distribution, is a conjugate prior to the multinomial distribution.Furthermore, the GD has a more general covariance matrix [40].
Therefore, the variational Bayes approach will be utilized to develop an extension of the MPCA model incorporating the generalized Dirichlet assumption.GDMPCA is anticipated to perform effectively because the Dirichlet distribution is a specific instance of the GD [41].Like MPCA, GDMPCA is a fully generative model applied to a corpus.It considers a collection of M documents represented as the corpus, denoted by D = {w 1 , w 2 , . . ., w M }.Each document w m consists of a sequence of N m words, expressed as w m = (w m1 , . . ., w mN m ).Words within a document are represented by binary vectors from a vocabulary of V words, where if the j-th word is selected, w n j = 1, and if not, w n j = 0 [42].The GDMPCA model then describes the generation of each word in the document through a series of steps involving c, a d + 1 dimensional binary vector of topics: If the i-th topic is chosen, z n i = 1, and in other cases, The GD distribution simplifies to a Dirichlet distribution when b i = a (i+1) + b (i+1) .
The mean and the variance matrix of the GD distribution are as follows [41]: and the covariance between m i and m j is given by The covariance matrix of the GD distribution offers greater flexibility compared to the Dirichlet distribution, due to its more general structure.This additional complexity allows for an extra set of parameters, providing d − 1 additional degrees of freedom, which enables the GD distribution to more accurately model real-world data.Indeed, the GD distribution fits count data better than the commonly used Dirichlet distribution [43].The Dirichlet and GD distributions are both members of the exponential family (Appendix A).Furthermore, they are also conjugate priors to the multinomial distribution.As a result, we can use the following method to learn the model.
The likelihood for the GDMPCA is given as follows: Hence, when hidden variables are assigned GD priors, and given a defined universe of words, we use an empirical prior derived from the observed proportions of words in the universe, denoted by f , where ∑ k f k = 1.The equation, then, is structured as follows: where 2 shows the small size of the prior sample size.First, we will calculate the parameters of GD utilizing the Hessian matrix as described in Appendix B.1.2,following Equations ( 19) and (20).To find the optimal variational parameters, we minimize the Kullback-Leibler (KL) divergence between the variational distribution and the posterior distributions p(m, w|Ω, ξ).This is achieved through a repetitive fixed-point method.We specify the variational parameters as follows: As an alternative to the posterior distribution p(m, c, w, ξ, Ω), we determine the variational parameters γ and Φ through a detailed optimization process outlined subsequently.To simplify, Jensen's inequality is applied to establish a lower bound on the log likelihood, which allows us to disregard parameters γ and Φ [44]: Consequently, Jensen's inequality provides a lower bound on the log likelihood for any given variational distribution q(m, c|γ, Φ).
If the right-hand side of Equation ( 22) is denoted as L(γ, Φ; ξ, Ω), the discrepancy between the left and right sides of this equation represents the KL divergence between the variational distribution and the true posterior probabilities.This re-establishes the importance of the variational parameters, leading to the following expression: As demonstrated in Equation ( 23), maximizing the lower bound L(γ, Φ; ξ, Ω) with respect to γ and Φ is equivalent to minimizing the Kullback-Leibler (KL) divergence between the variational posterior probability.By factorizing the variational distributions, we can describe the lower bound as follows: After that, we can extend Equation (A7) in terms of the model parameters (ξ, Ω) and variational parameters (γ, Φ) (A13).
In order to find ϕ nl , we proceed to maximize with the respect to ϕ nl , so we have the following equations: and therefore, we have Setting the above equation to zero leads to Next, we maximize Equation (A13) with respect to γ i .The terms containing γ i are Setting the derivative of the above equation to zero leads to the following updated parameters: The challenge of deriving empirical Bayes estimates for the model parameters ξ and Ω is tackled by utilizing the variational lower bound as a substitute for the marginal log probability, using variational parameters γ and Φ.The empirical Bayes estimates are then determined by maximizing this lower bound in relation to the model parameters.Until now, our discussion has centered on the log probability for a single document; the overall variational lower bound is computed as the sum of the individual lower bounds from each document.In the M-step, this bound is maximized with respect to model parameters ξ and Ω.Consequently, the entire process is akin to performing a coordinate ascent as outlined in Equation (31).We formulate the update equation for estimating Ω by isolating terms and incorporating Lagrange multipliers to maximize the bound with respect to Ω: To derive the update equation for Ω (lj) , we take the derivative of the variational lower bound with respect to Ω (lj) and set this derivative to zero.This step ensures that we find the point where the lower bound is maximized with respect to the parameter Ω (lj) .
The updates mentioned lead to convergence at a local maximum of the lower bound of log p(Ω, ξ|r), which is optimal for all product approximations of the form q(m)q(w) for the joint probability p(m, w|Ω, ξ, r).This approach ensures that the variational parameters are adjusted to optimally approximate the true posterior distributions within the constraints of the model. ) Collapsed Gibbs Sampling Method Utilizing the fundamental procedure of the GD distribution as delineated in the all-encompassing generative formula p(c, z, θ, φ, w|, Ω, ξ, µ) within our innovative methodology, we can express it in the following manner: Here, p(θ|Ω) signifies the GD document prior distribution, where Ω = (a 1 , b 1 , . . ., a n , b n ) serves as a hyperparameter.Simultaneously, p(φ|ξ), with ξ = (α 1 , β 1 , . . ., α d , β d ) as its hyperparameters, represents the GD corpus prior distribution.The process of Bayesian inference seeks to approximate the posterior distribution of hidden variables z by integrating out parameters, which can be mathematically depicted as follows: Crucially, the joint distribution is expressed as a product of Gamma functions, as highlighted in prior research [12,45,46].This expression facilitates the determination of the expectation value for the accurate posterior distribution: Employing the GD prior results in the posterior calculation as outlined below: This leads to a posterior probability normalization as follows: The sequence from Equation (38) to Equation ( 40) delineates the complete collapsed Gibbs sampling procedure, encapsulated as follows: The implementation of collapsed Gibbs sampling in our GD-centric model facilitates sampling directly from the actual posterior distribution p, as indicated in Equation ( 41).This sampling technique is deemed more accurate than those employed in variational inference models, which typically approximate the distribution from which samples are drawn [46,47].Hence, our model's precision is ostensibly superior.
Upon the completion of the sampling phase, parameter estimation is conducted using the methodologies discussed.

Beta-Liouville Multinomial PCA
For the Beta-Liouville Multinomial PCA (BLMPCA) model, we define a corpus as a collection of documents with the same assumption described in the GDMPCA section.Hence, we have the following procedure for the model for every single word of the document.The BLMPCA model proceeds with generating every single word of the document with the following steps, where c is a d + 1-dimensional binary vector of topics defined: In the model described, each topic is represented by a binary variable, where z n i = 1 indicates that the i-th topic is chosen for the n-th word, and z n i = 0 indicates it is not chosen.The vector z n is a (D + 1)-dimensional binary vector representing the topic assignments across all D + 1 topics for a given word.The vector m is defined as m = (m 1 , m 2 , . . ., m D+1 ), where m D+1 = 1 − ∑ D i=1 m i captures the distribution of topic proportions across the document, ensuring that the sum of proportions across all topics equals 1.
A chosen topic is associated with a multinomial prior w over the vocabulary, where Ω w ij = p(w j = 1|z i = 1) describes the probability of the j-th word being selected given that the i-th topic is chosen.This formulation allows for each word in the document to be drawn randomly from the vocabulary conditioned on the assigned topic.
The probability p(w n |z n , Ω w ) is a multinomial probability that conditions on the topic assignments z n and the topic-word distributions Ω w , effectively modeling the likelihood of each word in the document given the topic assignments.
Additionally, BL(Υ) represents a d-variate Beta-Liouville distribution with parameters Υ = (α 1 , ..., α D , α, β).The probability distribution function of this Beta-Liouville distribution encapsulates the prior beliefs about the distribution of topics across documents, accommodating complex dependencies among topics and allowing for flexibility in modeling topic prevalence and co-occurrence within the corpus.
A Dirichlet distribution is the special case of BL if [42,45].The mean, the variance, and the covariance in the case of a BL distribution are as follows [45]: and the covariance between θ l and θ k is given by The earlier equation illustrates that the covariance matrix of the Beta-Liouville distribution offers a broader scope compared to the covariance matrix of the Dirichlet distribution.For the parameter estimation of BLMPCA, first, the parameter Ω is estimated using the Beta-Liouville prior on m using parameter Υ [19].The likelihood model for the BLMPCA is given as follows: For the Beta-Liouville priors, we have the following: In the following step, we will estimate the parameters for Ω using the Beta-Liouville prior and the Hessian matrix (Appendix C).As we explained in the previous Section 3.1, we should estimate the model parameters (Υ, Ω) and the variational parameters (γ, Φ) according to Equations ( 21), ( 22) and (A7) to find m nl , and we proceed to maximize with respect to m nl ; so, we have the following equations: To find m nl , we proceed to maximize with respect to ϕ nl : Therefore, we have The next step is to optimize Equation ( 49) to find the update equations for the variational; we separate the terms containing the variational Beta-Liouville parameters once more.

L[ξ
Selecting the terms containing variational Beta-Liouville variables γ i , α γ , and β γ , we have Setting Equations ( 52)- (54) to zero, we have the following update parameters: We address the challenge of deriving empirical Bayes estimates for the model parameters Υ and Ω by utilizing the variational lower bound as a substitute for the marginal log likelihood.This approach fixes the variational parameters γ and Φ at values determined through variational inference.We then optimize this lower bound to obtain the empirical Bayes estimates of the model parameters.
To estimate Ω w , we formulate necessary update equations.The process of maximizing Equation ( 52) with respect to Ω results in the following equation: Taking the derivatives with the respect to β w(lj) and setting it to zero yields in Appendix C.1:

Beta-Liouville Parameter
The objective of this subsection is to determine the estimates of the model's parameters using variational inference techniques [48].
The derivative of the above equation with respect to the BL parameter is given by From the equations presented earlier, it is evident that the derivative in Equation ( 52) with respect to each of the BL parameters is influenced not only by their individual values but also by their interactions with one another.Consequently, we utilize the Newton-Raphson method to address this optimization problem.To implement the Newton-Raphson method effectively, it is essential to first calculate the Hessian matrix for the parameter space, as illustrated below [49]: The Hessian matrix shown above is very similar to the Hessian matrix of the Dirichlet parameters in the MPCA model and generalized Dirichlet parameters in GDMPCA.In fact, the above matrix can be divided into two completely separate matrices using parameters α d , α, and β.Each of the two parts' parameter derivation will be identical to the Newton-Raphson model provided by MPCA and GDMPCA.

Inference via Collapsed Gibbs Sampling
The collapsed Gibbs sampler (CGS) contributes to the inference by estimating posterior distributions through a Bayesian network of conditional probabilities, which are determined through a sampling process of hidden variables.Compared to the traditional Gibbs sampler that functions in the combined space of latent variables and model parameters, the CGS offers significantly faster estimation times.The CGS operates within the collapsed space of latent variables, where, in the joint distribution p(X, z, θ, ϕ, w|Ω, Υ, µ), the model parameters θ and ϕ are marginalized out.This marginalization leads to the marginal joint distribution p(X, z, w|Ω, Υ, µ), which is defined as follows: In using Equation ( 63), the method calculates the conditional probabilities of the latent variables z ij by considering the current state of all other variables while excluding the specific variable z ij itself [50].Meanwhile, the collapsed Gibbs sampler (CGS) determines the topic assignments for the observed words by employing the conditional probability of the latent variables, where "−ij" indicates counts or variables with z ij excluded [50].This specific conditional probability is defined as follows [51]: The sampling mechanism of the collapsed Gibbs approach can be summarized as an expectation problem: The collapsed Gibbs sampling Beta-Liouville multinomial procedure consists of two phases for assigning documents to clusters.First, each document is assigned a random cluster for initialization.After that, each document is assigned a cluster based on the Beta-Liouville distribution after a specified number of iterations.
The goal is to use a network of conditional probabilities for individual classes to sample the latent variables from the joint distribution p(X, z|w, Ω, Υ).The assumption of conjugacy allows the integral in Equation ( 63) to be estimated.
The likelihood of the multinomial distribution, defined by the parameter Υ, and the probability density function of the Beta-Liouville distribution can be expressed as follows: By integrating the probability density function of the Beta-Liouville distribution over the parameter θ and incorporating updated parameters derived from the remaining integral in Equation ( 69), we are able to express it as a fraction of Gamma functions.
The following shows the updated parameters, where N jk represents counts corresponding to variables [45,51]: Equation ( 67) is then equivalent to The parameters α 1 , . . ., α k , α, and β correspond to the Beta-Liouville distribution, while m k represents the number of documents in cluster k.
After the sampling process, parameter estimation is performed.Subsequently, the empirical likelihood method [47] is utilized to validate the results using a held-out dataset.Ultimately, this process leads to the estimation of the class conditional probability p(X|w, Ω, Υ) within the framework of collapsed Gibbs sampling: The parameters are then computed as follows: where S is the size of a sample.

Experimental Results
In this section, we validate our proposed algorithms' efficiency for two distinct and challenging applications, namely, topic modeling for medical text and sentiment analysis.Each model's evaluation is based on the success rate for each dataset and the perplexity [3,9,52,53], which is a common measure used in language modeling and is defined as: where |w d | is the length of document d.A lower perplexity score indicates better generalization performance.In addition to the perplexity metric, the success rate is employed as a key performance indicator to evaluate our models, reflecting the proportion of correctly identified topics within a corpus in topic modeling.The success rate serves as a straightforward measure of a model's efficacy, capturing its ability to accurately classify documents into the correct topical categories, which is essential for effective information retrieval and knowledge discovery in the domain of text analysis.The main goal of both applications is to compare the GDMPCA, BLMPCA, and MPCA performances.The choice of these datasets is pivotal to our research as they offer a broad spectrum of analytical scenarios, from topic modeling for medical text to sentiment analysis, thus enabling a thorough investigation into the models' adaptability and accuracy.By encompassing datasets with distinct characteristics, we are able to demonstrate the strengths of our proposed models in varied contexts, highlighting their potential as a versatile tool in the field of text analysis.

Topic Modeling
The goal of text classification is to assign documents to predefined subject categories, a problem extensively researched with various approaches [42,54,55].Topic modeling, a common application in natural language processing, is used for analyzing texts from diverse sources and for document clustering [56].It identifies key "topics" in a text corpus using unsupervised statistical methods, where topics are keyword mixtures with a probability distribution, and documents are composed of topic mixtures [12].The "CMU Book Summary Dataset" was used to validate the model performance, containing plot summaries and metadata for 16,559 books [57].The models' accuracy was tested by training on various document numbers and observing the impact of latent topics on the classification accuracy.Using variational Bayes inference, the models showed similar performances, but BLMPCA excelled, particularly in classifying similar classes.
In Tables 2-4, we present the first three topics, the perplexity measurements, and time complexity for all models compared in this study.The success rates obtained using GDMPCA, BLMPCA, and MPCA are depicted in Figure 1.These examples demonstrate that our proposed models, which incorporate Generalized Dirichlet and Beta-Liouville distributions, yield more accurate classifications in scenarios where distinct classes exhibit similarities, in contrast to the traditional MPCA which is a Dirichlet-based model.Additionally, in Tables 5 and 6, we show the results for the collapsed Gibbs sampling.

Topic Number Topics
Topic 1 girl, tells, war, novel, new, world, army, story, one, group, book, states, general, british, president, first, american

Topic Modeling for Medical Text
Topic modeling plays a crucial role in navigating the complexities of health and medical text mining, despite the inherent challenges of data volume and redundancy in this domain.The study by Onan et al. [58] marked a significant advancement, presenting an optimized topic modeling approach that utilizes ensemble pruning.This method significantly improves the categorization of biomedical texts by enhancing precision and managing the computational challenges posed by the extensive data typical of medical documents.With vast amounts of health-related data, specialists struggle to find pertinent information, exemplified by the millions of papers on PubMed and hospital discharge records in the United States in 2015.This study utilized the TMVAr corpus from PubMed and the TMVAr-Dataset containing health-related Twitter news to evaluate models [59][60][61][62][63][64].

TMVAr Dataset
The TMVar Corpus dataset, comprising 500 PubMed papers with manual annotations of various mutation mentions, was utilized to evaluate our models.Tables 7 and 8 elucidate the perplexity comparison and time complexity for the TMVAR dataset, offering insight into the performances of our proposed methods.Moreover, Tables 9 and 10 present the outcomes of the collapsed Gibbs sampling.As indicated in the tables, the time complexity of this method is higher, yet the perplexity is lower.
Furthermore, as shown in Table 11, the BLMPCA model successfully extracts pertinent topics, which is indicative of the model's nuanced analytical capabilities.Figure 2 further illustrates the success rate of our proposed models in comparison to the traditional MPCA, highlighting the enhanced classification accuracy achieved by our methods.

Sentiment Analysis
Sentiment analysis, crucial for interpreting emotions in texts from various sources, benefits from advanced methodologies beyond mere word analysis [65].Recent studies, such as [66,67], have demonstrated the effectiveness of deep learning and text mining in capturing nuanced sentiment expressions.Additionally, the authors of [68] highlighted the potential of ensemble classifiers in improving the sentiment classification accuracy.These innovations showcase the shift toward more complex analyses that consider semantics, context, and intensity for a more accurate sentiment understanding.
The "Multi-Domain Sentiment Dataset", containing Amazon.com product reviews across various domains, was used for analysis [69].This dataset, with extensive reviews on books and DVDs, provides data for basic analysis.The applied model, using K = 8 topics, assumed that each topic comprises a bag of words with specific probabilities, and each document is a mix of these topics.The model's goal was to learn the distributions of words and topics in the corpus.
We demonstrated that the overall sentiment of the dataset tends to be positive, influenced by the presence of high-frequency words with positive connotations within the corpus.This observation is substantiated by the sentiment analysis framework we employed.Tables 12 and 13 provides a detailed explanation of the perplexity measures and time complexity tested for sentiment analysis.Furthermore, the findings from the topic modeling of eight emotions and two sentiments are displayed in Tables 14 and 15. Figure 3 shows that our proposed models outperform the previous model.Figure 3 shows the success rates for MPCA, GDMPCA, and BLMPCA on sentiment analysis, with GDMPCA and BLMPCA outperforming MPCA as the number of emotions analyzed increases.This indicates their better suitability for complex emotion detection tasks in practical applications.Furthermore, Tables 16 and 17 present the results for the collapsed Gibbs sampling.Additionally, Tables 18 and 19 display the accuracy and recall of various classifiers utilized for emotion detection.Table 20 shows the F1-scores for various classifiers, indicating the balanced harmonic mean of the precision and recall for SVM, Naive Bayes, and MLP classifiers when applied with MPCA, GDMPCA, and BLMPCA models in sentiment analysis.

Discussion
We delved into the comparative advantages of the GDMPCA and BLMPCA models over existing methods in text classification and sentiment analysis.The superior performances of our proposed models can be attributed to several key factors.Firstly, the incorporation of Generalized Dirichlet and Beta-Liouville distributions allows for a more nuanced modeling of text data, which captures the intricacies of word distributions more effectively than traditional methods.This results in a more accurate representation of the underlying thematic structures in the data.For instance, in the CMU Book Summary Dataset, the intricacies of literary themes were better represented, showcasing the models' aptitude for multifaceted textual analysis.This was attributed to the models' ability to account for the co-occurrence and complex interrelationships of terms within documents, a feature less emphasized in MPCA due to its assumption of component independence.
In the TMVAR Corpus from PubMed, the medical text presented a challenge due to its specialized lexicon and the density of information.The BLMPCA model excelled by exploiting its additional parameters, optimizing data representation in this high-dimensional space, thus underscoring the importance of model selection aligned with dataset characteristics.The sentiment analysis on the Multi-Domain Sentiment Dataset further emphasized the adaptability of our models.Here, BLMPCA demonstrated its finesse in discerning subtle sentiments from Amazon.com reviews, outperforming traditional approaches that may not have captured the emotional granularity present in user-generated content.
However, the sophistication of GDMPCA and BLMPCA comes with greater computational demands, as reflected in longer convergence times.trade-off between accuracy and computational efficiency underscores the necessity of careful model selection in practice, considering the scale of the data and available computational resources.Although our proposed models signify a leap forward in text analysis methodologies, they are not without limitations.The reliance on variational inference and assumptions specific to the models may not be universally applicable to all types of textual data, suggesting room for future refinement and the exploration of alternative distributions or learning strategies.
The findings of this study illuminate the potential of integrating advanced probabilistic distributions into PCA to uncover deeper insights within text data.It is a testament to the evolution of statistical models in text analysis, pointing toward an exciting trajectory for future research in the field.The ongoing dialogue within the academic community on these topics is reflective of the dynamic nature of machine learning and its applications to natural language processing.As we continue to push the boundaries, it is imperative to balance innovation with practicality, ensuring that our models are not only theoretically robust but also computationally viable and accessible for varied applications.

Conclusions
In this paper, two novel models, generalized Dirichlet Multinomial PCA and Beta-Liouville Multinomial PCA, were proposed to improve the accuracy of the MPCA model for multi-topic modeling and text classification.We followed a Bayesian analysis that considers the generalized Dirichlet and Beta-Liouville assumptions.We demonstrated that our two proposed models have more flexibility.The models were used in two separate applications: text classification and sentiment analysis.The results show that the two proposed models, in all applications, achieved superior performances, as represented by the high prediction accuracy in comparison to that of the MPCA.It could be claimed that the proposed models, using different prior assumptions, yield better results than the standard methods.Specially, the BLMPCA provides the best improvement compared to the GDMPCA and MPCA for all the tested data.Crucially, the employment of collapsed Gibbs sampling for parameter inference was proven efficient and effective, despite its time-consuming nature.This method substantially boosts our models' computational capabilities, allowing for the superior discovery of latent topics in text corpora and marking a noteworthy advancement over the MPCA model.Future approaches for research will concentrate on model modifications and improvements to achieve greater precision in topic modeling.In addition, future works could be devoted to extending the proposed models to other applications and significantly extending the proposed model to fit a variety of data as well as real-time streaming data.within the exponential family provides an approximation for their dual aspects, as explored in [20].μt = ν t + Σ i t(x i ) S t + I (A3)  L(γ, Φ; ξ, Ω) = E q [log p(m|ξ)] + E q [log p(c|m)] + E q [log p(w|c, Ω)] − E q [log q(m)] − E q [log q(c)] (A7) In the following, we derive each of the five factors of the above equation:

Figure 1 .
Figure 1.Success rate for CMU Book data.

Table 1 .
Parameters of generalized Dirichlet and Beta-Liouville distributions.

Table 2 .
Common topics identified with the BLMPCA model on the CMU Book dataset, each defined by a set of keywords.

Table 3 .
A comparison of the perplexity of the MPCA, GDMPCA, and BLMPCA models, indicating the model fit quality across different topic numbers (K) on the CMU Book dataset.

Table 4 .
Time complexity comparison for MPCA, GDMPCA, and BLMPCA at varying topic levels (K) on the CMU Book dataset.

Table 5 .
Comparison perplexity scores of MPCA, GDMPCA, and BLMPCA, reflecting the model fit as the topic count (K) increases on the CMU Book dataset with CGS inference.

Table 6 .
Time complexity comparison for MPCA, GDMPCA, and BLMPCA with increasing topics (K) using CGS inference on the CMU Book dataset.

Table 7 .
Comparison of the perplexity of the MPCA, GDMPCA, and BLMPCA models, indicating the model fit quality across different topic numbers (K) on the TMVAR dataset with variation EM inference.

Table 8 .
Time complexity comparison for MPCA, GDMPCA, and BLMPCA with increasing topics (K) using variation EM inference on the TMVAR dataset.

Table 9 .
Comparison of the perplexity of the MPCA, GDMPCA, and BLMPCA models, indicating the model fit quality across different topic numbers (K) on the TMVAR dataset with CGS inference.

Table 10 .
Time complexity comparison for MPCA, GDMPCA, and BLMPCA with increasing topics (K) using CGS inference on the TMVAR dataset.

Table 12 .
Comparison of the perplexity of the MPCA, GDMPCA, and BLMPCA models, indicating the model fit quality across different topic numbers (K) on sentiment data with variation EM inference.

Table 13 .
Time complexity comparison for MPCA, GDMPCA, and BLMPCA with increasing topics (K) using variational EM inference on the sentiment analysis application.

Table 14 .
Frequency of emotions identified in text data via topic modeling.

Table 15 .
The counts of positive, negative, and unlabeled sentiments identified through sentiment analysis.

Table 16 .
Comparison of the perplexity of the MPCA, GDMPCA, and BLMPCA models, indicating the model fit quality across different topic numbers (K) on sentiment data with CGS inference.

Table 17 .
Time complexity comparison for MPCA, GDMPCA, and BLMPCA with increasing topics (K) using CGS inference on the sentiment analysis application.

Table 18 .
Accuracy comparisons for sentiment analysis classifiers

Table A1 .
Exponential family characterizations for Dirichlet, GD and BL distributions.