# Revisiting Probabilistic Latent Semantic Analysis: Extensions, Challenges and Insights

## Abstract

## 1. Introduction

## 2. The Method: PLSA Formulas

Aspect 1 | Aspect 2 | Aspect 3 | Aspect 4 |

imag | video | region | speaker |

SEGMENT | sequenc | contour | speech |

color | motion | boundari | recogni |

tissu | frame | descript | signal |

Aspect1 | scene | imag | train |

brain | SEGMENT | SEGMENT | hmm |

slice | shot | precis | sourc |

cluster | imag | estim | speakerindepend |

mri | cluster | pixel | SEGMENT |

algorithm | visual | paramet | sound |

## 3. Criticism: LDA and Reformulations

#### 3.1. Latent Dirichlet Allocation

#### 3.2. Other Formulations

#### 3.2.1. Probabilities for Unseen Documents

#### 3.2.2. Extension to Continuous Data

#### 3.2.3. Tensorial Approach

#### 3.2.4. Overfitting

#### 3.2.5. Discrete and Continuous Variables Case Equivalence

#### 3.2.6. Inference

#### 3.3. Extensions Significance

## 4. The Landscape of Applications

#### 4.1. Engineering

#### 4.2. Computer Science

#### 4.3. Semantic Image Analysis

#### 4.4. Life Sciences

#### 4.5. Fundamental Sciences

#### 4.6. Other Applications

## 5. NMF Point of View

## 6. Extensions

#### 6.1. Kernelization

#### 6.2. Principal Component Analysis

#### 6.3. Clustering

#### 6.4. Information Theory Interpretation

#### 6.5. Independent Component Analysis and Blind Source Separation

#### 6.6. Transfer Learning

#### 6.7. Neuronal Networks

#### 6.8. Open Questions

## 7. PLSA Processing Steps and State-of-the-Art Solutions

#### 7.1. Algorithm Initialization

#### 7.2. Algorithms Based on Expectation–Maximization Improvement

#### 7.2.1. Tempered EM

#### 7.2.2. Sparse PLSA

#### 7.2.3. Incremental PLSA

#### 7.3. Use of Computational Techniques

#### 7.4. Open Questions

## 8. Future Work

## 9. Discussion

## 10. Conclusions

## Funding

## Conflicts of Interest

**Figure 1.**Reproduced form [17]. PLSA generative models; (

**left**) panel is the asymmetric formulation: (i) select a document ${d}_{i}$ with probability $P\left({d}_{i}\right)$; (ii) pick a latent class ${z}_{k}$ with probability $P\left({z}_{k}\right|\phantom{\rule{0.166667em}{0ex}}{d}_{i})$; (iii) generate a word with probability $P\left({w}_{j}\right|\phantom{\rule{0.166667em}{0ex}}{z}_{k})$; (

**right**) panel is the symmetric formulation: (i) select a latent class ${z}_{k}$; (ii) generate documents and words with probabilities $P\left({d}_{i}\right|\phantom{\rule{0.166667em}{0ex}}{z}_{k})$ and $P\left({w}_{j}\right|\phantom{\rule{0.166667em}{0ex}}{z}_{k})$, respectively.

**Figure 4.**PCA and PLSA comparative. Vectors t are the columns of

**H**, which are the transformations of the Cartesian canonical basis.

**Figure 5.**Independent Component Analysis.Reproduced from [155]. Several signal sources are mixed in a matrix. Projections are the observed signals. ICA consists of separating noise into observations, providing the informative or source signals.

Year | Contribution | Remarks |
---|---|---|

2000 | PLSA | PLSA formulation in conference proceedings [1,2,3] comments on the connections among NMF, SVD, and information geometry. |

2001 | Kernelization | Fisher kernel derivation from PLSA [17]. |

2003 | LDA | Criticism of PLSA: LDA formulation [23]. |

2003 | Gaussian PLSA | Assumption of Gaussian mixtures [11]. |

2005 | NMF | PLSA solves the NMF problem [14]. Introduction to stochastic matrices [15]. |

2008 | k-means | Equivalence between k-means and NMF [24]. |

2009 | PCA | Comparison of NMF, PLSA, and PCA [19]. |

2012 | Information Geometry | Relationship between Fisher information matrix and variance from the PLSA context [20]. |

2013 | Transfer Learning | Use of latent variables weight for classifying most relevant variables [21]. |

2015 | Unified framework for PLSA and NMF. | Algorithm for NMF and PLSI based on Poisson likelihood [25]. |

2019 | Neural Networks | Neural networks training with PLSA [22]. |

2020 | SVD | Establishment of conditions for equivalence of NMF, PLSA, and SVD [16]. |

2020 | Inference | Construction of hypothesis tests [13] |

2021 | Number of topics | NMF and Silhouette index to determine the number of latent variables [26]. |

2023 | Discrete and continuous case equivalence. | Relation between co-occurrences and continuous variables [12]. |

**Table 2.**PLSA Solutions. PLSA solutions are the M-step formulas. For a formulation, select a value of k and initialize the M-step equations. Then estimate expression $P\left({z}_{k}\right|\phantom{\rule{0.166667em}{0ex}}{d}_{i},{w}_{j})$ and recompute $n({d}_{i},{w}_{j})$. The expression (9) increases in each step. The iterative process finishes achieving certain previous conditions.

Asymmetric Formulation | Symmetric Formulation | |
---|---|---|

E-step $\left(P\left({z}_{k}\right|\phantom{\rule{0.166667em}{0ex}}{d}_{i},{w}_{j})\right)$ | $\frac{P({w}_{j},{z}_{k})P({z}_{k},{d}_{i})}{{\sum}_{{k}^{\prime}}\phantom{\rule{0.166667em}{0ex}}P({w}_{j},{z}_{{k}^{\prime}})P({z}_{{k}^{\prime}},{d}_{i})}$ | $\frac{P\left({w}_{j}\right|\phantom{\rule{0.166667em}{0ex}}{z}_{k})P\left({d}_{i}\right|\phantom{\rule{0.166667em}{0ex}}{z}_{k})P\left({z}_{k}\right)}{{\sum}_{{k}^{\prime}}\phantom{\rule{0.166667em}{0ex}}P\left({w}_{j}\right|\phantom{\rule{0.166667em}{0ex}}{z}_{{k}^{\prime}})P\left({z}_{{k}^{\prime}}\right)P\left({d}_{i}\right|\phantom{\rule{0.166667em}{0ex}}{z}_{{k}^{\prime}})}$ |

M-step | $P\left({d}_{i}\right)=\frac{{\sum}_{j}{\sum}_{k}n({d}_{i},{w}_{j})P\left({z}_{k}\right|\phantom{\rule{0.166667em}{0ex}}{w}_{j},{d}_{i})}{{\sum}_{j}{\sum}_{i}{\sum}_{k}n({d}_{i},{w}_{j})P\left({z}_{k}\right|\phantom{\rule{0.166667em}{0ex}}{w}_{j},{d}_{i})}$ | $P\left({z}_{k}\right)=\frac{{\sum}_{i}{\sum}_{j}\phantom{\rule{0.166667em}{0ex}}n({d}_{i},{w}_{j})\phantom{\rule{0.166667em}{0ex}}P\left({z}_{k}\right|{w}_{j},{d}_{i})}{{\sum}_{i}{\sum}_{j}{\sum}_{k}\phantom{\rule{0.166667em}{0ex}}n({d}_{i},{w}_{j})P\left({z}_{k}\right|{w}_{j},{d}_{i})}$ |

$P\left({w}_{j}\right|\phantom{\rule{0.166667em}{0ex}}{z}_{k})=\frac{{\sum}_{i}n({d}_{i},{w}_{j})P\left({z}_{k}\right|\phantom{\rule{0.166667em}{0ex}}{w}_{j},{d}_{i})}{{\sum}_{j}{\sum}_{i}n({d}_{i},{w}_{j})P\left({z}_{k}\right|\phantom{\rule{0.166667em}{0ex}}{w}_{j},{d}_{i})}$ | $P\left({w}_{j}\right|\phantom{\rule{0.166667em}{0ex}}{z}_{k})=\frac{{\sum}_{i}\phantom{\rule{0.166667em}{0ex}}n({d}_{i},{w}_{j})P\left({z}_{k}\right|\phantom{\rule{0.166667em}{0ex}}{d}_{i},{w}_{j})}{{\sum}_{i}{\sum}_{j}\phantom{\rule{0.166667em}{0ex}}n({d}_{i},{w}_{j})P\left({z}_{k}\right|\phantom{\rule{0.166667em}{0ex}}{w}_{j},{d}_{i})}$ | |

$P\left({d}_{j}\right|\phantom{\rule{0.166667em}{0ex}}{z}_{k})=\frac{{\sum}_{j}n({d}_{i},{w}_{j})P\left({z}_{k}\right|\phantom{\rule{0.166667em}{0ex}}{w}_{j},{d}_{i})}{{\sum}_{j}{\sum}_{i}n({d}_{i},{w}_{j})P\left({z}_{k}\right|\phantom{\rule{0.166667em}{0ex}}{w}_{j},{d}_{i})}$ | $P\left({z}_{k}\right|\phantom{\rule{0.166667em}{0ex}}{d}_{i})=\frac{{\sum}_{j}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}n({d}_{i},{w}_{j})\phantom{\rule{0.166667em}{0ex}}P\left({z}_{k}\right|\phantom{\rule{0.166667em}{0ex}}{w}_{j},{d}_{i})}{{\sum}_{i}{\sum}_{j}n({d}_{i},{w}_{j})\phantom{\rule{0.166667em}{0ex}}P\left({z}_{k}\right|\phantom{\rule{0.166667em}{0ex}}{w}_{j},{d}_{i})}$ |

Discipline | Research Area | % |
---|---|---|

Engineering (43%) | Mechanics & Robotics | 35 |

Acoustics | 4 | |

Telecommunications & Control Theory | 3 | |

Materials Science | 1 | |

Computer Science (34%) | Clustering | 18 |

Information retrieval | 9 | |

Networks | 4 | |

Machine learning applications | 3 | |

Semantic image analysis (10%) | Image annotation | 4 |

Image retrieval | 3 | |

Image classification | 3 | |

Life Sciences (5%) | Computational Biology | 2 |

Biochemistry & Molecular Biology | 2 | |

Environmental Sciences Ecology | 1 | |

Methodological (4%) | Statistics & Computational Techniques | 4 |

Fundamental Sciences (2%) | Geochemistry & Geophysics | 1 |

Instrumentation | 1 | |

Other Applications (2%) | Pain Detection | 1 |

Sentiment Analysis | 1 |

