Abstract
In this paper, we describe the supervised dynamic correlated topic model (sDCTM) for classifying categorical time series. This model extends the correlated topic model used for analyzing textual documents to a supervised framework that features dynamic modeling of latent topics. sDCTM treats each time series as a document and each categorical value in the time series as a word in the document. We assume that the observed time series is generated by an underlying latent stochastic process. We develop a state-space framework to model the dynamic evolution of the latent process, i.e., the hidden thematic structure of the time series. Our model provides a Bayesian supervised learning (classification) framework using a variational Kalman filter EM algorithm. The E-step and M-step, respectively, approximate the posterior distribution of the latent variables and estimate the model parameters. The fitted model is then used for the classification of new time series and for information retrieval that is useful for practitioners. We assess our method using simulated data. As an illustration to real data, we apply our method to promoter sequence identification data to classify E. coli DNA sub-sequences by uncovering hidden patterns or motifs that can serve as markers for promoter presence.
1. Introduction
Data on multiple time series or sequences are ubiquitous in various domains, and their classification finds applications in numerous areas, such as human motion classification [], earthquake prediction [], and heart attack detection []. The classification of multiple real-valued time series have been well studied in the statistical literature (see the detailed review in []). However, for analyzing categorical time series, most statistical methods are primarily focused on examining a single time series. A few methods include the Markov chain model [], the link function approach [], and the likelihood-based method []. In the computer science literature, a number of sequence classification methods have been developed that are black-box in nature and may be difficult to interpret. These include the minimum edit distance classifier with sequence alignment [,] and Markov chain-based classifiers []. Overall, the classification of multiple categorical time series has not received much attention. Recent work has discussed a novel approach to the classification of categorical time series in the supervised learning framework, using a spectral envelope and optimal scalings [,]. We present an alternative approach for classifying multiple categorical time series in the topic modeling framework.
Topic modeling algorithms generally examine a set of documents referred to as a corpus in the natural language processing (NLP) domain. Often, the sets of words observed in documents appear to represent a coherent theme or topic. Topic models analyze the words in the documents in order to uncover themes that run through them. Many topic modeling algorithms have been developed over time, including non-negative matrix factorization [], latent Dirichlet allocation (LDA) [], and structural topic models []. Topic modeling in the dynamic framework has been extensively studied to model the dynamic evolution of document collections over time [,]. However, the dynamic evolution of words over time within each document has not been addressed. The topic modeling literature typically assumes that words are interchangeable. We relax this assumption to model word time series (documents). We build a family of probabilistic dynamic topic models by extending the correlated topic model (CTM) in a supervised framework and modeling the dynamic evolution of the underlying latent stochastic process that reflects the thematic structure of the time series collection, and we classify the documents. Topic models like latent Dirichlet allocation, correlated topic models, and dynamic topic models are unsupervised, as only documents are used to identify topics by maximizing the likelihood (or the posterior probability) of the collection. In such modeling frameworks, we hope that the topics will be useful for categorization and are useful when no response is available and we want to infer useful information from the observed documents. However, when the main goal is prediction, a supervised topic modeling framework is beneficial, as jointly modeling the documents and the responses can identify latent topics that can predict the response variables for future unlabeled documents. The sDCTM framework is attractive because it provides a supervised dynamic topic modeling framework to (i) estimate the evolution of the latent process that captures the dynamic evolution of words in the document; and (ii) classify the time series.
We apply the sDCTM method for promoter sequence identification in E. coli DNA sub-sequences. A promoter is a region of DNA where RNA polymerase begins to transcribe a gene, and it is usually located upstream or at the 5′ end of the transcription initiation site in DNA. DNA promoters are proven to be the primary cause of numerous human diseases, including diabetes [] and Huntington’s disease []. Thus, the identification of DNA sequences containing promoters has gained significant attention from researchers in the field of bioinformatics. Several computational methods and tools have been developed to analyze DNA sequences and predict potential promoter regions. These include the classification of promoter DNA sequences using a robust deep learning model, DeePromoter [], deep learning, and the combination of continuous FastText N-grams [] and the position-correlation scoring matrix (PCSM) algorithm []. We use the sDCTM model to analyze the E. coli DNA sub-sequences by treating each DNA sub-sequence as a document and the presence/absence of promoter regions as the associated class label.
The format of this paper is as follows: In Section 2, we describe the framework of the supervised dynamic correlated topic model, along with details on inference and parameter estimation. Section 3 discusses the simulation study conducted to assess our method. In Section 4, we present the results of applying the sDCTM method for promoter sequence identification in E. coli DNA subsequences and compare our method with the various classification techniques.
3. Simulation Study
To evaluate the performance of our model, we conducted a simulation study. We generated a dataset consisting of documents, where each document is represented by a word time series of length . The vocabulary size was set to , and the documents were divided into classes. Each word time series was governed by an underlying latent topic structure with topics. Our model was evaluated using a six-fold cross-validation with train and test documents. We conducted our analysis in Fortran 77. Given the observed word time series from M documents, we fitted the sDCTM model to estimate the model parameters, , , , and . As the model parameters are matrices, we assessed the estimation error using the squared Frobenius norm. The squared Frobenius norm lies between two matrices, such as and , and is defined as
The average squared Frobenius norm (over the six-fold cross-validation) for each of the model parameters is shown in Table 2. The low values of the squared Frobenius norm on the model parameters suggest that we are able to estimate the matrices well.
Table 2.
Average squared Frobenius norm between the true and estimated model parameters using the sDCTM model on the simulated dataset.
We used the estimated model parameter and the variational parameter for and to predict the response associated with the training data containing documents as follows.
for and . Table 3 shows the confusion matrix obtained on train data for one of the cross-validations. We were able to achieve a training accuracy of , and the average training accuracy across all cross-validation datasets was also .
Table 3.
Confusion matrix obtained from the sDCTM model on the test data of documents based on a simulation.
While classifying the test data, we estimated the variational parameters on the test documents given the estimated model parameters using variational inference. However, because we assume that we do not know the true label in the test documents, we replace the terms in the likelihood function associated with the response variable by the pseudo-label. The pseudo-label assigned to a test document is the class label associated with the nearest train document. We used the hamming distance to calculate the distance between the test and train sequence. Given the word time series, the associated pseudo-label, and the model parameters estimated using the train data, we estimated the variational parameters for each test document. We then obtained the test predictions using Equation (10). Table 4 shows the confusion matrix obtained on the test data for one of the cross-validation test datasets with a test accuracy of . The average test accuracy across all cross-validation datasets was .
Table 4.
Confusion matrix obtained from the sDCTM model on the test data in a simulation.
The model parameter, denoted as , provides an estimate of how words are distributed across a set of J latent topics. Figure 1 illustrates the proportions of words across each of the latent topics. We observed that Word 1 was highly probable within Topic 1, Words 3, 4, and 6 were highly probable within Topic 2, and Word 4 was highly probable within Topic 3.
Figure 1.
Estimated distribution of words over each of the latent topics on the simulated data.
4. Application for Promoter Sequence Identification in E. coli DNA Sub-Sequences
Proteins are one of the most important classes of biological molecules, being the carriers of the message contained in the DNA. An important process involved in the synthesis of proteins is called transcription. During transcription, a single-stranded RNA molecule, called messenger RNA, is synthesized (using the complementarity of the bases) from one of the strands of DNA corresponding to a gene (a gene is a segment of the DNA that codes for a type of protein). This process begins with the binding of an enzyme called RNA polymerase to a certain location on the DNA molecule. This exact site, which determines which of the two strands of DNA will be a transcript and in which direction, is recognized by the RNA polymerase due to the existence of certain regions of DNA placed near the transcription start site of a gene, called promoters. Because determining the promoter region in the DNA is an important step in the process of detecting genes, the problem of promoter identification is of major importance within the field of bioinformatics.
We considered a dataset of 106 E. coli DNA sub-sequences that is publicly available at https://archive.ics.uci.edu/dataset/67/molecular+biology+promoter+gene+sequences (accessed on 12 March 2023), among which 53 DNA sub-sequences contain promoters and the remaining 53 DNA sub-sequences do not contain promoters. Each sub-sequence consists of 57 nucleotides, represented by the four nucleotide bases: adenine (a), guanine (g), cytosine (c), and thymine (t). Given the two sets of DNA sub-sequences, with and without the presence of promoter regions, we used sDCTM to uncover the hidden patterns or motifs that could serve as markers for promoter presence to classify the sub-sequences. A detailed data description is provided in [].
We present the results of the sDCTM model applied to analyze the E. Coli DNA sub-sequences. We treated each DNA sub-sequence as a document and the presence/absence of promoter regions as the associated class label. Each word in the document was represented by the nucleotide observed at that position in the sub-sequence. We divided the data into an 80–20 train–test split in order to assess the model performance. We trained the model on data with (number of train DNA sub-sequences) and set (length of each DNA sub-sequence), (number of unique nucleotide levels), and (indicating the levels of the response variable). To choose the number of topics J, we ran the model for different values of J and chose J based on the highest test accuracy. We set the number of latent topics as . The confusion matrices obtained on the train and test data with (number of test DNA sub-sequences) using the sDCTM model are shown in Table 5 and Table 6. We were able to achieve a training accuracy of and a test accuracy of .
Table 5.
Confusion matrix obtained from the sDCTM model on the promoter identification train data with DNA sub-sequences.
Table 6.
Confusion matrix obtained from the sDCTM model on the promoter identification test data with DNA sub-sequences.
The distribution of nucleotides accross the three topics identified by the sDCTM model is shown in Figure 2. We observed that nucleotide levels a, c, and g were highly probable within Topic 1, nucleotide level t was highly probable within Topic 2, and nucleotide level a was highly probable within Topic 3. In addition, based on the model predictions, we also constructed a sequence logo plot [] to identify the diversity of sequences between the two predicted groups. The sequence logo plot based on the train model predictions are shown in Figure 3. We can see that that the promoter presence plot demonstrated significantly higher conservation at various positions compared with the promoter absence plot. The higher bits values in the promoter presence plot highlight critical motifs essential for promoter activity. In contrast, the promoter absence plot showed much smaller variability, suggesting fewer conserved elements.
Figure 2.
Estimated distribution of nucleotides over each of the latent topics.
Figure 3.
Sequence logo plot based on model predictions on the train data.
Comparison with Other Methods
We compared our sDCTM model with existing classification techniques, such as SVM, k-NN, and classification trees []. We used the k-NN classification technique, which is a popular approach used for sequence classification [] on the observed data. We compared our approach to popular machine learning techniques like SVM and classification trees on the extracted features from the DNA sequence. We used the Haar wavelet transform and the simplest discrete wavelet transform (DWT) to extract features from the observed time series and build a classification model based on these features. We implemented the k-NN classification technique using the R package class [], which identifies nearest neighbors (in Euclidean distance) to classify the train and test data. We used the R package wavelets [] to extract the DWT (with Haar filter) coefficients. The classification tree was implemented using the R package party [], and the SVM was implemented using the R package e1071 []. We ran each of these methods using an 80–20 train–test split to assess model performance. The train and test accuracies are shown in Table 7. In comparison with the the other classification techniques, sDCTM performed better on both the train and test data.
Table 7.
Comparing accuracy using the sDCTM, k-NN, classification tree, and SVM methods on the train and test data.
5. Discussion
In this paper, we introduced the sDCTM model framework, which aims to classify categorical time series by dynamically modeling the underlying latent topics that capture the hidden thematic structure of the time series collection. We demonstrated the applicability of our method to real data by applying it to the promoter identification data to classify E. coli DNA sub-sequences based on promoter presence. Using sDCTM, we estimated the dynamic latent topic structure that serves as markers for promoter presence in a DNA sub-sequence and classify the sub-sequences. Among the latent topics obtained using sDCTM, we observed that nucleotide levels a, c, and g were highly probable within Topic 1, nucleotide level t was highly probable within Topic 2, and nucleotide level a was highly probable within Topic 3. In addition, the logoplot based on the model predictions showed higher conservation at various positions in DNA sub-sequences with predicted promoter presence in comparison with DNA sub-sequences with predicted promoter absence. We compared our method with the k-NN, SVM, and classification tree method on a train–test data setup. Our comparative study results indicated that the sDCTM model performed better than the other classification techniques in both training and test datasets. This indicates that the estimated underlying latent topic structure captured by the sDCTM model is able to identify the promoter presence/absence in a DNA sub-sequence better in comparison with the other classification approaches. An extension of the sDCTM model accommodating word time series of varying lengths can be derived as a part of future work.
Author Contributions
Conceptualization, N.P., N.R. and S.R.; methodology, N.P., N.R. and S.R.; formal analysis, N.P., N.R. and S.R.; investigation, N.P., N.R. and S.R.; data curation, N.P.; writing—original draft preparation, N.P., N.R. and S.R.; writing—review and editing, N.P., N.R. and S.R.; visualization, N.P., N.R. and S.R.; supervision, N.R. and S.R.; project administration, N.R. and S.R. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Data and analysis code are available at https://github.com/NamithaVionaPais/sDTCM (accessed on 10 June 2024), Github link.
Conflicts of Interest
The authors declare no conflicts of interest.
Appendix A. Approximate Inference for sDCTM
We derive the evidence lower bound (ELBO) based on our choice of variational distribution defined in (4). Then, we derive the variational inference algorithm for the sDCTM model by maximizing the ELBO to estimate the variational parameters.
Appendix A.1. ELBO
Appendix A.2. Variational Multinomial
Appendix A.3. Variational Gaussian
Appendix A.4. Estimation for sDCTM
In this section, we provide update equations to estimate the model parameters based on the lower bound on the log likelihood of the data.
Then,
Appendix A.5. Conditional Gaussian
Appendix A.6. Conditional Multinomial
Appendix A.7. Softmax Regression
References
- Le Nguyen, T.; Gsponer, S.; Ilie, I.; O’reilly, M.; Ifrim, G. Interpretable Time Series Classification using Linear Models and Multi-resolution Multi-domain Symbolic Representations. Data Min. Knowl. Discov. 2019, 33, 1183–1222. [Google Scholar] [CrossRef]
- Neuhauser, D.S.; Allen, R.M.; Zuzlewski, S. Northern California Earthquake Data Center: Data Sets and Data Services. In AGU Fall Meeting Abstracts; American Geophysical Union: Washington, DC, USA, 2015; Volume 2015, p. S53A-2774. [Google Scholar]
- Olszewski, R.T. Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data; Carnegie Mellon University: Pittsburgh, PA, USA, 2001. [Google Scholar]
- Bagnall, A.; Lines, J.; Bostrom, A.; Large, J.; Keogh, E. The Great Time Series Classification Bake Off: A Review and Experimental Evaluation of Recent Algorithmic Advances. Data Min. Knowl. Discov. 2017, 31, 606–660. [Google Scholar] [CrossRef]
- Billingsley, P. Statistical Methods in Markov Chains. In The Annals of Mathematical Statistics; Institute of Mathematical Statistics: Hayward, CA, USA, 1961; pp. 12–40. [Google Scholar]
- Fahrmeir, L.; Kaufmann, H. Regression Models for Non-Stationary Categorical Time Series. J. Time Ser. Anal. 1987, 8, 147–160. [Google Scholar] [CrossRef]
- Fokianos, K.; Kedem, B. Prediction and Classification of Non-Stationary Categorical Time Series. J. Multivar. Anal. 1998, 67, 277–296. [Google Scholar] [CrossRef]
- Navarro, G. A Guided Tour to Approximate String Matching. ACM Comput. Surv. (CSUR) 2001, 33, 31–88. [Google Scholar] [CrossRef]
- Jurafsky, D.; Martin, J.H. Naïve Bayes Classifier Approach to Word Sense Disambiguation. In Computational Lexical Semantics; University of Groningen: Groningen, The Netherlands, 2009. [Google Scholar]
- Deshpande, M.; Karypis, G. Evaluation of Techniques for Classifying Biological Sequences. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Taipei, Taiwan, 6–8 May 2002; Springer: Berlin/Heidelberg, Germany, 2002; pp. 417–431. [Google Scholar]
- Stoffer, D.S.; Tyler, D.E.; Wendt, D.A. The Spectral Envelope and its Applications. In Statistical Science; Institute of Mathematical Statistics: Hayward, CA, USA, 2000; pp. 224–253. [Google Scholar]
- Li, Z.; Bruce, S.A.; Cai, T. Classification of Categorical Time Series Using the Spectral Envelope and Optimal Scalings. arXiv 2021, arXiv:2102.02794. [Google Scholar]
- Yan, X.; Guo, J.; Liu, S.; Cheng, X.; Wang, Y. Learning Topics in short texts by Non-Negative Matrix Factorization on Term Correlation Matrix. In Proceedings of the 2013 SIAM International Conference on Data Mining. SIAM, Austin, TX, USA, 2–4 May 2013; pp. 749–757. [Google Scholar]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Roberts, M.E.; Stewart, B.M.; Tingley, D.; Airoldi, E.M. The Structural Topic Model and Applied Social Science. In Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation. Harrahs and Harveys, Lake Tahoe; Harvard University: Cambridge, MA, USA, 2013; Volume 4, pp. 1–20. [Google Scholar]
- Blei, D.M.; Lafferty, J.D. Dynamic Topic Models. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 113–120. [Google Scholar]
- Wang, C.; Blei, D.; Heckerman, D. Continuous Time Dynamic Topic Models. arXiv 2012, arXiv:1206.3298. [Google Scholar]
- Ionescu-Tîrgovişte, C.; Gagniuc, P.A.; Guja, C. Structural properties of gene promoters highlight more than two phenotypes of diabetes. PLoS ONE 2015, 10, e0137950. [Google Scholar] [CrossRef] [PubMed]
- Coles, R.; Caswell, R.; Rubinsztein, D.C. Functional analysis of the Huntington’s disease (HD) gene promoter. Hum. Mol. Genet. 1998, 7, 791–800. [Google Scholar] [CrossRef]
- Oubounyt, M.; Louadi, Z.; Tayara, H.; Chong, K.T. DeePromoter: Robust Promoter Predictor using Deep Learning. Front. Genet. 2019, 10, 286. [Google Scholar] [CrossRef] [PubMed]
- Le, N.Q.K.; Yapp, E.K.Y.; Nagasundaram, N.; Yeh, H.Y. Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Continuous FastText N-Grams. Front. BIoeng. BIotechnol. 2019, 7, 305. [Google Scholar] [CrossRef] [PubMed]
- Li, Q.Z.; Lin, H. The Recognition and Prediction of σ70 Promoters in Escherichia Coli K-12. J. Theor. Biol. 2006, 242, 135–141. [Google Scholar] [CrossRef] [PubMed]
- Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational Inference: A Review for Statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
- Visual Numerics. IMSL® Fortran Numerical Math Library; Visual Numerics Inc.: Houston, TX, USA, 2007. [Google Scholar]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–22. [Google Scholar] [CrossRef]
- Lütkepohl, H. New Introduction to Multiple Time Series Analysis; Springer Science & Business Media: New York, NY, USA, 2005. [Google Scholar]
- Goffe, W.L.; Ferrier, G.D.; Rogers, J. Global Optimization of Statistical Functions with Simulated Annealing. J. Econom. 1994, 60, 65–99. [Google Scholar] [CrossRef]
- Rajasekaran, S. On Simulated Annealing and Nested Annealing. J. Glob. Optim. 2000, 16, 43–56. [Google Scholar] [CrossRef]
- Brainerd, W. Fortran 77. Commun. ACM 1978, 21, 806–820. [Google Scholar] [CrossRef]
- Czibula, G.; Bocicor, M.I.; Czibula, I.G. Promoter sequences prediction using relational association rule mining. Evol. Bioinform. 2012, 8, EBO-S9376. [Google Scholar] [CrossRef] [PubMed]
- Schneider, T.D.; Stephens, R.M. Sequence logos: A new way to display consensus sequences. Nucleic Acids Res. 1990, 18, 6097–6100. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Y. R and Data Mining: Examples and Case Studies; Academic Press: New York, NY, USA, 2012. [Google Scholar]
- Venables, W.N.; Ripley, B.D. R Package Class Modern Applied Statistics with S, 4th ed.; Springer: New York, NY, USA, 2002; ISBN 0-387-95457-0. [Google Scholar]
- Percival, D.B.; Walden, A.T. Wavelet Methods for Time Series Analysis; Cambridge University Press: Cambridge, UK, 2000; Volume 4. [Google Scholar]
- Strobl, C.; Malley, J.; Tutz, G. An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol. Methods 2009, 14, 323. [Google Scholar] [CrossRef] [PubMed]
- Meyer, D.; Wien, F. Support Vector Machines. R News 2001, 1, 23–26. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).