1. Introduction
Data on multiple time series or sequences are ubiquitous in various domains, and their classification finds applications in numerous areas, such as human motion classification [
1], earthquake prediction [
2], and heart attack detection [
3]. The classification of multiple real-valued time series have been well studied in the statistical literature (see the detailed review in [
4]). However, for analyzing categorical time series, most statistical methods are primarily focused on examining a single time series. A few methods include the Markov chain model [
5], the link function approach [
6], and the likelihood-based method [
7]. In the computer science literature, a number of sequence classification methods have been developed that are black-box in nature and may be difficult to interpret. These include the minimum edit distance classifier with sequence alignment [
8,
9] and Markov chain-based classifiers [
10]. Overall, the classification of multiple categorical time series has not received much attention. Recent work has discussed a novel approach to the classification of categorical time series in the supervised learning framework, using a spectral envelope and optimal scalings [
11,
12]. We present an alternative approach for classifying multiple categorical time series in the topic modeling framework.
Topic modeling algorithms generally examine a set of documents referred to as a corpus in the natural language processing (NLP) domain. Often, the sets of words observed in documents appear to represent a coherent theme or topic. Topic models analyze the words in the documents in order to uncover themes that run through them. Many topic modeling algorithms have been developed over time, including non-negative matrix factorization [
13], latent Dirichlet allocation (LDA) [
14], and structural topic models [
15]. Topic modeling in the dynamic framework has been extensively studied to model the dynamic evolution of document collections over time [
16,
17]. However, the dynamic evolution of words over time within each document has not been addressed. The topic modeling literature typically assumes that words are interchangeable. We relax this assumption to model word time series (documents). We build a family of probabilistic dynamic topic models by extending the correlated topic model (CTM) in a supervised framework and modeling the dynamic evolution of the underlying latent stochastic process that reflects the thematic structure of the time series collection, and we classify the documents. Topic models like latent Dirichlet allocation, correlated topic models, and dynamic topic models are unsupervised, as only documents are used to identify topics by maximizing the likelihood (or the posterior probability) of the collection. In such modeling frameworks, we hope that the topics will be useful for categorization and are useful when no response is available and we want to infer useful information from the observed documents. However, when the main goal is prediction, a supervised topic modeling framework is beneficial, as jointly modeling the documents and the responses can identify latent topics that can predict the response variables for future unlabeled documents. The sDCTM framework is attractive because it provides a supervised dynamic topic modeling framework to (i) estimate the evolution of the latent process that captures the dynamic evolution of words in the document; and (ii) classify the time series.
We apply the sDCTM method for promoter sequence identification in
E. coli DNA sub-sequences. A promoter is a region of DNA where RNA polymerase begins to transcribe a gene, and it is usually located upstream or at the 5′ end of the transcription initiation site in DNA. DNA promoters are proven to be the primary cause of numerous human diseases, including diabetes [
18] and Huntington’s disease [
19]. Thus, the identification of DNA sequences containing promoters has gained significant attention from researchers in the field of bioinformatics. Several computational methods and tools have been developed to analyze DNA sequences and predict potential promoter regions. These include the classification of promoter DNA sequences using a robust deep learning model, DeePromoter [
20], deep learning, and the combination of continuous FastText N-grams [
21] and the position-correlation scoring matrix (PCSM) algorithm [
22]. We use the sDCTM model to analyze the
E. coli DNA sub-sequences by treating each DNA sub-sequence as a document and the presence/absence of promoter regions as the associated class label.
The format of this paper is as follows: In
Section 2, we describe the framework of the supervised dynamic correlated topic model, along with details on inference and parameter estimation.
Section 3 discusses the simulation study conducted to assess our method. In
Section 4, we present the results of applying the sDCTM method for promoter sequence identification in
E. coli DNA subsequences and compare our method with the various classification techniques.
2. Supervised Dynamic Correlated Topic Model
Topic models are traditionally developed by treating words as interchangeable to identify semantic themes within each document. Our method aims to develop a family of probabilistic time series models to analyze the evolution of words over time within document collections. We assume that each document, represented by a categorical time series of words, arises from a generative process that includes latent variables. sDCTM provides a supervised framework for time series classification, allowing it to model class labels associated with each word time series. Alternatively, by removing the response component of the generative process, we can derive an unsupervised dynamic correlated topic model (DCTM).
Suppose we have a corpus
consisting of
M documents. We represent the
dth document as
, where
corresponds to a word observed at time point
t on the
dth document for times
. Each word is represented by a
V-dimensional unit (basis) vector, i.e.,
such that,
when
corresponds to the
vth word from the vocabulary for
and
V denote the number of levels associated with the categorical word time series (document). Additionally, let
be a multinomial response associated with
dth document for
. Each
is represented by a
unit (basis) vector, i.e.,
such that,
where
corresponds to the
cth class for
, where
C denotes the number of levels associated with the response variable. Suppose
J is the number of assumed latent topics for capturing the hidden thematic structure in the corpus; then,
is a
J dimensional vector corresponding to the latent topic structure for the
dth document at time
t.
In the promoter identification data, we treat each DNA sub-sequence to be a word time series (document) with levels (indicating nucleotides and t). The response variable for indicates the presence/absence of a promoter region with levels. The number of assumed topics J in the sDCTM model capture the hidden patterns within the observed DNA sub-sequences. Under the sDCTM framework, the dth document (DNA sub-sequence) and its associated class label (presence/absence of a promoter) for arises from the following generative process. For ,
For
Choose .
Choose a topic , where
for .
Choose a word .
Draw class label
, where
is the topic frequencies for the
dth document. The per-word topic indicator for each word
is represented by
for
and
. The model parameters include the latent VAR(1) parameter, the
matrices
and
, the
matrix of word probabilities
, and the
matrix of regression coefficients
. In
Section 2.1, we discuss approximate inference techniques based on variational methods and construct the lower bound (Constructing the Lower Bound) to estimate the variational parameters, which can be used to approximate the posterior. In
Section 2.2, we discuss the variational EM algorithm to estimate the model parameters. We also discuss simulated annealing (
Section 2.2.1), a probabilistic technique employed in optimizing problems to estimate certain model parameters, along with details on selecting initial values (
Section 2.2.2) for using this optimization technique.
2.1. Approximate Inference
Given the
dth document
and the associated response
for
, the posterior distribution of the latent variables
is
for
. The numerator term in Equation (
3),
, corresponds to the joint distribution of the latent variables
and the observed document–response pair
and is given by
for
. This joint distribution of the latent variables and the document–response pair can be factorized due to the sDCTM framework. The denominator term in Equation (
3), i.e.,
, represents the joint distribution of the
dth document–response pair
and is given by
for
.
Computing the posterior distribution in Equation (
3) is not analytically tractable. Hence, we use variational methods that consider a simple family of distributions over the latent variables, indexed by free variational parameters. The variational parameters are estimated to minimize the Kullback–Leibler (KL) divergence [
23] between the variational distribution and the true posterior distribution. The latent variables in the sDCTM framework include the per-word topic structure
and the per-word topic assignment
for
and
. We define the approximate variational posterior for the
dth document as
for
, where
is
, and we obtain
using a dynamic model with Gaussian “variational observations”
. This dynamic model is defined using a variational Kalman filter, where the variational parameters
are treated as observations. Using the variational distribution, we form a variational state space model as follows. For
,
The Gaussian variational forward filtering distribution
using standard Kalman filter calculations is characterized as follows. For
,
where
is the Kalman gain matrix and the initial conditions are specified by
and
.
Similarly, the Gaussian variational backward smoothing distribution
using standard Kalman smoothing calculations is characterized as follows. For
,
where
and the initial conditions are specified by
and
.
Using the Kalman filter equations, the approximate variational posterior for the
dth document is given by
where
is
and
for
and
Constructing the Lower Bound
Given the variational distribution defined in Equation (
4), our goal is to estimate the variational parameters
,
, and
for
and
in order to minimize the KL divergence between the variational distribution
defined in Equation (
4) and the true posterior
defined in Equation (
3). This optimization problem of minimizing the KL divergence is equivalent to maximizing the evidence lower bound (ELBO) [
24] defined below.
We estimate the variational parameters
,
, and
for
and
by maximizing the ELBO for approximate inference. In
Appendix A, we derive the ELBO and provide the update equations in order to estimate the variational parameters. We estimate the variational parameter
using a fixed-point update,
using a constrained optimization using the DBCPOL routine available in the IMSL library [
25] for Fortran 77, and
using simulated annealing, which we describe in
Section 2.2.1.
2.2. Parameter Estimation
In this section, we present a method for parameter estimation. In particular, given the corpus
with
M documents and their associated responses denoted by
, we wish to estimate the model parameters
, and
in order to maximize the log likelihood of the data given by
Because the log likelihood is analytically intractable, we consider a tractable lower bound on the log likelihood by summing the ELBO for
, defined in
Section 2.1 over
M documents. We estimate the model parameters via an alternating variational EM (VEM) procedure [
26] described below:
In
Appendix A, we derive the lower bound on the log likelihood defined in Equation (
8) and provide the updated equations in order to estimate the model parameters. We estimate model parameter
using a fixed-point update. The model parameters
(with the stationarity condition),
(with the positive definite condition), and
are estimated using a randomized search optimization technique, simulated annealing, which we describe in the next section.
For parameter estimation, we use frequentist methods to estimate the unknown parameters of the sDCTM model. As an alternative, one can set up a Bayesian model for parameters estimation in the sDCTM model by employing the Minnesota prior on
[
27], inverse Wishart or Lewandowski–Kurowicka–Joe (LKJ) prior on
, Dirichlet prior on
, and multivariate normal prior on
.
2.2.1. Simulated Annealing
Simulated annealing (SA) is a probabilistic technique employed for optimizing problems to find a good approximation to the global optimum of a given function [
28,
29]. It is often used when the search space is discrete and is inspired by the annealing technique in metallurgy, which involves the heating and controlled cooling of a material to increase the size of its crystals and reduce their defects. The algorithm starts with a randomly generated solution and iteratively explores its neighboring states to find better solutions. The acceptance probability mechanism considers solutions that are worse than the current one in order to avoid getting stuck in a local optimum. Algorithm 1 presents a structured pseudocode to execute the simulated annealing technique for optimization problems, and
Table 1 describes the initial parameters.
Algorithm 1 Simulated Annealing for Optimization |
procedure SIMULATED ANNEALING() and , repeat for times do for times do , for if then and end if if then and and end if if then and if then and end if end if Adjust such that of the moves are accepted Set if then for end if if then for end if end for if and then REPORT else Set and end if end for until convergence or N times end procedure
|
2.2.2. Selecting Initial Values of the Model Parameters
We use the SA technique, one of the top methods for non-derivative-based random search optimization, to estimate the model parameters , , and in the sDCTM framework. To ensure that the optimization works effectively, it is important to select appropriate initial values. We describe the steps to select initial values for , , and below.
Step 1. We select an initial search space within , where n represents the number of variables involved in the estimation (for instance, for and for and ). The initial search space is chosen based on our belief of where the estimates are likely to be found. Then, we divide this search space into smaller subspaces.
Step 2. For and , we evaluate the optimization function at the midpoint within each of the subspaces and choose the subspace with the highest (or lowest) value of the function based on the optimization problem as our new search space. In cases where multiple subspaces optimize the function, we consider each of them in the next step for further exploration. In the sDCTM framework, for , we consider the training accuracy as our optimization (maximization) function, as plays a crucial role in predicting the response variable. For model parameters and , we consider the lower bound on the log likelihood as the optimization (maximization) function.
Step 3. We repeat steps 1 and 2 until no significant improvement in the optimization function value is seen.
Step 4. If a single subspace is chosen as a search space to select the initial values for optimization, we choose our starting values from this subspace and estimate the parameters using the SA technique. If there are multiple subspaces selected as the initial search space, we choose the initial values from each of these subspaces and run the SA to obtain the parameter estimates. We choose the final parameter estimates from the search space yielding the highest (or lowest) value of the function based on the optimization problem. The sDCTM model is implemented using a standalone code in Fortran 77 [
30]. The code for model estimation is posted on
https://github.com/NamithaVionaPais/sDTCM (accessed on 10 June 2024).
3. Simulation Study
To evaluate the performance of our model, we conducted a simulation study. We generated a dataset consisting of
documents, where each document is represented by a word time series of length
. The vocabulary size was set to
, and the documents were divided into
classes. Each word time series was governed by an underlying latent topic structure with
topics. Our model was evaluated using a six-fold cross-validation with
train and
test documents. We conducted our analysis in Fortran 77. Given the observed word time series from
M documents, we fitted the sDCTM model to estimate the model parameters,
,
,
, and
. As the model parameters are matrices, we assessed the estimation error using the squared Frobenius norm. The squared Frobenius norm lies between two matrices, such as
and
, and is defined as
The average squared Frobenius norm (over the six-fold cross-validation) for each of the model parameters is shown in
Table 2. The low values of the squared Frobenius norm on the model parameters suggest that we are able to estimate the matrices well.
We used the estimated model parameter
and the variational parameter
for
and
to predict the response associated with the training data containing
documents as follows.
for
and
.
Table 3 shows the confusion matrix obtained on train data for one of the cross-validations. We were able to achieve a training accuracy of
, and the average training accuracy across all
cross-validation datasets was also
.
While classifying the test data, we estimated the variational parameters on the test documents given the estimated model parameters using variational inference. However, because we assume that we do not know the true label in the test documents, we replace the terms in the likelihood function associated with the response variable by the pseudo-label. The pseudo-label assigned to a test document is the class label associated with the nearest train document. We used the hamming distance to calculate the distance between the test and train sequence. Given the word time series, the associated pseudo-label, and the model parameters estimated using the train data, we estimated the variational parameters for each test document. We then obtained the test predictions using Equation (
10).
Table 4 shows the confusion matrix obtained on the test data for one of the cross-validation test datasets with a test accuracy of
. The average test accuracy across all
cross-validation datasets was
.
The model parameter, denoted as
, provides an estimate of how words are distributed across a set of
J latent topics.
Figure 1 illustrates the proportions of words across each of the
latent topics. We observed that Word 1 was highly probable within Topic 1, Words 3, 4, and 6 were highly probable within Topic 2, and Word 4 was highly probable within Topic 3.
4. Application for Promoter Sequence Identification in E. coli DNA Sub-Sequences
Proteins are one of the most important classes of biological molecules, being the carriers of the message contained in the DNA. An important process involved in the synthesis of proteins is called transcription. During transcription, a single-stranded RNA molecule, called messenger RNA, is synthesized (using the complementarity of the bases) from one of the strands of DNA corresponding to a gene (a gene is a segment of the DNA that codes for a type of protein). This process begins with the binding of an enzyme called RNA polymerase to a certain location on the DNA molecule. This exact site, which determines which of the two strands of DNA will be a transcript and in which direction, is recognized by the RNA polymerase due to the existence of certain regions of DNA placed near the transcription start site of a gene, called promoters. Because determining the promoter region in the DNA is an important step in the process of detecting genes, the problem of promoter identification is of major importance within the field of bioinformatics.
We considered a dataset of 106
E. coli DNA sub-sequences that is publicly available at
https://archive.ics.uci.edu/dataset/67/molecular+biology+promoter+gene+sequences (accessed on 12 March 2023), among which 53 DNA sub-sequences contain promoters and the remaining 53 DNA sub-sequences do not contain promoters. Each sub-sequence consists of 57 nucleotides, represented by the four nucleotide bases: adenine (
a), guanine (
g), cytosine (
c), and thymine (
t). Given the two sets of DNA sub-sequences, with and without the presence of promoter regions, we used sDCTM to uncover the hidden patterns or motifs that could serve as markers for promoter presence to classify the sub-sequences. A detailed data description is provided in [
31].
We present the results of the sDCTM model applied to analyze the E. Coli DNA sub-sequences. We treated each DNA sub-sequence as a document and the presence/absence of promoter regions as the associated class label. Each word in the document was represented by the nucleotide observed at that position in the sub-sequence. We divided the data into an 80–20 train–test split in order to assess the model performance. We trained the model on data with
(number of train DNA sub-sequences) and set
(length of each DNA sub-sequence),
(number of unique nucleotide levels), and
(indicating the levels of the response variable). To choose the number of topics
J, we ran the model for different values of J and chose
J based on the highest test accuracy. We set the number of latent topics as
. The confusion matrices obtained on the train and test data with
(number of test DNA sub-sequences) using the sDCTM model are shown in
Table 5 and
Table 6. We were able to achieve a training accuracy of
and a test accuracy of
.
The distribution of nucleotides accross the three topics identified by the sDCTM model is shown in
Figure 2. We observed that nucleotide levels
a,
c, and
g were highly probable within Topic 1, nucleotide level
t was highly probable within Topic 2, and nucleotide level
a was highly probable within Topic 3. In addition, based on the model predictions, we also constructed a sequence logo plot [
32] to identify the diversity of sequences between the two predicted groups. The sequence logo plot based on the train model predictions are shown in
Figure 3. We can see that that the promoter presence plot demonstrated significantly higher conservation at various positions compared with the promoter absence plot. The higher bits values in the promoter presence plot highlight critical motifs essential for promoter activity. In contrast, the promoter absence plot showed much smaller variability, suggesting fewer conserved elements.
Comparison with Other Methods
We compared our sDCTM model with existing classification techniques, such as SVM, k-NN, and classification trees [
33]. We used the k-NN classification technique, which is a popular approach used for sequence classification [
10] on the observed data. We compared our approach to popular machine learning techniques like SVM and classification trees on the extracted features from the DNA sequence. We used the Haar wavelet transform and the simplest discrete wavelet transform (DWT) to extract features from the observed time series and build a classification model based on these features. We implemented the k-NN classification technique using the R package
class [
34], which identifies
nearest neighbors (in Euclidean distance) to classify the train and test data. We used the R package
wavelets [
35] to extract the DWT (with Haar filter) coefficients. The classification tree was implemented using the R package
party [
36], and the SVM was implemented using the R package
e1071 [
37]. We ran each of these methods using an 80–20 train–test split to assess model performance. The train and test accuracies are shown in
Table 7. In comparison with the the other classification techniques, sDCTM performed better on both the train and test data.
5. Discussion
In this paper, we introduced the sDCTM model framework, which aims to classify categorical time series by dynamically modeling the underlying latent topics that capture the hidden thematic structure of the time series collection. We demonstrated the applicability of our method to real data by applying it to the promoter identification data to classify E. coli DNA sub-sequences based on promoter presence. Using sDCTM, we estimated the dynamic latent topic structure that serves as markers for promoter presence in a DNA sub-sequence and classify the sub-sequences. Among the latent topics obtained using sDCTM, we observed that nucleotide levels a, c, and g were highly probable within Topic 1, nucleotide level t was highly probable within Topic 2, and nucleotide level a was highly probable within Topic 3. In addition, the logoplot based on the model predictions showed higher conservation at various positions in DNA sub-sequences with predicted promoter presence in comparison with DNA sub-sequences with predicted promoter absence. We compared our method with the k-NN, SVM, and classification tree method on a train–test data setup. Our comparative study results indicated that the sDCTM model performed better than the other classification techniques in both training and test datasets. This indicates that the estimated underlying latent topic structure captured by the sDCTM model is able to identify the promoter presence/absence in a DNA sub-sequence better in comparison with the other classification approaches. An extension of the sDCTM model accommodating word time series of varying lengths can be derived as a part of future work.