Compression-Complexity Measures for Analysis and Classification of Coronaviruses

Munagala, Naga Venkata Trinath Sai; Amanchi, Prem Kumar; Balasubramanian, Karthi; Panicker, Athira; Nagaraj, Nithin

doi:10.3390/e25010081

Open AccessArticle

Compression-Complexity Measures for Analysis and Classification of Coronaviruses

by

Naga Venkata Trinath Sai Munagala

^1,†,

Prem Kumar Amanchi

^1,†

,

Karthi Balasubramanian

^1,*,

Athira Panicker

¹ and

Nithin Nagaraj

²

¹

Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Ettimadai 641112, Tamil Nadu, India

²

Consciousness Studies Programme, National Institute of Advanced Studies, Bengaluru 560012, Karnataka, India

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Entropy 2023, 25(1), 81; https://doi.org/10.3390/e25010081

Submission received: 5 October 2022 / Revised: 10 December 2022 / Accepted: 18 December 2022 / Published: 31 December 2022

(This article belongs to the Special Issue Data Compression and Complexity)

Download

Browse Figures

Versions Notes

Abstract

:

Finding a vaccine or specific antiviral treatment for a global pandemic of virus diseases (such as the ongoing COVID-19) requires rapid analysis, annotation and evaluation of metagenomic libraries to enable a quick and efficient screening of nucleotide sequences. Traditional sequence alignment methods are not suitable and there is a need for fast alignment-free techniques for sequence analysis. Information theory and data compression algorithms provide a rich set of mathematical and computational tools to capture essential patterns in biological sequences. In this study, we investigate the use of compression-complexity (Effort-to-Compress or ETC and Lempel-Ziv or LZ complexity) based distance measures for analyzing genomic sequences. The proposed distance measure is used to successfully reproduce the phylogenetic trees for a mammalian dataset consisting of eight species clusters, a set of coronaviruses belonging to group I, group II, group III, and SARS-CoV-1 coronaviruses, and a set of coronaviruses causing COVID-19 (SARS-CoV-2), and those not causing COVID-19. Having demonstrated the usefulness of these compression complexity measures, we employ them for the automatic classification of COVID-19-causing genome sequences using machine learning techniques. Two flavors of SVM (linear and quadratic) along with linear discriminant and fine K Nearest Neighbors classifer are used for classification. Using a data set comprising 1001 coronavirus sequences (causing COVID-19 and those not causing COVID-19), a classification accuracy of 98% is achieved with a sensitivity of 95% and a specificity of 99.8%. This work could be extended further to enable medical practitioners to automatically identify and characterize coronavirus strains and their rapidly growing mutants in a fast and efficient fashion.

Keywords:

compression-complexity measures; Effort-to-Compress complexity; Lempel-Ziv complexity; distance measure; machine learning; COVID-19

1. Introduction

Pandemics such as the ongoing COVID-19 (caused by SARS-CoV-2) pandemic that leads to enormous loss of life globally can only be controlled by vaccines or a very effective antiviral treatment. Finding a vaccine or specific antiviral treatment for such a global pandemic of virus diseases requires rapid analysis, annotation, and evaluation of metagenomic libraries to enable quick and efficient screening of nucleotide sequences. Traditional sequence alignment methods are not suitable since they are computationally intensive and cannot be easily scaled up as the number of sequences increases. Thus, there is a need for fast alignment-free techniques for sequence analysis [1,2]. Information theory and data compression algorithms provide a rich set of mathematical and algorithmic/computational tools to capture essential patterns in data that could be used for matching nucleotide sequences.

Genome sequences are inherently described by character strings and are hence amenable to mathematical and computational techniques for extracting information. Exactly what information is being sought from such character strings depends on the string itself and the domain as well as the kind of application. Some targets of interest for analyzing genome sequences include:

Identifying various genes that constitute the genome.
Identifying the origin of the genome sequence.
Understanding the information content present in the coding and non-coding regions.
Reconstructing the phylogenetic tree to study evolutionary patterns.
Automatic classification and identification of unknown genome sequences.

An important objective is to automate the above tasks so that a large number of sequences can be quickly, robustly, and efficiently analyzed (as one of the steps in the endeavor for finding a vaccine).

A cursory glance at these character strings does not tell us much about how they can be used for these applications. However, a harmonious blending of complexity analysis with the field of information theory provides deep insight in this regard. Application of complexity measures on these information-bearing character strings may reveal many surprising features that generally cannot be discerned by intuition or visual inspection of the data alone.

In this study, we propose compression-complexity based distance measures for the analysis of genomic sequences. To validate the efficacy of this distance measure, we first apply it to the mitochondrial DNA sequences of primates belonging to eight different species clusters and recreate a phylogenetic tree showing these clusters accurately. Subsequently, the distance measure is applied to a group of Severe Acute Respiratory Syndrome (SARS-CoV-1) coronaviruses and also to a group of SARS-CoV-2 viruses to successfully reconstruct their phylogenetic trees, grouping the viruses correctly.

Having demonstrated the usefulness of these compression-complexity measures, we employ them for the automatic classification of SARS-CoV-2 viruses using machine learning techniques. The compression-complexity measures extracted from the sequences are passed as features to machine learning algorithms (linear and quadratic SVM, linear discriminant, and fine KNN) for classification.

The paper is organized as follows. In Section 2, a brief overview of genetic sequences and methods of analysis are described. Section 3 deals with the materials used (genome primary sequence data with their details) and the methods proposed in this study. Machine learning methods for classification of SARS-CoV-2 sequences are also introduced. This is followed by results and discussion in Section 4 and the paper concludes in Section 5 with suggestions for future research.

2. Genomic Sequences and Comparison

2.1. Genome and Gene

The total DNA content (RNA for viruses) of an organism is known as the genome, thus representing the entire information coded in a cell, while a gene represents a section of the DNA that codes for RNA or protein. A genome consists of a sequence of multiple genes interspersed with non-coding sequences of nucleic bases [3].

2.2. Genome Sequence Comparison

Genome data classification comes under the broad field of bioinformatics, an established multidisciplinary field for over three decades, encompassing physical and life sciences, computer science, and engineering. Many fundamental problems in the fields of medicine and biology are being tackled using the tools of bioinformatics. The main requirement for accomplishing such tasks is the availability of sequenced genome data. This has been the focus of researchers for the past few decades and efforts have been made by the National Institutes of Health (NIH) to establish Genbank^® (http://www.ncbi.nlm.nih.gov/genbank (accessed on 4 October 2022)), a genetic sequence database containing an annotated collection of all publicly available DNA sequences. Ever since its inception in 1982, there has been an exponential rise in the number of sequences in Genbank. This has provided the required resources for researchers and industry people alike for delving into the field of bioinformatics.

Among the various aspects involved in bioinformatics, one key element is sequence comparison or analysis of sequence similarity [4]. This is used in database searching, sequence identification and classification, phylogenetic tree (also called an evolutionary tree, is a tree diagram that shows the evolutionary relationships among different species according to the composition of their genes) creation, gene annotation and evolutionary modeling. Since it is impossible to recreate or simulate past evolutionary events, computational and statistical methods for comparison of nucleotide and protein sequences are used for these kinds of studies [1,5].

There are two kinds of sequence comparison methods:

Alignment-based methods: These involve either shifting or insertion of gaps in sequences for the optimal alignment of two or more sequences [6,7]. The alignment involves selected scoring systems and gives high accuracy, but they are computationally intensive and consume huge memory.
Alignment-free methods: These are computationally less intensive methods that consider the genome sequences as character strings and use distance-based methods involving frequency and distribution of bases [8,9,10,11,12]. Our focus in this paper is on alignment-free methodology, especially on using compression-complexity measures for sequence comparisons.

Sequence comparison and genome data classification got a boost in the early 1990s with the use of data compression algorithms that can identify regularities in sequences [13]. They provided a means to define distances between two sequences that greatly aided in the comparison of sequences. The history behind the usage of data compression algorithms in this field has been elucidated by Otu and Sayood in [13]. We succinctly summarize that history here.

The first attempt at using data compression for phylogenetic tree construction was by Grumbach et al. in [14]. They explored the idea of compressing a sequence S using a sequence Q, where the degree of compression obtained by doing so would be an indicator of the distance between them. Although their definition was not mathematically valid, it set a platform for researchers to explore this area. Varre et al. [15] defined a transformation distance when sequence Q is transformed to sequence S by various mutations like segment-copy, segment-reverse copy, and segment-insertion. Li et al. [16] define a relative distance measure by using a compression algorithm called GenCompress [17] that is based on approximate repeats in DNA sequences. Using the concept of Kolmogorov complexity, the compression algorithm has been used to propose a distance between sequences S and Q. However, Kolmogorov complexity, [18] being an algorithmic measure of information and a theoretical limit, cannot be directly computed but only approximately estimated [19]. Hence it is not an optimum choice as a complexity measure. Even though the idea of relative distance is an efficient one, GenCompress is a complicated algorithm that is computationally intensive. To overcome the above-mentioned difficulties, Otu and Sayood [13] proposed similar but computationally simpler relative distance measures based on the Lempel-Ziv (LZ) [20] complexity measure. Given two sequences S and Q, sequences

S Q

and

Q S

are formed by concatenation (Q is appended at the end of the sequence S to yield the new sequence

S Q

). These four sequences are used to define four distance measures using the LZ complexity measure, as given below:

d (S, Q) = m a x {L Z (S Q) - L Z (S), L Z (Q S) - L Z (Q)} .

To account for the effect of the length of the sequence, a normalized measure is defined as follows:

d (S, Q) = \frac{m a x {L Z (S Q) - L Z (S), L Z (Q S) - L Z (Q)}}{m a x {L Z (S), L Z (Q)}} .

A third distance metric based on sum distance is defined as follows:

d (S, Q) = L Z (S Q) - L Z (S) + L Z (Q S) - L Z (Q) .

Finally, the normalized version of the sum distance is defined as:

d (S, Q) = \frac{L Z (S Q) - L Z (S) + L Z (Q S) - L Z (Q)}{L Z (S Q)} .

Using these distance measures on mtDNA (mitochondrial DNA) samples of a wide range of eutherans (placental mammals), they have successfully re-created phylogenetic trees showing the evolutionary patterns. Other researchers have used these and slight variants of these measures to identify families of coronaviruses, mammals, vertebrates, and salmons. Interested readers are referred to [21,22,23,24,25,26,27,28] for further details on these. Apart from these complexity-based measures, distance measures using Markov chain models [29,30,31] and measures of probability [32,33,34] have also been proposed for the study of genome identification.

3. Materials and Methods

In this section, we provide details of the datasets as well as the methods used in this study.

3.1. Genome Sequences Used in This Study

Genome data from mammals and various types of coronaviruses, obtained from Genbank database are used for the analysis. These are described below.

3.1.1. Mammalian Sequences

The first dataset we considered was mitochondrial genomes (mtDNA) of 41 mammals grouped into 8 species clusters as shown in Table S1 (Supplementary Resource 1) [35].

3.1.2. Coronaviruses (SARS-CoV-1)

For our analysis, we use genome sequences of the following viruses:

15 SARS-CoV-1 coronaviruses
15 Non-SARS-CoV-1 coronaviruses belonging to Groups I, II and III coronaviruses

For complete details of the above sequences with accession numbers, please refer to Table S2 (Supplementary Resource 1).

3.1.3. SARS-CoV-2 (COVID-19 Causing Corona Viruses)

For the third set of analyses, we consider 30 COVID-1 causing coronavirus sequences and 30 coronavirus sequences not causing COVID-19 (consisting of alpha, beta, gamma, and deltacoronaviruses). For complete details of the above sequences with accession numbers, please refer to Table S3 (Supplementary Resource 1).

3.2. Mathematical and Computational Methods Used in This Study

3.2.1. Compression Complexity Measures: Lempel–Ziv (LZ) and Effort-to-Compress (ETC)

For measuring the complexity of the nucleotide sequences, we have used Lempel-Ziv (LZ [20]) and Effort-to-Compress (ETC [36]) complexity measures. Lempel–Ziv complexity (LZ), a popular and widely used complexity measure, estimates the degree of compressibility of an input sequence. Effort-to-Compress, a relatively recent complexity measure, determines the number of steps required by the Non-Sequential Recursive Pair Substitution Algorithm to compress the input sequence to a constant sequence (or a sequence of zero entropy) [37]. It should be noted that both LZ and ETC are complexity measures derived from lossless data compression algorithms (hence we term them compression-complexity measures). Further, ETC consistently performs better than LZ in several applications as shown in recently published literature [38,39,40]. For details on how to compute LZ and ETC on actual input sequences, we refer the readers to Section S1 of supplementary information (Supplementary Resource 2).

3.2.2. Distance Measure

We use a distance measure which is computed using a compression-complexity measure (LZ or ETC). Let us say that we have genome sequences of two viruses

V_{1}

and

V_{2}

. Firstly, we form new sequences

V_{1} V_{2}

and

V_{2} V_{1}

by concatenation (

A B

is the new sequence obtained by simply concatenating sequence B at the end of sequence A). We then compute the complexity measures

E T C (V_{1})

,

E T C (V_{2})

,

E T C (V_{1} V_{2})

and

E T C (V_{2} V_{1})

(similarly for LZ). In line with what has been used by Otu and Sayood [13], the distance measure is given by the average of the relative distances between the complexity values of the two concatenated sequences

V_{1} V_{2}

and

V_{2} V_{1}

. Mathematically, they are described as:

d_{L Z} (V_{1}, V_{2}) = \frac{(L Z (V_{1} V_{2}) - L Z (V_{1})) + (L Z (V_{2} V_{1}) - L Z (V_{2}))}{2},

d_{E T C} (V_{1}, V_{2}) = \frac{(E T C (V_{1} V_{2}) - E T C (V_{1})) + (E T C (V_{2} V_{1}) - E T C (V_{2}))}{2} .

(1)

Note that the above distances will always be non-negative and symmetric (

d (A, B) \geq 0

,

d (A, A) = 0

and

d (A, B) = d (B, A)

. The triangle inequality is also likely to hold).

3.2.3. Machine Learning Algorithms Used in the Study

Various AI techniques are being currently used for COVID-19 related data analytics [41,42,43,44,45,46]. In this work, machine learning (ML) is used for classifying SARS-CoV-2 viruses from other coronaviruses. Since training of ML algorithms is data-intensive, we have used a total of 1001 sequences of coronaviruses (all sequences were obtained from GISAID (https://www.gisaid.org) (accessed on 4 October 2022)) belonging to alpha (131 sequences), beta (130 sequences), gamma (130 sequences) and delta (147 sequences) coronaviruses and SARS-CoV-2 (436 sequences).

The ML algorithms used for classification were support vector machines (SVM), linear (LSVM) and quadratic kernels (QSVM), fine K-nearest neighbors classifier (FKNN) using Euclidean distance with five nearest neighbors and linear discriminant algorithm (LD). A brief description of these methods can be found in Section S2 of supplementary information (Supplementary Resource 2). The Statistics and Machine Learning Toolbox^TM from MATLAB^® ver. R2021a was used for the analysis. To evaluate the system, analysis was done with a 10-fold cross-validation scheme, using a 90–10% dataset splitting (90% of the dataset for training and 10% for testing).

4. Results and Discussion

For the mammalian mitochondrial genomes, the phylogenetic tree that was obtained with the distance measure using the ETC measure (Equation (1)) is depicted in Figure 1.

The pairwise distance values were fed to MEGA [47] to obtain the phylogenetic tree. The phylogenetic tree corresponding to LZC can be found in Figure S1 (Supplementary Resource 1). Both these phylogenetic trees closely match the tree (in terms of groupings) obtained in [35].

Having validated the distance measure on the mammalian dataset, we applied it to SARS-CoV-1 and non-SARS-CoV-1 coronaviruses and SARS-CoV-2 and non-SARS-CoV-2 coronaviruses. The phylogenetic trees thus obtained (using ETC) are shown in Figure 2 and Figure 3.

It can be seen in Figure 2 that each of the Group I, Group II, Group III coronaviruses, and SARS-CoV-1 coronaviruses form distinct groups separate from each other. The same can be seen in the LZC phylogenetic tree as shown in Figure S2 (Supplementary Resource 1). Similarly in Figure 3, the COVID-19 and non-COVID-19 causing coronaviruses form separate distinct groups. Except for GammaCoV 6 coronavirus, within the sequences not causing COVID-19, we find that alphacoronaviruses, betacoronaviruses, gammacoronaviruses and deltacoronaviruses are all clustered in distinct groups. In the corresponding phylogenetic tree generated by the LZC distance measure as shown in Figure S3 (Supplementary Resource 1), COVID-19 and non-COVID-19-causing coronaviruses also fall into separate groups. However, among the sequences not causing COVID-19, there are three exceptions. Different versions of the phylogenetic tree for the COVID-19 and non-COVID-19-causing coronaviruses are shown in Figures S4 and S5 (Supplementary Resource 1).

Classification of SARS-CoV-2 Sequences

In this section, we analyze the performance of various machine learning algorithms for the classification of SARS-CoV-2 (COVID-19) sequences. Table 1, Table 2 and Table 3 summarize the results obtained for the classification of SARS-CoV-2 (COVID-19) vs. non-SARS-CoV-2 coronaviruses using LSVM, QSVM, LD and FKNN algorithms with ETC and LZC values as features.

Fine KNN outperforms LSVM, QSVM and LD algorithms in terms of F1-score and accuracy. It can be seen that, using ETC or LZC alone as a feature gives detection accuracies in the range of 80–90% while, using a combination of both gives an accuracy of more than 90%. This shows that combining features from multiple compression-complexity measures can significantly boost the classification accuracies of SARS-CoV-2 sequences.

5. Conclusions

Compression-complexity measures such as LZ and ETC which are based on lossless compression algorithms are good candidates for developing fast alignment-free methods for genome sequence analysis, comparison, and identification. The main reason for this is their ability to characterize and analyze information in biological sequences with contiguous segments. The distance measure based on LZ and ETC showed excellent recreation of phylogenetic trees for mammalian mtDNA sequences, SARS-CoV-1 coronaviruses, and SARS-CoV-2 viruses. Compression-complexity measures also serve as excellent features for the automatic classification of COVID-19-causing coronaviruses using machine learning algorithms. Our results demonstrate that combining ETC and LZC yields very high classification accuracies:

98 %

for the fine K-Nearest Neighbors classifier. Thus, compression-complexity measures such as ETC and LZC provide an efficient alternative to standard methods for genome analysis and classification of coronaviruses.

Supplementary Materials

The following are available online at https://www.mdpi.com/article/10.3390/e25010081/s1.

Author Contributions

Conceptualization, K.B. and N.N.; Software, N.V.T.S.M. and P.K.A.; Validation, N.N.; Formal analysis, N.V.T.S.M. and P.K.A.; Investigation, N.V.T.S.M., P.K.A., K.B., A.P. and N.N.; Data curation, A.P.; Writing—original draft, N.V.T.S.M. and P.K.A.; Writing—review & editing, K.B., A.P. and N.N.; Project administration, K.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in the analysis is publicly available and can be downloaded from https://www.ncbi.nlm.nih.gov/ (accessed on 4 October 2022).

Acknowledgments

The authors would like to thank Gayathri R Prabhu (Indian Institute of Technology, Chennai) for helping with some of the simulations. NN would like to thank Pranay S Yadav (National Institute of Advanced Studies, Bengaluru) for the Python implementation of ETC and for useful discussions and suggestions.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

ETC	Effort-to-Compress complexity
LZC	Lempel ziv complexity
LSVM	Linear Support Vector Machine
QSVM	Quadratic Support Vector Machine
LD	Linear Discriminant
FKNN	Fine K-Nearest Neighbors

References

Lebatteux, D.; Remita, A.M.; Diallo, A.B. Toward an alignment-free method for feature extraction and accurate classification of viral sequences. J. Comput. Biol. 2019, 26, 519–535. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhao, Y.; Xue, X.; Xie, X. An alignment-free measure based on physicochemical properties of amino acids for protein sequence comparison. Comput. Biol. Chem. 2019, 80, 10–15. [Google Scholar] [CrossRef] [PubMed]
Lesk, A. Introduction to genomics; Oxford University Press: Oxford, UK, 2012. [Google Scholar]
Pearson, W.R. An introduction to sequence similarity (“homology”) searching. Curr. Protoc. Bioinform. 2013, 42, 3.1.1–3.1.8. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Gupta, M.K.; Niyogi, R.; Misra, M. A framework for alignment-free methods to perform similarity analysis of biological sequence. In Proceedings of the Sixth International Conference on Contemporary Computing (IC3), Noida, India, 8–10 August 2013; pp. 337–342. [Google Scholar]
Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar] [CrossRef] [Green Version]
Larkin, M.A.; Blackshields, G.; Brown, N.P.; Chenna, R.; McGettigan, P.A.; McWilliam, H.; Valentin, F.; Wallace, I.M.; Wilm, A.; Lopez, R.; et al. Clustal W and Clustal X version 2.0. Bioinformatics 2007, 23, 2947–2948. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zielezinski, A.; Vinga, S.; Almeida, J.; Karlowski, W.M. Alignment-free sequence comparison: Benefits, applications, and tools. Genome Biol. 2017, 18, 186. [Google Scholar] [CrossRef] [Green Version]
Xia, X. Distance-Based Phylogenetic Methods. In Bioinformatics and the Cell; Springer: Berlin/Heidelberg, Germany, 2018; pp. 343–379. [Google Scholar]
Zielezinski, A.; Girgis, H.Z.; Bernard, G.; Leimeister, C.A.; Tang, K.; Dencker, T.; Lau, A.K.; Röhling, S.; Choi, J.J.; Waterman, M.S.; et al. Benchmarking of alignment-free sequence comparison methods. Genome Biol. 2019, 20, 144. [Google Scholar] [CrossRef] [Green Version]
Monge, R.E.; Crespo, J.L. Analysis of data complexity in human dna for gene-containing zone prediction. Entropy 2015, 17, 1673–1689. [Google Scholar] [CrossRef] [Green Version]
Dehghanzadeh, H.; Ghaderi-Zefrehei, M.; Mirhoseini, S.Z.; Esmaeilkhaniyan, S.; Haruna, I.L.; Najafabadi, H.A. A new DNA sequence entropy-based Kullback–Leibler algorithm for gene clustering. J. Appl. Genet. 2020, 61, 231–238. [Google Scholar] [CrossRef]
Otu, H.H.; Sayood, K. A new sequence distance measure for phylogenetic tree construction. Bioinformatics 2003, 19, 2122–2130. [Google Scholar] [CrossRef]
Grumbach, S.; Tahi, F. A new challenge for compression algorithms: Genetic sequences. Inf. Process. Manag. 1994, 30, 875–886. [Google Scholar] [CrossRef] [Green Version]
Varr, J.; Delahaye, J.P.; Rivals, E. Transformation distances: A family of dissimilarity measures based on movements of segments. Bioinformatics 1999, 15, 194–202. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Li, M.; Badger, J.H.; Chen, X.; Kwong, S.; Kearney, P.; Zhang, H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 2001, 17, 149–154. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chen, X.; Kwong, S.; Li, M. A compression algorithm for DNA sequences and its applications in genome comparison. In Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, Tokyo, Japan, 8–11 April 2000; ACM: New York, NY, USA, 2000; p. 107. [Google Scholar]
Ming, L.; Vitányi, P.M. Kolmogorov complexity and its applications. Algorithms Complex. 2014, 1, 187. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
Lempel, A.; Ziv, J. On the complexity of finite sequences. IEEE Trans. Inf. Theory 1976, 22, 75–81. [Google Scholar] [CrossRef]
Liu, N.; Wang, T.m. A relative similarity measure for the similarity analysis of DNA sequences. Chem. Phys. Lett. 2005, 408, 307–311. [Google Scholar] [CrossRef]
Zhang, Y.; Hao, J.; Zhou, C.; Chang, K. Normalized Lempel-Ziv complexity and its application in bio-sequence analysis. J. Math. Chem. 2009, 46, 1203–1212. [Google Scholar] [CrossRef]
Li, B.; Li, Y.B.; He, H.B. LZ complexity distance of DNA sequences and its application in phylogenetic tree reconstruction. Genom. Proteomics Bioinform. 2005, 3, 206–212. [Google Scholar] [CrossRef] [Green Version]
Liu, L.; Li, D.; Bai, F. A relative Lempel-Ziv complexity: Application to comparing biological sequences. Chem. Phys. Lett. 2012, 530, 107–112. [Google Scholar] [CrossRef]
Yu, C.; He, R.L.; Yau, S.S.T. Viral genome phylogeny based on Lempel–Ziv complexity and Hausdorff distance. J. Theor. Biol. 2014, 348, 12–20. [Google Scholar] [CrossRef]
Song, Y.J.; Cho, D.H. Classification of various genomic sequences based on distribution of repeated k-word. In Proceedings of the 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Jeju, Republic of Korea, 11–15 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3894–3897. [Google Scholar]
Monge, R.E.; Crespo, J.L. Comparison of complexity measures for DNA sequence analysis. In Proceedings of the International Work Conference on Bio-inspired Intelligence (IWOBI), Liberia, Costa Rica, 16–18 July 2014; pp. 71–75. [Google Scholar]
Sayood, K.; Otu, H.H.; Hinrichs, S.H. System and Method for Sequence Distance Measure for Phylogenetic Tree Construction. US Patent 8,725,419, 13 May 2014. [Google Scholar]
Wu, T.J.; Hsieh, Y.C.; Li, L.A. Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition. Biometrics 2001, 57, 441–448. [Google Scholar] [CrossRef] [PubMed]
Blaisdell, B.E. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl. Acad. Sci. USA 1986, 83, 5155–5159. [Google Scholar] [CrossRef] [Green Version]
Bzhalava, Z.; Hultin, E.; Dillner, J. Extension of the viral ecology in humans using viral profile hidden Markov models. PLoS ONE 2018, 13, e0190938. [Google Scholar] [CrossRef] [Green Version]
Pham, T.D.; Zuegg, J. A probabilistic measure for alignment-free sequence comparison. Bioinformatics 2004, 20, 3455–3461. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Yu, C.; Deng, M.; Yau, S.S.T. DNA sequence comparison by a novel probabilistic method. Inf. Sci. 2011, 181, 1484–1492. [Google Scholar] [CrossRef]
Omari, M.; Barrus, T.W.; Sanders, M.; Negron, D. Rapid Genomic Sequence Classification Using Probabilistic Data Structures. US Patent App. 15/977,667, 15 November 2018. [Google Scholar]
Li, Y.; He, L.; He, R.L.; Yau, S.S.T. A novel fast vector method for genetic sequence comparison. Sci. Rep. 2017, 7, 1–11. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Nagaraj, N.; Balasubramanian, K.; Dey, S. A new complexity measure for time series analysis and classification. Eur. Phys. J. Spec. Top. 2013, 222, 847–860. [Google Scholar] [CrossRef]
Balasubramanian, K.; Nagaraj, N. Aging and cardiovascular complexity: Effect of the length of RR tachograms. PeerJ 2016, 4, e2755. [Google Scholar] [CrossRef] [Green Version]
Nagaraj, N.; Balasubramanian, K. Dynamical complexity of short and noisy time series. Eur. Phys. J. Spec. Top. 2017, 226, 2191–2204. [Google Scholar] [CrossRef] [Green Version]
Thanaj, M.; Chipperfield, A.J.; Clough, G.F. Complexity-Based Analysis of Microvascular Blood Flow in Human Skin. In Physics of Biological Oscillators: New Insights into Non-Equilibrium and Non-Autonomous Systems; Springer: Cham, Switzerland, 2021; p. 291. [Google Scholar]
Thanaj, M.; Chipperfield, A.J.; Clough, G.F. Multiscale analysis of microvascular blood flow and oxygenation. In Proceedings of the World Congress on Medical Physics and Biomedical Engineering 2018, Prague, Czech Republic, 3–8 June 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 195–200. [Google Scholar]
Albahri, A.; Hamid, R.A.; Alwan, J.K.; Al-Qays, Z.; Zaidan, A.; Zaidan, B.; Albahri, A.; AlAmoodi, A.; Khlaf, J.M.; Almahdi, E.; et al. Role of biological data mining and machine learning techniques in detecting and diagnosing the novel coronavirus (COVID-19): A systematic review. J. Med. Syst. 2020, 44, 1–11. [Google Scholar] [CrossRef]
Callejon-Leblic, M.A.; Moreno-Luna, R.; Del Cuvillo, A.; Reyes-Tejero, I.M.; Garcia-Villaran, M.A.; Santos-Peña, M.; Maza-Solano, J.M.; Martín-Jimenez, D.I.; Palacios-Garcia, J.M.; Fernandez-Velez, C.; et al. Loss of smell and taste can accurately predict COVID-19 infection: A machine-learning approach. J. Clin. Med. 2021, 10, 570. [Google Scholar] [CrossRef] [PubMed]
Arun, S.S.; Iyer, G.N. On the Analysis of COVID19-Novel Corona Viral Disease Pandemic Spread Data Using Machine Learning Techniques. In Proceedings of the 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 13–15 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1222–1227. [Google Scholar]
Anand, R.; Sowmya, V.; Menon, V.; Gopalakrishnan, A.; Soman, K. Modified VGG deep-learning architecture for COVID-19 classification using chest radiography images. Biomed. Biotechnol. Res. J. (BBRJ) 2021, 5, 43. [Google Scholar]
Hari Prakash, S.; Adithya Narayan, K.; Nair, G.S.; Harikumar, S. Perceiving Machine Learning Algorithms to Analyze COVID-19 Radiographs. In Proceedings of International Conference on Recent Trends in Computing; Springer: Singapore, 2022; pp. 293–305. [Google Scholar]
Choudary, M.N.S.; Bommineni, V.B.; Tarun, G.; Reddy, G.P.; Gopakumar, G. Predicting COVID-19 Positive Cases and Analysis on the Relevance of Features using SHAP (SHapley Additive exPlanation). In Proceedings of the 2021 Second International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 4–6 August 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1892–1896. [Google Scholar]
Kumar, S.; Stecher, G.; Li, M.; Knyaz, C.; Tamura, K. MEGA X: Molecular Evolutionary Genetics Analysis across computing platforms. Mol. Biol. Evol. 2018, 35, 1547–1549. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Phylogenetic tree for mammals generated using ETC-based distance measure.

Figure 2. Phylogenetic tree generated for coronaviruses (SARS-CoV-1 and non-SARS-CoV-1) with ETC based distance measure.

Figure 3. Phylogenetic tree generated for coronaviruses causing COVID-19 (SARS-CoV-2) and those not causing COVID-19 (non-SARS-CoV-2) using ETC based distance measure.

Table 1. Machine learning for classification of SARS-CoV-2 sequences using LZC as feature. Accuracy is in percentage.

ML Methods	Accuracy	Precision	Sensitivity	Specificity	F1-Score
LSVM	89	0.98	0.82	0.98	0.89
QSVM	90	1	0.82	1	0.90
LD	86	1	0.80	1	0.88
FKNN	92	0.96	0.89	0.96	0.92

Table 2. Machine learning for classification of SARS-CoV-2 sequences using ETC as feature. Accuracy is in percentage.

ML Methods	Accuracy	Precision	Sensitivity	Specificity	F1-Score
LSVM	80	0.74	0.90	0.70	0.81
QSVM	83	0.84	0.74	0.89	0.79
LD	84	0.81	0.85	0.83	0.83
FKNN	88	0.95	0.79	0.96	0.86

Table 3. Machine learning for classification of SARS-CoV-2 sequences using both LZC and ETC as features. Accuracy is in percentage.

ML Methods	Accuracy	Precision	Sensitivity	Specificity	F1-Score
LSVM	92	0.98	0.89	0.97	0.93
QSVM	95	1	0.90	1	0.95
LD	87	1	0.79	1	0.88
FKNN	98	1	0.96	1	0.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Munagala, N.V.T.S.; Amanchi, P.K.; Balasubramanian, K.; Panicker, A.; Nagaraj, N. Compression-Complexity Measures for Analysis and Classification of Coronaviruses. Entropy 2023, 25, 81. https://doi.org/10.3390/e25010081

AMA Style

Munagala NVTS, Amanchi PK, Balasubramanian K, Panicker A, Nagaraj N. Compression-Complexity Measures for Analysis and Classification of Coronaviruses. Entropy. 2023; 25(1):81. https://doi.org/10.3390/e25010081

Chicago/Turabian Style

Munagala, Naga Venkata Trinath Sai, Prem Kumar Amanchi, Karthi Balasubramanian, Athira Panicker, and Nithin Nagaraj. 2023. "Compression-Complexity Measures for Analysis and Classification of Coronaviruses" Entropy 25, no. 1: 81. https://doi.org/10.3390/e25010081

APA Style

Munagala, N. V. T. S., Amanchi, P. K., Balasubramanian, K., Panicker, A., & Nagaraj, N. (2023). Compression-Complexity Measures for Analysis and Classification of Coronaviruses. Entropy, 25(1), 81. https://doi.org/10.3390/e25010081

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Compression-Complexity Measures for Analysis and Classification of Coronaviruses

Abstract

1. Introduction

2. Genomic Sequences and Comparison

2.1. Genome and Gene

2.2. Genome Sequence Comparison

3. Materials and Methods

3.1. Genome Sequences Used in This Study

3.1.1. Mammalian Sequences

3.1.2. Coronaviruses (SARS-CoV-1)

3.1.3. SARS-CoV-2 (COVID-19 Causing Corona Viruses)

3.2. Mathematical and Computational Methods Used in This Study

3.2.1. Compression Complexity Measures: Lempel–Ziv (LZ) and Effort-to-Compress (ETC)

3.2.2. Distance Measure

3.2.3. Machine Learning Algorithms Used in the Study

4. Results and Discussion

Classification of SARS-CoV-2 Sequences

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI