#### 3.1. Gene Expression and Transcriptomics

Identifying, modeling and characterizing gene–gene relationships and their co-expression patterns is an important challenge in molecular and systems biology, as these dynamic interactions determine cellular phenotypes and represent causality inherent in developmental processes and regulatory pathways. Such associations are typically modeled computationally and statistically ’reconstructed’ or ’reverse engineered’ using Gene-regulatory Networks (GRNs, also referred to as Transcriptional Regulatory Networks) utilizing gene expression and mRNA abundance data obtained using technologies such as microarrays and RNAseq. More recently, large expression data availability from single cell experiments has spurred developments of new computational methods to infer GRNs, model regulatory interactions and gain insights into cell fate decisions and associated transcriptional state changes [

26,

27,

28,

29,

30]. A GRN is modeled as an undirected graph or a network where the genes (and transcription factors) are represented as vertices (or nodes) and are connected by edges representing regulatory interactions between transcription factors and their targets. The gene expression data matrix (experimental or synthetic) typically provides the genes in the rows and samples and experimental conditions in the columns. The goal of the network inference method is to use the expression matrix to infer the set of regulatory interactions (direct or indirect regulation) between any two genes in the GRN, thereby predicting the edges in the network (

Figure 1). Because mutual information (MI) provides the ability to capture non-linear dependencies between two variables, several methods for reconstructing GRNs use MI or associated information theoretic scores to infer the regulatory relationships. As a first step, these methods require the computation of the MI Matrix (MIM), a square matrix in which the element at the

ith row and

jth column is given by the MI between genes

${G}_{i}$ and

${G}_{j}$. Often, this computational step has affordable complexity given that only pairs of MI computations based on bivariate probability distributions are needed to be estimated to complete the MIM. Several information theoretic methods have been developed to reconstruct GRNs over the past two decades [

31,

32,

33]. For the remainder of this section, we will briefly review some of the popular information theory based methods that were discussed in-depth in some prior reviews [

4,

8] and then focus our discussion on more recent advancements in this area in the context of analysis of single-cell gene expression data.

One of the earliest approaches using MIM can be found in Relevance Networks [

34] that simply links a pair of genes by an edge if the MI is larger than a given threshold. The approach has been introduced to infer relationships between RNA expressions and finding clusters of genes that affect cancer susceptibility to anticancer agents [

35]. The Context Likelihood of Relatedness (CLR) algorithm [

36] is an extension of the relevance network approach that calculates the statistical likelihood of a particular transcription factor/gene pair’s MI value within its network context and filters many of the false (noisy) edges in the constructed GRN where one gene can weakly co-vary with many transcription factors simply by chance due to inherent noise in biological systems. It compares the MI between the transcription factor/gene pair to the background distribution of MI values for all possible pairs that include either the transcription factor or its target gene, and retains only interactions that have MI scores significantly above the background distribution. Both Relevance Networks and CLR suffer from a fundamental limitation—genes in indirect relationships separated by one or more intermediate genes may be highly co-regulated, resulting in numerous false positives. To address this, ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) [

37], a popular algorithm for reconstructing gene networks, was developed. ARACNE also starts with estimating the MIM between the genes, but it then filters spurious gene–gene associations using a two step process. First, each MI value between a gene pair is filtered using an appropriate threshold computed from the null-hypothesis of independence between the two genes, the null distribution being obtained by randomly shuffling the gene expression values for the given pair of genes. In the second step, ARACNE removes the vast majority of indirect associations using the data processing inequality (DPI) information theoretic assertion that states that if there is an indirect association between two genes

${G}_{i}$ and

${G}_{j}$ through gene

${G}_{k}$, then

$MI({G}_{i},{G}_{j})\le min\{MI({G}_{i},{G}_{k}),MI({G}_{k},{G}_{j})\}$. It means that gene

${G}_{i}$ cannot have more information about gene

${G}_{j}$ than gene

${G}_{k}$ has about gene

${G}_{i}$ and the post-processing cannot increase information. More recent extensions of ARACNE can be found in [

38,

39,

40]. Time-Delay-ARACNE is proposed by Zoppoli et al. for GRN inference from time-course expression data [

38], while higher order DPI is used to improve inference performance in hARACNE [

39]. Improved computational performance is reported by Lachmann et al. in ARACNE-AP [

40] by using an adaptive binning strategy to estimate MI.

In the past few years, with the advancements in single-cell technologies for biological analysis, it has now become possible to quantify single cell transcriptomes in very large numbers. While that has made it possible to provide a finer-grained picture of the complex cellular processes and decipher heterogeneity inherent in gene expression across multiple tissue types, analysis of gene expression data from single cells comes with different data distribution patterns and characteristics (e.g., has a higher rate of zero values) often distinct from their bulk sample counterparts [

8]. These present more complexities for statistical analysis (such as more technical noise and biological variability) [

27]. As a result, GRN reconstruction methods developed for bulk sample expression data may not be suitable for data generated from single cells and may not take full advantage of the cell-to-cell variability to infer statistical relationships that can be used by information theory. Taking into consideration some of the different characteristics of single cell expression data, Chan et al. [

27] developed a multivariate information theory based approach using partial information decomposition (PID) [

41] computed from every triplet of genes in single-cell gene expression datasets. For three genes

${G}_{i}$,

${G}_{j}$,

${G}_{k}$, PID provides a decomposition of the information provided by two of the genes as source variables (say

$S=\{{G}_{i},{G}_{j}\}$) about the third gene

${G}_{k}$ into three components—redundant, unique and synergystic information.

The inference algorithm developed in [

27] (named PIDC or “PID and Context”) uses the Proportional Unique Contribution (PUC) metric defined as the average ratio of unique information between two genes

${G}_{i}$ and

${G}_{j}$ given the presence of a third gene

${G}_{k}$ from the remaining genes

S:

Next, the confidence of an edge is calculated as the sum of the cumulative distribution functions of all PUC scores for each gene in the pair:

where

${F}_{{G}_{i}}(U)$ is the cumulative distribution function of all PUC scores involving gene

${G}_{i}$. Like CLR, estimating confidence of an edge by incorporating distributional properties of the PUC values for a pair of genes improves robustness to noise. It effectively identifies the most important interactions per gene, instead of relying on a simple ranking of the pairwise scores and choosing the best scoring edges across the whole network. Comparing PIDC with ARACNE and CLR, the authors reported marginal improvement in inferring the GRNs using simulated and publicly available data sets, attributing their success to larger sample sizes provided by single-cell data. However, it remains unclear as to how PIDC takes advantage of specific characteristics of single-cell data, such as its zero-rich nature.

Intrinsic noise in transcriptomic and gene expression data, owing to the stochastic nature of biochemical reactions driving the production of mRNAs and proteins within cells, makes GRN inference and reconstruction an extremely challenging problem. In addition to information theoretic methods, statistical methods such as Bayesian networks, tree based [

42] and ordinary differential equation based [

43] methods have been proposed to analyze gene expression data and reconstruct GRNs. An advantage of Bayesian networks and tree based methods is that they can learn directionality of the edges conjecturing if there is a directional influence from a regulatory gene to a target gene; information theoretic methods typically infer undirected edges. A head-to-head comparison of various methods to infer GRNs using single cell expression data has been presented recently in [

8]. The methods were evaluated in their abilities to recover the correct set of GRN edges using ROC curves and Precision-Recall curves against reference networks from experimental assays, as well as in-silico reference networks from simulated data sets with single-cell characteristics. They reported that most of the assessed methods, including information theoretic methods, are not able to predict network structures from single cell expression data accurately and have differences with each other in the sets of identified edges. Also, their performances are inconsistent across data from different cell types and experimental conditions, as previously reported in another comparative study [

44]. Both studies found no single best performing method across multiple types of data and simulation scenarios; rather, ensemble methods that average prediction using multiple approaches seem to work best [

8,

44]. The results from these studies emphasize the necessity to develop more accurate and optimized network modeling methods that are compatible with single cell expression data.

#### 3.2. Alignment-Free Sequence Comparison

In this section, we discuss prominent applications of information theory for finding similarity between two genomic sequences without doing an actual alignment of the two genomes. Comparing two or more genomic sequences to find regions of similarities and dissimilarities and infer a measure of relatedness is vital for the success of basic phylogenetic and metagenomics research. Sequence comparison of genetic material across organisms allows discovery of gene structures and inferring their functional relationships based on the idea that higher similarity between sequences is a driver for higher structural and functional similarity and possible evolutionary relationships. While traditional methods for comparative sequence analysis and phylogeny reconstruction rely on computational algorithms for pairwise and multiple sequence alignments such as dynamic programming, they are often relatively slow in aligning two sequences, taking time proportional to the product of their lengths and are difficult to scale to whole genome comparisons. Furthermore, an underlying assumption supporting sequence alignment algorithms is the principle of co-linearity (homologous sequences have multiple linearly arranged and more or less conserved sequence stretches), which is often violated in biological genomes due to rearrangement events (e.g., in viral genomes) [

45]. These and other challenges, such as a rapid drop in accuracy of sequence alignments when sequence identity falls below a certain critical point, have prompted development of alternative alignment-free methods for comparing large genomic sequences [

46]. Among many different approaches to alignment-free sequence comparison (such as word frequency based methods inspired by linguistic analysis [

47], graphical representation of DNA sequences [

48,

49], chaos game representation [

50], iterated maps [

51], spaced words [

52,

53]), applications of information theory have been key to developing methods utilizing the informational content between sequences to be compared. In the remainder of this section, we briefly review some of the key articles using information theory for alignment-free sequence comparison.

In the feature frequency profiles (FFP) method [

54] developed by Sims et al., the authors proposed to estimate the distances between nucleotide or amino acid sequences using a two step process. In the first step, fixed length words (or k-mers) and their frequencies are extracted from each genomic sequence by sliding a window along the genome. Normalizing each frequency with the sum of all k-mer frequencies generates a probability distribution for the genome. This process leads to the conversion of each genomic sequence into its FFP,

${F}_{k}$, represented by the distribution of each word in the genome. In the second step, pairwise distance between two sequences

i and

j can be calculated using the Jensen–Shannon divergence (JSD) between their respective FFPs,

${F}_{i}$ and

${F}_{j}$. The JSD metric symmetricizes the asymmetric KLD distance metric between two probability distributions. The JSD serves as a measure of distance between the two genomes and can be used to generate phylogenetic trees from multiple genomes. The authors also used information theory to address a key question in their approach—how can one choose an optimal value of

k (the word length). Choosing a large value of

k will run into higher computational complexity and possibly unreliable estimates of k-mer probability distributions. Choosing a smaller value will result in words that are too small, commonly occurring in all genomes and does not have the information to distinguish between the genomes. To address this, the authors computed lower and upper bounds of word size

k. The lower limit is given by the word length that is most frequent in a genome of interest. The upper limit is derived using the

$k-2$ Markov model, which says that one can predict the frequency of a word of length

k using the frequencies of its

$k-1$ and

$k-2$ subwords. So for a given word length

k, the expected FFP,

$\widehat{{F}_{k}}$, can be computed and KLD or relative entropy is then calculated between

${F}_{k}$ and

$\widehat{{F}_{k}}$ for all values of

k from one to infinity and summed to get CRE (cumulative relative frequency). The CRE represents the accuracy of predicting FFPs for all lengths

$\ge k$, given the prior distributions

${F}_{k-1}$ and

${F}_{k-2}$. The value of

k at which the CRE approaches 0 becomes the upper limit of word size for use in genome comparison.

More recently, probabilistic data structures for k-mer counting have been used to enable alignment-free genome comparison [

55,

56]. A critical advantage of these data structures is higher memory efficiency allowing them to be used for comparing many sample sequences at the same time using next generation sequence data. One of these methods, kWIP, hashes all k-mers from a genomic sequence

i into a probabilistic data structure called a Sketch [

56,

57],

${S}_{i}$, that are numeric vectors of fixed size for every sequence. Given a set of many sequences (e.g., from a metagenomics experiment), genetic relatedness between any two sequences

i and

j is then calculated by computing an inner product between the two sketches

${S}_{i}$ and

${S}_{j}$, weighted by their informational entropy across the population set. This procedure down-weights uninformative k-mers (highly abundant or present in very few samples) and produces a kernel matrix,

K, of pairwise inner products that is then normalized using the Euclidean norm

${{K}^{\prime}}_{ij}={K}_{ij}/({K}_{ii}{K}_{jj})$ and converted to distance matrix

${D}_{ij}=\sqrt{{{K}^{\prime}}_{ii}+{{K}^{\prime}}_{jj}-2{{K}^{\prime}}_{ij}}$ containing the genetic distances between every pair of sequences in the population. A schematic illustrating the FFP and kWIP approaches is presented in

Figure 2.

Over the past few years, many alignment-free algorithms have been proposed [

58,

59,

60] to solve previously intractable challenges in phylogenomics, many of which have been compared in [

9,

45] using data from three categories—gene tree inference, genome-based phylogeny and horizontal gene transfer. Assessing the degree of topological disagreement between the inferred and reference trees, the report found that no single method performed best across all the data sets. When compared using eight data sets from the whole-genome phylogeny and horizontal gene transfer categories, FFP was found to be among the top five best performing methods out of 55 variants of 11 alignment-free phylogeny tools tested. These suggest that improved information theoretic distance metrics to assess similarities between genomes need to be developed as they were often more challenged in deciphering finer complex organization levels in the sequences [

45]. With genomics continuing to lead in terms of data growth and availability through production of high-throughput sequencing data, although information theory based methods are observed to be relatively more memory efficient and computationally inexpensive than other methods for smaller data sets, bigger data sets pose more serious challenges, for example, substantial memory overhead when using longer k-mers. Analyses of very large data sets containing beyond tens of thousands of samples will benefit from more optimized implementations of methods such as kWIP that performs pairwise computations, including parallelization across distributed processing systems. Therefore, it remains a challenge in alignment-free phylogeny to develop both novel metrics and efficient implementations that can scale to handle larger data sets.

#### 3.3. Sequencing and Error Correction

Information theory based approaches have notable applications in genome sequencing. DNA sequencing technologies have continuously evolved over the past two decades, leading to a substantial increase in throughput and decrease in cost. Current technologies are mainly based on a two-step approach. First, fragments of the target sequence/genome are read via shotgun sequencing. Then, these fragments, so-called reads, are assembled to reconstruct the original sequence. This reconstruction task can be seen as a problem of assembling a string from its substrings, where three aspects play an important role in the completion of the task: length of the fragments; number of reads per base (coverage depth); and fragment error rates. In the case of DNA sequencing, these aspects are highly dependent on the sequencing platform. Average read length typically ranges from 100 to 10,000 base pairs (bp), although platforms such as PacBio and Oxford Nanopore Technologies MinION can reach maximum read lengths in the order of 60–100 kbp [

61,

62]. The number of reads typically ranges from 100 s to 1000 s of millions. And the error rate can range from 0.1% to 20% [

63,

64].

Given a set of reads, the most crucial challenge for the genome reconstruction task is to determine whether complete reconstruction of the target sequence is possible from the given set of reads. Additionally, other important challenges include finding the minimum fragment length and coverage depth needed for the complete reconstruction with a given reliability, evaluating the impact of the read error rates on the assembly performance and on the requirements in terms of read length and coverage depth, and designing ways to compare different sequencing technologies and assembly methods. These challenges have been recently tackled from an information-theoretic perspective by investigating the fundamental limits of genome reconstruction. From an information theory point of view, a sequencing task can be seen as a decoding task where the genetic information is first encoded in terms of nucleotide bases and then transmitted during sequencing through a noisy channel in form of reads (

Figure 3).

Motahari et al. [

66] investigated the fundamental limits for the read length and the number of reads needed to allow complete sequence reconstruction, assuming error-free reads and DNA sequence modeled as an i.i.d (independent and identically distributed) random process. They calculated the minimum read length required for complete sequence reconstruction based on the Rényi entropy [

67] of order two, which is defined as

${H}_{2}(p)=-{log}_{2}{\sum}_{i}{p}_{i}^{2}$, where

p is the probability distribution on the alphabet {A,C,G,T} and

i iterates over each of these values. More specifically, given read length

L and original sequence length

G, they calculated the normalized read length

$\overline{L}=L/{log}_{2}(G)$ and showed that if

$\overline{L}<2/{H}_{2}(p)$, complete sequence reconstruction is impossible, regardless of the number of reads. If read length is above that threshold, then complete reconstruction requires that enough reads exist to cover every base least once. These findings were later extended to consider scenarios that do not require the i.i.d DNA sequence assumption or reads to be error-free [

68,

69,

70].

Gabrys et al. [

71] studied these questions in the more general context of string reconstruction, motivated by applications in DNA-based data storage systems [

72], discussed later in this section. They investigated the fundamental limits of read overlap and their impact on the read length requirements, and established the minimum Hamming distance between reads to allow proper sequence reconstruction. Marcovich and Yaakobi [

73] followed the work in [

71], by exploring two models for string reconstruction: one in which not all substrings are received, and another in which all substrings are received, but with errors. They also proposed substring constraints based on the Hamming distance between reads.

In another proposed method [

74], Si et al. investigated the closely related haplotype assembly problem, where the goal is to reconstruct haplotype sequences (i.e., string of single nucleotide polymorphisms (SNPs) on a single chromosome in a homologous pair). Using an information-theoretic perspective, they formulated this goal to be the successful recovery of two sources of information being communicated through a channel: the haplotype information and the chromosome membership. They studied the required conditions to allow a reliable reconstruction with and without error-free reads and showed that the requirements in terms of number of reads for the erroneous case is of the same order as in the error-free case. These limits were defined using the Fano’s inequality [

2].

More recently, in a novel approach, Chen et al. [

65] combined sequencing-by-synthesis with an information theory-based error-correction algorithm, as outlined in

Figure 3. They proposed a communication channel with three rounds of fluorogenic DNA sequencing [

75] and use information theory principles to analyze information redundancy. In each round, the sequencer generates a dual-base array, which is combined in a later stage to infer the four-base DNA sequence. For example, one of the rounds generates the so-called

MK dual-base array to determine whether the DNA base is

A or

C (encoded as

M) or

G or

T (encoded as

K) for every position of the sequence; in this case,

K and

M are called degenerate bases. Considering a random DNA sequence, the entropy of a sequence of length

L is 2

L bits, while the entropy of a dual-base array is

L bits, assuming no sequencing error. Therefore, if two different dual-base arrays are used, the original sequence can be fully reconstructed. However, as the actual entropy of each dual-base array is lower than

L, because of sequencing errors, a communication channel based on only two rounds, one for each dual-base coding, would be prone to propagate sequencing errors. Based on the experimental error rate of fluorogenic DNA sequencing, the authors showed that using three dual-base array provides enough information for sequence reconstruction with error correction. Besides the

MK dual-base array, they used the

RY array, where

R means

A or

G and

Y is

C or

T, and the

WS array, where

W means

A or

T and

S is

C or

G. By combining the information encoded by the three degenerate arrays, the decoder can correct an error of one of the rounds. Experimental results in [

65] showed that this strategy reduces the error rate in a 250 bp sequence from 0.96% to 0.33%. In a recent study, Mitchell et al. [

76] benchmarked several computational error-correction methods for sequencing data and reported that Chen et al.’s method [

65] suffered from increased computational cost compared with other methods that limits its scalability to larger data sets.

The same concept of degenerate bases is used in [

77,

78], but with the focus of increasing information capacity for DNA based data storage when storing results for multiples reads from a target sequence. By using different encoding bases (original and degenerate ones) to encode the observed base frequencies for each position of the sequenced reads, the need to store all reads is eliminated. As a finite alphabet of encoding bases, including pure and degenerate bases, the observed base frequency will most likely not match any of the bases in the alphabet. Anavy et al. [

77] solved this problem by using KLD to make the best approximation, while Choi et al. [

78] used a clustering approach. They both used Reed–Solomon coding [

79] as a correction mechanism.

Improvements in error correction for long-read sequencing (LRS) technologies, such as PacBio and MinION, is one of the current challenges in DNA sequencing [

80,

81]. LRS-based approaches have shown advantages in sequencing complex regions, such as those with extreme GC-rich, long tandem repeats, and interspersed repeats, when compared to prevailing short-read approaches [

82,

83]. However, their high error rates in DNA sequencing are still considered a main drawback [

84], and information theoretic methods have the capability to reduce it, as shown in [

65]. Information theory offers the potential to further contribute in this area by defining fundamental limits for the coverage depth of DNA reads, to guarantee full reconstruction, error correction, and minimum redundant information. Using information theory for biology inspired and error-free information storage is another emerging area of research [

72,

85,

86,

87]. It has drawn more attention lately because of the novel use of DNA as an archival medium and holds a lot of promise for future research to satisfy the exponentially increasing demand for information storage.

#### 3.4. Genome-Wide Disease-Gene Association Mapping

Over the last two decades, fueled by studies and reference projects of human genetic diversity and developments in high-throughput sequencing technologies, genome-wide association studies (GWAS) have been the primary tool to investigate and establish the connection between the genetic data and the observed characteristics or phenotypes, such as occurrence of human diseases or disease biomarkers. For the human genome, the most common form of observed genetic variations are bi-allelic single nucleotide polymorphisms (SNPs) spread across the human genome. The data is usually represented as a numeric matrix where the rows are the genomes from multiple individuals and the columns represent the SNPs. Each bi-allelic SNP (say the $i$th SNP in the data) can be modeled as a discrete random variable (say, ${X}_{i}$) taking values from the set $\{0,1,2\}$, representing the number of copies of the minor allele in the genome at that position. Also for every individual sample, there is an observed disease phenotype (say, modeled as random variable Y), which can be discrete in the simplest case (e.g., presence or absence of a disease), or it can be a continuous valued phenotype (e.g., blood pressure measurements or blood cholesterol values).

Many information theory based approaches have been proposed for disease-gene association studies in the last twenty years. Following a similar characterization as the one proposed by Wu et al. [

88], these approaches can be based on: (1) single SNPs; (2) haplotypes; (3) genes; and (4) gene–gene (GXG) interactions. Single SNP based association methods (also known as single-locus tests) are the most common and simplest approaches to identify the disease associated SNPs. Two examples of association tests based on entropy for single SNP-based analysis can be found in [

89,

90]. These methods devise test statistics treating every SNP as an independent random variable and have low computational complexity proportional linearly to the number of SNPs across the genome considered for the study. However, single SNP based methods suffer from low power to identify causal associations, especially when the SNPs involved are spread across multiples genes [

88]. Furthermore, for many such multi-factorial diseases, pathogenic genetic variants usually have a low population of minor allele frequencies (MAFs), making it difficult to identify their effects for relatively small sample sizes [

91]. Haplotype-based approaches look at combinations of marker alleles that are closely linked on the same chromosome and tend to be inherited together as a unit, instead of alleles from individual SNPs. This characteristic makes these approaches more suitable to analyze complex multi-factorial disease phenotypes than the single SNP-based methods [

88]. Entropy-based tests using haplotypes can be found in [

92,

93,

94]. Their results showed that these tests tend to outperform non-information theoretic tests. However, as pointed out in [

88], their main drawback is the computational cost associated with inferring the haplotype phase and frequencies needed for the tests. Closely related to haplotype-based methods are gene-centric approaches that consider genic variants within one gene as testing units and are specially powerful when there is more than one disease variant in a gene. These approaches are specially advantageous when samples come from non-homogenous populations. An entropy-based association test for gene-centric analysis can be found in [

95]. In addition to the association test, they also proposed a penalized entropy measure that is used to cluster genotypes and decrease the degrees of freedom of the association test. Gene–gene interaction studies comprise another prominent direction of investigative strategies, those that involve devising test statistics specifically geared towards studying complex, multi-factorial diseases. Many of these proposed methods, discussed below, use information theoretic metrics designed based on Shannon’s entropy. The goal of these information theory based methods is to develop test statistics or scores that measure gene–gene (GxG) statistical interactions (also known as epistasis, [

96]) between a set of SNPs and the disease phenotype, and also identify statistical interactions between the genotype and environment (GxE) when environmental variables (factors external to the genome) are included in the study design (

Figure 4).

Some of the early information theory based approaches focused on detecting interactions by proposing test statistics based on MI involving two SNPs and the disease phenotype from case-control studies in which a group exhibiting the phenotype/disease is compared to a control group. For example, Fan et al. [

97] computed MI between two genetic variants

${X}_{i}$ and

${X}_{j}$ separately in cases and controls and used their difference

${\mathrm{MI}}_{cases}({X}_{i};{X}_{j})-{\mathrm{MI}}_{controls}({X}_{i};{X}_{j})$ as the test statistic to estimate information gain. The idea is to use the MI between the SNPs in the control group as an approximation of the MI in the general population and identify pairs of SNPs with effects substantially bigger than that among the cases. In another proposed method by Yee et al. [

98], the difference between the entropy of the phenotype,

$H(Y)$, and the conditional entropy of the phenotype given a pair of genetic variants,

$H(Y|{X}_{i},{X}_{j})$, is normalized with regard to the overall entropy of the phenotype to get the test statistic

$[H(Y)-H(Y|{X}_{i},{X}_{j})]/H(Y)$ that normalizes the information gain with regard to the overall observed entropy of the phenotype. This is also referred to as normalized MI that quantifies the proportion of information contained in the interacting SNPS,

${X}_{i}$ and

${X}_{j}$, influencing the phenotype

Y. In another method, Dong et al. [

99] have proposed the ESNP2 (entropy-based SNP–SNP interaction) method integrating two-locus genetic models. Their test statistic is defined as

and uses conditional entropy of the phenotype random variable

$H(Y|.)$ to measure interaction effects of SNPs

${X}_{i}$ and

${X}_{j}$ whenever marginal effects exist. All these methods rely on variations of mutual information (MI) and conditional-MI based two-SNP test metrics and heuristic algorithms of mining the genome looking for two-SNP combinations showing significantly high association scores. A detailed systematic review of these methods, distributional properties of these metrics and their mathematical derivations, and associated algorithms using these information-gain-type quantities for feature selection in the context of genetic analyses can be found in [

100].

While the methods summarized above investigate GXG interactions considering relatively simpler two-way SNP–SNP interactions, evidence suggests that higher-order multivariate GxG interactions can contribute to a number of complex traits [

101]. Addressing this, several methods have been proposed using more intricate information theory based models that consider the contribution of more than two SNPs to the onset of a phenotype [

102,

103,

104,

105,

106,

107,

108,

109,

110,

111,

112,

113]. For such models, information gain through higher order (with more than two random variables) GxG interactions in explaining the phenotype states as more SNPs are added to a model can be defined from a synergy/redundancy point of view [

24,

103,

104,

105,

106,

108] that quantifies the effects of two or more SNPs compared to their individual effects. The synergy can be positive, when the joint effect of multiple SNPs is larger than the sum of the individual single SNP effects, or negative, when the joint effect is smaller than the sum, indicating information redundancy among SNPs. Some of the prominent methods addressing higher order interactions through synergy and redundancy use multivariate generalizations of MI such as

K-way Interaction Information (KWII) to parsimoniously model both GXG and GxE interactions and help gain deeper insights into the underlying disease causation pathways by combining them in a single analytical framework [

114,

115,

116,

117,

118,

119,

120]. All these methods depend on detection of higher order interactions in terms of synergy/redundancy that relies on robust empirical estimations of metrics such as entropy, MI and KWII. A study by Sucheston et al. [

121] systematically evaluated information theoretic metrics such as KWII along with established non-parametric methods, such as Multifactor Dimensionality Reduction [

122] and Restricted Partitioning Method [

123], comparing power and Type I error. They found that information theoretic models have more flexibility and have excellent power to detect GxG interactions under a variety of conditions including genetic heterogeneity that characterize complex diseases.

Building on similar ideas, epistasis networks proposed by Moore et al. [

124] provided a way of using GxG combined with computational network theory. Using the SNPs and their relationships, a network can be created where each node is an SNP and each edge is a KWII between a pair of SNPs in the presence of an observable trait or phenotype, such as bladder cancer susceptibility. An edge is included in the network if the KWII strength is above a threshold that is determined using permutations of the SNP genotypes and phenotype. Once the epistasis network is created, the authors performed network analysis to reveal interesting information. For example, the degree distribution of the vertices in the network was found to closely follow the power law distribution, and hence the network was approximately scale free [

125].

In terms of target phenotypes, most of the work described above considers dichotomous phenotypes for case-control studies. Alternative strategies to identify GxE interactions are the case-only and the family-based designs. Case-only studies [

126] are used to identify interactions using data from only affected individuals as used in [

127,

128]. For example, Kang et al. [

127] used case-only design to devise a multi-SNP test statistic

$2N(1-H(Y))\mathrm{log}(W)$ that is shown to be asymptotically

${\chi}^{2}$ distributed under the null hypothesis of no association with

W – 1 degrees of freedom. For a

k-SNP loci,

$H(Y)$ refers to the entropy of the phenotype

Y obtained using the counts of each unique genotype observed across all the

k SNPs;

W is the total number of genotype combinations observed on these loci. These tests are shown to have higher power when compared to the standard

${\chi}^{2}$ statistical association test. Family-based studies take into account the family relationship of the genomic sequences when comparing them. It considers, for example, that an allele associated with a disease will be transmitted to the affected offspring more often than that expected by chance—the Transmission Disequilibrium Test (TDT) [

129]. Examples of such disease association test metrics for family-based study designs can be found in [

88,

102,

130]. Notably, Zhao et al. [

130] designed a novel TDT statistic using entropy to generalize the original TDT statistic and incorporated non-linearity in its definition. They demonstrated that the entropy-based TDT test is more powerful than the original TDT test. Besides analyzing binary traits, often GWAS needs to detect gene-disease associations for quantitative traits involving non-discrete real values. Information theoretic methods dealing with such quantitative phenotypes and environmental variables can be found in [

116,

117,

131] that use differential or cross-entropy based generalizations of multivariate test statistics. In a recent review, Galas et al. [

132] presented a detailed discussion on the information theoretic formalism for gene association with quantitative phenotypes.

Recently, Tahmasebi et al. used an information theoretic approach to investigate the fundamental limits of GWAS parameters [

133]. Their study proposed an abstract probabilistic model to detect the causal subsequence of length

L for a specific phenotype using a dataset of

N individuals with genomes of length

G and their observed characteristics, where the presence of external environmental factors make the relationship between the causal subsequence and the observable characteristics a stochastic function. With increasing value of the model parameters

N,

G and

L, the authors reported observing a threshold effect at

$(G/N)H(L/G)$ (

H denoting binary entropy) that was then used to formulate the capacity of recovering the causal subsequence using information theory. This idea was further extended in a subsequent article [

134] by comparing mixed and unmixed populations.

With increasing availability of high-density SNP data from next-generation sequencing techniques like genotyping-by-sequencing, methods for identifying GxG and GxE interactions have rapidly evolved to be an essential technology for association studies. While regression based methods and Bayesian statistics [

135,

136] have been the primary workhorses for GWAS, information theory has played a crucial role in advancing the field and has been often combined with statistical approaches in reducing dimensionality and devising novel solutions. Nevertheless, there are some open questions that need to be addressed in future research. Although many definitions of statistical interactions and associated test statistics are proposed using information theory, the field would benefit from a unified definition and interpretation of statistical interactions. Additionally, often the underlying distributions of the test statistics under the null and the alternative hypothesis are unknown and more studies are needed as in [

137], focusing on investigating the asymptotic behaviors of the estimators involved. Computing higher order test statistics like

$KWII$ is computationally expensive as it necessitates entropy computations of all possible subsets of SNP combinations—for example, computing KWII for a set of two SNPs and a disease phenotype entails

${2}^{3}-1$ entropy computations with three random variables. As a result, computing all possible

k-SNP combinations in a GWAS study can quickly become computationally infeasible for larger values of

k. This also increases the number of tests and can reduce power if multiple-testing corrections, such as Bonferroni correction [

138], are applied. As a possible remedy, often single SNP analysis is used as pruning step to reduce the number of SNPs to be analyzed prior to interaction analysis, and test significances are estimated through efficient permutation strategies [

139]. Another challenge is that the information theoretic test statistics and estimators are, in many instances, still to be adapted to address practical challenges in genetic studies such as accounting for missing genotypes, genotyping errors, phenocopies and genetic heterogeneity. Finally, although some studies exist comparing computational methods for epistatic interaction detection (e.g., [

140]), research is still lacking with respect to well-defined studies comparing the powers and performances of several key information theory based as well as other statistical approaches under a variety of experimental conditions. The field will immensely benefit from such systematic studies in the near future.

#### 3.5. Protein Sequence, Structure and Interaction Analysis

Proteins are key macromolecules, which mediate their function by interacting with other molecules, including other proteins, nucleic acids, metabolites, and lipids. Major biological processes, such as immunity, metabolism, signaling, gene expression, and molecular machines are controlled through protein interactions [

141,

142,

143]. Due to the critical nature of protein interactions, a key way of investigating an unknown protein is to determine what interactions occur between proteins. The experimental determination of protein–protein interactions (PPIs), using methods such as yeast-two-hybrid, tandem affinity purification, mass spectrometry, DNA and protein microarrays [

144,

145,

146], is time consuming and relatively expensive, and hence, there is a need for accurate computational methods for determining interactions among proteins. With the availability of genomics and proteomics data, several computational methods have been proposed for determination of protein–protein interaction partners. These examples cover different approaches, including Bayesian methods [

147], graph-based methods [

148], phylogenetics-based methods [

149,

150], and information theoretic approaches. Applying information theory to protein sequence analysis allows researchers to identify these interactions more easily and without the cost associated with experimental methods. For protein structural analysis, information theoretic methods also offer the advantage of being able to capture relationships at the level of protein families, in contrast to other computational techniques such as PPI structural modeling or docking, which typically operate at the level of individual protein complexes. This is because, information theoretic techniques for protein interaction modeling can capture dependency information across multiple sequences to calculate useful metrics, and are thus able to discern interactions at the protein family level.

An information theoretic approach to infer PPIs is based on the concept of coevolution. The key biological concept that underpins this approach is that interacting proteins adapt together due to evolutionary pressures, which gives rise to relationships that can be observed while observing a large set of data that spans this evolutionary history [

151] (such as observing similar gene sequences or gene expression profiles from different cell types or organisms [

152]). These relationships can be identified by applying information theoretic concepts such as mutual information (MI). As previously described, MI measures the degree to which random variables are co-dependent. From a multiple sequence alignment of several related proteins, the amino acid distribution at every site can be thought of as a random variable. Following this, MI can quantify the degree of reduction of uncertainty at one site, based on the knowledge of the distribution at a different site. In a notable initial application of these approaches, Giraud et al. [

153] applied the concept of coevolution to proteins in their study in 1998. Since that time, several attempts have been made to refine this measure for their applicability to proteins. Several studies [

154,

155,

156] have described approaches to model the effect of phylogenetic relationships between the proteins and separated this from the coevolution between pairs of sites within a multiple sequence alignment, and have demonstrated that these techniques are able to improve the applicability of such metrics to identifying PPIs.

Figure 5 shows the role of information theory in a typical coevolution approach.

These methods also have a significant impact in the field of structural biology. A key element to understanding and predicting the function of a protein is its structure. While a structure is largely determined by its sequence, over the years, structures for several proteins have been determined by X-ray crystallography, Nuclear Magnetic Resonance (NMR) methods, and more recently Cryo-Electron Microscopy (CryoEM). As the number of structures solved and deposited in the Protein Data Bank (PDB) keeps increasing, the vast explosion in data has allowed the application of information theoretic approaches to play impactful roles in structural biology [

157].

Since complementarity is a key feature of three-dimensional protein interactions, coevolution would suggest that when a substitution occurs, a compensatory substitution would occur in an interacting partner, that would be captured by the metric regardless of where in the sequence the mutation occurred. Validation of coevolution-based approaches discussed previously has been performed by studying the structures. Specific examples encompass a variety of applications, some of which have been described in a study by Little et al. [

10]. This cross-functional work includes a study of coevolution in the PDZ domain of human protein Erbin, also among related, previously known and annotated functional domains from the PFAM database, as well as across known catalytic sites. Other studies have also demonstrated the immense potential that coevolution offers to structural and systems biology. Specialized approaches for modeling protein coevolution are required since amino acids have several interacting partners, and MI based metrics capture interactions that may be due to indirect coupling [

157]. This additional complexity is resolved in methods such as DCA [

158] or GREMLIN [

159], which are able to highlight direct interactions and remove the noise from indirect interactions.

In a recent study, Cong et al. [

160] have applied these techniques to large-scale data from proteomics studies. This study investigated coevolution between 5.4 million pairs of proteins in

Escherichia coli and between 3.9 million pairs in

Mycobacterium tuberculosis, and reported that in conjunction with structural modeling, they were able to predict PPIs with an accuracy much higher than that found by traditional proteome-wide two-hybrid screens. This study only highlights the potential for information theory methods to go beyond traditional methods and help identify PPIs for unexplored organisms.

Not addressed here, but other equally challenging and interesting uses of information theory to analyze PPIs are addressed in a review article by Mousavian et al. [

5]. In that review, the authors discuss in detail applications of information theory in dealing with PPI networks, describing a variety of applications such as protein complex identification, network complexity analysis, finding subnetwork markers, and other questions addressed by previous studies applying information theory for PPIs. In contrast, for this review, we have highlighted some studies that build on the concept of sequence coevolution and discuss the future scope of these methods in solving unsolved problems. As these methods are further refined and integrated with structural biology, as in Cong et al. [

160], they can be further improved with the goal to map all cellular PPIs, enabling discovery of the “interactome”—the cellular protein interaction network. This represents a challenging task of evolving from the current paradigm of discovering interactions between specific proteins or protein families to mapping at the scale of millions of proteins. Other exciting and emerging applications in the structural domain include recent studies of specificity in these interactions and how they can help guide in-silico assemblies of protein complexes [

157].

#### 3.6. Metabolic Networks and Metabolomics

Metabolic networks are networks that describe the physiological and biochemical properties of a cell. The components are metabolites, chemical reactions that transform the metabolites to each other catalyzed by metabolic enzymes, metabolic pathways, as well as the regulatory interactions that guide these reactions. Unlike other types of systems biology networks, studies of metabolic networks have been limited due to difficulties in generating large amounts of metabolite concentration data and lack of knowledge about the kinetic properties of the involved reactions. The recent rapid evolution of metabolomics based on mass spectrometry and nuclear magnetic resonance has made possible faster and more in-depth studies of metabolic networks. In this section, we will introduce some studies that have applied the concepts of information theory in understanding metabolic networks. An overview illustrating some of the approaches discussed is presented in

Figure 6.

Reconstruction of metabolic networks remains a central topic similar to other types of biological networks. A top-down approach to metabolic network reconstruction is by reverse engineering of metabolome data. Popular methods for inference of regulatory networks, ARACNE [

37] and CLR [

36] have been applied to reconstruct cellular metabolic networks [

161]. Both these methods leverage mutual information (MI). A method based on conditional MI has also been used [

162]. These methods can detect non-linear correlations compared to simpler correlation-based relatedness scores. Saccenti et al. [

163] proposed a “wisdom of crowds” approach that considers the consensus obtained from four different approaches, ARACNE, CLR, PCLRC (Probabilistic Context Likelihood of Relatedness on Correlations) [

164] and Pearson correlation.

Molecular networks built from metabolite profiling can exhibit a large degree of diversity across individuals and this variability reflects the intrinsic diversity observed among the individual metabolic phenotypes. To characterize this diversity, the concept of entropy can be used [

163]. Importantly, both entropy profiles of single metabolites and entropies of one metabolite relative to others that characterize dependencies and correlations between metabolites in a network context need to be considered. In the recent review published by Everett et al. [

165], metabolic entropy is considered as one of the four fundamental approaches to the generation and utilization of metabotype data for metabolic phenotyping in diagnosis and prognosis, another being the metabolic network itself.

An important goal of metabolic network studies is understanding the dynamical behaviors of metabolic networks and the functions generated by them. Despite their topological complexity, metabolic networks avoid complex dynamics and maintain a steady-state behavior. A combination of the Shannon entropy and the word entropy [

166] capable of separating different dynamic regimes in metabolic networks have been used to reveal that this pronounced regularization of dynamics is encoded in the network topology [

167]. Nykter et al. [

168] also studied network structure-dynamics relationships, using Kolmogorov complexity as a measurement of information distance between pairs of network structures and between their associated dynamic state trajectories. The possible dynamics of metabolic networks was studied in Grimbs et al. [

169] by using a kinetic model. The enzyme kinetic parameter space was sampled and the metabolic dynamic states were evaluated statistically. MI was then used as one of the three distinct measures to assess the relative impact of kinetic parameters on the stability and robustness of metabolic networks. Cellular metabolic systems form self-assembled aggregates and the activities of cellular enzymes can also exhibit spontaneous spatial-temporal functional structures. Entropy is a useful concept in the study of these dynamical systems [

11]. In particular, Kolmogorov–Sinai entropy, which can be estimated from a finite number of observations using a family of statistics named Approximate Entropy (ApEn), provides a good measure of the complexity and information for the study of attractors in biochemical systems.

Metabolic networks are often modeled and simulated using Flux Balance Analysis (FBA) in order to study the physiology of the relevant microorganism or cell. FBA can predict metabolic reaction rates, also known as fluxes, without using kinetic parameters, by representing a metabolic network in the form of a set of mass balance equations based on the stoichiometry of each reaction, and computing reaction fluxes to match biomass production rate to a measured growth rate. However, substantial cell-to-cell growth rate fluctuations exist even in well-controlled steady-state conditions. De Martino et al. [

170] introduced a generalization of FBA to the single-cell level based on the maximum entropy principle [

171]. The idea is to look for an as-random-as-possible distribution over fluxes that is matching the experimentally measured average growth rate. This maximum entropy metabolic modeling has been shown to provide a better match to experimentally measured fluxes and it makes a wide range of predictions such as on flux variability, regulation, and correlations.

Finally, metabolic networks transform nutrients into biomass and it is of interest to understand how a cell acquires information on nutrient availability through nutrient sensing and how a metabolic network uses this information. Wagner at al. [

172] proposed a way to relate the nutrient information to the error in cell’s measurement of nutrient concentration in the environment. FBA was then used to show that nutrient sensing inaccuracy is translated logarithmically into reduced cell growth, and for microbes like yeast, cells would need to estimate nutrient concentrations to very high accuracy to ensure optimal growth.