Abstract
Enhancer RNAs (eRNAs) play an important role in transcriptional regulation and serve as key intermediates linking genomic enhancers to their target genes. Although ongoing efforts aim to annotate enhancers in the chicken genome, the current understanding of avian enhancers remains less developed compared to that of mammals. We utilized CAGE-seq data from chicken tissues obtained through the “Genetic Technologies in Poultry” project to predict enhancers in the chicken genome. Preliminary predictions focused on non-coding regions exhibiting bidirectional transcription, which were subsequently validated using explicit Markov models and refined with hidden Markov models. To assess sequence family homogeneity, we developed a method based on Euclidean distances between explicit Markov model matrices. Our analysis revealed that the proportion of enhancer-associated DNA in chicken is approximately similar to that observed in mammals, encompassing 9.29% of the entire chicken genome, roughly similar to the estimate made by the ChickenGTEx project. A relatively small number of them (12,242 enhancers) were significantly expressed among all tissues. Notably, more than half of the enhancer DNA overlapped intronic regions. Additionally, based on the bimodal distribution of enhancer lengths combined with the homogeneity of their Markov models, we identified a class of long enhancer elements which we hypothesize to be absent in mammals.
1. Introduction
Elucidating the functional landscape of farm animal genomes is paramount for deciphering the molecular underpinnings of economically important traits, such as growth performance and disease susceptibility. While substantial progress has been made in comprehensively mapping functional regulatory elements in livestock through initiatives like the Functional Annotation of Animal Genomes (FAANG) consortium [] or FarmGTEx project [], and particularly in annotation of enhancers in chicken genome [,], including the ChickenGTEx project [], the annotation completeness of regulatory elements in the chicken (Gallus gallus) genome necessitates continued investigational efforts.
Birds display a high degree of conservation of nuclear landscape elements, such as topologically associating domains (TADs), formed on the basis of interactions between CCCTC-binding factor regions []. These domains play a role in restricting interactions between enhancers and promoters and, thus, in maintaining the specificity of regulatory contacts []. However, a number of studies have shown that birds have shortened TADs compared to mammals, which may reflect both the peculiarities of genome organization and the accelerated evolution of regulatory elements [,].
The use of the method of cap analysis of gene expression (CAGE) provided a highly accurate determination of TSSs and identification of clusters corresponding not only to promoters but also to active enhancers []. Hundreds of regions demonstrating characteristics of enhancer transcripts—bidirectional low-level transcription—have been described in the chicken genome. These elements turned out to be specific to certain tissues, including the liver, kidney, brain, and intestine [].
In the present study, we employed CAGE data to predict enhancer elements within the chicken (Gallus gallus) genome. Leveraging a substantial dataset derived from multiple tissues, we successfully identified a comprehensive repertoire of enhancers and quantitatively evaluated the distribution between intergenic and intragenic enhancer elements.
2. Results
2.1. Prediction of Enhancers in Chicken Genome from Bidirectional Promoters
We identified 447,451 bidirectional promoter regions, which were used as an estimate of enhancer loci in the chicken genome. The total length of enhancers was 223,610,494 b.p. or 21.22% of the entire chicken GRCg7b genome. The average enhancer length was about 500 b.p. with a standard deviation of 148 b.p., which gave a coefficient of variation of 30%.
The number of enhancers predicted by the GenTech project data significantly exceeded those predicted by third-party databases. However, almost half of the enhancers from eRNAdb overlapped with our predicted enhancers. A quarter of the predicted enhancer regions from the EnhancerAtlas database overlapped with our predictions (Figure 1, Table 1). The overlapping intervals are detailed in the supporting Table S1.
Figure 1.
Venn diagram of the overlaps between enhancers predicted by us from the GenTech project data (shown in white) and the EnhancerAtlas (shown in grey) and eRNAdb (shown in blue) databases.
Table 1.
Overlaps between enhancers predicted by from the GenTech project data (CAGE), the EnhancerAtlas and eRNAdb.
Homogeneity analysis of enhancer sequences predicted on the basis of promoter bidirectionality demonstrated a unimodal distribution of explicit Markov models of sequences characterized by high kurtosis (Figure 2), which suggested high statistical homogeneity. We took this observation as the evidence that the preliminary set of enhancers identified using Anderson’s method represented a homogeneous sample suitable for further training of the hidden Markov model.
Figure 2.
(a) Distribution of Euclidean distances of explicit Markov models of enhancer sequences from their average model. (b) Heatmap of the average model of explicit Markov models of enhancer sequences—. Low values are shown in shades of red and orange, average values—in shades of yellow and high values—in shades of green.
From the heatmap of the EMM model, it is noticeable that guanines (G) were rarely followed by cytosines (C), which was reflected by low conditional probability (red cell in Figure 2). In an extended CpG island one would expect nearly equal conditional probabilities of C followed by G and G followed by C, due to repeated overlaps of CG and GC dinucleotides. Here we see that GpC were under-represented which suggested that the CpG islands in the putative enhancers were frequently orphane which is characteristic for enhancer sequences [].
2.2. Refinement and Validation of Enhancer Prediction Using Hidden Markov Models
The eHMM’s hidden Markov model of enhancers demonstrated strong concordance with the predictions based on bidirectional promoters, only excluding 691 regions as non-enhancers and not identifying any novel enhancer regions that lacked overlap with the Andersson’s bidirectional model. Notably, eHMM expanded the genomic coordinates of predicted enhancer loci: to the 223 megabases of enhancer DNA identified using bidirectional promoter predictions, eHMM added approximately 43 megabases, representing an increase of nearly 20% (Figure 3, Table 2). In total, eHMM cut the number of initially predicted enhancers by 45,918 and thus identified 401,533 non-overlapping enhancer regions, with only 46 of them overlapping the eHMM-predicted promoter regions. Furthermore, the cumulative length of enhancer loci predicted by eHMM increased to 265,527,100 b.p., accounting for 25.5% of the genome, compared to predictions generated by Anderson’s method (see Supplementary Table S2).
Figure 3.
Overlaps between enhancer region predicted with the Andersson’s method and eHMM (in base pairs).Prediction made with the Andersson’s method are shown in grey, predictions made with the eHMM tool are shown in white.
Table 2.
Overlaps between enhancer region predictions by Anderson and eHMM (in base pairs), as shown in Figure 3.
Expectantly, the overlaps between predictions by hidden Markov model and annotations of Enhancer Atlas were significant, according to Fisher’s exact test (Table 3).
Table 3.
Fisher’s exact test on overlaps between eHMM predictions and Enhancer Atlas enhancers.
2.3. Annotation of Enhancers and Structure of the Chicken Genome
Genome annotation revealed that gene content in the chicken genome accounts for nearly 65% of the total genomic sequence (671,074,699 b.p. of 1,041,139,641 b.p.), leaving over 35% as intergenic space, which is more than sufficient to accommodate all predicted enhancer regions, which totally comprised 265,527,100 b.p. Nevertheless, we observed that the majority of enhancer DNA—specifically, 62.5%—overlaps with genic regions, including 56.7% of overlaps with intronic regions of all genes (Figure 4a). Notably, a substantial proportion of these overlaps occurred within non-coding regions of protein-coding genes (53.17% of enhancer DNA) and long non-coding RNAs (11.30% of enhancer DNA), while coding sequences accounted for a minor fraction (1.23% of enhancer DNA; Figure 4c, Table 4).
Figure 4.
Quantification different colors in the figure need to be further explained of enhancer DNA, genic, and nongenic sequences (in base pairs). (a) Genic regions overlapping with predicted enhancers. Intronic content of the chicken genome sequences is shown in white. Exonic content of the genome is shown in blue. Enhancer content is shown in grey. (b) Nongenic regions overlapping with predicted enhancers. Intergenic genomic content is shown in white. Enhancer content is shown in grey. (c) Overlap of enhancers with non-coding gene regions, coding sequences, and long non-coding RNA genes. Non-coding genomic content is shown in white. Genomic content of lncRNAs is shown in blue. Genomic content of coding sequences (CDS) is shown in red. Enhancer content is shown in grey.
Table 4.
Enhancers overlap with intronic, intergenic, non-coding exon and coding features shown in Figure 4.
Overall, the overlap between coding sequences and enhancers encompassed slightly more than 3 megabases (3,280,798 b.p.), representing only 1.23% of total enhancer DNA but constituting 10% of the approximately 32 megabases of coding sequence. Interestingly, coding sequences overlapped not only with enhancer fragments but also with the adjacent nucleosome landing sites, suggesting that these overlaps are unlikely to be due to erroneous enhancer predictions (Figure 5).
Figure 5.
Example of genomic location of enhancers elements with flanking nucleosome binding sites, as predicted by hidden Markov model approach (UCSC Genome Browser). Enhancer elements (E_A) are shown in shades of yellow and nuclesome flanking regions (E_N) are shown in shades of green. Background element is shown in black. The ‘*’ symbol in the name of the RefSeq track is a wildcard.
Furthermore, the distribution of Euclidean distances between the averaged Markov models for enhancer sequences and those for the overlapping coding sequences was comparable to the distribution observed for all coding sequences in the chicken genome. This similarity supports the potential functional relevance of enhancer–CDS overlaps (Figure 6).
Figure 6.
Distribution of Euclidean distances between the average explicit Markov model of enhancers and individual Markov models of enhancer sequences (blue), all coding sequences (orange), and coding sequences overlapping with enhancers (green).
The eHMM model for enhancers and promoters also predicts the presence of nucleosome-binding sites flanking both element types. Using this approach, the eHMM model successfully identified both flanking nucleosome-binding loci for nearly all predicted enhancer regions.
Notably, the lengths of enhancers predicted by the eHMM model exhibited a bimodal distribution (Figure 7). The majority of enhancers (n = 328,340) were shorter than 800 b.p., with a mean length of 553 b.p. A second subset, comprising 73,193 enhancers of 800 b.p. or greater, had an average length of 1143 b.p. Furthermore, both length-based subsets displayed comparable proportions of overlap with long non-coding RNAs, corresponding to 31.30% for the longer enhancers and 31.19% for the shorter enhancers.
Figure 7.
Histogram of lengths of enhancers predicted by eHMM.
2.4. Association of Predicted Enhancers with TADs
Fishman et al in their Ontogen database provided maps of chromosomal contacts in nuclei of embryonic chicken fibroblasts and immature erythrocytes []. We observed high similarity between intervals of chromosomal contacts in fibroblasts and erythrocytes and strong association between our eHMM predictions of enhancer regions and intervals between chromosomal contacts reported in Ontogen.
The vast majority (95.3%) of genomic content of predicted enhancer sequences was localized within TADs of the chicken genome (Figure 8). These TADs accounted for approximately 93.4% of the genome based on the GRCg7b assembly coordinates. Consequently, non-enhancer genomic regions were about 1.4-fold enriched in non-TAD regions.
Figure 8.
Genomic content of overlaps between predicted enhancers and TADs of fibroblast and erythrocyte tissues, base pairs. Enhancer content is shown in white. Genomic content of the fibroblast TADs is shown in grey. Genomic content of the erythrocyte TADs is shown in blue.
We also noticed the large overlap between TADs of fibroblast and erythrocyte nuclei. The Jaccard intersection-over-union ratio () was 0.780, pretty close to 1, which allowed us to neglect the fibroblast- and erythrocyte-specific contacts and use the overlap between these cell types in further analysis Table 5 and Table 6.
Table 5.
Jaccard index on overlaps between fibroblast and erythrocyte TADs.
Table 6.
Overlaps between erythrocyte TADs, fibroblast TADs and predicted enhancers, as shown in Figure 8, base pairs.
The enrichment of enhancer DNA within TADs, which were detected in both fibroblast and erythrocyte datasets, was statistically significant according to Fisher’s exact test (Table 7), with the p value approaching zero under the one-sided alternative hypothesis that the odds of encountering an enhancer region are greater inside TADs compared to outside. Noticeably, the density of enhancers within TADs also varied dramatically (Figure 9).
Table 7.
Association of enhancer genomic material with TADs (base pairs).
Figure 9.
Density of enhancers in chromosome 1. Only enhancers, which expression in CAGE experiments was significant, are shown. Density of enhancer intervals is shown in grey. Intersections of the Ontogene TADs are shown in blue bars.
2.5. Functional Annotation of Intragenic Enhancers
Analysis of the Gaussian mixture of the TPM-normalized and -transformed expression values of the eHMM enhancers resulted in the estimates of the mean background expression of −4.07 TPM with standard deviation of 1.46 TPM. Thus, the enhancers, whose normalized and transformed expression exceeded , were considered significantly expressed. The filtering of background expression reduced the total number of enhancers, significantly expressed in any tissue, to 12,242. Further consideration of only intragenic enhancers limited this number to 3662 intragenic enhancers located within 2566 genes. These genes, whose intragenic enhancers were significantly expressed in any of six analysed tissues, demonstrated significant enrichment of several KEGG pathways, the most significant of them being MAPK and calcium signalling and cytoskeleton in muscle cells (Figure 10).
Figure 10.
KEGG pathways enriched with genes, which intragenic enhancers were significantly expressed.
The rest enriched terms involved brain and vascular related pathways such as neuroactive signalling and vascular smooth muscle contraction (Table 8).
Table 8.
Significantly enriched functional terms of the KEGG pathway in genes in which intragenic enhancers were significantly expressed in any tissue.
2.6. Functional Annotation of Tissue Specific Intragenic Enhancers
Tissue-specific genes containing intragenic enhancers were expressed above the statistically significant threshold of −1.15 log2 TPM, with variation in their abundance (Figure 11, Table 9). Notably, heart- and liver-specific genes were the most abundant, whereas breast-specific genes were the least represented.
Figure 11.
Number of tissue specific genes in which intragenic enhancers were significanly expressed.
Table 9.
Number of tissue-specific genes which intragenic enhancers were significantly expressed, as shown in Figure 11.
Among the tissue-specfic genes, whose intragenic enhancers were significantly expressed, the tissue relevant functional terms were significantly enriched in brain, breast, and liver gene sets (Table 10). Brain-specific genes were enriched with the terms, fundamental to neuronal function: neuroactive signalling, neuron projection, and channel activity. Also brain, specific genes were also enriched with cardiomyocytes signatures which is expectable since both heart and brain tissues share the same activities of ion transport and cell signalling []. Breast-specific genes were enriched with features related to phospatase active, crucial for muscle development and energy metabolism [].
Table 10.
Significantly enriched functional terms of KEGG Pathway, GO Cellular Component (GO CC) and GO Biological Process (GO BP) databases.
2.7. Validation of Enhancer Predictions with IsoSeq Data
The total genomic content of the large number of the eHMM predicted enhancers made the surprising amount of more than one-fourth of the chicken genome. From one hand, such amount does not contradict the earlier estimate of approximately one million enhancers in mammals [], whose genomes are proportionally larger. From the other hand, the proportion of non-coding content in avian genomes is known to make roughly half of that in mammals [], which would dramatically reduce any a priori estimate of number and content of chicken enhancers. Moreover, in a comprehensive study in the course of the ChickenGTEx project the chicken enhancers were demonstrated to make only 8.86% of the chicken genome []. Such discrepancy might suggest high false positive rate of predictions. To address that issue, we cross-validated the eHMM predictions with the data of IsoSeq experiments on Piao chickens.
The eHMM predictions of enhancers were only partially confirmed by expression analysis of the IsoSeq-seq experiment on Piao chickens. In the IsoSeq reads, we detected expression of 194,591 eHMM enhancers, which made 48.46% or slightly less than a half of all enhancers predicted by eHMM in CAGE reads from the experiment on the F2 progeny of Cornish and Russian White parents. One of the sources of high discrepancy between the CAGE-seq and IsoSeq data could be the low specificity of the initial approach based on bidirectional expression. Its overprediction effect could even be elevated by the high coverage of our CAGE data. The total genomic content of IsoSeq validated eHMM enhancers was 125,917,100 b.p., which proportionally made 47.42% (also, slightly less than a half) of the genomic content of all eHMM enhancers and only 12.09% of the whole chicken GRCg7b reference genome. Thus, upon the cross-validation with the independent dataset, our estimated number and genomic span of chicken enhancers roughly halved, getting closer to the proportion of 8.86% found in []. Still, our estimate remained quite permissive compared to the results of the ChickenGTEx project. Thus, we performed further filtering of our predictions.
2.8. Validation of the Distribution of the CpG Islands in the Predicted Enhancers
The analysis of general CpG distribution revealed bimodal shape of probability of guanine following cytosine——in the enhancer sequences. Moreover, the shape of distribution of the proportion of the orphan CpG islands was also bimodal (Figure 12).
Figure 12.
(a) Distribution of CpG dinucleotides frequency in predicted enhancers as the conditional probability of C being followed by G. (b) Distribution of proportion of orphan CpG dinucleotides in the predicted enhancers. Cutoffs between fractions are shown in red.
We filtered out the enhancer intervals which belonged either to the fraction with lower CpG probability () or the fraction with the lower orphan CpG content (). Upon the CpG filtering, we were left with 147,061 enhancer intervals which totally occupied 96,741,600 b.p. or 9.29% of the chicken genome (see Supplementary Table S3). Thus our estimate of the enhancer content slightly exceeded that of the ChickenGTEx project.
3. Discussion
Bird genomes are generally more compact than those of mammals, a pattern that is exemplified by the chicken genome, which spans approximately 1.2 million b.p—nearly three times smaller than the human genome, which contains about 3.2 billion b.p. with estimated proportion of enhancer material of 7.9% [] and more than two times smaller than the mouse genome (2.7 million b.p.) with enhancer proportion of being even larger than in humans—12.6% []. A recent comprehensive study by the ChickenGTEx project [] has found the proportion of chicken enhancers to be of similar scale—8.86% []. Our estimate slightly exceeded that ratio up to the value of 9.29%. The presence of a subset of enhancers longer than 800 b.p. suggests the potential inclusion of a distinct class of long non-coding RNAs. However, the comparable degree of overlap between both the long and short enhancer fractions with annotated chicken long non-coding RNAs, along with the homogeneity observed in explicit Markov models for sequences from both groups in RefSeq, supports the interpretation that the longer predicted enhancers (>800 b.p.) represent a specific subclass of enhancer-like elements rather than a separate RNA category.
Explicit Markov models of biological sequences have long been established as a component of theoretical foundations [] and are also widely applied in practical research []. However, the likelihood metric for a genomic sequence—calculated as the product of the conditional probabilities of nucleotides—clearly depends on the sequence length, complicating direct comparisons between sequences of different lengths. In this study, we proposed a metric based on the Euclidean distance between matrices of conditional probabilities for each sequence. Although this approach does not provide a direct estimate of sequence likelihood, it facilitates straightforward comparison between sequences of equal length.
The eHMM enhancer–promoter model predicts the presence of flanking nucleosome-binding sites for both promoters and enhancers, and was able to successfully identify both flanking loci for the majority of enhancer regions. Although the majority of enhancer DNA identified in this study was located within intronic sequences, the proportion of intronic enhancers observed was consistent with the overall genomic proportion of intronic material in the chicken genome. Notably, intronic enhancers have been linked to tissue-specific gene expression, whereas genes exhibiting ubiquitous expression are thought to be predominantly regulated by intergenic enhancers []. In our previous study, we observed that a substantial proportion of actively expressed genes exhibited ubiquitous expression across all tissues [].
The annotation of certain enhancer RNA loci as long non-coding RNAs (lncRNAs) in RefSeq is not unexpected, as these RNA types are known to be functionally related []. Historically, the primary criterion for defining lncRNAs was simply transcript length, with non-coding RNAs exceeding 200 nucleotides classified as lncRNAs []. More recently, some enhancer RNAs have been explicitly categorized as a subset of lncRNAs [].
The observed overlap between enhancer regions—including their flanking nucleosome landing sites—and coding sequences is also consistent with previous findings. Specifically, nucleosome-binding sites within coding sequences have been reported in yeast [], and the presence of exonic enhancers has been demonstrated in mice and zebrafish through ChIP-seq experiments [].
We have observed strong association between our enhancer predictions and intervals between chromosomal contacts imputed earlier from embryonic fibroblasts and immature erythrocytes and released in the Ontogen database []. In that work, those contacts were designated as TADs and we conformed their similarity between fibroblasts and erythrocytes.
The observation of contacts retained in erythrocytes, even immature, is interesting. From one hand, even immature erythrocytes tend to progressively condensate their chromatin, which in turn would progressively lose its TADs in the course of erythropoesis [,]. According to this notion, only very young erythrocytes should bear TADs similar to other cells, whose chromatin does not undergo condensation, like embryonic fibroblasts.
It should be noted that the immature erythrocytes in the Ontogen paper were taken from chickens with chemically induced anemia. That, in turn, suggested rapid erythropoesis and early age of the erythrocytes sampled for the study. It might be speculated that the erythrocytes did not have enough time to undergo sufficient chromatin condensation and subsequent loss of TADs. From another hand, the chromosomal contacts in the condensed chromatin have been shown to follow the TADs in active chromosomes [].
In this study, we have performed a purely computational prediction of enhancers in chicken genome and we based solely on CAGE data. Studies that combine both transcriptomic and chromatin accessibility experiments on the same samples would be highly beneficial to validity and precision of the annotation of enhancers.
Recent research on chicken genomics and transcriptomics has extensively covered the economic impact of novel findings in the field, involving increased productivity, improved market value, disease resistance and livestock development [,,,]. Improvement of annotation of genomic enhancers would add more clarity to understanding of regulation of target genes and phenotype diversity.
4. Materials and Methods
4.1. Public CAGE-Seq and IsoSeq Data
We used publicly available dataset of CAGE-seq BGI-SEQ reads (http://chicken.biouml.org/downloads/ChickenResearch2023/CageSeq/raw_data (accessed on 9 November 2025) from the project “Genetic Technologies in Poultry Farming” (GenTech, https://chicken.biouml.org). The project involved the CAGE-seq experiments conducted on 12 fast-growing and 12 slow-growing F2 chickens from a cross between the Russian White and Cornish breeds. Samples from 6 tissues were used in the project: brain, breast, heart, kidney, legs, and breast, taken at the age of 9 weeks. To validate the enhancer RNA predictions, we used the public dataset from IsoSeq experiment on a Piao chicken from NCBI SRA (ID SRR24293230).
4.2. Reference Genome and Gene Annotation
We used the chicken GRCg7b Refseq reference genome assembly (Refseq ID GCF_016699485.2) and the corresponding version of the chicken genome annotation (NCBI Gallus gallus Annotation Release 106). For interval arithmetic, we used the bedtools [] package, and for visualizing the overlaps as Venn diagrams, we used the eulerr [] package.
4.3. Prediction of Enhancers in Chicken Genome from CAGE Data
To make initial predictions of genomic enhancers from CAGE experiments and thus provide a learning set for further HMM predictions, we mapped the CAGE-seq reads, calculated, aggregated and normalized the expression of the CAGE tag start sites (CTSS) in the chicken genome from BGI-SEQ reads similarly to our previous work [].
We aligned BGI-SEQ reads to the GRCg7b reference genome, using the STAR package v.2.7.11b [] with default parameters but accounting signal only from 5’ of the first read (read1_5p option).
The start positions of mapped CAGE reads were aggregated into CAGE tag start sites (CTSS) following a procedure analogous to that used in the FANTOM5 project []. Initially, mapped reads were filtered using SAMtools, version 1.22.1, and subsequent conversion and aggregation were performed with BEDtools, version 2.31.1. The resulting CTSSs, initially in BED format, were converted into the native CAGEr format, which incorporates chromosome coordinates, tag start positions, strand information, and the corresponding read counts per tag. These CTSS datasets were then imported into the CAGEr package v 2.8.0 [].
CTSS expression data were normalized using a power law normalization approach as described by []. This method, implemented in the CAGEr package, leverages the power law distribution characteristic of CTSS expression values and relies on two primary parameters: the slope of the log-log regression line fitted to the CTSS expression value distribution and the X-axis intercept of this regression, which defines the referent number of CTSSs. Two distinct normalization parameter sets were applied in this study: robust and permissive. The robust parameters, adopted from the CAGEr vignette, employed a slope of −1.2 and a referent CTSS count of 50,000. For the permissive parameter set, values were empirically derived from the log-log distribution of our CTSS expression data, with the X-axis intercept of the regression line at 1.2 × 107 serving as the referent CTSS number, while retaining the slope of −1.2. The robust normalization parameters were used for analyses involving ubiquitously active promoters and promoter shifts between tissues, whereas the permissive parameters were applied to the analysis of promoter shifts between slow- and fast-growing chickens.
We performed initial enhancer prediction from CAGE data using the approach of extracting bidirectional promoters distant from known gene loci with the use of clusetring of CTSS expression [] implemented in the CAGEr package as the interface to the CAGEfightR’s function quickEnhancers(). Thus, we clustered the CTSSs bidirectionally using window length of 201 b.p., the balance of expression from both strands was calculated using Bhattacharyya coefficient and its threshold was set 0.95. The expression of the enhancers was quantified as the sum of the expression values containing CTSSs.
We validated the predicted enhancers by overlapping their genomic intervals with enhancer intervals annotated for the chicken genome in the eRNAdb [] and Enhancer Atlas [] databases. Statistical significance of the overlaps was estimated with exact Fisher’s test implemented in BEDtools.
4.4. Topologically Associating Domains
We assigned the predicted enhancers to topologically associating domains (TADs) in the chicken genome. The reference set of TADs was from the Ontogen database []. The coordinates of the genomic TAD intervals were transformed from the original galGal5 assembly to the GRCg7b assembly using the liftOver package, version 469 []. For hypothesis rejection testing of the significance of overlaps between genomic intervals, we used the hypergeometric test implemented in the SciPy module for Python 3 [].
4.5. Homogeneity of Enhancer Sequences
To assess the sequence homogeneity of enhancers predicted from bidirectional promoters, we used the Euclidean distance metric between first-order explicit Markov models.
We calculated first-order Markov models for genomic sequences by analogy with []. For some sequence, we calculate the occurrence values of dinucleotides with a shift of 1 nucleotide. We divide each occurrence value of a dinucleotide by the occurrence of the first nucleotide of a pair in this sequence. Thus, we obtain a vector of 16 conditional probabilities , where is the nucleotide at the i-th position, and is the previous nucleotide. This vector, also representable as a 4 × 4 matrix, is an explicit first-order Markov model for this sequence. Then, for N sequences, one can calculate the averaged explicit first-order Markov model as a vector of arithmetic means of conditional probabilities of 16 dinucleotides:
where . As a result, for each of the sequences, one can calculate the deviation from the average model as the Euclidean distance along 16 coordinates representing dinucleotide frequencies:
The unimodality of the distribution of the statistic was taken as evidence of the homogeneity of the Markov models of predicted enhancers.
4.6. Hidden Markov Model of Enhancers
To refine our prediction of enhancers, we used the hidden Markov chain approach implemented in the eHMM package, https://github.com/tobiaszehnder/ehmm, last accessed on 9 November 2025 []. To train the model, we applied its learnModel module to the genomic intervals of enhancers and predicted by CAGEr. The learnModel was additionally provided with the promoter regions predicted by CAGEr in our previous work [], along with the enhancers, so that the resulting HMM could distinguish between promoters and enhancers. We also used CAGE read mapping against the chicken reference genome, and classified them as accessible chromatin regions (ACCs) in eHMM. As a result, eHMM built intermediate models of promoters and enhancers, as well as read count matrices mapping to promoter and enhancer regions. The resulting intermediate data were then used to build the Hidden Markov Model.
To construct the Hidden Markov Model (HMM), we employed the constructModel module, utilizing intermediate enhancer and promoter models along with count matrices derived during the training phase. Additionally, we explicitly defined state vectors corresponding to accessible chromatin and nucleosome states, each comprising three distinct states. The background model for accessible chromatin was implemented using the default eHMM configuration.
The constructed model was applied to the chicken genome assembly GRCg7b using the applyModel module. Training data consisted of enhancer intervals predicted by Andersson’s bidirectional method, promoter intervals predicted previously [], the matrices of CAGE read counts mapped onto these intervals and intermediate promoter and enhancer models, obtained from learnModel stage of the eHMM pipeline. We also used three Markov states of accessible chromatin and nucleosome flanking regions.
We estimated the discrimination between promoters and enhancers by overlapping the resulting promoter and enhancer regions using BEDtools and counting the number of predicted enhancers which overlapped the predicted promoters.
4.7. Background and Signal Expression of Predicted Enhancers
We measured the expression of the eHMM enhancers by counting reads mapped on to their intervals with featureCount tool of the Subread package, version 2.1.1 []. We estimated averaged expression of enhancers in all samples of all six tissues by summarizing enhancers counts in all samples and then normalized using the transcripts per million (TPM) approach. The array of the averaged expression of enhancers was then used to estimate the TPM threshold of enhancers whose expression was significantly higher then the level of background transcription. To estimate the average level and variability of the background transcription, we used the two-component Gaussian mixture model of log-transformed TPM values of averaged expression of enhancers in all samples, implemented in the mixtools R package, version 2.0.0.1 []. According to the approach, the density of the distribution of the vector of log-transformed expression values , was represented as
where B was the background transcription, S was the transcription signal, was the density of a normal distribution of random variable x with mean and standard deviation . Then, , , , were the estimates of the mean levels and the standard deviations of the background transcription and the transcription signal, respectively.
We estimated the mean and the standard deviation of the two components of the distribution of transformed expression values using the expectation–maximization (EM) approach implemented in the normalmixEM function of the package. The first component of the mixture with the least estimated mean was considered as the estimate of the distribution of the background transcription signal. In order to estimate the expression values which were significantly higher than the background, we define the cutoff between the upper level of background expression and the lower level of signal expression, , as
which corresponded to one-tailed p-value of ≈0.022.
Exponentiation of the resulting value allowed to estimate the TPM threshold for non-background (signal) expression of enhancers ():
Later, we selected enhancers who averaged expression exceeded the threshold in all tissues (tissue agnostic enhancers), as well as in each of six tissues under study (tissue characteristic enhancers).
4.8. Expression of Enhancers in Tissues and Functional Analysis
To estimate the TPM threshold of signal expression in the eHMM enhancers, we summarized the read counts of each enhancer across all samples and then TPM-normalized the resulting vector of counts.
The expression values of the eHMM enhancers in samples in the form of read counts were also aggregated by tissue and then were TPM normalized. We then applied the TPM threshold estimated earlier () to select enhancers whose expression was significantly higher than the background level in each tissue, which resulted in six sets of tissue characteristic enhancers.
To validate the functionality of the eHMM enhancers, we performed the over-representation analysis (ORA) of the genes which intronic enhancers were significantly expressed. We selected the intergenic subsets of the resulting tissue characteristic enhancers, as well as tissue agnostic enhancers, using bedtools with the option of 100% of an enhancer interval being within the intronic interval and annotated the relevant genes with their gene symbols using the RefSeq annotation. We obtained tissue-specific gene sets by filtering out the gene symbols which were not unique to a certain tissue.
The resulting tissue specific gene lists were searched for significantly over-represented functional terms against KEGG Pathway [] and Gene Ontology [] databases using the clusterProfiler tool [] with Benjamini–Hochberg adjusted p-value cutoff 0.05.
4.9. Validation with IsoSeq Data
We aligned the IsoSeq reads against the reference chicken genome, version GRCg7b, using the minimap2 tool [] with the preset for spliced alignment for long reads (-ax splice:hq option). The reads were overlapped against the predicted eHMM enhancers using the intersect command of the BEDtools package accounting only for enhancer intervals which overlapped with at least one mapping interval of the reads (-wo option, BAM format for the mapped reads). Thus, an enhancer from the eHMM set was considered detected if it overlapped at least one IsoSeq read. The output of BEDtools was aggregated using the uniq utility to count the eHMM enhancers which overlapped any of the IsoSeq reads.
Supplementary Materials
The supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ijms262210986/s1.
Author Contributions
Conceptualization, S.S.P. and V.A.G.; methodology, V.A.G., D.E.P. and V.S.G.; software, V.A.G., D.E.P., V.S.G. and S.S.P.; validation, V.A.G., D.E.P., V.S.G. and S.S.P.; formal analysis, V.A.G., D.E.P., V.S.G. and S.S.P.; investigation, V.A.G., D.E.P., V.S.G. and S.S.P.; resources, S.S.P.; data curation, V.A.G. and S.S.P.; writing—original draft preparation, V.A.G., D.E.P., V.S.G. and S.S.P.; writing—review and editing, S.S.P., O.A.G. and F.A.K.; visualization, V.A.G., D.E.P. and S.S.P.; supervision, S.S.P., O.A.G. and F.A.K.; project administration, S.S.P.; funding acquisition, S.S.P. All authors have read and agreed to the published version of the manuscript.
Funding
This research and APC were funded by Russian Science Foundation (RSF), grant number 24-24-20106 (https://rscf.ru/project/24-24-20106/ (accessed on 9 November 2025)).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The genomic intervals of predicted enhancers presented in the study are openly available in the GTRD database at http://gtrd.biouml.org:8888/downloads/current/projects/ChickenEnhancers/2025/ (accessed on 9 November 2025) and in the GenTech database at https://chicken.biouml.org/downloads/ChickenResearch2025/Enhancers/ (accessed on 9 November 2025).
Conflicts of Interest
Author Oleg A. Gusev was employed by the company LIFT Center LLC. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| b.p. | base pairs |
| CAGE | Cap Analysis of Gene Expression |
| CDS | Coding Sequence |
| EMM | Explicit Markov Model |
| HMM | Hidden Markov Model |
| ORA | Over-Representation analysis |
| TAD | Topologically Associating Domain |
| TPM | Transcripts Per Million |
References
- Foissac, S.; Djebali, S.; Munyard, K.; Vialaneix, N.; Rau, A.; Muret, K.; Esquerré, D.; Zytnicki, M.; Derrien, T.; Bardou, P.; et al. Multi-species annotation of transcriptome and chromatin structure in domesticated animals. BMC Biol. 2019, 17, 108. [Google Scholar] [CrossRef]
- Fang, L.; Teng, J.; Lin, Q.; Bai, Z.; Liu, S.; Guan, D.; Li, B.; Gao, Y.; Hou, Y.; Gong, M.; et al. The Farm Animal Genotype-Tissue Expression (FarmGTEx) Project. Nat. Genet. 2025, 57, 786–796. [Google Scholar] [CrossRef]
- Jin, W.; Jiang, G.; Yang, Y.; Yang, J.; Yang, W.; Wang, D.; Niu, X.; Zhong, R.; Zhang, Z.; Gong, J. Animal-eRNAdb: A comprehensive animal enhancer RNA database. Nucleic Acids Res. 2022, 50, D46–D53. [Google Scholar] [CrossRef]
- Gao, T.; Qian, J. EnhancerAtlas 2.0: An updated resource with enhancer annotation in 586 tissue/cell types across nine species. Nucleic Acids Res. 2020, 48, D58–D64. [Google Scholar] [CrossRef]
- Pan, Z.; Wang, Y.; Wang, M.; Wang, Y.; Zhu, X.; Gu, S.; Zhong, C.; An, L.; Shan, M.; Damas, J.; et al. An atlas of regulatory elements in chicken: A resource for chicken genetics and genomics. Sci. Adv. 2023, 9, eade1204. [Google Scholar] [CrossRef]
- Fishman, V.; Battulin, N.; Nuriddinov, M.; Maslova, A.; Zlotina, A.; Strunov, A.; Chervyakova, D.; Korablev, A.; Serov, O.; Krasikova, A. 3D organization of chicken genome demonstrates evolutionary conservation of topologically associated domains and highlights unique architecture of erythrocytes’ chromatin. Nucleic Acids Res. 2019, 47, 648–665. [Google Scholar] [CrossRef] [PubMed]
- Dixon, J.R.; Selvaraj, S.; Yue, F.; Kim, A.; Li, Y.; Shen, Y.; Hu, M.; Liu, J.S.; Ren, B. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 2012, 485, 376–380. [Google Scholar] [CrossRef] [PubMed]
- Zhang, G.; Li, C.; Li, Q.; Li, B.; Larkin, D.M.; Lee, C.; Storz, J.F.; Antunes, A.; Greenwold, M.J.; Meredith, R.W.; et al. Comparative genomics reveals insights into avian genome evolution and adaptation. Science 2014, 346, 1311–1320. [Google Scholar] [CrossRef] [PubMed]
- Kawaji, H.; Kasukawa, T.; Forrest, A.; Carninci, P.; Hayashizaki, Y. The FANTOM5 collection, a data series underpinning mammalian transcriptome atlases in diverse cell types. Sci. Data 2017, 4, 170113. [Google Scholar] [CrossRef]
- Lizio, M.; Deviatiiarov, R.; Nagai, H.; Galan, L.; Arner, E.; Itoh, M.; Lassmann, T.; Kasukawa, T.; Hasegawa, A.; Ros, M.A.; et al. Systematic analysis of transcription start sites in avian development. PLoS Biol. 2017, 15, e2002887. [Google Scholar] [CrossRef]
- Bell, J.S.K.; Vertino, P.M. Orphan CpG islands define a novel class of highly active enhancers. Epigenetics 2017, 12, 449–464. [Google Scholar] [CrossRef]
- Zhou, C.; Zhao, W.; Zhang, S.; Ma, J.; Sultan, Y.; Li, X. High-throughput transcriptome sequencing reveals the key stages of cardiovascular development in zebrafish embryos. BMC Genomics 2022, 23, 587. [Google Scholar] [CrossRef]
- Niknafs, S.; Fortes, M.R.S.; Cho, S.; Black, J.L.; Roura, E. Alanine-specific appetite in slow growing chickens is associated with impaired glucose transport and TCA cycle. BMC Genomics 2022, 23, 393. [Google Scholar] [CrossRef]
- Heintzman, N.D.; Hon, G.C.; Hawkins, R.D.; Kheradpour, P.; Stark, A.; Harp, L.F.; Ye, Z.; Lee, L.K.; Stuart, R.K.; Ching, C.W.; et al. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 2009, 459, 108–112. [Google Scholar] [CrossRef] [PubMed]
- Wu, L.; Jiao, X.; Zhang, D.; Cheng, Y.; Song, G.; Qu, Y.; Lei, F. Comparative genomics and evolution of avian specialized traits. Curr. Genomics 2021, 22, 496–511. [Google Scholar] [CrossRef] [PubMed]
- Abascal, F.; Acosta, R.; Addleman, N.J.; Adrian, J.; Afzal, V.; Ai, R.; Aken, B.; Akiyama, J.A.; Jammal, O.A.; Amrhein, H.; et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 2020, 583, 699–710. [Google Scholar] [CrossRef]
- Yue, F.; Cheng, Y.; Breschi, A.; Vierstra, J.; Wu, W.; Ryba, T.; Sandstrom, R.; Ma, Z.; Davis, C.; Pope, B.D.; et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature 2014, 515, 355–364. [Google Scholar] [CrossRef]
- Durbin, R.; Eddy, S.R.; Krogh, A.; Mitchison, G. Biological Sequence Analysis; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
- Irizarry, R.A.; Wu, H.; Feinberg, A.P. A species-generalized probabilistic model-based definition of CpG islands. Mamm. Genome 2009, 20, 674–680. [Google Scholar] [CrossRef] [PubMed]
- Borsari, B.; Villegas-Mirón, P.; Pérez-Lluch, S.; Turpin, I.; Laayouni, H.; Segarra-Casas, A.; Bertranpetit, J.; Guigó, R.; Acosta, S. Enhancers with tissue-specific activity are enriched in intronic regions. Genome Res. 2021, 31, 1325–1336. [Google Scholar] [CrossRef]
- Grushina, V.A.; Yevshin, I.S.; Gusev, O.A.; Kolpakov, F.A.; Stanishevskaya, O.I.; Fedorova, E.S.; Zinovieva, N.A.; Pintus, S.S. Prediction and annotation of alternative transcription starts and promoter shift in the chicken genome. J. Bioinform. Comput. Biol. 2025, 23, 2550004. [Google Scholar] [CrossRef]
- Bizet, M.; Defrance, M.; Calonne, E.; Bontempi, G.; Sotiriou, C.; Fuks, F.; Jeschke, J. Improving Infinium MethylationEPIC data processing: Re-annotation of enhancers and long noncoding RNA genes and benchmarking of normalization methods. Epigenetics 2022, 17, 2434–2454. [Google Scholar] [CrossRef] [PubMed]
- Mercer, T.R.; Dinger, M.E.; Mattick, J.S. Long non-coding RNAs: Insights into functions. Nat. Rev. Genet. 2009, 10, 155–159. [Google Scholar] [CrossRef] [PubMed]
- Song, P.; Han, R.; Yang, F. Super enhancer lncRNAs: A novel hallmark in cancer. Cell Commun. Signal. 2024, 22, 207. [Google Scholar] [CrossRef]
- Warnecke, T.; Batada, N.N.; Hurst, L.D. The impact of the nucleosome code on protein-coding sequence evolution in yeast. PLoS Genet. 2008, 4, e1000250. [Google Scholar] [CrossRef] [PubMed]
- Birnbaum, R.Y.; Clowney, E.J.; Agamy, O.; Kim, M.J.; Zhao, J.; Yamanaka, T.; Pappalardo, Z.; Clarke, S.L.; Wenger, A.M.; Nguyen, L.; et al. Coding exons function as tissue-specific enhancers of nearby genes. Genome Res. 2012, 22, 1059–1068. [Google Scholar] [CrossRef] [PubMed]
- Beacon, T.H.; Davie, J.R. Transcriptionally active chromatin-lessons learned from the chicken erythrocyte chromatin fractionation. Cells 2021, 10, 1354. [Google Scholar] [CrossRef]
- Beacon, T.H.; Davie, J.R. Chicken erythrocyte: Epigenomic regulation of gene activity. Int. J. Mol. Sci. 2023, 24, 8287. [Google Scholar] [CrossRef]
- Ulianov, S.V.; Khrameeva, E.E.; Gavrilov, A.A.; Flyamer, I.M.; Kos, P.; Mikhaleva, E.A.; Penin, A.A.; Logacheva, M.D.; Imakaev, M.V.; Chertovich, A.; et al. Active chromatin and transcription play a key role in chromosome partitioning into topologically associating domains. Genome Res. 2016, 26, 70–84. [Google Scholar] [CrossRef]
- Chu, J.; Ma, Y.; Song, H.; Zhao, Q.; Wei, X.; Yan, Y.; Fan, S.; Zhou, B.; Li, S.; Mou, C. The genomic characteristics affect phenotypic diversity from the perspective of genetic improvement of economic traits. iScience 2023, 26, 106426. [Google Scholar] [CrossRef]
- Zhang, F.; Chen, H.; Chang, C.; Zhou, J.; Zhang, H. Comparative genomic analysis across multiple species to identify candidate genes associated with important traits in chickens. Genes 2025, 16, 627. [Google Scholar] [CrossRef]
- Jiang, X.; Chu, Q.; Wei, G.; Gu, H.; Zhang, X.; Ren, X.; Chen, A.; Miao, X.; Yu, X.; Muhatai, G.; et al. A significant genomic region underlying growth traits in adult Beijing You chicken identified by genome-wide association analysis. Poult. Sci. 2025, 104, 105326. [Google Scholar] [CrossRef] [PubMed]
- Mattioli, S.; Angelucci, E.; Castellini, C.; Cartoni Mancinelli, A.; Chenggang, W.; Di Federico, F.; Chiattelli, D.; Dal Bosco, A. Effect of genotype and outdoor enrichment on productive performance and meat quality of slow growing chickens. Poult. Sci. 2024, 103, 104131. [Google Scholar] [CrossRef]
- Quinlan, A.R.; Hall, I.M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26, 841–842. [Google Scholar] [CrossRef]
- Larsson, J. Eulerr: Area-Proportional Euler and Venn Diagrams with Ellipses, R package version 7.0.2.; R Foundation for Statistical Computing: Vienna, Austria, 2024. [Google Scholar]
- Dobin, A.; Davis, C.A.; Schlesinger, F.; Drenkow, J.; Zaleski, C.; Jha, S.; Batut, P.; Chaisson, M.; Gingeras, T.R. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 2013, 29, 15–21. [Google Scholar] [CrossRef]
- Lizio, M.; Mukarram, A.K.; Ohno, M.; Watanabe, S.; Itoh, M.; Hasegawa, A.; Lassmann, T.; Severin, J.; Harshbarger, J.; Abugessaisa, I.; et al. Monitoring transcription initiation activities in rat and dog. Sci. Data 2017, 4, 170173. [Google Scholar] [CrossRef]
- Haberle, V.; Forrest, A.R.R.; Hayashizaki, Y.; Carninci, P.; Lenhard, B. CAGEr: Precise TSS data retrieval and high-resolution promoterome mining for integrative analyses. Nucleic Acids Res. 2015, 43, e51. [Google Scholar] [CrossRef] [PubMed]
- Balwierz, P.J.; Carninci, P.; Daub, C.O.; Kawai, J.; Hayashizaki, Y.; Van Belle, W.; Beisel, C.; van Nimwegen, E. Methods for analyzing deep sequencing expression data: Constructing the human and mouse promoterome with deepCAGE data. Genome Biol. 2009, 10, R79. [Google Scholar] [CrossRef]
- Andersson, R.; Gebhard, C.; Miguel-Escalada, I.; Hoof, I.; Bornholdt, J.; Boyd, M.; Chen, Y.; Zhao, X.; Schmidl, C.; Suzuki, T.; et al. An atlas of active enhancers across human cell types and tissues. Nature 2014, 507, 455–461. [Google Scholar] [CrossRef]
- Hinrichs, A.S.; Karolchik, D.; Baertsch, R.; Barber, G.P.; Bejerano, G.; Clawson, H.; Diekhans, M.; Furey, T.S.; Harte, R.A.; Hsu, F.; et al. The UCSC Genome Browser Database: Update 2006. Nucleic Acids Res. 2006, 34, D590–D598. [Google Scholar] [CrossRef]
- Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020, 17, 261–272, Erratum in Nat. Methods 2020, 17, 352. [Google Scholar] [CrossRef] [PubMed]
- Zehnder, T.; Benner, P.; Vingron, M. Predicting enhancers in mammalian genomes using supervised hidden Markov models. BMC Bioinform. 2019, 20, 157. [Google Scholar] [CrossRef] [PubMed]
- Liao, Y.; Smyth, G.K.; Shi, W. featureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 2014, 30, 923–930. [Google Scholar] [CrossRef] [PubMed]
- Benaglia, T.; Chauveau, D.; Hunter, D.R.; Young, D. mixtools: An R Package for Analyzing Finite Mixture Models. J. Stat. Softw. 2009, 32, 1–29. [Google Scholar] [CrossRef]
- Kanehisa, M.; Furumichi, M.; Sato, Y.; Matsuura, Y.; Ishiguro-Watanabe, M. KEGG: Biological systems database as a model of the real world. Nucleic Acids Res. 2025, 53, D672–D677. [Google Scholar] [CrossRef]
- Ashburner, M.; Ball, C.A.; Blake, J.A.; Botstein, D.; Butler, H.; Cherry, J.M.; Davis, A.P.; Dolinski, K.; Dwight, S.S.; Eppig, J.T.; et al. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000, 25, 25–29. [Google Scholar] [CrossRef]
- Yu, G. Thirteen years of clusterProfiler. Innovation 2024, 5, 100722. [Google Scholar] [CrossRef]
- Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 2018, 34, 3094–3100. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).