Next Article in Journal
Fluorescent Albumin-Binding N-Propylbenzene Indolenine-Based Squaraines as Potential Candidates for Prostate Cancer Photodynamic Therapy Photosensitizers
Next Article in Special Issue
Interaction Between Enhancers and Promoters in Chicken Genome
Previous Article in Journal
Molecular Interaction of Genes Related to Anthocyanin, Lipid and Wax Biosynthesis in Apple Red-Fleshed Fruits
Previous Article in Special Issue
Identification of Anticancer Targets in Ovarian Cancer Using Genomic Drug Sensitivity Data
 
 
Article
Peer-Review Record

Prediction of Enhancer RNAs in Chicken Genome

Int. J. Mol. Sci. 2025, 26(22), 10986; https://doi.org/10.3390/ijms262210986
by Valentina A. Grushina 1, Valeria S. Gagarina 1, Danila E. Prasolov 1, Fedor A. Kolpakov 1, Oleg A. Gusev 2,3 and Sergey S. Pintus 1,*
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Int. J. Mol. Sci. 2025, 26(22), 10986; https://doi.org/10.3390/ijms262210986
Submission received: 28 July 2025 / Revised: 3 November 2025 / Accepted: 10 November 2025 / Published: 13 November 2025
(This article belongs to the Special Issue Bioinformatics of Gene Regulations and Structure–2025)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The article "Prediction of Enhancer RNAs in Chicken Genome" is a high-quality and timely study aimed at the annotation of regulatory elements in the chicken genome. The work makes an important contribution to the FAANG and FarmGTEx initiatives by systematizing knowledge about enhancers in birds. The strengths of the study include the use of modern approaches (CAGE-seq, Markov models), incorporation of data from multiple tissues—, which increases biological relevance—and an attempt to quantitatively estimate the proportion of enhancer DNA.

Despite the overall high quality of the work, the manuscript could be significantly strengthened by addressing two key issues: stricter validation of the predicted regions to support the reported high genomic coverage, and a deeper functional analysis, which is currently lacking.

Main Comments and Suggestions for Improvement

1) Justification of the estimated enhancer fraction (~25% of the genome) and control of false positives.

The claimed figure of 25% enhancer DNA is a strong statement that requires additional validation beyond the homogeneity of Markov models. Sequence homogeneity confirms similarity but does not guarantee regulatory function. According to the description, the eHMM model mainly expanded the boundaries of predicted regions rather than filtered them. The manuscript should more clearly explain what filtering thresholds (e.g., CAGE signal expression levels in TPM) were used in the initial steps to reduce transcriptional noise. This is a standard practice that helps lower the number of false positives.

2) Lack of functional annotation and biological interpretation

The major unrealized potential of the study lies in the absence of functional analysis of the predicted enhancers. The authors have unique CAGE-seq data across six tissues, which could allow the study to move from structural annotation to functional interpretation. For example, GO term enrichment and KEGG pathway analysis could be performed for gene targets associated with tissue-specific enhancers. One might expect that genes regulated by brain-active enhancers would be involved in neurogenesis, while those in breast muscle might be related to muscle development and contraction. Such analysis would serve as key functional validation of the findings.

3) The Venn diagrams

For reproducibility and clarity, it would also be helpful to provide exact numbers for the intersections shown in the Venn diagrams (Figures 1, 3, 4, and 7) as an additional table.

Overall, this is a highly promising study with significant potential, of broad interest to biologists and professionals in the field of poultry genomics.

Author Response

Comment 1) Justification of the estimated enhancer fraction (~25% of the genome) and control of false positives

The claimed figure of 25% enhancer DNA is a strong statement that requires additional validation beyond the homogeneity of Markov models. Sequence homogeneity confirms similarity but does not guarantee regulatory function. According to the description, the eHMM model mainly expanded the boundaries of predicted regions rather than filtered them. The manuscript should more clearly explain what filtering thresholds (e.g., CAGE signal expression levels in TPM) were used in the initial steps to reduce transcriptional noise. This is a standard practice that helps lower the number of false positives.

Response:

We agree that the estimation of the large number of enhancers requires further filtering. We have estimated the background level of expression of our predicted enhancers and selected those whose expression was significantly higher than the background. The method is described in the Materials and Methods section, subsections “4.7. Background and signal expression of predicted enhancers” and “4.8. Expression of enhancers in tissues and functional analysis”  (lines 325-372) and results are described in the subsections “2.6. Functional annotation of intragenic enhancers” and “2.7. Functional annotation of tissue specific intragenic enhancers” (lines 132-159).

Comment 2) Lack of functional annotation and biological interpretation

The major unrealized potential of the study lies in the absence of functional analysis of the predicted enhancers. The authors have unique CAGE-seq data across six tissues, which could allow the study to move from structural annotation to functional interpretation. For example, GO term enrichment and KEGG pathway analysis could be performed for gene targets associated with tissue-specific enhancers. One might expect that genes regulated by brain-active enhancers would be involved in neurogenesis, while those in breast muscle might be related to muscle development and contraction. Such analysis would serve as key functional validation of the findings.

Response:

Based on the significantly expressed enhancers we have estimated the enriched KEGG pathways and GO biological processes presumably controlled by the intronic enhancers of the relevant genes (lines 146 – 159)

Comment 3) The Venn diagrams

We have added the tables with the numbers from the Venn diagrams for clarity and information (Table 1 – lines 56-57, Table 2 – lines 73-74, Table 4 – lines 86-87, Table 6 – lines 126-127)

Response:

We have added the tables with the numbers from the Venn diagrams for clarity and information (Table 1 – lines 56-57, Table 2 – lines 73-74, Table 4 – lines 86-87, Table 6 – lines 126-127)

Reviewer 2 Report

Comments and Suggestions for Authors

 

The review report

 of the article entitled “Prediction of Enhancer RNAs in Chicken Genome”

A brief summary

This manuscript presents the results of a large-scale prediction of enhancer RNAs (eRNAs) in the chicken genome. This was achieved by using CAGE-seq across multiple tissues, followed by validation and refinement with explicit and hidden Markov models, as well as by comparing the results with public enhancer resources. The authors report approximately 4×10⁵ predicted enhancer regions, covering around 25% of the genome. They also report strong intronic enrichment, localisation within topologically associating domains (TADs), and a bimodal enhancer length distribution. Furthermore, they hypothesise the existence of a “long-enhancer” class in chickens. The strengths of the study are its use of CAGE (ideal for detecting bidirectional enhancer transcription), its multi-tissue scope and its clear, quantitative, genome-wide summaries.

 

General concept comments

The biological question addressed is timely: the systematic annotation of enhancers in a major livestock species with a distinct 3D genome architecture compared to mammals.

CAGE is conceptually well-matched to eRNA detection and promoter/enhancer TSS mapping, and coupling it with CAGEr and HMMs is methodologically coherent.

However, some of the conclusions drawn (e.g. 'chicken-specific long enhancers') are stronger than the current comparative evidence supports, and should be framed as hypotheses pending cross-species and experimental validation.

Reproducibility would be improved by fuller parameterisation and public scripts, and biological credibility would be substantially strengthened by targeted wet-lab validation on a subset of candidates.

 

Article

Testability & controls: The pipeline produces very large candidate sets (e.g. ~447k for Anderson and ~402k for non-overlapping eHMM), but there is no hold-out/test-set calibration (e.g. ROC/PR) nor any orthogonal wet-lab corroboration on a subset. Both of these are important controls against false positives at this scale.

Methodological detail: STAR/CAGEr/eHMM usage is named but under-specified (e.g., STAR options; CAGEr thresholds for bidirectional clusters; eHMM state/emission settings; statistical test types and multiple-testing handling). This limits reproducibility.

Interpretability: Results nicely quantify intronic vs intergenic overlaps and TAD enrichment; additional functional/contextual layers (GO/Pathways by enhancer class/length) would help connect predictions to biology/economics implied in the aims.

 

Data availability: The manuscript commendably provides files and BEDs; deposition in standard NCBI/EBI repositories would ensure permanence and discoverability.

 

Specific comments

  1. Choice of sequencing technology (CAGE-seq) for enhancer studies
    The authors used CAGE-seq, which precisely maps RNA 5′ ends and identifies clusters of transcription start sites (TSS), including the hallmark bidirectional, low-level transcription characteristic of active enhancers (eRNAs). This is an appropriate choice because CAGE (and derivatives such as RAMPAGE) is a reference method for promoter mapping and for detecting transcriptionally active enhancers, as highlighted in the Introduction.
    An additional strength is that CAGE scales well across multiple tissues (as done here for chicken), increasing the chance of capturing tissue-specific elements. Coupling CAGE with bidirectional cluster calling (CAGEr) and subsequent Markov-model–based processing constitutes a coherent analysis pipeline.

    I did not find any information in the manuscript as to whether the co-authors of the publication participated in the isolation/sequencing process.
  2. Is a report on the quality of isolation and sequencing available?

  3. Scientific value of addressing underexplored questions in diverse organisms
    Extending regulatory annotation beyond mammalian models is scientifically important: avian genomes differ in 3D architecture (shorter TADs, different compactness) and in the evolutionary dynamics of regulatory elements. Studies in chicken therefore add comparative and conceptual value (e.g., FAANG, FarmGTEx).

  4. Complexity and functional importance of the non-coding layer
    The non-coding genome comprises multiple classes—lncRNAs, miRNAs, eRNAs, enhancers, promoters, insulators—and higher-order organization (e.g., TADs) that together shape gene expression and tissue specificity. In this context, identification and functional classification (e.g., intronic vs intergenic enhancers; length strata; association with genes within TADs) are crucial to elucidate molecular mechanisms. The manuscript reports, for example, a predominance of enhancer overlap with genic regions (especially introns) and provides detailed base-pair tallies (Table 1), which is valuable for downstream functional interpretation.

  5. Comment on the claim of “a class of long enhancer elements specific to chickens”
    Regarding the sentence: “…we identified a class of long enhancer elements specific to chickens that appears absent in mammals”—current knowledge does not support such a categorical conclusion. To argue species/lineage specificity one would need: (i) richer, well-saturated cross-species resources covering many tissues; (ii) conservation/liftover analyses of sequences and TF motifs; and (iii) formal comparisons of length distributions and chromatin features between birds and mammals. At present, this should be framed as a hypothesis requiring further validation, rather than a firm conclusion. (The bimodal length distribution and homogeneous explicit Markov models are intriguing signals but not definitive proof of chicken specificity.)

  6. Scientific and economic rationale
    The research motivation is sound: better annotation of regulatory elements in a livestock species is relevant to traits of economic interest (growth, health). It would help to articulate a clearer pathway from the predicted enhancers to economic impact (e.g., specific traits/biological pathways likely regulated by the discovered elements and a plan for validating such links).

  7. Data sharing—commendation and suggestion
    It is commendable that analytical files (since 2023) and BED files with “Enhancer regions with flanking nucleosomal sites” (2025) were made publicly available. This aligns with open-science best practices. In addition, I recommend depositing data in widely used repositories (NCBI/EBI—e.g., SRA/ENA for raw reads; GEO/ArrayExpress for processed matrices; Zenodo/figshare for scripts) to ensure long-term accessibility and discoverability—especially since the referenced site is Russian-language and may be less accessible to the broader community.

  8. Materials and Methods—need for greater detail
    The toolchain is described accurately (STAR, CAGEr, eHMM, bedtools, SciPy, eulerr), but specific parameters and procedures are missing: STAR command-line options (mapping to GRCg7b), CAGEr thresholds and windows for “bidirectional promoters” (clustering/merging rules), candidate filtering criteria, eHMM training parameters (e.g., number of states, emission models, learnModel inputs), and statistical testing details in SciPy (test types, hypotheses, α, multiple-testing corrections). I recommend adding scripts/notebooks (public repo) to enable full reproducibility.

  9. Validation
    I appreciate the validation attempts (Markov-model homogeneity, eHMM vs bidirectional promoters), but at the scale of ~447k candidates stronger evidence is needed to constrain the potential false-positive rate. Unimodality of Euclidean-distance distributions and agreement with eHMM are useful yet still indirect.
    b. The eHMM expansion of coordinates (+~43 Mb) and the final count of about 450k regions are impressive, but the manuscript lacks information to assess accuracy/calibration (e.g., cross-validation/hold-out, ROC/PR curves, fraction confirmed by independent resources or chromatin marks). Please include metrics, a test set, and benchmarking versus eRNAdb/EnhancerAtlas/MCE.
    c. Additionally, I suggest validating bioinformatic analyses using additional biotechnological methods (wet lab), for example Chip-seq with enhancer labelling or RT-qPCR for 5'CAP.

  10. Figures
    Overview schematic of chromatin architecture (enhancer–promoter within a TAD, flanking nucleosome “landing sites,” eRNA bidirectionality) to orient readers.
    b. Figure 4 / Table 1 — In addition to bar charts, consider pie plots (or normalized stacked bars to 100%) for proportions (intronic/intergenic/non-CDS exonic/CDS); for overlapping categories, a layered plot can be used.
    c. Genomic localization — Chromosome-level density maps (e.g., circus or violin/bean plots per chromosome e.g. based on data from the supplementary table) and heatmaps with TAD annotation to visualize in-TAD enrichment (reported in the text).
    d. Length comparison — Visualize the bimodal length distribution and, where possible, compare with other species (from public mammalian atlases) to place the “long enhancers” in broader context.
    e. Technical note on Figure 1 - The figure appears truncated on the right; please re-export with margins ensuring all labels and the legend are visible.

  11. Functional analysis—Gene Ontology/pathway enrichment
    After identifying overlapping genes (especially intronic) and/or nearest putative targets, perform GO and pathway enrichment (GO BP/CC/MF; KEGG/Reactome), ideally stratified by short vs long enhancers and intragenic vs intergenic categories. This will link the observations to biological hypotheses (and potentially to the stated economic relevance).

  12. Number formatting and style consistency
    For large values, please use thousands separators and enforce a consistent style throughout the manuscript (text, plot axes, tables). This materially improves readability (e.g., 223,610,494 instead of 223610494). This applies to Table 1 totals and other counts (e.g., 401,533 regions; 25.5% of the genome).

  13. Overall character of the work and general recommendation
    The Discussion is clear and well-structured, but the current version is purely computational. With such a large number of novel regions (hundreds of thousands) and no experimental validation, it is difficult to realize the claimed practical/economic significance. I recommend: (i) adding at least part of the wet-lab validation in point 8c; (ii) expanding “Materials and Methods” with parameters and scripts enabling full reproducibility; and (iii) tempering statements on the putative “chicken-specific” long enhancer class. With these improvements, the manuscript will be methodologically stronger and more compelling for IJMS.

 

Author Response

Comments 1 & 2:
Choice of sequencing technology (CAGE-seq) for enhancer studies

The authors used CAGE-seq, which precisely maps RNA 5′ ends and identifies clusters of transcription start sites (TSS), including the hallmark bidirectional, low-level transcription characteristic of active enhancers (eRNAs). This is an appropriate choice because CAGE (and derivatives such as RAMPAGE) is a reference method for promoter mapping and for detecting transcriptionally active enhancers, as highlighted in the Introduction.
An additional strength is that CAGE scales well across multiple tissues (as done here for chicken), increasing the chance of capturing tissue-specific elements. Coupling CAGE with bidirectional cluster calling (CAGEr) and subsequent Markov-model–based processing constitutes a coherent analysis pipeline.
I did not find any information in the manuscript as to whether the co-authors of the publication participated in the isolation/sequencing process.
Is a report on the quality of isolation and sequencing available?
Response:
The work covered by the current manuscript is purely computational and does not cover the experiment by itself.
We used publicly available CAGE-seq data as mentioned in Materials and Methods (lines 230-237) Unfortunately, the data on wet-lab validation for quality of libraries was not available.

Comment 5:
Comment on the claim of “a class of long enhancer elements specific to chickens”
Regarding the sentence: “…we identified a class of long enhancer elements specific to chickens that appears absent in mammals”—current knowledge does not support such a categorical conclusion. To argue species/lineage specificity one would need: (i) richer, well-saturated cross-species resources covering many tissues; (ii) conservation/liftover analyses of sequences and TF motifs; and (iii) formal comparisons of length distributions and chromatin features between birds and mammals. At present, this should be framed as a hypothesis requiring further validation, rather than a firm conclusion. (The bimodal length distribution and homogeneous explicit Markov models are intriguing signals but not definitive proof of chicken specificity.)
Response:
We agree that the claim made in the abstract was exaggerated and now we have rephrased it in less rigorous form “...we identified a class of long enhancer elements specific to chickens which we hypothesize to be absent in mammals” (line 18)

Comment 6:
Scientific and economic rationale
The research motivation is sound: better annotation of regulatory elements in a livestock species is relevant to traits of economic interest (growth, health). It would help to articulate a clearer pathway from the predicted enhancers to economic impact (e.g., specific traits/biological pathways likely regulated by the discovered elements and a plan for validating such links).
Response:
We have added a paragraph in Discussion where speculate on possible the economic impact of our findings (lines 223-227)

Comment 7:
Data sharing—commendation and suggestion

It is commendable that analytical files (since 2023) and BED files with “Enhancer regions with flanking nucleosomal sites” (2025) were made publicly available. This aligns with open-science best practices. In addition, I recommend depositing data in widely used repositories (NCBI/EBI—e.g., SRA/ENA for raw reads; GEO/ArrayExpress for processed matrices; Zenodo/figshare for scripts) to ensure long-term accessibility and discoverability—especially since the referenced site is Russian-language and may be less accessible to the broader community.
Response:
We have included the coordinates and estimated expression of the predicted enhancers as Supplementary Table 2.
The raw data was made public on that specific website by the team of the GenTech project. They have translated the website (http://chicken.biouml.org) into English and the upload to SRA should happen soon.

Comment 8:
Materials and Methods—need for greater detail
The toolchain is described accurately (STAR, CAGEr, eHMM, bedtools, SciPy, eulerr), but specific parameters and procedures are missing: STAR command-line options (mapping to GRCg7b), CAGEr thresholds and windows for “bidirectional promoters” (clustering/merging rules), candidate filtering criteria, eHMM training parameters (e.g., number of states, emission models, learnModel inputs), and statistical testing details in SciPy (test types, hypotheses, α, multiple-testing corrections). I recommend adding scripts/notebooks (public repo) to enable full reproducibility.
Response:
We have rewritten the Materials and Methods section and added the description of CAGE data processing and details on eHMM pipeline (lines 245-246, 259-271, 289-290 and 316-324)

Comment 9:
Validation
a. I appreciate the validation attempts (Markov-model homogeneity, eHMM vs bidirectional promoters), but at the scale of ~447k candidates stronger evidence is needed to constrain the potential false-positive rate. Unimodality of Euclidean-distance distributions and agreement with eHMM are useful yet still indirect.
b. The eHMM expansion of coordinates (+~43 Mb) and the final count of about 450k regions are impressive, but the manuscript lacks information to assess accuracy/calibration (e.g., cross-validation/hold-out, ROC/PR curves, fraction confirmed by independent resources or chromatin marks). Please include metrics, a test set, and benchmarking versus eRNAdb/EnhancerAtlas/MCE.
c. Additionally, I suggest validating bioinformatic analyses using additional biotechnological methods (wet lab), for example Chip-seq with enhancer labelling or RT-qPCR for 5'CAP.
Response:

  1. To address the problem of false positives, we estimated the total expression of the enhancers which we predicted with eHMM. Lowly expressed intervals were eliminated using a gaussian mixture model with two categories (see section Materials and Methods, subsections “4.7. Background and signal expression of predicted enhancers” and “4.8. Expression of enhancers in tissues and functional analysis”, lines 325-372). We were left with 12,242 intervals which we analyzed further (see subsections “2.6. Functional annotation of intragenic enhancers” and “2.7. Functional annotation of tissue specific intragenic enhancers”) (lines 132-159).
  2. Unfortunately, we don’t have the chromatin marks data for the Russian White genome, thus we had to use the available data by Fishman et al. We have performed Fisher's exact test to demonstrate the significance of overlaps between eHMM predictions and Enhancer Atlas  (Table 3, lines 74-76 and 289-290)
  3. We agree that additional experiments could be extremely beneficial to the results of the work, but as it was mentioned earlier, our computational project did not cover any experimental work and, unfortunately, we can not perform any experiments within it.

Comment 10:
Figures
Overview schematic of chromatin architecture (enhancer–promoter within a TAD, flanking nucleosome “landing sites,” eRNA bidirectionality) to orient readers.
b. Figure 4 / Table 1 — In addition to bar charts, consider pie plots (or normalized stacked bars to 100%) for proportions (intronic/intergenic/non-CDS exonic/CDS); for overlapping categories, a layered plot can be used.
c. Genomic localization — Chromosome-level density maps (e.g., circus or violin/bean plots per chromosome e.g. based on data from the supplementary table) and heatmaps with TAD annotation to visualize in-TAD enrichment (reported in the text).
d. Length comparison — Visualize the bimodal length distribution and, where possible, compare with other species (from public mammalian atlases) to place the “long enhancers” in broader context.
e. Technical note on Figure 1 - The figure appears truncated on the right; please re-export with margins ensuring all labels and the legend are visible.
Response:

  1. We have added a visualization of a region with predicted enhancers along with flanking regions (Figure 5, lines 92-93)
  2. We found more informative and technically feasible to add an additional table clarifying the numbers on Venn diagrams in Figure 4 (Table 4, lines 86-87)
  3. We have added the enhancer density karyoplots (Figure 9, lines 131-132)
  4. We have added a plot for bimodal distribution of lengths of predicted enhancers (Figure 7, lines 107-108)
  5. We have re-exported Figure 1 (lines 56-57)

Comment 11:
Functional analysis—Gene Ontology/pathway enrichment
After identifying overlapping genes (especially intronic) and/or nearest putative targets, perform GO and pathway enrichment (GO BP/CC/MF; KEGG/Reactome), ideally stratified by short vs long enhancers and intragenic vs intergenic categories. This will link the observations to biological hypotheses (and potentially to the stated economic relevance).
Response:
We have performed an overrepresentation analysis of intragenic enhancers (ORA), which were significantly expressed, against the KEGG and GO BP databases (see subsections “2.6. Functional annotation of intragenic enhancers” and “2.7. Functional annotation of tissue specific intragenic enhancers”, lines 132-159, Figure 10, Tables 8 and 10).

Comment 12:
Number formatting and style consistency
For large values, please use thousands separators and enforce a consistent style throughout the manuscript (text, plot axes, tables). This materially improves readability (e.g., 223,610,494 instead of 223610494). This applies to Table 1 totals and other counts (e.g., 401,533 regions; 25.5% of the genome)
Response:
All numbers larger than 999 have comma thousand separators in text and tables, except for eulerr Venn diagrams where the comma separator was technically impossible

 

Reviewer 3 Report

Comments and Suggestions for Authors

This is an interesting study on functional genomics, but unfortunatly, it has a number of major issues that I believe need to be addressed.

Here are my main comments:

In the Methods section, the authors state they used IsoSeq data to validate their eRNA predictions. However, there's no description of how this validation was done. Then, neither the Results nor the Discussion mention what this analysis actually showed. So it seems the eRNA predictions were not really verified by the authors.

Also, without any experimental validation (like luciferase reporter assays, STARR-seq, CRISPRi/a screening, etc.), the genomic regions they found are really just predictions of enhancers. I think it would be important to clearly state in the Discussion that this is a predictive study and that it needs further experimental follow-up.

The proposed "homogeneity assessment" method based on Markov models feels like a bit of a circular argument. It only shows that the algorithm picked out sequences with similar statistical properties, but it doesnt really say anything about their biological function. This approach can't be used as proof of a low false-positive rate.

There are some great enhancer predictions already out there using ChIP-seq, ATAC, and RNA data, like these papers:

Kern, C. et al., Nat Commun 12, 1821 (2021).

Zhangyuan Pan et al., Sci. Adv. 9, eade1204 (2023).

The authors should have used these datasets for cross-validation and to discuss the pros and cons of different approaches. These papers generated a huge amount of data (H3K27ac, H3K4me1 ChIP-seq, ATAC-seq, etc.) that the authors could have used to verify their predictions, maybe with some statistical tests like permutation tests.

It's also really important to discuss why in this study the enhancers cover ~25.5% of the chicken genome, while in the Pan et al. (2023) paper it’s only 8.86%. That's a huge difference and might suggest a high number of false positives here.

This issue might come from the bidirectional transcription method. The authors don't specify the parameters used in the CAGEr package. For example: What was the distance cutoff from known genes to distinguish an enhancer from a promoter? How was the balance of the bidirectional signal assessed to filter out asymmetric signals from gene promoters?

Another source of false positives could be the bidirectional promoters in the 5' UTR of LINEs or in the LTRs of ERVs. Yes, some of these can become enhancers, but not all bidirectional promoters from repeats will be enhancers.

The authors themselves admit that the chicken genome annotation isn't complete. This means their method could have picked up a bidirectional signal that is actually the promoter of an unnanotated gene (like a new lncRNA) and mistakenly called it an enhancer.

The analysis of enhancer location relative to TADs has a major flaw. In Figure 7, the authors combine TAD data from fibroblasts and erythrocytes, citing Fishman et al. (2019). However, a key finding of the Fishman paper is that chicken erythrocytes do not have typical TADs. The structures they find there are local compartments in a super condensed, transcriptionally inactive nucleus. Comparing the location of active enhancers from various tissues with the chromatin architecture of terminally differentiated erythrocytes doesn't seem biologically meaningful.

Other Points

  • The claim about discovering a class of "long enhancers specific to chickens" is a bit speculative, since no comparative analysis with mammalian data was done using teh same methodology.
  • Figure 2: the title on the graph (“Histogram of enhancer scores”) doesn't match the detailed description in the caption (“Distribution of Euclidean distances…”). And the X-axis is just labeled “Score”, which is a bit confusing.
  • The statement in the abstract that enhancers cover "roughly one-third of the entire genome" is an overstatement. The results show a more precise figure of 25.5%, which is closer to a quarter.

Author Response

Comment 1 - IsoSeq missed in Results:
In the Methods section, the authors state they used IsoSeq data to validate their eRNA predictions. However, there's no description of how this validation was done. Then, neither the Results nor the Discussion mention what this analysis actually showed. So it seems the eRNA predictions were not really verified by the authors.
Response:
We have added the missing section with the results on the IsoSeq validation of our enhancer predictions (subsection “2.4 Validation of enhancer predictions with IsoSeq data”, lines 108-112)

Comment 2 - Lack of experimental validation:
Also, without any experimental validation (like luciferase reporter assays, STARR-seq, CRISPRi/a screening, etc.), the genomic regions they found are really just predictions of enhancers. I think it would be important to clearly state in the Discussion that this is a predictive study and that it needs further experimental follow-up.
Response:
We agree that the hypothetical nature of the computational results must be clearly addressed. We have added the relevant text in Discussion (lines 219-222)

Comment 3 - Homogeneity assessment:
The proposed "homogeneity assessment" method based on Markov models feels like a bit of a circular argument. It only shows that the algorithm picked out sequences with similar statistical properties, but it doesn't really say anything about their biological function. This approach can't be used as proof of a low false-positive rate.
Response:
We agree that  a reader might take the homogeneity assessment of the HMM as a circular argument by mistake due to lack of clarity in the text.
The hidden Markov models, indeed, peek their predictions on the basis of sequence statistics. If we estimated the homogeneity of th eHMM predictions and posed that as the argument for validity of the prediction, that would count as the circular argument
Nevertheless, we have applied the homogeneity assessment not to the eHMM prediction, but to the initial predictions made on the basis of the expression values of the CAGE tags (Andersson’s method from Andersson at al, 2014). The Andersson’s method used for those predictions does not benefit from the sequence statistics, thus the validation of those predictions with the homogeneity of explicit Markov models are independent.
To clarify how the homogeneity approach was used in our work we added more details in the Results section (lines 59-67, 272-274)

Comment 4 - Validation with enhancers from Kern (2021) and Pan (2023):
There are some great enhancer predictions already out there using ChIP-seq, ATAC, and RNA data, like these papers:
Kern, C. et al., Nat Commun 12, 1821 (2021).
Zhangyuan Pan et al., Sci. Adv. 9, eade1204 (2023).
The authors should have used these datasets for cross-validation and to discuss the pros and cons of different approaches. These papers generated a huge amount of data (H3K27ac, H3K4me1 ChIP-seq, ATAC-seq, etc.) that the authors could have used to verify their predictions, maybe with some statistical tests like permutation tests.
Response:
It would be very interesting and important for our manuscript to compare our results with those of the FAANG project. Nevertheless, we could not obtain the enhancer intervals from Supplementary of both papers for some reason. We have asked the authors to provide the data.

Comment 5 - Large amount of predicted enhancers and false positives:
It's also really important to discuss why in this study the enhancers cover ~25.5% of the chicken genome, while in the Pan et al. (2023) paper it’s only 8.86%. That's a huge difference and might suggest a high number of false positives here.
This issue might come from the bidirectional transcription method. The authors don't specify the parameters used in the CAGEr package. For example: What was the distance cutoff from known genes to distinguish an enhancer from a promoter? How was the balance of the bidirectional signal assessed to filter out asymmetric signals from gene promoters?
Another source of false positives could be the bidirectional promoters in the 5' UTR of LINEs or in the LTRs of ERVs. Yes, some of these can become enhancers, but not all bidirectional promoters from repeats will be enhancers.
The authors themselves admit that the chicken genome annotation isn't complete. This means their method could have picked up a bidirectional signal that is actually the promoter of an unnanotated gene (like a new lncRNA) and mistakenly called it an enhancer.
Response:
We agree that such an amount of predicted enhancers is surprising. Though there is evidence that the enhancer content exceeds gene content, we have performed the expression analysis of the predicted enhancers to address the issue of the false positives. We found that slightly more than 12,000 enhancers appeared to be highly expressed in chicken tissues (see Materials and Methods, subsections “4.7 Background and signal expression of predicted enhancers” and “4.8. Expression of enhancers in tissues and functional analysis”, lines 300-347). We further analysed those enhancers in more detail (see Results, subsections “2.6. Functional annotation of intragenic enhancers” and “2.7. Functional annotation of tissue specific intragenic enhancers”, lines 137-150)
We have also described the bidirectionality check performed with CAGEr in more detail in Materials and Methods (lines 273-279)

Comment 6 - TADs in chicken erythrocytes:
The analysis of enhancer location relative to TADs has a major flaw. In Figure 7, the authors combine TAD data from fibroblasts and erythrocytes, citing Fishman et al. (2019). However, a key finding of the Fishman paper is that chicken erythrocytes do not have typical TADs. The structures they find there are local compartments in a super condensed, transcriptionally inactive nucleus. Comparing the location of active enhancers from various tissues with the chromatin architecture of terminally differentiated erythrocytes doesn't seem biologically meaningful.
Response:
It was surprising for us to see similar contact maps between nuclei of embryonic fibroblasts and “adult” erythrocytes. Nevertheless, in their Materials and Methods Fishman et al describe how they achieved enrichment of immature erythrocytes in chicken blood up to 95% by inducing anemic state in the birds. Actually, they analysed three cell types: embryonic fibroblasts (CEF), immature erythrocytes (CIE) from anemic birds and mature erythrocytes (CME) from control birds. In mature erythrocytes they, indeed, observed mostly condensed chromatin (for instance, on Figure 2B) exactly as they should (compare with Beacon&Davie, 2021, https://doi.org/10.3390/cells10061354 ). Nevertheless, the erythrocyte dataset, released in the Ontogen database, referred to immature erythrocytes whose chromatin state depends on the stage of erythropoiesis. 
Judging from high similarity between chromosomal contacts in immature erythrocytes and embryonic fibroblasts we suggest that the anemic erythrocytes were young enough to not condensate their chromatin.  In their paper Fishman et al observed and discussed a clear map of chromosomal contacts in erythrocytes and referred to them as “topologically associated domains” (Fishman et al, Figure 3A), which might be accepted given the observation that even in condensed chromatin the chromosomal contacts tend to retain the structure of active TADs (https://doi.org/10.1101/gr.196006.115 ).
We have added the relevant discussion to the manuscript (lines 202-218)

Comment 7 - Other Points:

  1. The claim about discovering a class of "long enhancers specific to chickens" is a bit speculative, since no comparative analysis with mammalian data was done using teh same methodology.
  2. Figure 2: the title on the graph (“Histogram of enhancer scores”) doesn't match the detailed description in the caption (“Distribution of Euclidean distances…”). And the X-axis is just labeled “Score”, which is a bit confusing.
  3. The statement in the abstract that enhancers cover "roughly one-third of the entire genome" is an overstatement. The results show a more precise figure of 25.5%, which is closer to a quarter.

Response:

We agree that the claim about "long enhancer elements specific to chickens" is exaggerated and rephrased it in a softer manner (lines 17-18)

The title and X axis labels in Figure 2 were rephrased

The statement of “roughly one-third of the entire genome" was a typo and was rephrased to “one-foutrh”(line 13)

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Accept in present form

Author Response

Comments 1. Accept in present form
Response 1. We cordially thank the reviewer for their interesting previous comments

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have made an effort to address the comments from the previous review. Several minor issues have been corrected, and the discussion regarding the use of TAD data from erythrocytes has been improved. However, significant flaws identified in the initial review remain largely unresolved.

1) The authors added a new subsection to address the missing IsoSeq validation. It consists of a two sentences stating that just under 50% of the predicted enhancers were "detected" in an IsoSeq experiment. This is insufficient. The authors provide no information on how this validation was performed. How was an enhancer considered "detected"? What were the expression thresholds, mapping criteria, and statistical methods used? Without this information, the result is uninterpretable. A validation rate of less than 50% is not strong evidence for the accuracy of the prediction method. This result raises more questions than it answers. The authors do not discuss the implications of more than half of their predictions not being supported by this orthogonal long-read sequencing data. This could indicate a very high false-positive rate, which is a central concern of this manuscript.
2) The authors' prediction that enhancers cover ~25.5% of the chicken genome is an extraordinary claim, especially when compared to the 8.86% reported in the comprehensive Pan et al. (2023) study. The authors have failed to address this discrepancy convincingly. The response that the authors "ccould not obtain the enhancer intervals from Supplementary of both papers for some reason" is not an acceptable reason to omit a comparison. The authors’ strategy of filtering their >400,000 predictions down to a set of ~12,000 highly expressed enhancers for functional analysis does not solve the problem. The core claim of the paper is the initial prediction of over 400,000 enhancers. The manuscript must provide a much stronger defense of this number and directly address the likely sources of false positives (e.g., bidirectional promoters of unannotated genes, repetitive elements). The parameters used to distinguish enhancers from promoters (e.g., distance from TSS) are still not fully clarified in the Methods section.
3) The authors have clarified that their homogeneity assessment was performed on the initial predictions, thus avoiding the circularity argument. However, they continue to make the unsubstantiated claim that this analysis "suggested a low proportion, if not a complete absence, of false positive predictions" (lines 60-61). Sequence statistics homogeneity is not a proxy for biological function. This method only shows that the algorithm selected sequences with similar nucleotide patterns; it cannot be used as proof of a low false-positive rate, and this strong claim should be removed.

 

Minor issues:

1) The legend for Figure 10 is "Histogram of lengths of enhancers predicted by eHΜΜ." This is incorrect, as Figure 10 displays the results of a KEGG pathway enrichment analysis.
2) Table 1: The numbers don't sum correctly. Individual categories total 583,668, but the claimed total is 447,451. The transition from 447,451 to 401,533 enhancers between methods is poorly explained.

3) Multiple Fisher's exact tests performed without correction for multiple comparisons.

4) Table 3/7: All labeled "Fisher's exact test" but P-values shown as 0.00 or 0.000 - should specify actual P-values or use scientific notation.

Author Response

Comments 1) The authors added a new subsection to address the missing IsoSeq validation. It consists of two sentences stating that just under 50% of the predicted enhancers were "detected" in an IsoSeq experiment. This is insufficient. The authors provide no information on how this validation was performed. How was an enhancer considered "detected"? What were the expression thresholds, mapping criteria, and statistical methods used? Without this information, the result is uninterpretable. A validation rate of less than 50% is not strong evidence for the accuracy of the prediction method. This result raises more questions than it answers. The authors do not discuss the implications of more than half of their predictions not being supported by this orthogonal long-read sequencing data. This could indicate a very high false-positive rate, which is a central concern of this manuscript.

Response 1: We agree that the IsoSeq part was insufficiently detailed in its previous form. We overlapped the eHMM enhancer intervals against the mapping of IsoSeq reads against the chicken genome, with BEDtools and counted all enhancers which overlapped at least one IsoSeq read. We added the description of the process of IsoSeq data analysis in Materials and methods, subsection “4.9. Validation with IsoSeq data” (lines 423 – 431). We also agree that presence of false positives in the initial uncorrected prediction of enhancers from bidirectional transcription and the subsequent eHMM learning is undoubtful, thus we took the crossvalidation on the set with the IsoSeq data as the basis for correction. The source of such an overestimated amount of enhancers was supposedly the simplicity of the bidirectional approach itself and the very high coverage of the CAGE reads used for our analysis. We added the discussion of the possible sources of false positives in the Results section, subsection “2.7. Validation of enhancer predictions with IsoSeq data”, lines 180 – 183.

Comments 2) The authors' prediction that enhancers cover ~25.5% of the chicken genome is an extraordinary claim, especially when compared to the 8.86% reported in the comprehensive Pan et al. (2023) study. The authors have failed to address this discrepancy convincingly. The response that the authors "could not obtain the enhancer intervals from Supplementary of both papers for some reason" is not an acceptable reason to omit a comparison. The authors’ strategy of filtering their >400,000 predictions down to a set of ~12,000 highly expressed enhancers for functional analysis does not solve the problem. The core claim of the paper is the initial prediction of over 400,000 enhancers. The manuscript must provide a much stronger defense of this number and directly address the likely sources of false positives (e.g., bidirectional promoters of unannotated genes, repetitive elements). The parameters used to distinguish enhancers from promoters (e.g., distance from TSS) are still not fully clarified in the Methods section.

Response 2: We agree that the previous version of the manuscript retained the claim which was not consistent with our results and the description of the IsoSeq cross-validation was scarce. We reworked the section on IsoSeq data (subsection “2.7. Validation of enhancer predictions with IsoSeq data”) to add more detail (lines 164 – 190). We suggest that the sources of the large number of false positive identifications were the high coverage of our data and low specificity of the bidirectional expression approach. We discuss these possible sources of over prediction in lines 184 – 187. Therefore, we used the IsoSeq data as the basis of our cross-validation. Additionally, we performed more filtering based on the analysis of the CpG distribution in the resulting enhancers, section “2.8. Validation of the distribution of the CpG islands in the predicted enhancers” (lines 191 – 199). Finally, we came to the number of 147,061 enhancers which occupied 9.29% of known chicken chromosomes, which was slightly greater than the estimate of Pan et al, 2023 (line 198). We have correspondingly rephrased the Discussion section (lines 201 – 208) to reflect the changes to the final number and content of predicted enhancers and compare it with Pan et al, 2023. Accordingly, we have rephrased the Abstract (lines 12 – 15).

We relied on eHMM functionality to distinguish between the enhancers and promoters.  We described the detailed process in Materials and Methods, subsection “4.6. Hidden Markov model of enhancers” (lines 352 – 355 and 367 – 374). We have verified the ability of the model to discriminate between promoters and enhancers by overlapping the eHMM promoter and enhancer predictions. The resulting overlaps were negligible – only 46 intervals predicted by eHMM as enhancer overlapped any of the eHMM promoters. We mentioned that in Results, subsection “2.2. Refinement and validation of enhancer prediction using hidden Markov models” (lines 78 – 80)

Comments 3) The authors have clarified that their homogeneity assessment was performed on the initial predictions, thus avoiding the circularity argument. However, they continue to make the unsubstantiated claim that this analysis "suggested a low proportion, if not a complete absence, of false positive predictions" (lines 60-61). Sequence statistics homogeneity is not a proxy for biological function. This method only shows that the algorithm selected sequences with similar nucleotide patterns; it cannot be used as proof of a low false-positive rate, and this strong claim should be removed.

Response 3: We agree that the discussion of false positives is not relevant in the context of the homogeneity analysis, thus we rephrased the subsection “2.1. Prediction of enhancers in chicken genome from bidirectional promoters” (lines 58 – 63)

Minor comments 1) The legend for Figure 10 is "Histogram of lengths of enhancers predicted by eHΜΜ." This is incorrect, as Figure 10 displays the results of a KEGG pathway enrichment analysis.

Response Minor 1: We have corrected the legend (lines 148 – 149)

Minor comments 2) Table 1: The numbers don't sum correctly. Individual categories total 583,668, but the claimed total is 447,451. The transition from 447,451 to 401,533 enhancers between methods is poorly explained.

Response Minor 2: We recalculated the overlaps in Figure1/Table1, so the numbers add up (lines 57 – 58). To get the total number of the enhancers, only GAGE related categories should be summarized. We also discussed the change in numbers of enhancers from the result of the method of bidirectional expression to the eHMM tool (lines 78 – 79).

Minor comments 3) Multiple Fisher's exact tests performed without correction for multiple comparisons.

Response Minor 3: In Tables 8 and 10 we used Benjamini-Hochberg correction implemented in the clusterProfiler tool (lines 421 – 422).

Minor comments 4) Table 3/7: All labeled "Fisher's exact test" but P-values shown as 0.00 or 0.000 - should specify actual P-values or use scientific notation.

Response Minor 4: We recalculated the P values with greater precision and for both tables they were 2.2 × 10-16 (lines 85 – 86, 136 – 137)

Round 3

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have significantly improved their manuscript, and I can recommend the article for publication.

Back to TopTop