Comprehensive Schistosoma mansoni Hierarchical Transcriptome Assembly Points to Novel lncRNAs Associated with Sexual Dimorphism

Caio Felipe Freire; Thalles Souza-Lopes; Murilo Sena Amaral; Ana Carolina Tahira; Sergio Verjovski-Almeida

doi:10.3390/ncrna12020009

Abstract

Background/Objectives: Schistosomiasis is a neglected tropical disease affecting >200 million people worldwide. Praziquantel is the sole recommended drug against Schistosoma mansoni; however, it lacks activity against juvenile forms and cannot prevent reinfection. Thus, there is an urgent need to identify novel therapeutic targets. Long noncoding RNAs (lncRNAs) are known to regulate various biological processes in S. mansoni, including parasite pairing and fertility; therefore, screening for novel lncRNAs could reveal new potential targets. Methods: We compiled all publicly available RNA-seq data from the Sequence Read Archive (SRA) and performed a hierarchical transcriptome assembly using the multi-sample assembler Ryūtō, combined with version 10 of the S. mansoni genome. We applied HOMER for peak-calling and identification of histone marks and used weighted gene co-expression network analysis (WGCNA) to infer putative functions of lncRNAs in sexual dimorphism. Results: Using a robust pipeline, we identified 10,170 novel lncRNA genes comprising 16,990 novel lncRNA transcripts, including 8783 intergenic, 7918 antisense, and 289 intronic lncRNA transcripts. Most (78.7%) have histone regulatory marks (H3K4me3, H3K27me3, H3K27ac, or H4K20me1) near their transcription start sites, indicating potential expression regulation. Comparing male and female samples, we identified 1991 differentially expressed genes (FDR < 5%, |log₂FC| ≥ 1.5), including 296 known lncRNAs and 339 novel lncRNAs. WGCNA identified hub lncRNAs within co-expression modules, and Gene Ontology enrichment analyses (FDR ≤ 5%) suggest that these lncRNAs are involved in cell differentiation and morphogenesis pathways. Conclusions: We provide a comprehensive catalog of S. mansoni lncRNAs. These findings offer opportunities to discover potential new therapeutic targets, advancing the future development of anti-schistosome therapies.

Keywords:

Schistosoma mansoni; long non-coding RNAs; sexual dimorphism; hierarchical transcriptome assembly; gene co-expression network; parasite stage-specific lncRNAs

1. Introduction

Schistosomiasis is a globally endemic neglected tropical disease caused by parasitic flatworms of the genus Schistosoma. It represents a major public health burden, with over 250 million people requiring preventive treatment in 2021 and approximately 700 million at risk worldwide [1,2]. In the Americas, Schistosoma mansoni is the causative species, with Brazil and Venezuela among the most affected countries [3,4].

Current treatment relies primarily on praziquantel, a drug that, despite its widespread use, has several limitations. It is ineffective against juvenile parasites and does not prevent reinfection, and there is growing concern over emerging drug resistance [5,6,7]. These challenges underscore the urgent need for identification of novel therapeutic target candidates.

Long non-coding RNAs (lncRNAs) are transcripts longer than 200 nucleotides that lack protein-coding potential [8]. They participate in key biological processes, including gene regulation, chromatin remodeling, cell differentiation, and development [9,10,11,12]. Due to their high tissue and cell-type specificity and lower sequence conservation across species compared with protein-coding genes, lncRNAs represent attractive candidates for therapeutic development [13]. Investigating lncRNA-mediated regulatory mechanisms in parasites such as S. mansoni could reveal novel approaches to disease control.

Previous works by our group and others have demonstrated the functional importance of lncRNAs in S. mansoni biology [14]. These molecules are critical for cellular proliferation, gonadal development, reproduction, and egg formation during worm pairing [14,15,16]. Notably, knockdown of specific lncRNAs such as SmLINC101519, SmLINC110998, and SmLINC175062 significantly reduces egg production and disrupts worm pairing [16]. In addition, S. mansoni lncRNA levels have been shown to be modulated by drugs, including praziquantel [17,18]. However, earlier characterizations of the S. mansoni lncRNA transcriptome [19,20,21] were based on limited RNA-seq datasets mapped to outdated genome versions (V5 and V7), which contained assembly gaps and incomplete sequences, potentially limiting comprehensive lncRNA identification and characterization.

This study addresses these limitations by leveraging the recently released S. mansoni genome version 10 (V10), which provides a complete chromosome-level assembly [22] but currently totally lacks lncRNA annotation [23]. Using a robust bioinformatics pipeline and a comprehensive collection of publicly available bulk RNA-seq data, we identified and annotated both novel and previously reported [20] S. mansoni lncRNAs. We also identified lncRNAs associated with and potentially involved in sexual dimorphism by comparing transcriptome profiles of adult female and male worms.

By integrating hierarchical transcriptome assembly, coding potential prediction, differential expression analysis, co-expression network construction, and histone mark profiling, we provide the most comprehensive catalog of S. mansoni lncRNAs to date. Our findings illuminate their possible regulatory roles and identify promising candidates for future therapeutic development.

2. Results

2.1. Library Selection

A total of 1796 S. mansoni bulk RNA-seq libraries publicly available in the Sequence Read Archive (SRA) at NCBI were retrieved and subjected to quality analysis. Of these, 1107 libraries (61.6%) passed all quality control steps and were retained for transcriptome assembly (Table 1). Quality filtering steps 1–5 assessed genome mapping alignment statistics, strand specificity, gene body coverage, transcript integrity, and genomic contamination; the number of libraries remaining after each filtering step is shown in Table 1.

Table 1. RNA-seq library filtering steps used in the pipeline.

Most of the excluded libraries had either low-quality alignments (~21%), particularly multi-mapping reads filtered by STAR, or poor transcript integrity numbers (low TIN scores, ~12%) (Table 1). The latter filtering step predominantly affected isolated tissue-specific samples; only libraries from ovaries, testes and adult heads consistently passed all criteria, as shown below.

2.2. Ryūtō Demonstrates Superior Assembly Performance for Schistosoma mansoni

We employed a hierarchical transcriptome assembly approach using the Ryūtō [24] assembly tool in a multi-stage process based on multi-sample evidence from RNA-seq data of similar parasite life-cycle stages or tissues. This was aimed at improving the accuracy and completeness of assembled transcripts and then annotating lncRNAs (see Section 4.4). To validate Ryūtō’s performance in transcriptome assembly, we benchmarked it against Scallop [25] + TACO [26] and against StringTie [27] (Supplementary Methods S1 shows the parameters used in this empirical test) using 50 randomly selected adult female public libraries (BioSample IDs in Supplementary Methods S1). We used the manually curated V10 protein-coding transcriptome [23] as our gold standard reference. Transcript classes from GffCompare [28] were employed to build confusion matrices and calculate recall, precision and F1-score (Figure 1).

Figure 1. Bar plots comparing Ryūtō, Scallop + TACO, and StringTie assembler performance using protein-coding transcripts as reference. The y-axis represents the metric value (0–1 scale). The x-axis shows the different assemblers. (a) F1-score, (b) precision, and (c) recall values are shown, which were obtained when reconstructing manually curated V10 protein-coding transcripts (gold standard). Only exact matches (“=” class in GffCompare) were considered True Positives. Ryūtō outperformed both competitors across all metrics despite conservative evaluation (potential lncRNAs counted as False Positives; see main text above).

Specifically, only protein-coding transcripts classified by GffCompare as “=” were considered True Positives (TP). Other assembled transcripts, including potential valid lncRNAs that had partial matches with protein-coding transcripts, were counted as False Positives (FP). False Negatives (FN) were defined as protein-coding transcripts from the reference that lacked exact matches in the assembly.

Because our benchmarking was limited to the most trusted transcript set (i.e., the manually curated protein-coding transcripts), precision may have been underestimated. Many valid lncRNAs that had partial matches with protein-coding transcripts in this benchmarking analysis were considered FPs. Despite the constraint, Ryūtō outperformed the other methods across all metrics (Figure 1) and was therefore selected for our pipeline.

2.3. Hierarchical Assembly Preserves Exclusive Transcripts of Underrepresented Stages and Tissues

Primary assemblies were obtained from RNA-seq libraries originating from specific life-cycle stage or isolated parasite tissues (Figure 2). We recovered between 655 to 8381 lncRNA transcripts in each primary assembly, with a median of 6394 lncRNA transcripts per primary transcriptome. Assemblies based on tissue-specific libraries (heads, testes, ovaries) generally contained fewer transcripts, while miracidia, sporocysts, and schistosomula exhibited the highest transcript numbers (Figure 2). This variability can be partially attributed to library size and the Ryūtō pool size parameter. Larger datasets (e.g., adult stage-specific libraries) required higher thresholds to meet the minimum 30% and 60% presence rule, which may have excluded rare transcripts but enhanced confidence in those retained. Conversely, stages with fewer samples (e.g., sporocysts) often retained more transcripts, possibly inflating the transcript count due to relaxed thresholds and partially complete transcripts resulting from insufficient coverage (Figure 2).

Figure 2. Distribution of lncRNA transcripts and size of source data used. (a) Bar plot of the number of lncRNA transcripts per primary transcriptome (x-axis: primary transcriptomes; y-axis: transcript count) (b) Total library sizes in Gbases per transcriptome (x-axis: primary transcriptomes; y-axis: Gbases). Colors reflect a gradient from light to dark, with darker colors indicating higher values. Tissue-specific assemblies contained fewer transcripts, while miracidia, sporocysts and schistosomula showed the highest counts, reflecting differences in library sizes, Ryūtō’s pooling parameters, and potentially incomplete transcripts.

We analyzed TPM-normalized expression differences among all genes, lncRNAs, and protein-coding genes, and observed that S. mansoni lncRNAs have lower abundance, as observed in other organisms [29]. This pattern was consistent across all life-cycle stages, with lncRNA expression distributions consistently lower than protein-coding genes (Supplementary Figure S1). We then examined the presence of each lncRNA gene across different S. mansoni life-cycle stages and observed that most of lncRNAs are expressed in only one life-cycle stage (Figure 3a, Supplementary Table S1). As the most highly expressed transcripts are generally those with greater biological importance, we identified the top 10% most highly expressed lncRNA and protein-coding genes in each stage and found that among them the highest fractions of lncRNA and protein-coding genes are expressed in only one of the eight life-cycle stages (Figure 3b, Supplementary Table S1), with a proportion of stage-specific lncRNAs (42%) almost twice as large as the stage-specific protein-coding genes (25%). This highlights the importance of the hierarchical assembly approach used in this study to identify stage-specific lncRNAs.

Figure 3. Overlap of lncRNA gene expression across different life-cycle stages. (a) lncRNA genes in common across S. mansoni stages; number of shared lncRNA genes across primary transcriptomes represented in a circular UpSet plot, showing the top 50 gene intersections. The dots indicate the stages at which a given set of lncRNA genes was detected, and lines connecting dots represent the stages that share that given set of lncRNAs; single isolated dots represent stage-specific lncRNA gene sets. The top 8 highest columns represent the number of lncRNAs at each stage-specific gene set, suggesting distinct lncRNA expression profiles for each life-cycle stage. The last two numerated columns represent the intramammalian life-cycle stages, with 244 lncRNA genes in common, and then all stages together, with 229 lncRNA genes in common. (b) Distribution of the top 10% most highly expressed genes across S. mansoni life-cycle stages; the histogram quantifies the proportion of lncRNA or protein-coding genes found in one stage versus those found at the intersection of stages, from 1/8 (exclusive, on the left) to 8/8 (present in all stages, on the right). The data reveal that 42% of lncRNA genes are stage-specific, highlighting a highly specialized expression pattern, whereas only 25% of protein-coding genes follow the same pattern.

2.4. Novel LncRNAs Identified in the Current S. mansoni Genome (V10) Predominantly Originate from Intergenic Regions

With the improvements introduced in the current genome version, such as the resolution of repetitive regions using long reads and manual curation [22], some lncRNA transcripts assembled in the previous V7 assembly version [20] were expected to have been altered. To assess this, we aligned all 10,127 genes with their respective 16,583 transcripts from the previous lncRNA V7 assembly [20] to the current V10 genome using BLAT and successfully recovered 10,126 genes, or 16,580 transcripts (99.9%) (Figure 4, left).

Figure 4. Fate of V7 LncRNA Transcripts in V10 Assembly. Sankey diagram illustrates the transition path of 16,583 lncRNA transcripts from the previous V7 assembly to the current V10 lncRNA consensus and provides the accounting of all transcripts through the alignment and filtering pipeline. Arrows at the top indicate the three major pipeline annotation steps. Teal paths represent the “Recovered Paths”, tracing transcripts that aligned to the V10 genome and were successfully recovered in the new V10 lncRNAs hierarchical assembly, passing through our pipeline filters. Red paths represent the “Lost/Retired Path,” identifying transcripts that were excluded due to protein coding (PC) potential, aberrant loci/size pattern (see Section 4), or being fragmented/absent in our assembly.

When comparing the V7 lncRNA transcriptome with the V10 protein-coding transcriptome (Figure 4, middle), we observed that 713 transcripts previously classified as lncRNAs are now annotated in the V10 genome as protein-coding (PC, 516 transcripts) or were classified as aberrant (197 transcripts) as per our filtering criteria (see Section 4.5) and were placed in the “retired V7 lncRNA transcripts” set. As a result, 15,867 transcripts from the V7 lncRNA set remain aligned to the V10 genome (Figure 4, middle).

The newly assembled V10 consensus lncRNA transcriptome successfully retrieved 12,118 (~73%) transcripts previously present in the V7 assembly (Figure 4, right). Notably, 3749 transcripts from the V7 assembly were not present in the new V10 transcriptome, and they belong to two different sets. One of them comprises 1757 transcripts, which were classified by GffCompare as protein-coding transcripts (some were classified as aberrant transcripts) (Figure 4, right); these were also placed in the “retired V7 lncRNA transcripts” set (representing 2470 in total) and are still available for visualization in a separate track in our public S. mansoni UCSC Genome Browser track hub, which can be accessed at http://schistosoma.usp.br (accessed on 15 November 2025).

The other set comprises 1992 transcripts from V7, which were not present in the V10 consensus (Figure 4, right) because they were classified by BEDTools intersect as fragmented partially assembled transcripts (1255 small monoexonic transcripts with exon overlapping with V7), or because they were completely absent in V10 (737 transcripts); these may likely represent very lowly expressed lncRNAs. They are listed in a separate genome browser track (“LncRNAs from V7 not present in V10 assembly”) for inspection.

Overall, 4465 lncRNA transcripts in the V7 assembly (~27%) are not present in the V10 consensus lncRNA assembly. Several potential reasons explain the failure to reassemble some transcripts: (A) low or inconsistent expression across our 12 sample groups or rare transcripts with a limited distribution, which may have led the Ryūtō algorithm to exclude them from the final assembly; (B) transcripts reclassified during the GffCompare step, potentially due to overlapping with newly annotated or repositioned protein-coding genes, meaning they are no longer assigned to class “i”, “u” and “x”; (C) transcripts filtered during the monoexonic removal step.

To exclude redundancies, we merged all lncRNA primary transcriptomes using StringTie. The resulting V10 consensus transcriptome has 15,624 lncRNA genes comprising 35,251 lncRNA transcripts, an average of ~2.3 lncRNA transcripts per lncRNA gene. Among these, 10,170 novel lncRNA genes were detected, comprising 16,990 novel transcripts when compared to the V7 assembly (Figure 5).

Figure 5. Comparison of V7 and V10 lncRNA transcriptomes. (a) Fate of V7-assembled lncRNA transcripts in comparison to V10. (b) Composition of V10 transcripts in the consensus assembly. Figures within parenthesis in the bar plot represent the number of transcripts. In red, transcripts that were removed or lost in the V7–V10 transition, leaving 26.9% not recovered in the V10 assembly. In dark blue, V7 transcripts that are retained in V10. In dark green, novel V10 transcript isoforms of V7 transcripts. In light green, V10 transcripts in novel loci compared to V7; nearly 50% of all transcripts are novel in the new V10 consensus assembly.

The V10 lncRNA transcripts were classified as 17,069 intergenic lncRNAs (LINC), with 8783 (51%) in novel loci; 17,747 antisense lncRNAs (LNCA), with 7918 (45%) in novel loci; and 435 intronic sense lncRNAs (LNCS), with 289 (66%) in novel loci. An additional 6143 transcripts are novel V10 isoforms in loci of V7 transcripts (Figure 5).

All genes were named using the convention SmLINCXXXXXX.Y, SmLNCAXXXXXX.Y or SmLNCSXXXXXX.Y, where X represents a unique numerical identifier and Y represents the isoform number of this ID. Transcripts that had identical splice junctions to those in the V7 transcriptome retained their original numerical identifiers. Newly assembled transcripts have procedurally generated numerical identifiers. Monoexonic transcripts initially removed from V10 but later recovered because they were present in V7 were given a “-V7” suffix in their IDs. Transcripts originally assembled in the V10 dataset carry the prefixes “SmLINC”, “SmLNCA”, “SmLNCS”, along with the suffix “-IBu”, an acronym for Instituto Butantan, in accordance with the V7 naming convention. Transcripts in the stage- and tissue-specific primary transcriptomes were named after the consensus transcriptome as a reference.

We also integrated the consensus lncRNA transcriptome with the protein-coding V10 transcriptome (9921 genes and 10,960 transcripts), generating what is currently the most complete representation of S. mansoni lncRNA and mRNA transcripts available, comprising 25,545 genes and 46,211 transcripts as a gtf file, which can be accessed at https://verjo103.butantan.gov.br/users/noncoding/consensus_V10_transcripts.gtf (accessed on 15 November 2025).

2.5. Promoters of Novel LncRNAs Are Marked by Canonical Transcriptionally Activating or Repressive Histone Modifications

Histone modifications are key regulators of chromatin state and gene expression, as condensing and relaxing chromatin may impair or facilitate transcription in that region [30]. To investigate whether our newly assembled transcripts have the potential to be transcriptionally regulated, we analyzed 108 publicly available ChIP-seq datasets (see Section 4.7) for four key histone marks across three S. mansoni life-cycle stages. We performed genome-wide peak calling using HOMER [31] and identified a total of 149,621 peaks across all experiments. Many of these peaks were associated with the active promoter mark H3K4me3 (48,541 peaks) [32], followed by the repressive mark H3K27me3 (43,219 peaks) [33], the transcription-associated mark H4K20me1 (29,668 peaks) [34], and the active enhancer/promoter mark H3K27ac (28,193 peaks) [35].

We then annotated these peaks to the V10 comprehensive transcriptome (with lncRNAs and protein-coding genes) to determine which genes were associated with these regulatory marks. We searched for evidence of histone modification marks within a ±5 kb window around all transcription start sites (TSSs). Within this window, 88.8% of all protein-coding genes (8809 of 9921) had at least one peak nearby. Among non-coding genes, 82.3% of previously known lncRNA genes (4490 of 5454) retained in our assembly were marked, and 78.7% of our novel lncRNA genes (8055 of 10,236) were also associated with these epigenetic marks, indicating a potentially regulated expression. These histone mark peaks can be seen in our public S. mansoni UCSC Genome Browser hub, which can be accessed at http://schistosoma.usp.br (accessed on 15 November 2025).

To further support that this association is biologically relevant, we analyzed the spatial distribution of peaks relative to the TSS. For all marks, we observed a higher density of peaks near the TSS compared to random regions (Figure 6).

Figure 6. Histone peak marks proximity to lncRNA genes or to random sequences in the S. mansoni genome. Peak profiles for four histone marks relative to the TSS of 10,236 novel lncRNA genes (light green lines) and a control set of 10,236 size-matched random genomic regions (dark lines). Dots represent the proportion of total peaks falling within 1 kb bins. There is a high density of marks near the TSSs for each of the four marks studied.

A two-sample Kolmogorov–Smirnov test confirmed that the distribution of peaks around novel lncRNA TSSs is significantly different from the random control for all four marks (H3K27ac D = 0.221, H3K27me3 D = 0.222, H3K4me3 D = 0.225, H4K20me1 D = 0.216; all p < 0.001). Thus, the majority of novel lncRNAs have regulatory marks near their promoters, following the same profile as previously known lncRNA and protein-coding genes [21].

2.6. Sexual Dimorphism in S. mansoni Is Linked to Distinct LncRNA and Protein-Coding Expression Profiles

As adult samples represent over 70% of our dataset, we analyzed differentially expressed genes (DEG) in adults to evaluate whether the pattern observed in the literature, i.e., differential expression of protein-coding genes and lncRNAs between adult male and female parasites [16,36], could be replicated using the entire publicly available RNA-seq dataset of male/female samples and whether expression of the newly assembled lncRNAs may correlate with the parasite’s sexual dimorphism.

Using the DESeq2 R package, we performed a principal component analysis (PCA) on 31 adult male and female samples that were labelled in the public RNA-seq repository as control samples. The initial PCA analysis revealed (Figure 7a) a PC2 stratification of female samples by BioProjects (16% variance), suggesting batch effects across datasets. To correct for this bias in sample batches, we applied ComBat-Seq [37]. After correcting the expression data for batch effects, samples now cluster separately according to the sex of parasites, with PC1 representing 64% of the variability (Figure 7b).

Figure 7. Principal component analysis of RNA-seq samples from S. mansoni adult male and female parasites before (a) and after (b) batch correction to remove BioProjects bias. Symbol shapes indicate the sex of parasites, with circles representing female parasites and triangles representing males. Symbol colors indicate the different BioProjects, which are identified in the legend at right. After batch correction (b), PC1 accounts for 64% of the total variance, clearly separating samples by sex. PC2 accounts for 7% of the variance. Each point represents a single sample within a given BioProject.

Using a cutoff of adjusted p-value < 0.05 and |log₂FC| ≥ 1.5, DESeq2 identified 1991 differentially expressed genes (DEGs), including 635 lncRNAs, with 296 being known lncRNAs (123 LINCs, 152 LNCAs and 21 LNCSs) and 339 novel lncRNAs (150 LINCs, 155 LNCAs and 34 LNCSs). Heatmap visualization of DEGs between adult male and female parasites revealed that 329 lncRNA DEGs (Figure 8a) and 784 protein-coding DEGs (Figure 8b) were upregulated in males (1113 transcripts in total) while 306 lncRNAs and 572 protein-coding transcripts were upregulated in females (878 transcripts in total).

Figure 8. Sex-specific differential gene expression in adult S. mansoni. Heatmap displays normalized expression values of DEGs (y-axis) across samples (x-axis). Using unsupervised clustering, females (pink, top color code) and males (light blue, top color code) were grouped separately. Expression levels are represented by z-score according to the color scale on the right (blue: downregulation; red: upregulation). (a) LncRNAs show a more heterogeneous pattern among samples, while (b) protein-coding genes show a more homogeneous expression pattern.

2.7. Chromosomal Partitioning of Sex-Biased Transcripts Reveals Distinct Z-Linked and Autosomal Regulatory Patterns

To determine if sexual dimorphism is linked to specific genomic architectures, we mapped the 1991 DEGs to the S. mansoni V10 reference genome. We first tested the hypothesis that DEGs are distributed randomly across chromosomes in direct proportion to transcript density with a chi-squared test, which rejected this hypothesis, revealing a highly non-random distribution for both protein-coding genes (χ² = 156.49, p < 10⁻²⁹) and lncRNAs (χ² = 45.49, p = 2.97 × 10⁻⁷).The Z chromosome and the W-specific region (WSR) exhibit a higher proportion of DEGs than would be expected by chance (Figure 9).

Figure 9. DEG distribution across the S. mansoni chromosomes. Comparison of the expected distribution (Green; percentage of total transcripts per chromosome) versus the observed distribution (Purple; percentage of total DEGs per chromosome) for (a) protein-coding and (b) lncRNA biotypes. The Z chromosome shows a significant enrichment of DEGs, particularly for lncRNAs (Observed 32.8% vs. Expected 23.7%).

For protein-coding genes, the Z chromosome accounts for 23.1% of the total transcriptome but contains 26.7% of all DEGs (1.16-fold enrichment). This behavior is more pronounced for lncRNAs, where the Z chromosome contains 23.7% of the total lncRNA but contributes 32.8% of the DEGs, representing a 1.38-fold enrichment of DEGs over the expected genomic distribution. The WSR showed the highest density of regulation relative to its size, with an observed-to-expected ratio of 5.72 for protein-coding DEGs and 2.83 for lncRNAs.

We also analyzed the direction of sex-biased DEGs by investigating chromosomal distribution to determine how male and female DEGs are allocated across the genome (Figure 10). By calculating the proportion of the total sex-specific DEGs assigned to each chromosome, we observed an asymmetry on the Z chromosome. Consistent with the ZZ/ZW sex-determination system, Z-linked DEGs were overwhelmingly male-biased (18.8% of protein-coding and 22.4% of lncRNAs).

Figure 10. Direction of Sex-biased DEGs. Partitioning of the total (a) protein-coding and (b) lncRNA DEG across chromosomes. Bars represent the percentage of total male-biased (teal) and female-biased DEGs (pink) accounted for by each chromosome. The Z chromosome exhibits a male-bias for both protein-coding and lncRNA DEGs, while autosomal lncRNA DEGs show a preferential female-bias (57.5%).

Autosomal lncRNAs exhibited a female bias (57.5%), a pattern that contrasts with autosomal protein-coding genes, which showed a male bias (55.1%). While the Z chromosome serves as a concentrated hub for male-biased transcripts, female-biased lncRNAs are more uniformly distributed across the autosomes. This divergence suggests that autosomal lncRNAs are preferentially involved in female-specific biological processes (e.g., vitellogenesis), whereas Z-linked lncRNAs are likely integrated into male-specific regulatory networks or are enriched by the fact that males have a pair of Z chromosomes. The presence of male-biased transcripts in the WSR (W-specific region) is likely artifactual, resulting from multi-mapping in the highly repetitive S. mansoni genome, as males lack the W chromosome and this region contains only 1 male-biased DEG.

2.8. LncRNAs Are Integral Components of Co-Expression Modules Linked to Sexual Dimorphism

Gene co-expression modules in S. mansoni males and females were investigated to identify functionally related gene sets that may help infer potential roles of lncRNAs in adult metabolism, using the “guilt-by-association” principle. This analysis was performed using the WGCNA R package v1.73 [38], applied to an expression matrix comprising 16,298 transcripts and 31 adult samples (male and female), all pre-processed, normalized, and batch-corrected as detailed in the Section 4.

We identified 17 distinct co-expression modules, each uniquely represented by a specific color (Figure 11a). The sizes of these modules varied considerably, ranging from 163 transcripts (lightcyan) to 3058 transcripts (turquoise). The ‘grey’ module, containing 1620 genes, clustered genes that were not assigned to any specific co-expression cluster.

Figure 11. Weighted Gene Co-expression Network Analysis in S. mansoni male and female RNA-seq data. (a) Cluster dendrogram and module assignment. The x-axis represents individual genes clustered based on their Topological Overlap Matrix (TOM) dissimilarity. The y-axis represents the height, indicating the dissimilarity (1-TOM) at which genes or branches are merged. Colors beneath the dendrogram branches denote the identified co-expression modules; genes of the same color are grouped into the same module. (b) Module–trait association heatmap. A heatmap displaying the association between Module Eigengenes (MEs) and the ‘Sex’ trait (male vs. female). Modules (y-axis) are ordered by the t-value of their association. Each cell shows the linear model t-value and the corresponding p-value; red modules indicate positive correlation (higher expression in males), while blue modules indicate a negative correlation (higher expression in females). Color intensity reflects the association strength (t-value) as indicated by the color scale on the right.

We then assessed the association of each module’s primary expression component, the Module Eigengene (ME), with the ‘Sex’ trait (male vs. female) using a linear model. Of the 17 modules found, 7 demonstrated a statistically significant association with the ‘Sex’ trait (adjusted p-value ≤ 0.05) (Figure 11b). A DEG enrichment analysis performed among these 7 associated modules, revealed blue, brown and turquoise modules as significantly enriched for DEGs. Blue module had a 2.23-fold enrichment, brown a 1.91-fold enrichment and turquoise a 1.37-fold enrichment (all p_adj < 0.001, chi-squared test). As determined by the absolute t-value, the brown module exhibited the strongest positive association with the ‘Sex’ trait (t = 25.31, p = 2.06 × 10⁻²², p_adj = 3.50 × 10⁻²¹) (Figure 11b), indicating genes within this module are predominantly upregulated in males compared with females. The blue module showed the second strongest, but negative, association with the ‘Sex’ trait (t = −24.25, p = 7.36 × 10⁻²², p_adj = 6.25 × 10⁻²¹) (Figure 11b), suggesting its genes are generally more highly expressed in females. As expected, the grey module showed no correlation with the trait association (t = −0.49, p = 0.630, p_adj = 0.817).

2.9. Functional Enrichment Analysis Delineates Distinct Biological Pathways for Male and Female Parasites

To explore possible functions of these co-expression modules, we performed Gene Ontology (GO) enrichment analysis on modules significantly associated with the ‘Sex’ trait. Enrichment was conducted separately for Biological Process (BP), Molecular Function (MF), and Cellular Component (CC) ontologies (Figure 12).

Figure 12. Summary of GO enrichment across sex-associated co-expression modules. A summarized overview of Gene Ontology (GO) enrichment results for modules significantly associated with the ‘Sex’ trait (adjusted p-value ≤ 0.05). The plot is arranged by GO (rows: Biological Process [BP], Cellular Component [CC], Molecular Function [MF]) and by module–sex association (columns: ‘Female-Associated Modules’ and ‘Male-Associated Modules’, which indicate they are upregulated in females or males, respectively). Modules are categorized based on the t-value from their module-trait association (positive t-value for male-associated, negative for female-associated). The x-axis lists the module color names. The y-axis displays the top 3 significantly enriched GO terms (adjusted p-value ≤ 0.01) for each module and ontology. Each dot represents an enriched GO term, with its size proportional to the −log₁₀(adjusted p-value), where larger dots indicate higher statistical significance. Dot color corresponds to the module color.

Among the significant modules, the brown module, which exhibited the strongest positive association with the male sex, was enriched for terms related to developmental and morphological processes. Key BP terms included ‘actin filament-based process’, ‘muscle structure development’ and ‘actin cytoskeleton organization’ (Figure 12, Supplementary Table S2). These terms are highly relevant to the sexual dimorphism of S. mansoni, specifically the distinct body musculature required by males to maintain the gynecophoric channel and support the female during pairing. Enriched CC terms, such as ‘contractile fiber’, ‘myofibril’ and ‘cell–cell junction’, and MF terms like ‘actin binding’, ‘cytoskeletal protein binding’ and ‘calcium ion binding’, further reinforce the structural and contractile specialization of the male parasite. Of the 2431 transcripts in the brown module, 1002 (41.2%) are lncRNAs, suggesting these non-coding RNAs are integral to the regulatory network for male-specific development and tissue integrity.

In contrast, the blue module showed the strongest negative association with the ‘Sex’ trait, indicating a predominant upregulation in females. This module contains 2713 transcripts, of which 942 (34.7%) are lncRNAs. Functional enrichment points toward high metabolic and proliferative activity; top BP terms include ‘DNA metabolic process’, ‘double-strand break repair’ and ‘DNA replication’ (Figure 12, Supplementary Table S2).

The CC and MF ontologies were dominated by terms related to the replication machinery, such as ‘nuclear chromosome’, ‘DNA helicase activity’ and ‘catalytic activity, acting on DNA’ (Figure 12, Supplementary Table S2). These findings align with the biological demands of the adult female, whose physiology is almost entirely dedicated to high rates of oogenesis and embryogenesis. The significant presence of lncRNAs within this module suggests they may act as key regulators of the rapid cellular proliferation required for egg production.

2.10. Intra-Modular Network Analysis Connects Novel LncRNA Hubs to Functionally Relevant Protein-Coding Genes

The co-expression modules vary in size and composition, with turquoise being the largest module containing 3058 genes in total, of which 864 (28.3%) are lncRNA genes (Supplementary Figure S2). We identified key regulatory genes within these co-expression networks by selecting hub genes based on two metrics: Module Membership (kME) and Intramodular Connectivity (kWithin or kIM). We retrieved genes in the top 20% for kME and top 30% for kWithin within their respective modules. For each transcript, we compiled a detailed table including its module assignment, biotype (protein-coding or lncRNA), hub classification, novelty status (known or novel), differential expression (DEG) classification, kME and kWithin values, median expression in TPM for adult female and male controls, and nearby histone marks. This table is available as Supplementary Table S3.

Among the hub genes, we identified 507 lncRNA hubs distributed across 14 modules. These candidates were further prioritized using orthogonal evidence: 143 (28.2%) were differentially expressed between sexes, and 393 (77.5%) were associated with nearby histone marks (H3K4me3, H3K27me3, H3K27ac, or H4K20me1), suggesting active epigenetic regulation; 262 of these hubs are novel lncRNAs discovered in this study. By intersecting these criteria, we identified a high-confidence set of 53 hubs that are simultaneously novel, sex-biased, and epigenetically supported, representing the highest-priority candidates for future functional validation.

To visualize the functional context of these regulators, we constructed intramodular sub-networks for modules containing lncRNA hubs. Seven of these modules were significantly associated with the ‘Sex’ trait (brown, black, blue, yellow, red, green, and turquoise). In the male-associated brown module (Figure 13a), the novel lncRNA hub SmLNCA007011.3-IBu was categorized as a central node with a high-confidence connection (adjacency = 0.942) to the protein-coding gene Smp_035210. This partner gene plays a critical role in muscle structure development and actin cytoskeleton organization.

Figure 13. Intra-Modular lncRNA–protein-coding transcript interaction networks for key sex-associated modules. Concentric ring representation of the most robust co-expression relationships within modules significantly associated with the ‘Sex’ trait. (a) Brown module (male-associated) and (b) Blue module (female-associated). The inner ring displays the top 8 lncRNA hubs ranked by intramodular connectivity (

k W i t h i n

), while the outer ring comprises the top 20 transcripts (protein-coding or lncRNA) with the highest adjacency to the central hubs. Edges are restricted to the top 3 strongest connections per node (adjacency > 0.15) to emphasize the core network skeleton. Squares represent lncRNAs and circles represent protein-coding genes. Nodes are colored gold for lncRNA hubs and by module color for partners. Bold labels indicate hub status, and novel transcripts are marked with an asterisk (*). Detailed information on the networks can be seen in Supplementary Table S4.

Conversely, in the female-associated blue module (Figure 13b), the lncRNA hub SmLINC142881-V7-IBu showed a high connection (adjacency = 0.930) to Smp_056350, a gene involved in chromosome segregation and mitotic nuclear division. This association suggests that lncRNAs in the blue module are regulators of the cellular proliferation required for female oogenesis. Detailed interaction data and functional annotations for all sub-networks are available in Supplementary Table S4.

3. Discussion

In humans, the most recent GENCODE v49 release [39] reports 35,899 lncRNA genes and 19,433 protein-coding genes, yielding a proportion of approximately 1.85:1.0, while the NONCODE V6 database [40] reports 96,411 lncRNA genes, indicating a much higher lncRNA:protein-coding gene proportion of approximately 5.0:1.0 in humans. In S. mansoni, we describe a total of 15,624 lncRNA genes, yielding an lncRNA:protein-coding gene proportion of approximately 1.6:1.0, considering the 9921 S. mansoni protein-coding genes. This represents a smaller proportion of lncRNAs than that found in humans, in line with the parasite’s 8-fold-smaller genome size compared to the human genome. We identified 15,624 lncRNA gene loci in the present assembly, with many of the 10,170 novel lncRNA gene loci likely resulting both from resolution of genomic repetitive regions in the complete chromosomes of the new genome [22] and from the stringency of our bioinformatics pipeline.

Our approach was rigorous in quality control for libraries, alignments, evidence across multiple samples, and coding potential evaluation, with multiple tools to minimize false positives and ensure that we considerably reduced the chance of false calling lncRNA transcripts that are partial protein-coding genes or artifacts. However, while this approach increases confidence in the lncRNA annotation, it also produces a conservative estimate, potentially excluding low-abundance monoexonic S. mansoni lncRNA genes that are poorly represented across multiple samples (due to strain variations, for example), which means that the true proportion of lncRNAs to protein-coding genes may be higher than approximately 1.6:1.0.

Of note, we observed that certain lncRNAs were predominantly assembled in monoexonic form, such as the V7-obtained SmLNCA159451-IBu, particularly when seeking higher coverage and evidence across multiple samples. One possibility, requiring further investigation, is that some lncRNAs might be retained in an unprocessed state as well as in the spliced form; this characteristic could indicate alternative regulatory roles for these specific lncRNAs, where the unspliced form itself may be spliced at a later step after transcription [41]. This is a recognizable challenge in lncRNA annotation, as traditional transcript assembly often prioritizes spliced, multi-exonic transcripts, in order to avoid false monoexonic lncRNA transcripts that may be pseudogenes or result from genomic contamination [42,43].

The hierarchical assembly approach not only generated a robust consensus transcriptome but also preserved unique transcripts within individual primary assemblies. Genome browser visualization supports the evidence of distinct expression patterns of numerous lncRNAs detected exclusively at specific developmental stages or in distinct tissues, revealing a diverse profile in S. mansoni biology. Our local version of the UCSC Genome Browser with all primary assembly tracks from the present work is available for public access at http://schistosoma.usp.br (accessed on 15 November 2025) (Figure 14). While the consensus transcriptome enables robust annotation, the primary stage- and tissue-specific assemblies capture distinct expression profiles that can facilitate future analyses. The genome browser is freely searchable, and all tracks’ data are visible and downloadable.

Figure 14. Snapshot view of our public Genome Browser track hub at chromosome 1 in the genomic position of SmLINC101519-IBu. Primary transcriptomes are ordered from top to bottom by parasite developmental stage, each with a different track color. Arrows in the intronic region of the gene indicate the direction of transcription. At the top, the V10 protein-coding gene track (smps-version10, blue), followed by the consensus transcriptome V10 lncRNA track (petroleum green), by the previous V7 lncRNA track (red), and by the tracks of lncRNAs detected in the 12 primary assemblies from the different life-cycle stages and parasite tissues, as indicated by the track names. At the bottom, there are three track labels corresponding to lncRNAs that are present in V7 but were not assembled in V10, to lncRNA V7 transcripts retired in V10, and to monoexonic V10 lncRNAs filtered out in the pipeline. For the consensus transcriptome track color and for each lncRNAs track color, dark tone indicates a transcript at the top “plus” genomic strand and light tone a transcript at the bottom “minus” strand.

The improved chromosome resolution of the V10 assembly enabled more precise analysis of how sex-biased transcripts are distributed across the genome, particularly since the sex chromosomes are now completely assembled. We observed that lncRNAs show significant non-random distribution between females and males, revealing an interesting partitioning of regulatory elements. In the ZZ/ZW sex-determination system of S. mansoni, the high concentration of male-biased lncRNAs on the Z chromosome suggests their integration into male-specific regulatory networks. Conversely, female-biased lncRNAs are preferentially located on autosomes, indicating a distinct regulatory architecture for female-specific processes such as oogenesis and vitellogenesis. This autosomal distribution may reflect the need for a more distributed regulatory network to support the higher metabolic demands of egg production, rather than relying on sex-chromosome-linked control. This chromosomal asymmetry highlights the specific roles that lncRNAs may play in maintaining sexual dimorphism.

Our findings emphasize the critical importance of considering lncRNA expression in S. mansoni. The observed divergence in transcriptional regulation between male and female parasites, evidenced by sex-biased lncRNA expression alongside the inclusion of lncRNAs in sex-associated co-expression modules, suggests lncRNAs function as regulators in sexual dimorphism and reproductive processes. Additionally, the profile of protein-coding genes shows correlation with fundamental biological processes, from basic transcriptional machinery to cell differentiation itself, demonstrating that sexual dimorphism requires differences in the core regulatory genes. Furthermore, the GO BP enrichment analysis of female-associated modules (e.g., red, green) revealed metabolism-centered terms, which aligns with the biological knowledge of higher metabolic activity in females to support and maintain oogenesis and egg production. Future analyses should extend this investigation to encompass all life-cycle stages and tissues of the parasite, providing a comprehensive view of the lncRNA regulatory network in stage development and its interactions with protein-coding genes, enabling exploration of more effective strategies to control schistosomiasis.

Transition to the V10 genome assembly has significantly improved the accuracy of long non-coding RNA annotation in Schistosoma mansoni. This updated assembly revealed substantial inaccuracies in previous annotations, leading to the retirement of over 25% of previously reported lncRNAs due to better resolution of repetitive regions and more precise gene boundary definitions. To establish a reliable foundation for future research, we generated an epigenetically mapped transcriptional start site (TSS) dataset covering 15,482 lncRNA transcripts. This comprehensive catalog represents more than a simple annotation list, providing a high-confidence functional framework that enables precise candidate selection for mechanistic studies. The exact genomic coordinates and detailed isoform structures documented here are essential for experimental approaches such as CRISPR-Cas9 gene editing and antisense oligonucleotide (ASO) screening, where precision is critical for experimental success. This resource establishes a new standard for lncRNA annotation in Schistosoma research and should facilitate more reliable functional studies across the research community.

4. Materials and Methods

4.1. Datasets and Quality Control (QC)

Metadata, including bulk RNA-seq run IDs, were obtained up to 10 March 2025 from the NCBI SRA [44] by querying with the string “Schistosoma mansoni” [Organism] AND (“illumina” [Platform] OR “bgiseq” [Platform]) AND “rna seq” [Strategy] AND “transcriptomic” [Source]. This search retrieved 3039 Bulk RNA-seq libraries. Libraries derived from microRNA and single-cell experiments were subsequently filtered out, retaining 1796 for downstream analysis (Supplementary Figure S3a). These represent a wide range of experimental conditions and treatments.

Each library was categorized according to its study of origin and annotated with parasite developmental stage, sex, tissue, and other relevant metadata (e.g., type of treatment) (Supplementary Figure S3b), based on the information available in the SRA descriptions. This systematic organization was essential for subsequent hierarchical assembly steps. The relevant SRA metadata are provided in Supplementary Table S5.

For read quality control (QC), we used FastQC v0.11.9 [45], MultiQC v1.8 [46], and fastp v0.20.0 [47] to remove adapters and low-quality reads, evaluating the Phred quality scores for each read in all bulk RNA-seq libraries.

4.2. Read Alignment

The S. mansoni V10 reference genome (SM_V10) [22] was retrieved from WormBase ParaSite (WBPS18) [48]. The annotated transcriptome and protein-coding sequences were downloaded from the same database (https://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS18/species/schistosoma_mansoni/PRJEA36577/, accessed on 29 January 2024).

Trimmed RNA-seq libraries were aligned to the V10 genome using STAR v2.7.11b [49]. Libraries were excluded if their unique mapping rate was below 70%, if more than 30% of reads mapped to multiple loci (parameter “--outFilterMultimapNmax 20”), or if unmapped reads exceeded 25%. After filtering, 375 libraries were excluded. Technical replicates were merged based on BioSample ID, resulting in 951 BioSample-merged BAM files.

An additional QC step was performed using RSeQC v3.0.1 [50] and Qualimap v2.3 [51,52]. With RSeQC, we assessed strand specificity with infer_experiment.py, evaluated uniformity of gene body coverage (to detect 3′ bias) with genebody_coverage.py, and measured transcript integrity with tin.py. Qualimap verified whether reads predominantly mapped to exonic regions. This was a crucial step for quantifying potential genomic contamination, as lncRNAs often have low expression, and genomic contamination noise in RNA-seq can lead to false positive lncRNAs during assembly and quantification.

After this filtering step, 1107 filtered libraries remained for transcriptome assembly, corresponding to 527 BioSample-merged BAM files. A list of filtered libraries retained for each primary transcriptome is provided in Supplementary Table S6.

4.3. Assembly of Primary Transcriptomes with Ryūtō

Filtered alignment files were used with Ryūtō [24,53] to assemble transcriptomes based on multi-sample evidence. Transcript paths present in at least 30% of samples (argument “--group-vote-low 30”) were retained, and transcripts assembled in at least 60% of the groups (argument “--group-vote-high 60”) were kept.

Ryūtō creates sample bins based on the --pool parameter. To optimize this setting for lncRNA assembly while maintaining high recovery of protein-coding genes, we conducted an empirical test using 50 randomly selected adult male and female samples. We tested pool sizes of 3, 5, 8, 10, 12, 15, 20, 30 and 50, calculating precision, recall and F1-score for each (Supplementary Figure S4).

We selected a pool size of 20 as the optimal setting (Supplementary Figure S4, red vertical dashed line). Although this pool size caused a reduction in F1-score compared to smaller pool sizes, it yielded a marked increase in recall and thus helped us to capture a larger set of transcripts, which may include potentially novel lncRNAs from stage- or tissue-specific expression. Notably, pool sizes beyond 20 caused a substantial drop in precision (Supplementary Figure S4). Maintaining precision is important to avoid including transcript fragments in our assembly, and our pool size choice reflects a balance between sensitivity and specificity, minimizing noise from possible artifacts.

Thus, for stages/tissues with more than 20 samples, -w 20 was used. For smaller groups, all available samples were included in the pool.

4.4. LncRNA Annotation

Using GffCompare v0.11.6 from GffUtilities [28], the assembled primary transcriptomes were compared against the S. mansoni protein-coding reference transcriptome from WormBase ParaSite. Multi-exonic transcripts classified as “x” (antisense), “i” (intronic), or “u” (intergenic/unknown) were retained as putative lncRNA candidates to be analyzed for coding-potential with the three ad hoc software described below. Novel assembled transcripts that were classified as “k” class, having an overlap of less than 30% of their sequence with an annotated protein-coding transcript, were also retained to be analyzed for coding-potential.

We then applied three coding potential detector algorithms: FlExible Extraction of LncRNAs (FEELnc) v0.01 [54], which employs a Random Forest model trained on less stringent ORFs and multiple k-mer frequencies to predict a coding score for each lncRNA transcript candidate; Coding Potential Calculator 2 (CPC2) v0.1 [55], which uses a support vector machine incorporating features such as Fickett TESTCODE, ORF length, ORF integrity, and isoelectric point; and finally, Coding-Potential Assessment Tool (CPAT) v3.0.4 [56], which predicts RNA coding probability based on a logistic regression model that considers ORF size, ORF coverage, Fickett TESTCODE, and hexamer usage bias.

Transcripts predicted as non-coding by at least two of these tools were further analyzed with ORFfinder v.0.4.3 [57] to predict putative peptide sequences. These sequences were then searched against the eggNOG database v5.0.2 [58] using eggNOG-mapper v2.1.12 [59] (--evalue 0.001 --score 60 --pident 60 --query_cover 60 --subject_cover 40 --itype ‘proteins’ --tax_scope ‘auto’ --target_orthologs ‘all’ --go_evidence ‘non-electronic’ --pfam_realign ‘none’). Transcripts without significant matches remained in the lncRNA sets.

4.5. Filtering out Aberrant Transcripts

We applied a filtering step to remove ‘aberrant transcripts’, defined as those whose genomic locus size was disproportionately large compared to their processed mature size. We built a linear regression model trained on protein-coding genes from the WormBase reference annotation. The model established a baseline correlation between the log₂-transformed locus size and mature size.

A cutoff threshold was then defined by shifting the model’s regression line upward to define a permissive boundary for transcript inclusion. Transcripts were removed if their locus size exceeded this boundary, following the formula

log₂(LociSize) > (β₀ + k) + β₁(log₂(MatureSize))

where β₀ and β₁ are the intercept and slope derived from the reference model, and k is a manually set constant (in our case, k = 4) to define the filtering stringency (see Supplementary Figure S5). After this filter, we obtained robust primary assemblies for every stage and tissue of S. mansoni.

4.6. Consensus Transcriptome Construction

Primary transcriptomes often contain exclusive lncRNAs expressed only at specific development stages [20,21]. To build a consensus transcriptome while preserving stage- and tissue-specific transcripts, we tested two merging strategies using TACO v0.6.2 [26] or StringTie merge v3.0.0 [27] and conducted a performance analysis. StringTie consistently produced better results in every transcriptome except for eggs, sporocysts, and somules (Supplementary Figure S6). Therefore, it was selected as the tool for generating our final consensus V10 transcriptome. This transcriptome was then concatenated with the V10 protein-coding transcriptome to create a comprehensive and up-to-date S. mansoni gene annotation.

4.7. Histone Marks Peak Calling

A total of 108 S. mansoni public ChIP-seq libraries were downloaded from the SRA from three studies under BioProject numbers PRJNA236156, PRJNA312093, and PRJNA602708. This dataset comprised four histone marks (H3K27ac, H3K27me3, H3K4me3, H4K20me1) across three life stages (cercariae, somules, adults) and three sex categories (female, male, mixed). Raw sequencing reads were subjected to QC using FastQC (v0.11.9) [45], and then adapters and low-quality reads were removed using fastp (v0.20.0) [47]. The trimmed libraries were aligned to the V10 genome (WBPS18) with bwa-mem (v0.7.19-r1273) [60]. Then, using samtools v1.19.2 [61], we performed coordinate sorting, mate-pair repairing for paired-end data with fixmate, removed PCR duplicates with markdup and mitochondrial reads with samtools view, and finally selected uniquely mapped reads (MAPQ ≥ 20) [62]. For quality control of peak results, we used ChiPQC v1.38.0 [63] for reporting.

Peak calling was performed using HOMER v5.1 [31], creating HOMER-specific tag directories for each alignment with makeTagDirectory. Then, we used findPeaks in -style histone mode. For each ChIP-seq experiment, a corresponding Input sample, matched by experimental conditions, was used as a control for background normalization. Peaks were called using a 2-fold enrichment threshold over the local background and the Input control with a p-value cutoff of 0.01 for the metrics (-F 2 -L 2 -LP 0.01 -P 0.01).

Peak files from all experiments corresponding to a specific mark (e.g., all H3K27ac peaks) were concatenated and merged into a single set of consensus peaks using BEDTools merge v2.31.1 [64]. The genomic location of these consensus peaks was then annotated using the annotatePeaks.pl script from the HOMER suite against a comprehensive gene annotation file (protein-coding and lncRNAs) to assign each peak to the nearest TSS and calculate its distance.

We then verified whether the consensus peaks were significantly enriched near our novel lncRNAs by analyzing their spatial distribution relative to the lncRNA TSSs. To establish a baseline for random distribution, the same consensus peaks were annotated against a control set of randomly generated genomic regions generated with BEDTools shuffle that preserved the length, chromosomal distribution, and GC content of the genomic loci of actual transcripts.

A two-sample Kolmogorov–Smirnov (KS) test was employed to formally test whether the shape of the peak distribution around lncRNA TSSs was significantly different from the random background distribution. To investigate the association of histone marks with different functional classes of genes, all transcripts in the annotation file were classified into three categories: ‘Novel lncRNA’, ‘Known lncRNA’, and ‘Protein-Coding’. We then calculated the percentage of genes in each category that had at least one consensus peak of a given histone mark located within a promoter-proximal window of ±5 kb from their TSS.

4.8. Differentially Expressed Genes Analysis

To identify differentially expressed genes (DEGs) between adult male and female S. mansoni worms, we performed differential expression analysis using the DESeq2 R package v1.46.0 [65]. We used raw transcript count matrices as input, obtained with Salmon (v0.7.2) [66]. Samples were filtered to include only adult male and female control samples, as labelled in our metadata (Stage = “Adult”, Sex %in% c(“Male”, “Female”), Treatment = “Control”). The 31 BioSample IDs used are in Supplementary Methods S2. The DESeqDataSetFromMatrix function was then employed, incorporating Sex as our independent variable for our statistical model design (design = ~ Sex).

Genes were retained for this analysis only if they exhibited a raw count of at least 10 in a minimum number of samples equal to the size of the smallest biological group (i.e., female parasites). Given that our dataset integrated samples from multiple studies (different BioProjects) with many laboratory conditions, potential technical batch effects were addressed using the ComBat_seq function from the sva R package (v3.54.0) [37], with batch = BioProjects and group = Sex. BioProjects not containing at least one sample from both male and female groups were removed. Principal Component Analysis (PCA) plots were generated from data subjected to variance-stabilized transformation (VST) (using vst from DESeq2 with blind = TRUE) with the top 1000 genes with the highest variance (n = 1000) both before and after ComBat_seq correction to visually confirm effective removal of batch effects and enhanced separation of biological groups.

Differential expression analysis was performed on the batch-corrected count data. Log₂ fold changes (Log₂FCs) were then extracted for the comparison of male versus female expression using results(dds_final, contrast = c(“Sex”, “Male”, “Female”)). Genes were considered differentially expressed if they met two criteria: a false discovery rate (FDR)-adjusted p-value (p_adj) < 0.05, and an absolute log₂ fold change (|log₂FC|) ≥ 1.5.

4.9. Weighted Gene Co-Expression Network Analysis

To identify modules of co-expressed genes and explore possible regulatory pathways associated with sex differentiation in S. mansoni, we performed a Weighted Gene Co-expression Network Analysis (WGCNA) utilizing the WGCNA R package v1.73 [38]. This analysis was performed using the same 31 adult male and female control samples that were used in the DEG analysis (Supplementary Methods S2).

Prior to constructing the co-expression network, we filtered the raw gene expression matrix. First, genes with a median expression of zero Counts Per Million (CPM), calculated using cpm from the edgeR package v4.4.2 [67], were removed across all samples. A gene was retained only if its CPM value exceeded the 10th percentile of median expression (calculated from the genes that passed the first filter) in at least 50% of the samples from the smallest biological group (female parasites). These filtered counts were subjected to batch effect correction using ComBat_seq to mitigate technical variations from multiple BioProjects, as performed in the DEGs analysis. These batch-corrected counts were transformed using the VST tool from DESeq2. Finally, genes with the lowest 10% coefficient of variation (CV = standard deviation/mean) across samples were excluded. These filtering steps resulted in a dataset of 13,756 expressed and variable genes.

The soft-thresholding power (β) for network construction was determined using the pickSoftThreshold function from the WGCNA package v1.73. We systematically tested the scale-free topology fit (R²) and median connectivity for powers from 1–20 to identify the lowest power value that resulted in a scale-free topology fit (R²) ≥ 0.80 for a “signed hybrid” network type. We identified a soft-thresholding power of β = 6 as optimal for our dataset, successfully approximating a scale-free network architecture (Supplementary Table S7).

The gene co-expression network and its modules were constructed using the blockwiseModules function (WGCNA package) with power = 6; TOMType = “signed”; deepSplit = 4; networkType = “signed hybrid”, maintaining consistency with the pickSoftThreshold step; reassignThreshold = 0; and mergeCutHeight = 0.1. To relate these identified gene co-expression modules to the trait of parasite sex, Module Eigengenes (MEs) were calculated, representing the first principal component of expression for all genes within a module. We used a linear model, implemented using the lmFit and eBayes functions from the limma R package v3.62.2 [68], to statistically assess the association between each module’s ME and the Sex trait (male vs. female).

Module DEG enrichment was assessed by comparing the observed ratio of number of DEGs within each module divided by the number of transcripts in the module with the expected number, i.e., the ratio of overall number of DEGs across all modules divided by the number of all transcripts. For modules where observed DEGs exceeded expected DEGs, statistical significance was evaluated using a one-sided chi-squared goodness-of-fit test (Observed > Expected), with p-values adjusted for multiple testing using the Benjamini–Hochberg (BH) method.

Functional enrichment analysis was performed for modules that demonstrated a statistically significant association (p-value ≤ 0.05 for module-trait association) with the Sex trait. This analysis was conducted using the clusterProfiler R package v4.14.6 [69]. S. mansoni Gene Ontology (GO) terms were associated with the gff file available in WormBase ParaSite, and GO enrichment was performed separately for Biological Process (BP), Molecular Function (MF), and Cellular Component (CC) ontologies using the enricher function. The “universe” for these enrichment tests was defined as all genes included in the WGCNA. For summary visualizations of GO enrichment (Figure 12), the top 3 enriched terms per module and ontology were selected based on adjusted p-value, breaking ties by decreasing gene count and then alphabetical description.

To discover hub genes within each module, which are central genes with high connectivity within their respective co-expression networks, the Module Membership (kME) and Intramodular Connectivity (kIM) were calculated, i.e., the correlation of each transcript’s expression profile with its module’s eigengene.

For intra-modular lncRNA–protein-coding interaction network representation, protein-coding transcripts were retained only if they belonged to enriched GO terms (adjusted p-value ≤ 0.05) of their respective module, and the edges for co-expression links were maintained when adjacency values were greater than 0.1; finally, only the top 30 strongest interactions per module were shown in the subnetworks.

We then added Gene Significance (GS) to all transcripts, determined for two specific traits: GS_sex, which represents the t-statistic from a linear model correlating individual gene expression with the Sex trait, and a label denoting whether a transcript was identified as a significant differentially expressed transcript in our DEG analysis. Transcripts were also labelled according to their biotype (e.g., protein-coding, lncRNA). The Supplementary Table S3 “hub lncRNA” column shows the hub lncRNAs, which are strong top candidates for further investigation.

Lastly, to explore potential lncRNA regulatory roles, intra-modular lncRNA/protein-coding transcript interaction networks were generated for sex-related modules that had at least one lncRNA hub connected to at least 3 transcripts by extracting the strongest connections between lncRNA hubs and protein-coding genes that were enriched for relevant GO terms within that specific module.

5. Conclusions

This study provides the most comprehensive catalog of S. mansoni lncRNAs to date, obtained using the V10 genome assembly and an up-to-date public RNA-seq dataset. Our robust pipeline identified a substantial number of lncRNAs, with 10,236 novel genes and 16,990 novel transcripts, revealing diverse expression profiles across male and female adult parasites. We demonstrated that many of these novel lncRNAs also exhibit sex-specific transcriptional profiles and histone mark proximity similar to known lncRNAs, highlighting their potential roles in parasite biology. These findings not only deepen our understanding of gene regulatory elements in S. mansoni but also enable future exploration of novel targets for alternative anti-schistosome therapies.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/ncrna12020009/s1. Methods S1: comparison of transcriptome assemblers’ performance; Methods S2: List of BioSample IDs used for DEG and WGCNA analyses; Figure S1: Gene expression levels across S. mansoni life cycle stages. Distribution of expression levels, measured in log₂(TPM) of protein-coding (green) and lncRNA (pink) genes across eight life cycle stages. Each boxplot summarizes the expression values of all expressed genes (TPM > 0) from the bulk RNA-seq samples for that stage. It is apparent that protein-coding genes are, on average, more highly expressed than lncRNAs; Figure S2: Distribution of lncRNAs and protein-coding genes across modules. Barplot showing the total number of genes within each of the 17 WGCNA modules. Each bar is segmented to show the proportion of protein-coding genes (solid color) and lncRNAs (hashed); Figure S3: Number of publicly available S. mansoni RNA-seq libraries used in this work. (a): Number of libraries distribution across S. mansoni life-cycle stages. The x-axis shows the seven developmental stages; the yaxis indicates library counts. Adult worms predominate, comprising >70% of libraries. (b): Number of tissue-specific libraries distribution. The x-axis lists tissue categories; gonads show the highest representation after whole worms; Figure S4: Impact of pool size on Ryūtō’s assembly performance. The x-axis shows tested pool sizes (3–50) using 50 randomly selected adult S. mansoni samples (pink: females; blue: males). The y-axis shows performance metrics (0–1 scale). The red vertical dashed line indicates the selected pool size (20). Left: Precision decreases with larger pools. Middle: Recall metrics show a peak at pool size 30, then decline. Right: F1-score (harmonic mean) shows optimal balance at 20 samples. This choice maximizes transcript recovery while maintaining precision, enabling novel lncRNA detection without artifacts from larger pools; Figure S5: Filtering of aberrant transcripts based on locus-size-to-mature-transcript size ratio. Scatter plots of log₂(Locus Size) versus log₂(Mature Size) for transcripts from each primary transcriptome at each life cycle stage or tissue (indicated at the top of each panel). The linear regression model (red dashed line) was trained on reference protein-coding genes (gold points) to establish expected size correlation. Transcripts falling above the dashed cutoff line (model intercept + 4) were classified as aberrant and removed. This filter removes transcripts with disproportionately large genomic loci relative to their mature processed length. Data from Maciel et al. (2019, orange) is shown for comparison [20]; and Figure S6: Performance comparison of transcriptome merging tools. The x-axis represents the primary transcriptomes; the y-axis shows the performance metrics (0–1 scale). Bar plots show on the left the precision metrics, in the middle the recall metrics, and on the right the F1-score metrics, allowing comparison of StringTie merge (blue bars) with TACO (green bars) across primary transcriptomes. StringTie outperformed TACO in most assemblies, except for eggs, sporocysts, and somules stages. The superior performance of StringTie led to its selection for constructing the final consensus V10 transcriptome. Table S1: LncRNAs expressed at each life cycle stage; the stages at which the lncRNAs were detected is indicated with (1) or with (0) when not detected; Table S2: GO Enrichment for each WGCNA co-expression module, for biological process (BP), cellular component (CC) and molecular function (MF) GO categories; Table S3: WGCNA co-expression module membership for all genes expressed in females and males; Table S4: Hub lncRNAs and their Interactions obtained from the WGCNA co-expression analyses of female and male adult parasites; Table S5: SRA metadata for all publicly available Schistosoma mansoni RNA-seq expression data used in this work; Table S6: Filtering criteria used for selecting the RNA-seq libraries used in the final transcriptome assembly for each life cycle stage indicated at the bottom tabs; Table S7: WGCNA analysis of scale-free topology for multiple soft_threshold powers in hybrid network type.

Author Contributions

Conceptualization, C.F.F., M.S.A., A.C.T. and S.V.-A.; methodology, C.F.F. and A.C.T.; software, C.F.F., T.S.-L. and A.C.T.; validation, C.F.F., T.S.-L. and A.C.T.; formal analysis, C.F.F. and A.C.T.; investigation, C.F.F.; resources, S.V.-A.; data curation, C.F.F. and T.S.-L.; writing—original draft preparation, C.F.F. and S.V.-A.; writing—review and editing, C.F.F., T.S.-L., M.S.A., A.C.T. and S.V.-A.; visualization, C.F.F. and A.C.T.; supervision, A.C.T. and S.V.-A.; project administration, S.V.-A.; funding acquisition, M.S.A. and S.V.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Fundação de Amparo a Pesquisa do Estado de São Paulo (FAPESP) grant number 2018/23693-5 to S.V.A., by FAPESP fellowships 2023/14590-6 to C.F.F. and 2021/06005-0 to T.S.L., and by Fundação Butantan. The work was also supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) (409278/2023-8 to MSA). S.V.A. is the recipient of an established investigator fellowship award from CNPq (313781/2025-7), Brazil. The APC was funded by Fundação Butantan.

Institutional Review Board Statement

Not applicable; study not involving humans or animals.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Materials. Further inquiries can be directed to the corresponding author. The data presented in this study were derived from source data available in the public domain at the Sequence Read Archive (SRA) (https://www.ncbi.nlm.nih.gov/sra, accessed on 10 March 2025). The list of relevant SRA metadata used here is provided in Supplementary Table S5.

Acknowledgments

This work used resources from the High-Performance Computing System of the Center for Bioinformatics and Computational Biology (NBBC) of the Butantan Institute.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

BP	Biological process
CC	Cellular component
CPM	Counts per million
DEG	Differentially expressed gene
FDR	False discovery rate
FN	False Negatives
FP	False positives
GO	Gene ontology
GS	Gene significance
kME	Module membership
KS	Kolmogorov–Smirnov
LFC	Log₂ fold changes
LINC	Intergenic lncRNA
LNCA	Antisense lncRNA
LNCS	Intronic sense lncRNA
ME	Module eigengenes
MF	Molecular function
P_adj	Adjusted p-value
PCA	Principal component analysis
QC	Quality control
SRA	Sequence Read Archive
TIN	Transcript integrity number
TP	True positives
TSS	Transcription Start Site
V10	Schistosoma mansoni genome version 10
V5	Schistosoma mansoni genome version 5
V7	Schistosoma mansoni genome version 7
VST	Variance-stabilized transformation
WGCNA	Weighted gene co-expression network analysis

References

Colley, D.G.; Bustinduy, A.L.; Secor, W.E.; King, C.H. Human schistosomiasis. Lancet 2014, 383, 2253–2264. [Google Scholar] [CrossRef]
WHO. Fact Sheets: Schistosomiasis. 2023. Available online: https://www.who.int/news-room/fact-sheets/detail/schistosomiasis (accessed on 15 November 2025).
Ministério_Da_Saúde. Vigilância da Esquistossomose Mansoni: Diretrizes Técnicas; Ministério_Da_Saúde: Brasília, Brazil, 2024.
Silva da Paz, W.; Duthie, M.S.; Ribeiro de Jesus, A.; Machado de Araújo, K.C.G.; Dantas dos Santos, A.; Bezerra-Santos, M. Population-based, spatiotemporal modeling of social risk factors and mortality from schistosomiasis in Brazil between 1999 and 2018. Acta Trop. 2021, 218, 105897. [Google Scholar] [CrossRef]
Bergquist, R.; Utzinger, J.; Keiser, J. Controlling schistosomiasis with praziquantel: How much longer without a viable alternative? Infect. Dis. Poverty 2017, 6, 74. [Google Scholar] [CrossRef]
Melman, S.D.; Steinauer, M.L.; Cunningham, C.; Kubatko, L.S.; Mwangi, I.N.; Wynn, N.B.; Mutuku, M.W.; Karanja, D.M.S.; Colley, D.G.; Black, C.L.; et al. Reduced Susceptibility to Praziquantel among Naturally Occurring Kenyan Isolates of Schistosoma mansoni. PLoS Negl. Trop. Dis. 2009, 3, e504. [Google Scholar] [CrossRef]
Vale, N.; Gouveia, M.J.; Rinaldi, G.; Brindley, P.J.; Gartner, F.; Correia da Costa, J.M. Praziquantel for Schistosomiasis: Single-Drug Metabolism Revisited, Mode of Action, and Resistance. Antimicrob. Agents Chemother. 2017, 61, e02582-16. [Google Scholar] [CrossRef]
Mattick, J.S.; Amaral, P.P.; Carninci, P.; Carpenter, S.; Chang, H.Y.; Chen, L.L.; Chen, R.; Dean, C.; Dinger, M.E.; Fitzgerald, K.A.; et al. Long non-coding RNAs: Definitions, functions, challenges and recommendations. Nat. Rev. Mol. Cell Biol. 2023, 24, 430–447. [Google Scholar] [CrossRef]
Batista, P.J.; Chang, H.Y. Long noncoding RNAs: Cellular address codes in development and disease. Cell 2013, 152, 1298–1307. [Google Scholar] [CrossRef]
Ransohoff, J.D.; Wei, Y.; Khavari, P.A. The functions and unique features of long intergenic non-coding RNA. Nat. Rev. Mol. Cell Biol. 2018, 19, 143–157. [Google Scholar] [CrossRef] [PubMed]
Banani, S.F.; Lee, H.O.; Hyman, A.A.; Rosen, M.K. Biomolecular condensates: Organizers of cellular biochemistry. Nat. Rev. Mol. Cell Biol. 2017, 18, 285–298. [Google Scholar] [CrossRef] [PubMed]
Statello, L.; Guo, C.-J.; Chen, L.-L.; Huarte, M. Gene regulation by long non-coding RNAs and its biological functions. Nat. Rev. Mol. Cell Biol. 2021, 22, 96–118. [Google Scholar] [CrossRef] [PubMed]
Garbo, S.; Maione, R.; Tripodi, M.; Battistelli, C. Next RNA Therapeutics: The Mine of Non-Coding. Int. J. Mol. Sci. 2022, 23, 7471. [Google Scholar] [CrossRef]
Silveira, G.O.; Coelho, H.S.; Amaral, M.S.; Verjovski-Almeida, S. Long non-coding RNAs as possible therapeutic targets in protozoa, and in Schistosoma and other helminths. Parasitol. Res. 2022, 121, 1091–1115. [Google Scholar] [CrossRef]
Morales-Vicente, D.A.; Zhao, L.; Silveira, G.O.; Tahira, A.C.; Amaral, M.S.; Collins, J.J.; Verjovski-Almeida, S. Single-cell RNA-seq analyses show that long non-coding RNAs are conspicuously expressed in Schistosoma mansoni gamete and tegument progenitor cell populations. Front. Genet. 2022, 13, 924877. [Google Scholar] [CrossRef]
Silveira, G.O.; Coelho, H.S.; Pereira, A.S.A.; Miyasato, P.A.; Santos, D.W.; Maciel, L.F.; Olberg, G.G.G.; Tahira, A.C.; Nakano, E.; Oliveira, M.L.S.; et al. Long non-coding RNAs are essential for Schistosoma mansoni pairing-dependent adult worm homeostasis and fertility. PLoS Pathog. 2023, 19, e1011369. [Google Scholar] [CrossRef]
Amaral, M.S.; Maciel, L.F.; Silveira, G.O.; Olberg, G.G.O.; Leite, J.V.P.; Imamura, L.K.; Pereira, A.S.A.; Miyasato, P.A.; Nakano, E.; Verjovski-Almeida, S. Long non-coding RNA levels can be modulated by 5-azacytidine in Schistosoma mansoni. Sci. Rep. 2020, 10, 21565. [Google Scholar] [CrossRef]
Jardim Poli, P.; Fischer-Carvalho, A.; Tahira, A.C.; Chan, J.D.; Verjovski-Almeida, S.; Sena Amaral, M. Long Non-Coding RNA Levels Are Modulated in Schistosoma mansoni following In Vivo Praziquantel Exposure. Noncoding RNA 2024, 10, 27. [Google Scholar] [CrossRef] [PubMed]
Liao, Q.; Zhang, Y.; Zhu, Y.; Chen, J.; Dong, C.; Tao, Y.; He, A.; Liu, J.; Wu, Z. Identification of long noncoding RNAs in Schistosoma mansoni and Schistosoma japonicum. Exp. Parasitol. 2018, 191, 82–87. [Google Scholar] [CrossRef] [PubMed]
Maciel, L.F.; Morales-Vicente, D.A.; Silveira, G.O.; Ribeiro, R.O.; Olberg, G.G.O.; Pires, D.S.; Amaral, M.S.; Verjovski-Almeida, S. Weighted Gene Co-Expression Analyses Point to Long Non-Coding RNA Hub Genes at Different Schistosoma mansoni Life-Cycle Stages. Front. Genet. 2019, 10, 464298. [Google Scholar] [CrossRef]
Vasconcelos, E.J.R.; daSilva, L.F.; Pires, D.S.; Lavezzo, G.M.; Pereira, A.S.A.; Amaral, M.S.; Verjovski-Almeida, S. The Schistosoma mansoni genome encodes thousands of long non-coding RNAs predicted to be functional at different parasite life-cycle stages. Sci. Rep. 2017, 7, 10508. [Google Scholar] [CrossRef] [PubMed]
Buddenborg, S.K.; Tracey, A.; Berger, D.J.; Lu, Z.; Doyle, S.R.; Fu, B.; Yang, F.; Reid, A.J.; Rodgers, F.H.; Rinaldi, G.; et al. Assembled chromosomes of the blood fluke Schistosoma mansoni provide insight into the evolution of its ZW sex-determination system. BioRxiv 2021. [Google Scholar] [CrossRef]
Buddenborg, S.K.; Lu, Z.; Sankaranarayan, G.; Doyle, S.R.; Berriman, M. The stage- and sex-specific transcriptome of the human parasite Schistosoma mansoni. Sci. Data 2023, 10, 775. [Google Scholar] [CrossRef]
Gatter, T.; Stadler, P.F. Ryūtō: Improved multi-sample transcript assembly for differential transcript expression analysis and more. Bioinformatics 2021, 37, 4307–4313. [Google Scholar] [CrossRef]
Shao, M.; Kingsford, C. Accurate assembly of transcripts through phase-preserving graph decomposition. Nat. Biotechnol. 2017, 35, 1167–1169. [Google Scholar] [CrossRef] [PubMed]
Niknafs, Y.S.; Pandian, B.; Iyer, H.K.; Chinnaiyan, A.M.; Iyer, M.K. TACO produces robust multisample transcriptome assemblies from RNA-seq. Nat. Methods 2016, 14, 68–70. [Google Scholar] [CrossRef] [PubMed]
Pertea, M.; Pertea, G.M.; Antonescu, C.M.; Chang, T.C.; Mendell, J.T.; Salzberg, S.L. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 2015, 33, 290–295. [Google Scholar] [CrossRef]
Pertea, G.; Pertea, M. GFF Utilities: GffRead and GffCompare. F1000Research 2020, 9, 304. [Google Scholar] [CrossRef]
Nojima, T.; Proudfoot, N.J. Mechanisms of lncRNA biogenesis as revealed by nascent transcriptomics. Nat. Rev. Mol. Cell Biol. 2022, 23, 389–406. [Google Scholar] [CrossRef]
Bannister, A.J.; Kouzarides, T. Regulation of chromatin by histone modifications. Cell Res. 2011, 21, 381–395. [Google Scholar] [CrossRef]
Heinz, S.; Benner, C.; Spann, N.; Bertolino, E.; Lin, Y.C.; Laslo, P.; Cheng, J.X.; Murre, C.; Singh, H.; Glass, C.K. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol. Cell 2010, 38, 576–589. [Google Scholar] [CrossRef]
Barski, A.; Cuddapah, S.; Cui, K.; Roh, T.Y.; Schones, D.E.; Wang, Z.; Wei, G.; Chepelev, I.; Zhao, K. High-Resolution Profiling of Histone Methylations in the Human Genome. Cell 2007, 129, 823–837. [Google Scholar] [CrossRef]
Rougeulle, C.; Chaumeil, J.; Sarma, K.; Allis, C.D.; Reinberg, D.; Avner, P.; Heard, E. Differential Histone H3 Lys-9 and Lys-27 Methylation Profiles on the X Chromosome. Mol. Cell. Biol. 2004, 24, 5475–5484. [Google Scholar] [CrossRef]
Wang, Z.; Zang, C.; Rosenfeld, J.A.; Schones, D.E.; Barski, A.; Cuddapah, S.; Cui, K.; Roh, T.-Y.; Peng, W.; Zhang, M.Q.; et al. Combinatorial patterns of histone acetylations and methylations in the human genome. Nat. Genet. 2008, 40, 897–903. [Google Scholar] [CrossRef]
Tie, F.; Banerjee, R.; Stratton, C.A.; Prasad-Sinha, J.; Stepanik, V.; Zlobin, A.; Diaz, M.O.; Scacheri, P.C.; Harte, P.J. CBP-mediated acetylation of histone H3 lysine 27 antagonizes Drosophila Polycomb silencing. Development 2009, 136, 3131–3141. [Google Scholar] [CrossRef]
Vasconcelos, E.J.R.; Mesel, V.C.; daSilva, L.F.; Pires, D.S.; Lavezzo, G.M.; Pereira, A.S.A.; Amaral, M.S.; Verjovski-Almeida, S. Atlas of Schistosoma mansoni long non-coding RNAs and their expression correlation to protein-coding genes. Database 2018, 2018, bay068. [Google Scholar] [CrossRef]
Leek, J.T.; Johnson, W.E.; Parker, H.S.; Jaffe, A.E.; Storey, J.D. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 2012, 28, 882–883. [Google Scholar] [CrossRef] [PubMed]
Langfelder, P.; Horvath, S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinform. 2008, 9, 559. [Google Scholar] [CrossRef]
Mudge, J.M.; Carbonell-Sala, S.; Diekhans, M.; Martinez, J.G.; Hunt, T.; Jungreis, I.; Loveland, J.E.; Arnan, C.; Barnes, I.; Bennett, R.; et al. GENCODE 2025: Reference gene annotation for human and mouse. Nucleic Acids Res. 2025, 53, D966–D975. [Google Scholar] [CrossRef]
Zhao, L.; Wang, J.; Li, Y.; Song, T.; Wu, Y.; Fang, S.; Bu, D.; Li, H.; Sun, L.; Pei, D.; et al. NONCODEV6: An updated database dedicated to long non-coding RNA annotation in both animals and plants. Nucleic Acids Res. 2021, 49, D165–D171. [Google Scholar] [CrossRef] [PubMed]
Bohrer, C.; Varon, E.; Peretz, E.; Reinitz, G.; Kinor, N.; Halle, D.; Nissan, A.; Shav-Tal, Y. CCAT1 lncRNA is chromatin-retained and post-transcriptionally spliced. Histochem. Cell Biol. 2024, 162, 91–107. [Google Scholar] [CrossRef] [PubMed]
Markou, A.N.; Smilkou, S.; Tsaroucha, E.; Lianidou, E. The Effect of Genomic DNA Contamination on the Detection of Circulating Long Non-Coding RNAs: The Paradigm of MALAT1. Diagnostics 2021, 11, 1160. [Google Scholar] [CrossRef]
Pei, B.; Sisu, C.; Frankish, A.; Howald, C.; Habegger, L.; Mu, X.J.; Harte, R.; Balasubramanian, S.; Tanzer, A.; Diekhans, M.; et al. The GENCODE pseudogene resource. Genome Biol. 2012, 13, R51. [Google Scholar] [CrossRef]
Sayers, E.W.; Bolton, E.E.; Brister, J.R.; Canese, K.; Chan, J.; Comeau, D.C.; Connor, R.; Funk, K.; Kelly, C.; Kim, S.; et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2022, 50, D20–D26. [Google Scholar] [CrossRef]
Andrews, S. FastQC: A Quality Control Tool for High Throughput Sequence Data. 2010. Available online: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (accessed on 10 March 2025).
Ewels, P.; Magnusson, M.; Lundin, S.; Käller, M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 2016, 32, 3047–3048. [Google Scholar] [CrossRef]
Chen, S.; Zhou, Y.; Chen, Y.; Gu, J. fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018, 34, i884–i890. [Google Scholar] [CrossRef]
Howe, K.L.; Bolt, B.J.; Shafie, M.; Kersey, P.; Berriman, M. WormBase ParaSite—A comprehensive resource for helminth genomics. Mol. Biochem. Parasitol. 2017, 215, 2–10. [Google Scholar] [CrossRef]
Dobin, A.; Davis, C.A.; Schlesinger, F.; Drenkow, J.; Zaleski, C.; Jha, S.; Batut, P.; Chaisson, M.; Gingeras, T.R. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 2013, 29, 15–21. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Wang, S.; Li, W. RSeQC: Quality control of RNA-seq experiments. Bioinformatics 2012, 28, 2184–2185. [Google Scholar] [CrossRef] [PubMed]
Okonechnikov, K.; Conesa, A.; García-Alcalde, F. Qualimap 2: Advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics 2016, 32, 292–294. [Google Scholar] [CrossRef] [PubMed]
García-Alcalde, F.; Okonechnikov, K.; Carbonell, J.; Cruz, L.M.; Götz, S.; Tarazona, S.; Dopazo, J.; Meyer, T.F.; Conesa, A. Qualimap: Evaluating next-generation sequencing alignment data. Bioinformatics 2012, 28, 2678–2679. [Google Scholar] [CrossRef]
Gatter, T.; Stadler, P.F. Ryūtō: Network-flow based transcriptome reconstruction. BMC Bioinform. 2019, 20, 190. [Google Scholar] [CrossRef] [PubMed]
Wucher, V.; Legeai, F.; Hédan, B.; Rizk, G.; Lagoutte, L.; Leeb, T.; Jagannathan, V.; Cadieu, E.; David, A.; Lohi, H.; et al. FEELnc: A tool for long non-coding RNA annotation and its application to the dog transcriptome. Nucleic Acids Res. 2017, 45, gkw1306. [Google Scholar] [CrossRef]
Kang, Y.J.; Yang, D.C.; Kong, L.; Hou, M.; Meng, Y.Q.; Wei, L.; Gao, G. CPC2: A fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res. 2017, 45, W12–W16. [Google Scholar] [CrossRef]
Wang, L.; Park, H.J.; Dasari, S.; Wang, S.; Kocher, J.-P.; Li, W. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Res. 2013, 41, e74. [Google Scholar] [CrossRef]
Rombel, I.T.; Sykes, K.F.; Rayner, S.; Johnston, S.A. ORF-FINDER: A vector for high-throughput gene identification. Gene 2002, 282, 33–41. [Google Scholar] [CrossRef]
Huerta-Cepas, J.; Szklarczyk, D.; Heller, D.; Hernández-Plaza, A.; Forslund, S.K.; Cook, H.; Mende, D.R.; Letunic, I.; Rattei, T.; Jensen, L.J.; et al. eggNOG 5.0: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 2019, 47, D309–D314. [Google Scholar] [CrossRef] [PubMed]
Huerta-Cepas, J.; Forslund, K.; Coelho, L.P.; Szklarczyk, D.; Jensen, L.J.; von Mering, C.; Bork, P. Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper. Mol. Biol. Evol. 2017, 34, 2115–2122. [Google Scholar] [CrossRef] [PubMed]
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv 2013, arXiv:1303.3997. [Google Scholar] [CrossRef]
Danecek, P.; Bonfield, J.K.; Liddle, J.; Marshall, J.; Ohan, V.; Pollard, M.O.; Whitwham, A.; Keane, T.; McCarthy, S.A.; Davies, R.M. Twelve years of SAMtools and BCFtools. GigaScience 2021, 10, giab008. [Google Scholar] [CrossRef]
Bailey, T.; Krajewski, P.; Ladunga, I.; Lefebvre, C.; Li, Q.; Liu, T.; Madrigal, P.; Taslim, C.; Zhang, J. Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. PLoS Comput. Biol. 2013, 9, e1003326. [Google Scholar] [CrossRef]
Carroll, T.; Stark, R. Assessing ChIP-seq Sample Quality with ChIPQC. 2017. Available online: https://bioconductor.org/packages/release/bioc/html/ChIPQC.html (accessed on 10 March 2025).
Quinlan, A.R.; Hall, I.M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26, 841–842. [Google Scholar] [CrossRef] [PubMed]
Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15, 550. [Google Scholar] [CrossRef]
Patro, R.; Duggal, G.; Love, M.I.; Irizarry, R.A.; Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 2017, 14, 417–419. [Google Scholar] [CrossRef] [PubMed]
Robinson, M.D.; McCarthy, D.J.; Smyth, G.K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26, 139–140. [Google Scholar] [CrossRef] [PubMed]
Ritchie, M.E.; Phipson, B.; Wu, D.; Hu, Y.; Law, C.W.; Shi, W.; Smyth, G.K. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015, 43, e47. [Google Scholar] [CrossRef] [PubMed]
Yu, G.; Wang, L.-G.; Han, Y.; He, Q.-Y. clusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters. OMICS J. Integr. Biol. 2012, 16, 284–287. [Google Scholar] [CrossRef]

Figure 1. Bar plots comparing Ryūtō, Scallop + TACO, and StringTie assembler performance using protein-coding transcripts as reference. The y-axis represents the metric value (0–1 scale). The x-axis shows the different assemblers. (a) F1-score, (b) precision, and (c) recall values are shown, which were obtained when reconstructing manually curated V10 protein-coding transcripts (gold standard). Only exact matches (“=” class in GffCompare) were considered True Positives. Ryūtō outperformed both competitors across all metrics despite conservative evaluation (potential lncRNAs counted as False Positives; see main text above).

Figure 2. Distribution of lncRNA transcripts and size of source data used. (a) Bar plot of the number of lncRNA transcripts per primary transcriptome (x-axis: primary transcriptomes; y-axis: transcript count) (b) Total library sizes in Gbases per transcriptome (x-axis: primary transcriptomes; y-axis: Gbases). Colors reflect a gradient from light to dark, with darker colors indicating higher values. Tissue-specific assemblies contained fewer transcripts, while miracidia, sporocysts and schistosomula showed the highest counts, reflecting differences in library sizes, Ryūtō’s pooling parameters, and potentially incomplete transcripts.

Figure 3. Overlap of lncRNA gene expression across different life-cycle stages. (a) lncRNA genes in common across S. mansoni stages; number of shared lncRNA genes across primary transcriptomes represented in a circular UpSet plot, showing the top 50 gene intersections. The dots indicate the stages at which a given set of lncRNA genes was detected, and lines connecting dots represent the stages that share that given set of lncRNAs; single isolated dots represent stage-specific lncRNA gene sets. The top 8 highest columns represent the number of lncRNAs at each stage-specific gene set, suggesting distinct lncRNA expression profiles for each life-cycle stage. The last two numerated columns represent the intramammalian life-cycle stages, with 244 lncRNA genes in common, and then all stages together, with 229 lncRNA genes in common. (b) Distribution of the top 10% most highly expressed genes across S. mansoni life-cycle stages; the histogram quantifies the proportion of lncRNA or protein-coding genes found in one stage versus those found at the intersection of stages, from 1/8 (exclusive, on the left) to 8/8 (present in all stages, on the right). The data reveal that 42% of lncRNA genes are stage-specific, highlighting a highly specialized expression pattern, whereas only 25% of protein-coding genes follow the same pattern.

Figure 4. Fate of V7 LncRNA Transcripts in V10 Assembly. Sankey diagram illustrates the transition path of 16,583 lncRNA transcripts from the previous V7 assembly to the current V10 lncRNA consensus and provides the accounting of all transcripts through the alignment and filtering pipeline. Arrows at the top indicate the three major pipeline annotation steps. Teal paths represent the “Recovered Paths”, tracing transcripts that aligned to the V10 genome and were successfully recovered in the new V10 lncRNAs hierarchical assembly, passing through our pipeline filters. Red paths represent the “Lost/Retired Path,” identifying transcripts that were excluded due to protein coding (PC) potential, aberrant loci/size pattern (see Section 4), or being fragmented/absent in our assembly.

Figure 5. Comparison of V7 and V10 lncRNA transcriptomes. (a) Fate of V7-assembled lncRNA transcripts in comparison to V10. (b) Composition of V10 transcripts in the consensus assembly. Figures within parenthesis in the bar plot represent the number of transcripts. In red, transcripts that were removed or lost in the V7–V10 transition, leaving 26.9% not recovered in the V10 assembly. In dark blue, V7 transcripts that are retained in V10. In dark green, novel V10 transcript isoforms of V7 transcripts. In light green, V10 transcripts in novel loci compared to V7; nearly 50% of all transcripts are novel in the new V10 consensus assembly.

Figure 6. Histone peak marks proximity to lncRNA genes or to random sequences in the S. mansoni genome. Peak profiles for four histone marks relative to the TSS of 10,236 novel lncRNA genes (light green lines) and a control set of 10,236 size-matched random genomic regions (dark lines). Dots represent the proportion of total peaks falling within 1 kb bins. There is a high density of marks near the TSSs for each of the four marks studied.

Figure 7. Principal component analysis of RNA-seq samples from S. mansoni adult male and female parasites before (a) and after (b) batch correction to remove BioProjects bias. Symbol shapes indicate the sex of parasites, with circles representing female parasites and triangles representing males. Symbol colors indicate the different BioProjects, which are identified in the legend at right. After batch correction (b), PC1 accounts for 64% of the total variance, clearly separating samples by sex. PC2 accounts for 7% of the variance. Each point represents a single sample within a given BioProject.

Figure 8. Sex-specific differential gene expression in adult S. mansoni. Heatmap displays normalized expression values of DEGs (y-axis) across samples (x-axis). Using unsupervised clustering, females (pink, top color code) and males (light blue, top color code) were grouped separately. Expression levels are represented by z-score according to the color scale on the right (blue: downregulation; red: upregulation). (a) LncRNAs show a more heterogeneous pattern among samples, while (b) protein-coding genes show a more homogeneous expression pattern.

Figure 9. DEG distribution across the S. mansoni chromosomes. Comparison of the expected distribution (Green; percentage of total transcripts per chromosome) versus the observed distribution (Purple; percentage of total DEGs per chromosome) for (a) protein-coding and (b) lncRNA biotypes. The Z chromosome shows a significant enrichment of DEGs, particularly for lncRNAs (Observed 32.8% vs. Expected 23.7%).

Figure 10. Direction of Sex-biased DEGs. Partitioning of the total (a) protein-coding and (b) lncRNA DEG across chromosomes. Bars represent the percentage of total male-biased (teal) and female-biased DEGs (pink) accounted for by each chromosome. The Z chromosome exhibits a male-bias for both protein-coding and lncRNA DEGs, while autosomal lncRNA DEGs show a preferential female-bias (57.5%).

Figure 11. Weighted Gene Co-expression Network Analysis in S. mansoni male and female RNA-seq data. (a) Cluster dendrogram and module assignment. The x-axis represents individual genes clustered based on their Topological Overlap Matrix (TOM) dissimilarity. The y-axis represents the height, indicating the dissimilarity (1-TOM) at which genes or branches are merged. Colors beneath the dendrogram branches denote the identified co-expression modules; genes of the same color are grouped into the same module. (b) Module–trait association heatmap. A heatmap displaying the association between Module Eigengenes (MEs) and the ‘Sex’ trait (male vs. female). Modules (y-axis) are ordered by the t-value of their association. Each cell shows the linear model t-value and the corresponding p-value; red modules indicate positive correlation (higher expression in males), while blue modules indicate a negative correlation (higher expression in females). Color intensity reflects the association strength (t-value) as indicated by the color scale on the right.

Figure 12. Summary of GO enrichment across sex-associated co-expression modules. A summarized overview of Gene Ontology (GO) enrichment results for modules significantly associated with the ‘Sex’ trait (adjusted p-value ≤ 0.05). The plot is arranged by GO (rows: Biological Process [BP], Cellular Component [CC], Molecular Function [MF]) and by module–sex association (columns: ‘Female-Associated Modules’ and ‘Male-Associated Modules’, which indicate they are upregulated in females or males, respectively). Modules are categorized based on the t-value from their module-trait association (positive t-value for male-associated, negative for female-associated). The x-axis lists the module color names. The y-axis displays the top 3 significantly enriched GO terms (adjusted p-value ≤ 0.01) for each module and ontology. Each dot represents an enriched GO term, with its size proportional to the −log₁₀(adjusted p-value), where larger dots indicate higher statistical significance. Dot color corresponds to the module color.

Figure 13. Intra-Modular lncRNA–protein-coding transcript interaction networks for key sex-associated modules. Concentric ring representation of the most robust co-expression relationships within modules significantly associated with the ‘Sex’ trait. (a) Brown module (male-associated) and (b) Blue module (female-associated). The inner ring displays the top 8 lncRNA hubs ranked by intramodular connectivity (

k W i t h i n

), while the outer ring comprises the top 20 transcripts (protein-coding or lncRNA) with the highest adjacency to the central hubs. Edges are restricted to the top 3 strongest connections per node (adjacency > 0.15) to emphasize the core network skeleton. Squares represent lncRNAs and circles represent protein-coding genes. Nodes are colored gold for lncRNA hubs and by module color for partners. Bold labels indicate hub status, and novel transcripts are marked with an asterisk (*). Detailed information on the networks can be seen in Supplementary Table S4.

Figure 13. Intra-Modular lncRNA–protein-coding transcript interaction networks for key sex-associated modules. Concentric ring representation of the most robust co-expression relationships within modules significantly associated with the ‘Sex’ trait. (a) Brown module (male-associated) and (b) Blue module (female-associated). The inner ring displays the top 8 lncRNA hubs ranked by intramodular connectivity (

k W i t h i n

), while the outer ring comprises the top 20 transcripts (protein-coding or lncRNA) with the highest adjacency to the central hubs. Edges are restricted to the top 3 strongest connections per node (adjacency > 0.15) to emphasize the core network skeleton. Squares represent lncRNAs and circles represent protein-coding genes. Nodes are colored gold for lncRNA hubs and by module color for partners. Bold labels indicate hub status, and novel transcripts are marked with an asterisk (*). Detailed information on the networks can be seen in Supplementary Table S4.

Figure 14. Snapshot view of our public Genome Browser track hub at chromosome 1 in the genomic position of SmLINC101519-IBu. Primary transcriptomes are ordered from top to bottom by parasite developmental stage, each with a different track color. Arrows in the intronic region of the gene indicate the direction of transcription. At the top, the V10 protein-coding gene track (smps-version10, blue), followed by the consensus transcriptome V10 lncRNA track (petroleum green), by the previous V7 lncRNA track (red), and by the tracks of lncRNAs detected in the 12 primary assemblies from the different life-cycle stages and parasite tissues, as indicated by the track names. At the bottom, there are three track labels corresponding to lncRNAs that are present in V7 but were not assembled in V10, to lncRNA V7 transcripts retired in V10, and to monoexonic V10 lncRNAs filtered out in the pipeline. For the consensus transcriptome track color and for each lncRNAs track color, dark tone indicates a transcript at the top “plus” genomic strand and light tone a transcript at the bottom “minus” strand.

Table 1. RNA-seq library filtering steps used in the pipeline.

Step	Libraries	Filtered Out	% Filtered	Remaining	% Retained
Total libraries	-	-	-	1796	100%
1. STAR mapping	1796	382	21.3%	1414	78.7%
2. RSeQC Strandedness (infer_experiment.py)	1414	44	2.4%	1370	76.3%
3. Gene Body Coverage (genebody_coverage.py)	1370	20	1.2%	1350	75.1%
4. Transcript integrity number (tin.py)	1350	207	11.5%	1143	63.6%
5. Qualimap	1143	36	2.0%	1107	61.6%
Retained	Total	-	-	1107	61.6%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.