The Conservation of Long Intergenic Non-Coding RNAs and Their Response to Verticillium dahliae Infection in Cotton

Long intergenic non-coding RNAs (lincRNAs) have been demonstrated to be vital regulators of diverse biological processes in both animals and plants. While many lincRNAs have been identified in cotton, we still know little about the repositories and conservativeness of lincRNAs in different cotton species or about their role in responding to biotic stresses. Here, by using publicly available RNA-seq datasets from diverse sources, including experiments of Verticillium dahliae (Vd) infection, we identified 24,425 and 17,713 lincRNAs, respectively, in Gossypium hirsutum (Ghr) and G. barbadense (Gba), the two cultivated allotetraploid cotton species, and 6933 and 5911 lincRNAs, respectively, in G. arboreum (Gar) and G. raimondii (Gra), the two extant diploid progenitors of the allotetraploid cotton. While closely related subgenomes, such as Ghr_At and Gba_At, tend to have more conserved lincRNAs, most lincRNAs are species-specific. The majority of the synthetic and transcribed lincRNAs (78.2%) have a one-to-one orthologous relationship between different (sub)genomes, although a few of them (0.7%) are retained in all (sub)genomes of the four species. The Vd responsiveness of lincRNAs seems to be positively associated with their conservation level. The major functionalities of the Vd-responsive lincRNAs seem to be largely conserved amongst Gra, Ghr, and Gba. Many Vd-responsive Ghr-lincRNAs overlap with Vd-responsive QTL, and several lincRNAs were predicted to be endogenous target mimicries of miR482/2118, with a pair being highly conserved between Ghr and Gba. On top of the confirmation of the feature characteristics of the lincRNAs previously reported in cotton and other species, our study provided new insights into the conservativeness and divergence of lincRNAs during cotton evolution and into the relationship between the conservativeness and Vd responsiveness of lincRNAs. The study also identified candidate lincRNAs with a potential role in disease response for functional characterization.


Introduction
Long non-coding RNAs (lncRNAs) are RNA molecules with a length longer than 200 base pairs (bp) and without protein-coding potential. Based on their genomic locality, lncRNAs are classified as long intergenic ncRNAs (lincRNAs), long intronic ncRNAs, and long anti-sense ncRNAs [1,2]. LncRNAs constitute an important component of the pervasiveness of genomic transcription [3] and function in wide ranges of biological processes in plants, including development and abiotic and biotic stress responses [4,5]. They achieve their functionality by modulating the transcription and/or translation of target genes in cis or in trans through diverse molecular mechanisms, such as chromatin remodeling, chromosome looping, regulation of mRNA splicing, and miRNA sponges [2,6].
Compared to protein-coding genes (PCGs) or mRNAs, generally, lncRNA transcripts are shorter, have fewer exons, are lowly expressed, show a more tissue-specific expression regulates salt stress in cotton remains elusive. GhlncRNA354, an endogenous target mimic of miR160b that targets auxin response factors, was found to regulate salt tolerance and root growth [32]. GhDAN1, a drought-responsive lincRNA expressed in Ghr but not in Gar, could be a negative regulator of drought stress as silencing GhDAN1 improves tolerance to drought stress. GhDAN1 might achieve its regulatory role by binding to the AAAG motifs of the genes of the auxin-response pathways [30]. A more recent study applied VIGS to 111 lncRNAs and phenotyped the treated plants under four stress conditions, including drought, salt, heat, and cold. Approximately half of the lncRNAs were found to affect plant height (20), drought response (34), heat response (1), or cold response (5) [25].
Verticillium wilt caused by the soil-borne fungus Verticillium dahliae (Vd) is a destructive cotton disease worldwide. To know the potential role of lncRNAs in response to Vd infection, a study compared the expression profiles of lncRNAs that are conserved or non-conserved between Gba (Vd resistant) and Ghr (Vd susceptible) [26]. It was found that the proportion of Vd-responsive lncRNAs is higher among the non-conserved ones than among the conserved ones, and the non-conserved lncRNAs tend to have a higher expression level than the conserved ones. Two long anti-sense ncRNAs, GhlncNAT-ANX2 and GhlncNAT-RLP7, generated from the ANX2 and RLP7 locus, respectively, were found to be negative regulators of Vd response as down-regulation of either of the two lincRNAs by VIGS enhances resistance to Vd. Both lncNATs probably achieve their functionality via the jasmonic acid (JA) pathway [26]. A more recent study identified 4277 DE lncRNAs based on comparison of transcriptomic data from Vd infected and mock root samples of a Vd-resistant Ghr cultivar [24]. For the DE lncRNAs, co-expressed trans-PCGs outnumber co-expressed cis-PCGs, implying that those DE lncRNAs could have a broad function by regulating the expression of PCGs not physically linked. The study also found that lncRNAs could be heavily involved in Vd response by regulating the JA pathway and demonstrated that the expression level of both GhlncLOX3 and its trans-target GhLOX3 is positively correlated with the Vd-resistance level of Ghr cultivars and that down-regulating GhlncLOX3 in resistant Ghr cultivar by VIGS compromises the Vd resistance of the Ghr cultivar [24]. In addition, lncRNA2 and lncRNA7 were found to be negative and positive regulator of Vd resistance, respectively, by modulating genes involved in cell-wall development [33].
Despite the studies on identification of lncRNAs in different cotton species, the evolutionary dynamics of cotton lncRNAs before and after the polyploidization event remains largely elusive, as little is known about the birth and death, conservation, and diversification of lncRNAs during the evolutionary history of the Gossypium lineage, oe whether the conservation of lncRNAs is related to their functional conservation. To address these questions, in this study, we identified long intergenic non-coding RNAs (lincRNAs) in the two cultivated allotetraploid cotton species, i.e., Ghr and Gba, and their two extant diploid progenitors, i.e., Gar and Gra, and systematically compared the conservation of the identified lincRNAs, with a focus on those responding to Vd infection. We showed that while most lincRNAs are species-specific, the closely related subgenomes of Ghr and Gba tend to retain a higher proportion of highly conserved lincRNAs, and the Vd responsiveness of lincRNAs is positively linked to their conservation. We also identified several lincRNAs with a potential function in response to Vd infection or in regulation of miRNAs involved in modulating disease-resistance genes.

Identification of lincRNAs in Diploid and Allotetraploid Cotton Species
To explore the repertoire and conservation of lincRNAs in the two cultivated allotetraploid cotton species [G. hirsutum (Ghr, AD 1 ) and G. barbadense (Gba, AD 2 )] and their diploid progenitors [G. arboreum (A 2 ) and G. raimondii (Gra, D 5 )], we collected a total of 610 published transcriptomic datasets generated from a diverse of tissues of the four species, including 84 datasets from the Vd-infection experiments ( Figure 1A, Supplemental The percentage of the expressed protein coding genes (exp genes), the expressed lincRNAs (exp lincRNAs), the differentially expressed protein coding genes (DE genes), and the differentially expressed lincRNAs (DE lincRNAs) in the two diploids (Gar and Gra) and the At and Dt subgenomes of the two allotetraploid cotton species (Ghr and Gba). DE genes and DE lincRNAs were defined based on the datasets from Vd-infection experiments with the criteria of q-value < 0.05 and |log2(FC)| ≥ 1, using relatively highly expressed protein coding genes and lincRNAs (those with TPM < 0.5 in at least one sample were not considered). (D) The length of lincRNAs and protein coding genes (PCGs) in the four cotton species. (E) The maximum expression level (log2 transformed) of lincRNAs and PCGs in the four cotton species. (F) Distribution of the tissue specificity index (TSI, ranging from 0 to 1) of lincRNAs and PCGs in the four cotton species. Zero represents broadly expressed and one represents specific expression. (G) Distribution of the distance between lincRNAs and their nearest neighboring PCGs in the four cotton species.

Conservation of lincRNAs in Diploid and Allotetraploid Cotton Species
Sequence conservation is associated with function conservation. We, thus, investigated sequence and genomic position conservation of the lincRNAs predicted in the four cotton species and grouped them into four types: syntenic and transcribed (ST), syntenic and allelically transcribed (SAT), positional conservation (PC), and genome-specific (GS; see Materials and Methods for the definition of each type) (Figure 2A,B; Supplemental Table S2). In all pairwise comparisons amongst the six (sub)genomes, generally the SAT and GS lincRNAs are the most abundant, suggesting frequent loss and birth of lincRNAs during the history of cotton evolution. In line with the genetic and evolutionary relationship amongst the cotton (sub)genomes, the (sub)genomes that are closely related have the highest proportion of ST lincRNAs and the least proportion of GS lincRNAs, such as the At or Dt subgenome of Ghr and Gba, while the (sub)genomes that are remotely related have the highest proportion of GS lincRNAs and the least proportion of ST lincRNAs, The number of lincRNAs identified in diploid (Gar and Gra) and the A t and D t subgenomes of the two allotetraploid cotton species (Ghr and Gba). (C) The percentage of the expressed protein coding genes (exp genes), the expressed lincRNAs (exp lincRNAs), the differentially expressed protein coding genes (DE genes), and the differentially expressed lincRNAs (DE lincRNAs) in the two diploids (Gar and Gra) and the A t and D t subgenomes of the two allotetraploid cotton species (Ghr and Gba). DE genes and DE lincRNAs were defined based on the datasets from Vd-infection experiments with the criteria of q-value < 0.05 and |log2(FC)| ≥ 1, using relatively highly expressed protein coding genes and lincRNAs (those with TPM < 0.5 in at least one sample were not considered). (D) The length of lincRNAs and protein coding genes (PCGs) in the four cotton species. (E) The maximum expression level (log2 transformed) of lincRNAs and PCGs in the four cotton species. (F) Distribution of the tissue specificity index (TSI, ranging from 0 to 1) of lincRNAs and PCGs in the four cotton species. Zero represents broadly expressed and one represents specific expression. (G) Distribution of the distance between lincRNAs and their nearest neighboring PCGs in the four cotton species.
Using the criteria specified in Materials and Methods (Supplemental Figure S1), 6936, 5911, 25,425, and 17,713 lincRNAs were identified in Gar, Gra, Ghr, and Gba, respectively ( Figure 1B). Between the two diploid species, while more lincRNAs were found in the A 2 genome than in the D 5 genome, the number of lincRNAs per dataset (112) was identical, suggesting that a larger genome size (~1.7 Gb of A 2 vs.~0.75 Gb of D 5 ) does not necessary translate into having more lincRNAs, although the number of lincRNAs identified could also be impacted by sequencing depth. Between the two allotetraploid species, more lincRNAs were uncovered in Ghr than in Gba, probably due to Ghr having more datasets from not only more diverse tissues but also samples subjected to various stresses (Supplemental Table S1). Between the two subgenomes, more lincRNAs were found in the A t subgenome than in the D t subgenome in both Ghr and Gba, although the number of lincRNAs identified seem to be only weakly related to the size of subgenome ( Figure 1B). Compared to protein-coding genes (PCGs), lincRNAs are shorter ( Figure 1D), are less likely to be expressed ( Figure 1C), tend to be lowly expressed ( Figure 1E), and are more tissue-specific ( Figure 1F). In each species, most lincRNAs are located at~1200 bp or~15,000 bp away from their nearest PCGs ( Figure 1G). The~1200 bp peak is the distal promoter region of PCGs. Given the median distance (6862 bp in Ghr) and mean distance (24,302 bp in Ghr) of two neighboring PCGs (Supplemental Figure S2) [34], enrichment of lincRNAs~15,000 bp away from their nearest PCGs suggest that the lincRNAs in that region might be important in gene regulation, potentially acting as enhancers of their neighboring PCGs. However, compared to the three cultivated cotton species, the wild cotton species Gra seems to have a significantly higher percentage of expressed lincRNAs and to have its lincRNAs found at the nearby regions of PCGs, suggesting that domestication and artificial selection pressure of breeding practices may have shaped the expression dynamics and landscape of lincRNAs in the cultivated cotton species.
The Ghr lincRNAs were blasted against the full-length cDNAs generated from the leaf and root of two Ghr accessions (MCU-5 and Siokra 1-4) by PacBio SMRT (to be published separately). Approximately 47.1% of them had a hit (E-value ≤ 10 −6 ). A total of 3278 lincRNAs (~27.4%) of those with a hit has ≥90% sequence similarity over ≥90% of the lincRNA length, meaning at least a quarter of the identified Ghr lincRNAs are highly confident ones.

Conservation of lincRNAs in Diploid and Allotetraploid Cotton Species
Sequence conservation is associated with function conservation. We, thus, investigated sequence and genomic position conservation of the lincRNAs predicted in the four cotton species and grouped them into four types: syntenic and transcribed (ST), syntenic and allelically transcribed (SAT), positional conservation (PC), and genome-specific (GS; see Materials and Methods for the definition of each type) (Figure 2A Table S2). In all pairwise comparisons amongst the six (sub)genomes, generally the SAT and GS lincRNAs are the most abundant, suggesting frequent loss and birth of lincRNAs during the history of cotton evolution. In line with the genetic and evolutionary relationship amongst the cotton (sub)genomes, the (sub)genomes that are closely related have the highest proportion of ST lincRNAs and the least proportion of GS lincRNAs, such as the A t or D t subgenome of Ghr and Gba, while the (sub)genomes that are remotely related have the highest proportion of GS lincRNAs and the least proportion of ST lincRNAs, such as Gar and Gra ( Figure 2C). However, to some extent, the proportion of ST and GS in the comparison between Ghr_A t and Gar is lower and higher, respectively, than that in other comparisons of closely related (sub)genomes ( Figure 2C), suggesting that, in terms of the lincRNAs in Gar and its two descendent A t subgenomes (Ghr_A t and Gba_A t ), Ghr_A t is more divergent than Gba_A t when compared to their ancestral A genome donor Gar (A 2 ).
ST lincRNAs, 62.3%, 24.0%, 7.3%, 3.9%, and 2.4% of the 8901 families (including 6962 one2one, 1202 one2many, and 737 many2many) are conserved in two, three, four, five, and six (sub)genomes, respectively ( Figure 2D). Of the 6962 ono2one families, 48 (0.7%), 118 (1.7%), 332 (4.8%), 1527 (21.9%), and 4937 (70.9%) are conserved in six, five, four, three, and two (sub)genomes, respectively. However, the difference of the (sub)genome conservation of the PC lincRNAs (which belongs to 5780 families) seems not to be as significant as that of the ST lincRNAs ( Figure 2E).  The ST lincRNAs of the 6962 one2one families are considered to originate from the common ancestor of Gar and Gra, as each of them has a pair of homologs in the A 2 and D 5 or in the A t and D t subgenomes. We classified them into four types based on their presence and absence in the two ancestral diploids (Gar and Gra), and whether the diploid homologous lincRNAs have been inherited to the corresponding (sub)genomes of the two allotetraploids (Ghr and Gba), to infer their evolution dynamics (Table 1). Type 1 (262 or 3.8%) contains the families with a homologous lincRNA identified in both Gar and Gra, and 18.3% of these lincRNAs are retained in both subgenomes of Ghr and Gba, representing the most conserved ones, whereas 81.7% of them lost in one or both subgenomes of Ghr and Gba after the divergence of Ghr and Gba. Type 2 (1349 or 19.4%) includes the families with the homologous lincRNA retained in Gar but lost in Gra after their divergence; 42.8% and 12.2% of these Gar lincRNAs are retained and lost in the A t subgenome of Ghr and Gba, respectively; the remaining (45.1%) are retained in the A t subgenome of Ghr or Gba. Type 3 (1397 or 20.1%) are the families with the homologous lincRNA retained only in Gra and lost in Gar after their divergence, with 36.1% and 16.7% of them being retained and lost in the D t subgenome of Ghr and Gba, respectively; close to half of them (47.2%) being lost in the D t subgenome of Ghr or Gba. Type 4 (3954 or 56.8%) includes the families with their both Gar and Gra homologous lincRNAs lost after the divergence of Ghr and Gba, as at least an A t and a D t homologous lincRNA were identified in Ghr and/or Gba (Table 1). These results indicate that only a tiny portion (0.7%) of the ST lincRNAs is very conserved and that the vast majority of the ST lincRNAs have lost their identity in at least one of the four species during the evolution trajectory of cotton. Table 1. Conservation of the syntenic and transcribed lincRNAs of the one2one family.

Type
Gar Presence in two to all four subgenomes 3954 (56.8) § √ and x represent presence and absence of homologous lincRNAs. respectively.
Of the 48 most-conserved lincRNAs, 17 have hits matching PacBio full-length cDNAs from both the A t and D t subgenomes of Ghr (see an example in Supplemental Figure S3), 9 and 3 have hits matching PacBio full-length cDNAs from the A t and the D t subgenome of Ghr, respectively, meaning that transcription of about half (47.9%, 46/96) of these conserved Ghr lincRNAs is supported by full-length cDNAs.

Relationship between Conservation of lincRNAs and Their Vd Responsiveness
We first identified lincRNAs responding to infection of V. dahliae, a fungal pathogen causing the Verticillium wilt disease in cotton, by comparing the changes of the expression level of all predicted lincRNAs in roots before and after Vd infection using transcriptomic data generated from Vd-inoculation experiments. Using the criteria presented in Materials and Methods, in Gar, between~200 and~350 lincRNAs were found to be differentially expressed (DE) in at least one of the three time points; overall, 96 were differentially expressed at both 24 and 48 hours post infection (hpi), and 32 lincRNAs were differentially expressed at all three time points ( Figure 3A). In Gra,~100 DE lincRNAs were identified at either 12 or 48 hpi, with 28 being common at both time points ( Figure 3B). In Ghr, there are more DE lincRNAs at the early time points (6-12 hpi) than at the late time points (24-72 hpi) with 11 being common at all five time points ( Figure 3C). More DE lincRNAs were identified in Gba than in the other three species, with the highest number observed at 24-48 hpi. Interestingly, about half (348) of the DE lincRNAs identified at 2 hpi were steadily differentially expressed at all other time points ( Figure 3D). While most lincRNAs that were found to be commonly differentially expressed at multiple time points in one species have homologous lincRNAs in at least one of the other three species, the homologous lincRNAs were usually found to be not differentially expressed. That could be a result of functional divergence of the homologous lincRNAs or because of the use of different Vd isolates in different Vd-infection experiments. However, overall, the Vd responsiveness of lincRNAs is positively associated with their conservation, as the percentage of Vdresponsive lincRNAs is significantly higher in the ST lincRNAs than in the PC ones in all four comparisons ( Figure 3E). This was further supported by the association between conservation of lincRNAs and their Vd responsiveness, i.e., the transcribed lincRNAs residing in the syntenic regions (ST and SAT) are more likely to be associated with Vd responsiveness than the GS lincRNAs ( Figure 3F).
We further used the following criteria to stringently select a set of DE lincRNAs: (1) the expression level of the Vd-infected samples is all higher (up-regulated) or all lower (down-regulated) than that of the uninfected control; (2) for the upregulated candidates, the expression level of at least one Vd-infected sample is ≥5 TPM (transcripts per million sequenced reads); for the down-regulated candidates, the expression level of the uninfected control is ≥5 TPM; (3) the expression fold change caused by Vd infection is ≥2 in at least one Vd-infected sample. As a result, 8, 47, 81, and 380 DE lincRNAs were shortlisted in Gar, Gra, Ghr, and Gba, respectively. Approximately 50% of these DE lincRNAs are genome-specific ones, the remaining were found in at least two (sub)genomes and belong to one of the three families ( Table 2). For the one2one family lincRNAs, as expected, more are conserved in two-three (sub)genomes than in four-six (sub)genomes. Despite one Vd-repressed Gra lincRNA and three Vd-induced Gba lincRNAs being conserved in all six (sub)genomes (Supplemental Table S3), their homologous lincRNAs in other (sub)genomes were not shortlisted by the stringent selection criteria, suggesting that the level of Vd responsiveness of the stringently selected lincRNAs tends to be species-dependent.

Subgenome Dominance of the Vd-Responsive lincRNAs
To know the effect of the polyploidization event on lincRNA characteristics, we compared the length, expression level, and specificity of lncRNAs from the two subgenomes in the two allotetraploid species, Ghr and Gba. The length difference of the A t subgenome lincRNAs seems to be bigger than that of the D t subgenome in both Ghr and Gba, while the median lincRNA length of the A t subgenome is slightly longer than that of the D t subgenome in Gba, but the median lincRNA length of the A t and D t subgenomes of Ghr seems to be similar ( Figure 4A). A significant difference was observed for the maximum expression level of lincRNAs from the two subgenomes in both Ghr and Gba, and the lowest maximum expression level was observed for the Gar lincRNAs ( Figure 4B). A significantly different tissue specificity was evident between the Ghr_A t and Ghr_D t subgenome lincRNAs but not between the Gba_A t and Gba_D t subgenome lincRNAs, despite lincRNAs of both Ghr and Gba seeming to have lost a certain level of tissue specificity compared to the lincRNAs in the two ancestral diploid cotton species, Gar and Gra ( Figure 4C). These results indicate that certain features of the allotetraploid cotton lincRNAs have diverged from that of their ancestral diploid cotton, particularly from the ancestral A genome cotton, and that the difference between the two subgenomes in both Ghr and Gba is relatively small, comparing to the difference between diploid and allotetraploid cottons.
Regarding the number of Vd-responsive PCGs and lincRNAs, a similar number of upor down-regulated PCGs were found in the two subgenomes of Ghr at each time point. For DE lincRNAs, while the number seems to be slightly higher in the A t subgenome at 6 hpi, both the number and change trend at other time points seem to be similar between the two subgenomes ( Figure 4D). Like Ghr, Gba has a very similar number of DE PCGs in the two subgenomes; however, more Vd-induced lincRNAs were found in the A t subgenome at most time points; in contrast, in the D t subgenome, the number of Vd-induced lincRNAs from 2-48 hpi are slightly less than that of the Vd-repressed lincRNAs ( Figure 4E). These observations indicate that the two allotetraploid cottons may have different subgenome dominance regarding Vd-responsive lincRNAs.

Cis-Regulatory Role of the Vd-Responsive lincRNAs in Cotton
The functionality of lincRNAs is achieved by interacting with PCGs or other noncoding RNAs via diverse molecular mechanisms [4]. LincRNAs and their cis targets tend to be co-or reciprocally expressed, we, therefore, using the RNA-seq data from the Vdinfection experiments, analyzed the expression changes of the neighboring PCGs (one on each side) of the Vd-responsive lincRNAs found in the four cotton species, to identify those that were significantly induced or repressed in at least one time point following Vd

Cis-Regulatory Role of the Vd-Responsive lincRNAs in Cotton
The functionality of lincRNAs is achieved by interacting with PCGs or other noncoding RNAs via diverse molecular mechanisms [4]. LincRNAs and their cis targets tend to be co-or reciprocally expressed, we, therefore, using the RNA-seq data from the Vdinfection experiments, analyzed the expression changes of the neighboring PCGs (one on each side) of the Vd-responsive lincRNAs found in the four cotton species, to identify those that were significantly induced or repressed in at least one time point following Vd infection. As a result, 208, 305, 228, and 903 such PCGs were found in Gar, Gra, Ghr, and Gba, respectively. These PCGs were subjected to GO-enrichment analysis. The PCGs from Gar were enriched with one biological process (BP) GO term (p-value < 0.001 applying to all) and one cellular component (CC) GO term. The Gra PCGs were enriched with one molecular-function (MF) GO term. The Ghr PCGs were enriched with two, one, and five biological process, cellular components, and molecular-function GO terms, respectively. The Gba PCGs were enriched with three and six biological process and molecular-function GO terms, respectively. Both GO terms enriched in Gar do not overlap with those of Gar, Ghr, and Gba; the MF term (iron ion binding) enriched in Gra is also enriched in Ghr and Gba; between Ghr and Gba, four terms overlap, including one BP and three MF terms ( Figure 5; Supplemental Table S4). Based on these results, more biological functions or pathways seem to be regulated by lincRNAs in the allotetraploid cottons than in the diploid cottons under the conditions of Vd infection. While the cis targets of the Vd-responsive lincRNAs in Gra tend to be maintained in Ghr and Gba, those in Gar do not. Even between the two allotetraploid cottons, many Vd-responsive lincRNAs might regulate PCGs with different functionality, despite some of them that might regulate a set of PCGs with similar functionality. infection. As a result, 208, 305, 228, and 903 such PCGs were found in Gar, Gra, Ghr, and Gba, respectively. These PCGs were subjected to GO-enrichment analysis. The PCGs from Gar were enriched with one biological process (BP) GO term (p-value < 0.001 applying to all) and one cellular component (CC) GO term. The Gra PCGs were enriched with one molecular-function (MF) GO term. The Ghr PCGs were enriched with two, one, and five biological process, cellular components, and molecular-function GO terms, respectively. The Gba PCGs were enriched with three and six biological process and molecular-function GO terms, respectively. Both GO terms enriched in Gar do not overlap with those of Gar, Ghr, and Gba; the MF term (iron ion binding) enriched in Gra is also enriched in Ghr and Gba; between Ghr and Gba, four terms overlap, including one BP and three MF terms ( Figure 5; Supplemental Table S4). Based on these results, more biological functions or pathways seem to be regulated by lincRNAs in the allotetraploid cottons than in the diploid cottons under the conditions of Vd infection. While the cis targets of the Vd-responsive lincRNAs in Gra tend to be maintained in Ghr and Gba, those in Gar do not. Even between the two allotetraploid cottons, many Vd-responsive lincRNAs might regulate PCGs with different functionality, despite some of them that might regulate a set of PCGs with similar functionality.  Table S4).
Together, the above results imply that, compared to the PCGs flanking overall lin-cRNAs, those flanking the Vd-responsive lincRNAs are quite unique, particularly in Gar, Gra, and Ghr, and that the regulatory functions of the Vd-responsive lincRNAs are largely conserved in cotton, particularly between Ghr and Gba, although some of their functions have diverged during the evolutionary history of cotton.  Table S4).
Together, the above results imply that, compared to the PCGs flanking overall lin-cRNAs, those flanking the Vd-responsive lincRNAs are quite unique, particularly in Gar, Gra, and Ghr, and that the regulatory functions of the Vd-responsive lincRNAs are largely conserved in cotton, particularly between Ghr and Gba, although some of their functions have diverged during the evolutionary history of cotton.

Overlapping between Vd-Responsive lincRNAs and QTL
To further explore the potential function of the Vd-responsive lincRNAs identified in Ghr, we analyzed their overlapping with the reported quantitative trait loci (QTL) associated with Vd responsiveness. In total, 198 A t -subgenome and 269 D t -subgenome Vdresponsive lincRNAs were found to overlap with 55 A t -subgenome and 37 D t -subgenome QTL, respectively. Depending on the QTL size, each QTL contains 1 to 36 Vd-responsive lincRNAs. Of the 23 stringently selected Vd-responsive Ghr lincRNAs, 9 were found to overlap with eight QTL, with 1 in the A t subgenome and 7 in the D t subgenome (Supplemental Table S5). For the overlapping QTL regions with potential disease-response gene(s), their response to Vd infection might be contributed to by the potential disease response gene(s) or their interaction with lincRNAs. For instance, Ghrlnc.47594, a Vd-responsive lincRNA selected by the stringent criteria, is located at QTL-59, where Ghrlnc.47594 was found to be flanked by one disease-resistance gene and one leucine-rich repeat-containing gene. Nevertheless, the majority of the overlapping QTL do not contain a gene with a predicted and/or demonstrated function in disease response.

LincRNAs as Potential Target Mimicry of miR482/2118
LncRNAs that interact with miRNAs but cannot be cleaved by miRNAs are negative regulators of the miRNAs, termed as endogenous target mimicry (eTM) [35]. miR482 and miR2118 are negative post-transcriptional regulators of genes encoding nucleotidebinding leucine-rich repeat (NLR) proteins by targeting their conserved P-loop motif for transcript degradation or translational repression [36][37][38]. It is, thus, in our interest to know whether some of the cotton lincRNAs identified here are potential eTMs involved in regulation of miR482/2118-mediated modulation of NLRs and, consequently, diseaseresponse outcomes.
Using the methods presented in Materials and Methods, we predicted one, two, four, and eight lincRNAs to be potential eTMs of different isoforms of miR482 or miR2118 in Gar, Gra, Ghr, and Gba, respectively (Supplemental Table S7). Most lincRNAs can potentially interact with a single miR482 or miR2118 isoform, but a couple of them were found to be able to interact with two different miR482 isoforms, such as Ghrlnc.53204, which contains binding sites for both miR482a and miR482g. In one case (Ghrlnc.71386), the lincRNA was found to contain two binding sites of miR2118k. Importantly, we found a pair of homologous lincRNAs in Ghr (Ghrlnc.36832) and Gba (Gbalnc.31516) to be potential eTMs of miR2118e (Supplemental Table S7). Except for the intron found in Gbalnc.31516 but not in Ghrlnc.36832, these two lincRNAs are almost identical (Supplemental Figure S4), implying their functional conservation. No matching sequence was found for these two lincRNAs in our PacBio full-length cDNA collections, likely due to the difference in the cotton accessions used and/or the low expression of the lincRNA, so it could not be detected by the sequencing depth used in generation of the full-length cDNAs. Nevertheless, three full-length cDNAs homologous to these two lincRNAs with 2-4 nucleotide polymorphisms at the miR2118e binding site were found (Supplemental Figure S5), and mutation in these polymorphic site(s) could change them into potential eTMs of miR2118e.

Discussion
Eukaryote genomes are pervasively transcribed to generate many different types of non-coding RNAs, with lincRNAs being one of the major types. Studies in both plants and animals indicated that lincRNAs are usually lineage-or species-specific, and positional conservation is more common than sequence conservation, meaning that the lincRNAs that are broadly conserved in different species that share only short patch sequences [9][10][11]15]. For instance, while~20% of rice lincRNAs have a detectable sequence similarity to the maize genomic sequences, only~1% of them have a sequence similarity to the maize lincRNAs, in contrast, approximately a quarter of the rice and maize lincRNAs were found in the synteny blocks [10]. In a study comparing conservation of lncRNAs in three lineages, Brassicaceae, Aethionemeae, and Cleomaceae, it was found that, of the 6480 Arabidopsis thaliana lincRNAs [39], only 11 are conserved in Aethionemeae with 9 of them seeming to be transcribed, whereas 12 of the 39 lineage-specific lncRNAs are positionally conserved in at least one of the other lineages [40]. Even within the same Brassicaceae family, while onlỹ 9% of Brassica napus lncRNAs showed sequence similarity with those from A. thaliana, 44% of B. napus and A. thaliana lncRNAs are conserved by position [41]. Our results observed in the four closely related cotton species are consistent with these findings. In each pairwise (sub)genome comparison, despite the proportion of lincRNAs conserved by both sequence and position (those of ST) decreases with the increase in genetic distance of the (sub)genomes, a significant proportion of lincRNAs (those of SAT and PC) are conserved by position and not by sequence (Figure 2).
The ST lincRNAs are the most-conserved. Their retention in genetically distant (sub)genomes suggests that they might have undergone purification selection owing to their functional importance. The PC lincRNAs have diverged significantly in their sequences but retained their transcription. For these lincRNAs, their function, if any, might not be related to their sequences but to the transcription of the genomic locus containing the lincRNA. Although no such functionality has been reported for plant lincRNA, it has been reported in animals, such as LncMyoD, a lincRNA identified in mouse myoblast and regulating skeletal muscle differentiation, which showed no sequence conservation between mouse and human but was conserved by gene structure and function [42].
Position but not sequence conservation of lincRNAs indicates rapid turnover of lin-cRNA loci. Comparative studies of lncRNAs from 16 vertebrate species and the echinoid sea urchin found >70% of lncRNAs might have appeared in the past 50 million years, although no homologous lincRNAs are traceable in the conserved positions [9]. Similarly, in plants, despite 83-98% of Citrus sinensis lincRNAs have homologous sequences in eight closely related citrus genomes, only 16-29% of them were observed to be transcribed in the eight species [43]. The rapid turnover of lincRNAs is also evident in cotton, based on the presence and absence of the ST lincRNAs of the 6962 one2one family in the six (sub)genomes (Table 1). These lincRNA families are inferred to be originated before the divergence of the A and D genomes~5 million years ago [17], rather than after their divergence, as at least a pair of A (A 2 and A t ) and D (D 5 and D t ) homologs were identified among the six (sub)genomes. Of these lincRNA families, only 0.7% have retained homologous lincRNAs in all six (sub)genomes, 19.4% and 20.1% lost the homolog in Gra (D 5 ) and Gar (A 2 ), respectively, and 56.8% lost the homologs in both Gar and Gra. While Gar is a domesticated and cultivated species and Gra is a wild species, the similar turnover rate observed in the two species suggests that domestication and artificial selection might have little impact on the evolution of these lincRNAs in the time period of~5 million years. However, for both A t (derived from A 2 ) and D t (derived from D 5 ) homologs, their retention rate is~2% higher in Ghr than in Gba (Table 1), implying a kind of species difference.
While it is still under debate whether the transcribed lincRNAs are functionally relevant, many lincRNAs have been demonstrated to be important regulators of diverse biological processes [44,45], including some preliminary results achieved in cotton [25,27,30,32,33]. Nevertheless, owing to the huge number of lincRNAs identified and their unique characteristics, such as low and tissue-specific expression, it is still a challenge to know which of them are functionally important, so they should be chosen for in-depth functional investigation to understand their underlying regulatory mechanism(s). Comparative genomics analysis of lincRNAs across related plant species, just like what we have done here, can provide practical clues for investigating function of lincRNAs, because like homologous PCGs, homologous lincRNAs are expected to have conserved function and, thus, are worthy of further study.
One of the major goals of this study was to use comparative analysis to understand the relationship between conservation and the Vd responsiveness of cotton lincRNAs and to identify candidate lincRNAs for further functional characterization. We found that the lincRNAs conserved by sequence and/or position (ST, SAT, and PC) are more likely to be associated with responding to Vd infection than the genome-specific (GS) ones ( Figure 3F). Compared to the PC lincRNAs, the ST lincRNAs have a much higher percentage of Vd-responsive ones ( Figure 3E). The regulatory roles of the Vd-responsive lincRNAs in Gra, Ghr, and Gba seem to be largely conserved ( Figure 5). In addition, a pair of lincRNAs highly conserved between Ghr and Gba were predicted to be potential eTMs of miR2118e, a negative regulator of several NLRs (Supplemental Figure S3) [46]. These observations suggest that the function of cotton lincRNAs, if any, is likely to be related to their conservation level, although the level of response might be species-dependent.
A few studies have investigated Vd-responsive cotton lncRNAs [24,26,33]. One of those studies found more Vd-responsive lncRNAs in the D t subgenome than in the A t subgenome in both Ghr and Gba, and, consistently, slightly more Vd-responsive QTL were found in D t than in A t [26]. In contrast, we found more Vd-responsive lincRNAs in A t than in D t in both Ghr and Gba ( Figure 4D,E). The discrepancy might be due to use of different cotton accessions and lncRNAs (all lncRNAs vs. lincRNAs) in the two studies. Despite the inconsistency, we also found more Vd-responsive QTL in D t than in A t for the stringently selected Vd-responsive lincRNAs. For the two lncRNAs (lncRNA2 and lncRNA7) that have been demonstrated to be regulators of Vd resistance [33], although lncRNA7 was not identified in this study, lncRNA2 was identified, despite its expression change upon Vd infection being insignificant. These results suggest that the repertoire of Vd-responsive lncRNAs is not yet saturated and that the same lncRNA might respond differently to Vd infection due to different genetic background and/or different Vd isolates. More investigations involving diverse cotton accessions and pathogens (also different strains of the same pathogen) are, thus, required to have an in-depth understanding of the landscape of the cotton lncRNAs responding to disease infection and the function of lncRNAs in the interaction between cotton and pathogens. Ideally, such study could combine multiple strategies, such as strand-specific RNA-seq, SMRT full-length RNA-seq, and cap analysis of gene expression, in the integration of lncRNAs, so it can simultaneously investigate the alternative splicing, transcription start, termination site, expression level, and change of lncRNAs [47].
We used a comprehensive pipeline to identify and characterize lincRNAs from different cotton species and their conservation during cotton evolution, particularly the lincRNAs responding to infection of Verticillium dahliae. The pipeline is applicable to other plant species. While we have achieved what we aimed for, we also realize the limitation of the study, mainly the relatively small number of RNA-seq datasets from Verticillium dahliae inoculation experiments, which we hope can be overcome in the future when more such datasets are available.

Identification of lincRNAs in Diploid and Allotetraploid Cotton Species
In order to study evolutionary dynamics of cotton lncRNAs before and after the polyploidization event and their functional relevance to Verticillium dahliae (Vd) infection, we collected publicly available RNA-seq datasets generated from the two cultivated allotetraploid cotton species, Gossypium hirsutum (Ghr, AD 1 ) and G. barbadense (Gba, AD 2 ), and their extant diploid progenitors, G. arboreum (Gar, A 2 ) and G. raimondii (Gra, D 5 ), for lncRNA identification ( Figure 1A; Supplemental Table S1). Given that most RNA-seq data are not strand-specific, we focused on lncRNAs in the intergenic regions, i.e., lincRNAs.
To identify lincRNAs, we firstly mapped RNA-seq reads from each species to its corresponding reference genome by HISTA2 [48] with the default parameters. The four cotton reference genomes, Ghr.TM-1.HAU_v1.1, Gba.AD2.HAU_v2_a1, Gar.CRI-updated_v1, and Gra.D5.JGI_v2_a2.1 [34,49,50], and their annotation files were downloaded from Cotton-Gen (https://www.cottongen.org; accessed on 29 July 2022) [51]. The mapped reads of each sample/replicate were then assembled by the genome guided software StringTie (McKusick-Nathans Institute of Genetic Medicine, Baltimore, MD, USA) [52], and the assembled transcriptomes of the same cotton species were merged using the merge module in StringtTie (StringTie -merge). The assembled transcripts of each cotton species were filtered by length and protein coding potential using the following criteria to have the final set of lincRNAs. We removed transcripts with a length less than 200 bp and with the predicted shortest open reading frame (ORF) longer than 100 amino acids. We also used blastx to query the non-redundant protein sequences (NR) of NCBI (National Center for Biotechnology Information) and filtered out the transcripts with a hit using the cut-off threshold of E < 10 −10 . The remaining transcripts were further assessed by the coding potential test software CPC2 [53] and compared to the annotated protein-coding transcripts of the corresponding genome to remove those with a match.

Identification of Homologous lincRNAs
The lincRNAs from each of the two cultivated allotetraploid cotton, Ghr and Gba, were separated into two groups based on their subgenome (A t and D t ) origin, and homologous lincRNA analysis was done amongst the six (sub)genomes, i.e., Gar (A 2 ), Gra (D 5 ), Ghr_A t , Ghr_D t , Gba_A t , and Gba_D t .
We defined homologous lincRNAs based on pairwise comparison amongst the six (sub)genomes with three complementary approaches: sequence similarity by blastn, synteny and positional conservation based on flanking PCGs by MCScanX (https://github. com/wyp1125/MCScanx; accessed on 29 July 2022), and whole-genome alignment by Lastz (https://lastz.github.io/lastz/; accessed on 29 July 2022). Firstly, the repeat masked lincRNA sequences from each (sub)genome were reciprocally compared with each other by BLAST 2.4.0+ (-evalue 1 × 10 −5 -num_threads 10 -max_target_seqs 1 -word_size 8 -strand plus -outfmt 6). LincRNA sequences from two (sub)genomes with an alignment E-value < 10 −5 were considered to be the best hits and homologs [54]. Secondly, MC-ScanX [55] was used to identify syntenic lincRNAs in two (sub)genomes based on their flanking syntenic PCGs by pairwise comparison. We considered three PCGs at each side of a given lincRNA. A lincRNA that was found in two (sub)genomes, flanked by a minimum of one syntenic PCG on each side, and has a total of at least three syntenic PCGs, was defined as syntenic lincRNA [56]. Thirdly, lincRNAs of the query (sub)genome were lifted to the target one by UCSC LiftOver (https://genome.ucsc.edu/goldenPath/help/hgTracksHelp. html#Liftover; accessed on 29 July 2022) with the assistance of chain files, which were generated by whole-genome alignment using Lastz, were used to translate the syntenic regions from one (sub)genome to another, to find homologous lincRNA pairs between the two (sub)genomes. Lastly, for each comparison, the homologous lincRNAs identified by the three approaches were merged to have a final list of homologous lincRNAs.
The homologous lincRNAs were assigned to four groups, syntenic and transcribed (ST), syntenic and allelically transcribed (SAT), positional conservation (PC), and genomespecific (GS), based on their conservation level. Syntenic and transcribed represents the situation where a pair of expressed homologous lincRNAs were identified in the syntenic position of two (sub)genomes. Syntenic and allelically transcribed represents the scenario where an expressed lincRNA was found in the syntenic region but only in one of the two (sub)genomes. Positional conservation means that a pair of expressed lincRNAs were identified in the syntenic region of two (sub)genomes, but their sequence homology has eroded to a point of insignificance. Genome-specific refers to those lincRNAs that were identified only in one (sub)genome and do not share position and sequence similarity with any lincRNA from another (sub)genome.
The ST and PC lincRNAs were further assigned into families. To that end, a lincRNA sequence similarity network was built to connect homologous lincRNAs from individual (sub)genomes. An unsupervised graph-cluster algorithm (MCL, https://micans.org/ mcl/; accessed on 29 July 2022) was then used to identify lincRNA cluster within the constructed network with the parameter -abc -I 2.0. Each cluster of homologous lincRNAs was designated a lincRNA family that was then assigned to one of the three families: one-to-one (one2one), one-to-many (one2many), and many-to-many (many2many), based on the number of homologous lincRNA(s) in each of the individual (sub)genomes from which the lincRNA(s) were identified. If a lincRNA has only a single homologous copy in all (sub)genomes with the homologous lincRNA identified, the cluster contains these homologous lincRNAs was defined as a one2one family; if a lincRNA has a single copy in one (sub)genome and multiple homologs (≥2) in at least one of the other (sub)genomes, the cluster contains such homologous lincRNAs was defined as a one2many family; and if a lincRNA has multiple homologs (≥2) in all (sub)genomes from which the homologous lincRNAs were identified, the cluster containing the homologous lincRNAs was defined as a many2many family.

Quantification of lincRNA Expression and Identification of Vd-Responsive lincRNAs
The sequences of lincRNAs identified in each cotton species were merged with the annotated protein coding sequences of the same species to create a reference transcript dataset of the cotton species, which was indexed by Kallisto software (https://pachterlab. github.io/kallisto/about; accessed on 29 July 2022) with default parameters [57]. The expression level (TPM) of lincRNAs and PCGs in each sample was determined by Kallisto software based on the indexed transcripts. The average value of the replicated samples was used to represent the final expression level of lincRNAs and PCGs if replicates were available. For the RNA-seq datasets generated from Vd-infection experiments, the raw read count of lincRNAs and PCGs were used in identification of differentially expressed (DE) lincRNAs and PCGs (DEGs) by DESeq2 [58] with the criteria of q-value < 0.05 and |log2(FC)| ≥ 1. PCGs and lincRNAs with a TPM < 0.5 in at least one sample were eliminated in the DE analysis. The DE lincRNAs and DEGs identified were considered as Vd-responsive. Tissue specificity index (TSI) was used to quantify the expression specificity of lincRNAs and PCGs, and generated by using the methodology previously described [59]. The value of TSI is between 0 and 1, with 1 and 0 meaning tissue-specific and broadly expressed, respectively.

Analysis of Cis Targets of Vd-Responsive lincRNAs
LincRNAs regulate their target genes in cis and/or in trans. Here, our focus was on the potential cis targets of Vd-responsive lincRNAs. To identify such cis targets, the expression changes of the nearest left and right neighboring PCGs of Vd-responsive lincRNAs in response to Vd infection were calculated, and the PCGs co-differentially expressed with their neighboring Vd-responsive lincRNAs were considered as cis targets of the lincRNAs and were subjected to GO-enrichment analysis using GOseq [60]. GO analysis was also done for the flanking genes of the ST and PC lincRNAs of the one2one family.

Analysis of Association between Conservation of lincRNAs and Their Vd-Responsiveness
To investigate expressional and evolutionary dynamics of Vd-responsive lincRNAs, we first overlapped homologous lincRNAs, which were defined based on the pipeline described in Section 2.2 with differentially expressed lincRNAs in response to Vd infection, then used Fisher's exact test (p-value) to judge whether homologous lincRNAs are overrepresented in Vd-responsive lincRNAs. Fisher's exact test was also used to test the association between the Vd-responsive lincRNAs and their conservation, i.e., enrichment of the Vd-responsive lincRNAs in the ST, SAT, PC, and GS four types of lincRNAs.

Identification of Endogenous Target Mimicry of miR482/2118
We used in-house script, coded based on the criteria previously described [61], and psMimic [62] to identify endogenous target mimicry (eTM) of miR482 and miR2118 [46], the two miRNAs targeting genes encoding nucleotide-binding leucine-rich repeat receptors (NLRs).

Conclusions
By using RNA-seq datasets from diverse sources, including Vd-infection experiments, we identified a number of lincRNAs in the two allotetraploid cotton species (Ghr and Gba) and their ancestral diploids (Gar and Gra). Most lincRNAs are species-specific, despite many more conserved lincRNAs being found between the closely related subgenomes of Ghr and Gba than between the remotely related (sub)genomes. Vd responsiveness of lincRNAs is positively correlated with their conservation level, so many Vd-responsive Ghr-lincRNAs overlap with Vd-responsive QTL, and several lincRNAs were predicted to be eTMs of miR482/2118, including a pair highly conserved between Ghr and Gba. The results presented here verified the characteristics of plant lincRNAs as previously reported, expanded the repositories of cotton lincRNAs, shed new insights on the relationship between the conservation of lincRNAs and their Vd-responsiveness, and provided candidate lincRNAs for future functional investigation.