Discovery of Four Novel ORFs Responsible for Cytoplasmic Male Sterility (CMS) in Cotton ( Gossypium Hirsutum L.) through Comparative Analysis of the Mitochondrial Genomes of Four Isoplasmic Lines

: Cytoplasmic male sterility (CMS) is an important feature for achieving heterosis in the development of hybrid crops. Mitochondria contribute to CMS, especially via mitochondrial DNA (mtDNA) rearrangements and chimeric genes. However, the mechanisms of CMS have not been fully elucidated, and the isonuclear alloplasmic lines used in previous studies have limited utility in cotton CMS research. In this study, three CMS lines (J4A ‐ 1, J4A ‐ 2 and J4A ‐ 3) and their isoplasmic maintainer line (J4B) were analyzed for mtDNA structural differences via high ‐ throughput sequencing. The results showed that mtDNA was conserved (with similarities higher than 99%) among the three CMS lines and their isoplasmic maintainer line. All lines harbored 36 known protein ‐ coding genes, 3 rRNAs, and 15 tRNAs. The protein ‐ coding genes with non ‐ synonymous mutations mainly encoded two types of proteins: ATPase and ribosomal proteins. Four new open reading frames (ORFs) ( orf116b , orf186a ‐ 1 , orf186a ‐ 2 and orf305a ) were identified as candidate ORFs responsible for CMS. Two of the ORFs ( orf186a ‐ 1 and orf186a ‐ 2 ) were identified as orf4 and orf4 ‐ 2 of the upland cotton CMS line 2074A (a line with Gossypium harknessii Brandegee CMS ‐ D2 ‐ 2 cytoplasm), respectively. These findings provide a reference for CMS research in cotton or other crops. orf174


Introduction
Cytoplasmic male sterility (CMS) refers to the loss of stamen function and pollination ability and is very important in the production of crop hybrids, such as rice, corn, and wheat [1][2][3]. Mitochondria are involved in CMS. Previous studies have revealed that CMS results from interactions between mitochondrial genes and nuclear genes [4], and the abnormal development of pollen is usually associated with defects in mitochondria. Mitochondria are essential, semi-autonomously replicating organelles that produce energy and encode for limited genetic information. As with any other organellar genome, plant mitochondrial genomes (mt genomes) depend on highly integrated functional coordination with the nucleus. During mitochondrial evolution, many essential genes are transferred to the host's nuclear genome or are lost, although the mt genome remains complete [5]. The mt genomes of animals are small, ranging from 15-18 kb, whereas the mt genomes of plants have evolved to vary greatly in size since the initial symbiosis between eukaryotic cells and αproteobacteria [6]. The size of angiosperm mitochondrial DNA (mtDNA) ranges from 208 kb (Brassica hirta) to 11.3 Mb (Silene conica) [5,7,8]. The large mt genomes in plants contain numerous genes and putative open reading frames (ORFs), many of which are of unknown function [9]. Additionally, higher plants have larger mt genomes sizes, with larger and more complex non-coding and repeat regions, than other organisms [10,11], and the repeated sequences can result in mtDNA rearrangement [12][13][14][15].
Chimeric ORFs result from rearranged mt genomes. In flowering plants, chimeric ORFs are thought to block mitochondrial function during critical periods of anther development, leading to the occurrence of CMS [16]. A chimeric gene is a combination of an ORF and a mitochondrial proteincoding gene that usually results in CMS. Previous studies have shown that these coding genes mainly code for ribosomal, cytochrome c oxidase or ATPase [17]. These ORF-encoded proteins usually have the same structure as proteins encoded by mitochondrial genes. In this way, they reduce ATP synthesis or interfere with energy metabolism by acting on complexes or producing toxic proteins, thus affecting pollen fertility. The first identified gene related to CMS was T-urf13 in maize, with a CMS-T-type cytoplasm; it was formed after 7 recombinations and is composed of a fragment of partial rrn26 and atp6 [18]. The mtDNA expression region associated with CMS in the rape nap-CMS line contains a pol CMS-related orf224 gene. The orf79 gene in the HL-CMS line of rice, which is located downstream of atp6, is composed of a partial fragment of coxl and a fragment of unknown origin; it was formed by the recombination of multiple genes in mtDNA [19]. In a sugar beet sterile line and maintainer line [20], several chimeric ORFs were found to be unique to one of the lines. Such differences may be due to normal differentiation between lines, yet some unique ORFs are thought to cause CMS, and studies have shown that one of these ORFs may be the CMS-related gene of sugar beet [21].
Cotton is one of the world's leading natural fiber crops and an important oil-producing plant. CMS in cotton was first described in CMS-D2, which contains Gossypium harknessii (G. harknessii) cytoplasm and was bred through distant hybridization and backcrossing. Both upland cotton and island cotton can maintain its sterility [22]. Another CMS line, CMS-D8, was developed by crossing Gossypium trilobum (G. trilobum) with Gossypium hirsutum (G. hirsutum) [23]. A new CMS line, 104-7A, which possesses the cytoplasm of the commercial G. hirsutum variety 'Shiduan 5′ was bred by interspecific hybridization between G. hirsutum and Gossypium barbadense (G. barbadense) [24]. However, the use of heterosis in cotton breeding is hindered by the limited availability of cotton germplasms with CMS [25,26]. Similar to the cytoplasm of the CMS line 104-7A, the cytoplasm of the materials (J4A-1, J4A-2 and J4A-3) used in the present study is from another upland cotton maintainer line, J4B (a major commercial cultivar developed by Hubei Gushen Science and Technology Co., Ltd., Wuhan, China, in 2010). Although cotton mt genomes were recently sequenced, the genomes of the materials used were alloplastic. A previous analysis of the mt genomes of four lines (the CMS line 2074A (G. harknessii Brandegee, CMS-D2-2 cytoplasm), 2074S (G. hirsutum L, CMS-AD1 cytoplasm), their maintainer line 2074B (a cultivar of upland cotton, "Sumian No. 20" cytoplasm) and the restorer line E5903 (normal male-fertile G. harknessii Brandegee cytoplasm) revealed 4 ORFs related to CMS [26]. Here, we used the four isoplasmic lines as materials to compare the mtDNA among three CMS lines (J4A-1, J4A-2 and J4A-3) and the maintainer line J4B to determine candidate CMS factors. The results provide insight into the intricate relationships between mtDNA and CMS generation and may stimulate further research.

Plant Materials
The CMS lines J4A-1, J4A-2 and J4A-3 and their maintainer line J4B were provided by the Key Laboratory of Plant Genetics and Breeding, College of Agriculture, Guangxi University. The three CMS lines are mutants of their maintainer line J4B, so the CMS lines and their maintainer line J4B possess almost identical nuclear and cytoplasmic genomes. Plants were cultivated at the Agricultural College of Guangxi University, Nanning city (longitude: 108°22′ N, latitude: 22°48′ N), Guangxi Province, China, in July 2016.

Sequencing and Assembly of Mitochondrial (mt) Genomes
Approximately 5 g fresh cotton leaf was collected and mtDNA was extracted using an improved method [27]. Then, 1 μg purified mtDNA was fragmented, and a short (350 bp) insertion library was constructed following Illumina's standard procedure. The cotton mt genomes were sequenced using a high-throughput sequencing platform (Illumina HiSeq 4000, Shanghai Biozeron Co., Ltd., Shanghai, China). Before assembly, SPAdes 3.10.1 [28] was used to estimate genome size based on K-mer statistical analysis. SOAP aligner software (with the default parameters -m 200 -x 600 -l 32 -v 8 -p 8) was used to compare the reads of the samples to the mt reference genome of upland cotton [29] (GenBank ID: JX065074.1) and to summarize the coverage of the reads on the mt reference genome, the coverage statistics and all additional sequencing data. Then, SOAPdenovo [30] (version 2.04) software with the default parameters (-K 121 -F -p 8) was used to assemble the sequences. Comparing the reads back to the contig obtained by the assembly and according to the relationship of the pairedend and overlap, the results were assembled and optimized locally, and then the order and direction of the contig were determined. Finally, GapCloser (version 1.12) software was used to repair the gaps in the assembly and remove the redundant segment sequences. Employing OGDRAW v1.2 [31] software, mt genome maps (as circular molecules) of the four lines were constructed.

Analysis and Annotations of the mt Genomes
The mitochondrial genes were annotated using Mitofy and AUGUSTUS software [32], and Evidence Modeler v1.1.1 was used to integrate the gene set [33]. Ribosomal RNA (rRNA) and transfer RNA (tRNA) genes were analyzed using tRNAscan-SE and Glimmer 1.2, respectively [34][35][36][37]. Repeats were assessed by RepeatMasker, and the screening criteria for long repeats were as follows: length ≥100 bp and identity greater than or equal to 97%. Using the chloroplast genome and nuclear genome of G. hirsutum as the reference, Nucleotide BLAST (BLASTN) was used to search the chloroplast source sequence and nucleus-derived insertions (identity ≥ 97% and length ≥ 100 bp). ORFs were predicted using ORFfinder and EMBOSS (6.3.1: getorf) [38,39]. Transmembrane domains of ORFs were identified by TMHMM Server version 2.0 [40].

Detection and Annotation of Single Nucleotide Polymorphisms (SNPs)
Single nucleotide polymorphisms (SNPs) were evaluated by SAMtools software [41] using two screening criteria；mapping quality >20 and variant position depth >4. SNPs in repeat regions were not included. The locations of SNPs were based on the annotation of gene models from the reference genome database [42]. SNPs were classified as exonic SNPs and intronic SNPs depending on their location. SNPs located in coding sequence regions were further divided into synonymous and nonsynonymous mutations by GeneWise [43].

Sequencing Data Statistics and Assembly
The original amount of data (Supplementary Table S1) obtained for the CMS lines (J4A-1, J4A-2 and J4A-3) and their maintainer line J4B was 1116 Mb, 1383 Mb, 1427 Mb and 1417 Mb, respectively. The GC content of the sequences of J4A-1, J4A-2, J4A-3 and J4B after filtering was 40%-42%, and the Q20 values were >95%. The K-mer number of the four lines was 19,316 kb-21,549 kb. These values indicate that the constructed database and sequences were suitable for subsequent mt genome assembly and bioinformatics analysis.

Structures and Contents of the mt Genomes
The mt genomes of J4A-1 National Center for Biotechnology Information (NCBI) accession: MN149527), J4A-2 (NCBI accession: MN149528), J4A-3 (NCBI accession: MN149526) and J4B (NCBI accession: MN149525) (the accession numbers in GenBank are not yet available) were each assembled into a single circular molecule with a total length of 623,067 bp, 623,343 bp, 622,848 bp and 621,841 bp, respectively ( Figure 1 and Supplementary Figures S1-S3), and the mt genome sizes of the CMS lines were larger than the genome size of their maintainer line ( Table 1). The number of predicted genes differed among J4A-1, J4A-2, J4A-3 and J4B, being 191, 189, 191 and 186, respectively. Additionally, the total length of genes in each CMS line was greater than that in the maintainer line J4B, and intergenic sequences accounted for approximately 83% of the genome sequence in all lines.  A total of 36 protein-coding genes, 3 rRNAs, and 15 tRNAs were predicted in the four lines (Table 2). Among the 36 protein-coding genes, multiple copies were detected for 3 (nad1 with 4 copies, nad2 with 2 copies, and nad5 with 3 copies), and 20 genes were identified as being involved in the oxidative phosphorylation system and electron transport. The genes involved in the electron transport system comprised 9 NADH-ubiquinone oxidoreductase (complex I) genes (nad1, nad2, nad3, nad4, nad4L, nad5, nad6, nad7 and nad9), two succinate dehydrogenase (complex II) genes (sdh3 and sdh4), one cytochrome bc1 complex (complex III) gene (cob), three complex IV cytochrome c oxidase genes (cox1, cox2 and cox3), and five ATP synthase genes (atp1, atp4, atp6, atp8, and atp9). Moreover, 10 genes were identified that encode ribosomal proteins, including 6 ribosomal small subunit protein genes (rps3, rps4, rps7, rps10, rps12 and rps14) and four ribosomal large subunit protein genes (rpl2, rpl5, rpl10 and rpl16). Among the ribosomal proteins, only rps3 and rps10 had a single intron. In addition, four cytochrome c biological origin-related genes (ccmB, ccmC, ccmFC and ccmFN) were identified. One maturase gene (matR) and one mttB gene coding for a transporter were identified.

Repeat Sequences in the mt Genome
Repeats are DNA sequences with multiple copies in the genome. Previous studies showed that some repeat sequences were involved in rearrangements of the mt genome, which can also result in CMS [2]. The lengths of the repeats in J4A-1, J4A-2, J4A-3 and J4B (length ≥100 bp, identity ≥97%) varied from 100 bp to 27,501 bp (Table 3 and Supplementary Tables S2-S5), and no significant differences in the number of repeat sequences among the four lines were observed. The total length of the repeat sequences in J4A-1, J4A-2, J4A-3 and J4B accounted for 10.60%, 10.77%, 12.60% and 10.90%, respectively, of the total genome length. Among the large (>500 bp) repeats, a 9284 bp repeat sequence (A3R5) was found only in J4A-3. In addition, A3R5 is a copy of A3R4 with a 464 bp deletion at the 5′ end. The mtDNA of J4A-3 contains the 9748 bp repeat sequence A3R4, whereas in the other three lines, this sequence is separated into two sequences (8898 bp and 906 bp, 8926 bp and 906 bp, and 8915 bp and 906 bp) by an unknown sequence. The total repeat sequence lengths of J4A-1, J4A-2, J4A-3 and J4B were 66 kb, 67 kb, 75 kb, and 67 kb, respectively, with copy numbers varying from 2 to 4. There were three large repeats that exceeded 10 kb in all lines. Other small repeats were distributed throughout the genome and variable in copy number (Figure 2). The distributions of repeat sequence sizes of J4A-1, J4A-2, J4A-3 and J4B were approximately the same, and fragments 100-200 bp in length were the most common, followed by those with a length of 200-300 bp.

Similarity of mtDNA with the Cotton Chloroplast and Nuclear Genomes
Insertions of DNA from other genomes, i.e., the chloroplast and nuclear genomes, into mtDNA have been reported for almost all completely sequenced plant genomes. In the present study, no difference in the chloroplast-like sequences was observed among J4A-1, J4A-2, J4A-3 and J4B (Supplementary Table S6). The total length of three sequence fragments (cp1, cp2, and cp3) was 7116 bp, accounting for 1.14% of the total genome size. Two chloroplast-like sequences, namely, cp1 and cp2, were similar to a hypothetical protein of G. hirsutum. In contrast, cp3 had no similar sequences in the NCBI database.
In addition to containing the plastid genome, plant mtDNA includes large amounts of fragments that potentially derive from the nuclear genome. The total length of nucleus-derived sequences in J4A-1, J4A-2, J4A-3 and J4B was 495,108 bp, 493,822 bp, 495,002 bp and 498,095 bp, accounting for 79.46%, 79.22%, 79.47% and 80.10% of the total genome size, respectively (Supplementary Table S7). Among the four cotton lines, the distribution of nuclear-like sequences among the 26 chromosomes was uneven, and the chromosome to which longest sequence aligned was chromosome 15, with a sequence length of 271.4-274.0 kb, accounting for approximately 43.57%-44.06% of the total chromosome length. In contrast, the shortest sequence aligned to chromosome 14, with a length of 4.53-4.63 kb, accounting for 0.73%-0.74% of the total chromosome length (Figure 3).

Statistical Analysis and Annotation of SNPs
The SNP variant sites of J4A-1, J4A-2, J4A-3 and J4B mtDNA were obtained by comparison with the reference sequence of the mt genome of upland cotton. In this study, 15 SNPs in 13 protein-coding genes were identified in the three CMS lines (with the SNPs being identical among the three CMS lines) relative to their maintainer line J4B . Five genes (apt4, cox1, nad2, nad7 and rpl16) contained synonymous SNPs, and 8 genes (atp8, cox3, matR, rpl2, rpl5, rpl10, rps4 and sdh3) contained nonsynonymous SNPs ( Table 5). The SNPs of rpl2 were the most variable and included two nonsynonymous mutations. However, the other genes had only one SNP. Moreover, all of the SNPs were transversions and most were A-C conversions, accounting for 80% of the total SNPs. Interestingly, compared with their maintainer line, the CMS lines contained more variations in ribosomal proteincoding genes, such as rpl2, rpl5, rpl10, rpl16 and rps4. In addition, evolutionary rate analysis revealed that the non-synonymous mutation (dN) rate for 6 protein-coding genes (atp4, cox1, cox2, ccmFC, nad4 and nad7) was lower than the synonymous mutation (dS) rate, making dS/dN greater than 1 and indicating positive selection on these genes [44]. Additionally, the dS/dN ratios of four genes (cox3, rpl2, matR and nad3) were less than 1, indicating purifying selection (Supplementary Table S10).

Discussion
Previous studies of CMS in cotton utilized genetic backgrounds with different cytoplasms at the mt genome level, such as 2074A and 2074S [25], for comparative analyses. Although these studies identified some meaningful differences, the materials used were alloplasmic lines. In the present study, the CMS lines and their maintainer line were isogenic, with almost the same mtDNA and nuclear genome. Therefore, the four lines used herein are ideal materials for investigating candidate CMS mitochondrial genes in cotton.

Characteristics of Plant Mitochondrial Genes
Unlike the chloroplast genome, which is very stable, the mt genome varies greatly, even at the subspecies level [20,[45][46][47]. However, the mt genomes of J4A-1, J4A-2, J4A-3 and J4B were almost identical, with similarities higher than 99%, confirming the highly consistent genetic background of these four lines. The difference in mt genome size among the four lines was only approximately 1 kb, and all four mt genomes contained 36 genes.
Repeated sequences are identical to symmetrical fragments that occur at different locations in the genome and include forward repeat sequences and palindromic repeat sequences. Repeats are ubiquitous in plant mitochondria and show strong polymorphism. Moreover, mt genome sizes in plants are large because of the presence of long repeats [37][38][39]. Generally, larger genomes consist of larger numbers of repeat sequences. However, this is not always the case. For example, the total mt genome size Cucurbita is 371 kb, but 38% of the genome comprises repeat sequences [48], while Vitis has a relatively large mt genome (approx. 773 kb), only 7% of which comprises repeat sequences [49]. Here, the total size of repeats in J4A-1, J4A-2, J4A-3 and J4B was 66,058 bp, 67,156 bp, 75,134 bp and 67,514 bp, representing 10.60%, 10.77%, 12.60% and 10.90%, respectively, of the genome. Among the four lines, J4A-3 had the largest proportion of repetitive sequences. In addition, the repeat sequences of the four lines were highly homologous (having similarities greater than 95%), with higher homology than that of non-homologous cytoplasmic materials (2074A, 2074S and E5903) [26].
Genes of the chloroplast genome and nuclear genome, even from other plants, can be transferred horizontally to the mt genome, and horizontal gene migration is the major route for obtaining foreign genes [49][50][51]. However, most genes that migrate to the mt genome are non-functional. In the present study, three chloroplast-like sequences (cp1, cp2 and cp3) in all four lines were found to be nonfunctional.

ORFs Associated with Cytoplasmic Male Sterility (CMS)
In many plants, CMS is associated with abnormal ORFs in the mitochondria, and the occurrence of CMS in cotton plays an important role in cotton breeding. In some plants, a common feature of ORFs associated with CMS is a location upstream or downstream of a protein-coding gene and cotranscription with the gene, which leads to improper functional transcription and translation [3,[52][53][54][55][56]. The location, origin, conservation, and function of predicted ORFs were analyzed in this study because new ORFs may be related to CMS. Our analysis revealed four novel chimeric ORFs (orf116b, orf186a-1, orf186a-2 and orf305a), that contain transmembrane domains and are located near known genes. These ORFs are thus candidate ORFs involved in CMS and are similar to the CMS-related ORFs in other CMS lines, such as S-orf355/orf77 [5], orf224 of rape [57][58][59], orf256 of wheat [60], orf125 of radish [52], and orf4 and orf4-2 of cotton [26]. In this study, two of four CMS-associated ORFs (orf186a-1 and orf186a-2) showed 99% identity with orf4 and orf4-2 of cotton CMS line 2074A (Supplementary Figure S4), which further supports a potential CMS-related role of orf186a-1 and orf186a-2. Orf116b is common and specific to the three sterile lines; therefore, it may also play a role in CMS. Although chimeric ORFs are common in plant mt genomes, their potential functions at specific locations in specific plants have not been explored. Therefore, to determine whether the products of these genes cause CMS, further characterization is necessary. The current results are based only on genomic data, and further analysis, such as overexpression and knockout of these genes, should be performed.

SNPs and Plant Mitochondrial DNA Evolve
Although plant mt genomes change rapidly in structure, the mutation rate is very low [61]. SNPs are mutations involving single nucleotides in the genome and are highly polymorphic. Compared with that of nuclear genes, the mutation rate of coding sequences (CDSs) in plant mitochondrial genes is relatively low [62]. Essential genes in the plant mt genome are highly conserved and exhibit a very low evolutionary rate. The present study identified 15 non-synonymous SNPs in 13 protein-coding genes of the CMS lines, which is generally consistent with the findings of previous research [26] and implies that these non-synonymous substitutions play important roles in CMS.

Conclusions
In summary, the materials used in this study are new CMS lines. Through the analysis of mt genomes, we found 36 protein-coding genes, 3 rRNAs, and 15 tRNAs in the four lines, which were largely identical to those in the reference mt genome of G. hirsutum (JX065074.1). SNP analysis revealed 15 SNPs in 13 protein-coding genes, including 9 non-synonymous mutations and 6 synonymous mutations. Most importantly, there were 18, 15 and 18 ORFs in the J4A-1, J4A-2 and J4A-3 CMS lines, respectively, which were absent in J4B. Among these ORFs, 4 showed characteristics related to CMS, and 2 were consistent with the results of previous studies. These results can inform future research on CMS in cotton and other species. However, whether these four ORFs are truly related to CMS and their mechanisms of action require further study.