Gene Loss and Evolution of the Plastome

Chloroplasts are unique organelles within the plant cells and are responsible for sustaining life forms on the earth due to their ability to conduct photosynthesis. Multiple functional genes within the chloroplast are responsible for a variety of metabolic processes that occur in the chloroplast. Considering its fundamental role in sustaining life on the earth, it is important to identify the level of diversity present in the chloroplast genome, what genes and genomic content have been lost, what genes have been transferred to the nuclear genome, duplication events, and the overall origin and evolution of the chloroplast genome. Our analysis of 2511 chloroplast genomes indicated that the genome size and number of coding DNA sequences (CDS) in the chloroplasts genome of algae are higher relative to other lineages. Approximately 10.31% of the examined species have lost the inverted repeats (IR) in the chloroplast genome that span across all the lineages. Genome-wide analyses revealed the loss of the Rbcl gene in parasitic and heterotrophic plants occurred approximately 56 Ma ago. PsaM, Psb30, ChlB, ChlL, ChlN, and Rpl21 were found to be characteristic signature genes of the chloroplast genome of algae, bryophytes, pteridophytes, and gymnosperms; however, none of these genes were found in the angiosperm or magnoliid lineage which appeared to have lost them approximately 203–156 Ma ago. A variety of chloroplast-encoded genes were lost across different species lineages throughout the evolutionary process. The Rpl20 gene, however, was found to be the most stable and intact gene in the chloroplast genome and was not lost in any of the analyzed species, suggesting that it is a signature gene of the plastome. Our evolutionary analysis indicated that chloroplast genomes evolved from multiple common ancestors ~1293 Ma ago and have undergone vivid recombination events across different taxonomic lineages.


Introduction
Photosynthesis is a process by which autotrophic plants utilize chlorophyll to transform solar energy into chemical energy [1]. Almost all life forms depend directly or indirectly on this chemical energy as a source of energy to sustain growth, development, and reproduction of their species [2,3].
genome. The annotated coding DNA sequences (CDS) sequences in each chloroplast genome were downloaded and the presence or absence of CDS from all chloroplast genomes were searched in each individual genome using Linux programming. Species that were identified as lacking a gene in their chloroplast genome were noted and further rechecked manually in the NCBI database. Each chloroplast genome was newly annotated using the GeSeq-annotation of the organellar genomes pipeline to further extend the study of gene loss in chloroplast genomes [40]. The combined analysis of NCBI and GeSeq-annotation of the organellar genomes were considered in determining the absence of a particular gene in a chloroplast genome.
The CDS of the nuclear genome of 145 plant species were downloaded from the NCBI database. The presence of chloroplast-encoded genes in the nuclear genome was determined using Linux-based commands and collected in a separate file. The chloroplast-encoded genes present in the nuclear genomes were further processed in a Microsoft Excel spreadsheet.

Multiple Sequence Alignment and Creation of Phylogenetic Trees
Prior to the multiple sequence alignment, the CDS sequences of PsaM, psb30, ChlB, ChlL, ChlN, and RPL21 were converted to amino acid sequences using a sequence manipulation suite (http://www.bioinformatics.org/sms2/translate.html) [41]. The resulting protein sequences were subjected to multiple sequence alignment using the Multalin server to identify conserved amino acid motifs [42]. The CDS sequences of PsaM, psb30, ChlB, ChlL, ChlN, and RPL21 genes were also subjected to multiple sequence alignment using Clustal Omega. The resultant aligned file was downloaded in Clustal format and converted to a MEGA file format using MEGA6 software [43]. The converted MEGA files of PsaM, psb30, ChlB, ChlL, ChlN, and RPL21 were subsequently used for the construction of a phylogenetic tree. Prior to the construction of the phylogenetic tree, a model selection was carried out using MEGA6 software using the following parameters; analysis, model selection; tree to use, automatic (neighbor-joining tree); statistical method, maximum likelihood; substitution type, nucleotide; gaps/missing data treatment, partial deletion; site coverage cut-off (%), 95; branch swap filer, very strong; and codons included, 1st, 2nd, and 3rd. Based on the lowest BIC (Bayesian information criterion) score, the following statistical parameters were used to construct the phylogenetic tree: statistical method, maximum likelihood; test of phylogeny, bootstrap method; number of bootstrap replications, 1000; model/method, general time-reversible model; rates among sites, gamma-distributed with invariant sites (G+I); number. of discrete gamma categories, 5; gaps/missing data treatment, partial deletion; site coverage cut-off (%), 95; ML Heuristic method, nearest-neighbor-interchange (NNI); branch swap filer, very strong; and codons included, 1st, 2nd, and 3rd. The resulting phylogenetic trees were saved as gene trees. Whole-genome sequences of chloroplast genomes were also collectively used to construct a phylogenetic tree to gain insight into the evolution of chloroplast genomes. ClustalW program was used in a Linux-based platform to construct the phylogenetic tree of chloroplast genomes using the neighbor-joining method and 500 bootstrap replicates. The resultant Newick file was uploaded in Archaeopteryx (https://sites.google.com/site/cmzmasek/home/software/archaeopteryx) to view the phylogenetic tree. A separate phylogenetic tree of species with IR-deleted regions was also constructed using the whole sequence of the IR-deleted chloroplast genome using similar parameters as described above. The evolutionary time of plant species used in this study was created using the TimeTree [44]. Cyanobacterial species were used as an outgroup to calibrate the time tree for the other species.

Analysis of the Deletion and Duplication of Chloroplast-Encoded Genes
A species tree was constructed using the NCBI taxonomy browser (https://www.ncbi.nlm.nih. gov/Taxonomy/CommonTree/wwwcmt.cgi) prior to the study of deletion and duplication of PsaM, psb30, ChlB, ChlL, ChlN, and RPL21 genes. The gene tree of the individual gene family was uploaded in Notung software v.2.9 followed by uploading the species tree and subsequent reconciliation of the Genes 2020, 11, 1133 4 of 23 gene tree with the species tree [45][46][47]. Once reconciled, deletion and duplication events for the genes were visualized and noted.

Recombination Events and Time Tree Construction of the Chloroplast Genome
The constructed phylogenetic tree of chloroplast genomes was uploaded in IcyTree [48] to analyze the recombination events that occurred in chloroplast genomes. The recombination events in IR-deleted and nondeleted IR species were studied separately. The time tree of the studied tree was constructed using the TimeTree program [44].

Substitution Rate in Chloroplast Genomes
Chloroplast genomes were grouped into different groups to determine lineage-specific nucleotide substitution rates. The groups were algae, bryophytes, gymnosperms, eudicots, monocots, magnoliids, Nymphaeales, protists, and IR-deleted species. At least 10 chloroplast genomes were included for each lineage when analyzing the rate of nucleotide substitutions. The full-length sequences of chloroplast genomes were subjected to multiple sequence alignment to generate a Clustal file. The MAFT-multiple alignment pipeline was implemented to align the sequences of the different chloroplast genomes. The aligned sequences of individual lineages were downloaded and converted to a MEGA file format using MEGA6 software [43]. The converted files were subsequently uploaded in MEGA6 software to analyze the rate of nucleotide substitution. The following statistical parameters were used to analyze the rate of substitution rate in chloroplast genomes: analysis, estimate transition/transversion bias (MCL); scope, all selected taxa; statistical method, maximum composite likelihood; substitution type, nucleotides; model/method, Tamura-Nei model; and gaps/missing data treatment, complete deletion.

Statistical Analysis
Principal component analysis and the probability distribution of chloroplast genomes were conducted using Unscrambler software version 7.0 and Venn diagrams were constructed using InteractiVenn (http://www.interactivenn.net/) [49].

The Genomic Features of Chloroplast Genomes Are Diverse and Dynamic
A study of 2511 chloroplast genomes was conducted to gain insight into the genomic structure and evolution of the chloroplast genome. The analysis included the complete genome sequences of algae, austrobaileyales, bryophytes, chloranthales, corals, eudicots, Flacourtiaceae, gymnosperms, magnoliids, monocots, Nymphaeales, opisthokonta, protists, pteridophytes, and an unclassified chloroplast genome (Supplementary File S1). A comparison of the analyzed genomes indicated that Haematococcus lacustris encoded the largest chloroplast genome, comprising 1.352 Mbs; however, Pilostyles aethiopica encoded the smallest chloroplast genome, comprising only 0.01134 Mbs ( Figure 1) followed by Pilostyles hamiltoni (0.01516 Mb), and Asarum minus (0.0155 Mb). The overall average size of the chloroplast genome was found to be 0.152 Mbs. The order of the average size (Mbs) of the chloroplast genome in different plant groups was 0.164 (algae), 0.160 (Nymphaeales), 0.154 (eudicot), 0.154 (Magnoliid), 0.149 (pteridophyte), 0.144 (monocot), 0.134 (bryophyte), 0.131 (gymnosperm), and 0.108 (protist). The average chloroplast genome size in algae (0.164 Mbs) and the Nymphaeales (0.160 Mbs) was larger than eudicots (0.154 Mbs), monocots (0.144 Mbs), and gymnosperms (0.131 Mbs). The average size of the protist chloroplast genome (0.108 Mbs) was the smallest. Principal component analysis (PCA) of the chloroplast genome size of algae, bryophytes, eudicots, gymnosperms, magnoliids, monocots, Nymphaeales, protists, and pteridophytes reveals a clear distinction between the different plant groups (Figure 2). The size of the chloroplast genome of gymnosperm and bryophytes grouped together; and eudicots, magnoliids, and pteridophytes grouped together. In contrast, the algae and protists were independently grouped ( Figure 2). This shows that the chloroplast genome of algae and protists might have evolved from their respective common ancestors.
Genes 2020, 11, x FOR PEER REVIEW 6 of 23 chloroplast genome and they fall distantly in the PCA plot. This suggests that the evolution of chloroplast genome and CDS number of algae and protist share a slightly similar trend compared to other plant species. However, they might have evolved from their respective ancestors. . PCA analysis revealed that the percentage GC content of eudicots, gymnosperms, magnoliids, monocots, and Nymphaeales grouped together, and the percentage of GC content in algae and protists grouped together ( Figure 5). The percentage of GC content in bryophytes and pteridophytes did not group with the algae and protists or the eudicots, gymnosperms, magnoliids, monocots, or Nymphaeales ( Figure 5). The GC content of algae and protists showed that they have a common trend of evolution with regard to genome size, CDS number, and GC content. The evolutionary similarity of algae and protist is closer than other lineages.  PCA analysis revealed that the percentage GC content of eudicots, gymnosperms, magnoliids, monocots, and Nymphaeales grouped together, and the percentage of GC content in algae and protists grouped together ( Figure 5). The percentage of GC content in bryophytes and pteridophytes did not group with the algae and protists or the eudicots, gymnosperms, magnoliids, monocots, or Nymphaeales ( Figure 5). The GC content of algae and protists showed that they have a common trend of evolution with regard to genome size, CDS number, and GC content. The evolutionary similarity of algae and protist is closer than other lineages.
Genes 2020, 11, x FOR PEER REVIEW 6 of 23 chloroplast genome and they fall distantly in the PCA plot. This suggests that the evolution of chloroplast genome and CDS number of algae and protist share a slightly similar trend compared to other plant species. However, they might have evolved from their respective ancestors. . PCA analysis revealed that the percentage GC content of eudicots, gymnosperms, magnoliids, monocots, and Nymphaeales grouped together, and the percentage of GC content in algae and protists grouped together ( Figure 5). The percentage of GC content in bryophytes and pteridophytes did not group with the algae and protists or the eudicots, gymnosperms, magnoliids, monocots, or Nymphaeales ( Figure 5). The GC content of algae and protists showed that they have a common trend of evolution with regard to genome size, CDS number, and GC content. The evolutionary similarity of algae and protist is closer than other lineages.  present at the right side of the figure represents the GC content of Trebouxiophyceae sp. MX-AZ01 that contain 57.66% GC nucleotides; however, the green dot present at the left upper part of the figure represents the lower GC content (23.25%) of Bulboplastis apyrenoidosa. Figure 5. Principal component analysis of GC content of the chloroplast genomes. The GC content of algae and protists and gymnosperms, magnoliids, monocots, eudicots, and Nymphaeales grouped together; however, the GC content of the bryophytes and pteridophytes fall distantly.

PsaM, Psb30, ChlB, ChlL, ChlN, and RPL21 Are Chloroplast Genes Characteristic of Algae, Bryophytes, Pteridophytes, and Gymnosperms
The PsaM protein is subunit XII of photosystem I. Among the 2511 studied species, 84 were found to possess the PsaM gene. All of the species found to possess the PsaM gene belonged to algae, bryophytes, pteridophytes, and gymnosperms (Supplementary File S2). Notably, no the species in the angiosperm lineage possessed the PsaM gene; clearly indicating that the PsaM gene was lost in the angiosperm lineage. The PsaM protein was found to contain the characteristic conserved amino Figure S1). A few species, including Cephalotaxus, Podocarpus tortara, Retrophyllum piresii, Dacrycarpus imbricatus, Glyptostrobus pensilis, T. distichum, Cryptomeria japonica, Pinus contorta, Pinus taeda, and Ptilidium pulcherrimum, however, did not contain the conserved amino acid motif. Instead, they possessed the conserved motif, F-x-S-x3-C-F-x4-F-S-x2-I (Supplementary Figure S1). Phylogenetic analysis revealed that PsaM genes grouped into five independent clusters, suggesting that they have evolved independently from multiple common ancestral nodes (Supplementary Figure S2A). Duplication and deletion analysis of PsaM genes revealed that deletion events were more prominent than the duplication or codivergence events (Table 1). Among the 84 analyzed PsaM genes, 12 underwent duplication and 34 underwent deletions, while 34 underwent codivergence (Table 1, Supplementary Figure S2B). A total of 164 species were found to possess Psb30 gene and all of the species belonged to algae, bryophytes, pteridophytes, or gymnosperms (Supplementary File S2). Psb30 was absent in the The PsaM protein is subunit XII of photosystem I. Among the 2511 studied species, 84 were found to possess the PsaM gene. All of the species found to possess the PsaM gene belonged to algae, bryophytes, pteridophytes, and gymnosperms (Supplementary File S2). Notably, no the species in the angiosperm lineage possessed the PsaM gene; clearly indicating that the PsaM gene was lost in the angiosperm lineage. The PsaM protein was found to contain the characteristic conserved amino acid motif Q- Figure S1). A few species, including Cephalotaxus, Podocarpus tortara, Retrophyllum piresii, Dacrycarpus imbricatus, Glyptostrobus pensilis, T. distichum, Cryptomeria japonica, Pinus contorta, Pinus taeda, and Ptilidium pulcherrimum, however, did not contain the conserved amino acid motif. Instead, they possessed the conserved motif, F-x-S-x 3 -C-F-x 4 -F-S-x 2 -I (Supplementary Figure S1). Phylogenetic analysis revealed that PsaM genes grouped into five independent clusters, suggesting that they have evolved independently from multiple common ancestral nodes (Supplementary Figure S2A). Duplication and deletion analysis of PsaM genes revealed that deletion events were more prominent than the duplication or codivergence events (Table 1). Among the 84 analyzed PsaM genes, 12 underwent duplication and 34 underwent deletions, while 34 underwent codivergence (Table 1, Supplementary Figure S2B). A total of 164 species were found to possess Psb30 gene and all of the species belonged to algae, bryophytes, pteridophytes, or gymnosperms (Supplementary File S2). Psb30 was absent in the chloroplast genome of angiosperms. Multiple sequence alignment revealed the presence of a conserved consensus amino acid sequence, N-x-E-x 3 -Q-L-x 2 -L-x 6 -G-P-L-V-I (Supplementary Figure S3). Phylogenetic analysis of Psb30 genes resulted in the designation of two major clusters and six minor clusters, suggesting that it evolved from multiple common ancestral nodes (Supplementary Figure  S4A). Deletion/duplication analysis indicated that 39 of Psb30 genes underwent a duplication event and 120 underwent a deletion event, while 49 were found to be codiverged (Table 1, Supplementary Figure S4B).
ChlB encodes a light-independent protochlorophyllide reductase. A total of 288 of the examined chloroplast genome sequences were found to possess a ChlB gene (Supplementary File S2) among protists, algae, bryophytes, pteridophytes, and gymnosperms. The ChlB gene was absent in species in the chloranthales, corals, or angiosperm lineage. Multiple sequence alignment revealed the presence of several highly conserved amino acid motifs (Supplementary Figure S8). At least seven conserved motifs were identified, including A- Figure S5). Phylogenetic analysis indicated that ChlB genes grouped into two major clusters and 13 minor clusters, reflecting multiple evolutionary nodes (Supplementary Figure S6A). ChlB genes were composed of a few groups. Specifically, deletion and duplication analysis revealed that 35 ChlB genes underwent duplications and 126 underwent deletions, while 116 exhibited codivergence in their evolutionary history (Table 1, Supplementary Figure S6B).
Analysis of the chloroplast genome sequences identified 303 species that possess ChlL genes (Supplementary File S2). All of the identified species possessing the ChlL gene belonged to algae, bryophytes, gymnosperms, protists, and pteridophytes. None of the taxa in the angiosperm or magnoliid lineage were found to possess a ChlL gene. Within the protist lineage, only species in the genera Nannochloropsis, Vaucheria, Triparma, and Alveolata encode a ChlL gene. Multiple sequence alignment revealed the presence of several highly conserved amino acid motifs, Figure S7). The phylogenetic analysis indicated that ChlL genes grouped into one major independent cluster and 11 minor clusters, suggesting that they also evolved independently from different common ancestors (Supplementary Figure S8A). Deletion and duplication analysis indicated that 49 ChlL genes underwent duplication events and 184 underwent deletions, while 100 ChlL genes exhibited codivergence (Table 1 Figure S9). Phylogenetic analysis revealed that ChlN genes group into two independent clusters (Supplementary Figure S10A). No lineage-specific grouping, however, was identified in the phylogenetic tree. Deletion and duplication analysis indicated that eight ChlN genes underwent duplication events, 46 underwent deletion events and 34 genes exhibited codivergence (Table 1, Supplementary Figure S10B).
The chloroplast genomes of at least 137 of the examined species were found to possess an RpL21 gene which belonged to algae, bryophytes, pteridophytes, and gymnosperms (Supplementary File S2). In the majority of cases, full-length CDS was not found. Instead, the CDS of the Rpl21 genes were found to be truncated. Therefore, only 22 full-length CDS were used to identify deletion and duplication events. Rpl21 proteins were found to contain the conserved amino acid  Figure S11). Phylogenetic analysis shows the presence of three clusters, reflecting their origin from multiple common ancestral nodes (Supplementary Figure S12A). Deletion/duplication analysis indicated that three RpL21 genes underwent duplication events, eight underwent deletion events, and nine exhibited codivergence (Table 1, Supplementary Figure S12B).

Deletion of Inverted Repeats (IRs) Has Occurred across All Plastid Lineages
Inverted repeats (IR) are one of the major characteristic features of the chloroplast genomes. The analysis conducted in the present study revealed the deletion of inverted repeats in the chloroplast genome of 259 (10.31%) species from the 2511 species examined (Supplementary File S3). IR deletion events were identified in protists (14), protozoans (one), algae (126), bryophytes (one), gymnosperms (64), magnoliids (one) monocots (nine), and eudicots (43). The average size of the deleted IR region in algae was 0.177 Mb, which is larger than the overall size of the chloroplast genome in the respective taxa. The average size of the deleted IR region in eudicots, monocots, and gymnosperms was 0.124, 0.131, and 0.127 Mb, respectively, which is smaller than the overall size of the chloroplast genome in the respective lineages.
Phylogenetic analysis of chloroplast genomes containing deleted IR regions produced three major clusters (Supplementary Figure S16). Gymnosperms were in the upper cluster (cyan) while the lower cluster (red) comprised the algae, bryophytes, eudicots, gymnosperms, and pteridophytes. No chloroplast genomes from monocot plants were present in the lower cluster (Supplementary Figure S16). The middle cluster contained at least four major phylogenetic groups (Supplementary Figure S16). Monocot plants were present in two groups (pink) in the middle cluster. Gymnosperm (cyan) and eudicot (green) chloroplast genomes were also present in two of the groups in the middle cluster. Although there was some sporadic distribution of algae in different groups of the phylogenetic tree, the majority of the algal species were present in a single group (yellow; Supplementary Figure  S16). A phylogenetic tree of taxa with deleted IR and taxa with chloroplast genomes that did not lose the IR region (Floydiella terrestris, Carteria cerasiformis, B. apyrenoidosa, Eucalyptus grandis, Oryza sativa, and others) did not reveal any specific difference in their clades. Instead, they also grouped with the genomes in which the IR region was deleted. Inverted repeats stabilize the chloroplast genome [50,51] and the loss of a region of inverted repeats most likely leads to a genetic rearrangement in the chloroplast genome. The lower cluster (red) contained the oldest group. Genomic recombination analysis revealed that the chloroplast genomes across different lineages also underwent vivid recombination (Supplementary Figure S14A,B). In addition, the IR-deleted chloroplast genomes also underwent vivid recombination (Supplementary Figure S15).

Chloroplast-Derived Genes Are Present in the Nuclear Genome
It has been speculated that genes lost from chloroplast genomes may have moved to the nuclear genome and are regulated as a nuclear-encoded gene [52,53]. Therefore, a genome-wide analysis of fully sequenced and annotated genomes of 145 plant species was analyzed to explore this question. Results indicated a maximum presence of the chloroplast-encoding genes in the nuclear genome. We found the presence of 189,381 putative nuclear encoding chloroplast gene from the study of 145 plant species (Supplementary File S5). Some of the chloroplast-derived genes that were found in the nuclear genome were: Rubisco accumulation factor, 30S ribosomal 30S ribosomal proteins (1, 2, 3, S1, S2, S3, S5, S6, S7, S8, S9, S10, S11, S12, S13, S14, S15, S16, S17, S18, S19, S20, S21, and S31) 50S ribosomal proteins (

The Ratio of Nucleotide Substitution Is Highest in Pteridophytes and Lowest in Nymphaeales
Determining the rate of nucleotide substitution in the chloroplast genome can be an important parameter that needs to be more precisely understood to further elucidate the evolution of the chloroplast genome. Single base substitutions, and insertion and deletion (indels) events play an important role in shaping the genome. Therefore, an analysis was conducted to determine the rate of substitution in the chloroplast genome by grouping them according to their respective lineages. Results indicated that the transition/transversion substitution ratio was highest in pteridophytes (k1 = 4.798 and k2 = 4.043) and lowest in Nymphaeales (k1 = 2.799 and k2 = 2.713; Supplementary Table S2). The ratio of nucleotide substitution in species with deleted IR regions was 2.951 (k1) and 3.42 (k2; Supplementary Table S2). The rate of transition of A > G substitution was highest in pteridophytes (15.08) and lowest in protists (8.51) and the rate of G > A substitution was highest in protists (22.15) and lowest in species with deleted IR regions (16.8). The rate of substitution of T > C was highest in pteridophytes (14.01) and lowest in protists (8.95; Supplementary Table S2). The rate of substitution of C > T was highest in protists (22.34) and lowest in Nymphaeales. The rate of transversion is two-times less frequent than the rate of transition. The rate of transversion of A > T was highest in protists (6.80) and lowest in pteridophytes (4.64), while the rate of transversion of T > A was highest in algae (6.98) and lowest in pteridophytes (Supplementary Table S2). The rate of substitution of G > C was highest in Nymphaeales (4.31) and lowest in protists (2.46), while the rate of substitution of C > G was highest in Nymphaeales (4.14) and lowest in protists (2.64; Supplementary Table S2). Based on these results, it is concluded that the highest rates of transition and transversion were more frequent in lower eukaryotic species, including algae, protists, Nymphaeales, and pteridophytes; however, high rates of transition/transversion were not observed in bryophytes, gymnosperms, monocots, and dicots (Supplementary Table S2). Notably, G > A transitions were more prominent in chloroplast genomes with deleted IR regions (Supplementary Table S2).

Chloroplast Genomes Have Evolved from Multiple Common Ancestral Nodes
A phylogenetic tree was constructed to obtain an evolutionary perspective of chloroplast genomes ( Figure 6). All of the 2511 studied species were used to construct a phylogenetic tree ( Figure 6). The phylogenetic analysis produced four distinct clusters, indicating that chloroplast genomes evolved independently from multiple common ancestral nodes. Lineage-specific groupings of chloroplast genomes were not present in the phylogenetic tree. The genomes of algae, bryophytes, gymnosperms, eudicots, magnoliids, monocots, and protists grouped dynamically in different clusters. Although the size of the chloroplast genome in protists was far smaller than other lineages and still, they were distributed sporadically throughout the phylogenetic tree. Time tree analysis indicated that the origin of the cyanobacterial species (used as outgroup) date back to~2180 Ma  these results, it is concluded that the highest rates of transition and transversion were more frequent in lower eukaryotic species, including algae, protists, Nymphaeales, and pteridophytes; however, high rates of transition/transversion were not observed in bryophytes, gymnosperms, monocots, and dicots (Supplementary Table S2). Notably, G > A transitions were more prominent in chloroplast genomes with deleted IR regions (Supplementary Table S2).

Chloroplast Genomes Have Evolved from Multiple Common Ancestral Nodes
A phylogenetic tree was constructed to obtain an evolutionary perspective of chloroplast genomes ( Figure 6). All of the 2511 studied species were used to construct a phylogenetic tree ( Figure  6). The phylogenetic analysis produced four distinct clusters, indicating that chloroplast genomes evolved independently from multiple common ancestral nodes. Lineage-specific groupings of chloroplast genomes were not present in the phylogenetic tree. The genomes of algae, bryophytes, gymnosperms, eudicots, magnoliids, monocots, and protists grouped dynamically in different clusters. Although the size of the chloroplast genome in protists was far smaller than other lineages and still, they were distributed sporadically throughout the phylogenetic tree. Time tree analysis indicated that the origin of the cyanobacterial species (used as outgroup) date back to ~2180 Ma and that the endosymbiosis of the cyanobacterial genome occurred ~1768 Ma ago and was incorporated into the algal lineage ~1293-686 Ma ago (Supplementary Figure S20); which then further evolved into the Viridiplantae ~1160 Ma, Streptophyta ~1150 Ma, Embryophyta ~532 Ma, Tracheophyte ~431 Ma, Euphyllophyte 402 Ma, and Spermatophyta 313 Ma (Supplementary Figure S20). The molecular signature genes PsaM, ChlB, ChlL, ChlN, Psb30, and Rpl21 in algae, bryophytes, pteridophytes, and gymnosperms were lost ~203 (Cycadales) and -156 (Gnetidae) Ma ago, and as a result, are not found in the subsequently evolved angiosperm lineage (Supplementary Figure S20).

Discussion
Chloroplasts are an indispensable part of plant cells function as semiautonomous organelles due to the presence of their own genetic material, potential to self-replicate, and capability to modulate

Discussion
Chloroplasts are an indispensable part of plant cells function as semiautonomous organelles due to the presence of their own genetic material, potential to self-replicate, and capability to modulate cell metabolism [4,[54][55][56]. The size of the chloroplast genome is highly variable and does not correlate to the size of the corresponding nuclear genome of the species. The average size of the chloroplast genome is 0.152 Mb and encodes an average of 91.67 CDS per genome. The deletion of IR regions in the chloroplast genome is supposed to drastically reduce the genetic content of the chloroplast genome and also the number of CDS. However, the current analysis does not support this premise. The average number of CDS in algae (140.93) was higher than protists (98.97), pteridophytes (86.54), eudicots (83.55), bryophytes (83.38), gymnosperms (82.54), and monocots (82.53). The larger genome size (0.177 Mb) of the chloroplast genome in algae with deleted IR regions, and the higher number of CDS (172.16 per genome) in IR-deleted taxa of algae indicates that the loss of IR regions in algae led to a genetic rearrangement and an enlargement in the chloroplast genome. However, the average CDS number of other lineages in IR-deleted genomes was quite lower than their average CDS count (86.28 for protist, 63 for monocot, 81.42 for gymnosperm, and 71.88 for eudicot). The average size of IR-deleted chloroplast genomes in eudicots, monocots, protists, and gymnosperms was smaller than the average size of chloroplast genomes of taxa where IR regions have not been deleted. Thus, the lower number of CDS in these taxa may be related to the deletion of IR regions. This suggests that the deletion of IR regions in the chloroplast genome of algae is directly proportional to the increase in the genome size and concomitant increase in the CDS number; however, this was not true in the other plant lineages where the relationship was inversely proportional. The deletion of IR regions has been previously reported in a few species of algae, magnoliids, and other genomes [57][58][59][60][61]. The present study, however, provided clear evidence regarding the loss of IR regions across all plant and protist lineages. The deletion of IR repeats and an increase in the genome size in algae has largely been attributed to the duplication of the chloroplast genome. The evolutionary age of IR-deleted species of algae dates back to~965-850 Ma. This provides strong evidence that the deletion of IR repeats and duplications of the chloroplast genome has been a continuous process since the initial evolution of the chloroplast genome in algae. Zhu et al. also suggested a role for duplication in the evolution of IR-deleted chloroplast genomes [60]. Characterizing the pattern and frequency of neutral mutations (substitution, insertions, and deletion) is important for deciphering the molecular basis of the evolution of genes and genomes. Turmel et al. reported that a differential loss of genes from the chloroplast genome resulted in the loss of IR regions in the chloroplast genome for all the lineages, except algae and protists [57]. The transition/transversion ratio of purine substitutions in all IR-deleted species (k1 = 2.951) was much lower than in non-IR-deleted species, except for species in the Nymphaeales, and the substitution of pyrimidines in all IR-deleted species was higher (k2 = 3.42), except pteridophytes (Supplementary Table S2). These data suggest that, in addition to a duplication event, a lower rate of purine substitution and a higher rate of pyrimidine substitution are closely associated with the deletion of IR regions.
In addition to the loss of IR regions, the loss of genes from chloroplast genomes was also analyzed. The loss of important genes from the chloroplast genome has been previously reported in some species of green algae, bryophytes, and magnoliids (Supplementary Data S1) [62][63][64][65]. The results of the present study indicate the loss of the Rbcl gene in at least 19 species among parasitic, mycoparasitic, and saprophytic plant species across different lineages, including algae, eudicots, magnoliids, monocots, and protists. The parasitic plant Conopholis of Orobanchacea lost the photosynthetic gene Rbcl; however, it was present in other parasitic plants in Orobanchacea [66,67]. The loss of Rbcl, however, was not observed in any species of bryophytes, pteridophytes, or gymnosperms. The number of CDS in the Rbcl-deleted chloroplast genome was much lower (27 per genome) relative to the average number of CDS found in the chloroplast genomes; except for Alveolata sp. CCMP3155 which possessed 81 CDS. The loss of the Rbcl gene in the chloroplast genome is associated with a drastic reduction in the number of other protein-coding genes. The reduction in the genome size is associated with the massive loss of ancestral protein-coding genes [68]. Interestingly, the parasitic genus, Cuscuta, possesses an Rbcl gene which suggests that the parasitic nature of a species is not always associated with the deletion of the Rbcl gene and vice versa, the loss of the Rbcl gene is not a prerequisite of becoming a parasitic plant as well. However, it is quite clear that parasitism is getting more prone towards the loss of chloroplast-encoding genes. Although a few contain the Rbcl gene, they cannot sustain themselves for their own photosynthesis. The losses of these molecular features are providing an important platform to understand the plant-parasite interactions and evolution of parasitic plants. The loss of genes is most possibly associated with a high level of contraction of the nuclear genome as well. Most possibly, the autotrophic plant evolved parasitic characters through neofunctionalization and transcriptional reprogramming of its older lineage. The study reported that transition from the autotrophic plants to parasitic plants relaxes the functional constraints in a stepwise manner for plastid genes [69].
The deletions of one or more important genes of the chloroplast genome observed in numerous species (Supplementary Data S1). It is difficult to decipher the exact reason for the loss of these individual genes in different chloroplast genomes. NdhA, NdhC, NdhD, NdhE, NdhF, NdhG, NdhH, NdhI, NdhJ, NdhK, and Rps16 were genes that were most commonly lost across the analyzed chloroplast genomes. The NdhB gene, however, was found to be intact in all species of bryophytes, suggesting that it could serve as a signature gene for the bryophyte chloroplast genome. Ndh genes encode a component of the thylakoid Ndh-complex involved in photosynthetic electron transport. The loss of specific Ndh genes in different species suggests that not all Ndh genes are involved in or needed for functional photosynthetic electron transport. The loss of one Ndh gene may be compensated for by other Ndh genes or by nuclear-encoded genes. The functional role of the Ndh gene was previously reported to be closely related to the adaptation of land plants and photosynthesis [70]. The loss of Ndh genes in species across all the plant lineages, including algae, suggests that Ndh genes are not associated with the adaptation of photosynthesis to terrestrial ecosystems. Previous studies have reported the loss of Ndh genes in the Orchidaceae, where the deletion was reported to occur independently after the orchid family split into different subfamilies [71]. These data suggest that the loss of Ndh genes in the parental lineage of orchids led to the loss of Ndh genes in the subfamilies in the downstream lineages of orchids.
A comparison of gene loss in monocots and dicots revealed that species in the eudicots are more prone to gene loss than monocot species. Monocots and dicots chloroplast genome shared a common loss of 59 genes, while eudicots have lost 10 more genes (ClpP, Rpl14, Rpl2, Rpl36, RpoA, Rps2, Rps8, Rps11, Rps14, and Rps18) than monocots, suggesting that these genes represent the molecular signature of the chloroplast genomes of monocot species. Ycf (Ycf1, Ycf2, Ycf3, and Ycf4) genes were found to be intact in all species of bryophytes, gymnosperms, and pteridophytes, suggesting that they represent a common molecular signature for these lineages. Various genes, including MatK, Rbcl, Ndh, and Ycf, are commonly used as universal molecular markers in DNA barcoding studies for determining the genus and species of the plants. The loss of these genes in the chloroplast genome of various lineages makes their use as universal markers questionable in future studies for DNA barcoding [72][73][74][75][76].
The loss of RpoA from the chloroplast genome of mosses was previously reported and it was suggested that RpoA had relocated to the nuclear genome [63,77]. The loss of Psa and Psb genes were quite prominent in algae, eudicot, magnoliid, monocot, and protist lineages. Psa and Psb genes were always found in species of bryophytes, pteridophytes, and gymnosperms, suggesting that these genes could serve as a common molecular signature for these lineages. PsaM, Psb30, ChlB, ChlL, ChlN, and Rpl21 are characteristic molecular signature genes for lower eukaryotic plants, including algae, bryophytes, pteridophytes, and gymnosperms. Additionally, these genes are completely absent in the eudicots, magnoliids, monocots, and protists. The absence of these genes in angiosperm and magnoliid lineages reflect their potential role in the origin of flowering plants. Duplication events for PsaM, Psb30, ChlB, ChlL, ChlN, and Rpl21 genes were much lower than deletion and codivergence events (Table 1). In fact, codivergence was the dominant event for all of these genes ( Table 1). The recombination events that occurred in the chloroplast genome directly reflect the potential possibility of codivergent and divergent evolution in these genes. The presence of PsaM, Psb30, ChlB, ChlL, and ChlN genes in their respective lineages support the premise that these genes are orthologous and resulted from a speciation event [78][79][80][81]. Chl genes are involved in photosynthesis in cyanobacteria, algae, pteridophytes, and conifers [82][83][84][85][86][87]; indicating that the Chl genes were originated at least~2180 Ma ago and remained intact up to the divergence of the angiosperms at~156 Ma. The loss of Psa and Psb genes in different species also suggests that they are not essential for a complete and functional photosynthetic process. The loss of a Psa or Psb gene in a species might be compensated for by other Psa or Psb genes or by a nuclear-encoded gene. The loss of Psa and Psb genes in species across all plant lineages has not been previously reported. Thus, this study is the first to report the loss of Psa and Psb genes in the chloroplast genome of species across all plant lineages, as well as protists. The loss of Rpl22, Rpl32, and Rpl33 genes was more prominent than the loss of Rpl2, Rpl14, Rpl16, Rpl20, Rpl23, and Rpl36, suggesting the conserved nature of Rpl2, Rpl14, Rpl16, Rpl20, Rpl23, and Rpl36 genes and the conserved transfer of these genes to subsequent downstream lineages as intact genes. Rpl20 was found to be an intact gene in all 2511 of the studied species, suggesting that Rpl20 is the most evolutionary conserved gene in the chloroplast genome of the plants and protists. Therefore, Rpl20 can be considered as the molecular signature gene of the chloroplast genome. Similarly, the loss of Rps15 and Rps16 was more frequently relative to the loss of Rps2, Rps3, Rps4, Rps7, Rps8, Rps11, Rps12, Rps14, Rps18, and Rps19.
There are several reports regarding the transfer of genes from the chloroplast to the nucleus [4,31,[88][89][90]. In the present study, almost all of the genes encoded by the chloroplast genomes were also found in the nuclear genome. The presence of the chloroplast-encoded genes in the nuclear genome, however, was quite dynamic. If a specific chloroplast-encoded gene was found in the nuclear genome of one species, it may not have been present in the nuclear genome of the other species. One report also indicated that genes transferred to the nuclear genome may not provide a one to one correspondence function [90]. The question also arises as to how almost all of the chloroplast-encoded genes can be found in the nuclear genome and how were they transferred? If the transfers and correspondence are real, it is plausible that almost all chloroplast-encoded genes have been transferred to the nuclear genome in one or more species and that the transfer of chloroplast genes to the nuclear genome is a common process in the plant kingdom and exchange of chloroplast genes with nuclear genome have already completed.

Conclusions
The underlying exact mechanism regarding the deletion of IR regions from the chloroplast genome is still unknown and the loss of specific chloroplast-encoded genes and IR regions in diverse lineages makes it more problematic to decipher the mechanism or selective advantage behind the loss of the genes and IR regions. It is likely that nucleotide substitutions and the dynamic recombination of chloroplast genomes are the factors that are most responsible for the loss of genes and IR regions. Although the evolution of parasitic plants can, to some extent, be attributed to the loss of important chloroplast genes (including Rbcl); still it is not possible to draw any definitive conclusions regarding the loss of genes and IR regions. The presence of all chloroplast-encoded genes in the nuclear genome in one or another species is quite intriguing. A question arises, however: do the chloroplast genomes complete the transfer of different chloroplast-encoding genes in different species based on some adaptive requirement? The presence of a completely intact Rpl20 gene without any deletions in the chloroplast genome of all the species indicates that the Rpl20 gene can be considered as a molecular signature gene of the chloroplast genome.  Table S1: Loss of chloroplast-encoding  genes in different species of respective lineages, Supplementary Table S2: Maximum composite likelihood substitution of nucleotides. The entry reflects the probability of substitution (r) from one base (row) to another base (column). The rates of transitions are highlighted in bold and rates of transversion are highlighted in italics. The nucleotide frequencies (%) of A, T/U, G, and C for the respective study are mentioned in the rows. The transition/transversion ratios are mentioned as K1 (purine) and K2 (pyrimidine). The transition/transversion bias R = [A*G*k1 + T*C*k2]/[(A+G) * (T+C)]. The codon position included were 1st + 2nd + 3rd + noncoding. All the positions with less than 95% site coverage were eliminated. That is fewer than 5% alignment gaps. Missing data and ambiguous bases were allowed at any position. The C > T substitution is more frequent than T > C substitution and G > A substitution more frequently than A > G. The major mechanism mutation is deamination of 5'-methyl cytosine to uracil (thiamine) producing C > T or on the complementary strand G > A, Supplementary Figure Figure S16. Phylogenetic tree of inverted repeat (IR)-deleted chloroplast genomes. The phylogenetic tree of chloroplast genomes was constructed with ClustalW using a neighbor-joining approach with 1000 bootstrap replicates and three major clusters were identified. The phylogenetic tree was constructed in combination with the species containing the inverted repeats (Floydiella terrestris, Carteria cerasiformis, Bulboplastis apyrenoidosa, Eucalyptus grandis, Oryza sativa, and others) to decipher the differences. Deletion of inverted repeats did not have a considerable impact on the phylogeny, Supplementary Figure