A Tale of Two Families: Whole Genome and Segmental Duplications Underlie Glutamine Synthetase and Phosphoenolpyruvate Carboxylase Diversity in Narrow-Leafed Lupin (Lupinus angustifolius L.).

Narrow-leafed lupin (Lupinus angustifolius L.) has recently been supplied with advanced genomic resources and, as such, has become a well-known model for molecular evolutionary studies within the legume family—a group of plants able to fix nitrogen from the atmosphere. The phylogenetic position of lupins in Papilionoideae and their evolutionary distance to other higher plants facilitates the use of this model species to improve our knowledge on genes involved in nitrogen assimilation and primary metabolism, providing novel contributions to our understanding of the evolutionary history of legumes. In this study, we present a complex characterization of two narrow-leafed lupin gene families—glutamine synthetase (GS) and phosphoenolpyruvate carboxylase (PEPC). We combine a comparative analysis of gene structures and a synteny-based approach with phylogenetic reconstruction and reconciliation of the gene family and species history in order to examine events underlying the extant diversity of both families. Employing the available evidence, we show the impact of duplications on the initial complement of the analyzed gene families within the genistoid clade and posit that the function of duplicates has been largely retained. In terms of a broader perspective, our results concerning GS and PEPC gene families corroborate earlier findings pointing to key whole genome duplication/triplication event(s) affecting the genistoid lineage.


Introduction
The last decade has seen gradual progress in evolutionary studies on plants, mainly due to simultaneous, rapid advancement in theory, computing, and molecular technology. Legumes, which are the third largest plant family, have attracted the focus of active and collaborative international groups of researchers in the area of systematics and evolution [1][2][3]. Fabaceae, consisting of three major clades-Papilionoideae, Caesalpinioideae, and Mimosoideae-includes important grain, pasture, and agroforestry species that are characterized by an unusual flower structure, podded fruit, and the encoded by a small multigene family with an insufficiently elucidated evolutionary history. However, it is assumed that gene duplication from pre-existing genes, followed by a few amino acid changes and the acquisition of a new gene transcription control, have led to the appearance of new isoforms such as C4 PEPC [51].
Here, we provide characterization of the L. angustifolius glutamine synthetase (GS1 and GS2) and phosphoenolpyruvate carboxylase (PEPC) gene families, including gene structure determination; genetic localization within narrow-leafed lupin linkage groups (NLLs) and estimations of the GS1, GS2, and PEPC copy number in the narrow-leafed lupin genome. As sequences of narrow-leafed lupin [11,52] were only available in draft form prior to the start of this study, we decided to combine the screening of the BAC library with available data from genome sequencing. We also address several fundamental questions regarding the evolution of GS and PEPC gene families in legume plants and 40 other dicots and monocots. We support our evolutionary conclusions with a cross-genera microsynteny analysis of selected genome regions carrying particular GS and PEPC gene variants in the genomes of narrow-leafed lupin and several legume and non-legume species. Moreover, Fabaceae GS and PEPC representatives were sampled for selection pressure parameters by both pairwise and branch-site assays.

Narrow-Leafed Lupin GS and PEPC are Encoded by Multigene Families
To tag/select cytosolic GS and PEPC genes, two sequence-specific probes targeting GS and PEPC genes, respectively, were amplified and used for narrow-leafed lupin genome BAC library screening. As a result, two BAC clone sub-libraries were created, with BACs representing L. angustifolius genome regions carrying GS and PEPC genes. The presence of analyzed genes within selected BACs was positively verified by PCR amplification and Sanger sequencing with gene-specific primers. The similarity level between particular GS and PEPC homologs identified in the selected clones was determined. Fragments of analyzed genes (300-400 bp) with a similarity level above 97% were classified as one gene variant and assigned to one contig. Two such contigs and two singletons were constructed for the GS sub-library and two contigs with one singleton were constructed for PEPC. The composition of the GS sub-library is as follows: contig1, clones 015C08 and 087N22; contig2, clones 038E09, 047P22, 088E07, 094A04, and 131H20; and singletons, 036L23 and 059J08. The PEPC sub-library contains contig1, clones 067C07 and 083F23; contig2, clones 064J15 and 077K22; and a singleton, 131K15. Taking into consideration these results, the accuracy of BAC library screening with the use of the Southern blot method was calculated to be 50% for both sub-libraries and was considered as being relatively low. It was expected that post-hybridization signals would represent the coverage of the L. angustifolius genome in the BAC library [28,53]. The observed phenomenon may reflect the general characteristics of the lupin BAC library and incorporated cloning system used, with the noted instability depending on the carried sequence [28,54,55].
Gene copy number estimation with ddPCR revealed that BAC sub-libraries were lacking some gene duplicates. When the study was initially conceived and the experimental part was conducted, the lupin draft genome had not been officially released. Moreover, the availability of both the scaffold-level [52] genome draft and the latter LupinExpress pseudochromosome-level [11] assemblies has, in some of our other studies, failed to entirely resolve certain areas of the genome, including, for example, the placement of RAP2-7 transcription factor, crucial to alkaloid biosynthesis, reported by Kroc et al. (2019). Therefore, our recent BAC-based study aimed at molecular control of the vernalization response Ku locus in the narrow-leafed lupin highlighted a candidate gene (a homolog of FLOWERING LOCUS T) and provided the sequence of the domesticated allele carrying a functional mutation (large indel in the promoter) before the release of the lupin pseudochromosome sequence [25,30]. This finding was later confirmed by genome assembly-based studies. Furthermore, BAC clones may be used as chromosome-specific cytogenetic landmarks for chromosome-scale analysis, as well as for inter-species tracking of conserved chromosome regions and profiling of their structural variation. Both approaches have been used in lupin molecular cytogenetic studies [30][31][32][33]. Indeed, BAC clones from this study (047P22, 036L23, 059J08, 067C07, and 131K15) were recently exploited in parallel research addressing lupin karyotype evolution, providing single-locus anchors for the visualization of chromosomal rearrangements across the panel of ten European and African lupin species. Therefore, even after updating the bioinformatic results to include the newly available genomic data, we decided to retain BAC-derived sequences in the final analysis, both as a record of the train of thought and as valuable supporting evidence directly linking recently developed cytomolecular resources for comparative fluorescent in situ hybridization mapping.
To obtain data on GS and PEPC genes, sequences of interest were blasted against the narrow-leafed lupin annotated gene set cds v1.0. The search resulted in the identification of nine narrow-leafed lupin GS genes in total: seven GS1 genes (named GS1a1, GS1a2, GS1a3, GS1b1, GS1b2, GS1c1, and GS1c2) and two GS2 genes (named GS2a1 and GS2a2). Nine PEPC homologs were identified: PEPC1a, PEPC1b, PEPC1c, PEPC2a, PEPC2b, PEPC3a, PEPC3b, PEPC4, and PEPC5 ( Table 1). The observed trend in the L. angustifolius GS and PEPC gene copy number is consistent with the data gathered for other legumes. The P. vulgaris GS1 gene family contains three active GS1 genes and one pseudogene [56]. In pea, three active GS1 genes have been characterized: GS1, GS3A, and GS3B [57]. In M. truncatula, two active GS1 genes (MtGS1a and MtGS1b), two GS2 genes (MtGS2a and MtGS2b), and one pseudogene (MtGSc) were revealed [58]. Two major classes of GS1 genes have been investigated in M. sativa [59]. In the G. max genome, there are three GS1 classes, each represented by at least two functional members [60]. Only one copy of the GS1 gene was identified in the A. ipaensis and A. duranensis species.
According to the proposed evolutionary history of narrow-leafed lupin, it was stated that this species has undergone duplication and/or triplication with several chromosome rearrangements [11,21,36]. Based on a cytogenetic analysis of several species from the Lupinus genus, it was also hypothesized that the lupin karyotype has evolved through polyploidy and subsequent aneuploidy [61]. Global analysis of the narrow-leafed lupin transcriptome and legume genome sequence comparative mapping enabled whole genome duplication (WGD) events to be dated. Hane et al. estimated the Papilionoideae radiation at 58 mya with genistoid lineage separation from the other Papilionoideae legumes at 54.6 mya, followed by whole-genome triplication in the genistoid lineage at 24.6 mya [11]. Additionally, the ancient polyploidy event has been confirmed based on an analysis of several genes, such as chalcone isomerases (CHI) [62], phosphatidylethanolamine binding proteins (PEBP) [30], isoflavone synthetases (IFS) [63], and cytosolic and plastid acetyl-coenzyme A carboxylases (ACCase) [64]. All listed genes are present in the narrow-leafed lupin genome in multiple variants and evolved by WGDs, evidenced by shared synteny and Bayesian phylogenetic inference. Our results concerning GS and PEPC gene families support the whole genome duplication/triplication(s) hypothesis. Table 1. Characterization of Lupinus angustifiolius bacterial artificial chromosomes (BACs)/scaffolds carrying glutamine synthetase (GS) and phosphoenolpyruvate carboxylase (PEPC) sequences, including anchoring genes to the scaffolds and narrow-leafed lupin linkage groups (NLLs), cytogenetic marker representation, and repetitive content analysis within selected scaffolds. NLL-narrow-leafed lupin linkage group, RE-repetitive element, and CDS-coding sequence.
In order to resolve the organization of multiple genome regions carrying distinct sequence variants of GS and PEPC, narrow-leafed lupin genome regions carrying these genes were extracted from the assembly and, together with seven sequenced BAC clone inserts (three with the PEPC genes 064J15, 067C07, and 131K15, and four with the GS sequences 036L23, 047P22, 059J08, and 087N22), were annotated with putative functions. BAC sequences were mapped onto narrow-leafed lupin scaffolds and selected regions were truncated into a uniform length of 100 Mbp. Four scaffolds remained with the original lengths: scaffold192, 88,054 bp; scaffold11_68, 28,507 bp; scaffold9_1, 31,494 bp; and scaffold59_19, 103,921 bp. Analysis revealed the average GC content of 32.95% and 33.23% for GS and PEPC regions, respectively. The observed occurrence of repetitive elements in genome fragments flanking GS and PEPC gene variants varied from 0% (GS1b1, scaffold73) to 15.63% (GS1a2, scaffold106), and from 1.17% (PEPC1c, scaffold274) to 15.64% (PEPC3b, scaffold296), with retrotransposons (Ty1/Copia and Gypsy/DIRS1) and DNA transposons (DNA/Mule-MuDR type) being the most abundant.
It has been confirmed that the narrow-leafed lupin genome is highly repetitive (57%) [11], with well-organized gene-rich regions. In addition to satellites sensu lato, long terminal repeat (LTR) retrotransposons and DNA transposons were revealed as the most common, with only a small proportion of non-coding RNA [11,19,31,65]. Due to the "copy and paste" mechanism underlying the amplification of LTR retrotransposons, they have been shown to make up the largest classes of transposable element (TE) content in the genomes of most flowering plants, greatly contributing to increases in size of their host genome [66]. As reported in studies concerning Arabidopsis, soybean, and flax genomes, Copia elements are largely located within and/or close to gene-coding regions, which suggests that these elements may have the dominant influence on the evolution of some gene families [67][68][69]. Gene prediction revealed features characteristic of gene-rich regions, with an average of 13 coding sequences per 100 Mbp for both GS and PEPC gene regions (Table 1, Supplementary file 1). The obtained data for the frequency of coding sequences within analyzed regions of the narrow-leafed lupin genome showed a lower coding sequence (CDS) abundance than in our previous studies [19,31]. This low number of genes in GS1a2, GS2a1, and PEPC3b neighborhoods is primarily due to the high content of repetitive elements in the surrounding regions.

GS and PEPC Gene Variants Present a Conserved Sequence Structure among All L. angustifolius Homologs and Other Legume Species
To investigate the structural changes of the GS and PEPC genes, sequence data from 46 species originating from 26 plant families were gathered (Supplementary file 2). In total, 244 sequences of GS homologs were subjected to exon/intron determination. The average CDS length for GS1 (178 sequences analyzed) was established as 3259 bp, with 12 exons as the dominant structure, whereas for GS2 (46 sequences analyzed), the value was 3866 bp, with 13 exons. Legume GS homologs (36 sequences of GS1 and GS2) presented a conserved gene structure consistent with the pattern described above. Indeed, only the structures of four GS1 genes were different: Lj0g3v0335159 from L. japonicus-nine exons; TR_5g077950 from M. truncatula-nine exons; gene13764 (LOC107631250) from A. ipaensis-10 exons; and GLYMA02G41106 from G. max-10 exons. In the case of GS2 homologs, only gene3699 (LOC107637831) from A. ipaensis with 14 exons and Tp57577_TGAC_v2_gene28916 from Trifolium pratense with 20 exons showed an atypical gene structure (Supplementary file 3).
To establish the structure of PEPC gene family representatives among higher plants, 223 sequences were analyzed. Based on the exon/intron organization, two groups were formed. The first group contained 167 sequences with an average length of 5645 bp (min. 3102 bp, max. 17,375 bp) structured into 10 exons. Nevertheless, some variation in exon composition was found, particularly in the sequences GSMUA_Achr9G06420_001 from Musa acuminata and MDP0000258440 from Malus domestica, The structures of all identified L. angustifolius GS and PEPC genes were established. The GS sequence lengths varied from 3550 to 8730 bp for GS1 homologs and from 4002 to 4890 bp for GS2. Coding sequence organization was highly conserved within GS1 (12 exons) and GS2 (13 exons) groups, despite the observed dissimilarities in lengths. CDS lengths were as follows: GS1a, 1071 bp (356 aa); GS1b1, 1071 bp (356 aa); GS1b2, 1062 bp (353 aa); GS1c, 1074 bp (357 aa); and GS2a1 and GS2a2, 1299 bp (432 aa). A major structural difference in GS genes was observed for GS1b2, where exon number 12 was significantly shorter than in other homologs (144 vs. 153 bp, respectively). Moreover, 5 and 3 GS regulatory regions revealed high variation between all analyzed sequences, both in length and composition. PEPC genes were organized into 10 exons, and the coding sequence length varied from 2901 to 2907 bp (from 966 to 968 aa), excluding PEPC5, which had a 3135 bp (1044 aa) length structured into 20 exons and thus significantly deviated from the other PEPC sequence variants. The observed level of sequence similarities within the PEPC clade is considered as being high. However, major differences in the length and composition of 5 and 3 UTR regions were noted (Supplementary file 4).

The Initial GS and PEPC Complement was Subsequently Duplicated in a Lineage-Specific Manner and Can be Traced to the Common Ancestor of Legumes
The reconstructed phylogeny of plant GS genes yielded several insights with regards to legume enzymes. Firstly, the initial representation of this family in Fabaceae is shown to have consisted of three ancestral clades ( Figure 1, Figure 2, and Supplementary file 5) for a simplified phylogenetic tree of relationships. The first monophyletic clade (denoted as GS2- Table 2) encompasses the known types of GS2 loci, which are annotated as chloroplastic proteins encoded in the nuclear genome. Duplicates are present in multiple, rather than singular, cases of divergent legumes and were previously found to be expressed in seeds, at least in the case of M. truncatula [70]. The other two clades (GS1cs1 and GS1cs2) carry genes encoding cytosolic proteins corresponding to cytosolic isoforms preferentially expressed in different organs/at different developmental stages (i.e., GS1cs2-α, and GS1cs1-β and γ subunits described in early comparative analyses [71]). The placement of Vitis vinifera and multiple malvid sequences between the two clades points to the GS1cs1/GS1cs2 ancestral split either coinciding or shortly following the γ triplication common to both rosids and asterids [72]. Additionally, the GS1cs1 ancestral split, which resulted in the separation of β and γ subclades (constitutively expressed vs. nodule enhanced, respectively), is shown to have occurred early in the evolution of legumes (possibly prior to the separation of genistoid lineage, with GS1cs1-β encoding loci seemingly not having been retained in the NLL reference genome).  Both one PEPC2 (PTPC, plant-type PEPC [50]) clade and two PEPC1 (BTPC, bacterial-type PEPC) clades can be clearly characterized as monophyletic in legumes. Therefore, three ancestral Both one PEPC2 (PTPC, plant-type PEPC [50]) clade and two PEPC1 (BTPC, bacterial-type PEPC) clades can be clearly characterized as monophyletic in legumes. Therefore, three ancestral genes inherited from an early rosid are indicated, each of which was duplicated prior to the divergence of genistoid/dalbergioid lines and can be traced to the common ancestor of legumes (PEPC1a, PEPC1b, PEPC2-see Table 3 for a full summary and Figure 3, Figure 4, and Supplementary file 6 for relevant fragments of phylogenetic reconstruction). The ancestral duplication giving rise to PEPC1a and PEPC1b legume plant-type PEPC subgroups likely dates back to core eudicots (coincident with γ triplication or closely following the event). An additional legume-specific duplication event is implied in PEPC1b, although incomplete lineage sorting artefacts cannot be ruled out. Indeed, as with available reconstructions of legume phylogeny based on housekeeping genes, the ordering of early diverging dalbergioid and genistoid lineages is seen to alternate between two possibilities.

of 32
genes inherited from an early rosid are indicated, each of which was duplicated prior to the divergence of genistoid/dalbergioid lines and can be traced to the common ancestor of legumes (PEPC1a, PEPC1b, PEPC2-see Table 3 for a full summary and Figure 3, Figure 4, and Supplementary file 6 for relevant fragments of phylogenetic reconstruction). The ancestral duplication giving rise to PEPC1a and PEPC1b legume plant-type PEPC subgroups likely dates back to core eudicots (coincident with γ triplication or closely following the event). An additional legume-specific duplication event is implied in PEPC1b, although incomplete lineage sorting artefacts cannot be ruled out. Indeed, as with available reconstructions of legume phylogeny based on housekeeping genes,  The initial GS1 complement was subsequently duplicated in a lineage-specific manner and available evidence (including intact intron-exon structure, which is prior published evidence in the case of alfalfa and common bean) indicates that the functionality of these duplicates has been largely retained in extant crop species. In regard to lupin, the narrow-leafed lupin enzymes are shown to be the result of such duplications and are thus paralogous to the closest counterparts from non-genistoid groups. As a closing side note, the overall resolution of events on the basis of the phylogeny (evolution of cytosolic GSI-encoding genes) suggests that monocot family members might be more ancient than dicot ones, stemming from the selective culling of duplicates predating the separation of both lineages (in line with the split between cytosolic and plastid eukaryotic GS, likely predating monocot/dicot divergence) [48]. However, it is worth noting that the resolution of these basal events lacks the support necessary to make strong inferences (less than 50% bootstrap probabilities for consensual clades).
Analogous to the GS case, most of the retained PEPC duplications are late and species-specific (as seen in the soybean, lotus, and lupin genomes). In this case, the reconciled PEPC phylogeny supports most lupin gene family members being late paralogs (PEPC1a.2 and PEPC1b.1-single duplication, and PEPC1b.2-either two rounds of duplication and loss or triplication in the lineage). The inference of possible subsequent duplications/triplication (both here and in the GS1cs1 γ clade) corroborates the earlier findings, pointing to events affecting the genistoid lineage [36].     1 Where a locus tag is not available (gene designated as the NCBI reannotation only), the NCBI Gene database ID is given in the parentheses, prefixed with LOC.

Compared to GS Genes, the History of Coding Sequences of PEPC Genes More Closely Recapitulates the History of Species
A maximum likelihood codon-based phylogenetic species tree of 46 reference plant genomes, based on 29 putative single-copy orthologs with the best coverage and uniqueness, was generated in order to track species evolution. The obtained species phylogeny ( Figure 5) is highly supported, with only two major differences from the accepted consensus (e.g., The Angiosperm Phylogeny Group 2016). One of these is the alliance of lycopod Selaginella and moss Physcomitrella. The grouping of these lineages is likely an artefact of rapid diversification in early land plant lineages and could be observed in PEPC/GS phylogenies. Additionally, a significant observed difference is the grouping of Citrus sinensis (malvid, order Sapindales) with representatives of the rosid order Malpighiales (Ricinus communis, Populus trichocarpa, and Salix purpurea). Notably, the phylogeny of the latter order has still not been entirely resolved, with the whole COM (Celastrales, Oxalidales, and Malpighiales) clade placement in rosids being challenged by different datasets [73]. Otherwise, the support for consensus topology is strong and the relationships, in particular the topology of the legume clade, support the earlier consensus [74,75].  Primary metabolism genes were frequently good candidates for molecular taxonomic markers, provided that paralogy was taken into account and suitable low/single copy orthologs were chosen for inference [76]. In this context, the members of GS and PEPC subfamilies were considered as good candidates in the past. Our results do not fully corroborate these findings.
Contrary to early inquiries [4,77], chloroplastic glutamate synthetases are not particularly good taxonomic markers for legumes. The GS phylogeny clearly confirms the existence of multiple, functional copies and the reconstructed ancestry contains both late duplications (L. angustifolius, M. truncatula, L. japonicus, and G. max) and traces of earlier events (e.g., positioning L. angustifolius sequences, which implies early duplications). From the point of view of future studies, PEPC clades provide better candidates for supplementary markers (bacterial-type PEPC sequences from clades α and β), as there are less duplications and the phylogenetic signal is strong (as exemplified by the Primary metabolism genes were frequently good candidates for molecular taxonomic markers, provided that paralogy was taken into account and suitable low/single copy orthologs were chosen for inference [76]. In this context, the members of GS and PEPC subfamilies were considered as good candidates in the past. Our results do not fully corroborate these findings. Contrary to early inquiries [4,77], chloroplastic glutamate synthetases are not particularly good taxonomic markers for legumes. The GS phylogeny clearly confirms the existence of multiple, functional copies and the reconstructed ancestry contains both late duplications (L. angustifolius, M. truncatula, L. japonicus, and G. max) and traces of earlier events (e.g., positioning L. angustifolius sequences, which implies early duplications). From the point of view of future studies, PEPC clades provide better candidates for supplementary markers (bacterial-type PEPC sequences from clades α and β), as there are less duplications and the phylogenetic signal is strong (as exemplified by the bootstrap support of inner bipartitions). This is supported by past findings demonstrating that WGD may have played a lesser role in the evolution of the PEPC family in land plants [78]. However, in all (recent) cases, paralogy should be taken into account (e.g., by targeting UTR regions in order to distinguish paralogs).
More interestingly, the general patterns of lineage-specific duplications suggest that sub-functionalization and/or regulatory rewiring played a large role in shaping the extant carbon and nitrogen primary metabolic pathways in some lineages (L. angustifolius, L. japonicus, and G. max). This is also corroborated by the conserved gene structure and further analyses of selection pressure, which show a lack of changes in core ligand-interacting residues of the encoded proteins. Taken together, the evidence points to regulatory rather than mechanistic changes driving the diversification of both GS and PEPC family members. Whether this is a result of the differential retention of functional duplicates or different frequency of events, the outcome remains pertinent for future translational/comparative studies of legumes and merits more investigation.

L. angustifolius Genome Regions Carrying GS and PEPC Genes Arise from Duplication/Triplication with Additional Complex Chromosome Rearrangements
Lupinus angustifolius genome regions carrying all identified variants of GS and PEPC genes were subjected to comparative mapping to nine well-defined legume genome assemblies. Several patterns of sequence collinearity in these loci were identified. In particular, a high level of microsynteny was observed for the region carrying To summarize, all legume regions carrying at least one copy of the GS gene revealed shared synteny ( Figure 6) to at least one narrow-leafed lupin region carrying a corresponding homologous copy. Some of them matched duplicated regions in the narrow-leafed lupin genome located on different chromosomes and carrying different homologous gene copies, providing clear evidence of ancient duplications of chromosome segments that did not result in the further elimination of additional gene copies.  The set of legume regions carrying PEPC genes had more complex patterns of collinearity links. Two types of syntenic relationship were observed, related to regions carrying a PEPC gene and to regions lacking such a gene. Moreover, numerous local duplications in the analyzed data set were revealed. Highly conserved microsynteny, expressed by high values of the total score of sequence alignments, was observed for PEPC1a, PEPC1b, PEPC1c, and A. duranensis This may suggest that PEPC4 gene copies were removed from these regions during evolution. In the PEPC5 region, no microsynteny was found between lupin and other legumes. Nevertheless, several orthologs of PEPC5 were described. In general, PEPC genes revealed complex patterns of microsynteny, indicating both lineage-specific and ancestral duplications, as well as possible deletions of excessive gene copies (Figure 7, Supplementary file 7). The distribution of collinearity links provided a clear line of evidence that both GS and PEPC gene families have expanded in legumes through segmental duplications, which may be considered as landmarks of two ancient WGD events.

The Major Events Promoting the Evolution of GS and PEPC Genes in Legumes were Whole-Genome Duplications
It is a well-accepted hypothesis that the evolution of legumes has been driven by an ancient WGD event which putatively occurred in the progenitor line of Papilionoideae about 50-65 mya, providing the tetraploid ancestor and launching the divergence of ancient lineages of Papilionoideae [3,8,75,79,80]. Traces of that event have been identified in numerous clades spanning the legume tree

The Major Events Promoting the Evolution of GS and PEPC Genes in Legumes were Whole-Genome Duplications
It is a well-accepted hypothesis that the evolution of legumes has been driven by an ancient WGD event which putatively occurred in the progenitor line of Papilionoideae about 50-65 mya, providing the tetraploid ancestor and launching the divergence of ancient lineages of Papilionoideae [3,8,75,79,80]. Traces of that event have been identified in numerous clades spanning the legume tree of life, from Xanthocercis and Cladrastis through dalbergioids (Arachis spp.) and genistoids (e.g., L. angustifolius), to more recent lineages of millettioids (P. vulgaris, G. max, C. cajan, and V. radiata) and galegoids (M. truncatula, L. japonicus, and C. arietinum) [1,3,74,80,81]. Some species have retained relatively large numbers of ancient tetraploid regions (i.e., 309 regions in M. truncatula carrying 4198 genes or 343 regions in G. max with 9486 genes). Taking into consideration the topology of the legume GS1c1 tree, this ancestral duplication might have contributed to the origin of β and γ subclades. A similar explanation might be proposed for the emergence of α and β groups of PEPC1a, PEPC1b, and PEPC2, supported by both phylogenetic inference and the synteny-based approach. However, the lack of genome sequencing data for early diverging legumes hampers such a comprehensive comparative analysis and precludes drawing firm conclusions.
During the early divergence of some downstream lineages, dated to roughly~30-55 mya, additional independent WGD events probably occurred, affecting Mimosoideae-Cassiinae-Caesalpinieae, Detarieae, Cercideae, and Lupinus clades [75]. Large-scale duplication and/or triplication in the L. angustifolius genome has been well-evidenced by recent studies involving linkage and comparative mapping [17,36] and microsynteny analysis of selected gene families [30,31,34,62,63]. These WGD events apparently contributed to multiplication of the gene copy number of L. angustifolius GS and PEPC genes because hypothetical duplicates were found in sister branches of the phylogenetic tree and the genome regions harboring these genes shared common collinearity links. Some lineages experienced WGD events relatively recently, including soybean (~13 mya), carrying numerous genes in the duplicated state [3,9]. All GS and PEPC subclades, except for PEPC1a-α, were shown to carry hypothetical survivors of such an event. Hypothetical legume tandem duplicates were only identified in the GS family: in P. vulgaris, V. radiata, and G. max for GS1cs1 and L. japonicus and M. truncatula for GS2. This is an expected outcome, as tandem duplication has been suggested to be a typical mechanism for the expansion of genes, representing flexible steps in the biochemical pathways or located at the end of pathways, where they do not affect many downstream genes [82]. GS and PEPC are genes encoding key enzymes involved in crucial metabolic pathways. Therefore, the appearance of additional copies without duplication of the whole pathway might have been selected against by evolutionary processes. On the contrary, the WGD event copies the entire molecular machinery, enabling the further evolution and divergence of redundant networks [83]. Moreover, the type of duplication contributes to the further evolutionary fate, demonstrated by different gene expression patterns and the methylation status of duplicates [84]. A recent expression quantitative trait loci mapping study of an L. angustifolius recombinant inbred line population (83A:476 x P27255) provided leaf transcriptomic profiles for 30,595 genes, including all GS and PEPC homologs present in the genome, except GS2a2 unannotated hitherto [85]. Gene expression values corresponding to GS and PEPC homologs were extracted from the Supplementary Materials, Table 6, of Plewiński et al. study [85] and are presented here in Table 4 for direct reference. Indeed, that survey highlighted significant differences in leaf expression levels between particular gene duplicates, namely between GS1a1 and GS1a2 or GS1a3 (43.1 ± 16.4 vs. 13.8 ± 5.2 and 11.5 ± 5.0, respectively); GS1b1 and GS1b2 (0.3 ± 0.3 vs. 2.6 ± 1.2, respectively); GS1c1 and GS1c2 (0.1 ± 0.2 vs. 187.5 ± 52.4, respectively); PEPC1a, PEPC1b, and PEPC1c (17.0 ± 4.0 vs. 0.5 ± 0.5 vs. 65.0 ± 8.7, respectively); and PEPC3a and PEPC3b (10.5 ± 2.5 vs. 51.0 ± 11.2, respectively) [85]. The observed differences in the gene expression of L. angustifolius GS and PEPC paralogs support the previously mentioned hypothesis on the expected sub-functionalization of WGD-derived duplicates. Table 4. Normalized leaf expression level of GS and PEPC genes in a L. angustifolius recombinant inbred line (RIL) mapping population (83A:476 x P27255) [85].

The Majority of Positively Selected GS and PEPC Genes are Duplicates
According to the topology of the majority of consensus trees, 85 pairs of duplicated legume GS and PEPC sequences were selected, including those located in sister branches and those originating from different subclades (if applicable). The analysis of the nonsynonymous to synonymous substitution rate (Ka/Ks) ratio revealed that all pairs except for Lj6g3v1887800/Lj6g3v1953860 and Lj6g3v1887790/Lj6g3v1953860 were under strong purifying selection, with Ka/Ks values ranging from 0.00 to 0.32 (Supplementary file 8). The two gene pairs mentioned above had a neutral (Ka/Ks) ratio (0.87). The average Ka/Ks ratio was similar in all species except L. japonicus: namely 0.09 in A. ipaensis and V. radiata; 0.10 in P. vulgaris; 0.11 in C. arietinum and G. max; 0.12 in T. pratense; and 0.13 in C. cajan, M. truncatula, and L. angustifolius. The outlier value calculated for L. japonicus (0.29) resulted from the two sequence pairs with neutral ratios mentioned above. The average Ka/Ks ratio differed between gene clades, from 0.07 to 0.08 in PEPC1a and PEPC1b, through 0.12 to 0.15 in GS1_cs2, PEPC2, and GS1_cs1, to 0.32 in GS2 (0.10 in GS2 without two L. japonicus sequence pairs under neutral selection). To address the selection pressure in a wider phylogenetic context, a branch-site test of episodic positive selection was performed for monophyletic clades, as well as all branches, for particular legume species (Supplementary file 9). Of the 163 combinations studied, statistically significant signals of positive selection were revealed for 16 foreground branches; namely, five for GS1_cs1, four for GS2, three for PEPC2, two for PEPC1a, and single branches for GS1_cs2 and PEPC1b. L. japonicus and A. ipaensis revealed the highest number of branches putatively affected by positive selection: four and three, respectively. C. arietinum and T. pratense revealed two branches with positive selection markers, whereas C. cajan, G. max, L. angustifolius, M. truncatula, and V. radiata showed only single branches with such residues. Different amino acid positions were altered and no common pattern for any gene clade was observed.
The majority of positively selected genes were duplicates (13 vs. 3). Duplicates revealed common selection patterns for A. ipaensis (GS2 and PEPC2) and partially similar patterns for L. japonicus GS2. This may indicate that episodic positive selection occurred in these lineages before duplication events. No correlation between the inferred type of duplication (local vs. WGD) and selection pressure parameters was found; remnants of positive selection were found in both types of duplicates.
Amino acid positions altered by relaxed selection constraints did not include known ligand interacting sites (ATP, glutamate, ammonia, and metal coordination sites were evaluated according to [86]). However, few sequences were considerably truncated and lacked several ligand binding sites, namely: GS1_cs1, Lj0g3v0335159 and GLYMA02G41106; GS1cs2, Lj2g3v0658180; and GS2, Lj6g3v1953860.
Calculated Ka/Ks values highlighted the high selection pressure acting on GS and PEPC paralogs. In general, selection constraints are related to the position of the enzyme in metabolic pathways, as well as the contribution of performed enzymatic activity for basic cell metabolic networks. Usually, genes encoding enzymes located at the top of the metabolic pathway are under stronger purifying selection than downstream ones [87]. An association between the selective pressure acting on a gene and the position of an encoded enzyme in the pathway was revealed in a wide metabolic context [88,89], including L. angustifolius genes encoding isoflavone synthase and acetyl-coenzyme A carboxylase [63,64]. A higher selection pressure acts on central and highly connected enzymes, enzymes with high metabolic flux, and enzymes catalyzing reactions that are difficult to bypass through alternative pathways [88]. Moreover, enzymes participating in primary metabolism are usually under a constant strong selective pressure, whereas enzymes performing specified metabolism are under weaker negative selection [89]. One of the postulated explanations for the above pattern is that these specified metabolism genes initially experienced positive selection (higher rate than primary metabolism genes) [90].

Research Material
This study was carried out with the use of L. angustifolius cv. Sonet germplasm obtained from the Polish Lupin GenBank in the Breeding Station Wiatrowo (Poznań Plant Breeders Ltd., Wiatrowo, Poland) and the narrow-leafed lupin genome BAC library [28]. Finally, probes were hybridized with the narrow-leafed lupin nuclear genome BAC library, as previously described by Książkiewicz et al. (2013). Verification of positive hybridization signals was performed by PCR and Sanger sequencing with gene-specific primers (Table 5).

Estimating GS and PEPC Sequence Variant Numbers
To estimate the number of GS and PEPC sequence variants in the L. angustifolius genome, droplet digital PCR (ddPCR) was performed with the use of the Bio-Rad QX200 Droplet Digital PCR System (Bio-Rad, Hercules, CA, USA). The set of GS and PEPC specific primers was anchored in the most conserved gene regions among legume plants with well-established sequence data. A gene described as a single copy in the narrow-leafed lupin genome, namely aspartate aminotransferase (AAT) [31,91], was used as the reference in the ddPCR experiment. A series of L. angustifolius genomic DNA dilutions, ranging from 0.125 to 2.0 ng/µL, were used as templates in ddPCR reactions containing 2× QX200 ddPCR EvaGreen Supermix The narrow-leafed lupin genome scaffold assembly v1.0 (GCA_000338175.1) and genome pseudochromosome assembly v1.0 (GCA_001865875.1) were used to obtain GS and PEPC gene variant sequences, not represented in BAC clones, and to establish their positions in the genome. The BLAST algorithm was optimized for highly similar sequences: e-value cut-off, 1 × 10 −20 ; word size, 28; match/mismatch scores, 1/-2; and gap costs, linear.
The obtained BAC clone insert sequences and narrow-leafed lupin scaffold fragments corresponding to the narrow-leafed lupin genome regions carrying GS1, GS2, and PEPC genes (average length of 100 kb) were subjected to computational characterization of repetitive content and gene coding sequences. Repetitive elements were annotated and masked using RepeatMasker Web Server version 4.0.3 (search engine, cross_match; speed/sensitivity, slow; DNA source, Arabidopsis thaliana) and supplemented with the CENSOR tool accessed via the Genetic Information Research Institute (sequence source, Viridiplantae; force translated search; mask pseudogenes).
Gene prediction was performed using FGENESH [92] with G. max as a reference species. Functional annotation of predicted coding sequences was performed with the use of the BLAST algorithm (e-value cut-off, 1 × 10 −10 word size, 28; match/mismatch scores, 1/-2; and gap costs, linear). The obtained GS1, GS2, and PEPC gene structures were visualized and compared in Geneious software v 10.1 (http://www.geneious.com). The results of functional annotation were subsequently used for gene density (genes/kbp) calculation.

Phylogenetic Reconstruction of the Plant Species Tree
The reference genome sequences were gathered from Phytozome [97], NCBI/RefSeq [98], and Ensembl/Plants [99] databases. A full list of genomes and respective sources is available in Supplementary file 2.
For species tree reconstruction, a set of conserved homologs were selected with conditional reciprocal BLAST (CRB-BLAST) [100] against the Ensembl/Plants version of the A. thaliana representative proteome (longest encoded protein at each coding locus) with default settings. Singular loci with over 95% representation as single-copy orthologs over all the analyzed species were selected for species tree reconstruction, yielding a total of 29 loci. The alignment of representative protein sequences for each orthologous locus was obtained with MAFFT-LINSi v 7.310 [101], and a 70% occupancy threshold was used to filter the alignments with trimal, while simultaneously back translating to underlying codons with the -backtrans option provided in trimal [102]. All alignments were concatenated and partitioned analysis was conducted on the basis of this joint supermatrix. The list of all loci (by A. thaliana reference locus) and the respective evolutionary models used can be found in Supplementary file 10.
An approximate species tree was reconstructed with IQTREE v 1.5.5 [103]. Optimal model selections [104] were carried out using IQTREE's built-in capabilities (MFP option). Ultrafast bootstrap approximation [105] was used to assess the topology based on a 3000 iteration threshold (convergence was reached in 104 iterations).

Determining GS1, GS2, and PEPC Gene Families Evolutionary Patterns
Sequences were gathered with independent BLASTP (2.6.0) searches of each included plant genome (including non-legume reference genomes; full list included as Supplementary file 2) and the July 2017 version of the UniProt/SwissProt (The UniProt Consortium 2017) golden standard database. The resulting hits were filtered based on the maximum 1 × 10 −20 expectation value threshold and the minimum 40% coverage of at least one of the lupin homologs sequenced during the experimental phase of the project (sequences obtained from sequenced BAC clones: 047P22, 087N22, 036L23, 059J08, 064J15, 067C07, and 131K15 used as queries). Supervised clustering was then conducted in a procedure analogous to that described in our earlier work [46] and the sequences were compared against each other with USEARCH (UBLAST v8.1.1831 search with e-value threshold 1 × 10 −10 ) [106]. Finally, the pairwise relationships (e-values post log-transformation) were used to cluster the sequences with MCL [107] at multiple inflation threshold values. The optimal value of the inflation threshold was selected as 1.4, based on the averaged values of the silhouette width [108], which is a cluster quality measure independent of predefined class labels. The largest clusters, which contained all of the GS/PEPC hits found in SwissProt, were processed further. SwissProt sequences were initially kept for purposes of alignment/filtering, but were discarded for final phylogenetic tree reconstruction/reconciliation.
In order to filter out assembly errors, heavily truncated partial genes, and/or pseudogenes, additional criteria were used. All accepted sequences were aligned with MAFFT v7.310 and preprocessed with OD-seq [109]. OD-seq uses a gap-based distance metric to filter out outliers with significantly different gap patterns compared to the rest of alignment. Prior to assessment, a round of trimming was carried out with trimal, based on a very permissive 1% gap threshold (parameter choice resulting in retaining sequences longer than average). All discarded sequences can be found in Supplementary file 11. The PEPC sequence from Archaeoglobus fulgidus and GS sequence from Rhizobium meliloti were initially used to guide rooting (pruned prior to reconciliation), and both coding sequences were selected on the basis of respective SwissProt records.
During GS analysis, a singular, a previously established [110] sequence for L. japonicus was introduced in lieu of seemingly duplicated loci on the sixth pseudochromosome of the draft genome (Lj6g3v0410480/Lj6g3v0410490; both corresponding to C-terminal part of the full coding sequence). A comparison of the L. japonicus pseudochromosome and reference sequence of the previously cloned region, has shown that likely misassembly or recombination has affected the region, so the reference UniProt sequence was used in downstream analyses.
During PEPC analyses, sequences from the Volvox carteri NCBI/RefSeq genome were used in lieu of Phytozome version due to the higher gene model quality. Additionally, available sequences from Chlamydomonas reinhardtii were obtained through UniProt/SwissProt records (and corresponding GenBank entries), as the current reference genome does not contain full-length gene models corresponding to either PEPC1 or PEPC2.
Phylogenetic inference was conducted analogous to the species tree reconstruction described above (IQ-TREE, optimal model selection, ultrafast bootstrap approximation). Codon-based models and coding sequences were used in order to obtain a better resolution of recent bipartitions. The SCHN05 model [111] with a free-rate model of site heterogeneity [112] was selected in both cases (GS:SCHN05+R6, PEPC:SCHN05+R8). Based on the rule of parsimony, reconstructions with the least amount of inferred duplications/losses (minimum cost of optimal reconciliation based on DTL-RANGER [113] reconciliations of species/gene trees, with disabled horizontal transfer events) were chosen. Notably, this resulted in the selection of codon-based nucleotide alignments over protein sequences and the abandonment of alignment trimming for gene tree reconstruction. The visualization of optimal reconciliation was carried out with custom scripts in the Python/ETE2 environment based on the built-in ETE2 reconciliation procedure and DTL-RANGER results [114].

Selection Pressure Analysis
Pairwise selection pressure parameters, including Ka (the number of nonsynonymous substitutions per nonsynonymous site), Ks (the number of synonymous substitutions per synonymous site), and Ka/Ks ratios, were calculated in DnaSP 5 [115]. To follow the topologies of the trees, the branch-site test of positive selection was performed in PAML4 [116]. Two models were considered: a null model, in which the foreground branch might have different proportions of sites under neutral selection to the background (i.e., relaxed purifying selection), and an alternative model, in which the foreground branch might have a proportion of sites under positive selection. The hypothesis of positive selection was verified by the likelihood ratio test (alternative vs. null model) and p-value under a Chi-square distribution and one degree of freedom (maximum p-value threshold of 0.05 was used). Sites under positive selection for foreground lineages were predicted by naive empirical Bayes and Bayes empirical Bayes [117] (a minimum posterior probability threshold of 0.95 was used). Both analyses were based on the same alignments as those used for phylogenetic inference; however, codons present in less than 30% of sequences from a particular clade were removed (Supplementary file 12).

1.
GS and PEPC genes were shown to have had a complex history, with bacterial-type PEPCs emerging as those best suited for future phylogenetic inquiries into relationships between divergent legumes. 2.
Legume GS and PEPC genes evolved by both ancestral legume-wide and more recent lineage-specific WGDs. Descendants of these duplications have been retained in the majority of lineages and have sustained typical gene structures, implying differences in carbon/nitrogen metabolism due to regulatory rather than mechanistic changes. 3.
Legume PEPC and GS gene sequences were highly conserved by significant purifying selection. Tentative traces of positive selection can only be inferred in several branches and point to single residues, outside of the core set involved in ligand binding.

4.
Monocot family members of the GS gene family might be more ancient than dicot ones, stemming from the selective culling of duplicates predating the separation of both lineages. 5.
The general patterns of lineage-specific duplications suggest that sub-functionalization and/or regulatory rewiring played a large role in shaping the extant carbon and nitrogen primary metabolic pathways in some lineages (L. angustifolius, L. japonicus, and G. max).

Acknowledgments:
The authors would like to thank Matthew Nelson (CSIRO Perth) for providing expertise in narrow-leafed lupin genetic mapping and phylogeny studies.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.