Genome-Wide Identification and Analysis of the MADS-Box Gene Family in American Beautyberry (Callicarpa americana)

The MADS-box gene family encodes a number of transcription factors that play key roles in various plant growth and development processes from response to environmental cues to cell differentiation and organ identity, especially the floral organogenesis, as in the prominent ABCDE model of flower development. Recently, the genome of American beautyberry (Callicarpa americana) has been sequenced. It is a shrub native to the southern region of United States with edible purple-colored berries; it is a member of the Lamiaceae family, a family of medical and agricultural importance. Seventy-eight MADS-box genes were identified from 17 chromosomes of the C. americana assembled genome. Peptide sequences blast and analysis of phylogenetic relationships with MADS-box genes of Sesame indicum, Solanum lycopersicum, Arabidopsis thaliana, and Amborella trichopoda were performed. Genes were separated into 32 type I and 46 type II MADS-box genes. C. americana MADS-box genes were clustered into four groups: MIKCC, MIKC*, Mα-type, and Mγ-type, while the Mβ-type group was absent. Analysis of the gene structure revealed that from 1 to 15 exons exist in C. americana MADS-box genes. The number of exons in type II MADS-box genes (5–15) greatly exceeded the number in type I genes (1–9). The motif distribution analysis of the two types of MADS-box genes showed that type II MADS-box genes contained more motifs than type I genes. These results suggested that C. americana MADS-box genes type II had more complex structures and might have more diverse functions. The role of MIKC-type MADS-box genes in flower and fruit development was highlighted when the expression profile was analyzed in different organs transcriptomes. This study is the first genome-wide analysis of the C. americana MADS-box gene family, and the results will further support any functional and evolutionary studies of C. americana MADS-box genes and serve as a reference for related studies of other plants in the medically important Lamiaceae family.


Introduction
Mints (Lamiaceae) are the sixth largest family of flowering plants and include many ornamental, medical, and edible species, such as basil, rosemary, thyme, peppermint, and spearmint. Full genome and transcriptome sequencing data that are available at the Mints Genome Project database (http://mints.plantbiology.msu.edu/index.html; accessed on 1 July 2021) and separate other projects are enhancing our understanding of this important medical plant family. American beautyberry (Callicarpa americana) is known for its prominent purple fruit, and it has been reported that native Americans have used it as an insect repellent and medicinal plant [1]. Studies have revealed a number of terpenoids, such as spathulenol, intermedeol, and callicarpenal that have been isolated from the plant, and proved to be effective as a mosquito repellent in laboratory experiments [2,3]. Callicarpa is a representative from the early-diverging mint lineage, and thus, it has an important phylogenetic position to study the evolution of key gene families, such as the MADS-box genes. Recently, the full genome sequence of C. americana has been published [4], providing tissues were analyzed. In addition, cis-regulatory elements were analyzed and identified in the 2 kb upstream promoter regions. Results indicated their broad range of functions in several C. americana tissues, with major roles in flower and fruit development and abiotic stress response. This study will help in improving our understanding of the evolution and function of this essential transcription factor family, in the medically important Lamiaceae family [24].

Identification of MADS-Box Genes and Their Distribution in C. americana Genome
Seventy-eight non-redundant MADS-box genes were obtained using the HMMER toolkit [25] to search the hidden Markov model of the MADS-box DNA-binding domain in C. americana proteome sequence, using both SRF (type I) and MEF2 (type II) MADS-box domain sequences (Table 1). Putative MADS-box genes were submitted to the SMART [26] and PROSITE [27] websites for further verification of the presence of the MADS domain. The following C. americana MADS-box genes (CamMADS 9,11,27,35,38,47,60,61,68,77) were identified as type II (MIKC) but they lacked the K-domain. Thus, a further inspection of the genomic data was necessary, and functional sites identification and genome annotation have been carried out using the FGENESH suite [28], using the CamMADS genomic DNA in reference to the S. indicum genes. The correct exons have been predicted, and the new CamMADS annotations contained both the M-and K-domains, as expected. * Gene annotation has been corrected. Chr = chromosome. TM8 gene is present in S. lycopersicum but not in A. thaliana.
Each MADS-box protein was then verified by BLASTP function at the Plant Transcription Factor Database [29,30] separately against A. thaliana, S. lycopersicum and S. indicum and finally against all species. CamMADS proteins were then initially categorized into type I (M-type) and type II (MIKC) based on their homology with their identified orthologues; Table 1 includes A. thanliana type II MADS-box genes orthologs, while type I CamMADS have low identity alignment scores to A. thaliana to be confidently assigned a specific ortholog. As reported in previous studies of other species, the number of type II MADS-box genes was higher than that of type I MADS-box genes. Five MADS-box proteins were neutral, with pI values between 6.5 and 7.5, 18 were acidic, with pI values less than 6.5, 55 were alkaline, with pI values greater than 7.5. It is worth noting that the average pI for M-type proteins was 7.5, and for the MIKC C group, it was 8.5 (basic), while all MIKC* proteins had acidic pI values, with a 5.6 average. The average predicted molecular weight for M-type proteins was 30,261.9 Da, and for MIKC C , 27,866.9 Da, while the MIKC* proteins had the highest molecular weight average of 40226.7 Da. The number of exons in CamMADS genes ranged from 1 to 15 exons. The number of exons in type II MADS-box genes (7-15) greatly exceeded the number in type I genes (1-7). The number of exons within the same subgroup did not vary much, with few exceptions.
After verification of the presence of the MADS-box domain and initial homology alignments, genes were mapped onto the 17 chromosomes of C. americana genome, and no MADS-box genes were located on the unanchored scaffolds ( Figure 1). The distribution of C. americana MADS-box genes (CamMADS) was uneven. The maximum number of genes (17; 21.79%) was localized on Chromosome 4, whereas Chromosomes 2, 7, and 16 have only one MADS-box gene each. Several CamMADS genes resided in a 2-5 genes cluster. To investigate possible gene duplication events, the OrthoFinder algorithm was utilized [31], and each two or more adjacent homologous genes located on a single chromosome were considered as co-linear duplicates. A total of 15 paralogous gene pairs were observed, of which four genes were MIKC-type (

Phylogenetic Analysis of MADS-Box Genes in C. americana
To properly classify the CamMADS proteins, three phylogenetic trees (maximum likelihood (ML) tree) for (1) all identified CamMADS proteins ( CamMADS proteins were classified into functional groups according to both A. thaliana and S. lycopersicum MADS-box genes that have been investigated extensively [32,33]. Based on the phylogenetic tree and structural features of the MADS-box proteins, genes were separated into 46 type II and 32 type I MADS-box genes. CamMADS genes were clustered into four groups: type II (MIKC C , MIKC*) and type I (Mα-type, Mγ-type), while the type I Mβ-type group was absent ( Figure 3). The Mβ-type group was also absent in S. indicum (sesame) [34]. While type II ( Figure 4) phylogenetic tree included all expected groups and subgroups, in accordance with A. thaliana and S. lycopersicum trees.   (Figure 4) MADS-box proteins were constructed using type I and type II MADS-box full length proteins from A. thaliana, S. lycopersicum (species with well characterized MADS-box genes), S. indicum (a species in lamiaceae family), A. trichopoda (a basal outer group), and the CamMADS proteins identified in the present study. CamMADS proteins were classified into functional groups according to both A. thaliana and S. lycopersicum MADS-box genes that have been investigated extensively [32,33]. Based on the phylogenetic tree and structural features of the MADS-box proteins, genes were separated into 46 type II and 32 type I MADS-box genes. CamMADS genes were clustered into four groups: type II (MIKC C , MIKC*) and type I (Mα-type, Mγ-type), while the type I Mβ-type group was absent ( Figure 3). The Mβ-type group was also absent in S. indicum (sesame) [34]. While type II ( Figure 4) phylogenetic tree included all expected groups and subgroups, in accordance with A. thaliana and S. lycopersicum trees.

Conservative Motif Distribution and Gene Structure Analysis of C. americana MADS-Box Genes
To better analyze the sequence characteristics and structural differences among the conserved motifs of all CamMADS proteins (list of CamMADS peptide sequences is available in the Supplementary Data), motifs were predicted by the MEME program ( Figure  5B). Motifs 1 and 16 represent the DNA binding MADS domain, and Motif 1 was the most typical MADS domain, 50 amino acids in length. Motifs 2 and 4 combined were the highly conserved K-domain (spanning K1, K2, and K3 subdomains). These motifs were present in all MIKC-type CamMADS genes. It is worth noting that even when the MEME suite did not recognize some K-domains in few CamMADS proteins, a second check by SMART and MotifFinder suites was enough to confirm that the K-domain was present. The length of the conserved K-domain (K1 + K2 + K3) was 67 (38 + 29) amino acids. In general, CamMADS proteins of the same subgroup had similar motifs, and it is probable that they might have conserved functions. While, the difference in motifs structure and distribution support the expected variety of function of CamMADS genes in different organs of C. americana.

Conservative Motif Distribution and Gene Structure Analysis of C. americana MADS-Box Genes
To better analyze the sequence characteristics and structural differences among the conserved motifs of all CamMADS proteins (list of CamMADS peptide sequences is available in the Supplementary Data), motifs were predicted by the MEME program ( Figure 5B). Motifs 1 and 16 represent the DNA binding MADS domain, and Motif 1 was the most typical MADS domain, 50 amino acids in length. Motifs 2 and 4 combined were the highly conserved K-domain (spanning K1, K2, and K3 subdomains). These motifs were present in all MIKC-type CamMADS genes. It is worth noting that even when the MEME suite did not recognize some K-domains in few CamMADS proteins, a second check by SMART and MotifFinder suites was enough to confirm that the K-domain was present. The length of the conserved K-domain (K1 + K2 + K3) was 67 (38 + 29) amino acids. In general, CamMADS proteins of the same subgroup had similar motifs, and it is probable that they might have conserved functions. While, the difference in motifs structure and distribu- To gain insights into the structural diversity of C. americana MADS-box genes, we analyzed the exon-intron organization of the coding sequences of each CamMADS gene ( Figure 5C). The number of exons followed a clear bimodal pattern. The type II (MIKC) CamMADS all had at least five introns (CamMADS11 and 27 in TM3/SOC1 group), up to 15 exons (CamMADS3 in SQUA group). While all type I (M-type) had only one exon-no introns-except, in the Mα group where CamMADS48 and 52 had three exons, and CamMADS46 and 2 had 8 and 9 exons, respectively, and in the Mγ group with CamMADS58 having seven exons. Few genes in the MIKC group have relatively long introns (>10 kb), compared to the rest of CamMADS genes.

Expression of C. americana MADS-Box Genes
The expression profile heat map of the 78 CamMADS genes was generated using the transcript per million (TPM) data [2]. CamMADS genes expression was analyzed in the following tissues: mature leaf, young leaf, stem, petiole, root, open flower, closed flower, and whole fruit, as shown in Figure 6. Overall, the CamMADS genes were active in all plant tissues under study, indicating their versatile role in many key physiological activities. The type II (MIKC) CamMADS genes had higher expression in the floral organ and later fruit, some of which were strictly expressed in the floral organ, which is expected as they are key regulators of the florogenesis process. The expression pattern suggests that the ABCDE model of flower development is also conserved in C. americana.
The To further assess the functions of CamMADS genes, the upstream 2 kb promoter region was analyzed for cis-acting regulatory elements, as shown in Figure 7. The following elements of key roles have been identified: W-boxes and TC-reach repeats are defense and stress-inducible promoters. AE-box, AT1-motif, chs, Box4, TCT-motif, G-box, and GT1-motif are involved in light responsiveness. MBS is the drought resistance-induced MYB binding site. ABRE is an abscisic acid response element. MeJA is the CGTCA-motif methyl jasmonate response element. AuxRR-core and TGA-element are regulatory auxin responsiveness elements. P-box, GARE-motif, and TATC-box are gibberellin response elements. STRE and WUN-motif are wound response elements. ARE is an anti-oxidant response element. O2-site is involved in zein metabolism regulation. GCN4_motif is involved in endosperm expression. CAT-box is related to meristem expression. TCA-motif is a salicylic acid response element. Circadian are cis-acting regulatory element involved in circadian control. LTR is involved in low-temperature responsiveness. MBSI is involved in flavonoid biosynthetic gene regulation.
All CamMADS genes have at least one regulatory element involved in light responsiveness. All have elements involved in wound response, except CamMADS46. Fifty-three genes have an abscisic acid response element. Forty-one genes have a methyl jasmonate response element. Twenty-six genes have either one or both of auxin responsiveness elements. Thirty-four genes have gibberellin response elements. Thirteen genes have an element involved in circadian control. Twenty-nine genes have elements involved in low-temperature responsiveness. A number of genes have different elements involved in metabolism and cell differentiation.
The upstream promoter regions were scanned for elements of GAGA (C-box), mainly GAGAGA hexamers and TGACGT-containing elements. Considering the other possible variation in the GA rich regions [35,36], all promoters had at least one of the aforementioned elements. In addition, all promoters have TATA-box and CAAT-box elements.

Discussion
MADS-box genes have been identified in several species, both the numbers and the types of MADS-box genes differed greatly among these species. Some species had very few type I (M-type) genes or lacked them totally, as in: Saccharum officinarum (grass), Marchantia polymorpha (Marchantiophyta), Klebsormidium flaccidum, Dunaliella salina, and Chlorella variabilis (Algaea). While, the Angiosperms species Amaranthus hypochondriacus and Jatropha curcas have ten genes. Several algae species had very few or lacked the type II (MIKC) genes, as in: Bathycoccus prasinos, Chlamydomonas reinhardtii, and Volvox carteri. Marchantia polymorpha (Marchantiophyta) has two type II (MIKC) genes, and Picea abies (Pinophyta) has three. While, the Angiosperms specie Daucus carota has five genes. Angiosperms also have the largest number of type I genes (Camelina sativa: 271 genes) and the largest number of type II genes (Glycine max, Soybean: 209 genes) [29,30].
The number of type I MADS-box genes in C. americana (32) was similar to S. indicum (31), but lower than Ocimum tenuiflorum (42), all members of Lamiaceae family. While, the number of type II genes in C. americana (46) was higher than that in O. tenuiflorum (43) but lower than S. indicum (62). The genome size of C. americana was 506.1 Mb [4], compared to the genome size of S. indicum 337 Mb [34] and 612 Mb estimated genome size for O. tenuiflorum [37]. When compared to the large soybean genome (1115 M) [34,38], which also has 269 MADS box genes, and Camelina sativa estimated the genome size of 785 Mb [39], which has 384 MADS box genes. The reduced number of genes in some Lamiaceae members might be justified by the smaller genome size and/or more active genome size reduction after duplication events, since the whole genome duplication event is a main contributor for the genes' number increment and diversification of species [40][41][42][43]. The clustering of genes is observed in other transcription factor families, such as Hox genes [44]. This clusters might have risen through tandem gene duplication events [18,45]. The high exon number in type II (MIKC) genes (5-15) compared to type I (1-9) is consistent with studies in other species, such as sesame, Arabidopsis, rice, and soybeans [32,34,38]. This also matches the more complex and versatile functions found in type II (MIKC) compared to type I (M-type) [7,12,18,23].
The Mβ-type of type I MADS-box genes was absent in C. americana; also, it was absent in S. indicum and U. gibba [34]. The absence of Mβ-type genes in these species, which are all members of the Lamiales order, is an indication of a close relationship between the Lamiacaea family (C. americana and S. indicum) and Lentibulariaceae family (U. gibba) within the Lamiales order. The function of most Mβ-type genes in Arabidopsis is not fully understood, but some play important roles in the differentiation of female gametophyte [32,46]. Either there is a different mechanism in C. americana due to the lack of Mβ-type genes, or there was a redundancy in their function and other CamMADS protein can still fill their role in the protein network. Mβ genes were reported to be absent in rice and other monocots as well [32], and the subgroup might have evolved as a lineagespecific clade.
CamMADS75 is an ortholog of TM8 gene present in S. lycopersicum, S. indicum, and A. trichopoda, but absent in A. thaliana. TM8-like genes were identified in gymnosperms and angiosperms. The pattern of genes expression in several different tissues and the lack of a clear associated phenotype related to TM8 deletion or overexpression render it difficult to pinpoint an exact function, and it could indicate that TM8-like genes are a clade of fast evolving genes [31,47]. Its promoter region has elements involved in stress and drought response, jasmonate and gibberellin response elements, and the GCN4_motif, which is involved in endosperm expression. Further molecular and systematic analysis of C. americana CamMADS75 TM8 ortholog could provide useful information on the function of this elusive gene.
In general, in each studied tissue, there was at least one CamMADS active gene being expressed. This hints to the importance and diversity in functions of this gene family in the C. americana plant. Type II CamMADS genes have an overall higher expressivity across all tissues compared to type I CamMADS. This is expected and can be justified, as the MIKC type genes are more complex and diverse than the M-type genes [7,12,18,23]. CamMADS51, an ortholog of Arabidopsis PISTILLATA (PI) gene, has the highest expression level in closed flower sample, along with CamMADS4, an ortholog of the Arabidopsis APETALA3 (AP3) gene. This is reasonable for the key roles that PI and AP3 plays during the florogenesis [19].
CamMADS64 an ortholog of Arabidopsis AG gene was highly expressed in whole fruit sample [48][49][50]. CamMADS68, an ortholog of Arabidopsis FLC, was suppressed during flower development, since it is a suppressor of flowering, implying that it has a conserved function in C. americana [14,21,23]. CamMADS47 and CamMADS60, members of the MIKC*-S subgroup, were expressed in flower tissues, hinting to a possible conserved function during male gametophyte development [15,18,22]. Some of the MIKC group genes were expressed in root, stem, and leaves tissues in addition to their key role in florogenesis. This is consistent with the patterns of MADS-box gene expression in A. thaliana where several genes are involved in biological processes other than florogenesis. A. thaliana FLM and FLC are involved in vernalization. FLC, SVP, and SOC1 are involved in drought response; the presence of cis-acting regulatory elements in the promoter regions involved in drought response in ortholog CamMADS implies a possible conservation of functions. ANR1 and AGL21 are involved in lateral root formation; both respective ortholog CamMADS38 and CamMADS44 are expressed in the C. americana root. SOC1, AGL21, and FLC are involved in abscisic acid (ABA) and gibberellin (GA) metabolism [8]; their orthologs in C. americana have the cis-acting regulatory elements involved in ABA and gibberellin GA metabolism. These functions might be conserved in C. americana as well, for the orthologs expression profile can justify the presence of these subgroups' members in the plants' respective tissues.
All promoters had at least one of the GAGA (C-box) elements, which is required for the normal expression of a wide range of different genes; it can facilitate activation by a remote enhancer. Cytokinin response elements was shown to interact with the C-box in A. thaliana [51,52]; a similar mechanism could be at play here in C. americana.
In addition to the upstream promoter region, the first intron of each CamMADS genes-when available-was scanned for cis-regulatory elements, all introns contained TATA-box and/or CAAT-box elements, in addition to few other elements found in the upstream promoter region. This might point to a possible role of the intronic region in gene regulation in CamMADS genes [53].
In A. thaliana, most type I MADS-box genes are expressed weakly, and their function is not as clear as type II MADS-box genes. The expression of CamMADS17 and CamMADS20 genes in the flower bud tissues suggested that they might have a role in flower development. This is in line with what some studies suggest that type I genes are involved in A. thaliana reproduction and development [32,46]. It is worth noting that some genes appear to have no expression in any C. americana tissue. This might be due to the fact that some of the MADS-box genes are activated in response to certain environmental cues and abiotic stress responses, such as: temperature, salinity, drought, and wound response [8,9]. Another possibility is that these gens might be pseudogenes being transcribed to RNA at a very low level, with no function, or might be redundant genes going through neofunctionalization process. The presence of two or more orthologs of A. thaliana MADS-box genes either reflect a functional redundancy, or some of these genes might have acquired new functions, or they might differ in response to different environmental cues to fine tune gene expression level in C. americana. The C. americana genome analyses have revealed three putative whole-genome duplication events [2]. Gene duplication events were also recently reported in mints [43]. Whole genome duplication events might have contributed to MADS-box gene family expansion.

Conclusions
Based on the latest C. americana genome sequence and RNA-Seq data, 78 CamMADS genes were identified using bioinformatics tools and were classified as M-type (Mα and Mγ) and MIKC-type (MIKC* and MIKC C ) according to their evolutionary relationships and protein structure characteristics. The Mβ-type of type I MADS-box genes was absent in C. americana, as it was absent in S. indicum and U. gibba. The absence of Mβ-type genes in these species, which are all members of the Lamiales order, might hint to a close relationship between Lamiacaea family and Lentibulariaceae family within the Lamiales order. Gene structure analysis revealed that type II genes contained a greater number of exons than did type I genes. The expression pattern of CamMADS genes in eight tissues, and the cis-regulatory element analysis of their promoter regions suggest an overall conservation of some of the abiotic stress responses and the ABCDE model of flower development functions to some extent in C. americana. The absence of certain elements and the change in expression patterns could point to some MADS-box genes being diversified in functions, or simply to a redundancy in function. This study will help guide future molecular proteinprotein interaction analysis studies to confirm the interactions and functions of each of the CamMADS genes presented.

Identification and Sequence Analysis of MADS-Box Genes
The C. americana (beautyberry) genome and proteome were downloaded from NCBI (PRJNA529675). The hidden Markov model (HMM) profiles of the SFR (type I) domain (PF00319) and Myocyte Enhancer Factor-2 (MEF2) type II domain (PF09047) were retrieved from Pfam [54]. MADS-box genes were identified in the C. americana proteome using the hidden Markov model (HMM) profile corresponding to the Pfam MADS-box family PF00319 and PF09047 domains, using HMMER v. 3.0 [25], and redundant sequences were removed manually. A total of 78 MADS-box proteins were obtained as candidate MADS-box genes. The amino acid sequences were then searched, based on the conserved domains, using ScanProsite and the simple modular architecture research tool (SMART) to confirm that all genes contained the MADS-box domain [26,27]. The annotations of the type II (MIKC) genes that were missing the K-domain were corrected by the FGENESH suite [28,29], using the CamMADS genomic DNA in reference to the S. indicum genes. The online tool ProtParam [55] was employed to analyze theoretical molecular weights and isoelectric points (PI).

Assigning the Location of MADS-Box Genes to the C. americana Genome
The physical positions of MADS-box genes were mapped to the 17 chromosomes of C. americana using the coding DNA sequence files. The TBtool suite [56] was used to visualize the genes on chromosomes. The OrthoFinder algorithm [57] was used to identify possible duplicated genes.

Alignment and Phylogenetic Analysis of MADS-Box Genes
Type I and type II MADS-box proteins of A. thaliana, S. lycopersicum, S. indicum, and A. trichopoda were downloaded from PlantTFDB 5.0 database [30], then C. americana type I and type II MADS-box full length proteins were aligned to them using UGENE MUS-CLE [58] with the following settings (Gap Open: −2.9, Gap Extended: 0.0, Hydrophobicity Multiplier: 1.2, Cluster Method: UPGMA). Two unrooted maximum likelihood (ML) trees of C. americana, A. thaliana, S. lycopersicum, and S. indicum type I and type II MADS-box proteins were constructed using the MEGA-X software [59,60] with the following settings (Bootstrap: 1000, Model: Jones-Taylor-Thornton, Uniform rates, Gaps: Use all sites). Another circular and linear phylogenetic tree of all C. americana MADS-box proteins was constructed using the same method.

Gene Structure and Conserved Motif Analysis
Gene structures were constructed using a GFF3 file downloaded from the GIGA database [61], BioProject (PRJNA529675). The structures were displayed using Gene Structure Display Server (GSDS 2.0) [62]. The MEME server [63] was used to predict conserved motifs with the following parameters: number of repetitions = any, maximum number of motifs = 20, optimum motif width set to ≥6 and ≤200, based on our knowledge of MADS-box protein domains.

Expression Profiling of MADS-Box Genes and Cis-Acting Regulatory Element Analysis
The RNA-Seq data were obtained from GIGA database [61] SRA study (SRP192973). These RNA-Seq data contained the transcriptomes of young leaf, mature leaf, stem, petiole, root, close flower, open flower, and whole fruit. Transcript abundance was calculated by normalized transcripts per million (TPM) values, and data were represented as a heatmap, using MS Excel sheets [64]. The cis-acting regulatory element analysis was performed on 2kb upstream promoter sequences of CamMADS genes, using PlantCARE server [65], and the number of elements was presented as a heatmap.