Comparative Analysis of Impatiens Leaf Transcriptomes Reveal Candidate Genes for Resistance to Downy Mildew Caused by Plasmopara obducens

Impatiens downy mildew (IDM) is a devastating disease to garden impatiens. A good understanding of IDM resistance in New Guinea impatiens is essential for improving garden impatiens resistance to this disease. The present study was conducted to sequence, assemble, annotate and compare the leaf transcriptomes of two impatiens cultivars differing in resistance to IDM, reveal sequence polymorphisms and identify candidate genes for IDM resistance. RNA-Seq was performed on cultivars Super Elfin® XP Pink (SEP) and SunPatiens® Compact Royal Magenta (SPR). De novo assembly of obtained sequence reads resulted in 121,497 unigenes with an average length of 1156 nucleotides and N50 length of 1778 nucleotides. Searching the non-redundant protein and non-redundant nucleotide, Swiss-Prot, Kyoto Encyclopedia of Genes and Genomes and Clusters of Orthologous Groups and Gene Ontology databases, resulted in annotation of 57.7% to 73.6% of the unigenes. Fifteen unigenes were highly similar to disease resistance genes and more abundant in the IDM-resistant cultivar than in the susceptible cultivar. A total of 22,484 simple sequence repeats (SSRs) and 245,936 and 120,073 single nucleotide polymorphisms (SNPs) were identified from SPR and SEP respectively. The assembled transcripts and unigenes, identified disease resistance genes and SSRs and SNPs sites will be a valuable resource for improving impatiens and its IDM resistance.


Introduction
The Genus Impatiens belongs to the family Balsaminaceae, which consists of about 850 species of mostly succulent annual or perennial herbs [1]. Wild Impatiens plants are primarily found in the mountainous regions of South-East Asia, south China, India and Africa while some species are also found in Japan, Europe, Russia and North America [2]. Garden impatiens (Impatiens walleriana Hook.f.) is one of the most popular flowers in the world, widely grown in garden beds, borders and woodland gardens as bedding plants and in containers, window boxes and hanging baskets as house plants [3,4]. Impatiens flowers are available virtually in all colors [3,4].
Impatiens downy mildew (IDM), caused by Plasmopara obducens (J. Impatiens noli-tangere), is a devastating disease to impatiens and can cause rapid defoliation and plant death. This disease was first reported in Germany in 1877 on Impatiens noli-tangere, a wild species of Impatiens native to many temperate countries in the northern hemisphere [5]. During the 1880s, P. obducens was identified in North America in native Impatiens species including I. capensis (synonym I. biflora), I. fulva and I. pallida [5]. The first recent occurrence of IDM on I. walleriana was reported in the UK in 2003 [6]. Subsequently it spread to other countries in Europe (including Norway (2010), Serbia (2011) and Hungary (2011)) and other continents [7][8][9]. IDM was reported in Taiwan and Japan in 2003 and in Australia in 2008 [10][11][12]. Since then, IDM has been problematic in these countries. In the USA, IDM appeared on I. walleriana plants in California in 2004 [13]. In 2011, IDM was reported in eleven states in the USA including Massachusetts, New York and Minnesota [14]. By 2012, this disease spread to 33 states [13]. In 2013, IDM was reported in Hawaii [15]. Outbreak of this disease in the USA in this short period of time has caused the loss of hundreds of millions of dollars, reducing the national annual wholesale value of impatiens from approximately $150 million in 2005 to $65 million in 2015 [16]. While I. walleriana is highly susceptible to IDM, New Guinea Impatiens (NGI) (Impatiens hawkeri) and its interspecific hybrids have shown resistance to IDM [17]. NGI has been grown as a substitute of garden impatiens currently and occupies a $55-million wholesale market [16]. Although the market of NGI is increasing, it ranked third among the alternatives of garden impatiens [18].
There has been a strong interest in introducing IDM resistance into I. walleriana cultivars due to continued strong consumer demand. Impatiens walleriana cultivars are available in a wide array of vibrant colors, possess excellent shade tolerance and are well adapted to a wide range of growing conditions in containers and garden beds [3]. Moreover, the ease of culturing, wide and greater availability of the seeds and cutting materials has increased the demand of garden impatiens [3]. Use of traditional breeding approaches to transferring the resistance from NGI to garden impatiens has been impeded by the differences between NGI and garden impatiens in chromosome number (2n = 2× = 16 in I. walleriana and 2n = 2× = 32 in NGI), size and morphology [19]. Hence, it is necessary to identify and isolate the gene(s) conferring IDM resistance in NGI and transfer the identified gene(s) into garden impatiens.
Disease resistance (R) genes are the most important component of the plant defense mechanism to confer resistance to pathogens carrying matching avirulence genes [20]. A great majority of isolated plant R genes encode nucleotide-binding site leucine-rich repeats (NB-LRR) proteins [21]. Based on structure, NB-LRR proteins are separated into Toll/interleukin-1 receptor (TIR)-domain-containing (TNL) and coiled coil (CC)-domain-containing (CNL) subfamilies. Plant NB-LRR proteins function in a network through signaling pathways and induce a series of defense responses such as initiation of oxidative burst, calcium and ion fluxes, mitogen-associated protein kinase cascade, induction of pathogenesis-related genes and hypersensitive responses [22][23][24][25]. Recent studies have shown that the TNL subfamily R proteins transduce signals through the enhanced disease susceptibility 1 (EDS1) protein and the CNL subfamily R proteins do so through non-race specific disease resistance 1 (NDR1) protein with some interchanges occurring in different plants [26]. There also exists a separate independent signal transduction pathway activated by Arabidopsis CNL proteins RPP8 and RPP13.
Few genomic resources are available in impatiens. Genomic and transcriptomic data are needed in impatiens to facilitate the identification of gene(s) conferring IDM resistance in NGI and the development of simple sequence repeat (SSR) and single nucleotide polymorphism (SNP) markers for impatiens breeding. With the rapid development of next generation sequencing and analytical tools, whole-genome and transcriptome sequencing has become possible in non-model organisms. Transcriptome sequencing and characterization can provide a descriptive and biological insight of the functionality of an organism. Recently, transcriptome sequencing has been used commonly to identify candidate R genes and to develop molecular markers for disease resistance breeding [27,28]. There has been very few or no study on genes involved in disease resistance using transcriptome sequencing in impatiens. Here, we report the sequencing, assembly, annotation and characterization of the leaf transcriptomes of two impatiens cultivars (Super Elfin ® XP Pink (SEP) and SunPatiens ® Compact Royal Magenta (SPR)) that differ in IDM resistance and the use of the transcriptome data to identify differentially expressed transcripts and candidate R genes and SSR and SNP sites for development of molecular markers. Results of this study, including the assembled transcriptomes and the identified candidate genes and polymorphic sites, will greatly facilitate the development of molecular markers for dissecting the resistance mechanism for downy mildew resistance in NGI, tagging the responsible gene loci and expediting the improvement of downy mildew resistance in impatiens.

HiSeq Sequencing and De Novo Assembly
A total of over 126 million raw reads were generated from the leaf transcriptomes of the two impatiens cultivars, resulting in approximately 117 million clean reads for the two cultivars and approximately 58 million clean reads per cultivar. More than 97% of the reads had a Phred score of 20. The GC content of impatiens sequence reads was approximately 45% (Table 1). De novo assembly of the impatiens leaf transcriptomes in Trinity resulted in 121,497 unigenes containing approximately 140 million bases of nucleotides. The average length of these unigenes was 1156 nucleotides and the N50 was 1778 nucleotides. The total number of contigs for SPR and SEP was 122,166 and 104,752, respectively. These contigs were functionally annotated, which produced 87,415 and 69,369 annotated unigenes for SPR and SEP, respectively ( Table 2).  There were 42,789 and 41,853 distinct singletons identified in SPR and SEP, respectively ( Table 2). There were 12,835 (14.68%) and 9431 (13.60%) unigenes that were of 2000 or more nucleotides in length in SPR and SEP cultivars (Table S1). Similarly, 22,656 (18.55%) and 22,175 (21.17%) of contigs with more than 500 nucleotides length were assembled in SPR and SEP cultivars respectively (Table S2).

COG Classification and KEGG Pathway Mapping
More than 54,521 impatiens unigenes (44%) were annotated in the KEGG database (Table 3). These unigenes were classified into 128 pathways, of which the primary five classes are: cellular processes (3.13%), environmental information processing (2.34%), genetic information processing (16.41%), metabolism (75%) and organismal systems (4%) ( Table 4). The top 25 mapped pathways annotated by the KEGG database are shown in Figure 2, among which metabolic pathways (24.70%) was the predominant one, followed by biosynthesis of secondary metabolites (12.01%), plant-pathogen interaction (7.50%), plant hormone signal transduction (6.93%) and spliceosome (4.30%) ( Table 5).  COG classification distributed the annotated impatiens unigenes into 25 categories of functional class. The General Function Prediction Only class contained 13,290 impatiens unigenes, followed by the Transcription class with 6959 impatiens unigenes. The Nuclear Structure class category had only one impatiens unigene ( Figure 3). All the categories are shown in Figure 3.

Functional Annotation Based on GO Classification
GO classification differentiated 70,190 impatiens unigenes into at least one of the three categories: biological process, cellular component and molecular function and 55 sub-categories. There were 22 sub-categories in the biological process, among which "cellular process," "metabolic process" and "single-organism process" had the three highest number of unigenes, involving 44,175,

Functional Annotation Based on GO Classification
GO classification differentiated 70,190 impatiens unigenes into at least one of the three categories: biological process, cellular component and molecular function and 55 sub-categories. There were 22 sub-categories in the biological process, among which "cellular process," "metabolic process" and "single-organism process" had the three highest number of unigenes, involving 44,175, 42,260 and 31,456 unigenes, respectively ( Figure 4). Similarly, in the "cellular component" category, the top three components involving highest number of unigenes were "cell part", "cell" and "organelle" sub-categories which involved 53,480; 53,480 and 42,491 unigenes, respectively ( Figure 4). In the third category "molecular function" the three sub-categories involving the highest number of unigenes were "catalytic activity", "binding" and "transporter activity" with 34,479; 31,890 and 4764 unigenes, respectively ( Figure 4).

Disease Resistance/Defense Genes
In this study, we were particularly interested in genes that are potentially involved in disease resistance and defense-related mechanisms in impatiens. Hence, the impatiens unigenes that were functionally annotated to disease resistance with different databases were selected for further characterization. A total of 164 impatiens unigenes were identified through functional annotation to be potentially involved in disease resistance (Table S3). Fifteen of these unigenes were more than 2000 bp in length each and were found in a greater abundance in the IDM-resistant cultivar SPR than in the IDM-susceptible cultivar SEP ( Table 6).
Two of these unigenes were similar to the NB-LRR class R genes and another two unigenes were similar to the LRR class R genes. In addition, two unigenes showed similarity to the recognition of Peronospora parasitica 13 (RPP13)-like protein 1 family and four unigenes hit putative disease resistance proteins in Arabidopsis thaliana. A number of unigenes were similar to the LETM1-like protein, cation efflux family protein, receptor like protein 35, or DNA-binding storekeeper protein-related transcriptional regulator protein classes. There was an uncharacterized unigene, uncharacterized mitochondrial protein AtMg00860 that was found to be involved in the disease resistance mechanism ( Table 6).
The abundance of these unigenes varied between the IDM-resistant cultivar SPR and susceptible cultivar SEP. CL1855.Contig4, similar to RPP13-like protein 1 family, reached a FPKM value of 96. 38 Table 6). All these four unigenes were similar to putative Impatiens noli-tangere, in A. thaliana. Table 6. List of impatiens unigenes selected based on more than 2-fold higher abundancy in IDM-resistant impatiens (SPR) compared to IDM-susceptible impatiens (SEP) that were annotated to previously known genes for disease resistance or defense using NR (non-redundant protein), NT (non-redundant nucleotide), SWISS-PROT, KEGG (Kyoto Encyclopedia of Genes and Genomes), COG (Clusters of Orthologous Groups of proteins) and GO (Gene Ontology) databases.

Discovery of SSRs and SNPs
A total of 22,484 SSRs were identified in 19,017 out of the 121,497 impatiens unigene sequences. There were 2986 unigene sequences that each contained more than one SSR and 757 SSRs contained compound repeats. The distribution of SSRs consisted of 2053 mononucleotides, 8195 dinucleotides, 10,501 trinucleotides, 599 tetranucleotides, 488 pentanucleotides and 648 hexanucleotides, among which the trinucleotide type AAG/GTT was the most abundant (3466), followed by the ATC/ATG type (1966) and dinucleotide type AT/AT (1165). The distribution of SSR repeat types among the impatiens unigenes are shown in Figure 5 Primers were developed using the SSRs identified from the unigenes and before and after application of the filtration, there were 44890 and 10389 primers developed (Table S4).

Discovery of SSRs and SNPs
A total of 22,484 SSRs were identified in 19,017 out of the 121,497 impatiens unigene sequences. There were 2986 unigene sequences that each contained more than one SSR and 757 SSRs contained compound repeats. The distribution of SSRs consisted of 2053 mononucleotides, 8195 dinucleotides, 10,501 trinucleotides, 599 tetranucleotides, 488 pentanucleotides and 648 hexanucleotides, among which the trinucleotide type AAG/GTT was the most abundant (3466), followed by the ATC/ATG type (1966) and dinucleotide type AT/AT (1165). The distribution of SSR repeat types among the impatiens unigenes are shown in Figure 5 Primers were developed using the SSRs identified from the unigenes and before and after application of the filtration, there were 44890 and 10389 primers developed (Table S4). The SOAPsnp software was used to identify SNPs in the impatiens transcriptome sequences. The combined assembled impatiens sequence was used as reference and the sequence reads from each cultivar were aligned to the reference individually to call SNPs. SNPs with the number of reads less than seven and the quality score below 20 were discarded. As a result, there were 245,936 and 120,073 SNPs discovered in SPR and SEP cultivars, respectively. The most dominant SNPs type was the transition type, consisting 139,794 (56.84%) and 71,181 (59.28%) for SPR and SEP, respectively, followed by the transversion type consisting of 106,142 (43.16%) and 48,892 (40.72%) for SPR and SEP, respectively. The number of SNPs identified in SPR was higher in all categories and in total in comparison to that of SEP ( Figure 6). The list of SNPs in SPR and SEP are supplemented in Tables S5   and S6, respectively. Supplementary Table S7 contain the SNPs, their positions in specific contigs and presence in specific cultivars and supplementary Table S8 contains the SNPs that have been identified in the contigs that are involved in disease resistance and explained in Table 6. The SOAPsnp software was used to identify SNPs in the impatiens transcriptome sequences. The combined assembled impatiens sequence was used as reference and the sequence reads from each cultivar were aligned to the reference individually to call SNPs. SNPs with the number of reads less than seven and the quality score below 20 were discarded. As a result, there were 245,936 and 120,073 SNPs discovered in SPR and SEP cultivars, respectively. The most dominant SNPs type was the transition type, consisting 139,794 (56.84%) and 71,181 (59.28%) for SPR and SEP, respectively, followed by the transversion type consisting of 106,142 (43.16%) and 48,892 (40.72%) for SPR and SEP, respectively. The number of SNPs identified in SPR was higher in all categories and in total in comparison to that of SEP ( Figure 6). The list of SNPs in SPR and SEP are supplemented in Tables S5  and S6, respectively. Supplementary Table S7 contain the SNPs, their positions in specific contigs and presence in specific cultivars and supplementary Table S8 contains the SNPs that have been identified in the contigs that are involved in disease resistance and explained in Table 6.

Discussion
RNA-Seq has become a very powerful technique for analysis of gene expression, identification of candidate genes involved in the expression of important traits and rapid discovery of large numbers of SSRs and SNPs in non-model plant species [29][30][31]. In this study, the leaf transcriptomes of impatiens were sequenced, assembled, annotated and compared to identify candidate genes potentially involved in resistance against IDM. We found fifteen unigenes that were more than 2000 nucleotides long and had at least 2-fold more transcripts in the IDM-resistant cultivar SPR than in the IDM-susceptible cultivar SEP. Most of the identified candidate genes showed high levels of similarity to the NB-LRR or LRR gene families. Members of the NB-LRR gene family are known to confer plants resistance to diseases caused by bacteria, fungi and viruses and are localized in the cytoplasm [32][33][34][35]. LRR genes have been shown to be involved in fungal resistance. The LRR domain in the NB-LRR and LRR R-proteins plays a critical role in defense response by perceiving signals from effectors released by pathogens or pathogens themselves and transducing the signals to initiate the defense responses [36][37][38][39]. The NB-LRR proteins are divided into two groups based on the presence of Toll/interleukin-1 receptor (TIR) or coiled-coil (CC) domain in the amino terminal end. Despite the common function of pathogen recognition, the sequence and signaling mechanism of TNL and CNL proteins are different [40][41][42]. In this study, there were two genes identified as TNLs or CNL encoding proteins.
RPP13 is a part of a NB-ARC domain known for the presence in APAF-1 (apoptopic proteaseactivating factor-1), R proteins and CED-4 (Caenorhabditis elegans death-4 protein) and belongs to the CC-NBS-LRR family that encodes NBS-LRR type R protein with a putative amino-terminal leucine zipper [43]. RPP13 has been reported to confer resistance to five isolates of Peronospora parasitica that caused downy mildew in A. thaliana [44]. More than 20 loci for RPP genes has been already identified in A. thaliana conferring recognition of P. parasitica [45]; genes from three loci, RPP1, RPP5 and RPP8 have been cloned [46][47][48]. RPP13 has been placed in the subclass of NB-LRR type R proteins that have putative leucine zipper (LZ) domain located near the N-terminus [49]. Before it was known that these loci confer resistance to downy mildew in A. thaliana, RPP13 was only known to possess resistant activity to Pseudomonas syringae pathovars [36][37][38][39]. After the understanding of TIR type NB-LRR R genes found at the complex RPP1 and RPP5 loci [47][48][49]. In Arabidopsis, RPP13 has been known to be reminiscent of R genes RPM1 [42] and RPS2 [36,38]. A recent study shows functionally diverged alleles as a distinction in RPP13 at a simple R gene locus [44]. In this study, there were two putative disease resistance RPP13-proteins identified in the IDM-resistant cultivar with higher amount than

Discussion
RNA-Seq has become a very powerful technique for analysis of gene expression, identification of candidate genes involved in the expression of important traits and rapid discovery of large numbers of SSRs and SNPs in non-model plant species [29][30][31]. In this study, the leaf transcriptomes of impatiens were sequenced, assembled, annotated and compared to identify candidate genes potentially involved in resistance against IDM. We found fifteen unigenes that were more than 2000 nucleotides long and had at least 2-fold more transcripts in the IDM-resistant cultivar SPR than in the IDM-susceptible cultivar SEP. Most of the identified candidate genes showed high levels of similarity to the NB-LRR or LRR gene families. Members of the NB-LRR gene family are known to confer plants resistance to diseases caused by bacteria, fungi and viruses and are localized in the cytoplasm [32][33][34][35]. LRR genes have been shown to be involved in fungal resistance. The LRR domain in the NB-LRR and LRR R-proteins plays a critical role in defense response by perceiving signals from effectors released by pathogens or pathogens themselves and transducing the signals to initiate the defense responses [36][37][38][39]. The NB-LRR proteins are divided into two groups based on the presence of Toll/interleukin-1 receptor (TIR) or coiled-coil (CC) domain in the amino terminal end. Despite the common function of pathogen recognition, the sequence and signaling mechanism of TNL and CNL proteins are different [40][41][42]. In this study, there were two genes identified as TNLs or CNL encoding proteins.
RPP13 is a part of a NB-ARC domain known for the presence in APAF-1 (apoptopic protease-activating factor-1), R proteins and CED-4 (Caenorhabditis elegans death-4 protein) and belongs to the CC-NBS-LRR family that encodes NBS-LRR type R protein with a putative amino-terminal leucine zipper [43]. RPP13 has been reported to confer resistance to five isolates of Peronospora parasitica that caused downy mildew in A. thaliana [44]. More than 20 loci for RPP genes has been already identified in A. thaliana conferring recognition of P. parasitica [45]; genes from three loci, RPP1, RPP5 and RPP8 have been cloned [46][47][48]. RPP13 has been placed in the subclass of NB-LRR type R proteins that have putative leucine zipper (LZ) domain located near the N-terminus [49]. Before it was known that these loci confer resistance to downy mildew in A. thaliana, RPP13 was only known to possess resistant activity to Pseudomonas syringae pathovars [36][37][38][39]. After the understanding of TIR type NB-LRR R genes found at the complex RPP1 and RPP5 loci [47][48][49]. In Arabidopsis, RPP13 has been known to be reminiscent of R genes RPM1 [42] and RPS2 [36,38]. A recent study shows functionally diverged alleles as a distinction in RPP13 at a simple R gene locus [44]. In this study, there were two putative disease resistance RPP13-proteins identified in the IDM-resistant cultivar with higher amount than in the susceptible cultivar. Presence of genes similar to RPP13 indicate the possibility of RPP-like genes conferring the resistance to IDM. Resistance to P. parasitica by RPP13 in A. thaliana Niederzenz accession is known to be independent to either Salicylic acid, NRD1 or EDS1 [50] triggered by the avirulence gene ATR13 [51]. RPP13 encodes protein that is predicted to be located in cytoplasm [43]. In addition to RPP-like genes, increased expression of other genes similar to LETM1-like protein, Cation efflux family protein, Receptor like protein 35 and DNA-binding storekeeper protein-related transcriptional regulator indicates the complexity of defense mechanism in the resistant line to IDM. Further research on the mode of resistance in New Guinea Impatiens for IDM could reveal the involvement of these proteins in conjunction with RPP13 to confer resistance or exhibit different mode of action.
Functional annotation of these genes also grouped them into different categories on the basis of similarity. The common method of recognition and signaling mechanisms for resistance exhibited by different classes of gene families differently in unrelated group of pathogens indicates the possibility of functional diversity of R genes to defend the host from different kinds of pathogens in different species of host. It has been previously reported that R genes exhibit polymorphism regarding the role in defense mechanism. There were other unigenes that were identified to be functionally similar to RPS2-like proteins family in the transcriptome but not all of them were significantly upregulated in the resistant cultivar in this experiment.
Being abundant and randomly distributed in plant genomes, SNPs and SSRs are very useful loci for development of molecular markers for high-density, high-resolution gene or genome mapping and marker-assisted breeding. RNA-Seq and bioinformatics analytical tools have made the discovery of SSR and SNP sites easy and fast. So far, molecular markers have not been reported for downy mildew resistance in impatiens. This study resulted in the discovery of 22,484 SSRs and 245,936 and 120,073 SNPs in SPR and SEP, respectively. These SSR and SNP sites offer a very useful resource for developing molecular markers that can be used by geneticists and breeders for genetic studies and improvement of impatiens.

RNA Isolation, cDNA Synthesis and Sequencing
Frozen leaf tissue samples of SEP and SPR were shipped on dry ice to Beijing Genomics Institute (BGI) in Shenzhen, China. Total RNA was extracted from the impatiens leaf tissues using pBiozol Total RNA Extraction Reagent (BioFlux, Hangzhou, China) following manufacturer's instructions. The RNA concentration, RNA integrity number (RIN) and the 28S/18S ratio were determined on an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). An equal amount of RNA from each of the three biological replicates was combined to form a pool of resistant and susceptible RNA samples for mRNA isolation and cDNA synthesis. The combined RNA samples were treated with DNase I to remove residual genomic DNA and mRNA was isolated using magnetic beads with Oligo (dT). The isolated mRNA molecules were fragmented using the Elute Prime Fragment Mix from Illumina TruSeq TM RNA sample prep kit v2 (Illumina, San Diego, CA, USA) at 94 • C for 8 min. The fragmented mRNAs were used as templates for cDNA synthesis. First strand cDNA was synthesized by using the First Strand Master Mix and SuperScript II from Invitrogen (Carlsbad, CA, USA) and amplified at 25 • C for 10 min, 42 • C for 50 min and 70 • C for 15 min. Second cDNA strand was synthesized using the Second Strand Master Mix from Invitrogen at 16 • C for 1 h. The resulted cDNA fragments were purified, their ends were repaired using the End Repair Master Mix from Invitrogen at 30 • C for 30 min and the Ampure XP beads (Beckman Coulter, Brea, CA, USA) and a single adenine (A) base were added to each end of the cDNA fragments. End-repaired and A-added short fragments were joined with adapters. The adapter-ligated cDNA was PCR-amplified to generate sequencing libraries. The libraries were analyzed with an Agilent 2100 Bioanalyzer and the ABI StepOnePlus Real-Time PCR System for quantifying and qualifying the sample library. Sequencing of cDNA was performed on an Illumina HiSeq™ 2000, generating 100 bp paired-end reads.

Sequence Filtering, De Novo Assembly and Transcriptome Analysis
Raw sequence reads were cleaned by removing the adapters attached at the ends and reads with unknown nucleotides more than 5% were removed using BGI's internal software (filter_fq). Low quality reads with more than 20% of quality value < 10 were removed, Q20 percentage, proportion of unknown nucleotides in clean reads (N) percentage and GC percentage among total nucleotides were calculated and the remaining clean reads were used in further analysis. De novo transcriptome assembly was done using the Trinity software (http://trinityrnaseq.sourceforge.net/) [52] with the following parameters: minimum contig length of 100, minimum glue 3, group pair distance 250, path reinforcement distance 85 and minimum kmer coverage 3 using the cleaned short reads. The longest, non-redundant, unique transcripts were defined as unigenes. Further processing of the sequences for redundancy removal and splicing was done using the sequence clustering software, TIGR Gene Indices Clustering Tools (TGICL) v2.1 (http://sourceforge.net/projects/tgicl/files/tgicl%20v2.1/) with minimum overlap length 40, base quality cutoff of clipping 10 and maximum length of unmatched overhangs 20. Homologous transcript clustering was done using Phrap release 23.0 (http://www. phrap.org/) with repeat stringency 0.95, minimum match 35 and minimum score 35.
Functional annotation of the unigenes was done by comparing them using the NR (non-redundant protein) and NT (non-redundant nucleotide) databases of National Center for Biotechnology Information (http://blast.ncbi.nlm.nih.gov/Blast.cgi), SWISS-PROT database (European Bioinformatics Institute, ftp://ftp.ebi.ac.uk/pub/databases/swissprot/), KEGG (Kyoto Encyclopedia of Genes and Genomes database), COG (Clusters of Orthologous Groups of proteins database) and GO (Gene Ontology) with BLASTx using E-value cutoff of 1 × 10 −5 . The best aligning results were used to determine the sequence direction of unigenes. When the results of databases conflicted each other, a priority order of NR, SWISS-PROT, KEGG and COG was followed to determine the sequence direction. When unigenes were found to be unaligned with any of the databases, ESTScan was used to determine the sequence direction [53]. Blast2Go (http://www.blast2go.com/b2ghome) was used to annotate unigenes generated by NR annotation to get GO annotation [54]. After GO annotation, WEGO software was used for functional classification of the unigenes and to understand the distribution of gene functions from the macro level [55]. KEGG database was used for metabolic pathway analysis of gene products in cell and function of gene products. Further study of unigene pathways was done using Path_finder (http://www.genome.jp/) with default parameters [56].

Unigenes Sequence Comparison and Phylogenetic Analysis
The sequences of selected unigenes from impatiens were compared to the Arabidopsis Information Resource (TAIR) Database (https://www.arabidopsis.org/) [57] to show the functional similarity and the best aligning results were used. BLAST was done using CDS sequences of fifteen unigenes (Table 6) involved in disease resistance to the nr/nt database for sequence similarity.

Conclusions
This study represents the first application of RNA-Seq and sequence analysis to the leaf transcriptomes of garden impatiens, a very important floriculture crop and a very popular garden plant in the world and a representative species of the family Balsaminaceae. The leaf transcriptome sequences will become an invaluable resource for understanding the genetic makeup of this important plant. The transcripts and unigenes assembled can provide valuable template sequences for gene mining and gene expression analysis in impatiens. The unigenes that were abundantly expressed in IDM-resistant impatiens while absent or expressed at very low levels in the IDM-susceptible impatiens and show strong similarity to plant R genes can serve as candidate genes for understanding the genetic basis of the IDM resistance in I. hawkeri and over-expressing in IDM-susceptible I. walleriana for improved IDM resistance. The SSR and SNP sites identified in the impatiens transcriptomes will be very useful for molecular marker development, gene and genome mapping and identification of specific markers for important plant, foliar and flower traits in impatiens.
Author Contributions: K.B. analyzed the data and drafted and revised the manuscript; W.W. prepared plant materials and performed the greenhouse experiments; Z.C. performed initial mining of the transcriptome data; Z.D. conceived the study, reviewed data and results and revised the manuscript. All authors read and consented to the final version of the manuscript.
Funding: This study was supported in part by the American Floral Endowment, the Frederic C. Gloeckner Foundation, Inc. and USDA-NIFA hatch projects FLA-GCR-005065 and FLA-GCC-005507.