Transcriptomic Diversity of Solanum tuberosum Varieties: A Drive towards Future Analysis of Its Polyploidy Genome †

: Despite the signiﬁcance of potatoes in combating hunger and ensuring global food security, their potential for fostering a sustainable society has not been fully exploited due to potatoes’ complex biological system as a polyploid. Therefore, there is a need for a more gene-informed potato breeding program to improve yields, nutrient content and other market characteristics. This study aimed at analysing the RNA-seq data from leaf samples of four potato varieties annotated as HJ, HL, LS and V7, to understand the transcriptomic diversity among the varieties. A pipeline was developed and used for the analyses of the fragments reads from each potato variety. A signiﬁcant amount (>85%) of fragment reads in all samples were mapped to the reference genome. Out of 27,356 gene features obtained from this study, 65.93% were expressed in all samples, and 4.5% were unique to individual potato species. Although all potato varieties’ top 10 expressed genes were associated with chloroplastic proteins/enzymes, other highly unique genes are yet to be fully annotated. Furthermore, the result from fold-change analysis, hierarchical-cluster plot and heatmap showed potato varieties HJ as the most distant species, whereas potato varieties HL and V7 are most similar. More so, the heatmap showed that genes expressed in HJ had the most similar cluster among themselves. Although limited by the unavailability of phenotype information and sample replicates, this study has shown that potato varieties, even with the same polyploid number, express a signiﬁcant level of diversity in their transcriptome under the same condition.


Introduction
Cultivated potato (Solanum tuberosum L.) is a tuber crop grown globally for food because of its high nutrient content. Ever since it was first discovered in South America and domesticated over about 8000 years ago, its global significance for combating hunger and ensuring food security has grown tremendously [1]. Presently, cultivated potato is the third most important food crop in the world after rice and wheat. It is cultivated in more than 125 countries with over a billion people consuming it daily [2]. Over 376 million tonnes of potatoes are globally produced annually, with China being the number one producer [3]. Potatoes are relatively easy to cultivate. Although some species especially diploid potatoes can be grown from botanical seeds, most potatoes cultivated for food are autotetraploid and propagated vegetatively by planting pieces of tubers containing axillary dominant bud (eyes)-from which new shoots sprout out. The underground stem of potatoes morphologically changes into a stolon which then develops into a tuber as the cultivated potato matures [3].
Although cultivated potatoes are broadly grouped into landraces, native varieties and improved varieties, taxonomically classifying over a thousand species belonging to the genus solanum has not been straightforward [4]. Several authors have recently presented different taxonomically classification for cultivated potatoes [4]. The challenges of potato taxonomy are as a result of the complexity of the potatoes' genome which arises due to gene introgression, interspecific hybridization, auto/allopolyploidy, sexual compatibility, toggling between sexual and asexual reproduction, recent species divergence, phenotypic plasticity and consequently high morphological similarity among species [4].
Polyploidy, which is a major cause of potato diversity, accounts for why potato varieties with a base chromosome number of n = 12 do exist as either diploids (2n = 2x = 24), triploids (3n = 3x = 36), tetraploids (4n = 4x = 48) and pentaploids (5n = 5x = 60). Nowadays, most of the cultivated potatoes grown for food are autotetraploids. There are well established theories and methods for quantitative genetic analysis of many important diploid species which have become invaluable tools for understanding the evolutionary, agronomy and medical significance of genes responsible for desired quantitative trait [5]. However, the genetic analysis of autotetraploid species (potato) has not been as straightforward as their diploid counterpart and the development of its methods has lagged for many years for several reasons [6,7].
RNA-seq is revolutionising the field of Agri-genomics and has continued to drive the sustainable productivity of plants and animals to ensure global food security [8]. The advent of RNA-seq has also made the sequencing of many polyploid plants a possibility and has revived the genetic analysis of many polyploids species. The overwhelming time and cost reduction which the technology proffers allow researchers not only to perform sequencing experiments at varying conditions of a cell but also perform replications for a more precise and accurate inference [9]. This study aimed at performing comparative analysis of RNA-seq data from leaf samples of four potato varieties annotated as HJ, HL, LS and V7 (all autotetraploid), to understand the transcriptomic diversity among the varieties.

Potato Cultivars
Four varieties of autotetraploid potato, Solanum tuberosum (HJ, HL, LS and V7) were grown under field conditions at the Qinghai Academy of Agriculture and Forestry Sciences, China (QAAFS). Leaf samples from each cultivar were harvested at the tuber filling stage (45-55 days after cultivation) for RNA extraction and Illumina sequencing analysis.

RNA Extraction and Illumina Sequencing Analysis
The RNA of the four potato cultivars were extracted from their harvested leaves using the Qiagen RNeasy Kit. The extracted RNA samples were sent to the Gene Energy Company (http://www.genenergy.cn/ (accessed on 5 December 2021) for Illumina sequencing. There were no replicates for each sample that were sequenced (2 × 150 bp paired end reads) for this pilot study.

Research Data Storage
All RNA-seq datasets of the four potato cultivars, their reference indexes and gene annotations used for this study were securely stored in a 3-terabyte allocated space of the Birmingham Environment for Academic Research (BEAR) central Research Data Store (RDS)-a facility of the University of Birmingham.

Computation Services and Analysis Software
All analysis performed on the RNA-seq datasets were conducted on the BEAR Linux high performance computing (HPC) environment (bash shell), also known as the blueBEAR. An SSH client software (PuTTY) was used to gain access to all blueBEAR services and software packages from an 8GB RAM personal computer. All software packages used for this study are highlighted in Table 1.

Methods
The RNA_Seq data were analysed using the summarized pipeline from Table 1, according to methods detailed by Pertea et al. (2016).

Quality Check on RNA-Seq Data
The first stage of the analysis pipeline was the trimming of the adapter sequence with Cutadapt and a general quality check of the sequence reads with FASTQC before and after trimming. All the RNA-seq datasets produced similar graphical output showing an excellent per base sequence quality and per sequence quality score even before the trimming-adapter operation.
As shown in Figure A1a,b, Phred quality scores of above 32 were assigned to each of the base-calling positions with an average quality per read of 40. Since the Phred quality score (Q) given by the equation Q = −10 log P is the logarithmic transformation of the probability (P) for having base-calling error, it therefore means that the chance for an erroneous base call in each sequence was 1 in 10,000 or simply put each base call has an average of 99.99% accuracy. Furthermore, the high quality of the dataset was also shown by the GC distribution plot ( Figure A1c), as the normal distribution of the GC count per reads over all sequences followed the theoretical distribution. Based on these results, it implies that the RNA-seq experiment was free of contamination and sequence library bias.
In the same vein, the efficiency of the adapter trimming process was assessed by performing quality check with FASTQC before and after adapter trimming. As shown in Figure A2a, the Illumina universal adapter sequence was the only adapter sequence discovered in all the datasets beginning from a base calling position of about 72-73 bp. The process for trimming adapter sequence was thorough and very accurate, as no adapter sequence was discovered in all datasets after trimming ( Figure A2b). It was therefore concluded by the quality check data that the Illumina sequencing experiment was performed with minimal systematic errors that could potentially affect further analysis down the pipeline.

Statistics for Mapping Sequences to Reference Genome
Subsequent to performing quality checks on the RNA-seq datasets, the sequence reads were mapped to the reference potato genome with HISAT2. Using SAMtools, the resulting sequence alignment/maps were then converted to binary alignment/maps which were more compressed in binary formats and compatible with other computational packages down the pipeline. Table 2 summarizes the mapping statistics generated with the SAMtoolsflagstat functions. As shown in Table 2, the sequence reads from the four potato cultivars were significantly mapped (>85%) to the reference genome. It was then concluded by the mapping statistic that the assembly of transcript with StringTie, which was next in the pipeline will yield a true representative of the actual transcriptome and was essential for an accurate comparison of how much expression variation exists among the potato varieties.

Top 10 Most Expressed Genes among Potato Varieties
As shown in Table A1, seven genes were expressed across all varieties, with the first three genes having the same position in a list of top 10 most expressed genes. Furthermore, most of the genes highly expressed (Table A1) were associated with chloroplastic enzymesinvolved in photosynthesis.

Number of Similar and Unique Genes among Samples
The Venn diagram package on R was used to analyse the number of genes that were unique to individual potato variety and similar across pairs, triads and all samples. As shown in Figure 1, 18,036 gene features were expressed in all potato varieties. Also, variety HJ had the highest number of 343 unique gene features, whereas V7 only had the lowest. The names and functions of the top 10 genes that were unique to the potato varieties HJ, HL, LS and V7 were summarised in Tables A2-A5, respectively.

Fold Change Analysis for Differential Variation in Gene Expression
Fold change analysis was performed among samples in pairs by using the fold change package in R. Differential genes were grouped under different range of fold change using the logic operators and length functions also in R. As shown in the Figure 2, potato HL and V7 varieties were the most similar pair as they had the least fold change, whereas the HJ and LS pair are most distant in expression variation. It was also deduced from the analysis (Figure 2) that the HJ variety is the most distant species which also was consistent with the result from the Hierarchical dendrogram (Figure 3a).

Estimating Gene and Sample Distance
The pairwise correlation was estimated between the samples (n = 4) and between significantly expressed genes (n = 1000). The dist function on R was used to estimate the distance between sample pairs or gene pairs. As shown in the Figure 3a, sample HJ was the most distant potato variety, whereas HL, LS and V7 were closely related. The distance in the relationship between genes was also shown in Figure 3b.

Heat Maps
The heatmap2 function on R-packages was used to generate a heatmap plot from both hierarchical clustering of samples and genes ( Figure 4). The heatmap is an excellent pictorial tool for depicting variation between several variables. So, the heatmap was very useful for estimating the variation in gene expression among the four potato varieties (HJ, HL, LS and V7). Similar to the other results previously presented, the heatmap (Figure 4) showed that the variety HJ had the most variation in gene expression when compared with other varieties HL, LS and V7. Furthermore, it was also deduced from the heatmap that more than 50% of the genes expressed in sample HJ are closely related, as depicted by the yellow cluster of genes in the heatmap (Figure 4), whereas the genes expressed in HL, LS and V7 are more dispersed.
useful for estimating the variation in gene expression among the four potato varieties (HJ, HL, LS and V7). Similar to the other results previously presented, the heatmap (Figure 4) showed that the variety HJ had the most variation in gene expression when compared with other varieties HL, LS and V7. Furthermore, it was also deduced from the heatmap that more than 50% of the genes expressed in sample HJ are closely related, as depicted by the yellow cluster of genes in the heatmap (Figure 4), whereas the genes expressed in HL, LS and V7 are more dispersed.

Discussion
Cultivated potatoes are the third most important food crops in the world and offer better carbohydrate to protein ratio, vitamin and antioxidant content when compared with rice and wheat. Hence, it was named as the future food crop by the FAO in 2008. Despite the vast significance of potatoes in combating hunger and ensuring global food security, its potential for a sustainable society has not been fully exploited. Several potato breeding programs aimed at improving yields, nutrient content and other important market characteristics of potatoes have been limited by its complex biological system. The diversity in potato varieties can be accounted for from the complex nature of its genome owing to autotetraploidy, extreme heterozygosity and the vegetative means of propagation. Recent advances in next generation sequencing technology have fostered advanced

Discussion
Cultivated potatoes are the third most important food crops in the world and offer better carbohydrate to protein ratio, vitamin and antioxidant content when compared with rice and wheat. Hence, it was named as the future food crop by the FAO in 2008. Despite the vast significance of potatoes in combating hunger and ensuring global food security, its potential for a sustainable society has not been fully exploited. Several potato breeding programs aimed at improving yields, nutrient content and other important market characteristics of potatoes have been limited by its complex biological system. The diversity in potato varieties can be accounted for from the complex nature of its genome owing to autotetraploidy, extreme heterozygosity and the vegetative means of propagation. Recent advances in next generation sequencing technology have fostered advanced studies of quantitative traits and the discovery of novel genes that control desirable phenotypes for an informed breeding program.
In this study, RNA sequencing by Illumina technology was adopted to capture the transcriptome expressed in leaf samples of four potato varieties, HJ, HL, LS and V7, all grown in field conditions. A pipeline consisting of several software and bash scripts was developed and used for the analysis of the fragments reads from each potato variety. An exploratory snapshot of the transcriptome was obtained from computational analysis with the pipeline, which served as the basis for measuring the extent of diversity (distance) among the varieties.
Despite the usefulness of NGS technologies in terms of high speed and cost effectiveness, the genomic datasets generated normally contain some kinds of sequencing artefacts such as low-quality reads, contaminating reads and non-uniform coverage which compromises downstream analysis. Therefore, it is necessary to process the fragment reads by trimming off adapter sequences and low-quality reads, and then perform a quality check on the resulting sequence. For all the RNA-seq datasets analysed in this study, an average Phred quality score of about 40 was assigned to all fragment reads by the FASTQC algorithm, meaning that the probability of obtaining a base-call error is 1 in 10,000 calls ( Figure A1). Furthermore, the excellent GC normal distribution curve ( Figure A1c) and the adapter-trimming operation by Trim Galore algorithm ( Figure A2), assured that the fragment sequences used for subsequent mapping step and downstream operation were accurate, for a reliable conclusion on varieties distance. Although different software for quality control have been reported in the past, several recent studies adopted the FASTQC for pre-processing/quality control of RNA-seq data due to its efficient runtime and memory utilization [10][11][12].
Going forward, the mapping statistics (Table 2 portrayed an excellent mapping operation performed by the "new tuxedo" package-HISAT2 [13]. Conesa et al. (2016) opined that an excellent mapping operation should have about 80% fragment reads mapped to the reference genome. Similarly, Sims et al. (2014) estimated that 80% of transcripts that are expressed in human cells can be accurately quantified with about 36 million 100-bp paired-end sequenced reads mapped to the human reference genome. In this study, more than 50 million 150-bp paired end reads were mapped unto the potato reference genome ( Table 2). Since the human reference genome (3234.83 Mb) is larger than the potato reference genome (840 Mb), obtaining a higher amount of pair-end reads mapped to the potato reference genome in this study, than the number reported by Sims et al. [14], was an indication of an excellent mapping operation.
A document containing gene features count for each potato variety was generated after running all the scripts for different software packages in the analysis pipeline that was developed in this study (method section). On further analysis of the count file with R-packages exploratory results were generated as tables and figures. Results from Table A1 showed the top 10 expressed genes in the four potato varieties. Since the RNA samples used for this study were extracted from the plant leaves, our results showed that 90% of the top 10 expressed genes were associated with chloroplastic proteins/enzymes, involved in plant photosynthesis and the Calvin cycle (Table A1). Furthermore, Calmodulin binding protein (CBP) sequence was also highly expressed in all samples (Table A1). Previous studies have shown that CBP is involved in calcium signalling pathways and interacts non-covalently with DNA in plants such as tomatoes-Solanum lycopersicum [15]. The CBP nucleotide sequence was not well annotated in the reference genome of the potato. Hence, there is a need for further studies on CBP's gene sequence in potatoes, to elucidate more of its function.
Out of a total of 27,356 gene features obtained from this study, 65.93% (18,036 genes) were expressed in all samples. About 4.5% of the total genes were unique to individual potato species (Figure 1). Of all the varieties, the top 10 uniquely expressed genes, as shown in Tables A2-A5, are yet to be properly annotated or assign a function. Although these unique gene sequences were not significantly expressed in the potato leaves, they may be involved in expressing distinguishing phenotypic characteristics among the varieties. Further studies on these unique gene sequences will benefit an informative potato breeding program.
The fold change analysis ( Figure 2) and hierarchical cluster plot ( Figure 3) showed potato varieties HJ as the most distant species, whereas potato varieties HL and V7 are the closest in similarity. Furthermore, the heat map ( Figure 4) clearly depicted the transcriptome diversity among the potato varieties HJ, HL, LS and V7. Although genes expressed in HJ were most diverse when compared with other varieties, they had the most similar cluster among themselves (yellow colour in the heatmap). A recent study conducted by Hardigan et al. [16] on the genomic diversity among different potato varieties was used to uncover evolutionary history and gene targets for potato domestication/breed improvement. Similarly, this study, although limited by the unavailability of phenotype information and sample replicates, has shown that potato varieties, even with the same polyploid number (autotetraploid), express a significant level of diversity in their transcriptome under the same condition.
In conclusion, a workable pipeline for analysis of autotetraploids RNA-seq datasets was developed in this study, which can be employed for future analysis. Only from the conducted exploratory transcriptome analysis, we discovered that potato variety HJ was the most distant among the four varieties studied. There is a need for future differential expression studies to correctly identify differentially expressed genes (DEGs) in the four potato varieties, under specific conditions. Consequently, obtaining a clear analysis of DEGs will foster a deeper understanding on the phenotypic variation among the potato varieties, and will aid the development of workable models for mapping complex traits to genome loci in polyploids.   Representative plots for quality control of all RNA-seq datasets. (a) Shows a Phred quality score across all base-calling positions in the sequence reads. The plot was segmented into green, brown and red partitions. Phred scores that fall within the green partition are excellent, whereas scores in the brown partition are good and scores in the red segments are poor. (b) Shows the distribution of the mean quality score across all sequence reads in the datasets. (c) Comparison of the actual distribution of the GC content across sequence reads (red plot) to the theoretical distribution (blue plot). Figure A2. Distribution of adapter content in sequence reads. (a) Shows the adapter distribution before the trimming of the adapter. (b) Shows the adaptor distribution after trimming. A red plot represents Illumina universal adapter, blue represents small RNA 3′ adapter, green represents small RNA 5′ adapter, black represents nextera transposase sequence and purple represents SOLID small RNA ad.  Superscript from "a to j" indicates the ranking of the genes based on the feature counts from highest to lowest in each of the potato varieties (HJ, HL, LS and V7), respectively. Superscript a indicate the gene with the highest number of features, whereas superscript j indicates the gene with the least (10th position) number of features.