Genomic Considerations for the Modification of Saccharomyces cerevisiae for Biofuel and Metabolite Biosynthesis

The growing global population and developing world has put a strain on non-renewable natural resources, such as fuels. The shift to renewable sources will, thus, help meet demands, often through the modification of existing biosynthetic pathways or the introduction of novel pathways into non-native species. There are several useful biosynthetic pathways endogenous to organisms that are not conducive for the scale-up necessary for industrial use. The use of genetic and synthetic biological approaches to engineer these pathways in non-native organisms can help ameliorate these challenges. The budding yeast Saccharomyces cerevisiae offers several advantages for genetic engineering for this purpose due to its widespread use as a model system studied by many researchers. The focus of this review is to present a primer on understanding genomic considerations prior to genetic modification and manipulation of S. cerevisiae. The choice of a site for genetic manipulation can have broad implications on transcription throughout a region and this review will present the current understanding of position effects on transcription.


Background and Introduction
As the global population grows and countries continue to develop, the demand for fuel continues to rise. Traditional fuel sources are finite and contribute to climate change, stressing the importance of renewable sources of energy. This shift offers many benefits, including reductions in greenhouse gasses and improved energy security. Many countries use field crops such as corn, wheat, barley, sugar cane, and cassava to produce ethanol; however, their use as a biomass source competes with their use as food crops, increasing demands for land and water resources [1,2]. Although this strain is of a lesser concern on a global level, it can have substantial regional impacts, pushing for the exploration of other viable options [1].
The native metabolic pathways in diverse microbes, including wild yeasts and bacteria, are particularly attractive for use in the production of biofuels via fermentation, which can generate bioethanol from a variety of non-food crop substrates, such as glycerol [3,4]. The metabolic flexibility of fungi allows for the generation of ethanol from materials (including olive mill wastewater) and the production of microbial enzymes and lipids from milling and confectioners' wastes [3,5,6]. Screening of diverse fungal isolates has proven effective in expanding the available repertoire of biofuels and metabolites produced from myriad starting materials [7]. The ability to reconstitute these pathways into an easier-to-culture, non-native host can increase the yield and efficiency of metabolite production. Advances in genetic engineering enhance the production of biofuels (and other metabolites), allowing synthesis in non-native species and manipulations to maximize the desired output.

Heterochromatic Genomic Regions Silence Transcriptional Activity
Heterochromatin forms in transcriptionally repressed, typically gene poor regions of the genome. S. cerevisiae maintains heterochromatic regions at the telomere and sub-telomeric regions, the silent mating loci (HML and HMR) on chromosome III, and within the rDNA repeats found on chromosome XII [36][37][38].

Telomeric Heterochromatin
The very distal tips of eukaryotic chromosomes are telomeres, specialized genomic regions maintained in unique structures to prevent inappropriate homologous recombination or triggering of the DNA damage response pathways [39]. Telomeres consist of simple, non-coding repeats (the sequence is C 1-3 A/TG 1-3 ) that stretch for approximately 300 base-pairs +/− 75 base-pairs per chromosome [40]. Adjacent to these telomeric repeats are telomere associated sequences (TAS), which fall into two classes: X and Y'. X TAS sequences are found on all telomeres and Y' TAS sequences are found on approximately half of the telomeres, which can vary by strain. There is significant diversity seen within each TAS class-in terms of size, composition, and insertions and deletions that can vary significantly from telomere to telomere [40].
Initial characterization has focused on individual telomeres, with an emerging consensus-telomeres exclude nucleosomes and are maintained in a transcriptionally silent, heterochromatic state. The integration of a transgene into telomeric regions in S. cerevisiae results in transcriptional repression of the transgene and is commonly known as the 'telomere position effect' (TPE) [38]. Evidence for this silencing focuses on the repression of URA3 integrated into telomere TEL07L, which is mediated by the spread of heterochromatin nucleated at the telomere across the transgene by the Sirtuin family proteins and Rap1p [41,42]. The ubiquitous nature of the TPE has recently come into question, as global analysis utilizing the more sensitive RNA-sequencing methods identified active transcription at many native telomeres, albeit at levels that are lower than integration sites further from the telomere. Furthermore, this study shows that while the Sirtuin proteins are localized throughout the telomeres, their activity is not the primary mechanism of transcriptional repression, as only 6% of subtelomeric genes are silenced by Sir protein activity [43]. The current understanding of transcription within these genomic regions is more nuanced-under certain growth conditions (e.g., the induction of the environmental stress response, or ESR), there is transcriptional activation of telomeric genes [44]. This last point is germane, as the culturing conditions required for biofuel and metabolite production can trigger the budding yeast's ESR, ultimately activating transcription within these regions. Indeed, our observations found that members of the stress-induced toxin-response gene family cluster in telomeric and sub-telomeric regions, playing a role in coordinating the transcription of this regulon [45,46].

Ribosomal DNA and the Silent Mating Loci
In addition to the heterochromatin that forms on telomeric regions, there are two additional sites where constitutive heterochromatin forms, at the silent mating loci, called HML and HMR, found on chromosome III, and at the rDNA repeats on chromosome XII. While there are similarities between the heterochromatin found in each region, there are also important differences compared to telomeric chromatin. Many of the chromatin modification proteins required for heterochromatin formation and maintenance are shared at all three heterochromatin loci, and each exhibit a lack of post-translational modifications to the tails [47,48]. One major difference is that transcriptional silencing at both HML and HMR is stronger but only extends over a shorter distance than that at the telomeres. Silencing at HML and HMR is nucleated by cis regulatory elements that flank each locus (HML-E and HML-I and HMR-E and HMR-I, respectively); however, the presence of insulators limit the genomic distance of silencing [49]. The rDNA repeats sequester to the nucleolus, and the number of active versus repressed regions varies, matched to the cell cycle and the demand for ribosomes. The transcriptional activation during stress does not affect the HML mating locus or the rDNA repeats [44].
The formation of heterochromatin facilitates higher order chromosome structures and the maintenance of subnuclear organization. Telomeres terminate into loops that fold back upon themselves, clustering together on the nuclear periphery into distinct puncta, as seen in the Rabl nucleus [40,48,50]. The silencers flanking HMR form a local loop that facilitates transcriptional repression, while HMR and HML form an extended loop whereby Chromosome III folds across itself forming a long range interaction-all together, these interactions facilitate the characteristic subnuclear arrangement necessary for proper gene expression and genomic integrity [48,51].

Global Position Effects Result in Large Differences in Reporter and Transgene Expression
Saccharomyces cerevisiae was the first eukaryotic organism with a fully sequenced genome [52]. While estimates vary, there is consensus that there are just under 6000 protein coding genes, encoded by approximately 12 megabases of DNA divided into 16 chromosomes [52,53]. The completion of the genome sequence allowed researchers to construct the 'Yeast Deletion Library' (also known as the 'Yeast Knock Out Collection', or YKO collection), a series of isogenic yeast strains, each with a single non-essential gene deleted [54,55]. The library consists of 5,916 isogenic yeast strains with the kanamycin resistance (KAN R ) gene replacing an individual open reading frame [55]. This resource allowed for rapid advances in the understanding of gene function on a level as never before, and led to further development of additional library collections, including the green-fluorescent protein (GFP) and tandem-affinity purification (TAP) tagged resources [56,57].

The Neighboring Gene Effect
The YKO collection expanded the understanding of gene function, significantly increasing the number of annotated gene functions. The design of the YKO collection integrated a KAN R under the regulation of a relatively high strength promoter. Systematic characterization of genetic screens that used the deletion library led to a surprising result: the KAN R integration site frequently disrupted the expression of a neighboring, adjacent gene [56][57][58][59]. The disruption caused by KAN R to the expression of the adjacent gene was coined the 'neighboring gene effect' (NGE), which is estimated to affect transcription at 7%-15% of the targeted loci [58]. Follow-up analysis revealed that this effect results in miss-annotation of genetic interactions in up to 10% of screen results, rather than the identification of a bona fide genetic link (Table 1) [59].

Global Position Effect Variance
The completion of the sequencing of the yeast genome allowed for further characterization of transcriptional differences that arise when other, identical constructs are integrated into different genomic regions. These studies characterized the differences in expression of an insertion of a construct directly, as opposed to the disruption of the transcription of neighboring genes. Two studies investigated the integration of a green fluorescent protein (GFP) reporter and a red fluorescent reporter (RFP) throughout the genome and monitored position effects on their expression. When a GFP reporter was integrated at 482 genomic sites, the noise in the transcription of this reporter varied by 20-fold, resulting in a 15-fold difference at the level of protein production [60]. A global survey characterized the position effects observed when the red fluorescent protein (RFP) gene was integrated at 1044 different genomic positions. Based on location, there was a 13-fold difference in fluorescence seen between difference loci [61]. Additional analysis of KAN R expression across the yeast deletion library revealed significant differences in the level of KAN R expression from locus to locus, with upto 35% of expressional differences attributed to position effects (Table 1) [62]. The authors note that insertion of the KAN R cassette does not disrupt the local chromatin environment [62]. [60] pTEF1-RFP and pURA3-RFP 13-fold difference in expression across 1044 integration sites. Different promoters resulted in different overall levels of expression; however, position affected both constructs to the same degree. [61]

Adjacent Gene Co-Regulation
The production of mature, translationally competent ribosomes requires the coordinated gene expression of two distinct families of protein coding genes: the 129 ribosomal proteins (RPs) and the more than 200 rRNA and ribosome biosynthesis genes (RRB or Ribi), consuming significant amounts of intracellular energy stores and under tight regulation, predominantly at the level of transcription [65,66]. Each gene family is distinct-the RPs assemble and are incorporated into the ribosome during synthesis and remain there throughout the ribosomes' life, while the RRBs facilitate the synthesis of the ribosome but do not remain associated after the ribosome is mature and translationally competent. The different roles result in different levels of expression for each gene family. Each gene family is tightly co-regulated, enriched for distinct promoter sequences, and bind to separate, specific transcription factors [67][68][69][70]. Characterization of the genomic distribution of both families revealed that each family is found in a non-random distribution, predominantly clustered in pairs with other members of the same family [67,71].
The best-characterized locus is the RRB gene pair, MPP10-MRX12, found on Chromosome X. MPP10, a component of the small subunit processome and 90S preribosome complex, and MRX12, a protein that associates with mitochondrial ribosome, are clustered together under the transcriptional co-regulation of shared promoter elements only found upstream of MPP10 ( Figure 1A). These cis-regulatory promoter elements are called the ribosomal RNA processing element (RRPE) and the polymerase A and C (PAC) element, both of which are enriched within the RRB family as a whole [67]. The promoter region of MRX12 contains no readily identifiable transcription factor-binding site; rather, the genome region immediately upstream is a nucleosome devoid autonomously replicating sequence (Figure 1) [72]. seen within each TAS class-in terms of size, composition, and insertions and deletions that can vary significantly from telomere to telomere [40]. Initial characterization has focused on individual telomeres, with an emerging consensustelomeres exclude nucleosomes and are maintained in a transcriptionally silent, heterochromatic state. The integration of a transgene into telomeric regions in S. cerevisiae results in transcriptional repression of the transgene and is commonly known as the 'telomere position effect' (TPE) [38]. Evidence for this silencing focuses on the repression of URA3 integrated into telomere TEL07L, which is mediated by the spread of heterochromatin nucleated at the telomere across the transgene by the Sirtuin family proteins and Rap1p [41,42]. The ubiquitous nature of the TPE has recently come into question, as global analysis utilizing the more sensitive RNA-sequencing methods identified active transcription at many native telomeres, albeit at levels that are lower than integration sites further from the telomere. Furthermore, this study shows that while the Sirtuin proteins are localized throughout the telomeres, their activity is not the primary mechanism of transcriptional repression, as only 6% of subtelomeric genes are silenced by Sir protein activity [43]. The current understanding of transcription within these genomic regions is more nuanced-under certain growth conditions (e.g., the induction of the environmental stress response, or ESR), there is transcriptional activation of telomeric genes [44]. This last point is germane, as the culturing conditions required for biofuel Functional dissection of the RRPE and PAC promoter elements in MPP10 by mutational analysis and gene expression profiling found that these sites are necessary for the transcriptional co-regulation of this RRB gene pair with the rest of the regulon during cellular growth and the induction of the stress response [71]. There is a physical requirement of adjacency for MRX12 to MPP10 for their co-regulation, as insertion of an actively transcribed LEU2 gene effectively uncouples the pair from each other; this phenomenon is termed 'adjacent gene co-regulation' [73]. This surprising relationship demonstrates the distance that endogenous yeast promoters can potentially act with-as transcription of MRX12 initiates across a chromosomal distance of almost four kilobases. The advantage of this arrangement is that it can potentially help buffer the expression of components needed at roughly stoichiometric levels. The inability to do this is wasteful, potentially resulting in alterations to pools of ribosomal subunits, and ultimately to abnormal proteostasis ( Figure 1B).

Functional Clustering
A systematic survey in budding yeast revealed that there are many functionally related gene families found in a non-random genomic distribution-this phenomenon is not limited to genes involved in the synthesis of the ribosome. Approximately, one-quarter of functionally related gene families in S. cerevisiae exhibit clustering throughout the genome. This arrangement results in tighter transcriptional co-regulation throughout the cell cycle, compared to their singleton (unpaired) counterparts [74]. It is likely that there are many different drivers that lead to functional clustering in such widely diverse gene families; however, the functionally clustered genes aggregate in areas of the genome where neighboring genes influence transcription of their neighbors [45].
Proper gene regulation via genomic clustering can also involve transcriptional interference, which results in an anti-correlated expression of neighboring genes due to mutually exclusive expression. The canonical example of this type of transcriptional regulation is seen at the SRG1-SER3 genomic locus found on chromosome V. SRG1 is a non-coding RNA that prevents the transcription of SER3-which can only be actively transcribed when the SRG1 transcript is repressed-as the transcript extends over the transcription start site for SER3, altering the underlying chromatin [75,76].
The co-regulation of clustered genes can help to prevent the accumulation of molecules that may be potentially toxic [74,77]. The GAL genes cluster on chromosome II-there are three genes necessary for galactose catabolism (Figure 2A). This locus consists of GAL1, a galactokinase that catalyzes the conversion of alpha-D -galactose to galactose-1-phosphate; GAL10, a UDP-glucose-4-epimerase; and GAL7, a galactose-1-phosphate uridyl transferase. The genomic arrangement of these genes results in buffering of the cell from transcriptional noise, maintaining comparable levels of expression, which would not be possible if the genes were unlinked. The balance in the expression of GAL1 to that of GAL7 avoids the accumulation of the cytotoxic galactose-1-phosphate metabolite ( Figure 2B) [78]. protein that associates with mitochondrial ribosome, are clustered together under the transcriptional co-regulation of shared promoter elements only found upstream of MPP10 ( Figure 1A). These cisregulatory promoter elements are called the ribosomal RNA processing element (RRPE) and the polymerase A and C (PAC) element, both of which are enriched within the RRB family as a whole [67]. The promoter region of MRX12 contains no readily identifiable transcription factor-binding site; rather, the genome region immediately upstream is a nucleosome devoid autonomously replicating sequence ( Figure 1) [72]. Functional dissection of the RRPE and PAC promoter elements in MPP10 by mutational analysis and gene expression profiling found that these sites are necessary for the transcriptional co-regulation of this RRB gene pair with the rest of the regulon during cellular growth and the induction of the stress response [71]. There is a physical requirement of adjacency for MRX12 to MPP10 for their coregulation, as insertion of an actively transcribed LEU2 gene effectively uncouples the pair from each other; this phenomenon is termed 'adjacent gene co-regulation' [73]. This surprising relationship demonstrates the distance that endogenous yeast promoters can potentially act with-as transcription of MRX12 initiates across a chromosomal distance of almost four kilobases. The advantage of this arrangement is that it can potentially help buffer the expression of components needed at roughly stoichiometric levels. The inability to do this is wasteful, potentially resulting in

Lessons and Conclusions
Saccharomyces cerevisiae offers significant advantages for the introduction of non-native biological pathways, compared to some of its less-complex, prokaryotic peers. It also presents additional levels of complexity and challenges that require consideration and planning to ameliorate. The genomic site for modification must be chosen with care, as there is the ability to introduce unintended consequences-the second order effects. The targeted integration site can profoundly affect the levels of expression of any gene-and vice versa-potentially altering delicate cellular processes and leading to a lower metabolic output.
Global analyses of position effects reveal that centromeric and telomeric regions, a distance defined as +/− 20 kB from the repeat sequence, result in reduced gene expression compared to integration sites positioned further away [61]. These regions, however, are not completely silent in all cases. The researcher looking for a site to target for lower levels of gene expression may very well consider these regions. Genomic regions further away represent target sites that are more conducive for higher levels of gene expression, although position effects only result in high levels of expression in 25% of the genome. Hot spots that support high levels of expression include the YDR448W, YGR240C, YHR142W, YML059C, YPL014W, and YPR028W loci, and represent excellent starting points for researchers who want to synthesize highly expressed pathways [61]. One caveat is the effect of any non-native metabolite on the budding yeast. S. cerevisiae has a robust stress response, which triggers easily [46,79]. Such a response could result in the increase in transcription of regions that are typically low expressed and compound matters.
Our research group previously performed an analysis of the position effects on endogenous gene expression that arose due to the budding yeast's ESR [45]. Analysis focused on the induction of the environmental stress response, including a heat-shock response, DNA damage response, oxidative stress, nitrogen depletion, and the switch in carbon source from glucose to glycerol. As a resource for those concerned about induction of the ESR, the data from 30 representative loci have been parsed from that dataset and is presented in Table 2. * The average SCC for every pairwise combination of genes throughout a kB window centered on the specified target region. ** The average SCC for every pairwise combination of genes throughout a 10-gene window centered on the specified target region (25kB-30kB region).
Due to its versatility and potential, S. cerevisiae will continue to be an attractive choice for genetic engineering for a variety of purposes, both inside and outside the biofuel sector. Yeasts are important sources of metabolites within the pharmaceutical and biotechnology industries, with S. cerevisiae widely used for these purposes, including for the production of metabolites such as dihydroartemisinic acid (an important precursor molecule in the production of the anti-malarial agent artemisinin), and the production of beta-carotene [80][81][82][83]. Years of research and discovery on the microorganism has opened up incredible avenues and opportunities to harness this biological system as a mini factory. Furthermore, these observations are not simply limited to S. cerevisiae; they likely extend throughout the fungal kingdom and to related eukaryotes [74,84].
Funding: This research received no external funding.