Genome Analysis of a Newly Discovered Yeast Species, Hanseniaspora menglaensis

Annual surveys of Irish soil samples identified three isolates, CBS 16921 (UCD88), CBS 18246 (UCD443), and CBS 18247 (UCD483), of an apiculate yeast species within the Hanseniaspora genus. The internal transcribed spacer (ITS) and D1/D2 region of the large subunit (LSU) rRNA sequences showed that these are isolates of the recently described species Hanseniaspora menglaensis, first isolated from Southwest China. No genome sequence for H. menglaensis is currently available. The genome sequences of the three Irish isolates were determined using short-read (Illumina) sequencing, and the sequence of one isolate (CBS 16921) was assembled to chromosome level using long-read sequencing (Oxford Nanopore Technologies). Phylogenomic analysis shows that H. menglaensis belongs to the fast-evolving lineage (FEL) of Hanseniaspora. Only one MAT idiomorph (encoding MATα1) was identified in all three sequenced H. menglaensis isolates, consistent with one mating type of a heterothallic species. Genome comparisons showed that there has been a rearrangement near MATα of FEL species compared to isolates from the slowly evolving lineage (SEL).


Introduction
Hanseniaspora species are apiculate yeasts found abundantly on a variety of ripening fruits, flowers and barks [1].They are particularly prevalent and diverse within grape musts [2,3].Hanseniaspora species have long since been associated with wine fermentation, commonly as pests, and more recently as potential bio-flavouring agents [3,4].Most species are not prolific fermenters with low ethanol tolerance between 3 and 5% and are quickly outcompeted by Saccharomyces species in early fermentation [5,6].Those that do exceed this threshold, such as Hanseniaspora osmophila, often produce "off flavour" compounds such as acetic acid, acetaldehyde, and ethyl acetates, which are considered detrimental to the flavour profile of the wine [7,8].Hanseniaspora species may function as bio-flavouring agents and potential co-fermenters with Saccharomyces cerevisiae because they can metabolise cellobiose [9,10].
Hanseniaspora species fall into two subclades commonly known as the fast (FEL) and slow (SEL) evolving lineages within the Saccharomycodaceae [11].The clades are distinguished by the loss of genes involved in DNA repair, cell cycle repair, and mitotic checkpoints [11], with more extensive loss observed within the FEL.The loss of repair genes enabled the rapid accumulation of mutations, resulting in significant protein divergence.213,571 raw reads, which were reduced to 182,436 reads by filtering with NanoFilt v. 2.8.0 (ONT) to remove reads with quality scores < 7 or lengths of <1 kb.The filtered reads were assembled into 13 contigs using Canu v. 2.2 [20].The raw assembly was polished with the trimmed Illumina reads using five rounds of correction with NextPolish v. 1.4.1 [21].Two contigs containing partial arrays of the rDNA locus at one end each were manually joined.Four short, overlapping contigs derived from the mitochondrial genome were removed.The mitochondrial genome was annotated using MITOS2 [22] and trimmed to one copy using bedtools [23].The VAR1 gene was identified by BLAST analysis of an unannotated open reading frame.Contig-level assemblies for samples CBS 18246 and CBS 18247 were generated using SPAdes v. 3.14.0[24].Only contigs larger than 500 bases with an average coverage greater than 10 were retained.
Trimmed reads for CBS 16921, CBS 18246 and CBS 18247 were mapped to the CBS 16921 reference genome using BWA v. 0.7.17-r1188 [32].Alignments were sorted and indexed using SAMtools v. 1.10 [33].Duplicate reads were marked using Picard tools [34].Variants were called and filtered using BCFtools v. 1.10.2[35] and VCFtools v. 0.1.16[36], respectively.Sites missing in any sample, sites with quality < 30, and sites with depth < 15 or >200 were removed.Only single-nucleotide polymorphisms (SNPs) were analysed.This identified 122 variant sites, all of which were manually verified in IGV [37] using BAM alignment to ensure that the sequencing depth matched the surrounding regions.SNP, protein-coding, tRNA, and rRNA annotations were visualised in R using the Circlize v. 0.4.15 [38] package.

Mating-type locus annotation:
The mating-type locus and neighboring genes in H. menglaensis isolates CBS 16921, CBS 18246 and CBS 18247 were identified using BLASTN and TBLASTN [41] against a dataset of Saccharomyces cerevisiae reference proteins.Similar sequences were identified in other Hanseniaspora species using BLAST [11,41].Pairwise identity was calculated using the H. menglaensis sequences of SLA2, SUI1, MATα1, VPS75, YNL247W, GNEAS1 and CWC25 and MATa2 from H. valbyensis.
Physiological characterization: Morphology, nutritional growth and additional phenotypic profiles were characterised using standard protocols as described in Kurtzman et al. [46].Most growth tests were performed at 25 • C, except for fermentation, which was assayed at 20 • C. Growth at 30 • C was assessed on (GYPA-2% glucose, 1% peptone, 0.5% yeast extract, 1.5% agar, pH 6.8).Ascus and ascospore formation were investigated by growing CBS 16921, CBS 18246 and CBS 18247 separately and as mixed cultures on 2% Difco malt extract agar (MEA) (pH 5.5) at 25 • C. Cells were examined daily for up to 7 days.
We sequenced the genome of one isolate (H.menglaensis CBS 16921) using a combination of long read (Oxford Nanopore, Oxford, UK) and short read (Illumina, Cambridge, UK) technologies, and we used short-read sequencing to survey the genomes of the other two isolates.The final assembly of H. menglaensis CBS 16921 consists of 8 contigs (7 chromosomelevel contigs, named from 1 to 7 in order of size, and 1 mitochondrial contig) (Figure 1).This assembly is 9,558,052 bp with an N50 of 1,490,982 bp and G+C content of 30.34%.This assembly is likely chromosome-level, but no telomere repeats were identified.The mitochondrial genome consists of a circular contig of 19.63 kb.G+C content is lower than that of the nuclear genome (23.61%).All core mitochondrial components (rrnL, rrnS, cob, cox1, cox2, cox3, atp6, atp8 and atp9) are present, as well as 27 tRNA genes.The NADH ubiquinone oxidoreductase genes (nad1, nad2, nad3, nad4, nad4L, nad5, nad6) are not present, similar to species in the Saccharomycetaceae [50].The ribosomal protein gene VAR1 is missing from Hanseniaspora uvarum [50,51].However, it is present in the H. menglaensis mitochondrial genome.Approximately 43% of the mitochondrial genome consists of intergenic regions.
Mapping the individual reads from H. menglaensis CBS 16921, CBS 18246 and CBS 18247 to the haploid genome assembly identified 122 variant sites (Figure 1).In each isolate, a small number of sites were called as heterozygous with high confidence: there are 24 such sites in CBS 16921, 29 in CBS 18246, and 40 in CBS 18247.In addition, 53 sites in CBS 18246 and 55 sites in CBS 18247 were called as homozygous for an allele different from the reference (the CBS 16921 haploid assembly).The genomes of the three Irish isolates are therefore very similar (~99.987%),but they are not identical (Figure 1).No high-confidence SNPs were identified between the mitochondrial genomes.
Phylogenomic analysis (Figure 2) shows that H. menglaensis CBS 16921, CBS 18246 and CBS 18247 belong to the fast-evolving lineage (FEL) of Hanseniaspora and form a subclade with Hanseniaspora lindneri, Hanseniaspora valbyensis, Hanseniaspora smithiae, Hanseniaspora mollemarum, Hanseniaspora singularis and Hanseniaspora jakobsenii.They are most closely related to H. lindneri CBS 285 but are separated with bootstrap support of 100%.This tree provides strong support for placing H. menglaensis within the FEL as a close relative of H. lindneri, similar to a previous analysis which used only the ITS region, D1/D2 domain of the LSU and ACT1 [13] (bootstrap support of 78%).In addition, comparing the ANI over the entire genome sequences showed that H. menglaensis CBS 16921 and H. lindneri have an ANI of 75.5% (determined using OrthoAni [39]) (Table A2), supporting the designation of H. menglaensis as a separate species.ANI over the entire genome sequences showed that H. menglaensis CBS 16921 and H. lindneri have an ANI of 75.5% (determined using OrthoAni [39]) (Table A2), supporting the designation of H. menglaensis as a separate species.
We used BLASTN and TBLASTN [41] to extend the analysis of the MAT locus across 13 Hanseniaspora species, including species from both the FEL and SEL (Figure 3).Only one MAT idiomorph (encoding MATα1) was identified in all three Irish H. menglaensis isolates (Figure 3).The structure of the region resembles the MATα locus in other FEL isolates, with MATα1 lying between SLA2 and CWC25 (Figure 3).We find that the inversion of SUI1-SLA2-MATα1 previously described [56] is restricted to FEL isolates (Figure 3).The MATa locus has the same structure in both SEL and FEL isolates, with a single MAT gene (MATa2) between SLA2 and GNEAS1 (Figure 3).
Some isolates of the Hanseniaspora species are diploid, and contain both MATa and MATα loci (e.g., H. vinae, H. nectarophila and H. hatyaenisis) (Figure 3).These isolates are likely to be heterothallic, with one MAT locus originating from one parent and the other from a second parent.In other isolates, only one MAT locus has been identified: only MATα in H. osmophila, H. occidentalis, H. menglaensis, H. jakobsenii, and H. lindneri, and only MATa in H. gamundiae, H. mollemarum and H. valbyensis (Figure 3).These are likely to be haploid and heterothallic species, and the missing MAT locus may be present in other isolates of the same species.For example, MATa and MATα have been identified in different isolates of H. pseudoguilliermondii (Figure 3; [56]).However, it is also possible that the isolates are diploid, and the second MAT locus has not been identified in the genome assemblies.For example, the genome of H. vinae TO2/19AF was assembled twice (from the same data), and in one iteration, the MATa locus was assembled, and in the second, the MATα locus was assembled (Figure 3).In addition, although Chen et al. [13] did not observe ascospores in the Chinese isolate of H. menglaensis, ascospores are formed by the  [56].Some MAT loci are assembled into short contigs (e.g., H. nectarophila and H. hatyaiensis).The MATa loci of H. pseudoguilliermondii CBS8772 and H. opuntiae AWRI 357 are described in Saubin et al. [56].There is a rearrangement around SUI1/MATα1 in FEL isolates, as previously described by Saubin et al. [56].The three H. menglaensis isolates have identical MATα loci.Pairwise similarity with the reference H. menglaensis sequence varies for each gene: CWC25 (34-61.5%),GNEAS1 (35.4-67.4%),MATα1 (34.6-52.9%),SLA2 (10.8-76%),SUI1 (10.1-89.7%),VPS75 (41.1-77.3%)and YNL247W (63.5-80.7%).MATa2 was compared to H. valbyensis and ranges from 37.6 to 65.7% identity.Some isolates of the Hanseniaspora species are diploid, and contain both MATa and MATα loci (e.g., H. vinae, H. nectarophila and H. hatyaenisis) (Figure 3).These isolates are likely to be heterothallic, with one MAT locus originating from one parent and the other from a second parent.In other isolates, only one MAT locus has been identified: only MATα in H. osmophila, H. occidentalis, H. menglaensis, H. jakobsenii, and H. lindneri, and only MATa in H. gamundiae, H. mollemarum and H. valbyensis (Figure 3).These are likely to be haploid and heterothallic species, and the missing MAT locus may be present in other isolates of the same species.For example, MATa and MATα have been identified in different isolates of H. pseudoguilliermondii (Figure 3; [56]).However, it is also possible that the isolates are diploid, and the second MAT locus has not been identified in the genome assemblies.For example, the genome of H. vinae TO2/19AF was assembled twice (from the same data), and in one iteration, the MATa locus was assembled, and in the second, the MATα locus was assembled (Figure 3).In addition, although Chen et al. [13] did not observe ascospores in the Chinese isolate of H. menglaensis, ascospores are formed by the Irish isolates (Figure 4).The species is likely to be homothallic as asci with warty ascospores were observed routinely after 7 days for each of the studied strains, CBS 16921, CBS 18246 and CBS 18247, when grown as separate cultures on sporulation medium.The sexual cycle of H. menglaensis therefore requires further exploration.It is notable that the MAT locus in H. opuntiae appears to have arisen from a recombination between MATa and MATα, and contains both MATa2 and MATα1 (Figure 3) (previously described in Saubin et al. [56]).This may be a homothallic species.However, some H. opuntiae isolates appear to encode only MATa2, and hybrids between H. opuntiae and H. pseudoguilliermondii have been identified [56].

Discussion
As of September 2023, 60 Hanseniaspora assemblies are publicly available from NCBI Gen-Bank [58].These include 1 complete, chromosome-level assembly (H.meyeri, GCA_030370665.1), 9 contig-level assemblies, and 50 scaffold-level assemblies.We have added another complete chromosome-level assembly for a newly discovered species (H.menglaensis), which will facilitate future comparative analysis.
Yeast mitochondrial genomes vary greatly in size, ranging from 18 to more than 105 kb [50].H. menglaensis contains a small mitochondrial genome, similar in size to that of its close relative H. uvarum (~19.6 kb and ~18.5 kb, respectively).The H. uvarum mitochondrial genome is linear and has identical repeat regions of 3543 bp at each end [51], similar to the mitochondrial genome of H. meyeri [59], whereas the H. menglaensis mitochondrial genome is circular.H. meyeri and H. uvarum belong to a different branch than H. menglaensis within the FEL (Figure 2), suggesting that there may be a difference in mitochondrial organization between sub-lineages of the FEL clade.However, mitochondrial genome assemblies of other FEL species are needed to confirm this.The mitochondrial genomes from H. uvarum and H. menglaensis differ in their G+C content (~30% and 24%, respectively).The gene content of Hanseniaspora mitochondrial genomes are similar to other Saccharomycetaceae species, containing all core components except for NADH ubiquinone oxidoreductase genes [50].The mitochondrial genes are also short, similar to those in H. uvarum [51].The RNaseP subunit (rpm1) is absent from the H. menglaensis assembly; however, this element is consistently poorly annotated among yeast species [50].The ribosomal protein VAR1 gene is present in the SEL Hanseniaspora clade and in the FEL subclade that includes H. menglaensis, H. singularis, H. mollemarum, H. smithiae, H. valbyensis, and H. lindneri (Figure 2).However, VAR1 is missing from the FEL subclade containing H. uvarum [50,51] (Figure 2).VAR1 is also missing from species in the CTG-Ser1 clade but is present in most other Saccharomycetaceae [50,60].The functional consequence of this gene loss is not clear.
H. menglaensis was identified from rotting wood in China [13] and from soil in Ireland, suggesting that it may be a soil saprobic yeast.The genomes of the Irish isolates are highly similar, with a sequence divergence of ~0.0013%.In wild and domestic S. cerevisiae populations, sequence divergence between 0.001 and 1.1% has been observed, with an average of 0.5% [61].Isolates of the human pathogen Candida albicans have a divergence of ~0.5% (between isolates of the same clade) and 1.1% (between isolates of different clades) [62].The sequence divergence in the Irish H. menglaensis is surprisingly low, considering that they originated from locations up to 180 km apart and that they belong to a fast-evolving lineage (Figure 2).It is possible that there was a recent genetic bottleneck or a founder effect in the evolutionary history of the Irish population.Comparisons of the whole-genome sequences of the Irish and Chinese isolates (which have not yet been sequenced) may help to address this in the future.
The synteny of the MAT locus is generally well conserved in yeast, and it is often adjacent to the SLA2 gene [52][53][54][55][56].This pattern is also observed in Hanseniaspora species (Figure 3) [56].Saubin et al. [56] previously identified an inversion around SUI1-SLA2 in MATα idiomorphs in some Hanseniaspora isolates.Our analyses show that the rearrangement occurs exclusively in members of the FEL and likely occurred in an ancestor of this lineage (Figure 3).The locus in H. opuntiae AWRI 3578, which is probably homothallic, likely arose from a recombination between a MATa and a rearranged MATα locus (Figure 3).
All three sequenced H. menglaensis isolates contain only a MATα locus, consistent with a haploid and heterothallic structure.Chen et al. [13] did not observe ascospore formation in the Chinese isolate (CICC 33364/NYNU 181083), which also suggests that they are haploid [13].However, ascospores are formed by the Irish isolates (Figure 4).In addition, 24-40 heterozygous sites were identified in the three isolates.It therefore remains possible that the genomes are highly homozygous diploids and that MATa is present but was not assembled.As the mechanisms of mating and sporulation in Hanseniaspora are poorly understood, further investigation is required to underline the processes at work.
The physiology of all five H. menglaensis isolates is similar (Table 1) [13], but there are some differences.For example, all isolates assimilate nitrogen from lysine, but only the Chinese isolate uses tryptophan and only the Irish isolates use cadaverine (Table 1).All are signatures of association with plant material, where these nitrogen sources are commonly found.Other differences include the ability of the Chinese isolate to grow at temperatures up to 30 • C, which may indicate an adaptation to different locations.Chen et al. [13] suggest that the ability to assimilate D-gluconate is a distinguishing factor between H. menglaensis and H. lindneri.However, the Irish isolates cannot assimilate D-gluconate (Table 1).We do note that an inability to metabolise ethylamine distinguishes all five H. menglaensis isolates from H. lindneri [13,57].

REVIEW 5 of 15 Figure 1 .
Figure 1.Chromosome circle diagram of H. menglaensis CBS 16921 genome assembly.The central circle (blue) shows each chromosome, labelled by number 1 through 7 on the outermost ring.Chromosome sizes are shown in 300 kb intervals.The pink-to-red heatmap rings show the genotypes of well-supported SNPs in comparison to the reference genome, in the order CBS 16921, CBS 18246and CBS 18247 (from inner to outer)."0/0" represents sites called as homozygous for the reference allele, "1/1" represents sites called as homozygous for an alternative allele, and "0/1" represents sites called as heterozygous.The black ring shows protein-coding sequences, the green ring shows tRNA genes, and the gold ring shows the rRNA array on chromosome 3.

Figure 1 .
Figure 1.Chromosome circle diagram of H. menglaensis CBS 16921 genome assembly.The central circle (blue) shows each chromosome, labelled by number 1 through 7 on the outermost ring.Chromosome sizes are shown in 300 kb intervals.The pink-to-red heatmap rings show the genotypes of well-supported SNPs in comparison to the reference genome, in the order CBS 16921, CBS 18246 andCBS 18247 (from inner to outer)."0/0" represents sites called as homozygous for the reference allele, "1/1" represents sites called as homozygous for an alternative allele, and "0/1" represents sites called as heterozygous.The black ring shows protein-coding sequences, the green ring shows tRNA genes, and the gold ring shows the rRNA array on chromosome 3.

Figure 2 .
Figure 2. Phylogenomic tree generated from 522 single-copy orthologs from 23 Hanseniaspora isolates and 4 outgroup species, S. cerevisiae, K. marxianus, W. anomalus and C. jadinii.Bootstraps lower than 100% are shown.The accession of the reference assembly or protein set used is shown in parentheses.The slow (SEL) and fast (FEL) evolving lineages are shown with blue and yellow boxes, respectively.The new species H. menglaensis is marked in bold text.

Figure 2 .
Figure 2. Phylogenomic tree generated from 522 single-copy orthologs from 23 Hanseniaspora isolates and 4 outgroup species, S. cerevisiae, K. marxianus, W. anomalus and C. jadinii.Bootstraps lower than 100% are shown.The accession of the reference assembly or protein set used is shown in parentheses.The slow (SEL) and fast (FEL) evolving lineages are shown with blue and yellow boxes, respectively.The new species H. menglaensis is marked in bold text.

Table A1 .
Comparison of ITS and D1/D2 regions to CBS 16921.
* sequences derived from whole-genome assembly.

Table A2 .
Comparison of average nucleotide identity (ANI) of CBS 16921 to available Hanseniaspora genomes.