Next Article in Journal
Role of the Common PRSS1-PRSS2 Haplotype in Alcoholic and Non-Alcoholic Chronic Pancreatitis: Meta- and Re-Analyses
Previous Article in Journal
Diverse Roles of MAX1 Homologues in Rice
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A High Quality Asian Genome Assembly Identifies Features of Common Missing Regions

1
Interdisciplinary Program of Bioinformatics, College of Natural Science, Seoul National University, Seoul 08826, Korea
2
Genome & Health Big Data Laboratory, Department of Health Science, Seoul National University, Seoul 08826, Korea
3
Institute of Health & Environment, Seoul National University, Seoul 08826, Korea
4
DKU-Theragen institute for NGS analysis (DTiNa), Cheonan 31116, Korea
5
Center for Bio-Medical Engineering Core Facility, Dankook University, Cheonan 31116, Korea
6
Department of Microbiology, Dankook University, Cheonan 31116, Korea
7
Department of Nanobiomedical Science, Dankook University, Cheonan 31116, Korea
8
Center for Bio-Analysis, Korea Research Institute of Standards and Science, Daejeon 34113, Korea
9
Bioinformatics Institute, Macrogen Inc., Seoul 08511, Korea
10
Genomic Medicine Institute, Medical Research Center, Seoul National University, Seoul 03080, Korea
11
Department of Biochemistry and Molecular Biology, Seoul National University College of Medicine, Seoul 03080, Korea
12
Precision Medicine Center, Seoul National University Bundang Hospital, Seongnam 13605, Korea
13
Gong-Wu Genomic Medicine Institute, Seoul National University Bundang Hospital, Seongnam 13605, Korea
*
Authors to whom correspondence should be addressed.
These authors equally contributed to this manuscript.
Genes 2020, 11(11), 1350; https://doi.org/10.3390/genes11111350
Submission received: 9 October 2020 / Revised: 6 November 2020 / Accepted: 9 November 2020 / Published: 13 November 2020
(This article belongs to the Section Human Genomics and Genetic Diseases)

Abstract

:
The current human reference genome (GRCh38), with its superior quality, has contributed significantly to genome analysis. However, GRCh38 may still underrepresent the ethnic genome, specifically for Asians, though exactly what we are missing is still elusive. Here, we juxtaposed GRCh38 with a high-contiguity genome assembly of one Korean (AK1) to show that a part of AK1 genome is missing in GRCh38 and that the missing regions harbored ~1390 putative coding elements. Furthermore, we found that multiple populations shared some certain parts in the missing genome when we analyzed the “unmapped” (to GRCh38) reads of fourteen individuals (five East-Asians, four Europeans, and five Africans), amounting to ~5.3 Mb (~0.2% of AK1) of the total genomic regions. The recovered AK1 regions from the “unmapped reads”, which were the estimated missing regions that did not exist in GRCh38, harbored candidate coding elements. We verified that most of the common (shared by ≥7 individuals) missing regions exist in human and chimpanzee DNA. Moreover, we further identified the occurrence mechanism and ethnic heterogeneity as well as the presence of the common missing regions. This study illuminates a potential advantage of using a pangenome reference and brings up the need for further investigations on the various features of regions globally missed in GRCh38.

1. Introduction

DNA sequencing of the human genome is the basis of precision medicine. An overwhelming majority of sequencing performs resequencing of massive short reads, using the GRCh37/38 (a.k.a., hg19/38) human genome assembly as a reference. GRCh38 is the successor of the Human Genome Project. The GRCh38 has been further enriched (~30%) by the addition of genomes from >50 individuals, including contributions from those of African ancestry [1]. The general belief has been that a single global reference genome was sufficient because resequencing requires a reference to determine the genetic variants of individuals, rather than a database encompassing a list of variants. However, researchers find it substantial that recent findings point out the diversity of structural variation among ethnic groups [2,3] and that the human migration has evolved into complex detours. This pattern of human migration resulted in local admixture [4] and has led to questions about whether some portions of the DNA sequences are missed by the current re-sequencing methods [5,6]. In attempts to find the missing regions, some previous studies have used the “unmapped” reads, which are sequence reads of the RNA sequencing data [7] and DNA sequencing data [8,9,10] that fail to align to the reference, to identify regions with suggestive evidence of protein coding [8] or disease association [9] on the previous reference version and GRCh38. Researchers used raw fragmented genome data, and performed de novo assembly of the unmapped reads of different individuals for comparison with the reference [8]. However, when the short reads were assembled into the contigs, the contigs created from the short reads could have had missing or limited information due to the lack of continuity compared to the contigs from long reads, which raised the difficulty of placing the contigs in the reference genome.
In addition to using the de novo assembly of short unmapped reads, other studies found missing regions with long read sequences relative to GRCh38 and thoroughly examined the possibility of using the sequences as a reference patch to discover structural variants [11] and alternate alleles [12]. Although the common missing regions explored in several studies represented structural variations and alternate alleles that are not on the GRCh38 reference genome and discussed the potential of “pan reference”, the gap in literature that studies the occurrence mechanism of common missing regions should be highlighted.
In this study, we first performed a comparison of the two human genome assemblies, GRCh38 and AK1 (one Korean genome assembly) [13], with high contiguity, and outlined the differences between the two genomes. Second, we re-aligned the “unmapped” reads of general samples to new assembly and further specified the estimated missing parts by tracing re-aligned “unmapped” reads. Finally, we also searched for the putative functions of the missing reference sequence and investigated the mechanisms of these events by experimentally verifying the presence and characteristics of the missing regions.

2. Materials and Methods

2.1. Comparison between the Reference Genome (GRCh38) and the AK1

Our study was approved by the Institutional Review Board of Seoul National University (SNU 19-11-064). We used the LASTZ program [14] to generate a chain file between the AK1 (GCA_001750385.2) and GRCh38 (GCA_000001405.27) downloaded from NCBI website (Available online: https://www.ncbi.nlm.nih.gov/assembly/ (accessed on 21 July 2018)), with written parameters (--gapped --gap = 600, 150, --hspthresh = 4500, --seed = 12 of 19 --notransition --ydrop = 15,000) [10].
We used UCSC Kent utilities (Available online: https://github.com/ENCODE-DCC/kentUtils (accessed on 14 November 2018)) for the chaining and netting process. By generating bidirectional “chain files” indicating both homology and gaps at base-pair resolution, we categorized a total of 2832 scaffolds of AK1 into three groups according to the alignment patterns (Figure 1).
Group 1: The first scaffold group (n = 945, ~2.70 Gbp in total) consisted of ≥99% of the chromosomes of GRCh38.
Group 2: The second group (n = 467, ~165 Mb in total) presented partial (0% < X < 99%) matches.
Group 3: The third scaffold group (n = 1420, ~41 Mb) lacked synteny with GRCh38.
Based on the synteny of and gaps in the chain file, we calculated the alignments between the AK1 scaffolds and GRCh38 chromosomes. To strengthen the reliability of our LASTZ results, we randomly selected a scaffold in each group, and appointed the parameters as set 1. We performed a comparison analysis using different parameters and randomly selected scaffolds (Table S1). As a result, the chain files of the scaffolds using other parameters were not significantly different.

2.2. Study Samples and Materials for the Profiling of Sequencing Reads to the Reference Genome (GRCh38)

To extract unmapped reads from bam files aligned to GRCh38, the bam files of all samples, which were already aligned to the GRCh38 full analysis set with HLA sequences, were downloaded from the 1000 Genomes browser [2]. All the WGS data were generated on HiSeq platforms (Illumina, Sandiego) with PCR-free procedures. We only selected 14 samples from 3 ethnic groups, which were deeply sequenced (depth > X50), mapped to GRCh38 with BWA-MEM (version bwakit-0.7.12.) [15], and subjected to the specified written QC processes (Available online: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project (accessed on 12 December 2018)) including sorting, marking duplicates, and Indel realignment by SAMtools (version 1.2), BioBamBam (version 0.0.191) [16], GATK-3.3-0 [17] and CRAMtools.3.0.

2.3. Investigation of the Characteristics of Mapped/Unmapped Reads from BAM Files and Realignment of the Extracted “Unmapped Reads” against the AK1 Genome Assembly

For quality checks and repetitive annotation of the mapped/unmapped reads, we used FastQC [18] and RepeatMasker [19]. After samtools (samtools view -b -f 4 Inputfile) was used to extract the unmapped reads from the downloaded BAM files of the 14 multiethnic samples, we realigned the unmapped reads to the AK1 genome assembly using BWA-MEM (version 0.7.17-r1188) (Figure 2). After realignment, the sorting and the removal of duplicates were performed by SAMtools (version 1.3) and Picard Tools (version 2.0.1).
We only used reads of primary alignments to exclude reads aligning reasonably well to more than one place. We calculated the depth/breadth [20] and excluded regions with low depths(<3×) from realigned bam files for each individual using BEDTools (version 2.25.0) [21] and Samtools (version 1.3). We used output data from BEDTools to show coverage and count depth by genomic positions with R (version 3.4.3). We also used GATK-pathSeq [22] to identify those reads including the putative microbial sequences.

2.4. Functional Search for the Common Missing Regions

We searched for functional clues via BLASTx [23] for the sequences (>200 bp) that were shown to be unique to AK1 when compared with GRCh38. Additionally, we performed both BLASTn (with an e-value < 10−10, identity ≥70%, and coverage ≥70%) and BLASTx (with an e-value < 10−10, identity ≥ 70%, and alignment length ≥50 bp) searches against the nr database with default options to find whether the estimated missing regions on AK1, which were re-aligned with more than ten “unmapped reads” from bam files of two or more individuals, exist across populations, and to speculate the functional roles of the missing regions.
For further investigation on the candidate regions located on Group 1 scaffolds that are missing globally, which is defined as common missing regions in seven or more individuals, we searched the locations of the missing regions in the GRCh38 genome using a chain file (“lifting” AK1 over GRCh38). To visualize these regions, we merged 14 BAM files into one and used the UCSC genome browser [24] and Integrative Genomics Viewer (IGV) [25] to visualize the merged BAM file. The suggested functional roles of the globally missing regions were also identified via BLASTx searches with an e-value < 10−10, identity ≥ 70%, and alignment length ≥ 50 bp.

2.5. Verification of the Missing Regions by PCR

After the functional comparison, we selected ±2 kb of the flanking sequences on the AK1 genome from the candidate regions that are missing globally in seven or more individuals to verify the existence of the regions and also to put them into the BLAST-Like Alignment Tool (BLAT) (Available online: http://genome.ucsc.edu/cgi-bin/hgBlat (accessed on 4 May 2019)) for human (GRCh38; December 2013) and chimpanzee (panTro6; February 2018) genomes [26] to investigate the characteristics of the globally missing common regions on Group 1 scaffolds. For the experimental confirmation of the non-overlapping AK1 sequences, we performed PCR amplification using four European DNA samples and a chimpanzee DNA sample, which was distributed by the Coriell Institute (Coriell Cell Repository, Camden, NJ, USA) and provided by Dr. Takenaka (Primate Research Institute, Kyoto University, Japan). The following cell lines/DNA samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: NA17001, NA17002, NA17003, and NA17004. The oligonucleotide primers used for the PCR amplification of each locus were designed by using the software Primer3 [27]. The PCR amplification of each locus was conducted with 25 uL of the reaction using 100 ng of DNA, 10 μL of 2X Lamp Pfu DNA polymerase (BioFact, Daejeon, Korea), and 10 pmol/μL of each oligonucleotide primer. The PCR conditions were as follows: 95 °C for 5 min, followed by 35 cycles of 30 sec of denaturation at 95 °C, 40 s at the annealing temperature, and 1 to 7 min of extension at 72 °C (depending on the expected size of the PCR product), followed by a final 5 min extension at 72 °C. Specific primer designs for 11 loci out of 31 putative insertions could not be done due to the absence of their sequence counterparts in the UCSC reference genome and the abundance of simple repeats and tranposable elements.

3. Results

3.1. Systematic Comparison between GRCh38 p.12 and AK1

We first compared the whole AK1 sequence against GRCh38 (“liftover”) and its alternative sequences to search for synteny. A total of 53.4 Mb (~1.8%) of the AK1 genome lacks homology with GRCh38, as we calculated the difference between “Total scaffold size” and “Size matched with GRCh38.p12” in Table 1. Dividing GRCh38 genome sequences by sequence types (chromosome; fix, error corrections or assembly improvements applied to the GRCh38 genome; random, the unlocalized contigs; unknown chromosome), we also investigated the matching sizes between AK1 scaffolds and GRCh38 genome by sequence type. The Group 1 and 2 scaffolds of AK1 matched with multiple chromosomes of GRCh38, among which the contributions of ectopic chromosomes amounted to ~22.2 Mb (~0.76%). The third group of scaffolds, which are unique to AK1, presented different genome sequences and repeat components according to the analysis using the RepeatMasker [19]. Satellites, which are multiple copies of repeated patterns that can vary in length from a single base to several thousand bases, were predominant in the repetitive components of Group 3 scaffolds (Table S2). In addition, the majority of the small size scaffolds on the AK1 genome were grouped to the third group and the N50 of the third group was 34.6 kb, although the N50 of AK1 genome data from NCBI was 44.85 Mb. To identify genome sequences that were unique to AK1 compared to GRCh38, we selected 3333 regions larger than 200 bp and searched for putative protein-coding functions via a translated BLAST [23] search within mammals. A total of 1390 regions (e-value < 10−10, identity ≥ 70%, and alignment length ≥ 50 bp) were predicted to harbor putative protein-coding elements (Table S3).

3.2. Profile of the “Unmapped Reads”

We selected high-depth (>50×) WGS data of 14 individuals from the 1kG database comprising Caucasians (four individuals), Asians (five individuals), and Africans (five individuals). The data represented populations from different areas and were initially aligned against GRCh38 according to the specifically written quality control information (all from the Illumina HiSeq platform). On average, ~4.7% of the WGS data (~2.6 M out of 54.6M total reads per individual) failed to align with GRCh38 and its alternative sequences. The data of Africans had the lowest alignment, and that of Caucasians had the highest mapping rate to GRCh38 (Table 2). The quality score of the reads re-aligned to AK1 was 7.2, which was higher than that of the overall unmapped reads, and the reads to AK1 have compatible quality in terms of base quality and mapping quality, with slightly lower coverage compared with the reads that were initially mapped to GRCh38 (Figure S1, Table S4).
Meanwhile, the “unpaired reads” explained the most substantial part of the unmapped reads (~59%) (Figure S2) due to the differences in sequencing quality between read 1 and read 2. In addition to the generally lower sequencing quality, the proportion of repetitive sequences among unmapped reads showed approximately ten times more low-complexity and >2 times more simple repeats and satellites, presenting much lower proportions of SINEs, LINEs, and LTRs compared to the reference genome (Table S5). The use of massive, fragmented reads will inevitably generate equivocal data for alignment, particularly in satellite or low-complexity regions. Given the quality and components involved, technical characteristics intrinsic to the analytic platform rather than the incompleteness of the reference genome are likely to have given rise to the majority of the unmapped reads.

3.3. Genomic Regions Recovered by “Realignment” to AK1

On average, 71 K of the ~2.6 M reads per individual (mapping quality >10) were newly mapped to AK1, with a very small proportion of reads of microbial origins. The recovery rates from realignment to AK1 were relatively low (0.92% or 0.49% overall for high-fidelity mapping quality) and did not show substantial differences between populations (Table 2). The regions with recovered reads accounted for ~0.2% (5.3 Mb) of the AK1 genome. We classified the recovered reads by mapping the scaffolds into three groups as shown in Figure 1. The Group 1 scaffolds harbored the largest number (n = 58,340) of realigned reads; proportionally, however, the Group 3 scaffolds were populated more broadly with unmapped reads (Table S6). Most of the realignments occurred within putative insertions and absent regions on the GRCh38 genome as the depth of the regions have similarity with that of GRCh38 (Table S4). Our findings suggest that the addition of an ethnic reference allows some missing genome regions to be salvaged, although only a small portion of the “unmapped reads” was responsible for these results.

3.4. Characterization and Heterogeneity of Common Missing Parts

By realigning “unmapped reads” to AK1, we selected 110 regions (shared by ≥ 2 individuals with read depth ≥ X10 for each) and 38 regions (shared by ≥ 7 individuals with read depth ≥ X10 for each) as the estimated missing regions, which were not on GRCh38. We took a look into the characteristics of the AK1 regions by finding repetitive sequences. The proportion of SINE and LINE was a little higher in these regions, and the value of simple repeats and low complexity on 38 regions is about 11 percent (Table 3). In addition, we scrutinized the recovered regions with short unmapped reads in the public database. Sixty-four of the 110 regions were previously reported or exhibited homology in the BLASTn searches of the mammalian genome database [9,28,29] (Table S7). Notably, 25 regions showed putative mammalian protein-coding functions in the translated BLAST search on NCBI’s nr database (e-value < 10−10, identity ≥70%, and alignment length ≥50 bp). The list of the regions showing putative protein-coding functions is presented in Table S8. When we observed 38 regions selected as both globally sharing (≥7 individuals with read depth ≥ X10 for each) and commonly missing, one of the 38 regions was suggested to be highly homologous to zinc finger protein 454 isoform 2 (Table S9).
The Group 1 scaffolds harbored 31 out of 38 common missing regions; the 31 regions could be visualized in comparison with GRCh38 and the flanking sequences were annotated. Typically, these regions were flanked by several repeat elements, such as Alu or LINE elements (Figure S3).
After the functional comparison, we selected ±2 kb of the flanking sequences of the 38 regions to verify the existence of the regions. We experimentally verified the presence of the above 31 regions on Group 1 scaffolds, which were located within known locations of the reference genome. We conducted PCR amplification using the DNA of AK1, four Europeans and a chimpanzee. For AK1, 20 out of 31 putative insertions were verified, and 9 regions were also verified for the chimpanzee. Further examination of the breakpoints using BioEdit [30] suggested that nonhomologous end-joining with microhomology (NHEJ, n = 26) was the dominant occurrence mechanism followed by nonallelic homologous recombination (NAHR, n = 3). Interestingly, 26 of the 31 putative insertions presented exact matches with chimpanzee, and similar findings were obtained for the gorilla reference genome. For some regions, the Europeans subjects were either homozygous or heterozygous for insertions/deletions (Table 4). For example, the region (LPVO02000186.1: 2,132,760–2,132,810) on a scaffold of Group 1 was verified as insertion on GRCh38 (chr3: 95,822,539–95,830,080). In spite of the exsitence of missing region among AK1 (Korean), European and chimp, the inserted region was identified as a polymorphic region in European samples (Figure 3).

4. Discussion

A comparison between the reference genome and the precise ethnic genome suggested that the genomic differences between individuals exceed the previous consensus of “99.9% sharing” (which was primarily derived from human genome variation projects) and are far below in similarity with a 10% difference, derived from the assembly of unmapped reads of individuals of African ancestry [8]. This result may be explained by the fact that our results are derived from a two-genome comparison. Thus, the magnitude of the difference of ~1.8% might be either conservative or inflated: it may be conservative considering that GRCh38 is a composite genome from the contribution of >50 individuals and that structural variation was not considered, while it may be inflated considering that some of the satellite sequences showing a high proportion of repetitive sequences on Group 3 scaffolds, which are small in size, might not have been fully identified in the two assemblies.
However, it is unlikely that both possibilities have substantially affected the estimation of a ~1.8% difference, considering the quality of the two genome assemblies. In contrast to the estimated difference, only a portion of the “missing information” was recovered from the unmapped reads (<0.2% of AK1 sequences). It is likely that the differences are attributable to the high proportion of repetitive sequences in unique AK1 regions and the intrinsic limitations of the sequencing platform (e.g., extremely large numbers of low complexity characterized by the unmapped reads).
In addition, the analytic platforms for the de novo assembly between GRCh38 and AK1 differed mainly due to the time gap and rapid technological transition between the two assemblies. It is not likely that our findings reflect the methodological differences between the two genome assemblies, because we mainly focused on the regions that were confirmed through multiple approaches, which included laboratory testing.
Meanwhile, our study revealed that some parts of the missing regions might be common globally and harbor functional regions. According to our research concerning the characterization and heterogeneity of the common missing parts, the majority of the “globally missing” candidate regions found with unmapped reads of various populations might be deletions in the reference, rather than insertions in other populations. This result is consistent with a previous observation [11]. The functional search for the globally missing regions conducted in this study was preliminary and was limited to the coding sequences, so the suggested functional candidates require further validation. In addition to the existence and function of the missing parts, each of the missing parts have heterogeneity of the genomic structure by ethnicity and occurred by different mechanisms. This implies that there are differences in the occurrence mechanisms and structures in those common missing regions, although they were found in several population genomes. Thus, we see the necessity to further investigate the ethnically specific heterogenous structures and different occurrence mechanisms in the “commonly” missing regions.
In this study, we only used short reads for mapping to GRCh38 and also to AK1. Because some of the unmapped reads to both assembly genomes might be influenced by using short reads, the use of long read sequencing data could also help reduce the number of unmapped reads that stem from alignment ambiguities [31,32]. In addition, we did not compare recent pangenome and AK1, so that some of the “newly identified regions” in AK1 might overlap the pangenome of the Human Pangenome Reference Consortium (HPRC).
In conclusion, our study corroborates the usefulness of precise ethnic genomes for acquiring missing genomic information. Precise ethnic genomes in particular will become easier to obtain in the future and bear greater importance for understanding complete genome functions in addition to having a precise evolutionary history of humans. Precise ethnic genomes will also play an important role in finding other missing information and redress the research gap between populations.

Supplementary Materials

The following are available online at https://www.mdpi.com/2073-4425/11/11/1350/s1, Figure S1. The distribution of the read count proportions by read quality of the mapped reads and unmapped reads on each of the two genome assemblies; Figure S2. The distribution of the read counts by read quality; Figure S3. The 31 globally missed regions (shared by ≥7 individuals) by visual inspection with UCSC genome browser and IGV; Table S1. The match % between scaffolds and GRCh38 applied with different parameter sets; Table S2. The distribution of non-repetitive and repetitive sequences between GRCh38 genoms and AK1 Group 3 scaffolds by Repeat Masker.; Table S3. A total of 1390 regions on non-overlapping AK1 genome (>200 bp) of Group 1 and 2 were predicted to be putative coding regions; Table S4. Average mapping quality and average depth of mapped reads on GRCh38 and AK1; Table S5. The distribution of repetitive sequences on reference genome (GRCh38) and sequencing reads from 14 samples by Repeat masker; Table S6. The breadth of coverage and read counts by groups of AK1; Table S7. The 110 regions not on GRCh38 reference of Group 1, 2, and 3, including the regions with more than ten reads of more than two samples and the 64 similar sequences of 110 on BLASTn search; Table S8. The putative proteins of translated BLAST search on the 25 of 110 regions on Group 1, 2, and 3, where more than ten reads are mapped in more than two samples; Table S9. The list and translated BLAST search of 38 globally missing regions where more than ten reads are mapped in more than seven samples.

Author Contributions

J.K. and J.S. designed the study and wrote the manuscript. J.K, J.S., and K.H. supervised analyses. J.K, W.L., S.M., J.L., and K.B. analysed data. I.Y., Y.-K.B., C.K., J.-I.K., and J.-S.S. provided ideas and resources. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Post-Genome Technology Development Program (10050164, Developing Korean Reference Genome) funded by the Ministry of Trade, Industry and Energy (MOTIE, Korea). This research was also supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2017R1A2B2002136) and the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NO. 2019R1A6A3A13093761).

Acknowledgments

We gratefully acknowledge Center for Bio-Medical Engineering Core Facility at Dankook University for providing equipment including PCR device.

Conflicts of Interest

The authors declare no conflict of interest.

Data and Materials Availability

When we compared between GRCh 38.p12 and AK1 genome, the chain file was used and supported the findings of this study. The chain file between GRCh 38.p12 and AK1 genome can be found at http://dx.doi.org/10.5281/zenodo.3921981.

References

  1. Schneider, V.A.; Graves-Lindsay, T.; Howe, K.; Bouk, N.; Chen, H.C.; Kitts, P.A.; Murphy, T.D.; Pruitt, K.D.; Thibaud-Nissen, F.; Albracht, D.; et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017, 27, 849–864. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. 1000 Genomes Project Consortium; Auton, A.; Brooks, L.D.; Durbin, R.M.; Garisson, E.P.; Kang, H.M.; Korbel, J.O.; Marchini, J.L.; Mccarthy, S.; McVean, G.; et al. A global reference for human genetic variation. Nature 2015, 526, 68–74. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Sudmant, P.H.; Mallick, S.; Nelson, B.J.; Hormozdiari, F.; Krumm, N.; Huddleston, J.; Coe, B.P.; Baker, C.; Nordenfelt, S.; Bamshad, M.; et al. Global diversity, population stratification, and selection of human copy-number variation. Science 2015, 349, aab3761. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Mondal, M.; Casals, F.; Xu, T.; Dall’Olio, G.M.; Pybus, M.; Netea, M.G.; Comas, D.; Laayouni, H.; Li, Q.; Majumder, P.P.; et al. Genomic analysis of Andamanese provides insights into ancient human migration into Asia and adaptation. Nat. Genet. 2016, 48, 1066–1070. [Google Scholar] [CrossRef]
  5. Maretty, L.; Jensen, J.M.; Petersen, B.; Sibbesen, J.A.; Liu, S.; Villesen, P.; Skov, L.; Belling, K.; Theil Have, C.; Izarzugaza, J.M.G.; et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature 2017, 548, 87–91. [Google Scholar] [CrossRef] [Green Version]
  6. Genovese, G.; Handsaker, R.E.; Li, H.; Kenny, E.E.; McCarroll, S.A. Mapping the human reference genome’s missing sequence by three-way admixture in Latino genomes. Am. J. Hum. Genet. 2013, 93, 411–421. [Google Scholar] [CrossRef] [Green Version]
  7. Chen, G.; Li, R.; Shi, L.; Qi, J.; Hu, P.; Luo, J.; Liu, M.; Shi, T. Revealing the missing expressed genes beyond the human reference genome by RNA-Seq. BMC Genom. 2011, 12, 590. [Google Scholar] [CrossRef] [Green Version]
  8. Sherman, R.M.; Forman, J.; Antonescu, V.; Puiu, D.; Daya, M.; Rafaels, N.; Boorgula, M.P.; Chavan, S.; Vergara, C.; Ortega, V.E.; et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat. Genet. 2018, 51, 30–35. [Google Scholar] [CrossRef]
  9. Kehr, B.; Helgadottir, A.; Melsted, P.; Jonsson, H.; Helgason, H.; Jonasdottir, A.; Jonasdottir, A.; Sigurdsson, A.; Gylfason, A.; Halldorsson, G.H.; et al. Diversity in non-repetitive human sequences not found in the reference genome. Nat. Genet. 2017, 49, 588–593. [Google Scholar] [CrossRef]
  10. Duan, Z.; Qiao, Y.; Lu, J.; Lu, H.; Zhang, W.; Yan, F.; Sun, C.; Hu, Z.; Zhang, Z.; Li, G.; et al. HUPAN: A pan-genome analysis pipeline for human genomes. Genome Biol. 2019, 20, 149. [Google Scholar] [CrossRef] [Green Version]
  11. Audano, P.A.; Sulovari, A.; Graves-Lindsay, T.A.; Cantsilieris, S.; Sorensen, M.; Welch, A.E.; Dougherty, M.L.; Nelson, B.J.; Shah, A.; Dutcher, S.K.; et al. Characterizing the Major Structural Variant Alleles of the Human Genome. Cell 2019, 176, 663–675 e619. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Li, R.; Tian, X.; Yang, P.; Fan, Y.; Li, M.; Zheng, H.; Wang, X.; Jiang, Y. Recovery of non-reference sequences missing from the human reference genome. BMC Genom. 2019, 20, 746. [Google Scholar] [CrossRef] [PubMed]
  13. Seo, J.S.; Rhie, A.; Kim, J.; Lee, S.; Sohn, M.H.; Kim, C.U.; Hastie, A.; Cao, H.; Yun, J.Y.; Kim, J.; et al. De novo assembly and phasing of a Korean human genome. Nature 2016, 538, 243–247. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Harris, R.S. Improved Pairwise Alignment of Genomic DNA. Ph.D. Thesis, Pennsylvania State University, State College, PA, USA, 2007. [Google Scholar]
  15. Li, H.; Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25, 1754–1760. [Google Scholar] [CrossRef] [Green Version]
  16. Tischler, G.; Steven, L. Biobambam: Tools for read pair collation based algorithms on BAM files. Source Code Biol. Med. 2014, 9, 13. [Google Scholar] [CrossRef] [Green Version]
  17. McKenna, A.; Hanna, M.; Banks, E.; Sivachenko, A.; Cibulskis, K.; Kernytsky, A.; Garimella, K.; Altshuler, D.; Gabriel, S.; Daly, M.; et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20, 1297–1303. [Google Scholar] [CrossRef] [Green Version]
  18. Andrews, S. FastQC: A Quality Control Tool for High Throughput Sequence Data. 2010. Available online: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (accessed on 20 June 2019).
  19. Smit, A.; Hubley, R.; Green, P. RepeatMasker Open-4.0. 2015. Available online: http://www.repeatmasker.org/ (accessed on 11 July 2019).
  20. Sims, D.; Sudbery, I.; Ilott, N.E.; Heger, A.; Ponting, C.P. Sequencing depth and coverage: Key considerations in genomic analyses. Nat. Rev. Genet. 2014, 15, 121–132. [Google Scholar] [CrossRef]
  21. Quinlan, A.R.; Hall, I.M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26, 841–842. [Google Scholar] [CrossRef] [Green Version]
  22. Walker, M.A.; Pedamallu, C.S.; Ojesina, A.I.; Bullman, S.; Sharpe, T.; Whelan, C.W.; Meyerson, M. GATK PathSeq: A customizable computational tool for the discovery and identification of microbial sequences in libraries from eukaryotic hosts. Bioinformatics 2018, 34, 4287–4289. [Google Scholar] [CrossRef] [Green Version]
  23. Camacho, C.; Coulouris, G.; Avagyan, V.; Ma, N.; Papadopoulos, J.; Bealer, K.; Madden, T.L. BLAST+: Architecture and applications. BMC Bioinform. 2009, 10, 421. [Google Scholar] [CrossRef] [Green Version]
  24. Karolchik, D.; Baertsch, R.; Diekhans, M.; Furey, T.S.; Hinrichs, A.; Lu, Y.; Roskin, K.M.; Schwartz, M.; Sugnet, C.W.; Thomas, D.; et al. The UCSC genome browser database. Nucleic Acids Res. 2003, 31, 51–54. [Google Scholar] [CrossRef] [PubMed]
  25. Robinson, J.T.; Thorvaldsdottir, H.; Winckler, W.; Guttman, M.; Lander, E.S.; Getz, G.; Mesirov, J.P. Integrative genomics viewer. Nat. Biotechnol. 2011, 29, 24–26. [Google Scholar] [CrossRef] [Green Version]
  26. Kent, W.J. BLAT—the BLAST-like alignment tool. Genome Res. 2002, 12, 656–664. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  27. Untergasser, A.; Cutcutache, I.; Koressaar, T.; Ye, J.; Faircloth, B.C.; Remm, M.; Rozen, S.G. Primer3--new capabilities and interfaces. Nucleic Acids Res. 2012, 40, e115. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  28. Wong, K.H.Y.; Levy-Sakin, M.; Kwok, P.Y. De novo human genome assemblies reveal spectrum of alternative haplotypes in diverse populations. Nat. Commun. 2018, 9, 3040. [Google Scholar] [CrossRef]
  29. Fan, X.; Chaisson, M.; Nakhleh, L.; Chen, K. HySA: A Hybrid Structural variant Assembly approach using next-generation and single-molecule sequencing technologies. Genome Res. 2017, 27, 793–800. [Google Scholar] [CrossRef] [Green Version]
  30. Ta, H. BioEdit: A user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl. Acids Symp. Ser. 1999, 41, 95–98. [Google Scholar]
  31. Derrien, T.; Estelle, J.; Marco Sola, S.; Knowles, D.G.; Raineri, E.; Guigo, R.; Ribeca, P. Fast computation and applications of genome mappability. PLoS ONE 2012, 7, e30377. [Google Scholar] [CrossRef]
  32. Pockrandt, C.; Alzamel, M.; Iliopoulos, C.S.; Reinert, K. GenMap: Ultra-fast computation of genome mappability. Bioinformatics 2020, 36, 3687–3692. [Google Scholar] [CrossRef]
Figure 1. A systematic comparison between AK1 scaffolds (n = 2382) and GRCh38.p12. The degree of match divided AK1 scaffolds into three distinct patterns of synteny by LASTZ [14]. The x axis (and vertical pop-up axis for Group 1) represents the percent of matches between AK1 scaffold and GRCh38.p12 chromosomes, and the y axis represents the count of scaffolds. GRCh38.p12, Genome Reference Consortium Human Build 38 patch release 12.
Figure 1. A systematic comparison between AK1 scaffolds (n = 2382) and GRCh38.p12. The degree of match divided AK1 scaffolds into three distinct patterns of synteny by LASTZ [14]. The x axis (and vertical pop-up axis for Group 1) represents the percent of matches between AK1 scaffold and GRCh38.p12 chromosomes, and the y axis represents the count of scaffolds. GRCh38.p12, Genome Reference Consortium Human Build 38 patch release 12.
Genes 11 01350 g001
Figure 2. The process of realigning unmapped reads of GRCh38 to AK1.
Figure 2. The process of realigning unmapped reads of GRCh38 to AK1.
Genes 11 01350 g002
Figure 3. The example of globally missing regions on GRCh38 investigated with UCSC Genome browser and the experimental verification of the existence of the regions. The region (Group 1) with a high depth with 7 or more samples was discovered in the inserted sequences (yellow block). The G1-26 region (Insertion into chr3:95,825,553–95,825,555) was near L1M2. The yellow block is the estimated insertion against GRCh38 on the chain file. The grey blocks are repetitive sequences. The pink block is the sequence only on the GRCh38 genome. Chimp, chimpanzee; Eur, European.
Figure 3. The example of globally missing regions on GRCh38 investigated with UCSC Genome browser and the experimental verification of the existence of the regions. The region (Group 1) with a high depth with 7 or more samples was discovered in the inserted sequences (yellow block). The G1-26 region (Insertion into chr3:95,825,553–95,825,555) was near L1M2. The yellow block is the estimated insertion against GRCh38 on the chain file. The grey blocks are repetitive sequences. The pink block is the sequence only on the GRCh38 genome. Chimp, chimpanzee; Eur, European.
Genes 11 01350 g003
Table 1. Statistics of the three groups of AK1 scaffolds according to a systematic comparison between AK1 scaffolds (n = 2382) and GRCh38.p12. Fix, the patches represent changes (error corrections or assembly improvements) to GRCh38 genome; Random, the unlocalized contigs of GRCh38; GRCh38.p12, Genome Reference Consortium Human Build 38 patch release 12; * Size of sum of minor contributing chromosomes.
Table 1. Statistics of the three groups of AK1 scaffolds according to a systematic comparison between AK1 scaffolds (n = 2382) and GRCh38.p12. Fix, the patches represent changes (error corrections or assembly improvements) to GRCh38 genome; Random, the unlocalized contigs of GRCh38; GRCh38.p12, Genome Reference Consortium Human Build 38 patch release 12; * Size of sum of minor contributing chromosomes.
AllGroup 1Group 2Group 3
Number of Scaffolds28329454671420
Total scaffold size (Scaffold N50)2904 Mb
(44.85 Mb)
2,697 Mb
(45.09 Mb)
165 Mb
(13.74 Mb)
41 Mb
(34.60 kb)
Size matched with GRCh38.p12 (%)2851 Mb (98.2)2691 Mb (99.8)160 Mb (96.2)0
by Sequence types
Chromosomes (or alternative)
2839 Mb2681 Mb158 Mb0
Fix8047 kb 7831 kb216 kb0
Random2783 kb 1906 kb 878 kb 0
Unknown chromosomes1005 kb 648 kb 358 kb 0
Scaffolds matched multiple chromosomes of GRCh38.p124873431440
Total size of scaffolds contributed from multiple chromosomes *22.2 Mb21.1 Mb1.1 Mb0
Table 2. The read counts of unmapped reads by samples.
Table 2. The read counts of unmapped reads by samples.
Sample IDAncestryPopulationTotal Number of Unmapped Reads (K)Unpaired Reads, Counts (K) (%)Mapped on AK1,
Read Counts (K) * Mapping Rate (%)
Suggestive Microbial Origin, Read Count
OverallMapping Quality > 10
HG02922AFREsan59,751Average
42,613
36,871 (61.7)205 (0.9)Mean %
0.90
110 (0.5)Mean %
0.46
318
HG03052Mende34,95821,174 (60.6)127 (0.9)67 (0.5)401
NA19625African-American SW48,71834,396 (70.6)121 (0.8)63 (0.4)353
HG01879African-Caribbean35,674198,064 (55.5)165 (1.0)78 (0.5)1191
NA19017Luhya33,96520,442 (60.2)96 (0.7)56 (0.4)2188
HG00419EASSouth. Han Chinese34,935Average
36,474
22,398 (64.1)131 (1.0)Mean %
0.95
66 (0.5)Mean %
0.51
527
NA18525Han Chinese15,6208,759 (56.1)51 (0.7)34 (0.5)517
HG01595Kinh Vietnamese59,35531,507 (53.1)265 (1.0)140 (0.5)3405
NA18939Japanese27,95015,520 (55.5)127 (1.0)66 (0.5)522
HG00759Dai Chinese44,51021,418 (48.1)234 (1.0)117 (0.5)512
NA20502EURTuscan26,343Average
26,711
19,640 (74.6)57 (0.9)Mean %
0.88
33 (0.5)Mean %
0.49
1557
HG00096British29,91516,773 (56.1)108 (0.8)64 (0.5)1878
HG01500Spanish31,33115,726 (50.2)164 (1.1)76 (0.5)2423
HG00268Finnish19,25512,139 (63.0)58 (0.8)36 (0.5)289
Total Average (Mean ± sd)35,877 ± 13,19321,184 ± 8091 (59.0%)137± 65 (0.92%)71 ±31 (0.49%)1149 ± 988
Suggestive microbial origin was analyzed by GATK-pathSeq. African-American SW, African-American Southwes; *   Mapping   rate = N o .   o f   r e a d s   r e a l i g n e d   t o   A K 1 ( t o t a l   u n m a p p e d   r e a d s   -   u n p a i r e d   r e a d ) .
Table 3. The distribution of repetitive sequences on the putative missing regions on AK1 scaffolds. The estimated missing regions by unmapped reads, 110 regions (≥X10, ≥2 indiv) and 38 regions (≥X10, ≥7 indiv), were investigated on the distribution of repetitive sequences with Repeat Masker. Mean% (SD).
Table 3. The distribution of repetitive sequences on the putative missing regions on AK1 scaffolds. The estimated missing regions by unmapped reads, 110 regions (≥X10, ≥2 indiv) and 38 regions (≥X10, ≥7 indiv), were investigated on the distribution of repetitive sequences with Repeat Masker. Mean% (SD).
Family110 Regions (More than Ten Reads Are Mapped in More than Two Samples38 Regions (More than Ten Reads Are Mapped in More than Seven Samples)
Mean % (SD)Mean % (SD)
SINEAll8.01(9.85)2.54 (5.41)
ALUs6.41 (12.25)0.27 (1.63)
MIRs1.6 (6.65)2.27 (7.37)
LINEAll7.34 (13.35)3.64 (13.80)
LINE15.13 (15.50)3.64 (13.80)
LINE22.21 (10.77)0
L3/CR100
LTRAll2.47(4.79)0.56 (2.50)
ERVL0.88 (5.86)0
ERVL-MaLRs0.98 (4.35)0.56 (2.50)
ERV-class I0.60 (3.93)0
ERV-class II00
DNAAll0.14 (0.70)0
hAT-Charlie0.14 (0.70)0
TcMar-Tigger00
Unclassified0.48 (5.01)0
Small RNA0.05 (0.51)0
Satellite8.94 (26.92)7.85 (26.82)
Simple repeats17.62 (33.73)10.82 (29.95)
Low complexity11.80 (31.59)0.52 (2.00)
SINE = Short interspersed nuclear elements; MIR = Mammalian-wide interspersed repeats; LINE = Long interspersed nuclear elements; LTR = Long terminal repeat; ERVL = Endogenous retrovirus-L; ERVL-MaLRs = Endogenous retrovirus-L-Mammalian apparent LTR Retrotransposons; ERV = Endogenous retroviruses.
Table 4. Characteristics and verifications of the presence of the estimated globally missing regions on Group 1 scaffolds. The common candidate regions globally missing with ±2 kb of flanking sequences were searched and 20 of 31 globally missing regions (shared by ≥7 individuals) were verified by PCR.
Table 4. Characteristics and verifications of the presence of the estimated globally missing regions on Group 1 scaffolds. The common candidate regions globally missing with ±2 kb of flanking sequences were searched and 20 of 31 globally missing regions (shared by ≥7 individuals) were verified by PCR.
AK1 Genome InformationSequence Comparison Using UCSC BLATValidated by PCRHg38 PositionVerified Actual Indel Size (bp)Breakpoint StructureMechanismMicrohomology (bp)Microhomology Sequence or Homologous Sequence
IDScaffold of AK1The Estimated Location of Globally Missing Region (≥7 indiv)Human (GRCh38)Chimp (panTro6)Gorilla (gorGor4)Eur 1Eur 2Eur 3Eur 4
StartEnd
G1-1KV784719.130,209,97730,210,924XOOOOOOchr13:48910547-489142941198Unique-UniqueNHEJ0
G1-2KV784719.179,001,65579,002,640XNOXXXXchr13:97337324-973404281333SR-SRNHEJ4TGTG
G1-3KV784719.193,452,30393,455,222-OOXXXXN/AN/AN/AN/AN/AN/A
G1-4KV784719.193,470,70593,471,918-OOXXXXN/AN/AN/AN/AN/AN/A
G1-5KV784720.127,885,64727,886,104XOOOOOOchr4:79781761-79785451767Alu-UniqueNHEJ2CT
G1-6KV784723.18,349,1718,349,628XOOOOOOchr4:181366776-1813700461192Unique-UniqueNHEJ0
G1-7KV784723.110,288,01210,288,493XOODelDelPolDelchr4:179430209-179433860827Unique-UniqueNHEJ4ATTT
G1-8KV784723.134,400,76334,401,227XOOXXXXchr4:155347518-155351075901Unique-UniqueNAHR38TTTCTTGTCTCCTGCCTTCTGCCAAGCCTTAGTCACAA
G1-9KV784731.115,610,50915,611,959XONOOOOchr5:6446724-64505541636SR-UniqueNHEJ4CTGC
G1-10KV784736.16,179,4766,184,176XOOOOOOchr6:67607329-676110674961Alu-L1NHEJ4AAAA
G1-11KV784736.118,433,04018,435,697XOOOOOOchr6:79899617-799034492892Unique-UniqueNHEJ5GGACT
G1-12KV784738.133,432,22233,432,240XONXXXXchr10:2389608-23954394163Unique-UniqueNHEJ5CCCTC
G1-13KV784747.11,225,8421,227,344XOODelDelPolOchr6:28174388-281778502035Unique-UniqueNHEJ2AG
G1-14KV784754.150,234,03650,235,663XOOOOOOchr8:136025060-1360287261957Unique-AluNHEJ5ATCTC
G1-15KV784761.12,374,8552,374,857X--OOOOchr18:13980325-13983782543Unique-UniqueNHEJ4TCCT
G1-16KV784762.1646,396646,455XNNXXXXchr19:869056-8767032372G-rich-G-richNHEJ4GGGG
G1-17KV784762.1942,159943,260XOOOOOOchr19:1160489-11624723127Alu-AluNAHR25CCTGTAATCCCAGCACTTTGGGAGG
G1-18KV784774.1387,226387,651XOOXXXXchrX:47084676-470925002920SR-AluNHEJ3ATG
G1-19KV784797.127,753,97827,754,392XOOOOOOchr1:93874952-938768592521Unique-AluNHEJ0
G1-20KV784800.113,617,52313,617,941XOOPolPolOPolchr10:63781277-63784929763Alu-UniqueNHEJ4AGAA
G1-21KV784803.115,594,97815,595,455XOOOPolDelDelchr14:88710100-887131851390LTR-LTRNHEJ6GAACTG
G1-22KV784803.121,188,20621,188,829XOODelDelOOchr14:83119034-831221531504L1-UniqueNHEJ3AGA
G1-23KV784804.14,078,8614,078,900XOOOOOOchr17:40521389-40524617820Unique-AluNHEJ1G
G1-24KV784806.165,330,32565,332,270XOOOOOOchr2:21821760-218255422160L1-UniqueNHEJ1T
G1-25KV784811.13,734,0913,735,143XOOOOOOchr7:68760760-687633952414Alu-UniqueNHEJ3AAG
G1-26LPVO02000186.12,132,7602,132,810XOOPolPolOPolchr3:95822539-958300802497L1-UniqueNHEJ0
G1-27LPVO02000191.18,716,1408,716,258XONXXXXchr3:194273873-194277269 720G-rich-G-rich NHEJ 2GG
G1-28LPVO02000230.13,020,5373,020,573XXNXXXXchr5:181099166-181102877 615SR-SR NHEJ 3CCT
G1-29LPVO02000423.111,658,53011,658,908XOOXXXXchr11:101923894-101927461 806Alu-Alu NHEJ 8GTGCAGTG
G1-30LPVO02000423.113,811,26413,811,292XOODelPolDelPolchr11:104076897-104080443 579LTR-Unique NHEJ 2TT
G1-31LPVO02000621.11,217,4131,217,481XNNXXXXchrX:2318537-2323680 4923Alu-Alu NAHR 24GTGGAGGTTGCAGTGAGCCGAGAT
The Estimated Location of Globally Missing Region start/end (≥7 indiv) = Start/End postion of the sequence mapped by unmapped reads of more 7 samples; Eur, European;X, Not exist; O, Same as AK1; “-“, Not matched to the primate reference genome; N, Matched but ambiguous sequences (Ns) were included; Del, Deletion; Pol, Polymorphic; SR, Simple Repeat; NHEJ, Non-homologous end-joining; NAHR, Non-allelic homologous recombination; N/A = Not available.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Kim, J.; Sung, J.; Han, K.; Lee, W.; Mun, S.; Lee, J.; Bahk, K.; Yang, I.; Bae, Y.-K.; Kim, C.; et al. A High Quality Asian Genome Assembly Identifies Features of Common Missing Regions. Genes 2020, 11, 1350. https://doi.org/10.3390/genes11111350

AMA Style

Kim J, Sung J, Han K, Lee W, Mun S, Lee J, Bahk K, Yang I, Bae Y-K, Kim C, et al. A High Quality Asian Genome Assembly Identifies Features of Common Missing Regions. Genes. 2020; 11(11):1350. https://doi.org/10.3390/genes11111350

Chicago/Turabian Style

Kim, Jina, Joohon Sung, Kyudong Han, Wooseok Lee, Seyoung Mun, Jooyeon Lee, Kunhyung Bahk, Inchul Yang, Young-Kyung Bae, Changhoon Kim, and et al. 2020. "A High Quality Asian Genome Assembly Identifies Features of Common Missing Regions" Genes 11, no. 11: 1350. https://doi.org/10.3390/genes11111350

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop