A High Quality Asian Genome Assembly Identifies Features of Common Missing Regions

The current human reference genome (GRCh38), with its superior quality, has contributed significantly to genome analysis. However, GRCh38 may still underrepresent the ethnic genome, specifically for Asians, though exactly what we are missing is still elusive. Here, we juxtaposed GRCh38 with a high-contiguity genome assembly of one Korean (AK1) to show that a part of AK1 genome is missing in GRCh38 and that the missing regions harbored ~1390 putative coding elements. Furthermore, we found that multiple populations shared some certain parts in the missing genome when we analyzed the “unmapped” (to GRCh38) reads of fourteen individuals (five East-Asians, four Europeans, and five Africans), amounting to ~5.3 Mb (~0.2% of AK1) of the total genomic regions. The recovered AK1 regions from the “unmapped reads”, which were the estimated missing regions that did not exist in GRCh38, harbored candidate coding elements. We verified that most of the common (shared by ≥7 individuals) missing regions exist in human and chimpanzee DNA. Moreover, we further identified the occurrence mechanism and ethnic heterogeneity as well as the presence of the common missing regions. This study illuminates a potential advantage of using a pangenome reference and brings up the need for further investigations on the various features of regions globally missed in GRCh38.


Introduction
DNA sequencing of the human genome is the basis of precision medicine. An overwhelming majority of sequencing performs resequencing of massive short reads, using the GRCh37/38 (a.k.a., hg19/38) human genome assembly as a reference. GRCh38 is the successor of the Human Genome Project. The GRCh38 has been further enriched (~30%) by the addition of genomes from >50 individuals, including contributions from those of African ancestry [1]. The general belief has been that a single global reference genome was sufficient because resequencing requires a reference to determine the genetic variants of individuals, rather than a database encompassing a list of variants. However, researchers find it substantial that recent findings point out the diversity of structural variation among ethnic groups [2,3] and that the human migration has evolved into complex detours. This pattern of human migration resulted in local admixture [4] and has led to questions about whether some portions of the DNA sequences are missed by the current re-sequencing methods [5,6]. In attempts to find the missing regions, some previous studies have used the "unmapped" reads, which are sequence reads of the RNA sequencing data [7] and DNA sequencing data [8][9][10] that fail to align to the reference, to identify regions with suggestive evidence of protein coding [8] or disease association [9] on the previous reference version and GRCh38. Researchers used raw fragmented genome data, and performed de novo assembly of the unmapped reads of different individuals for comparison with the reference [8]. However, when the short reads were assembled into the contigs, the contigs created from the short reads could have had missing or limited information due to the lack of continuity compared to the contigs from long reads, which raised the difficulty of placing the contigs in the reference genome.
In addition to using the de novo assembly of short unmapped reads, other studies found missing regions with long read sequences relative to GRCh38 and thoroughly examined the possibility of using the sequences as a reference patch to discover structural variants [11] and alternate alleles [12]. Although the common missing regions explored in several studies represented structural variations and alternate alleles that are not on the GRCh38 reference genome and discussed the potential of "pan reference", the gap in literature that studies the occurrence mechanism of common missing regions should be highlighted.
In this study, we first performed a comparison of the two human genome assemblies, GRCh38 and AK1 (one Korean genome assembly) [13], with high contiguity, and outlined the differences between the two genomes. Second, we re-aligned the "unmapped" reads of general samples to new assembly and further specified the estimated missing parts by tracing re-aligned "unmapped" reads. Finally, we also searched for the putative functions of the missing reference sequence and investigated the mechanisms of these events by experimentally verifying the presence and characteristics of the missing regions.
We used UCSC Kent utilities (Available online: https://github.com/ENCODE-DCC/kentUtils (accessed on 14 November 2018)) for the chaining and netting process. By generating bidirectional "chain files" indicating both homology and gaps at base-pair resolution, we categorized a total of 2832 scaffolds of AK1 into three groups according to the alignment patterns ( Figure 1). DNA polymerase (BioFact, Daejeon, Korea), and 10 pmol/μL of each oligonucleotide primer. The PCR conditions were as follows: 95 °C for 5 min, followed by 35 cycles of 30 sec of denaturation at 95 °C, 40 s at the annealing temperature, and 1 to 7 min of extension at 72 °C (depending on the expected size of the PCR product), followed by a final 5 min extension at 72 °C. Specific primer designs for 11 loci out of 31 putative insertions could not be done due to the absence of their sequence counterparts in the UCSC reference genome and the abundance of simple repeats and tranposable elements. A systematic comparison between AK1 scaffolds (n = 2382) and GRCh38.p12. The degree of match divided AK1 scaffolds into three distinct patterns of synteny by LASTZ [14]. The x axis (and vertical pop-up axis for Group 1) represents the percent of matches between AK1 scaffold and GRCh38.p12 chromosomes, and the y axis represents the count of scaffolds. GRCh38.p12, Genome Reference Consortium Human Build 38 patch release 12.
Group 1: The first scaffold group (n = 945,~2.70 Gbp in total) consisted of ≥99% of the chromosomes of GRCh38.
Group 2: The second group (n = 467,~165 Mb in total) presented partial (0% < X < 99%) matches. Group 3: The third scaffold group (n = 1420,~41 Mb) lacked synteny with GRCh38. Based on the synteny of and gaps in the chain file, we calculated the alignments between the AK1 scaffolds and GRCh38 chromosomes. To strengthen the reliability of our LASTZ results, we randomly selected a scaffold in each group, and appointed the parameters as set 1. We performed a comparison analysis using different parameters and randomly selected scaffolds (Table S1). As a result, the chain files of the scaffolds using other parameters were not significantly different.

Investigation of the Characteristics of Mapped/Unmapped Reads from BAM Files and Realignment of the Extracted "Unmapped Reads" against the AK1 Genome Assembly
For quality checks and repetitive annotation of the mapped/unmapped reads, we used FastQC [18] and RepeatMasker [19]. After samtools (samtools view -b -f 4 Inputfile) was used to extract the unmapped reads from the downloaded BAM files of the 14 multiethnic samples, we realigned the unmapped reads to the AK1 genome assembly using BWA-MEM (version 0.7.17-r1188) (Figure 2). After realignment, the sorting and the removal of duplicates were performed by SAMtools (version 1.3) and Picard Tools (version 2.0.1).
Genes 2020, 11, x FOR PEER REVIEW 5 of 16 Figure 1. A systematic comparison between AK1 scaffolds (n = 2382) and GRCh38.p12. The degree of match divided AK1 scaffolds into three distinct patterns of synteny by LASTZ [14]. The x axis (and vertical pop-up axis for Group 1) represents the percent of matches between AK1 scaffold and GRCh38.p12 chromosomes, and the y axis represents the count of scaffolds. GRCh38.p12, Genome Reference Consortium Human Build 38 patch release 12.

Systematic Comparison between GRCh38 p.12 and AK1
We first compared the whole AK1 sequence against GRCh38 ("liftover") and its alternative sequences to search for synteny. A total of 53.4 Mb (~1.8%) of the AK1 genome lacks homology with GRCh38, as we calculated the difference between "Total scaffold size" and "Size matched with GRCh38.p12" in Table 1. Dividing GRCh38 genome sequences by sequence types (chromosome; fix, error corrections or assembly improvements applied to the GRCh38 genome; random, the unlocalized contigs; unknown chromosome), we also investigated the matching sizes between AK1 scaffolds and GRCh38 genome by sequence type. The Group 1 and 2 scaffolds of AK1 matched with multiple chromosomes of GRCh38, among which the contributions of ectopic chromosomes amounted to ~22.2 Mb (~0.76%). The third group of scaffolds, which are unique to AK1, presented different genome sequences and repeat components according to the analysis using the RepeatMasker [19]. Satellites, which are multiple copies of repeated patterns that can vary in length from a single base to several thousand bases, were predominant in the repetitive components of Group 3 scaffolds (Table S2). In addition, the majority of the small size scaffolds on the AK1 genome were grouped to the third group and the N50 of the third group was 34.6 kb, although the N50 of AK1 genome data from NCBI was 44.85 Mb. To identify genome sequences that were unique to AK1 compared to GRCh38, we selected 3333 regions larger than 200 bp and searched for putative proteincoding functions via a translated BLAST [23] search within mammals. A total of 1390 regions (e-value < 10 −10 , identity ≥ 70%, and alignment length ≥ 50 bp) were predicted to harbor putative protein-coding elements (Table S3).

Profile of the "Unmapped Reads"
We selected high-depth (>50×) WGS data of 14 individuals from the 1kG database comprising Caucasians (four individuals), Asians (five individuals), and Africans (five individuals). The data represented populations from different areas and were initially aligned against GRCh38 according to the specifically written quality control information (all from the Illumina HiSeq platform). On average, ~4.7% of the WGS data (~2.6 M out of 54.6M total reads per individual) failed to align with GRCh38 and its alternative sequences. The data of Africans had the lowest alignment, and that of Caucasians had the highest mapping rate to GRCh38 ( Table 2). The quality score of the reads realigned to AK1 was 7.2, which was higher than that of the overall unmapped reads, and the reads to AK1 have compatible quality in terms of base quality and mapping quality, with slightly lower coverage compared with the reads that were initially mapped to GRCh38 ( Figure S1, Table S4). We only used reads of primary alignments to exclude reads aligning reasonably well to more than one place. We calculated the depth/breadth [20] and excluded regions with low depths(<3×) from realigned bam files for each individual using BEDTools (version 2.25.0) [21] and Samtools (version 1.3). We used output data from BEDTools to show coverage and count depth by genomic positions with R (version 3.4.3). We also used GATK-pathSeq [22] to identify those reads including the putative microbial sequences.

Functional Search for the Common Missing Regions
We searched for functional clues via BLASTx [23] for the sequences (>200 bp) that were shown to be unique to AK1 when compared with GRCh38. Additionally, we performed both BLASTn (with an e-value < 10 −10 , identity ≥70%, and coverage ≥70%) and BLASTx (with an e-value < 10 −10 , identity ≥ 70%, and alignment length ≥50 bp) searches against the nr database with default options to find whether the estimated missing regions on AK1, which were re-aligned with more than ten "unmapped reads" from bam files of two or more individuals, exist across populations, and to speculate the functional roles of the missing regions.
For further investigation on the candidate regions located on Group 1 scaffolds that are missing globally, which is defined as common missing regions in seven or more individuals, we searched the locations of the missing regions in the GRCh38 genome using a chain file ("lifting" AK1 over GRCh38). To visualize these regions, we merged 14 BAM files into one and used the UCSC genome browser [24] and Integrative Genomics Viewer (IGV) [25] to visualize the merged BAM file. The suggested functional roles of the globally missing regions were also identified via BLASTx searches with an e-value < 10 −10 , identity ≥ 70%, and alignment length ≥ 50 bp.

Verification of the Missing Regions by PCR
After the functional comparison, we selected ±2 kb of the flanking sequences on the AK1 genome from the candidate regions that are missing globally in seven or more individuals to verify the existence of the regions and also to put them into the BLAST-Like Alignment Tool (BLAT) (Available online: http://genome.ucsc.edu/cgi-bin/hgBlat (accessed on 4 May 2019)) for human (GRCh38; December 2013) and chimpanzee (panTro6; February 2018) genomes [26] to investigate the characteristics of the globally missing common regions on Group 1 scaffolds. For the experimental confirmation of the non-overlapping AK1 sequences, we performed PCR amplification using four European DNA samples and a chimpanzee DNA sample, which was distributed by the Coriell Institute (Coriell Cell Repository, Camden, NJ, USA) and provided by Dr. Takenaka (Primate Research Institute, Kyoto University, Japan). The following cell lines/DNA samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: NA17001, NA17002, NA17003, and NA17004. The oligonucleotide primers used for the PCR amplification of each locus were designed by using the software Primer3 [27]. The PCR amplification of each locus was conducted with 25 uL of the reaction using 100 ng of DNA, 10 µL of 2× Lamp Pfu DNA polymerase (BioFact, Daejeon, Korea), and 10 pmol/µL of each oligonucleotide primer. The PCR conditions were as follows: 95 • C for 5 min, followed by 35 cycles of 30 sec of denaturation at 95 • C, 40 s at the annealing temperature, and 1 to 7 min of extension at 72 • C (depending on the expected size of the PCR product), followed by a final 5 min extension at 72 • C. Specific primer designs for 11 loci out of 31 putative insertions could not be done due to the absence of their sequence counterparts in the UCSC reference genome and the abundance of simple repeats and tranposable elements.

Systematic Comparison between GRCh38 p.12 and AK1
We first compared the whole AK1 sequence against GRCh38 ("liftover") and its alternative sequences to search for synteny. A total of 53.4 Mb (~1.8%) of the AK1 genome lacks homology with GRCh38, as we calculated the difference between "Total scaffold size" and "Size matched with GRCh38.p12" in Table 1. Dividing GRCh38 genome sequences by sequence types (chromosome; fix, error corrections or assembly improvements applied to the GRCh38 genome; random, the unlocalized contigs; unknown chromosome), we also investigated the matching sizes between AK1 scaffolds and GRCh38 genome by sequence type. The Group 1 and 2 scaffolds of AK1 matched with multiple chromosomes of GRCh38, among which the contributions of ectopic chromosomes amounted to~22.2 Mb (~0.76%). The third group of scaffolds, which are unique to AK1, presented different genome sequences and repeat components according to the analysis using the RepeatMasker [19]. Satellites, which are multiple copies of repeated patterns that can vary in length from a single base to several thousand bases, were predominant in the repetitive components of Group 3 scaffolds (Table S2). In addition, the majority of the small size scaffolds on the AK1 genome were grouped to the third group and the N50 of the third group was 34.6 kb, although the N50 of AK1 genome data from NCBI was 44.85 Mb. To identify genome sequences that were unique to AK1 compared to GRCh38, we selected 3333 regions larger than 200 bp and searched for putative protein-coding functions via a translated BLAST [23] search within mammals. A total of 1390 regions (e-value < 10 −10 , identity ≥ 70%, and alignment length ≥ 50 bp) were predicted to harbor putative protein-coding elements (Table S3). Table 1. Statistics of the three groups of AK1 scaffolds according to a systematic comparison between AK1 scaffolds (n = 2382) and GRCh38.p12. Fix, the patches represent changes (error corrections or assembly improvements) to GRCh38 genome; Random, the unlocalized contigs of GRCh38; GRCh38.p12, Genome Reference Consortium Human Build 38 patch release 12; * Size of sum of minor contributing chromosomes.

Profile of the "Unmapped Reads"
We selected high-depth (>50×) WGS data of 14 individuals from the 1kG database comprising Caucasians (four individuals), Asians (five individuals), and Africans (five individuals). The data represented populations from different areas and were initially aligned against GRCh38 according to the specifically written quality control information (all from the Illumina HiSeq platform). On average, 4.7% of the WGS data (~2.6 M out of 54.6M total reads per individual) failed to align with GRCh38 and its alternative sequences. The data of Africans had the lowest alignment, and that of Caucasians had the highest mapping rate to GRCh38 ( Table 2). The quality score of the reads re-aligned to AK1 was 7.2, which was higher than that of the overall unmapped reads, and the reads to AK1 have compatible quality in terms of base quality and mapping quality, with slightly lower coverage compared with the reads that were initially mapped to GRCh38 ( Figure S1, Table S4).
Meanwhile, the "unpaired reads" explained the most substantial part of the unmapped reads (~59%) ( Figure S2) due to the differences in sequencing quality between read 1 and read 2. In addition to the generally lower sequencing quality, the proportion of repetitive sequences among unmapped reads showed approximately ten times more low-complexity and >2 times more simple repeats and satellites, presenting much lower proportions of SINEs, LINEs, and LTRs compared to the reference genome (Table S5). The use of massive, fragmented reads will inevitably generate equivocal data for alignment, particularly in satellite or low-complexity regions. Given the quality and components involved, technical characteristics intrinsic to the analytic platform rather than the incompleteness of the reference genome are likely to have given rise to the majority of the unmapped reads.

Genomic Regions Recovered by "Realignment" to AK1
On average, 71 K of the~2.6 M reads per individual (mapping quality >10) were newly mapped to AK1, with a very small proportion of reads of microbial origins. The recovery rates from realignment to AK1 were relatively low (0.92% or 0.49% overall for high-fidelity mapping quality) and did not show substantial differences between populations ( Table 2). The regions with recovered reads accounted for 0.2% (5.3 Mb) of the AK1 genome. We classified the recovered reads by mapping the scaffolds into three groups as shown in Figure 1. The Group 1 scaffolds harbored the largest number (n = 58,340) of realigned reads; proportionally, however, the Group 3 scaffolds were populated more broadly with unmapped reads (Table S6). Most of the realignments occurred within putative insertions and absent regions on the GRCh38 genome as the depth of the regions have similarity with that of GRCh38 (Table S4). Our findings suggest that the addition of an ethnic reference allows some missing genome regions to be salvaged, although only a small portion of the "unmapped reads" was responsible for these results.

Characterization and Heterogeneity of Common Missing Parts
By realigning "unmapped reads" to AK1, we selected 110 regions (shared by ≥ 2 individuals with read depth ≥ X10 for each) and 38 regions (shared by ≥ 7 individuals with read depth ≥ X10 for each) as the estimated missing regions, which were not on GRCh38. We took a look into the characteristics of the AK1 regions by finding repetitive sequences. The proportion of SINE and LINE was a little higher in these regions, and the value of simple repeats and low complexity on 38 regions is about 11 percent (Table 3). In addition, we scrutinized the recovered regions with short unmapped reads in the public database. Sixty-four of the 110 regions were previously reported or exhibited homology in the BLASTn searches of the mammalian genome database [9,28,29] (Table S7). Notably, 25 regions showed putative mammalian protein-coding functions in the translated BLAST search on NCBI's nr database (e-value < 10 −10 , identity ≥70%, and alignment length ≥50 bp). The list of the regions showing putative protein-coding functions is presented in Table S8. When we observed 38 regions selected as both globally sharing (≥7 individuals with read depth ≥ X10 for each) and commonly missing, one of the 38 regions was suggested to be highly homologous to zinc finger protein 454 isoform 2 (Table S9). (total unmapped reads − unpaired read) . The Group 1 scaffolds harbored 31 out of 38 common missing regions; the 31 regions could be visualized in comparison with GRCh38 and the flanking sequences were annotated. Typically, these regions were flanked by several repeat elements, such as Alu or LINE elements ( Figure S3).
After the functional comparison, we selected ±2 kb of the flanking sequences of the 38 regions to verify the existence of the regions. We experimentally verified the presence of the above 31 regions on Group 1 scaffolds, which were located within known locations of the reference genome. We conducted PCR amplification using the DNA of AK1, four Europeans and a chimpanzee. For AK1, 20 out of 31 putative insertions were verified, and 9 regions were also verified for the chimpanzee. Further examination of the breakpoints using BioEdit [30] suggested that nonhomologous end-joining with microhomology (NHEJ, n = 26) was the dominant occurrence mechanism followed by nonallelic homologous recombination (NAHR, n = 3). Interestingly, 26 of the 31 putative insertions presented exact matches with chimpanzee, and similar findings were obtained for the gorilla reference genome. For some regions, the Europeans subjects were either homozygous or heterozygous for insertions/deletions (Table 4). For example, the region (LPVO02000186.1: 2,132,760-2,132,810) on a scaffold of Group 1 was verified as insertion on GRCh38 (chr3: 95,822,539-95,830,080). In spite of the exsitence of missing region among AK1 (Korean), European and chimp, the inserted region was identified as a polymorphic region in European samples (Figure 3).

Discussion
A comparison between the reference genome and the precise ethnic genome suggested that the genomic differences between individuals exceed the previous consensus of "99.9% sharing" (which was primarily derived from human genome variation projects) and are far below in similarity with a 10% difference, derived from the assembly of unmapped reads of individuals of African ancestry [8]. This result may be explained by the fact that our results are derived from a two-genome comparison. Thus, the magnitude of the difference of ~1.8% might be either conservative or inflated: it may be conservative considering that GRCh38 is a composite genome from the contribution of >50 individuals and that structural variation was not considered, while it may be inflated considering that some of the satellite sequences showing a high proportion of repetitive sequences on Group 3 scaffolds, which are small in size, might not have been fully identified in the two assemblies.
However, it is unlikely that both possibilities have substantially affected the estimation of a ~1.8% difference, considering the quality of the two genome assemblies. In contrast to the estimated

Discussion
A comparison between the reference genome and the precise ethnic genome suggested that the genomic differences between individuals exceed the previous consensus of "99.9% sharing" (which was primarily derived from human genome variation projects) and are far below in similarity with a 10% difference, derived from the assembly of unmapped reads of individuals of African ancestry [8]. This result may be explained by the fact that our results are derived from a two-genome comparison. Thus, the magnitude of the difference of~1.8% might be either conservative or inflated: it may be conservative considering that GRCh38 is a composite genome from the contribution of >50 individuals and that structural variation was not considered, while it may be inflated considering that some of the satellite sequences showing a high proportion of repetitive sequences on Group 3 scaffolds, which are small in size, might not have been fully identified in the two assemblies.
However, it is unlikely that both possibilities have substantially affected the estimation of ã 1.8% difference, considering the quality of the two genome assemblies. In contrast to the estimated difference, only a portion of the "missing information" was recovered from the unmapped reads (<0.2% of AK1 sequences). It is likely that the differences are attributable to the high proportion of repetitive sequences in unique AK1 regions and the intrinsic limitations of the sequencing platform (e.g., extremely large numbers of low complexity characterized by the unmapped reads).
In addition, the analytic platforms for the de novo assembly between GRCh38 and AK1 differed mainly due to the time gap and rapid technological transition between the two assemblies. It is not likely that our findings reflect the methodological differences between the two genome assemblies, because we mainly focused on the regions that were confirmed through multiple approaches, which included laboratory testing.
Meanwhile, our study revealed that some parts of the missing regions might be common globally and harbor functional regions. According to our research concerning the characterization and heterogeneity of the common missing parts, the majority of the "globally missing" candidate regions found with unmapped reads of various populations might be deletions in the reference, rather than insertions in other populations. This result is consistent with a previous observation [11]. The functional search for the globally missing regions conducted in this study was preliminary and was limited to the coding sequences, so the suggested functional candidates require further validation. In addition to the existence and function of the missing parts, each of the missing parts have heterogeneity of the genomic structure by ethnicity and occurred by different mechanisms. This implies that there are differences in the occurrence mechanisms and structures in those common missing regions, although they were found in several population genomes. Thus, we see the necessity to further investigate the ethnically specific heterogenous structures and different occurrence mechanisms in the "commonly" missing regions.
In this study, we only used short reads for mapping to GRCh38 and also to AK1. Because some of the unmapped reads to both assembly genomes might be influenced by using short reads, the use of long read sequencing data could also help reduce the number of unmapped reads that stem from alignment ambiguities [31,32]. In addition, we did not compare recent pangenome and AK1, so that some of the "newly identified regions" in AK1 might overlap the pangenome of the Human Pangenome Reference Consortium (HPRC).
In conclusion, our study corroborates the usefulness of precise ethnic genomes for acquiring missing genomic information. Precise ethnic genomes in particular will become easier to obtain in the future and bear greater importance for understanding complete genome functions in addition to having a precise evolutionary history of humans. Precise ethnic genomes will also play an important role in finding other missing information and redress the research gap between populations.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4425/11/11/1350/s1, Figure S1. The distribution of the read count proportions by read quality of the mapped reads and unmapped reads on each of the two genome assemblies; Figure S2. The distribution of the read counts by read quality; Figure  S3. The 31 globally missed regions (shared by ≥7 individuals) by visual inspection with UCSC genome browser and IGV; Table S1. The match % between scaffolds and GRCh38 applied with different parameter sets; Table S2. The distribution of non-repetitive and repetitive sequences between GRCh38 genoms and AK1 Group 3 scaffolds by Repeat Masker.; Table S3. A total of 1390 regions on non-overlapping AK1 genome (>200 bp) of Group 1 and 2 were predicted to be putative coding regions; Table S4. Average mapping quality and average depth of mapped reads on GRCh38 and AK1; Table S5. The distribution of repetitive sequences on reference genome (GRCh38) and sequencing reads from 14 samples by Repeat masker; Table S6. The breadth of coverage and read counts by groups of AK1; Table S7. The 110 regions not on GRCh38 reference of Group 1, 2, and 3, including the regions with more than ten reads of more than two samples and the 64 similar sequences of 110 on BLASTn search; Table S8. The putative proteins of translated BLAST search on the 25 of 110 regions on Group 1, 2, and 3, where more than ten reads are mapped in more than two samples; Table S9. The list and translated BLAST search of 38 globally missing regions where more than ten reads are mapped in more than seven samples.