Evaluation of Liftover Tools for the Conversion of Genome Reference Consortium Human Build 37 to Build 38 Using ClinVar Variants

Although Genome Reference Consortium Human Build 38 (GRCh38) was released with improvement over GRCh37, it has not been widely adopted. Several liftover tools have been developed as a convenient approach for GRCh38 implementation. This study aimed to investigate the accuracy of liftover tools for genome conversion. Two Variant Call Format (VCF) files aligned to GRCh37 and GRCh38 were downloaded from ClinVar (clinvar_20221217.vcf.gz). Liftover tools such as CrossMap, NCBI Remap, and UCSC liftOver were used to convert genome coordinates from GRCh37 to GRCh38. The accuracy of CrossMap, NCBI Remap, and UCSC liftOver were 99.81% (1,567,838/1,570,748), 99.69% (1,565,953/1,570,748), and 99.99% (1,570,550/1,570,748), respectively. Variants that failed conversion via all three liftover tools were all indels/duplications: a pathogenic/likely pathogenic variant (n = 1) and benign/likely benign variants (n = 7). The eight variants that failed conversion were identified in the ALMS, TTN, CFTR, SLCO, LDLR, PCNT, MID1, and GRIA3 genes, and all the variants were not in the VCF files aligned to GRCh37. This study demonstrated that three liftover tools could successfully convert reference genomes from GRCh37 to GRCh38 in more than 99% of ClinVar variants. This study takes the first step to clinically implement GRCh38 using liftover tools. Further clinical studies are warranted to compare the performance of liftover tools and to validate re-alignment approaches in routine clinical settings.


Introduction
The accuracy and completeness of reference genomes have an important influence on the accuracy of clinical next-generation sequencing (NGS) data analyses.Since the first human reference genome was published in 2001 by the Genome Reference Consortium (GRC), several updated versions of the human reference genome have been released with subsequent incremental improvement (https://www.ncbi.nlm.nih.gov/grc,accessed on 10 August 2023).The most recent build is the Genome Reference Consortium Human Build 38 (GRCh38, released in 2013), which is also referred to as Human Genome 38 (hg38).GRCh38 was released with improvements over GRCh37 (also known as hg19, released in 2009) in terms of accuracy and completeness.Previous studies reported that, compared with GRCh37, GRCh38 updated 8000 nucleotides; filled in numerous gaps; added sequences of centromeres, telomeres, and the mitochondrial genome; and encompassed populationspecific genomic contents [1,2].The different builds of the reference genome not only result in different genome assemblies but also impact genomic analyses and variant classification.Also, the reference allele might not represent the major allele, according to builds, because the reference genome is determined by a very small group of individuals.For instance, the Genome Aggregation Database reports that the allele frequency of the factor V Leiden (FVL) variant [c.1601G > A (p.Arg534Gln) in the F5 gene] in GRCh37 (Chr1: 169549811) and GRCh38 (Chr1: 169519049) was 98.1% and 1.8%, respectively (https://gnomad.broadinstitute.org/variant/1-169549811-C-T?dataset=gnomad_r3, accessed on 10 August 2023) [3].The FVL variant may be erroneously defined as the reference allele based on GRCh37.This means that the pathogenic risk alleles based on GRCh38 could be defined as non-pathogenic common alleles based on GRCh37.Furthermore, it has been reported that the use of GRCh37 could result in the false detection of variants in clinically significant genes such as KCNE1 (Jervell-Lange-Nielsen syndrome 2, long-QT syndrome 5), NOTCH2 (Alagille syndrome 2; Hajdu-Cheney syndrome), and SIK1 (developmental and epileptic encephalopathy 30) [2,4].In addition, sequence differences between GRCh37 and GRCh38 have been reported to be found in disease-related genes such as NCF1 (Chronic granulomatous disease 1, autosomal recessive), ADAMTSL2 (Geleophysic dysplasia 1), and RPS17 (Diamond-Blackfan anemia 4) [2].Therefore, if GRCh37 continues to be used, there are risks of missing or inaccurately interpreting clinically significant variations.Several studies regarding GRCh38 alignment have reported that the implementation of GRCh38 could produce more accurate and consistent genomic data through the analysis of single nucleotide variations (SNVs), insertions/deletions/duplications (indels/dup), structural variants, and copy number variations, compared with those based on GRCh37 [4,5].
To date, GRCh38 has not been widely adopted and GRCh37 is still extensively used in sequencing data analyses.There are clear benefits and disadvantages of using GRCh38.When switching reference genomes from GRCh37 to GRCh38, the accuracy of NGS data analyses is expected to improve with the improvement of the accuracy and completeness of the reference genome.However, there is a disadvantage in that it is necessary to revalidate the bioinformatic pipeline for the application of GRCh38.This requires additional investment of time, cost, human resources, and computational resources to facilitate the migration.In addition, the raw sequence data for re-alignment can be large, and they are not always accessible.A recent survey demonstrated that clinical laboratories hesitate to migrate to GRCh38; only 7% (2/28) of laboratories migrated to GRCh38 [6].Furthermore, more than half of the laboratories aligning with GRCh37 (58%, 15/28) responded that they had no plan to change to GRCh38 [6] because their routine bioinformatics pipelines are based on GRCh37, and migration to GRCh38 requires revalidation of the total bioinformatics pipelines [6].However, if clinical laboratories continuously stick to GRCh37, there would be the risk of missing clinically significant variants as well as the false detection of pathogenic variants [3,4,7,8].Currently, there is a need for a fast and convenient approach to implement GRCh38.

Statistics of ClinVar Variants
Two VCF files aligned to GRCh37 (No. of clinical variants = 1,573,534) and GRCh38 genome assembly (No. of clinical variants = 1,573,638) were downloaded from ClinVar (clinvar_20221217.vcf.gz).Mitochondrial variants were excluded from the two VCF files [No. of clinical variants = 1,570,644 (from GRCh37) and 1,570,748 (from GRCh38)].Five variants from the VCF file aligned to GRCh37 were absent in the VCF file aligned to GRCh38, whereas 109 variants from the VCF file aligned to GRCh38 were absent in the VCF file aligned to GRCh37.A total of 1,570,639 variants were in both VCF files aligned to GRCh37 and GRCh38.BED file was generated using position information extracted from the VCF data aligned to GRCh37.

Conversion from GRCh37 to GRCh38
The conversion rate was defined as the proportion of converted variants from GRCh37 to GRCh38 using liftover tools among aligned variants to GRCh37 (Figure 1 Genes 2023, 14, x FOR PEER REVIEW 4 of 11

Conversion from GRCh37 to GRCh38
The conversion rate was defined as the proportion of converted variants from GRCh37 to GRCh38 using liftover tools among aligned variants to GRCh37 (Figure 1

Variant Annotation and Figure Presentation
Variant annotation based on RefSeq transcripts was completed using Ensembl Variant Effect Predictor (VEP release-106).A single identical RefSeq Select was chosen to compare nomenclature between converted variants and aligned variants.Segmental duplications and pseudogenes defined by GENCODE v44 (https://www.gencodegenes.org/human/,accessed on 10 August 2023) were downloaded from the UCSC Table browser (https://genome.ucsc.edu/cgi-bin/hgTables,accessed on 10 August 2023) [16,17].To investigate whether there are highly homologous sequences in the genomic regions where the variants are located, we annotated the information of segmental duplications and pseudogenes using ANNOVAR (24 October 2019) and bedtools (v2.25.0)[18,19].
Clinical significance of the genes and/or variants was assessed based on the information provided in ClinVar INFO fields and/or Online Mendelian Inheritance in Man (OMIM, https://www.omim.org/downloads,accessed on 10 August 2023).
Figures were presented using

Variant Annotation and Figure Presentation
Variant annotation based on RefSeq transcripts was completed using Ensembl Variant Effect Predictor (VEP release-106).A single identical RefSeq Select was chosen to compare nomenclature between converted variants and aligned variants.Segmental duplications and pseudogenes defined by GENCODE v44 (https://www.gencodegenes.org/human/,accessed on 10 August 2023) were downloaded from the UCSC Table browser (https: //genome.ucsc.edu/cgi-bin/hgTables,accessed on 10 August 2023) [16,17].To investigate whether there are highly homologous sequences in the genomic regions where the variants are located, we annotated the information of segmental duplications and pseudogenes using ANNOVAR (24 October 2019) and bedtools (v2.25.0)[18,19].
Clinical significance of the genes and/or variants was assessed based on the information provided in ClinVar INFO fields and/or Online Mendelian Inheritance in Man (OMIM, https://www.omim.org/downloads,accessed on 10 August 2023).

Comparison between GRCh37-Aligned Variants and GRCh38-Aligned Variants
All 5 variants removed from the VCF file aligned to GRCh38 were large indels/dup: NC_000022.10

Discussion
We have evaluated three liftover tools for genome conversion using clinical variants from ClinVar.The ClinVar is one of the most commonly used clinical databases for variant curation and interpretation.Currently, studies regarding genome conversion for clinical variants using liftover tools have not been rigorously investigated.Pan et al., have reported the conversion rate (average 99%) of SNV from NA12878 from GRCh37 to GRCh38 using LiftoverVcf from the Picard package and CrossMap [14].Ormond et al. have investigated the conversion failure rates of 0.14% from GRCh37 to GRCh38 [15].These two studies have been performed using reference materials such as NA12878, and they do not represent clinical variants [14,15].In addition, previous studies focused on the conversion of SNVs and did not consider indels/dup [14,15].To date, there has been only one study regarding the genome conversion of a limited set of clinical variants (n = 158) using UCSC liftOver [4].We investigated the conversion rate and accuracy of liftover tools using a large  S1-S3).The non-converted P/LP variants by CrossMap (n = 68) were identified in GPR179 (n = 5), RBP3 (n = 3), ZNHIT3 (n = 1), HNF1B (n = 57), LDLR (n = 1), and PCGF2 (n = 1) genes (Supplementary Table S1).S2).The non-converted P/LP variant by UCSC liftOver (n = 1) was in the LDLR (n = 1) gene (Supplementary Table S3).The P/LP variant which failed conversion by all three liftover tools was NM_000527.5(LDLR):c.314-446_1187-386dup(Supplementary Tables S1-S3).

Discussion
We have evaluated three liftover tools for genome conversion using clinical variants from ClinVar.The ClinVar is one of the most commonly used clinical databases for variant curation and interpretation.Currently, studies regarding genome conversion for clinical variants using liftover tools have not been rigorously investigated.Pan et al., have reported the conversion rate (average 99%) of SNV from NA12878 from GRCh37 to GRCh38 using LiftoverVcf from the Picard package and CrossMap [14].Ormond et al. have investigated the conversion failure rates of 0.14% from GRCh37 to GRCh38 [15].These two studies have been performed using reference materials such as NA12878, and they do not represent clinical variants [14,15].In addition, previous studies focused on the conversion of SNVs and did not consider indels/dup [14,15].To date, there has been only one study regarding the genome conversion of a limited set of clinical variants (n = 158) using UCSC liftOver [4].We investigated the conversion rate and accuracy of liftover tools using a large set of clinical variants from ClinVar (GRCh37-aligned ClinVar variants and GRCh38-aligned ClinVar variants) including indels/dup as well as SNVs.
To the best of our knowledge, there are no studies regarding the comparison of three liftover tools using clinical variants.One recent study has reported a high degree of correlation between liftover tools including CrossMap, NCBI Remap, and UCSC liftOver using epigenetic data such as DNA methylation and Chromatin ImmunoPrecipitation Sequencing data [13].In this study, we showed the similar accuracy of liftover tools using genetic data; three liftover tools could successfully convert reference genomes from GRCh37 to GRCh38 in more than 99% of ClinVar variants [CrossMap (99.82%),NCBI Remap (99.70%), and UCSC liftOver (99.99%)].
However, most of the P/LP variants, except one, were converted successfully by the combined use of multiple liftover tools.One P/LP variant of NM_000527.5(LDLR):c.314-446_1187-386dup, which failed conversion by all three liftover tools, was an 8.1kb duplication variant.Considering the size of the variant, the direct re-alignment approach is more appropriate than using liftover tools for this variant; the larger the genomic segments and intervals of the variant, the lower the accuracy result of the liftover tools for genome conversion of the variant.
Here, we demonstrated that most of the non-converted variants by a single liftover tool were successfully converted by the other tools.For example, the variant of NM_002900.3(RBP3):c.1682_1686dup(p.Thr563fs) was not converted by CrossMap, whereas it was converted successfully by NCBI Remap and UCSC liftOver.Another example of NM_006331.8(EMG1):c.126dup(p.Leu43fs) was not converted by UCSC liftOver, whereas it was converted successfully by NCBI Remap and CrossMap.This suggested that the combined use of multiple liftover tools might increase the conversion rate and accuracy of the liftover tools.
There are pros and cons of using multiple liftover tools simultaneously.As shown in this study, the simultaneous use of multiple liftover tools could result in a more successful genome conversion and they could be used complementarily with each other.Fortunately, UCSC liftOver, NCBI Remap, and CrossMap are freely available on the public web.Therefore, these tools can be easily accessed, even in laboratories without bioinformatics resources.However, most liftover tools also require chain files or alignment formats and other liftover tools are relatively time consuming and computationally intensive, although they are a relatively convenient and more cost-effective approach than re-alignment [12,13].Currently, the accuracy and limitations of these liftover tools have not been extensively studied.Therefore, further clinical studies should be performed to investigate the clinical utility as well as analytical utility of several liftover tools in routine clinical settings prior to the simultaneous use of liftover tools.
We investigated the type of non-converted variants.We found that the proportion of indels/dup among the total studied variants was 8.93% (23,421/262,156), while the proportion of indels/dup among non-converted variants by CrossMap, NCBI Remap, and UCSC liftOver was 13.56% (48/354), 13.95% (72/516), and 71.43% (10/14), respectively.Indels/dup are more complicated in variant size and sequence context and are likely to be more error-prone than SNVs.In addition, indels/dup frequently occur in repetitive sequences, and this can make analyses difficult.Therefore, when using the liftover tools, especially UCSC liftover, it may be helpful to consider the type of variants.
Next, we performed variant annotation and investigated whether there are highly homologous sequences such as pseudogenes and segmental duplications in the genomic regions where the variants are located.Considering that liftover tools "lift" the genome position in one reference genome build 'over' to another build, highly homologous sequences might also result in genome conversion failure.However, in this study, we found no pseudogene-related factors in any non-converted variants among authentic ClinVar variants.
Previous studies have investigated the problems associated with genome conversion using liftover tools [13][14][15].Pan et al., have shown that discordant SNVs had lower read depth and a higher prevalence of GC contents [14].Ormond et al., have reported that conversion-unstable positions were associated with gaps in the builds, contig differences between builds, and segmental duplications [15].Therefore, it is important to pre-exclude the problematic regions such as gapped regions before the implementation of liftover tools.Here, we pre-excluded the variants on alternative contigs to exclude conversion failure due to contig differences between builds.In the present study, we did not investigate comprehensively the genomic regions associated with conversion failure by liftover tools because we analyzed only VCF files downloaded from ClinVar.Neither FASTQ or BAM files were available in this study.We demonstrated that gaps in the reference builds or variant types such as indels/dup were associated with conversion failure.Further clinical studies to investigate the genomic characteristics associated with conversion failure would be recommended prior to the clinical use of liftover tools.
Previous studies reported that converted variants using liftover tools are not always concordant with the variants obtained from the aligned data [13][14][15].These variants might be false positive results of liftover tools because these variants were not in the VCF file obtained from alignments to GRCh38, which is considered as ground truth.In this study, five variants that were included in the GRCh37-aligned VCF file and removed from the GRCh38-aligned VCF file [for example, NM_005235.2(ERBB4):c.83-200864_83-199104del]were also successfully converted by all three liftover tools.Despite successful conversion, these variants should not be reported if GRCh38 was chosen as a reference genome.Here, there was no discordant variants among the authentic ClinVar variants.Previous studies reported that discordant variants were noted to be in regions with segmental duplications and repetitive sequences [13][14][15].Furthermore, it has been reported that there were changes in the chromosome and genomic position of converted variants by use of liftover tools [10,14,15].Inconsistent chromosome numbers between reference genomes by different liftover tools can negatively impact downstream analyses in terms of nomenclature and classification of variants.Another explanation for discordancy is that the position in the GRCh37 is not in the GRCh38 or vice versa [13][14][15].This is because all positions are not completely comparable between the reference genomes.

Conclusions
In conclusion, three liftover tools could successfully convert reference genomes from GRCh37 to GRCh38 in more than 99% of ClinVar variants.We showed that gaps in the reference builds and variant types such as indels/dup were associated with conversion failure using liftover tools.In addition, we provided the list of non-converted P/LP variants from GRCh37 to GRCh38 by liftover tools.This list can be used to pre-exclude the variants and/or genes prior to the implementation of liftover tools.The liftover tools might be one of the practical alternatives for genome conversion in case re-alignment approaches were not possible, even if they do not guarantee a completely accurate conversion.The use of multiple liftover tools and pre-excluding of known variants in conversion failure regions before the implementation of liftover tools could result in more successful genome conversion.To our knowledge, this is the first study regarding accuracy of three liftover tools using the largest set of clinical variants.This study takes the first step for the clinical implementation of GRCh38.Further clinical studies are warranted to validate the performance of liftover tools, to characterize the non-converted variants in routine clinical settings, and, eventually, to improve the accuracy of liftover tools.
, No. of converted variants by CrossMap/No. of GRCh37-aligned variants, No. of converted variants by NCBI Remap/No. of GRCh37-aligned variants, and No. of converted variants by UCSC liftOver/No. of GRCh37-aligned variants).Non-converted variants were defined as the variants that failed conversion, mapped to a different chromosome, or mapped to a different position.Converted variants by using liftover tools were compared to those from ClinVar VCF aligned to GRCh38.The accuracy of the liftover tools was assessed based on ClinVar variants aligned to GRCh38 (Figure 1, No. of converted variants by CrossMap/No. of GRCh38-aligned variants, No. of converted variants by NCBI Remap/No. of GRCh38-aligned variants, and No. of converted variants by UCSC liftOver/No. of GRCh38-aligned variants).
, No. of converted variants by CrossMap/No. of GRCh37-aligned variants, No. of converted variants by NCBI Remap/No. of GRCh37-aligned variants, and No. of converted variants by UCSC liftOver/No. of GRCh37-aligned variants).Non-converted variants were defined as the variants that failed conversion, mapped to a different chromosome, or mapped to a different position.Converted variants by using liftover tools were compared to those from ClinVar VCF aligned to GRCh38.The accuracy of the liftover tools was assessed based on ClinVar variants aligned to GRCh38 (Figure 1, No. of converted variants by CrossMap/No. of GRCh38-aligned variants, No. of converted variants by NCBI Remap/No. of GRCh38aligned variants, and No. of converted variants by UCSC liftOver/No. of GRCh38-aligned variants).

Figure 1 .
Figure 1.Analysis workflow and variant statistics.The liftover tools CrossMap, NCBI Remap, and UCSC liftOver were used to convert the reference genome from GRCh37 to GRCh38.The conversion rate was calculated based on the comparison between converted variants and GRCh37-aligned variants.The accuracy was assessed based on the comparison between converted variants and GRCh38aligned variants.Statistics of total ClinVar variants (A) and authentic ClinVar variants (B).

Figure 1 .
Figure 1.Analysis workflow and variant statistics.The liftover tools CrossMap, NCBI Remap, and UCSC liftOver were used to convert the reference genome from GRCh37 to GRCh38.The conversion rate was calculated based on the comparison between converted variants and GRCh37-aligned variants.The accuracy was assessed based on the comparison between converted variants and GRCh38-aligned variants.Statistics of total ClinVar variants (A) and authentic ClinVar variants (B).

Figure 2 .
Figure 2. Comparison between aligned data and converted data by liftover tool.Aligned variants were downloaded from the ClinVar database, which consisted of variants aligned to GRCh37 and GRCh38.Converted variants from GRCh37 to GRCh38 were obtained by use of CrossMap, NCBI Remap, and UCSC liftOver.Eight variants that failed conversion by all three liftover tools were not in the VCF file aligned to GRCh37; this means that the variants were in gaps in the reference genome build.Results of total ClinVar variants (A) and authentic ClinVar variants (B).

Figure 2 .
Figure 2. Comparison between aligned data and converted data by liftover tool.Aligned variants were downloaded from the ClinVar database, which consisted of variants aligned to GRCh37 and GRCh38.Converted variants from GRCh37 to GRCh38 were obtained by use of CrossMap, NCBI Remap, and UCSC liftOver.Eight variants that failed conversion by all three liftover tools were not in the VCF file aligned to GRCh37; this means that the variants were in gaps in the reference genome build.Results of total ClinVar variants (A) and authentic ClinVar variants (B).

Figure 3 .
Figure 3. Type and classification of non-converted variants by liftover tool.Type (A) and classification (B) of the variants.