Comprehensive analysis of the genetic variation in the LPA gene from short-read sequencing

Lipoprotein (a) [LP(a)] is a risk factor for cardiovascular diseases and mainly regulated by the complex LPA gene. We investigated the types of variation in the LPA gene and their predictive performance on LP(a) concentration. We determined the Kringle IV-type 2 (KIV-2) copy number (CN) using the DRAGEN LPA Caller (DLC) and a read-depth based CN estimator in 8351 whole genome sequencing samples from the GENESIS-HD study. The pentanucleotide repeat in the promoter region was genotyped with GangSTR and ExpansionHunter. LP(a) concentration was available in 4861 population-based subjects. Predictive performance on LP(a) concentration was investigated using random forests. The agreement of the KIV-2 CN between the two specialized callers was high (r=0.9966; 95% confidence interval [CI] 0.9965–0.9968). Allele-specific KIV-2 CN could be determined in 47.0% of the subjects using the DLC. Lp(a) concentration can be better predicted from allele-specific KIV-2 CN than total KIV-2 CN. Two single nucleotide variants 4925G>A and rs41272114 further improved prediction. The genetically complex LPA gene can be analyzed with excellent agreement between different callers. The allele-specific KIV-2 CN is more important for predicting LP(a) concentration than the total KIV-2 CN. It would be important that the allele-specific KIV-2 CN is determinable in all subjects.


Introduction
Lipoprotein (a) or Lp(a) molecules are a unique group of lipoproteins with unclearly deciphered function.Lp(a) is a lipoprotein consisting of a large apolipoprotein (a) molecule and a low-density lipoprotein (LDL) particle [1,2].The apolipoprotein (a) is covalently bound to apolipoprotein B-100 through a disulfide bond [1,2].Lp(a) shows a broad range of concentrations, ranging from 0.1 mg/dL to more than 200 mg/dL [3,4].Lp(a) levels differ between populations of different geographic origin, and African populations show 2-3-fold higher Lp(a) levels than European and Asian populations [4].
High Lp(a) levels are associated with an increased cardiovascular disease (CVD) risk [5].Several hypotheses exist as to why high Lp(a) levels may increase CVD risk.
BioMed 2024, 4 157 One hypothesis is that Lp(a) inhibits the breakdown of blood clots by interfering with plasminogen.Interestingly, the PLG gene, which encodes plasminogen, and the LPA gene are thought to have evolved from the recent duplication of a common precursor gene [6][7][8].It has been shown that Lp(a) acts as a modulator between blood clotting and fibrinolysis, but the physiological function remains unknown [3].Furthermore, it was proposed that Lp(a) delivers cholesterol to sites of injury and wound healing when bound to fibrin [9].
Lp(a) levels are highly determined by underlying genetics, and the LPA gene explains up to 90% of the Lp(a) variation [3].The main driver for the variability of Lp(a) levels is the Kringle IV (KIV) structure, which consists of 10 subtypes (Figure 1).The subtype KIV-2 is 5.5 kilobases long, highly variable, and the number of KIV-2 repeats ranges from 1 to >40 [4].A high number of KIV-2 repeats is associated with low Lp(a) protein levels, and more than 95% of subjects are heterozygous for the number of KIV-2 repeats [3].It is important to note that Lp(a) protein levels are primarily determined by the number KIV-2 repeats on the shorter allele, because a small number of KIV-2 repeats might facilitate secretion by the liver, and therefore leads to a higher Lp(a) protein levels [8,10,11].
BioMed 2024, 4, FOR PEER REVIEW 2 levels differ between populations of different geographic origin, and African populations show 2-3-fold higher Lp(a) levels than European and Asian populations [4].
High Lp(a) levels are associated with an increased cardiovascular disease (CVD) risk [5].Several hypotheses exist as to why high Lp(a) levels may increase CVD risk.One hypothesis is that Lp(a) inhibits the breakdown of blood clots by interfering with plasminogen.Interestingly, the PLG gene, which encodes plasminogen, and the LPA gene are thought to have evolved from the recent duplication of a common precursor gene [6][7][8].It has been shown that Lp(a) acts as a modulator between blood clotting and fibrinolysis, but the physiological function remains unknown [3].Furthermore, it was proposed that Lp(a) delivers cholesterol to sites of injury and wound healing when bound to fibrin [9].
Lp(a) levels are highly determined by underlying genetics, and the LPA gene explains up to 90% of the Lp(a) variation [3].The main driver for the variability of Lp(a) levels is the Kringle IV (KIV) structure, which consists of 10 subtypes (Figure 1).The subtype KIV-2 is 5.5 kilobases long, highly variable, and the number of KIV-2 repeats ranges from 1 to > 40 [4].A high number of KIV-2 repeats is associated with low Lp(a) protein levels, and more than 95% of subjects are heterozygous for the number of KIV-2 repeats [3].It is important to note that Lp(a) protein levels are primarily determined by the number KIV-2 repeats on the shorter allele, because a small number of KIV-2 repeats might facilitate secretion by the liver, and therefore leads to a higher Lp(a) protein levels [8,10,11].The number of KIV-2 repeats has been traditionally measured by laborious wet-lab experiments, such as immunoblotting and pulsed-field gel electrophoresis (PFGE) [12,13].Because of the strong linkage disequilibrium (LD) between the KIV-2 repeat and several single nucleotide variations (SNVs) in the LPA gene, SNVs have been used as simple-to-use proxies for the KIV-2 repeat [4,14,15].While this approach is accurate for European-like populations, it is diminishing for other populations, especially East Asian-like populations [15].
The use of long-read sequencing showed promising results to resolve complex genes such as LPA because of their high read length [16].The high costs associated with this technology prevent its widespread application in both clinical applications and large-scale population-based studies [17].Furthermore, even long-read sequencing technologies show non-unique alignments in the complex KIV-2 region [18]; the longer the reads, the lower the fraction of nun-unique alignments.
A recent technological advancement is the development of specialized callers for estimating the number of KIV-2 repeats from short-read whole genome sequencing (WGS) data.Specifically, the DRAGEN LPA Caller allows determining the number of KIV-2 repeats [18].Depending on the genotypes at two SNVs at position 296 and 1264 within the KIV-2 repeat unit, this caller may be able to determine the allele-specific KIV-2 repeat number [18].A more general caller, termed read depth-based copy number estimator (CNE), to determine The number of KIV-2 repeats has been traditionally measured by laborious wet-lab experiments, such as immunoblotting and pulsed-field gel electrophoresis (PFGE) [12,13].Because of the strong linkage disequilibrium (LD) between the KIV-2 repeat and several single nucleotide variations (SNVs) in the LPA gene, SNVs have been used as simple-to-use proxies for the KIV-2 repeat [4,14,15].While this approach is accurate for European-like populations, it is diminishing for other populations, especially East Asian-like populations [15].
The use of long-read sequencing showed promising results to resolve complex genes such as LPA because of their high read length [16].The high costs associated with this technology prevent its widespread application in both clinical applications and large-scale population-based studies [17].Furthermore, even long-read sequencing technologies show non-unique alignments in the complex KIV-2 region [18]; the longer the reads, the lower the fraction of nun-unique alignments.
A recent technological advancement is the development of specialized callers for estimating the number of KIV-2 repeats from short-read whole genome sequencing (WGS) data.Specifically, the DRAGEN LPA Caller allows determining the number of KIV-2 repeats [18].Depending on the genotypes at two SNVs at position 296 and 1264 within the KIV-2 repeat unit, this caller may be able to determine the allele-specific KIV-2 repeat number [18].A more general caller, termed read depth-based copy number estimator (CNE), to determine all multicopy genes was publicly released in late 2023 [19].CNE uses the read depth from short-read WGS to estimate the total copy number for the genes of interest.
A number of genome-wide association studies (GWAS) have identified several SNVs which are associated with Lp(a) levels [12,20].These SNVs are primarily located in the LPA gene, with some of them in the KIV-2 repeat.Because of its complex genetic structure, a specialized caller is needed for the KIV-2 repeat.This caller can determine if a subject carries a variant or not [13].
It was suggested that a pentanucleotide repeat (PNR) in the promotor region upstream of the LPA gene alters Lp(a) levels [4,21,22].Specifically, the PNR alleles with 10 or 11 repeats were associated with a small number of KIV-2 repeats, but counterintuitively expressed very low Lp(a) concentrations [21].However, it has been demonstrated that the association between a high number of KIV-2 repeats and low Lp(a) levels is mediated by an SNV located in the KIV-2 repeat [23].Therefore, adding this SNV to a model predicting Lp(a) concentrations should abolish the association with the PNR.
The aim of this study was to investigate all types of genetic variation in the LPA gene and their association with Lp(a) levels.To this end, we used the 8351 short-read WGS from the GENEtic SequencIng Study Hamburg-Davos (GENESIS-HD) study for a comprehensive analysis of the LPA gene.Lp(a) protein level measurements were available for 4861 of these subjects.We compared both callers, DRAGEN LPA Caller and CNE, to determine the number of KIV-2 repeats against each other on their agreement in determining the total number of KIV-2 repeats.Furthermore, we compared both PNR callers ExpansionHunter and GangSTR and their agreement in determining the number of PNRs.We developed a random forest model to predict Lp(a) levels from all types of genetic variation.We hypothesized that the model incorporating all genetic variation would show the highest predictive performance because it has more predictive features than a sparse model.Furthermore, because each allele is separately transcribed and translated into an Lp(a) protein, we further hypothesized that models incorporating allele-specific KIV-2 repeat numbers would have higher predictive performance compared to models based solely on the total number of KIV-2 repeats.We also investigated whether the PNR is predictive for Lp(a) measurement if SNV 4925G>A located in the KIV-2 repeat is included in the model.

Cohort
The GENESIS-HD study is a collaborative effort.It was planned to sequence 9000 individuals in total, of which approximately 8000 were to come from the German populationbased Hamburg City Health Study (HCHS); for a description of the HCHS design, see [24].An additional set of approximately 1000 subjects were to be selected from patient-based clinical cohorts with distinct cardiovascular characteristics.The clinical cohorts included subjects with myocardial infarction at a young age among others; for details, see Betschart et al. (2024).In this study, only Lp(a) levels measured in 4861 individuals from the HCHS were used.
The local ethics committee of the Landesärztekammer Hamburg (State of Hamburg Chamber of Medical Practitioners, PV5131) had no objections against the conduct of the study.The Data Protection Commissioner of the University Medical Center of the University Hamburg-Eppendorf and the Data Protection Commissioner of the Free and Hanseatic City of Hamburg approved the study.The study is registered at ClinicalTrial.gov (NCT03934957).

Measurement of Lp(a)
Lp(a) was centrally measured in thawed serum samples previously stored at −80 • degrees Celsius with the molar-based Tina-quant ® Gen 2 Lipoprotein (a) assay on a COBAS INTEGRA ® 400 plus analyzer (Roche Diagnostics, Rotkreuz, Switzerland) at the University Medical Center Hamburg-Eppendorf.Concentrations were reported in nmol/L with a limit of detection (LOD) of 7 nmol/L.The assays inter-coefficient of variation (CV) was 5.86%, the inter-CV was 3.76%.

Sequencing
Core elements of the sequencing protocol for the GENESIS-HD study are as follows: After arrival of the DNA samples from the University Medical Center Eppendorf (UKE) in Hamburg at the University Medical Center Zurich (USZ), DNA concentration was measured with PicoGreen, followed by an automated DNA normalization with Hamilton Robotics.The library was constructed according to Illumina TruSeq DNA PCR Free Library Prep protocol HT (Illumina Inc., San Diego, CA, USA) for whole genome sequencing.Briefly, the protocol steps were: (1) fragmentation of 1 µg genomic DNA to 350 bp inserts by Covaris LE220-plus, (2) cleanup of fragmented DNA, (3) repair ends, (4) removal of large and small DNA fragments, (5) 3 ′ -end adenylation, and (6) adapter ligation.The resulting library was quantified and quality assessed with the iSeq100 (Illumina).Samples were normalized according to the quantification values, and 54 samples were pooled for sequencing on an Illumina NovaSeq 6000 sequencer (Illumina Inc., San Diego, CA, USA).Samples were sequenced twice on S4 flow cells with 300 cycles (2 × 150 reads) with an estimated coverage of 15 × each, following Illumina protocols.The aimed sequencing depth was that ≥95% of all samples had a coverage of ≥30×.

Pre-Processing, Quality Control, and Multi-Sample Calling of WGS Data
Pre-processing and quality control (QC) of WGS data were described elsewhere by Betschart et al. (2024).In brief, QC was continuously performed during the conduct of the study, and approximately 600 samples were monthly processed on the used single NovaSeq 6000 sequencer.During normal operation, data were transferred in batches of approximately 250 subjects, and QC reports were generated for these batches.For preprocessing, we used DRAGEN version 3.8.4 on all samples for mapping and alignment and for single sample variant calling, using the hg38 human reference genome.Preprocessing was done without adapter trimming, and read length was 151 bp.During QC, PC coefficients were estimated using PCs provided by the 1000Genomes project phase 3 data [25].The first two estimated PCs were used to define a genetically similar Europeanlike population; for details, see Betschart et al. (2024).For the multi-sample calling (also termed joint-calling), the iterative gVCF genotyper was used.

Measurement of Number of KIV-2 Repeats
The total number of KIV-2 repeats was determined with two approaches, the CNE [19] and the DRAGEN LPA Caller [18].

Read Depth-Based Copy Number Estimator (CNE)
The CNE software divides the genome into non-overlapping bins with a size of 100 bp to estimate GC content and read depth for every sample.Read depth was estimated with mosdepth version 0.3.6 [26].Because CNE produces relative copy numbers, the output was multiplied by six to account for the number of KIV-2 copies within the hg38.

DRAGEN LPA Caller
The Illumina DRAGEN LPA Caller is available with DRAGEN version 4.2 [18].Specifically, we used the DRAGEN version 4.2.0-673-g9e903543.This caller counts reads which fall within the KIV-2 repeat.A total of 3000 additional regions, each with a length of 2000 bases are used for normalization.The read counts for all regions are normalized by their length and by GC-content.The number of KIV-2 repeats is derived from the normalized KIV-2 coverage multiplied by six to account for the number of KIV-2 copies within the reference genome.
To determine the allele-specific number of KIV-2 repeats, the DRAGEN LPA Caller measures the proportion of reads at two SNVs in linkage disequilibrium (LD) at positions 296 and 1264, which occur in every KIV-2 repeat.The DRAGEN LPA Caller reports the number of KIV-2 repeats for both the reference and the alternative allele.The shorter allele is termed KIV-2 short, and the longer allele is named KIV-2 long.We only used the number BioMed 2024, 4 160 of KIV-2 repeats determined by the DRAGEN LPA Caller in association analysis because of its ability to determine the number of allele-specific KIV-2 repeats.

Analysis of the Pentanucleotide Repeat (PNR)
The number of repeats of the PNR were determined with GangSTR version 2.5.0 [27] and ExpansionHunter version 5.0.0 [28], both with default parameters.The genomic location of the PNR is chr6:160665585-160665629 in hg38.Oketch et al. [29] demonstrated good agreement between these two short tandem repeat callers and showed fewer Mendelian inheritance errors of GangSTR in the analysis of Genome in a Bottle trios [30].The resulting variant calling files (VCFs) were parsed for the REPCN field with bcftools query version 1.18 [31].If a subject was heterozygous for the number of PNRs, the lower number (shorter allele) was assigned to the variable Allele 1, and the higher number was assigned to Allele 2 [23].The polyserial correlation between the number of PNRs and the carrier status of KIV-2 SNV 4925G>A was estimated using the polyserial function from the polycor R package version 0.8-1.

Analysis of the KIV-2 Single Nucleotide Variations (SNVs)
SNVs within the KIV-2 repeat were genotyped with the pipeline termed exome-vntrnf [13].The pipeline takes aligned reads as input (BAM file) and extracts the reads that fall within the LPA genomic region, i.e., chr6:160530484-160665259 in hg38.These reads are then converted to a FASTQ file and realigned to a single KIV-2 repeat.In the last step, the variants are called with mutserve, a specialized caller originally developed for mitochondrial variant calling [32].This caller only provides information on carrier status, which is defined as either carrier or non-carrier.

Statistical Analysis 2.8.1. Descriptive Statistics
Descriptive characteristics of important variables are provided with median and quartiles.For dichotomous variables, absolute and relative frequencies are given.Agreement between the two KIV-2 callers was determined using Pearson correlation, a scatterplot, and a Bland-Altman plot.Agreement between the two PNR callers was determined using Pearson correlation and dotplot.

Genome-Wide Association Study (GWAS)
GWAS for the Lp(a) concentration was performed with REGENIE version 3.2.5.3 [33].A minor allele frequency (MAF) threshold of 0.005 and a Hardy-Weinberg equilibrium (HWE) threshold of 10 −9 were used for filtering diallelic SNVs.Subjects showing a kinship coefficient of 3rd cousins or closer (≥0.044; [34]) were partitioned to create an unrelated subset with the highest number of unrelated subjects using the pcairPartition function from the GENESIS R package version 2.28.0 [35].
For association analyses, Lp(a) measurements were log-transformed.An additive genetic model was used, and adjustments were conducted for age, sex, and the first five principal components (PCs).The genome-wide significance threshold was set to 5 × 10 −8 .Genotypes of all statistically significant SNVs without missing values were extracted with the R package SeqVarTools version 1.40.0 [36].

Predictive Model for Lp(a) Levels Using Random Forests
Because of the high LD between the different genetic markers in the LPA gene, i.e., high multicollinearity, we used random forests by employing the ranger R package [37] to investigate the importance of multiple genetic markers on Lp(a) levels.
In the first step, we estimated two random forests.Random forest 1 (RF1) included all genetic variation plus the allele-specific number of KIV-2 repeats.Random forest 2 (RF2) included all genetic variation plus the total number of KIV-2 repeats.The genetic variation consisted in all SNVs from the GWAS, all SNVs from the specialized SNVs BioMed 2024, 4 161 caller in the KIV-2 repeat, and the number of PNRs estimated by GangSTR.To estimate the variables with a large contribution to the predictive performance, we estimated the conditional predictive impact (CPI) using the CPI R package.We used default values for the CPI function [38].The CPI tests for conditional independence and measures the variable importance.Specifically, the CPI of a variable provides information on how much the predictive performance deteriorates if the variable is replaced by a non-informative variable.CPIs, their Cis, and test statistics were estimated from 10-fold cross-validation (CV).Variables with a CPI p-value < 0.05 were kept.
The direct comparison of the predictive performance between the number of allelespecific KIV-2 repeats and the total number of KIV-2 repeats was conducted with the following model: For all random forests, we tuned the following hyperparameters using 10-fold nested CV: minimal node size (min.node.size),percentage of included variables in each splitting step (mtry.ratio),and tree depth (max.depth).Default values were used for other parameters.Hyperparameter tuning was conducted with the mlr3 R package (version 0.17.2) [39].The hyperparameters of the best performing model, i.e., the one with the lowest root mean square error (RMSE), was then used in the CPI calculations.
Since the random forests for the total number of KIV-2 repeats and the allele-specific number of KIV-2 repeats are estimated from the same CV data, they are directly comparable.Our hypotheses were:

•
Hypothesis 1: The full model shows the highest predictive performance.

•
Hypothesis 5: Full models a and b show similar performance, because proxy SNVs from the LPA gene should compensate for the additional information for the allelespecific number of KIV-2 repeats.
The global significance level was set to 0.05.To adjust for multiple testing, hypotheses were tested hierarchically, starting with Hypothesis 1.For Hypothesis 2, no formal statistical test was performed.For all models, we forced sex and age to be available for splitting in all trees and all splitting steps used the ranger option always.split.variable [37].If the final model does not contain age or sex, the coefficient of determination (R 2 ) obtained from this model can be interpreted as locus-specific heritability.R 2 was estimated for all models.

Software and Hardware
The statistical analysis was performed with R version 4.3.2[40] on our on-premises HPC equipped with 4 computing nodes.Each node is equipped with 2 AMD EPYC 7742 CPUs and a total of 2TB of RAM.The entire R workflow was run with targets version 1.3.2[41].Plots were generated with ggplot2 version 3.4.4[42].The code for the analyses is provided as a supplement.

Study Characteristics
Characteristics of GENESIS-HD subjects with available Lp(a) measurements are displayed in Table 1. Figure 2 shows the Lp(a) concentration per binned total number of KIV-2 repeats of GENESIS-HD subjects with available Lp(a) measurements.Lp(a) concentration per binned number of KIV-2 repeats on the short and long allele are provided in Figures S1 and S2, respectively.The statistical analysis was performed with R version 4.3.2[40] on our on-premises HPC equipped with 4 computing nodes.Each node is equipped with 2 AMD EPYC 7742 CPUs and a total of 2TB of RAM.The entire R workflow was run with targets version 1.3.2[41].Plots were generated with ggplot2 version 3.4.4[42].The code for the analyses is provided as a supplement.

Study Characteristics
Characteristics of GENESIS-HD subjects with available Lp(a) measurements are displayed in Table 1. Figure 2 shows the Lp(a) concentration per binned total number of KIV-2 repeats of GENESIS-HD subjects with available Lp(a) measurements.Lp(a) concentration per binned number of KIV-2 repeats on the short and long allele are provided in

Agreement between Specialized Variant Callers
Figure 3 displays the scatterplot for the total number of KIV-2 repeats estimated by the DRAGEN LPA Caller and the CNE.The correlation was 0.9966 (CI 0.9965-0.9968).Figure S3 shows the corresponding Bland-Altman plot.Both GangSTR and ExpansionHunter successfully estimated the genotypes of the PNR in the promoter region of the LPA gene on all 8351 subjects.The Pearson correlation between the callers was 0.9938 (CI 0.9936-0.9941)for allele 1 and slightly higher for allele 2 (r = 0.9961, CI 0.9960-0.9963).Figure S4 shows the dotplot for the number of subjects per genotype.In agreement with [23], we observed a high LD between the number of PNR and 4925G>A with a polyserial correlation of 0.991.

Agreement of Allele-Specific Number of KIV-2 Repeats with Results from 1000 Genomes Project
We compared the number of KIV-2 repeats of the GENESIS-HD dataset with the 1000 Genomes dataset (1KGP) analyzed by Behera et al. [18].In 3925 out of the 8351 subjects (47.0%), the DRAGEN LPA Caller determined the allele-specific number of KIV-2 repeats.Excellent agreement between the GENESIS-HD dataset and the European ancestral population (633 samples) was observed (Figure 4).Both GangSTR and ExpansionHunter successfully estimated the genotypes of the PNR in the promoter region of the LPA gene on all 8351 subjects.The Pearson correlation between the callers was 0.9938 (CI 0.9936-0.9941)for allele 1 and slightly higher for allele 2 (r = 0.9961, CI 0.9960-0.9963).Figure S4 shows the dotplot for the number of subjects per genotype.In agreement with [23], we observed a high LD between the number of PNR and 4925G>A with a polyserial correlation of 0.991.

Agreement of Allele-Specific Number of KIV-2 Repeats with Results from 1000 Genomes Project
We compared the number of KIV-2 repeats of the GENESIS-HD dataset with the 1000 Genomes dataset (1KGP) analyzed by Behera et al. [18].In 3925 out of the 8351 subjects (47.0%), the DRAGEN LPA Caller determined the allele-specific number of KIV-2 repeats.Excellent agreement between the GENESIS-HD dataset and the European ancestral population (633 samples) was observed (Figure 4).[25] (n = 633).For the total number of KIV-2 repeats, 8351 subjects were included.For the number of KIV-2 repeats on the short and long allele, 3925 subjects were included.

Genome-Wide Association Study for Lp(a) Concentration
Filtering and quality control (QC) left us with a total of 10.3 million (10,334,912) diallelic SNVs in 4803 unrelated subjects.The only genome-wide significant region for Lp(a) concentration was the LPA gene (Figure S5).A total of 855 SNVs were found to be statistically significant in the LPA gene (hg38: chr6:160531482-160664275) with a padding of 500 kb (Table S1).The lead SNV was rs56393506 (MAF: 0.16; per-allele change in Lp(a) con-Figure 4. Violin and boxplots of number of KIV-2 repeats for GENESIS-HD samples and EUR-like samples (EUR) from 1000 Genomes project phase 3 (1KGP) [25] (n = 633).For the total number of KIV-2 repeats, 8351 subjects were included.For the number of KIV-2 repeats on the short and long allele, 3925 subjects were included.

Genetic Variants Selected by Conditional Predictive Impact (CPI)
Four and six variables were selected by conditional predictive impact for the allelespecific KIV-2 random forest model RF1 and the total KIV-2 random forest model RF2, respectively (Table 2).In both models, the number of KIV-2 repeats had the highest conditional predictive impact.In the allele-specific KIV-2 model, the shorter allele had a stronger effect on Lp(a) levels than the longer allele.Two SNVs, 4925G>A and rs41272114, had a significant conditional predictive impact in both models, and the contingency table is provided in Table 3.  Non-carriers of both SNVs (4925GG and rs41272114CC; n = 3488) had a median Lp(a) concentration of 20.3 nmol/L (7.3, 83.0).Subjects carrying at least one A allele at SNV 4925 and genotype CC at rs41272114 (n = 1088) showed a lower Lp(a) concentration of 15.0 nmol/L (9.7-29.3).The lowest median Lp(a) concentration of 4.8 nmol/L (2.9-15.8)was observed in subjects carrying variant alleles at rs41272114, but not carrying an A allele at SNV 4925 (n = 247).Individuals carrying variations at both SNVs (n = 36) had an Lp(a) concentration of 8.4 nmol/L (6.3-12.7).It should be noted that the number of double carriers is low.The Lp(a) concentration of the genotype combination and short and total number of KIV-2 repeats is displayed in Figure 5.
nmol/L (9.7-29.3).The lowest median Lp(a) concentration of 4.8 nmol/L (2.9-15.8)was observed in subjects carrying variant alleles at rs41272114, but not carrying an A allele at SNV 4925 (n = 247).Individuals carrying variations at both SNVs (n = 36) had an Lp(a) concentration of 8.4 nmol/L (6.3-12.7).It should be noted that the number of double carriers is low.The Lp(a) concentration of the genotype combination and short and total number of KIV-2 repeats is displayed in Figure 5.Only three subjects were homozygous TT at rs41272114; these three subjects did not harbor an A allele at SNV 4925.These subjects had Lp(a) concentrations of 2.2, 2.3, and 3.0 nmol/L.
The three SNVs from model RF2 rs9347465, rs2489959, and rs2457567 were neither found in the GWAS catalogue nor by a Medline search.As expected, the PNR did not have a significantly conditional predictive impact.

Comparison of Predictive Performances for Lp(a) Concentrations
Mean and standard deviation (SD) of the difference in R 2 between allele-specific number and total number of KIV-2 repeats are displayed in Table 4. Specifically, we estimated Only three subjects were homozygous TT at rs41272114; these three subjects did not harbor an A allele at SNV 4925.These subjects had Lp(a) concentrations of 2.2, 2.3, and 3.0 nmol/L.
The three SNVs from model RF2 rs9347465, rs2489959, and rs2457567 were neither found in the GWAS catalogue nor by a Medline search.As expected, the PNR did not have a significantly conditional predictive impact.

Comparison of Predictive Performances for Lp(a) Concentrations
Mean and standard deviation (SD) of the difference in R 2 between allele-specific number and total number of KIV-2 repeats are displayed in Table 4. Specifically, we estimated the difference in R 2 between the allele-specific number and total number of KIV-2 repeats.The full model showed the highest predictive performance for both the total number of KIV-2 repeats (R 2 = 0.6855, CI 0.6109-0.7601)and the allele-specific number of KIV-2 repeats for the total number of KIV-2 repeats (R 2 = 0.6953, CI 0.6414-0.7492).The performance of the full model was significantly better than that of RF1 and RF2 (both p < 0.001; Table S2).This confirms Hypothesis 1.For model RF1, four variables were selected, compared to six variables for model RF2.RF1 was thus sparser than RF2, which confirms Hypothesis 2. Model RF1 b showed a significantly better predictive performance than model RF1 a (p = 0.0126), which confirms Hypothesis 3. Similarly, model RF2 b outperformed model RF2 a (p = 0.0496), which confirms Hypothesis 4. Finally, no statistically significant difference was observed between the total and allele-specific KIV-2 repeat models in the full model (p = 0.4620), which is in agreement with Hypothesis 5.

Discussion
In this study, we performed a comprehensive analysis of the genetic variation within the LPA gene by using data from whole genome short-read sequencing.Key to these analyses was the availability of specialized callers to estimate the number of KIV-2 repeats.There was excellent agreement between the DRAGEN LPA Caller and the CNE, which were both publicly released in 2023.The accuracy of the DRAGEN LPA Caller was previously confirmed by long-read sequencing [18], and the present analysis indirectly confirms the validity of the CNE caller for the LPA gene.
A weakness of both callers is the estimation of the number of allele-specific KIV-2 repeats.This information is important because the number of allele-specific KIV-2 repeats leads to better predictive performance for Lp(a) concentration.Of note, the CNE is able to estimate the total number of KIV-2 repeats only.In contrast, the DRAGEN LPA Caller is able to estimate the number of allele-specific KIV-2 repeats, depending on the genotypes of two SNVs within the KIV-2 region.In this study, the DRAGEN LPA Caller provided the number of allele-specific KIV-2 repeats in 47% of all subjects, which were of European ancestry.This percentage varies between subjects of different ancestry groups.For example, in AMR-like subjects, the percentage was the lowest at 38%, and highest in AFR-like individuals at 52% [18].
There was a high agreement between the two callers, ExpansionHunter and GangSTR, for estimating the number of PNRs.However, the genetic information of the PNR is not important for predicting Lp(a) concentrations because almost all the genetic information of the multiallelic PNR is captured by the dichotomous SNV 4925G>A, located within the KIV-2 repeat (correlation = 0.997).This is in agreement with previously published results [23].
The results of our GWAS confirmed previous reports that variants associated with Lp(a) concentrations are primarily located in the LPA gene region [12].We identified 855 significant SNVs surrounding the LPA gene.The lead SNV rs56393506 has been previously reported as being associated with increased Lp(a) concentration [43].
Predicting Lp(a) concentrations based on genetic variation is challenging because of high LD in the LPA gene.This high multicollinearity between genetic variation invalidates classical regression models, such as linear regression or quantile regression.In contrast, some of the machine learning methods, such as random forests, are able to adequately deal with a high LD.The random forest models revealed that only a small number of genetic variations had a significant impact on the predictions of the Lp(a) concentrations.This is not surprising, because the LPA gene shows large LD blocks [44].The full model incorporating all genetic variation from the LPA gene had a substantially higher predictive performance of Lp(a) concentrations when compared to models that covered the most important genetic markers (RF1 and RF2).Indeed, the locus-specific heritability (R) of the full model was 0.83 (CI 0.82-0.84),while it was only 0.68 (CI 0.65-0.70)and 0.71 (CI 0.69-0.72)for RF1 and RF2, respectively.The locus-specific heritability of the full model was thus in the same magnitude as that of other studies [14,15].
In line with the literature [14,15], the number of KIV-2 repeats had the highest conditional predictive impact, with the number of allele-specific KIV-2 repeats being more predictive than the total number of KIV-2 repeats.The two most important SNVs for predicting Lp(a) concentrations were 4925G>A and rs41272114, which confirms other work [14].SNV 4925G>A is located on a splice site within the KIV-2 repeat and leads to a decrease in Lp(a) concentration [13,15,45].rs41272114 is also located on a splice site, and T alleles at this SNV result in null alleles [12][13][14].Such null alleles lead to extremely low Lp(a) concentrations, which we also observed in this study.Three subjects were homozygous TT at rs41272114.The highest Lp(a) concentration in these individuals was 3.0 nmol/L, which is in the bottom five percent of the Lp(a) concentration levels among all subjects.
A limitation of our study is that analyses were based on a specific assay to measure the Lp(a) concentration.Specifically, we used the Roche assay, which reports the molar Lp(a) concentration (nmol/L).Various studies have reported differences between Lp(a) assays [46,47].Furthermore, we restricted our study to EUR-like subjects only.Future research should investigate the predictive performance of all genetic variation within the LPA gene in different ancestry groups.Furthermore, determining not only the carrier status of SNVs within the KIV-2 repeats but also the genotype could lead to a better understanding of the complex LPA gene.

Conclusions
Technological advancements allow the accurate determination of the genetic variation within the complex LPA gene from short-read WGS data.The number of allele-specific KIV-2 repeats was more important for predicting Lp(a) protein levels than the total number of KIV-2 repeats.However, with current callers, the allele-specific number of KIV-2 repeats can only be determined in approximately 50% of European-like samples.The LPA genespecific heritability of Lp(a) protein levels was approximately 70% in this population-based study.Due to high linkage disequilibrium between the PNR and the SNV 4925G>A, the PNR was not important for predicting Lp(a) protein levels.In contrast, SNVs 4925G>A and rs41272114C>T were of high importance in addition to the number of KIV-2 repeats.

Supplementary Materials:
The following supporting information can be downloaded at https: //www.mdpi.com/article/10.3390/biomed4020013/s1, Figure S1: Scatterplot for LP(a) and number of KIV-2 repeats on the short allele (n = 4861).Number of KIV-2 repeats were binned so that there are at least 30 samples in each bin.Horizontal jittering was added to prevent overlapping between points; Figure S2: Scatterplot for LP(a) and number of KIV-2 repeats on the long allele (n = 4861).Number of KIV-2 repeats were binned so that there are at least 30 samples in each bin.Horizontal jittering was added to prevent overlapping between points; Figure S3: Bland-Altman plot with mean and 95% confidence interval for total number of KIV-2 repeats estimated by the DRAGEN LPA Caller and read depth based CN estimator (CNE) (n = 8351); Figure S4: Dotplot per number of subjects per genotype estimated by ExpansionHunter and GangSTR; Figure S5: Manhattan plot from genome-wide association analysis (GWAS) for LP(a) concentration; Figure S6: Locus-zoom plot for the LPA gene with 500kb padding; Table S1: Statistically significant SNVs in the LPA gene; Table S2: Performance statistics of the full random forest model and random forest models RF1 and RF2; R zip code S1: R code for analyses conducted in this work; instructions are provided README.md.Institutional Review Board Statement: This study did not require consultation by an ethics committee.The ethics committee of the Landesärztekammer Hamburg (State of Hamburg Chamber of Medical Practitioners, PV5131 did not have objections against the conduct of the Hamburg City Health Study, see reference [24].
Informed Consent Statement: Written informed consent was obtained from all study participants.

Data Availability Statement:
The data used in this study may not be shared due to privacy issues.The targets file used for the analysis is available in the Supplementary Materials.

Figure 1 .
Figure 1.LPA gene structure, genetic variation, and different callers for short-read whole genome sequencing data.Figure not drawn to scale.

Figure 1 .
Figure 1.LPA gene structure, genetic variation, and different callers for short-read whole genome sequencing data.Figure not drawn to scale.

Figure 2 .
Figure 2. Scatterplot for Lp(a) and total number of KIV-2 repeats (n = 4861).KIV-2 CNs were binned so that there were at least 30 samples in each bin.Horizontal jittering was added to prevent overlapping between points.

Figure 2 .
Figure 2. Scatterplot for Lp(a) and total number of KIV-2 repeats (n = 4861).KIV-2 CNs were binned so that there were at least 30 samples in each bin.Horizontal jittering was added to prevent overlapping between points.

Figure 5 .
Figure 5. Lp(a) concentration by number of KIV-2 repeats for all single nucleotide variation combinations of 4925G>A and rs41272114 carriers of the T allele.(Upper panels): number of KIV-2 repeats of the short allele.(Lower panels): total number of KIV-2 repeats.

Figure 5 .
Figure 5. Lp(a) concentration by number of KIV-2 repeats for all single nucleotide variation combinations of 4925G>A and rs41272114 carriers of the T allele.(Upper panels): number of KIV-2 repeats of the short allele.(Lower panels): total number of KIV-2 repeats.

Table 1 .
Characteristics of GENESIS-HD subjects with available Lp(a) measurements.Continuous characteristics are provided as median and interquartile range (IQR), dichotomous variables as number of subjects and percentage.

Table 1 .
Characteristics of GENESIS-HD subjects with available Lp(a) measurements.Continuous characteristics are provided as median and interquartile range (IQR), dichotomous variables as number of subjects and percentage.

Table 2 .
Variables selected by conditional predictive impact (CPI) for both random forest models RF1 and RF2.CPIs, their 95% confidence intervals (95% CI) and corresponding p-values are displayed.CPIs and 95% CIs were multiplied by 10 3 .

Table 3 .
Contingency table for SNV 4925G>A carrier status and rs41272114.

Table 4 .
Coefficient of determination (R 2 ) and 95% confidence interval (CI) of different random forest (RF) models for Lp(a) concentrations.