Genome-Wide Association Study of Age at First Calving in U.S. Holstein Cows

A genome-wide association study (GWAS) of age at first calving (AFC) using 813,114 first lactation Holstein cows and 75,524 SNPs identified 2063 additive effects and 29 dominance effects with p-values < 10−8. Three chromosomes had highly significant additive effects in the regions of 7.86–8.12 Mb of Chr15, 27.07–27.48 Mb and 31.25–32.11 Mb of Chr19, and 26.92–32.60 Mb of Chr23. Two of the genes in those regions were reproductive hormone genes with known biological functions that should be relevant to AFC, the sex hormone binding globulin (SHBG) gene, and the progesterone receptor (PGR) gene. The most significant dominance effects were near or in EIF4B and AAAS of Chr05 and AFF1 and KLHL8 of Chr06. All dominance effects were positive overdominance effects where the heterozygous genotype had an advantage, and the homozygous recessive genotype of each SNP had a very negative dominance value. Results from this study provided new evidence and understanding about the genetic variants and genome regions affecting AFC in U.S. Holstein cows.


Introduction
Age at first calving (AFC) is measured in negative days, such that a larger AFC value represents a younger first-calving age and a smaller AFC value represents an older firstcalving age. This is a new reproduction trait for U.S. Holstein genomic evaluation [1,2]. A large sample of Holstein cows with AFC phenotypic values and genotypic data of single nucleotide polymorphism (SNP) markers has become available, providing an opportunity to identify genetic variants and genome regions that affect AFC and involve puberty and successful pregnancy with high statistical confidence. Prior to the inclusion of AFC for genomic evaluation, the U.S. Holstein genomic evaluation included three reproductive traits, daughter pregnancy rate (DPR), which is the percentage of cows that become pregnant during each 21-d period, and cow conception rate (CCR) and heifer conception rate (HCR), each as percentage pregnancy at each service [3]. These reproductive traits should involve different and possibly some overlapping physiological processes affecting reproduction, and a genome-wide association study (GWAS) using large samples is a powerful approach to identify and understand the genetic factors underlying these reproductive traits. We previously reported results from a large-scale GWAS for DPR, CCR, and HCR [4], but a similar large-scale GWAS was unavailable for the AFC of U.S. Holstein cows. The sample size for AFC is now much larger than those for the large-scale GWAS for DPR, CCR, and HCR. The purpose of this study was to identify genetic variants and chromosome regions affecting AFC in U.S. Holstein cows using a large sample from the U.S. Holstein genomic evaluation data.

Results and Discussion
The results presented below focus on the three chromosome regions with the most significant additive effects and the two chromosome regions with the most significant dominance effects. The top 100 additive effects are provided in Table S2, and all significant dominance effects are provided in Table S3.

Additive Effects
The GWAS analysis identified 2063 additive effects with log 10 (1/p) > 8. The observed log 10 Table 1). Two of the genes in those regions were sex hormone genes with known biological functions that should be relevant to AFC, the sex hormone binding globulin (SHBG) gene, and the progesterone receptor (PGR) gene. The known biological functions and the statistical significance of the SNPs in or near these two genes implicated the involvement of the two reproductive hormone genes in the AFC of Holstein cows.   The 7.86-8.12 Mb region of Chr15 had three genes, TRPC6, PGR, and ARHGAP42. Of these three genes, PGR, as the progesterone receptor gene, has known relevant biological functions affecting AFC. This gene encodes a member of the steroid receptor superfamily, and the encoded protein mediates the physiological effects of progesterone, which plays a central role in reproductive events associated with the establishment and maintenance of pregnancy [5]. PGR had four SNPs, and one of these SNPs (rs136764006) had log 10 (1/p) = 27.63 (Table 1). TRPC6 had four SNPs, and none of those four SNPs reached the statistical significance of log 10 (1/p) > 8, but an SNP about 3151 bp downstream of TRPC6 or 349,081 bp upstream of PGR was highly significant with log 10 (1/p) = 29.38 ( Figure 1b, Table 1). ARHGAP42 had 14 SNPs, and four of the 14 SNPs about 200 Kb downstream of PGR were significant with log 10 (1/p) values of 21.74-21.89 ( Figure 1b). The TRPC6 and ARHGAP42 genes did not have known biological functions directly affecting AFC.
The 27.07-27.48 Mb region of Chr19 had three genes, MPDU1, SHBG, and DNAH2. In this region, the most significant SNP was rs111004845, which was 9664 bp upstream of SHBG, noting that SHBG did not have any SNP in our dataset. SHBG is the sex hormonebinding globulin gene and the only gene in this region known to have a biological function related to reproduction. This gene encodes a steroid-binding protein, and the encoded protein transports androgens and estrogens in the blood, binding each steroid molecule as a dimer formed from identical or nearly identical monomers [6]; the sex hormone binding globulin was likely associated with early puberty [7,8]. The known biological function of SHBG affecting reproduction and the highly significant SNP in the proximity of SHBG should implicate SHBG as an interesting candidate gene affecting AFC. MPDU1 had one SNP, and DNAH2 had four SNPs in our dataset.
The 31.25-32.11 Mb regions of Chr19 had the most significant additive effect in ARHGAP44 with log 10 (1/p) = 37.30 (Table 1), but ARHGAP44 was not known to affect reproduction. The HS3ST3A1 gene is widely expressed, with the most abundant expression in the liver and placenta [9], and the gene expression in the placenta could affect AFC. This gene had an additive effect with log 10 (1/p) = 27.34 (Table S1). In this region, TTC19, NCOR1, and COX10 had highly significant additive effects.
Among the top 20 SNPs, the sizes of positive allelic effects were in the range of 0.354-0.818, the sizes of the negative allelic effects were −0.363 to −0.270, and the absolute values of the additive effects (α) were in the range of 0.98-1.04 days (Table 1). Such effect sizes were considerably smaller than some of the dominance effects described below.

Dominance Effects
The GWAS analysis identified 29 dominance effects with log 10 (1/p) > 8. The observed log 10 Table 2), noting that EIF4B did not have any SNP in our dataset, and this dominance effect with log 10 (1/p) = 45.08 was the most significant effect among all additive and dominance effects. The second-most significant dominance effect was that in AAAS on Chr05 (Table 2), and this effect also was the second-most significant effect among all additive and dominance effects. We previously showed the SNPs in this 26.38-26.96 Mb region of Chr05 had significant dominance effects for milk, fat, and protein yields [4].
The 101.86-102.17 Mb region of Chr06 had three highly significant dominance effects (log 10 (1/p) > 30) in AFF1 and KLHL8 ( Figure 2c, Table 2). AFF1 had eight SNPs in our dataset, and two of these SNPs had log 10 (1/p) > 30. One of the two significant SNPs (rs43480825) in AFF1 present in a previous Holstein GWAS was the most significant dominance effect for heifer conception rate (HCR) and the second-most significant dominance effect for daughter pregnancy rate (DPR) and cow conception rate (CCR) [4]. KLHL8 had seven SNPs, and one of these SNPs had log 10 (1/p) > 30. The KLHL8 gene was proposed as a candidate gene for nonreturn rate in Holstein heifers [14]. The dominance effects in AFF1 and KLHL8 were positive overdominance effects and had the same pattern as the positive overdominance effects near EIF4B and in AAAS of Chr05. These positive overdominance effects of AFC had five features. First, each SNP had a 'recessive allele' that, in homozygous status, had a very negative dominance value. Second, each SNP had a 'dominant allele' that, in heterozygous status, neutralized the negative effect of the recessive allele. Third, the dominant allele in homozygous status behaved   (12)). 'DR' is the heterozygous genotype with one dominant allele and one recessive allele. 'd_DR' is the dominance value of the heterozygous genotype with one dominant allele (D) and one recessive allele (R) (Equation (13)). 'DD' is the homozygous genotype with two dominant alleles. 'd_DD' is the dominance value of the homozygous genotype with two dominant alleles (DD) (Equation (13)). 'RR' is the homozygous genotype with two recessive alleles. 'd_RR' is the dominance value of the homozygous genotype with two recessive alleles (RR) (Equation (13)). 'f_DR' is the frequency of the heterozygous genotype. 'f_DD' is the frequency of the homozygous genotype of the dominant allele. 'f_RR' is the frequency of the homozygous genotype of the recessive allele. 'f_R' is the frequency of the recessive allele.
These positive overdominance effects of AFC had five features. First, each SNP had a 'recessive allele' that, in homozygous status, had a very negative dominance value. Second, each SNP had a 'dominant allele' that, in heterozygous status, neutralized the negative effect of the recessive allele. Third, the dominant allele in homozygous status behaved like a neutral allele with a small absolute dominance value or dominance deviation, noting that each dominance value was a deviation of the genotypic value from the mean and additive value. Fourth, the dominance value of the heterozygous genotype was more positive than that of either homozygous genotype, but the heterozygous dominance value was not much above zero. Fifth, the recessive allele had an allele frequency of mostly <0.10, so the homozygous recessive genotype was rare and had a genotypic frequency of mostly <0.01 (Table 2). For the example of rs109438971, which had the most significant dominance effect, the dominance value of the homozygous genotype of the recessive allele (allele 1) was −9.76, compared to the slightly negative dominance value −0.06 of the homozygous genotype of the dominant allele (allele 2) and the positive dominance value 0.62 of the heterozygous genotype with alleles 1 and 2. This positive value was less than 1/15 of the negative recessive homozygous genotypic value, but the heterozygous genotypic frequency was about 30 times that of the homozygous recessive genotype (0.152 vs. 0.005). Consequently, at the population level, the heterozygous advantage in the form of a positive overdominance effect more than offset the very negative effect of the homozygous recessive genotype. The contribution to the population mean of the rs109438971 dominance values was (d_DR)(f_DR) = (0.622)(0.152) = 0.095 for the heterozygous genotypes and (d_RR)(f_RR) = (−9.76)(0.005) = −0.049 for the homozygous recessive genotypes, based on the dominance values and genotypic frequencies in Table 2. Therefore, the positive contribution of the heterozygous genotypes to the population mean of the rs109438971 dominance values was about twice the negative contribution of the homozygous recessive genotypes. This heterozygous advantage in the form of a positive overdominance effect likely was the reason the very negative recessive allele still had a substantial allele frequency of 0.081 (Table 1) and was not eliminated over the years.
Compared to the additive effects in Table 1, the recessive genotypes were considerably more detrimental than negative additive effects. For the top dominance effects, the dominance values of the recessive genotypes were in the range of −9.76 to −5.18 (Table 2), whereas the allelic effects of the negative alleles of additive effects were in the range of −0.363 to −0.270 for the top 20 additive effects ( Table 1). Given that the negative dominance values of the recessive genotypes were more than ten times as large as the negative allelic effects, the first step of the application of the GWAS results would be the use of the recessive SNP genotypes for heifer culling.

Elimination of Rare Negative Recessive Genotypes for Heifer Culling
The results of the dominance effects of AFC identified seven SNPs with very negative dominance values for the recessive homozygous genotypes ( Table 2). We recommend using the recessive SNP genotypes of these seven SNPs for culling heifers that carry such genotypes. Detailed results supporting this recommendation are provided in Table S4. Among the 813,114 cows in this study, 3541-5274 cows carried the negative recessive genotypes for at least one of the seven SNPs for heifer culling (Table S4). For dominance values that removed additive values, the heterozygous genotypes had the highest dominance values ( Table 2). To evaluate the impact of culling heifers with the recessive genotypes, we defined the negative impact of a recessive genotype as the difference between the average of the phenotypic values of cows carrying the recessive genotype and the average of the phenotypic values of cows carrying the heterozygous genotype and the homozygous dominant genotype of each SNP. The results of negative impact showed that cows with the recessive genotypes required 7.69-12.83 days longer than the heterozygous genotypes and homozygous dominant genotypes for first calving and had sharply lower yields, 201.23-646.33 kg lower for milk yield, 9.05-26.03 kg lower for fat yield, and 6.74-19.27 kg lower for protein yield (Table 3). Therefore, evidence from this study showed that the recessive genotypes had severely negative effects on AFC and the yield traits and that heifers with the recessive genotypes should be culled. We are not ready to recommend the elimination of bulls carrying the recessive alleles because such a recommendation requires a separate study.

Comparison with Previous Studies
Several GWASs on AFC were available prior to our study, but results from the previous studies, including a study in beef cattle [15] and a study in Chinese Holsteins [16], did not overlap the results from our study. The beef study using 185,356 Nellore heifers identified significant SNPs on chromosomes 2 and 14, and none of those significant SNPs was highly significant in our Holstein study. Results from the Chinese Holsteins using 19,111 heifers also lacked overlap with the results of our study. Although the exact reasons for the differences among those studies were unknown, results from our study add new understanding about the genetic variants and chromosome regions underlying AFC from a large sample of U.S. Holstein cows. In comparison with our previous GWAS results for three other reproductive traits (DPR, CCR, and HCR) in U.S. Holstein cows [4], AFC did not share significant additive effects and only shared a significant dominance effect of rs43480825 in AFF1 with DPR, CCR, and HCR. This limited sharing of common significant effects indicated that AFC mostly involved different genetic mechanisms from those for DPR, CCR, and HCR.

Gene Ontology of Candidate Genes
To understand the potential biological functions of the candidate genes, we searched Gene Ontology Resources [17], KEGG [18] and DAVID [19] for the biological processes involved by the 14 candidate genes for additive effects in Table 1 and nine candidate genes for dominance effects in Table 2. However, Gene Ontology Resources had more details than available from KEGG and DAVID. Therefore, we only included the biological processes involved by the candidate genes from Gene Ontology Resources in Table S5 for candidate genes of additive effects with 560 entries and in Table S6 for candidate genes of dominance effects with 486 entries. Other than SHBG, for which no descriptions of its biological functions were available other than the hormone binding process indicated by the gene name, every candidate gene was involved in multiple biological processes. Although any of those processes could have affected AFC, the exact genetic mechanisms of the significant SNP effects remained unknown. Among all the biological processes, only PGR and AAAS were involved in known reproductive processes. The PGR gene was already known for its role in the pregnancy process, which should be highly relevant to AFC, and was one of the multiple reproductive processes described for PGR in Table S5. The AAAS gene was involved in fertilization and the reproductive process (Table S6), which should also be highly relevant to AFC.

Holstein Population and SNP Data
The Holstein population in this study had 813,114 first lactation cows with AFC phenotypic observations and 78,964 original and imputed SNPs. With the requirement of 0.05 minor allele frequency, 75,524 SNPs were used in the GWAS analysis. The SNP positions were those from the ARS-UCD1.2 cattle genome assembly. Genes containing or in the proximity of highly significant additive and dominance effects were identified as candidate genes affecting AFC. The AFC phenotypic values are reported in negative days, such that higher AFC values represent younger first-calving ages and are considered more desirable than lower AFC values, representing older first-calving ages [1,2]. The AFC phenotypic values used in the GWAS analysis were the phenotypic residuals after removing fixed nongenetic effects available from the December 2021 U.S. Holstein genomic evaluation data. The 813,114 phenotypic residuals values had an approximate bell-shaped distribution ( Figure S1; Supplementary Materials), and the basic statistics of these phenotypic values are described in Table S1.

GWAS Analysis
The GWAS analysis used an approximate generalized least-squares (AGLS) method. The AGLS method combines the least-squares (LS) tests implemented by EPISNP1mpi [20,21] with the estimated breeding values from a routine genetic evaluation using the entire U.S. Holstein population. The statistical model was: where y is the column vector of phenotypic deviation after removing fixed nongenetic effects, such as heard-year-season (termed as 'yield deviation' for any trait) using a standard procedure for the CDCB/USDA genetic and genomic evaluation; µ is the common mean; I is the identity matrix; g is the column vector of SNP genotypic values; X g is the model matrix of g; b = (µ, g ) , X = (I, X g ); a is the column vector of additive polygenic values; Z is the model matrix of a; and e is the column vector of random residuals. The first and second moments of Equation (1) are E(y) = Xb and var(y) = V = ZGZ + R = σ 2 a ZAZ + σ 2 e I, respectively, where σ 2 a = additive variance, A = additive relationship matrix, and σ 2 e = residual variance. The problem of estimating the b vector that includes SNP genotypic values in Equation (1) is the requirement of inverting V if the generalized least-squares (GLS) method is used or inverting the A matrix if the mixed model equations (MME) [22] are used. However, both V and A could not be inverted for our sample size. To avoid inverting these large matrices, the GWAS used the method of approximate GLS (AGLS), which replaces the polygenic additive values (a) with the best linear unbiased prediction based on pedigree relationships [4]. The AGLS method is based on the following results: where y * = y − Zâ andâ is the best linear unbiased prediction (BLUP) of a. Equation (2) is the GLS solution, and Equation (3) is the MME solution of b. These two equations yield identical results, andb from either equation is termed the best linear unbiased estimator (BLUE) [22]. Ifâ is known, the LS version of BLUE given by Equation (3) is computationally efficient relative to the GLS of Equation (2), requiring the V inverse, or the joint MME solutions ofb andâ, requiring the A inverse. The AGLS method uses two approximations. The first approximation is to useã from routine genetic evaluation as an approximation of a in Equation (3) where y * = y − Zã, andã is the column vector of 2(PTA) with PTA being the predicted transmission ability from the routine genetic evaluation. Equation (4) achieves the benefit of sample stratification correction from mixed models using pedigree relationships without the computing difficulty of inverting V or A. The second approximation of the AGLS approach is the t-test using the LS rather than the GLS formula of the t-statistic to avoid using the V inverse in the GLS formula. The significance tests for additive and dominance SNP effects used the t-tests of the additive and dominance contrasts of the estimated SNP genotypic values [20,23]. The t-statistic of the AGLS was calculated as: where L j is the additive or dominance contrast; var(L j is the standard deviation of the additive or dominance contrast; s a represents the additive contrast coefficients (P 11 /p 1 , 0.5P 12 (p 2 − p 1 )/(p 1 p 2 ), −P 22 /p 2 ); s d represents the dominance contrast coefficients (−0.5, 1, −0.5); v 2 = (y − Xb) (y − Xb)/(n − k) is the estimated residual variance;ĝ is the column vector of the AGLS estimates of the three SNP genotypic effects of g 11 , g 12 , and g 22 from Equation (4); (X X) − gg is the submatrix of (X X) − corresponding toĝ; p 1 is the frequency of A 1 allele; p 2 is the frequency of A 2 allele of the SNP; P 11 is the frequency of A 1 A 1 genotype; P 12 is the frequency of A 1 A 2 genotype; P 22 is the frequency of A 2 A 2 genotype, n is the number of observations, and k is the rank of X. The formula of s a defined above allows the Hardy-Weinberg disequilibrium [23] and simplifies to (p 1 , p 2 − p 1 , −p 2 ) under the Hardy-Weinberg equilibrium. Additive effects of each SNP were estimated using three measures, the average effect of gene substitution, allelic mean, and allelic effect of each allele based on quantitative genetics definitions [23,24]. The allelic mean (µ i ), the population mean of all genotypic values of the SNP (µ), the allelic effect (a i ), and the average effect of gene substitution of the SNP (α) are: µ 1 = P 11.1 g 11 +0.5P 12.1 g 12 (6) µ 2 = 0.5P 12.2 g 12 +P 22.2 g 22 (7) where P 11.1 = P 11 /p 1 , P 12.1 = P 12 /p 1 , P 12.2 = P 12 /p 2 , and P 22.2 = P 22 /p 2 . The additive effect measured by the average effect of gene substitution of Equation (10) is the difference between the two allelic means or effects of the same SNP, and it is the fundamental measure for detecting SNP additive effects, as shown by the t-statistic of Equation (5). The allelic effect defined by Equation (9) provide an understanding of the effect size and direction of each allele. However, the allelic effect of Equation (9) is not comparable across SNPs because the allelic effect is affected by the genotypic mean of the SNP defined by Equation (8). To compare allelic effects across SNPs, we replace the SNP genotypic mean (µ) in Equation (9) with the average of all SNP genotypic means (µ all ): The dominance effect of each SNP was estimated as the dominance contrastĝ from Equation (4), i.e., δ = L d = d 12 − (d 11 +d 22 )/2 = g 12 − (g 11 + g 22 )/2 (12) where g ij represents the AGLS estimates of SNP genotypic values from Equation (4) (i, j = 1, 2) and d ij is the dominance value (dominance deviation) of the A i A j SNP genotype In this study, overdominance refers to the fact that the dominance value of the heterozygous genotype is more extreme than that of either homozygous genotype, i.e., d 12 > d 11 and d 12 > d 22 for positive overdominance effects, or d 12 < d 11 and d 12 < d 22 for negative overdominance effects. The dominance effects to be reported were all positive overdominance effects. For 75,524 SNPs with additive and dominance effects, the threshold p-value for declaring significant t-tests for the Bonferroni correction with 0.05 genome-wide false positives was 10 −8 , or log 10 (1/p) = 8. All figures for the GWAS results were produced using SNPEVG2 in the SNPEVG package [25].

Conclusions
This large sample GWAS identified significant additive effects in three chromosome regions and implicated two reproductive hormone genes affecting AFC. A small number of significant positive overdominance effects were also identified. The results provided new evidence and understanding of the genetic variants and chromosome regions affecting AFC in U.S. Holstein cows.