Next Article in Journal
Optimization of Exergy Output Rate in a Supercritical CO2 Brayton Cogeneration System
Previous Article in Journal
Analysis of Price Dynamic Competition and Stability in Cross-Border E-Commerce Supply Chain Channels Empowered by Blockchain Technology
Previous Article in Special Issue
Multivariate Modeling of Some Datasets in Continuous Space and Discrete Time
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Bayesian Model for Paired Data in Genome-Wide Association Studies with Application to Breast Cancer

1
Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX 75080, USA
2
Department of Biological Sciences, University of Texas at Dallas, Richardson, TX 75080, USA
3
Department of Mathematics, University of Texas at Arlington, Arlington, TX 76019, USA
*
Author to whom correspondence should be addressed.
Entropy 2025, 27(10), 1077; https://doi.org/10.3390/e27101077
Submission received: 31 July 2025 / Revised: 13 October 2025 / Accepted: 15 October 2025 / Published: 18 October 2025

Abstract

Complex human diseases, including cancer, are linked to genetic factors. Genome-wide association studies (GWASs) are powerful for identifying genetic variants associated with cancer but are limited by their reliance on case–control data. We propose approaches to expanding GWAS by using tumor and paired normal tissues to investigate somatic mutations. We apply penalized maximum likelihood estimation for single-marker analysis and develop a Bayesian hierarchical model to integrate multiple markers, identifying SNP sets grouped by genes or pathways, improving detection of moderate-effect SNPs. Applied to breast cancer data from The Cancer Genome Atlas (TCGA), both single- and multiple-marker analyses identify associated genes, with multiple-marker analysis providing more consistent results with external resources. The Bayesian model significantly increases the chance of new discoveries.

1. Introduction

Many complex diseases, including diabetes, heart disease, hypertension, and various cancers, are associated with genetic factors. Identifying the genetic factors in complex diseases is critical to understanding disease heritability, which may lead to better strategies of the diagnosis and the treatment of disease. Single nucleotide polymorphisms (SNPs)—single-base substitutions in the genome—account for nearly 90% of genetic variation [1]. Over the past decade, genome-wide association studies (GWASs) have emerged as a powerful approach for linking common variants to complex diseases such as breast cancer, Crohn’s disease, bipolar disorder, and hypertension [2,3,4]. Standard GWASs typically adopt a case–control design, sampling controls from the healthy population and cases from the affected population, to identify disease-associated SNPs across the genome. However, these studies explain only a small proportion of heritability [5].
Paired tumor–normal data—where tumor and normal tissues from the same patient are genotyped—focus on somatic mutations driving cancer development. Resources such as The Cancer Genome Atlas (TCGA) provide extensive paired data for numerous cancer types, including breast cancer. While many mutations have no phenotypic effect, some drive cellular dysfunction and tumor development [6]. It has been reported that most cancer genes harbor somatic mutations [7,8]. Often the development of cancer is a progressive process in which multiple mutations are accumulated in a normal cell that eventually evolves to a cancerous cell, which can evade the immune mechanisms and start to proliferate. Understanding the role of somatic mutations in carcinogenesis can improve risk prediction, monitoring, early detection, and personalized therapy.
The TCGA has been collecting tumor tissues along with their matched normal samples for different types of cancer since 2005. More than 30 cancer types are involved in genomic characterization and sequence analysis in the TCGA project. Unlike traditional case–control GWASs, paired tumor–normal data inherently control for genetic background and shared environmental factors, reducing confounding and potentially improving power. Yet conventional GWAS methods assume independence between cases and controls and are not directly applicable to paired data. Additionally, single-locus GWASs often miss moderate-effect variants due to stringent multiple testing thresholds and fail to capture joint SNP effects.
In this paper, we propose a novel framework for analyzing somatic mutations using paired tumor-normal data. We first introduce a penalized maximum likelihood estimation (MLE) for single markers and then develop hypothesis tests based on the model. To improve the model performance, a hierarchical Bayesian model is developed for single-marker analysis. To further increase power to detect associated variants, we extend the hierarchical Bayesian method to multi-marker analysis to detect joint effects. Our approach enhances the detection of somatic mutation associations and joint effects to better understand the genetic risks. Finally the single- and multi-marker analysis are applied to breast cancer data using hierarchical Bayesian models.

2. Materials and Methods

2.1. Single-Marker Analysis

In a matched-pair design, each patient provides both tumor (case) and adjacent normal (control) tissues. This structure enhances the detection of somatic mutations and controls for inter-individual variability. However, the common GWAS method requires that cases and controls are independent, and thus are unsuited for the matched-pair data. We present novel approaches to developing single-marker analysis using frequentist and Bayesian methods.
At a locus, let A be the risk allele frequency, and M be the somatic mutation rate. Under genetic equilibrium, the probabilities of carrying genotype 0, 1, 2 are ( 1 A ) 2 ,   2 A ( 1 A ) ,   A 2 , respectively. The mutation rate M characterizes the probability of an allele alteration at the locus. Given the normal tissue genotype is 0, the mutant genotypes can be 0, 1, and 2 with probabilities of ( 1 M ) 2 ,   2 M ( 1 M ) ,   M 2 , respectively. If the normal genotype is 1, the mutant genotype can be 0, 1, and 2 with probabilities of M ( 1 M ) ,   M 2 + ( 1 M ) 2 ,   M ( 1 M ) , respectively. If the normal genotype is 2, the mutant genotypes can be 0, 1, and 2 with probabilities of M 2 ,   2 M ( 1 M ) ,   ( 1 M ) 2 , respectively.
Next, consider the disease risk. The penetrance is defined as the probability of having cancer given a specific genotype:
π t = P ( D i s e a s e | M u t a t e d g e n o t y p e = t ) , t = 0 ,   1 ,   2 .
If the mutant is unassociated with disease, π 2 = π 1 = π 0 . If the mutant is a risk allele, π 2 π 1 π 0 . Let R t be the allelic relative risk (RR) given a specific genotype: R t = π t / π 0 ,   t = 1, 2. If the somatic mutation are not associated with cancer, the corresponding relative risk should be one. Here we consider the additive genetic risk model R 1 = ( R + 1 ) / 2 , and R 2 = R , where R is the relative risk parameter. Other genetic risk models, such as dominant, recessive, and multiplicative can be considered as well. Thus, the expected probability of paired genotypes in the patient population can be derived as follows. Let P i , j be the the probability of the paired normal–tumor genotype, where i = 0 ,   1 ,   2 is the normal tissue genotype and j = 0 ,   1 ,   2 is the tumor tissue genotype. Each P i , j is a function of ( R , A , M ) , i.e., P i , j = f i , j ( R , A , M ) , as defined below:
P 00 = ( 1 A ) 2 ( 1 M ) 2 q s u m , P 01 = 2 ( 1 A ) 2 M ( 1 M ) q s u m · R + 1 2 , P 02 = ( 1 A ) 2 M 2 R q s u m , P 10 = 2 A ( 1 A ) M ( 1 M ) q s u m , P 11 = 2 A ( 1 A ) [ M 2 + ( 1 M ) 2 ] q s u m · R + 1 2 , P 12 = 2 A ( 1 A ) M ( 1 M ) R q s u m , P 20 = A 2 M 2 q s u m , P 21 = 2 A 2 M ( 1 M ) q s u m · R + 1 2 , P 22 = A 2 ( 1 M ) 2 R q s u m ,
where q s u m is the normalization constant such that i j P i , j = 1 . Note that q s u m is also a function of R, A and M.
Suppose a simple random sample of size n is drawn from all patients. Therefore, the joint distribution of normal–tumor genotypes of a SNP follows a Multinoulli distribution, also known as the generalized Bernoulli distribution.

2.1.1. Maximum Likelihood Estimation

We first consider a penalized maximum likelihood estimation for R, A and M. For each genetic marker, the sample data can be summarized to 9 counts, i.e., n = ( n 00 , n 01 , n 02 , n 10 , n 11 , n 12 , n 20 , n 21 , n 22 ) , where n i j is the number of patients whose normal tissue genotype is i and tumor tissue genotype is j. Assuming a simple random sample from all patients, n will follow the Multinoulli n , ( P 00 , P 01 , P 02 , P 10 , P 11 , P 12 , P 20 , P 21 , P 22 ) . The likelihood function is written as follows:
L ( R , A , M | n ) = i = 0 , 1 , 2 j = 0 , 1 , 2 ( f i , j ( R , A , M ) ) n i , j
To prevent the parameter estimations on the boundary, we add a penalty term to the log-likelihood function. The boundaries are set so that R ( 0 ,   ) , A ( 0 ,   1 ) , and M ( 0 ,   1 ) . The penalized log-likelihood function is
l p ( R , A , M | n ) = i = 0 , 1 , 2 j = 0 , 1 , 2 n i , j l o g ( f i , j ( R , A , M ) ) λ l o g 2 ( A ) λ l o g 2 ( 1 A ) λ l o g 2 ( M ) λ l o g 2 ( 1 M ) α l o g 2 ( R ) ,
where α and λ are tuning parameters. When the estimations of parameters A and M are close to the their boundaries, at least one of the terms A, 1 A , M, 1 M will be close to 0. Thus, at least one of the terms of l o g 2 ( A ) , l o g 2 ( 1 A ) , l o g 2 ( M ) , l o g 2 ( 1 M ) will be large enough to prevent the parameter estimations to reach the boundaries.

2.1.2. Hypothesis Testing

For each SNP, consider testing the hypothesis R = R 0 . We are interesting in testing the null hypothesis H 0 : R 0 = 1 versus alternative hypothesis H 1 : R 0 1 . The Wald test measures the squared difference R ^ M L E R 0 weighted by the curvature of the log-likelihood function. The Wald statistic is calculated as follows:
W = ( R ^ M L E R 0 ) 2 v a r ( R ^ M L E ) ,
where R ^ M L E is the penalized MLE of R, and v a r ( R ^ M L E ) is its variance estimated by the inverse of the expected information matrix evaluated at the maximum likelihood estimate. Under the null hypothesis, the Wald test statistic W follows an asymptotic χ 2 distribution with one degree of freedom.
Score test assesses the statistical significance of the parameter based on the gradient of the likelihood function. The value of the score function is evaluated at R 0 and equaled to 0 when R 0 = R ^ M L E . When the score function at R 0 deviates far from 0, the alternative hypothesis is more plausible than H 0 . The Score test statistic is calculated as follows:
S = ( u ( R 0 ) ) 2 I ( R 0 ) ,
where u ( R 0 ) is the score function evaluated at R 0 and other parameters are replaced by the MLE, and I ( R 0 ) is the Fisher information evaluated by R 0 while other parameters are fixed at their MLE. The test statistic asymptotically follows a χ 2 distribution with one degree of freedom.
The likelihood ratio test is another method to assess statistical significance based on the comparison of log-likelihood function evaluated at MLE and R 0 . The likelihood ratio test statistic is calculated as follows:
L R = 2 l ( R ^ M L E ) 2 l ( R 0 ) ,
where l ( R ^ M L E ) and l ( R 0 ) is the log-likelihood function evaluated at R ^ M L E and R 0 , respectively. Under large samples the likelihood ratio test statistic asymptotically follows a χ 2 distribution with one degree of freedom.

2.2. Bayesian Hierarchical Modeling

The genome-wide association studies have effectively detected many SNPs that are associated with various diseases. However, the identified variants explain a limited portion of the disease heritability [5]. Many associated variants remain undiscovered partly because the single-locus GWAS has limited power to reveal the associations. For example, the p-values from the hypothesis tests outlined in Section 2.1 require adjustment for multiple testing using methods such as Bonferroni correction or False Discovery Rate (FDR) to control for false positives [9]. Only a limited number of markers with very large effect sizes can pass the stringent criteria after adjustments. In addition, the SNP interactions, which play an important role in complex diseases susceptibility [10], are ignored in single-marker analysis.
We propose a Bayesian hierarchical model to jointly examine a SNP set in order to improve the detection power. The literature contains numerous models for SNP-set tests, including the burden test [11], SKAT [12], aSPU test [13], VEGAS/VEGAS2 [14,15], MAGMA [16], GATES [17], GHC [18], GBJ [19], and ACAT tests [20]. However, existing SNP-set tests are designed for traditional GWAS where cases and controls are sampled independently. We consider paired tumor–normal specimens where tumor and normal tissues were collected from the same patient. Therefore, traditional GWAS-based tests are not directly applicable to the matched pair data.
The multiple marker model can aggregate a group of SNPs in a biologically meaningful way, such as genes, pathways, or topological 3D structures [21]. For simplicity, we use genes as an example of grouping SNPs, with a note that the model applies to any other biologically meaningful way of grouping SNPs. The multiple marker model is applied to determine the association status of a gene, after integrating data from all genetic markers located in the gene region.

2.2.1. Prior Distribution

Suppose there are J SNPs on a gene. Let G denote the gene association status: G = 1 if associated and 0 otherwise. Gene status G is assigned a Bernoulli prior with probability b, where b is further assigned a Beta hyper prior. Assuming only a small proportion of genes are associated, set the prior mean of b at a small value, e.g., 0.2. The distributions of G and probability b are as follows:
f ( G | b ) = b G · ( 1 b ) ( 1 G ) , f ( b ) B e t a ( α b , β b ) .
Let H j , j = 1 , . . . , J , represent the association status of the jthSNP on the gene, where H j = 1 if the SNP is a risk marker, and H j = 0 if it is unrelated or neutral. We assume the probability of H j = 1 depend on the gene status G. If G = 1 , there is a high probability that the SNPs on the gene are associated with the disease. Otherwise, the probability is low. Let H j | ( G = i ) B e r n o u l l i ( p i ) , i = 0 , 1. For example, we can set p 0 = 0.1 and p 1 = 0.9 , but they can take other values to reflect the prior knowledge. Thus, the prior distribution of H j given G is
f ( H j | G ) = p 1 G · p 0 ( 1 G ) .
Let H = { H 1 , , H J } . Assuming that each H j is conditionally independent, we can write the prior distribution as follows:
f ( H | G ) = f ( H 1 , , H J | G ) = j = 1 J f ( H j | G ) .
Let R j denote the relative risk (RR) of the jth SNP on the gene. The prior distribution of R j depends on the association status of the jth SNP. When it is a risk mutant, the effect size should be greater than one. We assign R j a G a m m a ( k 1 , θ 1 ) distribution with the mean greater than one. In practice, the prior mean can be the average of all MLEs of the SNPs. On the other hand, when the jth SNP is neutral, the effect size should be one, and thus R j is assigned a prior G a m m a ( k 0 , θ 0 ) with mean at one. Let R = { R 1 , , R J } be a set of risk parameters on the same gene. Assume R j is conditionally independent of each other given H , and R j is independent of H j when j j . Then the prior distribution of R is as follows:
f ( R j | H j ) = γ ( k 1 , θ 1 ) H j · γ ( k 0 , θ 0 ) ( 1 H j ) , f ( R | H ) = j = 1 J f ( R j | H j ) ,
where γ ( k 1 , θ 1 ) and γ ( k 0 , θ 0 ) are the density function of G a m m a ( k 1 , θ 1 ) and G a m m a ( k 0 , θ 0 ) , respectively.
Let A j and M j denote the risk allele frequency (AF) and the mutation rate (MR), respectively, of the jth SNP on the gene. The prior distribution of A j depends on the association status of the jth SNP. The observed and expected AF are
A ^ j = n 10 + n 11 + n 12 + 2 n 20 + 2 n 21 + 2 n 22 2 n , μ A ^ j = E ( A ^ j ) = E n 10 + n 11 + n 12 + 2 n 20 + 2 n 21 + 2 n 22 2 n = A j ( R j A j + R j A j + 2 A j M j 2 R j A j M j + 1 ) 2 ( R j A j M j A j + R j M j + 2 A j M j 2 R j A j M j + 1 ) .
When the SNP is neutral, i.e., R j = 1 , E ( A ^ j ) = A j . Thus A j | ( H j = 0 ) is assigned a B e t a ( α A 0 , j , β A 0 , j ) prior distribution with the mean set at A ^ j . On the other hand, for a risky SNP, the allele frequency is enriched in the patient population. We can estimate the parameter A j by solving the equation A ^ j = μ A ^ j :
A j = A ^ j + R j + 1 R j + A ^ j 2 ( 2 M j 1 ) ( R j 1 ) + 1 2 8 A ^ j ( 2 M j 1 ) ( R j 1 ) ( R j M j M j + 1 ) 2 ( 2 M j 1 ) ( R j 1 ) .
In Equation (3) the parameters R j and M j are unknown, and they can be set to the MLE or other estimates. Then, A j | H j = 1 is assigned a B e t a ( α A 1 , j , β A 1 , j ) distribution, with the prior mean equal to (3). Let A = { A 1 , , A J } be a set of AF variables on the same gene. Assume that A j is conditionally independent of each other given H , and A j is independent of H j when j j . Therefore, the prior distribution of A is as follows:
f ( A j | H j ) = β ( α A 1 , j , β A 1 , j ) H j · β ( α A 0 , j , β A 0 , j ) ( 1 H j ) , f ( A | H ) = j = 1 J f ( A j | H j ) ,
where β ( α A 1 , j , β A 1 , j ) and β ( α A 0 , j , β A 0 , j ) are the density function of B e t a ( α A 1 , j , β A 1 , j ) and B e t a ( α A 0 , j , β A 0 , j ) , respectively.
The prior distribution of MR parameter M j depends on the association status of the jth SNP. The observed and expected mutation rate from the sample are
M ^ j = n 01 + 2 n 02 + n 10 + n 12 + 2 n 20 + n 21 2 n , μ M ^ j = E ( M ^ j ) = E n 01 + 2 n 02 + n 10 + n 12 + 2 n 20 + n 21 2 n = M j ( R j M j + R j M j + 2 A j M j 2 R j A j M j + 1 ) 2 ( R j A j M j A j + R j M j + 2 A j M j 2 R j A j M j + 1 ) .
When the SNP is non-associated, E ( M j | H j = 0 ) = M j . The MR is thus assigned a prior B e t a ( α M 0 , j , β M 0 , j ) . For a risky SNP, an estimate of M j can be obtained by letting M ^ j = μ M ^ j and then solving the equation:
M j = M ^ j + ( R j + 1 ) + R j + 1 2 M ^ j ( R j 1 ) ( 1 2 A j ) 2 + 8 M ^ j ( 1 2 A j ) ( R j 1 ) ( R j A j A j + 1 ) 2 ( R j 1 ) ( 1 2 A j ) .
We can assign a B e t a ( α M 1 , j , β M 1 , j ) prior for M j conditional on H j = 1 , where the prior mean is equal to (5). Let M = { M 1 , , M J } . Assume that M j | H j is independent of each other, and M j is independent of H j when j j . Therefore, the distribution of M is as follows:
f ( M j | H j ) = β ( α M 1 , j , β M 1 , j ) · β ( α M 0 , j , β M 0 , j ) ( 1 H j ) , f ( M | H ) = j = 1 J f ( M j | H j ) .

2.2.2. Joint Posterior Distribution

Let Θ j = { R j , A j , M j } be the set of RR, AF and MR parameters on the jth SNP, where Θ j and Θ j are independent when j j . Let Θ = { R 1 , . . . , R J , A 1 , . . . , A J , M 1 , . . . , M J } be the set of RR, AF, and MR parameters on all SNPs of a given gene. Let H = { H 1 , . . . , H J } be a set of all SNP association status on then gene. Under the assumption that R j | H j , A j | H j , M j | H j are independent of each other, the joint distribution of Θ , H , G , b can be derived as follows:
f ( Θ , H , G , b ) = j = 1 J f ( R j | H j ) f ( A j | H j ) f ( M j | H j ) f ( H j | G ) f ( G | b ) f ( b ) .
The observed counts of the normal–tumor paired genotypes at the jth SNP, defined as n j = { n 00 ( j ) , n 01 ( j ) , n 02 ( j ) , n 10 ( j ) , n 11 ( j ) , n 12 ( j ) , n 20 ( j ) , n 21 ( j ) , n 22 ( j ) } , follows a Multinoulli distribution with 9 categories, where the expected probabilities are shown in (1). Let S = { n 1 , . . . , n J } be the set of counts on a given gene. Assume conditional independence of n j and n j when j j . The likelihood function is as follows:
f ( S | Θ ) = j = 1 J k , k 0 , 1 , 2 P k , k ( j ) n k , k ( j ) .
Therefore, the joint posterior distribution can be derived as follows:
f ( Θ , H , G , b | S ) f ( S | Θ ) j = 1 J f ( R j | H j ) f ( A j | H j ) f ( M j | H j ) f ( H j | G ) f ( G | b ) f ( b ) .
The posterior distributions lack closed-form expressions, and thus it is difficult to sample from the posterior distributions directly. We estimate them using Markov Chain Monte Carlo (MCMC) simulation. For parameters having a closed-form conditional posterior distribution, such as ( G , b , H ) , a Gibbs sampler is used to approximate the target distribution. Other parameters are sampled with the Metropolis–Hastings algorithm in each Gibbs iteration. Details can be found in the Supplementary Materials.

3. Results

3.1. Simulation Studies

We evaluate penalized MLE, single-marker Bayesian ( J = 1 ) , and multi-marker Bayesian models using simulated paired data with varying (1) sample sizes ( n = 1000 , 3000); (2) allele frequencies (A = 0.05, 0.1, 0.2); (3) mutation rates (M = 0.001, 0.005); and (4) relative risks (R = 1, 2, 3). Each setting is repeated 100 times. We focus on the estimation accuracy of the relative risk R, which is the major parameter of interest. For hypothesis testing, we compare the type I error rates under the null hypothesis H 0 : R = 1, and the power to identify associated variants under the alternative hypotheses H a : R = 2 and H a : R = 3. The allelic risk model is assumed to be additive, where the risk of the heterozygous genotype is the additive mean of the two homozygous genotypes [22].
For multi-marker analysis, we consider a gene that has 4 SNPs having the same relative risk. To compare the performance of estimating the R, we use the mean square error (MSE) for penalized MLE, the single-marker Bayesian model and the multi-marker Bayesian model. For Bayesian models we use the sample median of the posterior draws. In the penalized MLE method, we let the ridge coefficient = 0.05 to regularize the estimation.
In Bayesian modeling, the prior distributions for an unassociated SNP are set as follows: R | ( H = 0 ) G a m m a ( 3 , 3 ) ; A | ( H = 0 ) follows a Beta distribution with mean at the observed allele frequency; M | ( H = 0 ) follows a Beta distribution with mean at observed mutation rate. The prior distributions conditioning on H = 1 are: R | ( H = 1 ) G a m m a ( 3 , 8 ) ; A | ( H = 1 ) follows a Beta distribution with the mean chosen according to (3); M | ( H = 1 ) follows a Beta distribution whose mean equals to (5). MCMC simulations with three restarts are used to draw from the posterior distribution for model parameters. The Bayesian estimation is derived by the median of MCMC samples after burn-in and thinning.
Figure 1 shows the MSE of estimators over 100 replicates when M = 0.005 . The results show that multiple-marker Bayesian model has the lowest MSE in most settings, especially under small sample sizes or low mutation rates. Similar results are observed for M = 0.001 (Supplementary Materials).
Next we consider the hypothesis testing. To estimate the false positive rate (type I error), we simulate data from the null model that no SNPs are associated with the disease phenotype, i.e., all SNP-level association are H = 0 . In this situation, all the SNP relative risks are set to R = 1 . To estimate the power (True Positive Rate), we simulate data under the alternative hypothesis that all SNP-level association are H = 1 . To vary the association level, we considered two scenarios, R = 2 and 3. We fix the nominal type I error rate at α = 0.05 for the likelihood-based tests. All three tests (Wald, Score, likelihood ratio) have similar performance, and we choose the Wald test as a representative under the penalized MLE method. In single- and multi-marker Bayesian models, we use the posterior median of H to make a decision for the association of the marker. Table 1 shows the estimated type I error and the power based on 100 replicates of the three methods when M = 0.005 . When the sample size is 1000 and the allele frequencies are low, all methods fail to control the type I error rate at 0.05. However, the type I error of multi-marker Bayesian model is the lowest among all three methods in most settings. In moderate-risk case ( R = 2 ), the power of the multi-marker Bayesian either exceeds or is similar to the single-marker Bayesian model in all settings. When R increases from 2 to 3, the multi-marker Bayesian model shows a substantial power improvement for n = 1000. All performances are similar in settings with large sample size (n = 3000) and high-risk allele frequency. Similar results are observed for M = 0.001 (Supplementary Materials).
Overall, the multi-marker Bayesian model outperforms the other two methods in most settings. The Bayesian model provides a more stable estimation than penalized MLE. In scenarios where data are limited due to insufficient sample size, low allele frequencies or low mutation rates, the multi-marker Bayesian model has a clear advantage over others.

3.2. Real Data Application

3.2.1. Application to Matched-Pair Breast Cancer Data

We analyzed the tumor–normal matched-pair data for breast cancer from The Cancer Genome Atlas (TCGA). The total number of SNPs is 905,461 and the sample size is 1070. We applied quality control methods to remove invalid SNP genotypes. First, Hardy–Weinberg equilibrium test [23] was applied and p-value = 0.05 was used as the cut-off threshold. SNPs with missing genotype and the allele frequency less than 0.05 are removed. After quality control, we retained 614,883 SNPs.
Among the 1070 samples, 725 are self-reported as White, 176 are Black, 60 are Asians, 95 are Others, and 109 are Unknown. We applied the principal Component Analysis (PCA) using all input SNPs to determine the patient population structure. Ethnicity groups were distinguishable using the top two principal components as shown in Figure 2. We imputed the missing racial ancestry for unknown patients. To avoid potential problems caused by population stratification, we used 807 samples classified as White in the subsequent analysis.
We used the refFlat gene annotation (UCSC hg19) for human genome references. The gene region is determined using transcription start and end positions. A SNP is on a gene region if it locates within 1000 base pairs upstream or downstream of a gene. A total of 58,545 genes are available, among which 20,353 contain at least one SNP, with a total of 220,268 SNPs being mapped to these genes. The summary statistics about SNP counts and gene length (in base pairs) of all genes are provided in Table 2.
The SNPs tend to be widely separated across a long gene region, where linkage disequilibrium can hardly be observed. To capture the joint effects of adjacent markers, we split long genes evenly into small segments that contain 32,000 base pairs or less. After segmentation, there were 58,161 gene segments. Summary statistics about SNP counts on each gene segment is given in Table 3. Then, multi-marker analysis method was applied to all 58,161 gene segments to analyze the joint effects.

3.2.2. Single-Marker Analysis

We applied the Bayesian single-marker analysis to all 220,268 SNPs. The individual SNP status is estimated by the mean of the posterior distribution. To obtain aggregated scores for genes, we first computed the average of all SNPs on a gene segment, and then pick the largest segment to represent the gene.
Table 4 shows top ranked genes identified by the single-marker Bayesian model. Among them, many have been reported to be cancer related. Multiple recent studies have indicated a positive correlation between boosted TIAM1 expression level and higher grade of human breast cancer [24,25]. The TIAM1 gene and the encoded protein have been implicated in cell proliferation, migration, invasion, and tumor progression in a variety of human cancer [26,27,28,29]. Genetic loss of NDST4 is significantly associated with tumor progression, and NDST4 gene is identified as a novel candidate tumor suppressor in human colorectal cancer [30]. A number of studies have suggested that the deactivation of EIF2AK2 can suppress tumor growth [31,32,33], while elevated expression of EIF2AK2 increases carcinoma progression in a variety of human cancer, including breast cancer [34,35,36]. The TMEM117 gene belongs to the TMEM family. There is evidence that down- or up-regulated TMEM expression has been identified in tumor tissues compared to adjacent healthy tissues, and some suggest TMEMs as prognostic biomarkers [37].

3.2.3. Multi-Marker Analysis

The multi-marker Bayesian model was also applied to the TCGA breast cancer data. In this results, the gene segment status is estimated by the median of the posterior distribution. Similarly, gene status is represented by the maximum value of all gene segments.
Table 5 shows the top genes identified in the multi-marker Bayesian analysis. Recent studies have reported a highly significant association between the KIRREL3 region and breast cancer [38]. The STX3 gene may contribute to carcinogenesis via up- or down-regulation in various cancer, promoting breast cancer cell growth [39,40]. Elevated AGPAT4 expression in cancer tissues is associated with poorer survival rates in colorectal cancer patients [41]. Deletions in the PKNOX2 gene region are linked to breast cancer and ovarian cancer malignancies [42,43]. The RGS3 protein may function as a tumor suppressor [44]. Low expression of CSMD1, a tumor suppressor gene, is significantly associated with higher breast tumor grades [45,46].
We used the external oncogenic database, the Catalogue Of Somatic Mutations In Cancer (COSMIC), which provides comprehensive somatic mutations and genes that are associated with all types of breast cancer tissues. The gene list contains gene symbols, mutated samples, and total samples. The COSMIC breast cancer genes are sorted by mutated rates. Genes with higher mutation rates tend to have greater risk in breast cancer. The top 500 genes from COSMIC were used as benchmark to compare with top ranked genes of multi- and single-marker analysis. We plotted the receiver operating characteristic (ROC) curves in Figure 3, and calculated the area under curve (AUC) of both methods. The AUC of multi-marker analysis is 0.86 and the AUC of single-marker analysis is 0.83.
We also explored external resources from the Genomic Data Commons (GDC) Data Portal to compare with the gene lists generated by our Bayesian models. The GDC data include mutations reported to be associated with breast cancer. The impact of these mutations is classified based on the severity of the variant consequences using three tools: Ensembl Variant Effect Predictor (VEP), Polymorphism Phenotyping (PolyPhen) and Sorting Intolerant From Tolerant (SIFT). The Ensembl VEP tool evaluates the effect of a genomic variant in coding and non-coding regions. The effect levels include high, moderate, low, and modifier, which range from high impact in protein to no evidence of impact. The SIFT tool predicts whether an amino acid substitution will affect protein function and phenotype based on sequence homology. The impact levels are “deleterious”, “deleterious low confidence”, “tolerated low confidence” and “tolerated”, which ranges from very likely to not likely to have a phenotypic effect. The PolyPhen tool predicts the potential impact of an amino acid substitution on human proteins. The impact levels are “probably damaging”, “possibly damaging”, “benign”, and “unknown”, ranging from high confidence of affecting protein function or structure to an indeterminate prediction. In Figure 4, we summarized the counts of each impact level for variants within the top 100 associated genes identified by single-marker and multi-marker Bayesian models. The results indicate that variants identified by the multi-marker model are more frequently classified as high impact compared to those identified by the single-marker model.

4. Discussion

Studies have indicated that the progression of cancer is associated with the accumulation of somatic mutations [8,47]. Investigating the impact of somatic mutations in carcinoma is critical in risk prediction, continuous monitoring, and early detection of cancer, and can contribute to individualized prevention and therapeutic strategies. We proposed a novel model framework to analyze somatic mutations using tumor and matched normal tissue data in GWAS. The penalized maximum likelihood estimation (MLE) provides a computationally efficient method for individual SNP analysis. The single-marker hierarchical Bayesian model, compared to the penalized MLE method, has low MSE in the relative risk estimation and high power to identify associated SNPs with limited sample size, low allele frequency or low mutation rates. However, in settings with sufficient sample size, high allele frequency and mutation rate, the performance of penalized MLE and single-marker Bayesian method are similar. The multi-marker hierarchical Bayesian model groups SNPs into biologically meaningful sets, allowing joint analysis of their effects. In the breast cancer data analysis, large genes were divided into smaller segments (~32,000 base pairs or less), with the multi-marker Bayesian model applied to each segment comprising localized relevance sets of 2–40 SNPs (Section 3.2.1). Simulations consistently demonstrated that this model maintains a low type I error rate and enhances power (TPR) compared to single-marker or penalized MLE methods, particularly for moderate effect sizes. Additionally, the algorithm substantially reduces the multiple testing burden, decreasing the number of tests from 220,268 SNPs to 20,353 genes. It is worth mentioning that the cost of computation is similar for multiple- and single-marker Bayesian model. Future work will focus on extending the model to incorporate interaction networks and applying it to other cancer types.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/e27101077/s1, Figure S1: MSE of penalized MLE, Single-marker and Multiple-marker Bayesian models (M = 0.001); Figure S2: Estimated type I error rates (R = 1) and power (R = 2, 3) of penalized MLE, Single-marker and Multiple-marker Bayesian models (M = 0.001); Figure S3: Estimated relative risk, segment status, and SNP counts of 10 segments on TACC2 gene; Figure S4: Estimated relative risk, segment status, and SNP counts of 71 segments on CSMD1 gene; Figure S5: Estimated relative risk, segment status, and SNP counts of 41 segments on CDH13 gene. References [48,49,50] are citied in the Supplementary Materials.

Author Contributions

Conceptualization, M.C.; methodology, M.C., X.W., and Y.B.; software, Y.B.; resources, Z.X. and X.W.; data curation, Z.X.; writing—original draft preparation, Y.B.; writing—review and editing, M.C. and X.W.; visualization, Y.B.; supervision, M.C. and Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by NIH R01GM157608 and R01GM160515.

Data Availability Statement

R code available at https://github.com/bysKate/matchedGWAS (accessed on 10 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Smith, J.E.; Clark, A.R.; Staggemeier, A.T. A Genetic Approach to Statistical Disclosure Control. In Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, Montreal, QC, Canada, 8–12 July 2009; Association for Computing Machinery: New York, NY, USA, 2009. GECCO’09. pp. 1625–1632. [Google Scholar] [CrossRef]
  2. Stadler, Z.K.; Thom, P.; Robson, M.E.; Weitzel, J.N.; Kauff, N.D.; Hurley, K.E.; Devlin, V.; Gold, B.; Klein, R.J.; Offit, K. Genome-wide association studies of cancer. J. Clin. Oncol. 2010, 28, 4255. [Google Scholar] [CrossRef] [PubMed]
  3. Marees, A.T.; de Kluiver, H.; Stringer, S.; Vorspan, F.; Curis, E.; Marie-Claire, C.; Derks, E.M. A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. Int. J. Methods Psychiatr. Res. 2018, 27, e1608. [Google Scholar] [CrossRef]
  4. Kim, H.S.; Minna, J.D.; White, M.A. GWAS Meets TCGA to Illuminate Mechanisms of Cancer Predisposition. Cell 2013, 152, 387–389. [Google Scholar] [CrossRef]
  5. Manolio, T.A.; Collins, F.S.; Cox, N.J.; Goldstein, D.B.; Hindorff, L.A.; Hunter, D.J.; McCarthy, M.I.; Ramos, E.M.; Cardon, L.R.; Chakravarti, A.; et al. Finding the missing heritability of complex diseases. Nature 2009, 461, 747–753. [Google Scholar] [CrossRef] [PubMed]
  6. Martincorena, I.; Campbell, P.J. Somatic mutation in cancer and normal cells. Science 2015, 349, 1483–1489. [Google Scholar] [CrossRef]
  7. Futreal, P.A.; Coin, L.; Marshall, M.; Down, T.; Hubbard, T.; Wooster, R.; Rahman, N.; Stratton, M.R. A census of human cancer genes. Nat. Rev. Cancer 2004, 4, 177–183. [Google Scholar] [CrossRef]
  8. Alexandrov, L.B.; Nik-Zainal, S.; Wedge, D.C.; Aparicio, S.A.J.R.; Behjati, S.; Biankin, A.V.; Bignell, G.R.; Bolli, N.; Borg, A.; Børresen-Dale, A.L.; et al. Signatures of mutational processes in human cancer. Nature 2013, 500, 415–421. [Google Scholar] [CrossRef]
  9. Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodol.) 1995, 57, 289–300. [Google Scholar] [CrossRef]
  10. Li, P.; Guo, M.; Wang, C.; Liu, X.; Zou, Q. An overview of SNP interactions in genome-wide association studies. Brief. Funct. Genom. 2014, 14, 143–155. [Google Scholar] [CrossRef]
  11. Lee, S.; Abecasis, G.R.; Boehnke, M.; Lin, X. Rare-variant association analysis: Study designs and statistical tests. Am. J. Hum. Genet. 2014, 95, 5–23. [Google Scholar] [CrossRef] [PubMed]
  12. Wu, M.C.; Lee, S.; Cai, T.; Li, Y.; Boehnke, M.; Lin, X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 2011, 89, 82–93. [Google Scholar] [CrossRef]
  13. Pan, W.; Kim, J.; Zhang, Y.; Shen, X.; Wei, P. A powerful and adaptive association test for rare variants. Genetics 2014, 197, 1081–1095. [Google Scholar] [CrossRef]
  14. Liu, J.Z.; Mcrae, A.F.; Nyholt, D.R.; Medland, S.E.; Wray, N.R.; Brown, K.M.; Hayward, N.K.; Montgomery, G.W.; Visscher, P.M.; Martin, N.G.; et al. A Versatile Gene-Based Test for Genome-wide Association Studies. Am. J. Hum. Genet. 2010, 87, 139–145. [Google Scholar] [CrossRef] [PubMed]
  15. Mishra, A.; Macgregor, S. VEGAS2: Software for more flexible gene-based testing. Twin Res. Hum. Genet. 2015, 18, 86–91. [Google Scholar] [CrossRef]
  16. De Leeuw, C.A.; Mooij, J.M.; Heskes, T.; Posthuma, D. MAGMA: Generalized gene-set analysis of GWAS data. PLoS Comput. Biol. 2015, 11, e1004219. [Google Scholar] [CrossRef]
  17. Li, M.X.; Gui, H.S.; Kwan, J.S.; Sham, P.C. GATES: A rapid and powerful gene-based association test using extended Simes procedure. Am. J. Hum. Genet. 2011, 88, 283–293. [Google Scholar] [CrossRef]
  18. Barnett, I.; Mukherjee, R.; Lin, X. The generalized higher criticism for testing SNP-set effects in genetic association studies. J. Am. Stat. Assoc. 2017, 112, 64–76. [Google Scholar] [CrossRef]
  19. Sun, R.; Lin, X. Genetic variant set-based tests using the generalized Berk–Jones statistic with application to a genome-wide association study of breast cancer. J. Am. Stat. Assoc. 2020, 115, 1079–1091. [Google Scholar] [CrossRef] [PubMed]
  20. Liu, Y.; Chen, S.; Li, Z.; Morrison, A.C.; Boerwinkle, E.; Lin, X. ACAT: A fast and powerful p value combination method for rare-variant analysis in sequencing studies. Am. J. Hum. Genet. 2019, 104, 410–421. [Google Scholar] [CrossRef] [PubMed]
  21. Chen, M.; Cho, J.; Zhao, H. Incorporating biological pathways via a Markov random field model in genome-wide association studies. PLoS Genet. 2011, 7, e1001353. [Google Scholar] [CrossRef]
  22. Ziegler, A.; Konig, I.R. A Statistical Approach to Genetic Epidemiology; Wiley-VCH Verlag GmbH & Co. KGaA: Berlin, Germany, 2010. [Google Scholar] [CrossRef]
  23. Emigh, T.H. A Comparison of Tests for Hardy-Weinberg Equilibrium. Biometrics 1980, 36, 627. [Google Scholar] [CrossRef]
  24. Minard, M.E.; Kim, L.S.; Price, J.E.; Gallick, G.E. The Role of the Guanine Nucleotide Exchange Factor Tiam1 in Cellular Migration, Invasion, Adhesion and Tumor Progression. Breast Cancer Res. Treat. 2004, 84, 21–32. [Google Scholar] [CrossRef]
  25. Adam, L.; Vadlamudi, R.K.; McCrea, P.; Kumar, R. Tiam1 Overexpression Potentiates Heregulin-induced Lymphoid Enhancer Factor-1/β-Catenin Nuclear Signaling in Breast Cancer Cells by Modulating the Intercellular Stability. J. Biol. Chem. 2001, 276, 28443–28450. [Google Scholar] [CrossRef] [PubMed]
  26. Walch, A.; Seidl, S.; Hermannstadter, C.; Rauser, S.; Deplazes, J.; Langer, R.; von Weyhern, C.H.; Sarbia, M.; Busch, R.; Feith, M.; et al. Combined analysis of Rac1, IQGAP1, Tiam1 and E-cadherin expression in gastric cancer. Mod. Pathol. 2008, 21, 544–552. [Google Scholar] [CrossRef]
  27. Engers, R.; Mueller, M.; Walter, A.; Collard, J.G.; Willers, R.; Gabbert, H.E. Prognostic relevance of Tiam1 protein expression in prostate carcinomas. Br. J. Cancer 2006, 95, 1081–1086. [Google Scholar] [CrossRef]
  28. Minard, M.E.; Ellis, L.M.; Gallick, G.E. Tiam1 regulates cell adhesion, migration and apoptosis in colon tumor cells. Clin. Exp. Metastasis 2006, 23, 301–313. [Google Scholar] [CrossRef]
  29. Ding, Y.; Chen, B.; Wang, S.; Zhao, L.; Chen, J.; Ding, Y.; Chen, L.; Luo, R. Overexpression of Tiam1 in hepatocellular carcinomas predicts poor prognosis of HCC patients. Int. J. Cancer 2009, 124, 653–658. [Google Scholar] [CrossRef] [PubMed]
  30. Tzeng, S.T.; Tsai, M.H.; Chen, C.L.; Lee, J.X.; Jao, T.M.; Yu, S.L.; Yen, S.J.; Yang, Y.C. NDST4 Is a Novel Candidate Tumor Suppressor Gene at Chromosome 4q26 and Its Genetic Loss Predicts Adverse Prognosis in Colorectal Cancer. PLoS ONE 2013, 8, e67040. [Google Scholar] [CrossRef]
  31. Meurs, E.F.; Galabru, J.; Barber, G.N.; Katze, M.G.; Hovanessian, A.G. Tumor suppressor function of the interferon-induced double-stranded RNA-activated protein kinase. Proc. Natl. Acad. Sci. USA 1993, 90, 232–236. [Google Scholar] [CrossRef] [PubMed]
  32. Shir, A.; Levitzki, A. Inhibition of glioma growth by tumor-specific activation of double-stranded RNA–dependent protein kinase PKR. Nat. Biotechnol. 2002, 20, 895–900. [Google Scholar] [CrossRef]
  33. Kim, T.H.; Cho, S.G. Kisspeptin inhibits cancer growth and metastasis via activation of EIF2AK2. Mol. Med. Rep. 2017, 16, 7585–7590. [Google Scholar] [CrossRef]
  34. Kim, S.H.; Forman, A.P.; Mathews, M.B.; Gunnery, S. Human breast cancer cells contain elevated levels and activity of the protein kinase, PKR. Oncogene 2000, 19, 3086–3094. [Google Scholar] [CrossRef] [PubMed]
  35. Lee, Y.S.; Kunkeaw, N.; Lee, Y.S. Protein kinase R and its cellular regulators in cancer: An active player or a surveillant? WIREs RNA 2019, 11, e1558. [Google Scholar] [CrossRef]
  36. Garcia, M.A.; Gil, J.; Ventoso, I.; Guerra, S.; Domingo, E.; Rivas, C.; Esteban, M. Impact of Protein Kinase PKR in Cell Biology: From Antiviral to Antiproliferative Action. Microbiol. Mol. Biol. Rev. 2006, 70, 1032–1060. [Google Scholar] [CrossRef]
  37. Schmit, K.; Michiels, C. TMEM Proteins in Cancer: A Review. Front. Pharmacol. 2018, 9, 1345. [Google Scholar] [CrossRef]
  38. Wang, X.; Pankratz, V.S.; Fredericksen, Z.; Tarrell, R.; Karaus, M.; McGuffog, L.; Pharaoh, P.D.; Ponder, B.A.; Dunning, A.M.; Peock, S.; et al. Common variants associated with breast cancer in genome-wide association studies are modifiers of breast cancer risk in BRCA1 and BRCA2 mutation carriers. Hum. Mol. Genet. 2010, 19, 2886–2897. [Google Scholar] [CrossRef]
  39. Giovannone, A.J.; Winterstein, C.; Bhattaram, P.; Reales, E.; Low, S.H.; Baggs, J.E.; Xu, M.; Lalli, M.A.; Hogenesch, J.B.; Weimbs, T. Soluble syntaxin 3 functions as a transcriptional regulator. J. Biol. Chem. 2018, 293, 5478–5491. [Google Scholar] [CrossRef] [PubMed]
  40. Nan, H.; Han, L.; Ma, J.; Yang, C.; Su, R.; He, J. STX3 represses the stability of the tumor suppressor PTEN to activate the PI3K-Akt-mTOR signaling and promotes the growth of breast cancer cells. Biochim. Biophys. Acta (BBA)—Mol. Basis Dis. 2018, 1864, 1684–1692. [Google Scholar] [CrossRef] [PubMed]
  41. Zhang, D.; Shi, R.; Xiang, W.; Kang, X.; Tang, B.; Li, C.; Gao, L.; Zhang, X.; Zhang, L.; Dai, R.; et al. The Agpat4/LPA axis in colorectal cancer cells regulates antitumor responses via p38/p65 signaling in macrophages. Signal Transduct. Target. Ther. 2020, 5, 24. [Google Scholar] [CrossRef]
  42. Launonen, V.; Stenback, F.; Puistola, U.; Bloigu, R.; Huusko, P.; Kytola, S.; Kauppila, A.; Winqvist, R. Chromosome 11q22.3-q25 LOH in Ovarian Cancer: Association with a More Aggressive Disease Course and Involved Subregions. Gynecol. Oncol. 1998, 71, 299–304. [Google Scholar] [CrossRef]
  43. Gentile, M.; Wiman, A.; Thorstenson, S.; Loman, N.; Borg, A.; Wingren, S. Deletion mapping of chromosome segment 11q24-q25, exhibiting extensive allelic loss in early onset breast cancer. Int. J. Cancer 2001, 92, 208–213. [Google Scholar] [CrossRef] [PubMed]
  44. Chen, Z.; Wu, Y.; Meng, Q.; Xia, Z. Elevated microRNA-25 inhibits cell apoptosis in lung cancer by targeting RGS3. Vitr. Cell. Dev. Biol.-Anim. 2015, 52, 62–67. [Google Scholar] [CrossRef] [PubMed]
  45. Escudero-Esparza, A.; Bartoschek, M.; Gialeli, C.; Okroj, M.; Owen, S.; Jirstrom, K.; Orimo, A.; Jiang, W.G.; Pietras, K.; Blom, A.M. Complement inhibitor CSMD1 acts as tumor suppressor in human breast cancer. Oncotarget 2016, 7, 76920–76933. [Google Scholar] [CrossRef] [PubMed]
  46. Kamal, M.; Shaaban, A.M.; Zhang, L.; Walker, C.; Gray, S.; Thakker, N.; Toomes, C.; Speirs, V.; Bell, S.M. Loss of CSMD1 expression is associated with high tumour grade and poor survival in invasive ductal breast carcinoma. Breast Cancer Res. Treat. 2009, 121, 555–563. [Google Scholar] [CrossRef]
  47. Greenman, C.; Stephens, P.; Smith, R.; Dalgliesh, G.L.; Hunter, C.; Bignell, G.; Davies, H.; Teague, J.; Butler, A.; Stevens, C.; et al. Patterns of somatic mutation in human cancer genomes. Nature 2007, 446, 153–158. [Google Scholar] [CrossRef]
  48. Conte, N.; Delaval, B.; Ginestier, C.; Ferrand, A.; Isnardon, D.; Larroque, C.; Prigent, C.; Séraphin, B.; Jacquemier, J.; Birnbaum, D. TACC1-chTOG-Aurora A protein complex in breast cancer. Oncogene 2003, 22, 8102–8116. [Google Scholar] [CrossRef]
  49. Ma, C.; Quesnelle, K.M.; Sparano, A.; Rao, S.; Park, M.S.; Cohen, M.A.; Wang, Y.; Samanta, M.; Kumar, M.S.; Aziz, M.U.; et al. Characterization CSMD1 in a large set of primary lung, head and neck, breast and skin cancer tissues. Cancer Biol. Ther. 2009, 8, 907–916. [Google Scholar] [CrossRef]
  50. Toyooka, K.O.; Toyooka, S.; Virmani, A.K.; Sathyanarayana, U.G.; Euhus, D.M.; Gilcrease, M.; Minna, J.D.; Gazdar, A.F. Loss of expression and aberrant methylation of the CDH13 (H-cadherin) gene in breast and lung carcinomas. Cancer Res. 2001, 61, 4556–4560. [Google Scholar]
Figure 1. MSE of penalized MLE, single-marker Bayesian and multi-marker Bayesian model (M = 0.005).
Figure 1. MSE of penalized MLE, single-marker Bayesian and multi-marker Bayesian model (M = 0.005).
Entropy 27 01077 g001
Figure 2. Principal Component Analysis of all input SNPs in the breast cancer data.
Figure 2. Principal Component Analysis of all input SNPs in the breast cancer data.
Entropy 27 01077 g002
Figure 3. ROC curves of multiple and single marker models. Dotted line: Random Classifier (AUC = 0).
Figure 3. ROC curves of multiple and single marker models. Dotted line: Random Classifier (AUC = 0).
Entropy 27 01077 g003
Figure 4. The counts of impact levels predicted by three tools: Ensembl VEP, SIFT and PolyPhen using the variants from the top 100 associated genes identified by multiple-marker Bayesian model and single-marker Bayesian model.
Figure 4. The counts of impact levels predicted by three tools: Ensembl VEP, SIFT and PolyPhen using the variants from the top 100 associated genes identified by multiple-marker Bayesian model and single-marker Bayesian model.
Entropy 27 01077 g004
Table 1. Estimated type I error rates (R = 1) and power (R = 2, 3) of penalized MLE, single-marker Bayesian and multi-marker Bayesian model (M = 0.005).
Table 1. Estimated type I error rates (R = 1) and power (R = 2, 3) of penalized MLE, single-marker Bayesian and multi-marker Bayesian model (M = 0.005).
nSNPAlleleMulti BayesSingle BayesPenalized MLE
Freq R = 1 R = 2 R = 3 R = 1 R = 2 R = 3 R = 1 R = 2 R = 3
1000SNP10.050.140.310.640.240.320.450.260.130.09
0.100.140.380.820.230.360.590.180.100.21
0.200.070.390.910.180.340.650.180.090.55
SNP20.050.140.310.620.230.310.440.300.160.12
0.100.150.390.810.250.380.570.220.130.18
0.200.070.400.910.150.350.670.180.100.55
SNP30.050.140.320.640.230.340.470.260.120.14
0.100.140.360.820.210.320.590.200.110.20
0.200.070.410.920.180.370.670.150.090.53
SNP40.050.140.310.640.240.310.470.190.120.11
0.100.140.380.830.230.350.590.210.130.15
0.200.070.440.920.150.420.700.190.150.58
3000SNP10.050.120.450.910.220.380.680.240.040.17
0.100.080.550.980.190.440.840.150.170.80
0.200.050.620.990.120.470.910.110.490.99
SNP20.050.110.440.910.200.380.670.180.030.14
0.100.060.550.980.140.420.840.160.130.82
0.200.050.620.990.130.460.900.180.421.00
SNP30.050.120.440.910.230.400.680.120.050.16
0.100.060.550.970.120.440.820.160.120.83
0.200.040.640.990.100.510.900.090.500.99
SNP40.050.110.460.910.190.420.690.180.020.12
0.100.060.550.980.140.440.820.150.150.77
0.200.030.620.990.090.450.900.180.481.00
Table 2. Summary of SNP counts, gene length on all genes.
Table 2. Summary of SNP counts, gene length on all genes.
SummaryMinQ1MedianMeanQ3Max
SNP Counts13842391242
Gene Length2116,21547,282159,292166,206 2.32 × 10 6
Table 3. Summary of SNP counts on gene segments.
Table 3. Summary of SNP counts on gene segments.
SummaryMinQ1MedianMeanQ3Max
SNP counts2356840
Table 4. Genes with highest gene status estimated by posterior median in single-marker analysis.
Table 4. Genes with highest gene status estimated by posterior median in single-marker analysis.
Gene NameChrGene Status
IL7chr80.975
TIAM1chr210.972
CKAP2Lchr20.961
TTC28chr220.956
NDST4chr40.952
EIF2AK2chr20.952
CACNB4chr20.95
PARD3Bchr20.946
TMEM117chr120.944
ATP6V0D1chr160.943
Table 5. Genes with highest gene status estimated by posterior median in multiple-marker analysis.
Table 5. Genes with highest gene status estimated by posterior median in multiple-marker analysis.
Gene NameChrGene Status
LINC00383chr130.999
KIRREL3chr110.999
STX3chr110.999
AGPAT4chr60.999
SYCE1chr100.997
RCBTB1chr130.997
PKNOX2chr110.997
RGS3chr90.997
GCSHchr160.997
CSMD1chr80.996
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bu, Y.; Chen, M.; Xuan, Z.; Wang, X. A Bayesian Model for Paired Data in Genome-Wide Association Studies with Application to Breast Cancer. Entropy 2025, 27, 1077. https://doi.org/10.3390/e27101077

AMA Style

Bu Y, Chen M, Xuan Z, Wang X. A Bayesian Model for Paired Data in Genome-Wide Association Studies with Application to Breast Cancer. Entropy. 2025; 27(10):1077. https://doi.org/10.3390/e27101077

Chicago/Turabian Style

Bu, Yashi, Min Chen, Zhenyu Xuan, and Xinlei Wang. 2025. "A Bayesian Model for Paired Data in Genome-Wide Association Studies with Application to Breast Cancer" Entropy 27, no. 10: 1077. https://doi.org/10.3390/e27101077

APA Style

Bu, Y., Chen, M., Xuan, Z., & Wang, X. (2025). A Bayesian Model for Paired Data in Genome-Wide Association Studies with Application to Breast Cancer. Entropy, 27(10), 1077. https://doi.org/10.3390/e27101077

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop