A Unified Comparative Evaluation of Genomic Prediction Models Across Four Aquaculture Species

Zhang, Jinxin; Yang, Xiaofei; Wang, Wei; Hu, Hongxia; Xu, Shaogang; Song, Hailiang

doi:10.3390/fishes11020115

Open AccessArticle

A Unified Comparative Evaluation of Genomic Prediction Models Across Four Aquaculture Species

by

Jinxin Zhang

,

Xiaofei Yang

,

Wei Wang

,

Hongxia Hu

,

Shaogang Xu

^* and

Hailiang Song

^*

Fisheries Science Institute, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100068, China

^*

Authors to whom correspondence should be addressed.

Fishes 2026, 11(2), 115; https://doi.org/10.3390/fishes11020115

Submission received: 25 December 2025 / Revised: 6 February 2026 / Accepted: 10 February 2026 / Published: 12 February 2026

(This article belongs to the Special Issue Functional Gene Analysis and Genomic Technologies in Aquatic Animals)

Download

Browse Figures

Versions Notes

Abstract

Genomic prediction has been increasingly applied in aquaculture selective breeding; however, systematic evaluations of prediction accuracy across multiple aquaculture species and analytical methods under a unified and comparable framework remain limited. In this study, we conducted a comprehensive comparative assessment of genomic prediction performance across four representative aquaculture species, including Atlantic salmon (Salmo salar), gilthead sea bream (Sparus aurata), common carp (Cyprinus carpio), and rainbow trout (Oncorhynchus mykiss), using ten genomic prediction models including GBLUP, Bayesian and machine learning methods. Prediction accuracy varied widely among species and models, ranging from 0.49 to 0.85, and was strongly associated with trait heritability. High-heritability traits consistently achieved higher prediction accuracies, with rainbow trout and common carp exhibiting the best overall performance (0.75–0.83 and 0.73–0.85, respectively), whereas Atlantic salmon and gilthead sea bream showed lower and more variable accuracies (0.49–0.61 and 0.49–0.66). No single model performed optimally across all species. Machine learning-based approaches achieved the highest prediction accuracy in specific cases but exhibited pronounced species-dependent variability, while GBLUP provided stable and well-calibrated predictions with consistently low bias. Incremental SNP feature selection further improved prediction accuracy by 2.8–4.2% in three species using only 0.54–9.64% of the available markers, whereas no improvement was observed for a low-heritability trait. These results show that genomic prediction performance is highly context-dependent and underscores the importance of jointly considering trait genetic architecture, population characteristics, model choice, and marker selection when optimizing genomic selection strategies in aquaculture breeding programs.

Keywords:

genomic selection; genomic prediction models; model comparison; feature selection; aquaculture species

Key Contribution: This study establishes a unified and comparable evaluation framework to systematically assess GBLUP, Bayesian, and machine learning genomic prediction models across multiple aquaculture species. By jointly analyzing prediction accuracy, bias, and incremental SNP feature selection, it clarifies the strengths and limitations of different modeling approaches under heterogeneous genetic architectures.

1. Introduction

Global aquaculture is facing the dual pressures of increasing food demand and growing constraints on natural resources and the environment. As a major source of high-quality animal protein, aquaculture plays a critical strategic role in ensuring food security, improving human nutrition, and promoting rural economic development [1,2,3]. However, despite the substantial success achieved by traditional breeding approaches—such as pedigree-based selection—their long breeding cycles and limited genetic progress for complex traits, particularly disease resistance and stress tolerance, have become major bottlenecks restricting further industrial development [4,5]. To address these limitations, the development and application of more efficient and precise modern breeding technologies have become a key driving force for the future advancement of the aquaculture industry.

Genomic selection (GS) has emerged as a revolutionary breeding technology that provides a powerful solution to these challenges [6,7,8]. The core principle of GS lies in the use of genome-wide molecular markers, such as single nucleotide polymorphisms (SNPs), to estimate genomic estimated breeding values (GEBVs), thereby enabling early and accurate selection of breeding candidates [9]. A major advantage of GS is its ability to improve prediction accuracy compared with traditional pedigree-based method [10,11,12,13]. This approach is particularly well suited to aquaculture species, which often exhibit high fecundity and allow large-scale genotyping and early-life selection, thereby offering great potential to accelerate genetic gain [7,8,9,14].

In recent years, a wide range of genomic prediction models have been proposed and applied in livestock and plant breeding programs. These models mainly include linear mixed-model-based approaches such as genomic best linear unbiased prediction (GBLUP) [14,15], a series of Bayesian variable selection models (e.g., BayesA, BayesB, BayesCπ, and BayesLASSO) [16,17,18], and more recently developed machine learning algorithms such as support vector regression, random forest, and extreme gradient boosting [19,20,21]. These models differ substantially in their assumptions regarding quantitative trait genetic architecture, computational complexity, and ability to capture complex genetic mechanisms. For example, GBLUP assumes that all SNP effects are equal and normally distributed and has been widely adopted due to its robustness and computational efficiency [22,23]. In contrast, Bayesian models allow heterogeneous variances or sparse SNP effect distributions, making them more suitable for traits controlled by a limited number of loci with moderate to large effects [24]. Meanwhile, machine learning methods, including support vector regression (SVR), random forest (RF), kernel ridge regression (KRR), and extreme gradient boosting (XGB), have attracted increasing attention because of their capacity to model nonlinear relationships and higher-order interactions, potentially improving prediction performance for traits with complex genetic architectures [25,26]. Despite the growing application of GS in aquaculture [14,19,21], several critical knowledge gaps remain. Most existing studies focus on single species or compare only a limited number of models. Moreover, the lack of consistency in sample size, marker density, trait types, and analytical workflows across studies makes it difficult to directly compare results or draw general conclusions. As a result, there is still a lack of systematic evaluations of genomic prediction accuracy across multiple aquaculture species conducted under a unified and comparable analytical framework.

To address these gaps, we designed and implemented a comprehensive comparative evaluation framework. Four economically important aquaculture species—Atlantic salmon (Salmo salar), gilthead sea bream (Sparus aurata), common carp (Cyprinus carpio), and rainbow trout (Oncorhynchus mykiss)—were carefully selected. For each species, one representative economically relevant trait was analyzed to conduct an integrated assessment of genomic prediction models, comparing the performance of GBLUP, Bayesian, and machine learning approaches in terms of prediction accuracy and bias.

Rather than proposing new genomic prediction algorithms, the primary contribution of this study lies in establishing a unified and comparable benchmarking framework that enables systematic cross-species evaluation of model performance under heterogeneous genetic architectures. Within this framework, we further evaluated the impact of incremental SNP feature selection on prediction performance. The results provide practical guidance for model selection and application in aquaculture breeding programs and offer insights into the generalizability and limitations of different genomic prediction strategies across diverse species–trait combinations.

2. Materials and Methods

2.1. Population and Phenotypes

Phenotypic and genotypic data were obtained from previously published studies on four aquaculture species. Briefly: (1) Atlantic salmon (Salmo salar) were challenged with amoebic gill disease and phenotypes for mean gill score, defined as the average score of the left and right gills. Gill lesions were subjectively scored on a continuous scale ranging from 0 (no lesion) to 5 (severe lesion). The challenged population comprised 1481 individuals from 84 full-sib families, with 1–39 fish per family [27]. common carp (Cyprinus carpio) were originated from four factorial crosses involving five females and ten males per cross (20 females and 40 males in total). Body weight was recorded as the target trait. A total of 1214 individuals from 195 full-sib families were included, with family sizes ranging from 1 to 21 individuals [28]. gilthead sea bream (Sparus aurata) was produced from a factorial mating design involving 67 broodfish (32 males and 35 females). Fish were challenged by a 30 min immersion in Photobacterium damselae (1 × 10⁵ CFU), the causative agent of pasteurellosis, and the number of days to death was recorded. The dataset comprised 777 individuals from 73 full-sib families, with 2–144 individuals per family [29]. Rainbow trout (Oncorhynchus mykiss) belonged to 58 full-sib families generated from 58 females and 20 males of the 2014-year class, with 10–18 individuals per family. Fish were challenged with infectious pancreatic necrosis virus, and survival time (days to death) was recorded [30]. A summary of population structure, phenotypes, and sample sizes is provided in Table 1. Phenotypic summary statistics are reported as mean ± SD to describe population-level variation, whereas heritability estimates are presented with their standard errors (SE) to indicate estimation uncertainty.

2.2. SNP Detection, Quality Control and Principal Component Analysis

Genotyping was conducted using different platforms across species: (1) Atlantic salmon (n = 1481) were genotyped using an Illumina 17K combined Atlantic salmon–rainbow trout SNP array (17,156 SNPs), derived from a higher-density platform [31]; Common carp (n = 1214) were genotyped using RAD-seq, yielding approximately 12,000 SNPs; gilthead sea bream (n = 777) were genotyped using RAD sequencing, also producing approximately 12,000 SNPs; Rainbow trout (n = 749) were genotyped using the 57K Affymetrix Axiom SNP array [32]. Missing genotypes for SNPs with known chromosomal positions were imputed using Beagle v5.4 [33]. Quality control was applied to the imputed datasets using PLINK v1.90 [34], with the following filtering criteria: Minor allele frequency (MAF) < 0.05; SNP call rate < 0.90; Hardy–Weinberg equilibrium (HWE) p < 1 × 10⁻⁷; Individual call rate < 0.90. After quality control, all individuals were retained. The final numbers of SNPs used for genomic prediction were 10,383 (Atlantic salmon), 8531 (common carp), 8545 (gilthead sea bream), and 37,958 (rainbow trout). Furthermore, principal component analysis (PCA) was performed using Plink v1.90 to assess the population genetic background.

2.3. Genomic Prediction Models

2.3.1. Genomic Best Linear Unbiased Prediction (GBLUP)

GBLUP is one of the most widely used linear mixed models for genomic selection [10] and was implemented in this study using the standard mixed-model formulation:

y = Xβ + Zg + e

where y is the vector of phenotypic observations, Xβ represents fixed effects and available design covariates, Z is the incidence matrix linking genomic effects to observations, g is the vector of genomic breeding values, and e is the residual error vector.

The genomic effects were assumed to follow a multivariate normal distribution:

g ~ N(0, Gσ_g²)

where G is the genomic relationship matrix and

σ_{g}^{2}

is the additive genetic variance. Residual errors were assumed to follow:

e ~ N(0, Iσ²_e)

where I is the identity matrix and

σ_{e}^{2}

is the residual variance. Here, N denotes a multivariate normal distribution with the specified mean vector and covariance matrix. Fixed effects and available design covariates were fitted explicitly through the fixed-effect design matrix X, following standard genomic evaluation practice in aquaculture species. GBLUP was implemented using the DMU software v6 (https://dmu.ghpc.au.dk/dmu/, accessed on 5 February 2026) [35].

2.3.2. Bayesian Models

Bayesian genomic prediction models were fitted using a marker-effect formulation within a linear mixed-model framework:

y = X β + \sum_{i = 1}^{m} Z_{i} g_{i} + e

where y is the vector of phenotypic observations, Xβ represents fixed effects and available design covariates, Zᵢ is the genotype vector of the i-th SNP (coded as 0, 1, and 2), gᵢ denotes the effect of the i-th SNP (marker effect), m is the total number of markers, and e is the residual error term. Fixed effects and covariates were fitted explicitly through the fixed-effect design matrix X, consistent with standard genomic prediction practice.

Genomic estimated breeding values (GEBVs) were calculated at the individual level as the linear combination of SNP effects weighted by marker genotypes across all loci:

{GEBV}_{j} = \sum_{i = 1}^{m} Z_{j i} g_{i}

where Z_ji denotes the genotype of individual j at SNP i. Thus, SNP effects (g_i) are model parameters, whereas GEBVs are derived individual-level predictions.

The core distinction among Bayesian models lies in the assumed prior distributions of SNP effects (gᵢ) and their variances, reflecting different assumptions about genetic architecture. In general, SNP effects follow a mixture prior:

g_{i} | (π, {σ_{g}}^{2}) ~ \{\begin{matrix} 0; σ_{g}^{2} = 0; π ~ d i s t 0 \\ g i | σ_{g}^{2} ~ d i s t 1; σ_{g}^{2} ~ d i s t 2; (1 - π) \end{matrix}\}

where dist0 represents the distribution for null-effect markers, dist1 denotes the distribution of non-zero marker effects, and dist2 specifies the prior distribution of marker-effect variances.

The model-specific prior assumptions are defined as follows.

BayesA: all SNP effects are non-zero with

g_{i} \sim N (0, σ_{g}^{2})

, and marker-specific variances follow a scaled inverse chi-square prior

χ^{- 2} (ν, S)

.

BayesB: a large proportion of SNPs have zero effect, while non-zero effects follow

N (0, σ_{g}^{2})

with scaled inverse chi-square priors on variances.

BayesCπ: the proportion of non-zero SNP effects (π) is estimated from the data, with non-zero effects following a normal prior

BayesLASSO: SNP effects follow a double-exponential (Laplace) prior with exponential variance structure.

The degrees of freedom (ν) and scale parameter (S) control the shrinkage intensity and are related to the assumed genetic architecture of the trait. Markov chain Monte Carlo (MCMC) sampling was run for 18,000 iterations, with the first 3000 iterations discarded as burn-in. Bayesian models were implemented using the BGLR R package (version 1.1.4) [36].

2.3.3. Machine Learning and Regularized Regression Models

In addition to linear mixed and Bayesian models, we evaluated several machine learning and regularized regression approaches to capture potential nonlinear relationships and complex interactions between SNP markers and phenotypes.

For these models, fixed environmental and design effects were not modeled as separate explicit terms within the prediction algorithms. Instead, phenotypes were pre-adjusted prior to model training by fitting a linear model including fixed environmental and design effects within each training fold of the cross-validation procedure, and the resulting residuals were used as response variables. This preprocessing step reduces environmental confounding while maintaining a consistent genotype-based input structure across machine learning models for fair comparative benchmarking.

Elastic Net (EN)

Elastic Net (EN) is a regularized linear regression method that combines the L₁ (LASSO) and L₂ (ridge) penalties, allowing simultaneous feature selection and coefficient shrinkage. By balancing sparsity and stability, EN is particularly effective in high-dimensional settings with correlated predictors, which are common in genomic datasets [37].

The EN estimator solves:

\frac{m i n}{β} (\frac{1}{2 N} \sum_{i = 1}^{N} {(y_{i} - X_{i}^{T} β)}^{2} + α [(1 - λ) | {| β | |}_{1} + λ | | β | | \frac{2}{2}])

where α > 0 controls the overall regularization strength and λ ∈ [0, 1] determines the balance between the L₁ and L₂ penalties (λ = 1 corresponds to LASSO and λ = 0 to ridge regression). This formulation enables EN to select groups of correlated SNPs while maintaining stable coefficient estimates, making it well suited for genomic prediction.

Nonlinear Machine Learning Models

To further explore potential nonlinear effects and higher-order interactions between SNP markers and phenotypes, we evaluated several nonlinear machine learning models, including support vector regression (SVR) [38], random forest (RF) [39], kernel ridge regression (KRR) [40], and extreme gradient boosting (XGB) [41]. Kernel-based methods (SVR and KRR) and ensemble-based methods (RF and XGB) were used to model complex patterns in high-dimensional genomic data.

Hyperparameters for all machine learning and regularized regression models were optimized using grid search combined with repeated five-fold cross-validation strictly within each training fold. For each model, a predefined grid of candidate hyperparameters was evaluated, and the optimal parameter combination was selected based on average prediction accuracy across inner cross-validation runs. To prevent information leakage and ensure fair comparison, hyperparameter tuning, model training, and performance evaluation were strictly separated between training and validation folds.

All machine learning models were implemented using the Scikit-learn package (version 1.8.0) for Python 3.8 [42]. The full list of hyperparameters and their corresponding search ranges for each model are provided in Table 2. The analysis code pipeline and model training scripts can be accessed via the following link: https://sandbox.zenodo.org/records/435718, accessed on 5 February 2026.

2.4. Incremental Feature Selection Based on GWAS

GWAS-assisted SNP prioritization and subset construction were performed within a nested cross-validation framework to avoid information leakage. For each cross-validation split, the reference (training) population was used to perform GWAS, while validation individuals were excluded prior to association analysis. All SNP ranking and subset construction steps were confined strictly to the training folds.

SNPs were ranked according to GWAS association statistics (p values) derived from the training set and were used for feature prioritization rather than formal locus discovery. Therefore, SNP inclusion was not restricted to only genome-wide significant markers. The genome-wide significance threshold used for GWAS reporting is provided for reference, but SNP subset construction followed a ranking-based strategy rather than a significance-threshold cutoff.

To identify SNP subset sizes that optimize predictive performance, an incremental feature selection pipeline was applied within each training fold. Starting from the top-ranked SNPs, feature sets were expanded progressively as follows: +1 SNP increments up to 100 markers, +5 up to 500, +10 up to 1000, +50 up to 5000, and +100 SNPs thereafter until all markers were included.

For very small SNP subsets where the number of markers was substantially lower than the number of individuals, tree-based machine learning models (Random Forest) were used for subset evaluation because they are more stable under low-feature conditions than relationship-matrix-based linear models [43,44,45,46]. Larger SNP subsets were evaluated using the full set of genomic prediction models described above.

2.5. Genomic Prediction Performance Evaluation

To assess prediction efficiency, genomic prediction was evaluated using a 5-fold cross-validation (CV) scheme. Genotyped individuals were randomly partitioned into five approximately equal-sized subpopulations. In each CV round, one subpopulation was used as the validation set, while the remaining four formed the reference population. This process ensured that each subpopulation served once as the validation set. The entire 5-fold CV procedure was repeated 20 times for all scenarios. Prediction accuracy was calculated as r (y, GEBV)/√(h²), where r (y, GEBV) is the correlation between observed phenotypes and predicted genomic breeding values, and h² is the trait heritability estimated independently (Table 1). Thus, prediction accuracy is distinct from heritability itself.

Prediction bias was evaluated using the regression coefficient (b) of observed phenotypes on predicted GEBVs, summarized as |1 − b|. This coefficient was used as a calibration diagnostic measure of prediction dispersion rather than for formal hypothesis testing, which is not routinely applied to cross-validation–derived regression coefficients in genomic prediction studies. Additionally, mean squared error (MSE) and mean absolute error (MAE) were computed to further evaluate model performance. MSE reflects the average squared deviation between y and GEBVs, whereas MAE represents the average absolute deviation [47].

3. Results

3.1. Population Genetic Structure

Principal component analysis (PCA) revealed distinct population genetic structures across the four aquaculture species (Figure 1). Atlantic salmon and common carp individuals formed relatively compact clusters with limited dispersion, indicating comparatively homogeneous genetic backgrounds within their respective breeding populations. In contrast, gilthead sea bream showed broader dispersion along the first two principal components, suggesting a higher level of genetic heterogeneity, potentially reflecting historical admixture or variable breeding origins.

Rainbow trout, which was characterized by the largest SNP dataset (37,958 SNPs), exhibited the most pronounced population structure, with individuals forming multiple subclusters that may correspond to distinct breeding groups or genetic lineages. Across all four species, the first two principal components explained between 12% and 25% of the total genetic variance, indicating that sufficient genetic diversity was present for downstream genomic prediction analyses. These differences in population structure imply variation in effective population size and breeding history among species, factors that may contribute to differences in genomic prediction performance.

3.2. Substantial Variation in Genomic Prediction Accuracy Across Species and Models

Genomic prediction accuracy varied substantially among the four species and the ten evaluated prediction models (Figure 2). Across species, traits with higher heritability tended to achieve higher prediction accuracies. The highest average accuracies were observed in rainbow trout (0.75–0.83) and common carp (0.73–0.85), corresponding to high (h² = 0.50) and moderate-to-high (h² = 0.26) heritability, respectively. Atlantic salmon (h² = 0.25) and gilthead sea bream (h² = 0.12) showed lower average accuracies, with ranges of 0.49–0.61 and 0.49–0.66, respectively, indicating that trait heritability was a primary determinant of prediction accuracy.

Model performance varied markedly within each species. In Atlantic salmon, most models achieved similar prediction accuracies clustered around 0.600–0.615. Elastic Net showed notably lower accuracy (0.487), while BayesB also performed slightly below the group average (0.588). In gilthead sea bream, Kernel ridge regression (KRR) achieved the highest accuracy (0.664), followed by BayesCpi and XGB (both approximately 0.610), whereas random forest (RF) and Elastic Net yielded the lowest accuracies (approximately 0.522 and 0.495, respectively). In common carp, support vector regression (SVR) delivered the highest prediction accuracy (0.853), substantially outperforming all other models. KRR also performed well (0.8223). GBLUP and Bayesian models formed an intermediate performance group (approximately 0.750–0.761), while RF and XGBoost showed slightly lower accuracies (approximately 0.730). In rainbow trout, most models achieved high prediction accuracies exceeding 0.81. RF (0.833), Elastic Net (0.832), and SVR (0.830) were the top-performing models, whereas KRR was a clear outlier with a lower accuracy of 0.747.

In summary, no single model consistently outperformed all others across the four species. Elastic Net showed highly variable performance, ranking among the weakest models in Atlantic salmon and gilthead sea bream but among the strongest in rainbow trout. Similarly, KRR achieved the highest accuracy in gilthead sea bream but the lowest accuracy in rainbow trout. Overall, these results suggest that machine learning-based models have considerable potential to enhance genomic prediction accuracy, although their effectiveness appears to be species dependent.

3.3. Assessment of Prediction Bias Across Species and Models

Prediction bias, quantified as the absolute deviation of the regression coefficient (b) from unity, provided complementary information on model performance beyond prediction accuracy (Figure 3). Across all species, the GBLUP model consistently exhibited the lowest bias, with regression coefficients closest to 1.00, consistent with its theoretical properties as a best linear unbiased predictor. In contrast, other models displayed varying degrees of bias depending on species and trait. For example, in gilthead sea bream, the high-accuracy KRR model showed a tendency toward under-dispersion (b < 1), whereas Bayesian models in Atlantic salmon exhibited slight over-dispersion (b > 1). Among machine learning approaches, SVR and RF generally produced less biased predictions than XGBoost and Elastic Net in several species. Overall, these results indicate a trade-off between prediction accuracy and calibration: models achieving the highest predictive correlations—such as KRR in gilthead sea bream or SVR in common carp—sometimes did so at the cost of increased prediction bias, which may have implications for long-term selection response if not properly accounted for. This analysis highlights a frequent trade-off between accuracy and bias: models achieving the highest predictive correlation (e.g., KRR in gilthead sea bream, SVR in carp) sometimes did so at the cost of increased prediction bias, which could have implications for long-term selection response if uncorrected. In addition to prediction accuracy and bias, model performance was further evaluated using mean squared error (MSE) and mean absolute error (MAE). The overall patterns observed for MSE and MAE were largely consistent with those based on prediction accuracy, providing complementary support for the robustness of the model comparisons (Figures S1 and S2).

3.4. Optimization of Genomic Prediction Through Incremental Feature Selection

To evaluate how marker density influences predictive performance, we assessed changes in prediction accuracy across progressively expanded SNP subsets constructed using GWAS-based ranking.

Incremental feature selection based on GWAS results revealed a consistent pattern across species: genomic prediction accuracy could be improved by using a carefully selected subset of highly ranked SNPs rather than the complete marker set (Figure 4). In three of the four species examined, the maximum prediction accuracy achieved with an optimal SNP subset exceeded that obtained using all available markers, although the magnitude of improvement and the size of the optimal SNP subset varied substantially among species.

In Atlantic salmon, peak prediction accuracy was achieved using the top 9.64% of SNPs (1001 out of 10,383), resulting in a 2.8% relative improvement compared with the full SNP set. In common carp, the optimal model used only the top 4.58% of SNPs (391 out of 8531), yielding a 3.1% relative improvement. In rainbow trout, the highest accuracy was obtained using an exceptionally sparse subset comprising just 0.54% of SNPs (206 out of 37,958), corresponding to a 4.2% relative improvement. In contrast, no improvement in prediction accuracy was observed in gilthead sea bream when using selected SNP subsets, and the full marker set remained the most effective.

Overall, these results demonstrate that prediction accuracy does not increase monotonically with marker number and highlight the potential advantage of incremental feature selection in enhancing genomic prediction by reducing noise from low-information SNPs. Importantly, the optimal feature set size is strongly dependent on species and trait architecture.

4. Discussion

In this study, rather than introducing new genomic prediction algorithms, we focused on establishing a unified and comparable evaluation framework to systematically assess the performance of different genomic prediction models across multiple aquaculture species. Using this framework, we compared the performance of GBLUP, Bayesian, and machine learning genomic prediction models across four representative aquaculture species. Our results demonstrate that no single model was universally optimal across all species–trait combinations, and that relative model performance was strongly influenced by trait heritability, population genetic structure, and SNP marker characteristics. In addition, incremental feature selection strategies were shown to effectively improve prediction accuracy in many scenarios, although their benefits were highly dependent on the underlying genetic architecture of the species and traits examined. Together, these findings illustrate the applicability and limitations of different genomic prediction models in multi-species contexts and provide empirical guidance for optimizing genomic selection strategies in aquaculture breeding programs.

It is important to clarify the methodological role of GWAS in the present study. GWAS analyses were not conducted for formal locus discovery or biological interpretation but were used exclusively as a statistical tool for SNP ranking and feature prioritization within a strictly nested cross-validation framework. Accordingly, SNP inclusion was based on relative association strength estimated from the training population only, rather than on a fixed genome-wide significance threshold. This ranking-based strategy was designed to optimize predictive performance by reducing noise from low-information markers while explicitly avoiding information leakage between training and validation sets. Therefore, the GWAS results presented here should not be interpreted as evidence of causal loci underlying the studied traits. We acknowledge that ranking-based SNP prioritization does not substitute for biologically driven locus discovery and that alternative ranking metrics or thresholding strategies may yield different optimal SNP subsets. Integrating functional annotation or multi-trait association information may further enhance biological interpretability in future studies.

Our results indicate that trait heritability is one of the primary drivers of genomic prediction accuracy, consistent with classical quantitative genetic theory and numerous empirical studies across plant and animal species [48,49]. In general, traits with higher heritability—such as survival time in rainbow trout and body weight in common carp—achieved higher prediction accuracies, whereas traits with low to moderate heritability—such as survival in gilthead sea bream and gill health in Atlantic salmon—showed more limited predictive performance. For low-heritability traits, increasing model complexity alone is often insufficient to substantially improve prediction accuracy. Instead, future breeding strategies should prioritize improvements in phenotyping precision, reduction in environmental noise and expansion of the reference population size [20,50]. Population genetic structure analyses further revealed pronounced differences in genetic diversity and structure among species, which in turn influenced prediction performance. Rainbow trout and common carp populations exhibited more dispersed genetic structures and higher levels of genetic variation and correspondingly achieved higher prediction accuracies across most models. This result suggests that larger effective population sizes and richer genetic diversity facilitate the construction of stable genomic relationships and enhance prediction accuracy. In contrast, gilthead sea bream populations, despite showing relatively dispersed structure, displayed lower prediction accuracy overall, likely due to the low heritability of the target trait. These results collectively indicate that population genetic background and genetic diversity constitute fundamental factors shaping genomic prediction outcomes and should be carefully evaluated prior to model selection [47,51].

Our findings clearly demonstrate that no genomic prediction model performs optimally across all species and traits. GBLUP exhibited stable performance and relatively low prediction bias across all four species, confirming its robustness as a baseline model for genomic selection [52,53]. However, for traits with low heritability or more complex genetic architectures, such as disease resistance in gilthead sea bream, GBLUP was outperformed by certain Bayesian and machine learning models [54]. Notably, the performance of machine learning models in this study showed pronounced polarization [21,55]. Support vector regression achieved the highest prediction accuracy in common carp, while random forest and elastic net performed well in rainbow trout, suggesting that these models may have captured underlying non-linear relationships or higher-order interactions in specific traits. However, machine learning models generally exhibited lower accuracy and greater instability in gilthead sea bream and Atlantic salmon. This pattern indicates that when sample sizes are limited or genetic signals are weak, complex models are more prone to learning noise rather than true biological signal [56]. These results highlight that model choice should be guided by trait genetic architecture and data characteristics, rather than by algorithmic complexity alone and that machine learning approaches have substantial potential to further improve genomic prediction accuracy.

Incremental SNP feature selection analyses revealed that prediction accuracy does not increase linearly with marker number but instead exhibits clear saturation effects [57,58]. In rainbow trout, common carp, and Atlantic salmon, using only a small proportion of SNPs (0.54%, 4.58%, and 9.64% of the full marker set, respectively) resulted in higher prediction accuracy than using the complete genome-wide marker panel, with relative improvements of 2.8–4.2%. This suggests that a substantial proportion of the genetic variance underlying these traits can be explained by a limited subset of markers with higher predictive relevance. Removing large numbers of low-information or noisy markers effectively increased the signal-to-noise ratio of the prediction models. These findings are consistent with those of recent reports in other aquaculture species. For example, Study in the Russian sturgeon (Acipenser gueldenstaedtii) showed that an incremental feature selection strategy based on GWAS p value ranking significantly improved prediction accuracy for caviar yield, color, and body weight, with approximately 3.12% of top SNPs outperforming the full marker set [46]. However, incremental feature selection was not universally beneficial. In the present study, no accuracy gains were observed for the disease resistance trait in gilthead sea bream, further supporting the conclusion that neither prediction models nor feature selection strategies are universally optimal. Future studies should evaluate the performance of incremental feature selection across a broader range of species and traits, and leverage parallel computing approaches to improve computational efficiency, particularly for complex traits with low heritability.

5. Conclusions

This study provides a unified comparison of genomic prediction performance across four representative aquaculture species using GBLUP, Bayesian, and machine learning models. Prediction accuracy varied markedly among species and traits and was strongly influenced by trait heritability, population genetic characteristics, and model choice, with no single model performing optimally across all scenarios. Machine learning approaches achieved superior performance in specific cases but showed pronounced species-dependent variability, whereas GBLUP consistently delivered stable and well-calibrated predictions. Incremental SNP feature selection improved prediction accuracy in several species by reducing noise from low-information markers, although its effectiveness depended on trait genetic architecture. Overall, these results demonstrate that genomic prediction performance varies substantially across species and traits, underscoring the need to tailor model selection and marker strategies to specific breeding scenarios in aquaculture.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/fishes11020115/s1, Figure S1: Mean squared error (MSE) of different genomic prediction models across four aquaculture species; Figure S2: Mean absolute error (MAE) of different genomic prediction models across four aquaculture species.

Author Contributions

Conceptualization, H.S. and J.Z.; methodology and writing—review and editing, H.S.; writing—original draft preparation, J.Z.; data curation, X.Y. and W.W.; investigation, S.X.; supervision, H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 32341059); the Excellent Young Scientist Program of the Beijing Academy of Agriculture and Forestry Sciences (Grant No. YKPY2025004) and the Innovation Capacity Building Project of the Beijing Academy of Agriculture and Forestry Sciences (Grant No. KJCX20240310).

Institutional Review Board Statement

Not applicable for studies not involving humans or animals.

Informed Consent Statement

Not applicable.

Data Availability Statement

The genotype and phenotype data can be accessed at https://www.g3journal.org/content/8/4/1195.supplemental (Atlantic salmon), https://www.g3journal.org/content/6/11/3693.supplemental (gilthead sea bream), https://figshare.com/articles/dataset/Supplemental_Material_for_Palaiokostas_et_al_2018/6281561 (common carp), and https://figshare.com/articles/Untitled_Item/7725668 (rainbow trout) (accessed on 5 February 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

XGB	eXtreme Gradient Boosting
MAE	mean absolute error
MSE	mean squared error
GBLUP	Genomic Best Linear Unbiased Prediction
SVR	Support Vector Regression
KRR	Kernel Ridge Regression
RF	Random Forest

References

Ruben, M.O.; Akinsanola, A.B.; Okon, M.E.; Shitu, T.; Jagunna, I.I. Emerging challenges in aquaculture: Current perspectives and human health implications. Vet. World 2025, 18, 15–28. [Google Scholar] [CrossRef]
Matias, A.C.; Andrade, C. New Challenges in Marine Aquaculture Research. J. Mar. Sci. Eng. 2025, 13, 324. [Google Scholar] [CrossRef]
Gui, J.-F. Chinese wisdom and modern innovation of aquaculture. Water Biol. Secur. 2024, 3, 100271. [Google Scholar] [CrossRef]
Henderson, C.R. Best linear unbiased estimation and prediction under a selection model. Biometrics 1975, 31, 423–447. [Google Scholar] [CrossRef]
Soller, M.; Beckmann, J. Genetic polymorphism in varietal identification and genetic improvement. Theor. Appl. Genet. 1983, 67, 25–33. [Google Scholar] [CrossRef]
Meuwissen, T.H.; Hayes, B.J.; Goddard, M. Prediction of total genetic value using genome-wide dense marker maps. Genetics 2001, 157, 1819–1829. [Google Scholar] [CrossRef] [PubMed]
Song, H.; Dong, T.; Yan, X.; Wang, W.; Zhang, Q.; Hu, H. Advancing aquaculture breeding through genomic selection: Models, tools, and challenges. Water Biol. Secur. 2025, 100494, in press. [Google Scholar] [CrossRef]
Kang, Z.; Kong, J.; Li, Q.; Sui, J.; Dai, P.; Luo, K.; Meng, X.; Chen, B.; Cao, J.; Tan, J. Genomic selection for hard-to-measure traits in aquaculture: Challenges in balancing genetic gain and diversity. Aquaculture 2025, 606, 742576. [Google Scholar] [CrossRef]
Zenger, K.R.; Khatkar, M.S.; Jones, D.B.; Khalilisamani, N.; Jerry, D.R.; Raadsma, H.W. Genomic selection in aquaculture: Application, limitations and opportunities with special reference to marine shrimp and pearl oysters. Front. Genet. 2019, 9, 693. [Google Scholar] [CrossRef]
García-Ruiz, A.; Cole, J.B.; Van Raden, P.M.; Wiggans, G.R.; Ruiz-López, F.J.; Van Tassell, C.P. Changes in genetic selection differentials and generation intervals in US Holstein dairy cattle as a result of genomic selection. Proc. Natl. Acad. Sci. USA 2016, 113, E3995–E4004. [Google Scholar] [CrossRef]
Goddard, M.; Hayes, B. Genomic selection. J. Anim. Breed. Genet. 2007, 124, 323–330. [Google Scholar] [CrossRef]
Jones, H.E.; Wilson, P.B. Progress and opportunities through use of genomics in animal production. Trends Genet. 2022, 38, 1228–1252. [Google Scholar] [CrossRef]
Meuwissen, T.; Hayes, B.; Goddard, M. Genomic selection: A paradigm shift in animal breeding. Anim. Front. 2016, 6, 6–14. [Google Scholar] [CrossRef]
Song, H.; Hu, H. Strategies to improve the accuracy and reduce costs of genomic prediction in aquaculture species. Evol. Appl. 2022, 15, 578–590. [Google Scholar] [CrossRef]
Pang, Z.; Wang, W.; Zhang, H.; Qiao, L.; Liu, J.; Pan, Y.; Yang, K.; Liu, W. Mutual information-based best linear unbiased prediction for enhanced genomic prediction accuracy. J. Anim. Sci. 2025, 103, skaf250. [Google Scholar] [CrossRef] [PubMed]
Sahebalam, H.; Gholizadeh, M.; Hafezian, S.H. The effect of different approaches to determining the regularization parameter of bayesian LASSO on the accuracy of genomic prediction. Mamm. Genome 2025, 36, 331–345. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Hu, H.; Sun, T.; Li, X.; Lv, G.; Bai, Z.; Li, J. Genomic selection for improvement of growth traits in triangle sail mussel (Hyriopsis cumingii). Aquaculture 2022, 561, 738692. [Google Scholar] [CrossRef]
Erbe, M.; Hayes, B.; Matukumalli, L.; Goswami, S.; Bowman, P.; Reich, C.; Mason, B.; Goddard, M. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J. Dairy Sci. 2012, 95, 4114–4129. [Google Scholar] [CrossRef] [PubMed]
Song, H.; Wang, W.; Dong, T.; Yan, X.; Geng, C.; Bai, S.; Hu, H. Prioritized SNP Selection from Whole-Genome Sequencing Improves Genomic Prediction Accuracy in Sturgeons Using Linear and Machine Learning Models. Int. J. Mol. Sci. 2025, 26, 7007. [Google Scholar] [CrossRef]
Crossa, J.; Montesinos-Lopez, O.A.; Costa-Neto, G.; Vitale, P.; Martini, J.W.; Runcie, D.; Fritsche-Neto, R.; Montesinos-Lopez, A.; Pérez-Rodríguez, P.; Gerard, G. Machine learning algorithms translate big data into predictive breeding accuracy. Trends Plant Sci. 2025, 30, 167–184. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Ni, P.; Sturrock, M.; Zeng, Q.; Wang, B.; Bao, Z.; Hu, J. Deep learning for genomic selection of aquatic animals. Mar. Life Sci. Technol. 2024, 6, 631–650. [Google Scholar] [CrossRef]
VanRaden, P.M. Efficient methods to compute genomic predictions. J. Dairy Sci. 2008, 91, 4414–4423. [Google Scholar] [CrossRef]
Su, G.; Christensen, O.F.; Janss, L.; Lund, M.S. Comparison of genomic predictions using genomic relationship matrices built with different weighting factors to account for locus-specific variances. J. Dairy Sci. 2014, 97, 6547–6559. [Google Scholar] [CrossRef]
Meher, P.K.; Rustgi, S.; Kumar, A. Performance of Bayesian and BLUP alphabets for genomic prediction: Analysis, comparison and results. Heredity 2022, 128, 519–530. [Google Scholar] [CrossRef]
DeVito, R.; Gymrek, M. Modeling nonlinear and interaction effects of spatiotemporal and other non-genetic factors improves phenotypic prediction for complex traits. medRxiv 2025. [Google Scholar] [CrossRef]
Shokor, F.; Croiseau, P.; Gangloff, H.; Saintilan, R.; Tribout, T.; Mary-Huard, T.; Cuyabano, B. Deep learning and genomic best linear unbiased prediction integration: An approach to identify potential nonlinear genetic relationships between traits. J. Dairy Sci. 2025, 108, 6174–6189. [Google Scholar] [CrossRef]
Robledo, D.; Matika, O.; Hamilton, A.; Houston, R.D. Genome-wide association and genomic selection for resistance to amoebic gill disease in Atlantic salmon. G3 Genes Genomes Genet. 2018, 8, 1195–1203. [Google Scholar] [CrossRef] [PubMed]
Palaiokostas, C.; Robledo, D.; Vesely, T.; Prchal, M.; Pokorova, D.; Piackova, V.; Pojezdal, L.; Kocour, M.; Houston, R.D. Mapping and sequencing of a significant quantitative trait locus affecting resistance to koi herpesvirus in common carp. G3 Genes Genomes Genet. 2018, 8, 3507–3513. [Google Scholar] [CrossRef] [PubMed]
Palaiokostas, C.; Ferraresso, S.; Franch, R.; Houston, R.D.; Bargelloni, L. Genomic prediction of resistance to pasteurellosis in gilthead sea bream (Sparus aurata) using 2b-RAD sequencing. G3 Genes Genomes Genet. 2016, 6, 3693–3700. [Google Scholar] [CrossRef]
Rodríguez, F.H.; Flores-Mara, R.; Yoshida, G.M.; Barría, A.; Jedlicki, A.M.; Lhorente, J.P.; Reyes-López, F.; Yáñez, J.M. Genome-wide association analysis for resistance to infectious pancreatic necrosis virus identifies candidate genes involved in viral replication and immune response in rainbow trout (Oncorhynchus mykiss). G3 Genes Genomes Genet. 2019, 9, 2897–2904. [Google Scholar] [CrossRef] [PubMed]
Houston, R.D.; Taggart, J.B.; Cézard, T.; Bekaert, M.; Lowe, N.R.; Downing, A.; Talbot, R.; Bishop, S.C.; Archibald, A.L.; Bron, J.E. Development and validation of a high density SNP genotyping array for Atlantic salmon (Salmo salar). BMC Genom. 2014, 15, 90. [Google Scholar] [CrossRef]
Palti, Y.; Gao, G.; Liu, S.; Kent, M.; Lien, S.; Miller, M.; Rexroad, C., III; Moen, T. The development and characterization of a 57 K single nucleotide polymorphism array for rainbow trout. Mol. Ecol. Resour. 2015, 15, 662–672. [Google Scholar] [CrossRef]
Browning, B.L.; Browning, S.R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 2009, 84, 210–223. [Google Scholar] [CrossRef]
Chang, C.C.; Chow, C.C.; Tellier, L.C.; Vattikuti, S.; Purcell, S.M.; Lee, J.J. Second-generation PLINK: Rising to the challenge of larger and richer datasets. Gigascience 2015, 4, 7. [Google Scholar] [CrossRef]
Madsen, P.; Jensen, J.; Labouriau, R.; Christensen, O.F.; Sahana, G. DMU-a package for analyzing multivariate mixed models in quantitative genetics and genomics. In Proceedings of the 10th World Congress on Genetics Applied to Livestock Production (WCGALP), Vancouver, BC, Canada, 17–22 August 2014. [Google Scholar]
Pérez, P.; de Los Campos, G. Genome-wide regression and prediction with the BGLR statistical package. Genetics 2014, 198, 483–495. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
Long, N.; Gianola, D.; Rosa, G.J.; Weigel, K.A. Application of support vector regression to genome-assisted prediction of quantitative traits. Theor. Appl. Genet. 2011, 123, 1065–1074. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Douak, F.; Melgani, F.; Benoudjit, N. Kernel ridge regression with active learning for wind speed prediction. Appl. Energy 2013, 103, 328–340. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Azodi, C.B.; Bolger, E.; McCarren, A.; Roantree, M.; de Los Campos, G.; Shiu, S.-H. Benchmarking parametric and machine learning models for genomic prediction of complex traits. G3 Genes Genomes Genet. 2019, 9, 3691–3702. [Google Scholar] [CrossRef]
González-Recio, O.; Forni, S. Genome-wide prediction of discrete traits using Bayesian regressions and machine learning. Genet. Sel. Evol. 2011, 43, 7. [Google Scholar] [CrossRef]
Blondel, M.; Onogi, A.; Iwata, H.; Ueda, N. A ranking approach to genomic selection. PLoS ONE 2015, 10, e0128570. [Google Scholar] [CrossRef]
Song, H.; Dong, T.; Wang, W.; Jiang, B.; Yan, X.; Geng, C.; Bai, S.; Xu, S.; Hu, H. Cost-effective genomic prediction of critical economic traits in sturgeons through low-coverage sequencing. Genomics 2024, 116, 110874. [Google Scholar] [CrossRef]
Zhang, J.; Wei, Y.; Song, H.; Rong, Y.; Hu, W.; Chen, J.; Hu, H. Genomic selection and genome-wide association study for temperature-induced sex reversal trait in a combined common carp population. Aquaculture 2025, 612, 743274. [Google Scholar] [CrossRef]
Kaler, A.S.; Purcell, L.C.; Beissinger, T.; Gillman, J.D. Genomic prediction models for traits differing in heritability for soybean, rice, and maize. BMC Plant Biol. 2022, 22, 87. [Google Scholar] [CrossRef] [PubMed]
Zhao, W.; Zhang, Z.; Wang, Z.; Ma, P.; Pan, Y.; Wang, Q.; Zhang, Z. Factors affecting the accuracy of genomic prediction in joint pig populations. Animal 2023, 17, 100980. [Google Scholar] [CrossRef] [PubMed]
Džermeikaitė, K.; Šidlauskaitė, M.; Antanaitis, R.; Anskienė, L. Enhancing Genomic Selection in Dairy Cattle Through Artificial Intelligence: Integrating Advanced Phenotyping and Predictive Models to Advance Health, Climate Resilience, and Sustainability. Dairy 2025, 6, 50. [Google Scholar] [CrossRef]
Bian, Y.; Holland, J. Enhancing genomic prediction with genome-wide association studies in multiparental maize populations. Heredity 2017, 118, 585–593. [Google Scholar] [CrossRef] [PubMed]
Zheng, W.; Zhang, Q.; He, J.; Han, B.; Zhang, Q.; Sun, D. Comparative evaluation of SNP-weighted, Bayesian, and machine learning models for genomic prediction in Holstein cattle. BMC Genom. 2025, 26, 1037. [Google Scholar] [CrossRef]
Zhou, X.; Hong, Z.; Cui, W.; Zhang, Y.; Ikhwanuddin, M.; Ye, S.; Ma, H. Genetic parameters estimation and optimization of genomic selection in mud crab (Scylla paramamosain): A case study for growth-related traits. BMC Genom. 2025, 26, 1029. [Google Scholar] [CrossRef] [PubMed]
Song, H.; Dong, T.; Yan, X.; Wang, W.; Tian, Z.; Hu, H. Using Bayesian threshold model and machine learning method to improve the accuracy of genomic prediction for ordered categorical traits in fish. Agric. Commun. 2023, 1, 100005. [Google Scholar] [CrossRef]
Chafai, N.; Hayah, I.; Houaga, I.; Badaoui, B. A review of machine learning models applied to genomic prediction in animal breeding. Front. Genet. 2023, 14, 1150596. [Google Scholar] [CrossRef]
Han, G.-R.; Goncharov, A.; Eryilmaz, M.; Ye, S.; Palanisamy, B.; Ghosh, R.; Lisi, F.; Rogers, E.; Guzman, D.; Yigci, D. Machine learning in point-of-care testing: Innovations, challenges, and opportunities. Nat. Commun. 2025, 16, 3165. [Google Scholar] [CrossRef]
Heinrich, F.; Lange, T.M.; Kircher, M.; Ramzan, F.; Schmitt, A.O.; Gültas, M. Exploring the potential of incremental feature selection to improve genomic prediction accuracy. Genet. Sel. Evol. 2023, 55, 78. [Google Scholar] [CrossRef] [PubMed]
Griot, R.; Allal, F.; Phocas, F.; Brard-Fudulea, S.; Morvezen, R.; Haffray, P.; François, Y.; Morin, T.; Bestin, A.; Bruant, J.-S. Optimization of genomic selection to improve disease resistance in two marine fishes, the European sea bass (Dicentrarchus labrax) and the gilthead sea bream (Sparus aurata). Front. Genet. 2021, 12, 665920. [Google Scholar] [PubMed]

Figure 1. Principal Component Analysis (PCA) plots of genetic variation for the four species. (A) Atlantic salmon (Salmo salar). (B) Seabream (Sparus aurata). (C) Common carp (Cyprinus carpio). (D) Rainbow trout (Oncorhynchus mykiss). The axes represent the first two principal components (PC1 and PC2), with the percentage of explained variance indicated.

Figure 2. Prediction accuracy of different genomic prediction models across four species. Bars represent mean prediction accuracy and error bars indicate ± SE. All panels share the same y-axis scale to facilitate direct comparison among species. Notes: Prediction accuracy was calculated as r(y, GEBV)/√h², where h² denotes trait heritability. Model abbreviations: GBLUP (Genomic Best Linear Unbiased Prediction), BayesA, BayesB, BayesCπ, BayesLASSO, SVR (Support Vector Regression), KRR (Kernel Ridge Regression), RF (Random Forest), XGB (eXtreme Gradient Boosting), Elastic Net.

Figure 3. Prediction bias of genomic models across the four species. In the color bar, Regression corresponds to the regression coefficient b. Note: Each cell shows the regression coefficient (b) of observed phenotypes on predicted GEBVs obtained from cross-validation. Cell color corresponds to the absolute deviation from unity, |b − 1|, where lighter colors indicate lower bias and better calibration. The regression coefficient was used as a calibration diagnostic metric rather than for formal hypothesis testing. Model abbreviations: GBLUP, BayesA, BayesB, BayesCπ, BayesLASSO, SVR, KRR, RF, XGB, Elastic Net.

Figure 4. Changes in prediction accuracy with increasing SNP subset size under a GWAS-based incremental feature selection framework. (A) Atlantic salmon; (B) Gilthead Sea bream; (C) Common carp; (D) Rainbow trout. Mean performance of the model when trained on all SNPs is represented by the horizontal red line, while the blue line indicates the SNP subset size achieving the optimal prediction accuracy.

Table 1. Description of populations, target traits, and summary statistics used in genomic prediction analyses.

Species	Trait	N-obs	Mean ± SD	QC SNPs	Heritability ± SE
Atlantic salmon	Mean gill score	1481	2.79 ± 0.85	10,383	0.25 ± 0.06
Gilthead Sea bream	Number of days to death	777	10.34 ± 4.09	8545	0.12 ± 0.06
Common carp	Body weight	1214	16.32 ± 4.58	8531	0.26 ± 0.06
Rainbow trout	Number of days to death	749	51.47 ± 13.98	37,958	0.50 ± 0.06

Note: Mean ± SD represents the mean and standard deviation of the analyzed trait listed in the Trait column for each species, describing the dispersion of observed phenotypic values within the population. Heritability is reported as estimate ± SE, where SE denotes the standard error of the heritability estimate and reflects parameter estimation uncertainty rather than population variability. Abbreviations: N-obs, number of observations; SD, standard deviation; SE, standard error; QC SNPs, number of SNPs retained after quality control.

Table 2. Optimal hyperparameter settings for machine learning methods.

Species	Methods ¹	Optimal Hyperparameters ²
Atlantic salmon	SVR	C: 1, gamma: auto, kernel: poly
	KRR	alpha: 10, gamma: 0.0001, kernel: poly
	RF	Max depth: 20, min samples leaf: 1, min samples split: 2, n estimators: 500
	XGB	Learning rate: 0.01, max depth: 5, n estimators: 500, subsample: 0.6
	Elastic Net	alpha: 0.1, l1_ratio: 0.5, max iter: 1000
Gilthead Sea bream	SVR	C: 10, gamma: scale, kernel: rbf
	KRR	alpha: 100, gamma: 0.001, kernel: poly
	RF	Max depth: 30, min samples leaf: 4, min samples split: 2, n estimators: 100
	XGB	Learning rate: 0.01, max depth: 5,n estimators: 500, subsample: 0.6
	Elastic Net	alpha: 1, l1_ratio: 0.1, max iter: 1000
Common carp	SVR	C: 10, gamma: scale, kernel: rbf
	KRR	alpha: 1, gamma: 0.0001, kernel: rbf
	RF	Max depth: 30, min samples leaf: 4, min samples split: 2, n estimators: 200
	XGB	Learning rate: 0.01, max depth: 7, n estimators: 500, subsample: 0.6
	Elastic Net	alpha: 1, l1_ratio: 0.1, max iter: 1000
Rainbow trout	SVR	C: 100, gamma: scale, kernel: rbf
	KRR	alpha: 10, gamma: 0.0001, kernel: poly
	RF	Max depth: 20, min samples leaf: 4, min samples split:5, n estimators: 100
	XGB	Learning rate: 0.01, max depth: 3, n estimators: 500, subsample: 0.8
	Elastic Net	alpha: 1, l1_ratio: 0.5, max iter: 1000

Note: ¹ SVR: support vector regression; RF: random forest; KRR: kernel ridge regression; XGB: extreme gradient boosting; EN: elastic net. ² Optimal hyperparameters: The optimal hyperparameters of each machine learning method obtained by using grid search.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Yang, X.; Wang, W.; Hu, H.; Xu, S.; Song, H. A Unified Comparative Evaluation of Genomic Prediction Models Across Four Aquaculture Species. Fishes 2026, 11, 115. https://doi.org/10.3390/fishes11020115

AMA Style

Zhang J, Yang X, Wang W, Hu H, Xu S, Song H. A Unified Comparative Evaluation of Genomic Prediction Models Across Four Aquaculture Species. Fishes. 2026; 11(2):115. https://doi.org/10.3390/fishes11020115

Chicago/Turabian Style

Zhang, Jinxin, Xiaofei Yang, Wei Wang, Hongxia Hu, Shaogang Xu, and Hailiang Song. 2026. "A Unified Comparative Evaluation of Genomic Prediction Models Across Four Aquaculture Species" Fishes 11, no. 2: 115. https://doi.org/10.3390/fishes11020115

APA Style

Zhang, J., Yang, X., Wang, W., Hu, H., Xu, S., & Song, H. (2026). A Unified Comparative Evaluation of Genomic Prediction Models Across Four Aquaculture Species. Fishes, 11(2), 115. https://doi.org/10.3390/fishes11020115

Article Menu

A Unified Comparative Evaluation of Genomic Prediction Models Across Four Aquaculture Species

Abstract

1. Introduction

2. Materials and Methods

2.1. Population and Phenotypes

2.2. SNP Detection, Quality Control and Principal Component Analysis

2.3. Genomic Prediction Models

2.3.1. Genomic Best Linear Unbiased Prediction (GBLUP)

2.3.2. Bayesian Models

2.3.3. Machine Learning and Regularized Regression Models

Elastic Net (EN)

Nonlinear Machine Learning Models

2.4. Incremental Feature Selection Based on GWAS

2.5. Genomic Prediction Performance Evaluation

3. Results

3.1. Population Genetic Structure

3.2. Substantial Variation in Genomic Prediction Accuracy Across Species and Models

3.3. Assessment of Prediction Bias Across Species and Models

3.4. Optimization of Genomic Prediction Through Incremental Feature Selection

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI