1. Introduction
Plecoglossus altivelis belongs to the order
Osmeriformes, family
Plecoglossidae, and genus
Plecoglossus [
1] and is widely distributed in East Asia, especially in Japan, Korea, and China [
2,
3,
4]. It exhibits a distinctive morphology characterized by an elongated, laterally compressed body, a hook-like downward-curving snout, a large mouth, and paired anterior protrusions on the lower jaw forming a concave structure [
5]. Ayu is of extremely high economic value in Japan. Aquaculture production of ayu was approximately 5000 tons in 2017, the second-largest inland aquaculture production in Japan [
6]. The average annual production of ayu in China stabilized at around 60,000 tons between 2019 and 2023. The Northeast region accounts for about 45% of the market share, followed by North China with about 30%, and South China ranks third with 18% market share. Together, these three areas form the heart of China’s ayu industry. In recent years, the market for ayu has seen significant growth due to the growing consumer demand for healthy food and the improvement of people’s consumption habits of high-quality aquatic products [
7]. Therefore, ayu aquaculture has emerged as a pivotal growth driver in the fisheries industry of China. With the rapid development of genome technology, the genome of
Plecoglossus altivelis has been decoded. However, the assembly remains rough. At present, research on the
Plecoglossus altivelis genome is only at the scaffold level (i.e., the scaffold structure for gene expression regulation), and no scholars or research teams have conducted analysis and assembly of the ayu chromosome structure. A series of systematic efforts are still required to thoroughly refine it to the chromosome level. The ayu genome is relatively small, comprising approximately 420 Mb distributed across 28 chromosomes (
n = 28). Moreover, a y-linked receptor gene was mapped in ayu for its sex-determination [
8]. Comparison of whole-genome resequencing mapping coverage between males and females identified male-specific regions in sex-linked scaffolds. A duplicate copy of the anti-Mullerian hormone type-II receptor gene (
amhr2bY) was found within these male-specific regions [
8], distinct from the autosomal copy of
amhr2. These findings provide a basis for studying the sex determination mechanism of ayu.
A genome-wide association study (GWAS) [
9] is a high-throughput genomic approach that identifies genetic variants associated with target traits by analyzing dense genotyping data from large cohorts. It enables genome-scale screening for genetic polymorphisms linked to diseases or complex traits within specific populations [
10], leveraging the principle of linkage disequilibrium (LD), where adjacent alleles on chromosomes are co-inherited non-randomly. By detecting single-nucleotide polymorphisms (SNPs) [
11], GWAS infers trait-associated loci through LD patterns [
12].
As a genome-level analytical framework, GWAS facilitates the discovery of causal genetic variants underlying phenotypic traits. Integrated with molecular marker-assisted breeding [
13], GWAS holds transformative potential for aquaculture. Significant advancements have been achieved in fish species such as rainbow trout (
Oncorhynchus mykiss) [
14], yellow croaker (
Nibea albiflora) [
15], and large yellow croaker (
Larimichthys crocea) [
16]. For instance, Tai et al. [
17] identified key candidate loci and genes (e.g.,
igf1,
gh) associated with growth traits (body weight, body length, total length, and body height) in rainbow trout via GWAS. Similarly, studies on yellow croaker [
18] revealed critical SNPs and genes (e.g.,
mstn,
gdf8) linked to growth regulation. Cui et al. [
19] conducted a GWAS on yellowtail amberjack (
Seriola lalandi), pinpointing growth-related SNPs and candidate genes. Ali et al. [
20] reported analogous findings in rainbow trout, while Wang et al. [
21] identified growth-associated genetic markers in tiger pufferfish (
Takifugu rubripes), providing valuable insights for selective breeding. Beyond growth traits, GWAS has been widely applied to investigate disease resistance and stress tolerance traits in fish.
2. Materials and Methods
2.1. Experimental Population and Phenotypic Measurements
In this study, 426 Plecoglossus altivelis individuals were collected from a single, closed breeding population at the aquaculture farm of Liaoning Plecoglossus altivelis Fisheries Co., Ltd. in Dandong, Northeast China. To ensure genetic consistency, all samples originated from the same broodstock population with a shared breeding history and management protocol. To minimize the influence of close kinship, which could confound genetic association analyses, individuals were randomly selected based on available pedigree records to avoid sampling full-sib or half-sib family groups. Phenotypic characterization was performed on all individuals. The sampled population consisted of 5-month-old fish with an average body weight of 21.1 g and an average body length of 11.4 cm. Six key growth-related traits were precisely measured using standardized protocols:
Body Weight (BW): Measured using an electronic balance with a precision of 0.01 g after wiping the surface moisture of the fish body with absorbent paper.
Total Length (TL): The straight-line distance from the most anterior tip of the snout to the distal end of the caudal fin, measured with a digital caliper with a precision of 0.01 mm.
Body Length (BL): The straight-line distance from the most anterior tip of the snout to the posterior edge of the caudal peduncle, measured with a digital caliper with a precision of 0.01 mm.
Body Height (BH): The maximum vertical distance from the dorsal contour to the ventral contour of the fish body, measured at the position of the first dorsal fin ray using a digital caliper with a precision of 0.01 mm.
Eye Diameter (ED): Defined as the horizontal cross-sectional diameter of the eyeball, referring to the straight-line distance between the left and right edges of the eyeball in the horizontal direction, measured with a digital caliper with a precision of 0.01 mm.
Gonad Weight (GW): The weight of the dissected gonad tissue, measured using an electronic balance with a precision of 0.01 g after rinsing with sterile phosphate-buffered saline (PBS) and blotting surface moisture.
Sex: Determined by visual inspection combined with histological observation of gonad tissue; individuals were categorized into male, female, and undifferentiated (if applicable).
Concurrently with phenotypic data recording, a portion of the caudal fin tissue from each fish was excised using sterile scissors and immediately preserved in 2.0 mL sterile EP tubes prefilled with absolute ethanol. All samples were stored at −20 °C for subsequent DNA extraction.
Euthanasia and Tissue Sampling: Prior to tissue sampling, all fish were euthanized by immersion in a buffered tricaine methanesulfonate (MS-222, 150 mg/L) solution to ensure unconsciousness and cessation of opercular movement, in accordance with established animal welfare guidelines. Following confirmation of death, a portion of the caudal fin tissue was excised using sterile scissors for DNA extraction.
2.2. DNA Extraction, Sequencing, and Genotype Data Acquisition
DNA was extracted from the collected caudal fin tissues using the phenol–chloroform method. The quality of extracted DNA was assessed via 1% agarose gel electrophoresis, and concentrations were adjusted to 2.5 ng/μL prior to sequencing by BGI Wuhan (Wuhan, China).
2.3. Sequencing Methods
Sequencing was performed on the DNBSEQ platform of BGI Wuhan Co., Ltd., which included library construction and sequencing steps. The specific procedures are as follows:
The concentration of DNA samples was measured using a fluorometer, and the integrity of DNA samples was examined via 1% agarose gel electrophoresis. Only samples that passed the detection were used for library preparation.
- 2.
DNA Sample Fragmentation
DNA samples were fragmented by ultrasonication, and short DNA fragments meeting the length requirements were obtained by adjusting the fragmentation parameters.
- 3.
Fragment Size Selection
The fragmented samples were subjected to fragment selection using magnetic beads to concentrate the sample bands at approximately 300–400 bp. The amount of purified DNA samples was quantified using a fluorometer.
- 4.
End Repair, A-Tailing, and Adapter Ligation
A reaction system was prepared and incubated at an appropriate temperature for a specific duration to repair the ends of double-stranded DNA and add an adenine (A) base to the 3′ ends. An adapter ligation reaction system was then prepared and incubated at an appropriate temperature for a specific duration to ligate adapters to the DNA fragments.
- 5.
PCR Amplification and Product Recovery
A PCR reaction system was prepared, and the reaction program was set up to amplify the ligation products. The amplified products were subjected to fragment selection using magnetic beads, and the concentration and fragment size of the PCR products were detected.
- 6.
PCR Product Circularization
The PCR products were denatured into single strands, after which a circularization reaction system was prepared, thoroughly mixed, and incubated at an appropriate temperature for a specific duration to obtain single-stranded circular products. After digesting the uncircularized linear DNA molecules, the final library was obtained.
- 7.
Library Detection
The concentration of the library was determined.
- 8.
Sequencing on the Instrument
Single-stranded circular DNA molecules were amplified via rolling circle replication to form DNA nanoballs (DNBs) containing more than 300 copies. The obtained DNBs were loaded into the mesh pores on the chip using high-density DNA nanochip technology, and sequencing was performed via the Combinatorial Probe-Anchor Synthesis (CPAS) technology.
- 9.
Data Generation and Quality Assessment
Sequencing was performed on the BGI DNBSEQ platform (Wuhan, China) using a short-fragment library construction protocol. Paired-end sequencing was conducted with a read length of PE150. The clean FASTQ data adhere to the Phred+33 quality scoring system, with Q20 scores exceeding 98% for all samples. Each sample contains more than 120,000,000 Clean Reads, equivalent to over 36 billion clean bases.
2.4. Genotype Data Acquisition
Raw sequencing data were processed for genotyping using GATK (v4.1.8.0) software. The HaplotypeCaller tool was employed to generate single-sample gVCF files, followed by joint genotyping performed with the GenotypeGVCFs tool. Post-genotyping, stringent quality control filters were applied to exclude SNP loci failing analytical criteria, thereby minimizing false-positive outcomes.
Data refinement steps included:
- (a)
We performed quality control and refinement of the raw genotype data using the following multi-step pipeline:
- (b)
Raw Read Filtering: Raw sequencing reads were filtered using SOAPnuke with parameters: -n0.01-20-90.5—adaMR 0.25 -polyX50 —minReadLen 150.
- (c)
Variant Calling and Merging: Single-sample variant calling was performed using GATK (v4.1.8.0) HaplotypeCaller with basic quality filters applied (Genotype Quality ≥ 20, Mapping Quality ≥ 40). The gVCF files from all samples were subsequently merged using BCFtools (v1.22).
- (d)
Depth- and Frequency-based Site Filtering: The merged variant set was filtered using BCFtools: (a) retaining only SNPs; (b) removing sites with a read depth (DP) < 10 (—exclude ‘INFO/DP < 10′); (c) removing sites with a minor allele frequency (MAF) < 0.01 (—min-af 0.01).
- (e)
Genotype Imputation: The remaining missing genotypes in the filtered dataset were imputed using BEAGLE v4.1.
- (f)
Comprehensive QC and Format Conversion: The imputed data were converted to PLINK format and subjected to stringent filtering:
- (g)
Individual-level: Samples with a genotype missing rate > 0.05 were removed (—mind 0.05).
- (h)
Variant-level: The following filters were applied sequentially: (a) variants with a missing rate > 0.05 (—geno 0.05); (b) variants with MAF < 0.01 (—maf 0.01); (c) variants showing significant deviation from Hardy–Weinberg equilibrium in the control group (—hwe 1 × 10−6).
- (i)
Linkage Disequilibrium Pruning: To obtain a set of independent variants, linkage disequilibrium pruning was performed using PLINK (parameters: —indep-pairwise 50 5 0.2). The resulting high-quality genotype dataset was used for downstream association analyses.
The initial sample pool contained 426 individuals and 1,460,282 loci. Following DNA sequencing performed by the BGI Group, low-quality individuals (individuals with poor DNA quality or excessive missing genotype data) were excluded, and subsequent filtration with PLINK resulted in the retention of 555,242 high-quality single-nucleotide polymorphism (SNP) loci and 171 individuals suitable for subsequent research, which were utilized for the subsequent genome-wide association study (GWAS).
2.5. Population Genetic Analysis
Genetic diversity and population structure were assessed using genome-wide SNP data. Principal component analysis (PCA) was performed using PLINK v1.9 with the —pca option after linkage disequilibrium pruning (—indep-pairwise 50 10 0.2). Genetic diversity indices, including observed heterozygosity (Ho) and the inbreeding coefficient (F), were calculated using PLINK’s —het function. Observed heterozygosity was calculated as Ho = (N.NM − O.HOM)/N.NM, where N.NM is the number of non-missing genotypes and O.HOM is the observed number of homozygotes.
2.6. Genome-Wide Association Analysis
First, a genomic relationship matrix (GRM) based on SNP-derived genetic similarity between individuals was constructed using the genomic relationship matrix (GRM) approach [
22]. This matrix enabled direct estimation of additive genetic variance for each trait from genome-wide SNP data, followed by calculation of SNP-based heritability for the seven traits. Single-trait genome-wide association study (GWAS) analysis was then performed using GCTA software.
Subsequently, single-trait GWAS was independently conducted using GEMMA software. Based on phenotypic correlation coefficients and inter-trait heritability estimates, six growth traits were grouped for multi-trait joint analysis via GEMMA. By integrating results from both software tools (GCTA and GEMMA), complementary results were obtained, enhancing the reliability of identifying candidate genes associated with these economically important traits through combined single and multi-trait approaches.
For GWAS result visualization, Manhattan plots [
23] and Q-Q plots [
24] were generated using R 4.4.0. Significant loci were filtered using R 4.4.0. Significant loci were filtered based on the threshold of
p ≤ 5 × 10
−8. The candidate gene screening and positional mapping were conducted. Candidate genes were mapped within 100 kb windows (50 kb upstream and downstream) flanking each significant SNP [
25].
2.7. Candidate Gene Identification and Functional Annotation
Based on the ayu (
Plecoglossus altivelis) reference genome (Pal_1.0) provided by the Fish Aquaculture Laboratory, Department of Marine Biosciences, Tokyo University of Marine Science and Technology, and available on NCBI, 100 kb genomic regions (50 kb upstream and downstream of each significant locus) were extracted. These regions were subjected to BLAST sequence alignment on NCBI to identify potential candidate genes [
26]. Functional annotation of the identified genes was then performed, supported by literature review, to further prioritize biologically relevant candidate genes.
4. Discussion
Phenotypic analysis of Plecoglossus altivelis experiment revealed the genetic correlations among traits. Four traits—body weight (BW), total length (TL), body length (BL), and body height (BH)—exhibited a high degree of correlation, indicating that these indices collectively reflect individual size. This formed the basis for grouping these traits in our multi-trait genome-wide association study (GWAS) analysis. While the six growth traits could be combined into numerous groups, five representative combinations were selected based on correlation coefficients to ensure comprehensiveness while avoiding redundancy in candidate gene screening.
When comparing the single-trait GWAS results from GCTA and GEMMA with the multi-trait analysis results from GEMMA, significant complementarity was observed between the methods in terms of candidate gene detection power, effect estimation accuracy, and biological interpretability. Partial gene overlaps (e.g.,
LOC131530706,
LOC117378376) were found in GCTA and GEMMA single-trait analyses, indicating that these loci exhibit robust association signals within the mixed linear model framework. For example,
LOC131530706 showed highly significant
p-values in both methods (GCTA: 3.84 × 10
−29; GEMMA: 3.84 × 10
−29) and is involved in GTP binding, suggesting it may play a central role in transmembrane transport or cell signaling. The discovery of such overlapping genes by two different statistical methods enhances result reliability, consistent with the theoretical advantage of mixed models in controlling population structure bias [
27].
The population genetic analyses provide important context for interpreting our primary findings. The relatively low proportion of variance explained by individual principal components and the absence of clear population stratification reduce concerns about false-positive associations due to population structure in subsequent analyses. The moderate level of genetic diversity (Ho = 0.395) indicates sufficient variation for association studies, while the average inbreeding coefficient (F = −0.107) suggests either historical outcrossing or potential heterozygote advantage. These genetic characteristics should be considered when evaluating the robustness of association signals and selection signatures detected in this study.
More genes were detected by only one method: the GCTA-specific gene
slc48a1a participates in heme transport, while the GEMMA-specific gene
myo5aa regulates actin movement. These differences likely arise from algorithmic optimizations: GCTA’s heritability estimation based on the genomic relationship matrix (GRM) emphasizes global variance decomposition, whereas GEMMA’s sparse matrix accelerates local effect detection with higher sensitivity to low-frequency variants [
28].
The multi-trait analysis by GEMMA further revealed shared genetic architectures across traits. For instance,
slc25a12 was identified in both single- and multi-trait analyses, functioning in amino acid-ion coupled transport and potentially integrating multiple phenotypes through metabolic pathways. The multi-trait model also detected genes not covered by GCTA, such as
maml3, which participates in the Notch signaling pathway, indicating that multi-trait can capture hub genes in cross-phenotype regulatory networks [
29]. Notably,
LOC134022516 lacked functional annotation in both single- and multi-trait analyses, possibly representing an understudied novel regulatory element that requires validation with chromatin interaction data (e.g., Hi-C) [
30]. These results highlight the necessity of multi-trait integration in GWAS: GCTA excels in heritability partitioning and candidate gene prioritization, while GEMMA expands functional association dimensions through multi-trait modeling and efficient computation.
In terms of statistical power, GEMMA showed higher resolution in estimating phenotypic variance explained (PVE). For example,
LOC117378376 had highly significant
p-values in both methods, but its PVE was significantly higher in GEMMA. This discrepancy may stem from GEMMA’s fine-grained modeling of random effects, where its Bayesian framework (e.g., BSLMM) more accurately decomposes additive and non-additive genetic effects [
31]. Additionally,
maml3 in GEMMA multi-trait analysis had a low PVE of 0.0016 but showed significant pathway enrichment, indicating that low-PVE genes can still amplify phenotypic impacts through regulatory networks. In contrast, GCTA’s PVE estimates are more conservative, potentially underestimating the contribution of pleiotropic genes—a phenomenon widely discussed in complex trait analysis [
32].
Biologically, both methods converged on three functional modules: transmembrane transport, cytoskeletal dynamics, and metabolic regulation. GCTA-detected
slc48a1a (heme transport) and GEMMA-identified
slc25a12 (amino acid transport) both belong to the solute carrier (SLC) family, confirming that transmembrane material exchange is a core mechanism for the target traits. Furthermore, the GEMMA-specific gene
dnah10, which drives microtubule movement via ATP hydrolysis, and GCTA-detected
filip1L (myosin binding) jointly regulate cell morphology and migration, potentially influencing tissue development or pathogen responses [
33]. Notably,
cica in multi-trait analysis, as an RNA polymerase II-dependent transcription factor, may integrate multiple traits through epigenetic regulation, aligning with recent hypotheses about “super-enhancers” regulating multi-gene clusters [
34]. Future studies should combine CRISPR screening or single-cell sequencing to validate upstream–downstream regulatory relationships of these candidate genes and explore their molecular mechanisms of interaction with the environment.
5. Limitations
This study acknowledges several limitations. Primarily, the Plecoglossus altivelis genome has not yet reached the chromosomal level and remains at the scaffold level. This limitation may lead to a series of challenges, such as less precise genomic positioning, scattered signals in Manhattan plots, and potential biases in statistical models, which could increase false positive rates and affect the reliability of GWAS results to some extent.
Furthermore, due to the lack of a comprehensive gene annotation file for P. altivelis in public databases, candidate gene identification relied on extracting sequences from significant loci for BLAST alignment. This traditional approach may introduce a degree of subjectivity. Nonetheless, the identification of overlapping candidate genes (e.g., five genes identified by both GEMMA single- and multi-trait analyses) through complementary GWAS methods enhances confidence in our screening results.
6. Conclusions
Analysis of correlation coefficients among six growth traits in Plecoglossus altivelis showed that body weight (BW) was significantly positively correlated with total length (TL), body length (BL), and body height (BH), indicating these indices collectively influence body size, with BW tightly linked to length and height. TL and BL showed high consistency in assessing body length, while the correlation between BL and BH highlighted their importance in evaluating individual size. Gonad weight (GW) was strongly associated with BW but weakly linked to TL, BL, BH, and eye distance (ED), suggesting GW may be more closely related to reproductive traits or physiological status. ED showed weak associations with other indices, indicating insignificant links to body shape characteristics and potential relevance to survival adaptation or predatory behavior.
GCTA genetic correlation analysis revealed strong positive genetic correlations (rg > 0.9) between BW and TL, BL, and BH, with high synergy among TL, BL, and BH, suggesting these morphological traits are regulated by shared polygenic networks—selecting for BW could synchronously improve other growth-related traits. GW was strongly correlated with BL and BW but weakly with ED, indicating gonadal development integrates growth metabolism and reproduction-specific regulatory mechanisms, requiring balanced breeding goals. ED had low genetic correlations with most traits (e.g., BW, GW) and only moderate correlation with TL, showing relatively independent genetic regulation. Treating ED as a secondary trait for separate optimization could improve breeding efficiency, consistent with phenotypic analysis results.