Integrative Transcriptomic Analysis and Co-Expression Network Characterization of Soybean Developmental Tissues

Knizia, Dounya; Meksem, Khalid; Kassem, My Abdelmajid

doi:10.3390/plants15071002

Open AccessArticle

Integrative Transcriptomic Analysis and Co-Expression Network Characterization of Soybean Developmental Tissues

by

Dounya Knizia

¹

,

Khalid Meksem

¹

and

My Abdelmajid Kassem

^2,*

¹

School of Agricultural Sciences, Southern Illinois University, Carbondale, IL 62901, USA

²

Plant Genomics and Bioinformatics Lab, Department of Biological and Forensic Sciences, Fayetteville State University, Fayetteville, NC 28301, USA

^*

Author to whom correspondence should be addressed.

Plants 2026, 15(7), 1002; https://doi.org/10.3390/plants15071002

Submission received: 2 March 2026 / Revised: 19 March 2026 / Accepted: 23 March 2026 / Published: 25 March 2026

(This article belongs to the Special Issue Bean Breeding)

Download

Browse Figures

Versions Notes

Abstract

Soybean (Glycine max (L.) Merr.) is a globally important legume crop valued as a major source of plant-based protein and edible oil. Understanding the transcriptional programs underlying tissue-specific development is essential for improving seed quality and agronomic performance. Here, we present an integrative transcriptomic analysis of soybean based on 12 samples representing key seed developmental stages—including globular, heart, cotyledon, embryo, dry seed, mid-mature, and late-mature—and vegetative and reproductive tissues, including leaf, root, stem, flower bud, and seedling at 6 days after imbibition (6 DAI). Following data preprocessing and filtering, 54,880 genes were retained for downstream analysis. Principal component analysis revealed clear separation between seed and non-seed tissues, indicating that tissue identity is the dominant driver of transcriptomic variation. Analysis of the top 100 most variable genes further highlighted distinct expression modules associated with seed maturation and vegetative growth. Differential expression analysis identified 9785 genes exhibiting significant expression differences between seed and non-seed tissues, including 1139 upregulated and 8646 downregulated genes under relaxed statistical thresholds. Functional characterization of seed-upregulated genes revealed enrichment of biological processes related to storage metabolism, embryo development, and stress protection mechanisms associated with desiccation tolerance. In addition, co-expression network and correlation analyses demonstrated strong transcriptional coherence among seed tissues and distinct clustering of vegetative organs. Together, these results provide a comprehensive systems-level overview of transcriptional organization across soybean tissues and identify candidate gene sets relevant to seed biology, functional genomics, and crop improvement.

Keywords:

co-expression network; differential gene expression; Glycine max; PCA; RNA-seq; seed development; transcriptomics

1. Introduction

Soybean (Glycine max (L.) Merr.) is one of the most important legume crops worldwide because of its high protein and oil content and its widespread use as a source of animal feed, human food, and industrial raw materials [1,2,3]. Meeting the increasing global demand for plant-based protein and vegetable oil represents a major challenge for modern agriculture and global food security. Transcriptome-wide studies have shown that thousands of genes participate in the temporal and spatial regulation of soybean seed development and maturation [4,5,6]. In parallel, advances in sequencing technologies have enabled the generation of multiple high-quality soybean genome assemblies, greatly expanding the genomic resources available for functional studies [3,7]. Additional genomic tools, including high-density genetic linkage maps and SNP arrays, further facilitate soybean functional genomics and trait dissection [6,8].

Studies of seed development in soybean have revealed that different phases—from globular embryo, heart stage, cotyledon formation, dry seed, to seedling establishment—are characterized by distinct gene expression patterns such as storage protein accumulation, desiccation tolerance, and embryo maturation [1,4,9,10,11]. For instance, early seed developmental stages show high expression of histones and proline-rich proteins, whereas mature dry seeds show elevated levels of late embryogenesis abundant (LEA) proteins and dehydrins [2,10].

These transcriptional programs reflect key physiological transitions underlying embryo development, nutrient accumulation, and maturation. During the globular to cotyledon stage, cellular proliferation dominates and genes related to cell cycle, chromatin remodeling, and transcription factors such as WUSCHEL-related homeobox (WOX) and LEAFY COTYLEDON (LEC) are upregulated [12,13]. As the embryo matures, a developmental shift toward nutrient storage occurs, marked by the activation of genes encoding glycinin, β-conglycinin, and oleosins—essential for protein and lipid deposition in storage vacuoles and oil bodies [14,15]. This storage phase is tightly regulated by hormonal cues such as abscisic acid (ABA), which not only drives seed filling but also induces desiccation tolerance via LEA proteins and antioxidant enzymes [16].

Dry seeds, in contrast, display a highly specialized transcriptome adapted for quiescence and stress protection. Numerous studies report elevated expression of dehydrins, small heat shock proteins, and seed-specific protective proteins like oleosin and caleosin during late maturation [2,17]. These molecules stabilize membranes and proteins during water loss, ensuring viability during desiccation and storage.

Beyond seeds, vegetative tissues (leaf, root, stem, flower bud) also participate in the developmental trajectory of the plant, and comparative transcriptomic atlases across multiple tissues are emerging [18,19]. For example, an integrated transcriptome atlas of soybean organs revealed large-scale gene expression coordination across tissues and developmental stages [18,20]. Vegetative development is characterized by distinct transcriptional signatures aligned with the physiological functions of each organ. For instance, the root transcriptome is enriched in genes involved in water and nutrient uptake, stress response pathways, and hormone biosynthesis such as auxin and cytokinin, which play crucial roles in root growth and lateral branching [1,21]. In contrast, the shoot apex and stem tissues activate meristem maintenance genes and regulators of vascular tissue differentiation, such as HD-ZIP III and NAC-domain transcription factors [18]. These regulatory pathways ensure proper formation of xylem and phloem, necessary for long-distance transport of assimilates and signaling molecules.

Leaves, as the primary photosynthetic organs, display a distinct transcriptomic landscape marked by genes for chloroplast development, light response, and carbon fixation (e.g., Rubisco subunits, chlorophyll a/b binding proteins) [2]. Comparative analyses between photosynthetically active leaves and non-photosynthetic tissues reveal tightly regulated metabolic compartmentalization and tissue-specific expression of plastid-related genes. Flower buds, meanwhile, initiate a reproductive gene program, including MADS-box transcription factors and floral organ identity genes, such as APETALA and AGAMOUS homologs [22]. This ensures proper floral organ specification and timing of reproductive transitions.

Despite these advances, our understanding of how gene expression dynamics differ across a broad tissue set—from seed developmental stages through vegetative tissues to early seedling—is still incomplete in soybean. Moreover, integrative approaches combining principal component analysis (PCA), differential expression (DE), and co-expression network modeling in the same dataset remain less common.

The objective of this study was to perform an integrative re-analysis of soybean transcriptomic data across developmental and vegetative tissues to identify coordinated gene expression patterns, differentially expressed genes, and co-expression relationships underlying tissue specialization. Unlike previous studies that focused on individual aspects of soybean transcriptomics, this study integrates multiple analytical layers—including variance-based gene selection, differential expression, correlation structure, and network analysis—to identify coordinated transcriptional programs underlying tissue specialization.

2. Results

The dataset analyzed (GSE29163) comprises RNA-seq expression profiles across 12 soybean tissues. The dataset does not include biological replicates for each tissue, and each sample represents a distinct developmental stage or organ. Therefore, downstream statistical analyses should be interpreted as exploratory rather than confirmatory.

2.1. Composition of the Soybean Transcriptomic Dataset

The transcriptomic dataset analyzed in this study comprised samples representing a broad range of soybean developmental stages and tissue types. In total, twelve tissues were included, spanning multiple seed developmental stages as well as vegetative and reproductive organs. Seed tissues included globular, heart, cotyledon, embryo, dry seed, mid-mature seed, and late-mature seed stages, whereas non-seed tissues consisted of leaf, root, stem, flower bud, and seedling 6 DAI.

A summary of the dataset composition is shown in Supplementary Figure S1, which illustrates the relative number of samples representing seed versus vegetative or reproductive tissues. Seed tissues accounted for seven of the twelve samples, reflecting the emphasis on multiple stages of embryo and seed maturation. Vegetative and reproductive tissues accounted for the remaining five samples. This dataset structure enabled comparative analysis of transcriptional programs associated with seed development relative to those operating in vegetative organs.

2.2. Principal Component Analysis Reveals Clear Separation of Tissue Types

PCA was performed using the full set of filtered genes (after removal of genes with >50% missing values and log₂ transformation). PCA of the log₂-transformed soybean expression data revealed a clear separation of tissue types along the first two principal components (Figure 1). Together, these components explained 66.6% of the total variance, with PC1 accounting for 56.2% and PC2 explaining 10.4% of the variance.

Seed-related tissues (Mid_Mature_Seed, Late_Mature_Seed, Globular_Seed, Heart_Seed, Cotyledon_Seed, Embryo_Seed, and Dry_Seed) clustered together and were clearly separated from vegetative organs (Leaf, Root, Stem) and reproductive tissues (Flower_Bud and Seedling_6dai). The strong separation along PC1 indicates that the primary axis of transcriptomic variation corresponds to the developmental distinction between seed and non-seed tissues.

Within the seed cluster, earlier embryonic stages such as globular and heart stages formed closely related groups, whereas mature seed tissues occupied distinct positions along the PCA axes. These patterns suggest that progressive developmental transitions during seed maturation are accompanied by substantial transcriptional remodeling. Conversely, vegetative tissues formed a separate cluster reflecting shared physiological functions such as photosynthesis, nutrient transport, and structural growth.

2.3. Highly Variable Genes Distinguish Developmental and Vegetative Programs

Highly variable genes were defined as those with the highest variance across all samples. Specifically, the top 100 genes ranked by variance across samples were selected for downstream clustering analysis (Figure 2).

The resulting heatmap revealed clear gene expression patterns distinguishing seed and non-seed tissues. Two major gene clusters were apparent. One cluster showed elevated expression across multiple seed developmental stages but reduced expression in vegetative tissues, suggesting roles in seed-specific processes such as embryo development, storage compound accumulation, and desiccation tolerance. The second cluster exhibited higher expression in vegetative organs, consistent with genes involved in photosynthesis, structural growth, and metabolic activity in actively growing tissues.

Clustering of the samples further supported the separation observed in the PCA. Seed developmental stages grouped together, whereas vegetative organs and reproductive tissues formed separate clusters. These patterns highlight the presence of distinct transcriptional programs associated with seed maturation versus vegetative growth.

2.4. Differential Gene Expression Between Seed and Non-Seed Tissues

Differential gene expression analysis was performed to identify genes exhibiting coordinated transcriptional differences between seed and non-seed tissues. Statistical testing identified 9785 genes that satisfied the thresholds of FDR < 0.05 and |log₂FC| > 1.

Among these, 1139 genes were significantly upregulated in seed tissues, whereas 8646 genes were downregulated relative to vegetative and reproductive organs. The results are visualized using a volcano plot (Figure 3), which illustrates the distribution of genes according to fold change and statistical significance.

The volcano plot reveals a strong asymmetry in differential expression patterns. A large proportion of genes exhibited lower expression in seed tissues relative to vegetative organs, reflecting the suppression of many growth-associated processes during seed maturation. In contrast, a smaller but distinct set of genes showed strong upregulation in seed tissues, likely representing genes involved in embryo development, storage reserve synthesis, and seed maturation pathways.

2.5. Expression Patterns of the Top Differentially Expressed Genes

To further explore transcriptional differences between tissues, the 50 genes with the largest absolute log₂ fold changes were selected and visualized using hierarchical clustering (Figure 4). The heatmap revealed pronounced differences in expression patterns between seed and vegetative tissues.

Genes highly expressed in seed tissues formed distinct clusters corresponding to developmental stages of embryo formation and seed maturation. Many of these genes are consistent with known seed-related functional categories, including proteins associated with lipid storage, seed storage proteins, and stress-related protective proteins such as late embryogenesis abundant (LEA) proteins.

Conversely, genes highly expressed in vegetative tissues showed elevated expression in leaf, root, and stem samples. These genes are likely associated with photosynthetic processes, cell wall biosynthesis, and general metabolic functions required for vegetative growth. Together, these patterns demonstrate strong tissue-specific transcriptional specialization across soybean developmental stages.

2.6. Functional Characterization of Seed-Upregulated Genes

To gain functional insight into genes preferentially expressed in seed tissues, we examined the top upregulated genes identified in the differential expression analysis (Figure 5) and integrated available functional annotations from SoyBase. Among these, several genes are associated with stress response and cellular adaptation processes that are characteristic of seed maturation.

Notably, Glyma06g02500 is annotated as an HVA22-like protein G (IPR004345; TB2/DP1/HVA22-related protein), a protein family known to be involved in abscisic acid (ABA)-mediated stress responses and vesicle trafficking under dehydration conditions. HVA22-like proteins have been implicated in protecting cellular integrity during water deficit and are often associated with late stages of seed development, when desiccation tolerance is acquired.

In addition, Glyma04g082200 encodes a protein containing a stress-responsive A/B barrel domain (IPR011008), which is commonly associated with proteins involved in environmental stress adaptation and metabolic regulation. The presence of this domain further supports the activation of stress-responsive pathways in seed tissues. Another gene, Glyma13g364000, is currently annotated as a protein of unknown function; however, its localization to the plasma membrane and broad expression across multiple plant structures and developmental stages suggest a potential role in membrane-associated processes during seed maturation.

Although functional annotations are limited for several of the top upregulated genes, the available evidence points toward enrichment of pathways related to stress response, cellular protection, and membrane dynamics. These processes are well recognized as essential components of seed maturation, particularly in the acquisition of desiccation tolerance and long-term viability [23,24]. The observed expression patterns are also consistent with established Gene Ontology classifications describing seed development and stress adaptation processes [25].

Collectively, these findings indicate that seed-upregulated genes reflect coordinated transcriptional programs associated with stress resilience, cellular protection, and developmental specialization. Furthermore, the identification of both annotated and uncharacterized genes highlights opportunities for future functional studies aimed at elucidating novel regulators of soybean seed development.

2.7. Co-Expression Network Reveals Distinct Tissue Modules

To explore transcriptional relationships among tissues, a co-expression network was constructed using Pearson correlation coefficients greater than 0.9. The resulting network structure is shown in Figure 6. In this network, nodes represent tissues and edges represent strong correlations in gene expression profiles. Seed developmental stages formed a tightly interconnected module, reflecting their highly similar transcriptional programs. Early embryonic stages, including globular, heart, cotyledon, and embryo tissues, exhibited particularly strong connectivity within the network.

Vegetative and reproductive tissues formed a separate module consisting of leaf, root, stem, flower bud, and seedling samples. Connections among these tissues indicate shared transcriptional activity associated with vegetative growth and organ development. The network structure therefore reflects the biological organization of soybean tissues into functionally related transcriptional modules. The co-expression network includes only tissues connected by strong correlations (|r| > 0.9). As a result, only 9 tissues are represented as nodes, since some samples did not meet this threshold.

2.8. Global Transcriptomic Similarity Among Soybean Tissues

Pairwise Pearson correlation coefficients were calculated to quantify the similarity of transcriptomes across all tissues. The resulting correlation matrix is shown as a heatmap in Figure 7. High correlation values were observed among seed developmental stages, indicating that these tissues share closely related transcriptional programs. In particular, globular, heart, cotyledon, and embryo tissues displayed strong pairwise correlations, consistent with their sequential developmental progression during embryogenesis.

Vegetative tissues also exhibited strong correlations among themselves, particularly between leaf, stem, and root samples. In contrast, correlations between seed and vegetative tissues were generally lower, reflecting substantial divergence in transcriptional programs between developmental programs associated with embryo maturation and vegetative growth.

2.9. Distribution of Differential Expression Magnitudes

Figure 8 shows the distribution of log₂ fold-change values across all genes, reflecting the global magnitude and direction of transcriptional differences. Most genes exhibited fold changes near zero, indicating similar expression levels between seed and non-seed tissues. However, the distribution displayed extended tails representing genes with large positive or negative fold changes.

Genes with large positive fold changes correspond to those strongly upregulated in seed tissues, whereas genes with large negative fold changes are preferentially expressed in vegetative organs. This distribution highlights the coexistence of seed-specific transcriptional activation and repression of vegetative gene programs.

2.10. Global Summary of Differentially Expressed Genes

A summary of the number of genes significantly upregulated or downregulated in seed tissues is shown in Supplementary Figure S2. Differential expression analysis identified 1139 genes upregulated in seed tissues and 8646 genes downregulated relative to non-seed tissues.

The predominance of downregulated genes supports the fact that many metabolic and physiological pathways active in vegetative tissues are suppressed during seed development. This observation is consistent with the transition from actively growing tissues to specialized reproductive structures focused on storage compound accumulation and embryo maturation.

2.11. MA Plot of Gene Expression Changes

To further evaluate differential expression patterns, an MA plot was constructed showing the relationship between mean gene expression levels and log₂ fold changes (Figure 9). Most genes clustered around the horizontal axis, indicating relatively stable expression between the two tissue groups. However, a substantial number of genes displayed large positive or negative fold changes across a range of expression levels. Highly significant genes were distributed across both high and moderate expression ranges, suggesting that transcriptional changes associated with seed development involve genes with diverse expression intensities.

2.12. Hierarchical Clustering Confirms Tissue Relationships

Hierarchical clustering analysis was performed to further examine relationships among soybean tissues based on their global transcriptomic profiles. The resulting dendrogram (Figure 10) revealed clear grouping of tissues according to developmental and physiological characteristics.

Seed tissues clustered together, with closely related embryonic stages forming subclusters within the broader seed group. Vegetative tissues formed a separate branch of the dendrogram, reflecting their shared transcriptional programs associated with growth and metabolic activity.

Notably, reproductive tissues such as flower buds and early seedlings showed intermediate relationships, linking vegetative and developmental clusters. Overall, the clustering results reinforce the patterns observed in PCA, correlation analysis, and network analysis, confirming that soybean tissues exhibit strong transcriptional organization corresponding to their developmental roles. Hierarchical clustering revealed groupings consistent with PCA results but also highlighted finer relationships among tissues, particularly within seed developmental stages, suggesting subtle transcriptional differences not fully captured by PCA.

2.13. Identification of Candidate Genes Associated with Seed Development

The top differentially expressed genes identified in this study represent candidate genes associated with soybean seed development and tissue specialization. Notably, several genes with strong upregulation in seed tissues are associated with storage protein accumulation, lipid metabolism, and stress tolerance. These include genes encoding late embryogenesis abundant (LEA) proteins, oleosins, and other seed-specific proteins. Conversely, genes downregulated in seed tissues are primarily associated with photosynthesis, cell wall biosynthesis, and vegetative growth.

To further highlight candidate genes associated with seed development, the top upregulated genes ranked by log₂ fold change were visualized (Figure 10). These genes exhibited strong preferential expression in seed tissues compared to vegetative organs. Several of the highly upregulated genes are likely associated with seed-specific biological processes, including storage protein accumulation, lipid metabolism, and stress protection mechanisms. The identification of these candidate genes provides a focused set of targets for future functional validation and crop improvement studies.

3. Discussion

The integrative transcriptomic analysis presented here provides a comprehensive view of gene expression dynamics across soybean developmental and vegetative tissues. By combining multivariate statistical approaches, differential expression analysis, and network-based methods, this study reveals key transcriptional features associated with soybean organ specialization and seed development.

One of the most prominent observations from this study is that tissue identity represents the dominant axis of transcriptional variation in soybean. Principal component analysis revealed strongly coordinated gene expression programs, with the first two principal components explaining more than two-thirds of the total variance in the dataset. This separation reflects large-scale transcriptional reprogramming associated with the transition from vegetative growth to reproductive development. Similar patterns have been reported in previous soybean transcriptome atlases and developmental profiling studies, where seed maturation stages cluster separately from leaf, root, and stem tissues [2,16,22,26]. The distinct clustering of embryonic stages further is consistent with the fact that progressive developmental transitions during seed maturation are accompanied by large-scale shifts in transcriptional activity [27].

The analysis of the top 100 most variable genes further highlights the existence of distinct regulatory programs associated with seed and vegetative tissues. Hierarchical clustering of these genes revealed two major expression modules: one predominantly expressed in seed developmental stages and another enriched in vegetative organs. Such patterns are consistent with previous transcriptomic atlases showing that plant organs maintain specialized transcriptional networks adapted to their physiological roles [1,5,18]. In vegetative tissues, highly expressed genes are often associated with processes such as photosynthesis, carbon metabolism, and structural growth. In contrast, genes highly expressed in seeds are typically linked to embryo development, storage compound synthesis, and stress tolerance mechanisms required for seed maturation.

Differential gene expression analysis further demonstrated the magnitude of transcriptional reprogramming that occurs between seed and non-seed tissues. In total, nearly ten thousand genes were significantly differentially expressed, including more than eight thousand genes downregulated in seeds relative to vegetative tissues. The predominance of downregulated genes suggests that many metabolic pathways active in vegetative tissues become suppressed during seed development. Such repression is expected during the transition from actively growing organs toward specialized reproductive structures focused on storage reserve accumulation and embryo maturation. Similar large-scale transcriptional shifts have been observed in other studies of soybean seed development and legume embryogenesis [10,11,22,26,28].

The examination of the top differentially expressed genes revealed several functional categories characteristic of seed maturation processes. Many of the genes highly expressed in seeds encode proteins involved in storage compound accumulation, lipid body formation, and stress protection. These include genes encoding storage proteins, oleosin family proteins associated with lipid droplets, and late embryogenesis abundant (LEA) proteins. LEA proteins and related stress-response proteins are known to play critical roles in protecting cellular structures during dehydration and seed desiccation [2,29,30]. Their elevated expression during seed maturation reflects the physiological requirement to prepare the embryo for dormancy and long-term survival.

The analysis also revealed that dry seeds exhibit a highly specialized transcriptional state. Previous studies have shown that dry seeds accumulate transcripts encoding dehydrins, small heat shock proteins, and other stress-protective molecules during late maturation stages [2,17,31,32]. These proteins contribute to membrane stabilization, protein protection, and tolerance to cellular dehydration during seed desiccation. The transcriptomic patterns observed in this study are consistent with these findings, as dry and late-mature seeds occupy distinct positions in the PCA and clustering analyses relative to earlier embryonic stages and vegetative tissues.

The identification of top upregulated genes provides additional biological insight beyond global expression patterns. These genes represent strong candidate regulators or effectors of seed development and maturation. Many are likely involved in processes such as storage reserve accumulation, lipid body formation, and desiccation tolerance, which are essential for seed viability. Highlighting these candidate genes addresses an important gap in transcriptomic studies by linking global expression patterns to specific molecular targets that can be explored in functional genomics and breeding applications.

Network-based approaches provided additional insight into the organization of transcriptional programs across soybean tissues. The co-expression network constructed from highly correlated samples revealed strong connectivity among seed developmental stages, indicating that these tissues share highly coordinated gene expression patterns. Such tight clustering likely reflects the presence of common regulatory hubs controlling developmental pathways during embryo formation and maturation. Similar network-based analyses in soybean have demonstrated that gene expression during seed development is governed by highly interconnected regulatory modules [33,34]. In contrast, vegetative tissues exhibited slightly more modular network organization, reflecting the diversity of physiological processes operating across organs such as leaves, roots, and stems.

Correlation analysis and hierarchical clustering further supported these observations. Both approaches consistently grouped seed tissues together while separating them from vegetative organs. The concordance among multiple analytical methods—including PCA, clustering, correlation analysis, and network modeling—provides strong evidence that soybean tissues exhibit highly structured transcriptomic organization corresponding to their biological roles.

Beyond seed development, the transcriptional patterns observed in vegetative tissues reflect dynamic transcriptional responses to environmental and developmental signals [35]. Leaf tissues showed expression patterns associated with photosynthetic metabolism, whereas root and stem tissues exhibited transcriptional profiles linked to nutrient uptake, structural growth, and transport processes. Such divergence among organ-specific transcriptomes reflects the broader principle of transcriptional compartmentalization that underlies plant development. These findings are consistent with emerging pan-tissue transcriptomic maps demonstrating that plant organs maintain distinct gene expression signatures tailored to their physiological functions [18,36,37].

From an applied perspective, the gene sets identified in this study may provide valuable candidates for improving soybean crop traits. Genes strongly upregulated in seeds are particularly attractive targets for breeding programs aiming to enhance seed composition, including oil and protein content [15,38]. Likewise, genes associated with stress tolerance and desiccation protection could contribute to improved seed vigor and storage longevity. The high degree of transcriptional coordination observed among seed developmental stages also suggests that a limited number of representative stages may capture much of the transcriptomic variability associated with seed maturation, potentially enabling more efficient experimental designs in future studies.

Despite these insights, several limitations should be acknowledged. Because the analysis relies on publicly available datasets, variation in experimental conditions such as sequencing platforms, library preparation methods, and sample processing may introduce technical heterogeneity. In addition, the differential expression analysis employed simplified thresholds that do not account for all potential confounding variables, including genotype differences or batch effects. Future studies should address these limitations by incorporating biological replicates, multi-genotype datasets, and more robust statistical frameworks for RNA-seq analysis.

Further work will be required to validate candidate genes identified in this study, which may serve as targets for crop improvement and trait optimization [39]. Functional characterization using reverse genetics approaches, such as CRISPR/Cas-mediated gene editing or transgenic expression studies, will be essential to determine the precise biological roles of these genes in soybean development. Integration of transcriptomic data with other omics layers—including proteomics, metabolomics, and epigenomic profiling—may also provide a more comprehensive understanding of the regulatory mechanisms controlling seed maturation.

Another limitation of this study is the aggregation of multiple seed developmental stages into a single group for differential expression analysis, which may obscure stage-specific transcriptional dynamics. Future analyses using pairwise or time-series approaches could provide higher-resolution insights into developmental regulation. The absence of biological replicates limits statistical robustness. In addition, incorporating gene family-level clustering or pathway-based analysis may further refine the interpretation of tissue-specific transcriptional programs.

In conclusion, this study integrates multivariate analysis, variance-based gene filtering, differential expression profiling, and network modeling to provide a comprehensive overview of transcriptional dynamics across soybean developmental tissues. The results highlight the strong transcriptional specialization that distinguishes seed developmental stages from vegetative organs and identify candidate genes associated with key biological processes such as embryo maturation, nutrient storage, and desiccation tolerance. These findings contribute to a growing body of transcriptomic resources that enhance our understanding of soybean development and provide a foundation for future functional genomics and crop improvement efforts.

Compared to previous soybean transcriptome atlases, this study emphasizes the coordinated structure of gene expression across tissues rather than focusing solely on individual gene differences. The integration of multiple analytical layers provides a broader systems-level understanding of soybean development.

From an applied perspective, the identified gene sets may serve as targets for improving seed composition, stress tolerance, and developmental efficiency through breeding or genetic engineering.

Beyond general functional categories, the enrichment of genes associated with nutrient reservoir activity and desiccation tolerance indicates coordinated regulation by seed-specific transcriptional networks. Previous studies have identified key regulators such as LEAFY COTYLEDON (LEC), ABSCISIC ACID INSENSITIVE (ABI), and WRINKLED1 (WRI1) as central drivers of seed maturation and storage compound accumulation. The enrichment patterns observed in this study are consistent with activation of these regulatory modules, suggesting that the identified gene sets may represent downstream targets of conserved seed developmental regulators.

4. Methods

4.1. Data Acquisition and Preprocessing

Transcriptomic datasets were obtained from the Gene Expression Omnibus (GEO) under accession GSE29163 [2], which contains RNA-seq expression profiles from multiple soybean (Glycine max) tissues representing both developmental and vegetative stages. Samples included seed developmental stages (globular, heart, cotyledon, embryo, dry seed, mid-mature seed, and late-mature seed), vegetative tissues (leaf, root, and stem), a reproductive organ (flower bud), and early post-germination tissue (seedling 6 DAI). No biological replicates are available.

Raw expression files were downloaded and merged into a unified gene expression matrix following commonly recommended RNA-seq data processing practices [40]. Gene identifiers were standardized across files, and sample names were simplified for readability during downstream analysis. Genes with missing values in more than 50% of samples were removed. Remaining missing values were replaced with zero as a pragmatic approach to maintain matrix completeness; however, this may underestimate low-expression genes and should be interpreted cautiously.

Expression values were transformed using the function log₂(x + 1) to stabilize variance and reduce the influence of highly expressed genes. After filtering, the final dataset consisted of approximately 74,818 genes across 12 tissue samples [2]. Samples were grouped into two biological categories for comparative analysis: seed tissues (all developmental seed stages) and non-seed tissues (vegetative and reproductive organs).

4.2. Overview of Tissue Composition

To illustrate the distribution of biological samples included in the study, tissues were categorized into seed developmental stages and vegetative/reproductive tissues. The number of samples representing each category was summarized to provide an overview of the dataset composition.

4.3. Principal Component Analysis (PCA)

PCA was performed on the filtered and log₂-transformed gene expression data to evaluate global transcriptional relationships among samples. Prior to PCA, gene expression values were standardized using the StandardScaler implementation in the Scikit-learn library [41]. Dimensionality reduction was then performed using PCA, and the first two principal components were extracted to visualize the primary axes of transcriptional variation among soybean tissues.

4.4. Identification of Highly Variable Genes

To identify genes exhibiting the greatest transcriptional variability across tissues, gene-wise variance was calculated across the entire log₂-transformed dataset. The top 100 genes with the highest variance were selected for clustering analysis. Expression values for these genes were standardized using row-wise z-score normalization to emphasize relative expression patterns across tissues. Hierarchical clustering was performed using Euclidean distance and average linkage to identify groups of genes with similar expression patterns across developmental and vegetative samples.

4.5. Differential Gene Expression Analysis

Differential gene expression analysis was conducted to identify genes whose expression showed functional specialization, following commonly used statistical frameworks for RNA-seq differential expression analysis [42,43]. For each gene, the log₂ fold change (log₂FC) was calculated as the difference between the mean expression of seed tissues and the mean expression of non-seed tissues. Statistical significance was evaluated using Welch’s two-sample t-test, which accounts for unequal variances between groups. p-values were adjusted for multiple hypothesis testing using the Benjamini–Hochberg false discovery rate (FDR) correction. The top differentially expressed genes were defined as those with the largest absolute log₂ fold change among genes meeting the significance threshold (FDR < 0.05). The distribution of gene expression changes was calculated using log₂ fold-change values for all genes between seed and non-seed tissues. Using these thresholds, a total of 9785 genes were identified as differentially expressed, including 1139 genes up-regulated in seed tissues and 8646 genes down-regulated relative to non-seed tissues.

This grouping strategy was designed to capture global transcriptional differences between seed and non-seed tissues. However, it does not account for stage-specific or tissue-specific variation, which represents an important direction for future analyses.

Due to the absence of biological replicates in the dataset, standard RNA-seq differential expression tools such as DESeq2 or edgeR could not be reliably applied. Therefore, Welch’s t-test was used as an exploratory statistical approach to identify broad transcriptional differences between groups. The results should be interpreted as indicative of global expression trends rather than definitive statistical inference.

4.6. Visualization of Differential Expression Patterns

Several visualization approaches were used to examine the distribution and magnitude of differential expression results. Volcano plots were generated to display the relationship between log₂ fold change and statistical significance. Histograms of log₂ fold-change values were constructed to visualize the overall distribution of differential expression magnitudes across genes. An MA plot was also produced to examine the relationship between mean expression levels and log₂ fold changes across genes. In addition, the number of up-regulated and down-regulated genes was summarized to provide an overview of developmental programming.

4.7. Heatmap of Differentially Expressed Genes

To examine transcriptional patterns among the most strongly regulated genes, the top 50 genes with the largest absolute log₂ fold changes were selected. Expression values for these genes were visualized using a clustered heatmap based on log₂-transformed expression values. Hierarchical clustering was applied to both genes and samples using Euclidean distance and average linkage to reveal groups of genes with similar expression patterns across tissues.

4.8. Correlation Analysis of Tissue Transcriptomes

To evaluate global transcriptional similarity among tissues, a Pearson correlation matrix was calculated using the log₂-transformed expression dataset. The resulting matrix was visualized as a heatmap in which correlation coefficients reflected the overall similarity between pairs of tissues. This analysis allowed identification of closely related transcriptional programs across developmental stages and vegetative organs.

4.9. Hierarchical Clustering of Tissue Transcriptomes

Hierarchical clustering was performed to assess global similarity among soybean tissues based on transcriptomic profiles. Pairwise distances between samples were calculated using Euclidean distance applied to the log₂-transformed expression matrix. Agglomerative hierarchical clustering was then performed using the average linkage method, which iteratively groups samples based on their pairwise similarity. The resulting dendrogram illustrates relationships among tissues and highlights clusters corresponding to seed developmental stages and vegetative organs. Clustering was implemented using functions from the SciPy v1.13 hierarchical clustering module [44] and visualized using Matplotlib v3.10.

4.10. Co-Expression Network Analysis

To further explore transcriptional relationships among tissues, a co-expression network was constructed based on Pearson correlation coefficients, an approach commonly used in gene co-expression network analysis [45,46]. Pairwise correlations were calculated between samples, and edges were drawn between nodes representing tissues when the correlation coefficient exceeded r > 0.9. The resulting network was modeled as an undirected graph using the NetworkX Python library [45]. Node positions were determined using a spring-layout algorithm to optimize visualization of highly connected sample groups.

4.11. Software and Computational Environment

All analyses were performed using Python v3.10 [47] within a Jupyter Notebook v7.4 environment [48]. Numerical computations and matrix operations were conducted using NumPy v1.26 [49], while Pandas v2.2 [50] was used for data manipulation and table handling. Data visualization was performed using Matplotlib 3.10 [51] and Seaborn v0.13 [52], which enabled the generation of high-resolution publication-quality figures. Hierarchical clustering and principal component analysis were implemented using Scikit-learn v1.7 [41], and network analysis was performed using NetworkX 3.4 [45].

5. Conclusions

In this study, we performed an integrative transcriptomic analysis of soybean developmental and vegetative tissues to characterize global patterns of gene expression. Multivariate analyses demonstrated that tissue identity—particularly the distinction between seed and non-seed tissues—represents the dominant source of transcriptional variation.

Analysis of highly variable and differentially expressed genes revealed distinct transcriptional programs associated with seed maturation and vegetative growth. Network and correlation analyses further demonstrated strong transcriptional coherence among seed tissues, indicating coordinated regulatory processes underlying embryo development and storage accumulation.

Although limited by the absence of biological replicates, this study provides a systems-level framework for understanding transcriptional organization across soybean tissues. The identified gene sets represent valuable candidates for future functional genomics studies and breeding strategies aimed at improving seed composition, yield, and stress resilience.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/plants15071002/s1, Figure S1. Composition of soybean transcriptomic dataset. Bar plot showing the number of samples representing seed developmental stages and vegetative or reproductive tissues included in the analysis. Figure S2. Summary of differentially expressed genes. Bar plot showing the number of genes significantly upregulated and downregulated in seed tissues compared to non-seed tissues based on FDR and fold-change thresholds.

Author Contributions

Conceptualization, K.M. and M.A.K.; methodology, D.K. and M.A.K.; validation, D.K. and K.M.; formal analysis, D.K. and M.A.K.; data curation, D.K. and M.A.K.; writing—original draft preparation, D.K.; writing—review and editing, D.K., K.M. and M.A.K.; supervision, K.M. and M.A.K.; project administration, K.M. and M.A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work received no external funding.

Data Availability Statement

The transcriptomic dataset analyzed in this study is publicly available from the NCBI Gene Expression Omnibus (GEO) under accession number GSE29163. The processed expression matrices, differential expression results, and the complete Jupyter Notebook used for data preprocessing, statistical analysis, and figure generation are publicly available at the corresponding author’s GitHub repository: https://github.com/abdelmajidk/Transcriptomic-Profiling-Soybean (accessed on 22 March 2026).

Acknowledgments

We gratefully acknowledge the open-source software community for providing the powerful tools that made this analysis possible.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Libault, M.; Farmer, A.; Joshi, T.; Takahashi, K.; Langley, R.J.; Franklin, L.D.; He, J.; Xu, D.; May, G.; Stacey, G. An integrated transcriptome atlas of the crop model Glycine max, and its use in comparative analyses in plants. Plant J. 2010, 63, 86–99. [Google Scholar] [CrossRef] [PubMed]
Severin, A.J.; Woody, J.L.; Bolon, Y.-T.; Joseph, B.; Diers, B.W.; Farmer, A.D.; Muehlbauer, G.J.; Nelson, R.T.; Grant, D.; Specht, J.E. RNA-Seq Atlas of Glycine max: A guide to the soybean transcriptome. BMC Plant Biol. 2010, 10, 160. [Google Scholar] [CrossRef] [PubMed]
Schmutz, J.; Cannon, S.B.; Schlueter, J.; Ma, J.; Mitros, T.; Nelson, W.; Hyten, D.L.; Song, Q.; Thelen, J.J.; Cheng, J. Genome sequence of the palaeopolyploid soybean. Nature 2010, 463, 178–183. [Google Scholar] [CrossRef] [PubMed]
Jones, S.I.; Gonzalez, D.O.; Vodkin, L.O. Flux of transcript patterns during soybean seed development. BMC Genom. 2010, 11, 136. [Google Scholar] [CrossRef]
Zhang, X.; Luo, Z.; Marand, A.P.; Yan, H.; Jang, H.; Bang, S.; Mendieta, J.P.; Minow, M.A.; Schmitz, R.J. A spatially resolved multi-omic single-cell atlas of soybean development. Cell 2025, 188, 550–567.e519. [Google Scholar] [CrossRef]
Song, Q.; Hyten, D.L.; Jia, G.; Quigley, C.V.; Fickus, E.W.; Nelson, R.L.; Cregan, P.B. Development and evaluation of SoySNP50K, a high-density genotyping array for soybean. PLoS ONE 2013, 8, e54985. [Google Scholar] [CrossRef]
Valliyodan, B.; Shi, H.; Nguyen, H.T. A simple analytical method for high-throughput screening of major sugars from soybean by normal-phase HPLC with evaporative light scattering detection. Chromatogr. Res. Int. 2015, 2015, 757649. [Google Scholar] [CrossRef]
Song, Q.; Jenkins, J.; Jia, G.; Hyten, D.L.; Pantalone, V.; Jackson, S.A.; Schmutz, J.; Cregan, P.B. Construction of high resolution genetic linkage maps to improve the soybean genome sequence assembly Glyma1. 01. BMC Genom. 2016, 17, 33. [Google Scholar] [CrossRef]
Gao, H.; Wang, Y.; Li, W.; Gu, Y.; Lai, Y.; Bi, Y.; He, C. Transcriptomic comparison reveals genetic variation potentially underlying seed developmental evolution of soybeans. J. Exp. Bot. 2018, 69, 5089–5104. [Google Scholar] [CrossRef]
Peng, L.; Qian, L.; Wang, M.; Liu, W.; Song, X.; Cheng, H.; Yuan, F.; Zhao, M. Comparative transcriptome analysis during seeds development between two soybean cultivars. PeerJ 2021, 9, e10772. [Google Scholar] [CrossRef]
Zhang, Y.; Bhat, J.A.; Zhang, Y.; Yang, S. Understanding the molecular regulatory networks of seed size in soybean. Int. J. Mol. Sci. 2024, 25, 1441. [Google Scholar] [CrossRef] [PubMed]
Goldberg, R.B.; De Paiva, G.; Yadegari, R. Plant embryogenesis: Zygote to seed. Science 1994, 266, 605–614. [Google Scholar] [CrossRef] [PubMed]
Le, B.H.; Cheng, C.; Bui, A.Q.; Wagmaister, J.A.; Henry, K.F.; Pelletier, J.; Kwong, L.; Belmonte, M.; Kirkbride, R.; Horvath, S. Global analysis of gene activity during Arabidopsis seed development and identification of seed-specific transcription factors. Proc. Natl. Acad. Sci. USA 2010, 107, 8063–8070. [Google Scholar] [CrossRef] [PubMed]
Baud, S.; Mendoza, M.S.; To, A.; Harscoët, E.; Lepiniec, L.; Dubreucq, B. WRINKLED1 specifies the regulatory action of LEAFY COTYLEDON2 towards fatty acid metabolism during seed maturation in Arabidopsis. Plant J. 2007, 50, 825–838. [Google Scholar] [CrossRef]
Kumar, V.; Vats, S.; Kumawat, S.; Bisht, A.; Bhatt, V.; Shivaraj, S.; Padalkar, G.; Goyal, V.; Zargar, S.; Gupta, S. Omics advances and integrative approaches for the simultaneous improvement of seed oil and protein content in soybean (Glycine max L.). Crit. Rev. Plant Sci. 2021, 40, 398–421. [Google Scholar] [CrossRef]
Xu, W.; Wang, Q.; Zhang, W.; Zhang, H.; Liu, X.; Song, Q.; Zhu, Y.; Cui, X.; Chen, X.; Chen, H. Using transcriptomic and metabolomic data to investigate the molecular mechanisms that determine protein and oil contents during seed development in soybean. Front. Plant Sci. 2022, 13, 1012394. [Google Scholar] [CrossRef]
Guo, B.; Zhang, J.; Yang, C.; Dong, L.; Ye, H.; Valliyodan, B.; Nguyen, H.T.; Song, L. The late embryogenesis abundant proteins in soybean: Identification, expression analysis, and the roles of GmLEA4_19 in drought stress. Int. J. Mol. Sci. 2023, 24, 14834. [Google Scholar] [CrossRef]
Fan, J.; Shen, Y.; Chen, C.; Chen, X.; Yang, X.; Liu, H.; Chen, R.; Liu, S.; Zhang, B.; Zhang, M. A large-scale integrated transcriptomic atlas for soybean organ development. Mol. Plant 2025, 18, 669–689. [Google Scholar] [CrossRef]
Klepikova, A.V.; Kasianov, A.S.; Gerasimov, E.S.; Logacheva, M.D.; Penin, A.A. A high resolution map of the Arabidopsis thaliana developmental transcriptome based on RNA-seq profiling. Plant J. 2016, 88, 1058–1070. [Google Scholar]
Haun, W.J.; Hyten, D.L.; Xu, W.W.; Gerhardt, D.J.; Albert, T.J.; Richmond, T.; Jeddeloh, J.A.; Jia, G.; Springer, N.M.; Vance, C.P. The composition and origins of genomic variation among individuals of the soybean reference cultivar Williams 82. Plant Physiol. 2011, 155, 645–655. [Google Scholar]
Zhou, Z.S.; Song, J.B.; Yang, Z.M. Genome-wide identification of Brassica napus microRNAs and their targets in response to cadmium. J. Exp. Bot. 2012, 63, 4597–4613. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Hu, Z.; Yang, Y.; Liu, X.; Lv, H.; Song, B.-H.; An, Y.-q.C.; Li, Z.; Zhang, D. Transcriptome profiling reveals the spatial-temporal dynamics of gene expression essential for soybean seed development. BMC Genom. 2021, 22, 453. [Google Scholar] [CrossRef] [PubMed]
Baud, S.; Lepiniec, L. Physiological and developmental regulation of seed oil production. Prog. Lipid Res. 2010, 49, 235–249. [Google Scholar] [CrossRef] [PubMed]
Angelovici, R.; Galili, G.; Fernie, A.R.; Fait, A. Seed desiccation: A bridge between maturation and germination. Trends Plant Sci. 2010, 15, 211–218. [Google Scholar] [CrossRef]
Ashburner, M.; Ball, C.A.; Blake, J.A.; Botstein, D.; Butler, H.; Cherry, J.M.; Davis, A.P.; Dolinski, K.; Dwight, S.S.; Eppig, J.T. Gene ontology: Tool for the unification of biology. Nat. Genet. 2000, 25, 25–29. [Google Scholar] [CrossRef]
Sun, S.; Yi, C.; Ma, J.; Wang, S.; Peirats-Llobet, M.; Lewsey, M.G.; Whelan, J.; Shou, H. Analysis of spatio-temporal transcriptome profiles of soybean (Glycine max) tissues during early seed development. Int. J. Mol. Sci. 2020, 21, 7603. [Google Scholar] [CrossRef]
Chen, Z.; Zhong, W.; Zhou, Y.; Ji, P.; Wan, Y.; Shi, S.; Yang, Z.; Gong, Y.; Mu, F.; Chen, S. Integrative analysis of metabolome and transcriptome reveals the improvements of seed quality in vegetable soybean (Glycine max (L.) Merr.). Phytochemistry 2022, 200, 113216. [Google Scholar] [CrossRef]
Du, J.; Wang, S.; He, C.; Zhou, B.; Ruan, Y.-L.; Shou, H. Identification of regulatory networks and hub genes controlling soybean seed set and size using RNA sequencing analysis. J. Exp. Bot. 2017, 68, 1955–1972. [Google Scholar] [CrossRef]
Hu, Y.; Liu, Y.; Lu, L.; Tao, J.J.; Cheng, T.; Jin, M.; Wang, Z.Y.; Wei, J.J.; Jiang, Z.H.; Sun, W.C. Global analysis of seed transcriptomes reveals a novel PLATZ regulator for seed size and weight control in soybean. New Phytol. 2023, 240, 2436–2454. [Google Scholar] [CrossRef]
Smolikova, G.; Leonova, T.; Vashurina, N.; Frolov, A.; Medvedev, S. Desiccation tolerance as the basis of long-term seed viability. Int. J. Mol. Sci. 2020, 22, 101. [Google Scholar] [CrossRef]
Ramtekey, V.; Cherukuri, S.; Kumar, S.; Sripathy, V.S.; Sheoran, S.; Bhaskar, K.U.; Naik, K.B.; Kumar, S.; Singh, A.N.; Singh, H.V. Seed Longevity in Legumes: Deeper Insights into Mechanisms and Molecular Perspectives. Front. Plant Sci. 2022, 13, 918206. [Google Scholar] [CrossRef] [PubMed]
El-Maarouf Bouteau, H. The Seed and the Metabolism Regulation. Biol. Reprod. 2022, 11, 168. [Google Scholar] [CrossRef] [PubMed]
Zhu, M.; Dahmen, J.L.; Stacey, G.; Cheng, J. Predicting gene regulatory networks of soybean nodulation from RNA-Seq transcriptome data. BMC Bioinform. 2013, 14, 278. [Google Scholar] [CrossRef] [PubMed]
Sainz, M.M.; Filippi, C.V.; Eastman, G.; Sotelo-Silveira, M.; Zardo, S.; Martínez-Moré, M.; Sotelo-Silveira, J.; Borsani, O. Water deficit response in nodulated soybean roots: A comprehensive transcriptome and translatome network analysis. BMC Plant Biol. 2024, 24, 585. [Google Scholar] [CrossRef]
Wang, N.; Khan, W.; Smith, D.L. Changes in soybean global gene expression after application of lipo-chitooligosaccharide from Bradyrhizobium japonicum under sub-optimal temperature. PLoS ONE 2012, 7, e31571. [Google Scholar] [CrossRef]
Yu, L.; Zhang, H.; Guan, R.; Li, Y.; Guo, Y.; Qiu, L. Genome-wide tissue-specific genes identification for novel tissue-specific promoters discovery in soybean. Genes 2023, 14, 1150. [Google Scholar] [CrossRef]
Lv, Z.; Jiang, S.; Kong, S.; Zhang, X.; Yue, J.; Zhao, W.; Li, L.; Lin, S. Advances in single-cell transcriptome sequencing and spatial transcriptome sequencing in plants. Plants 2024, 13, 1679. [Google Scholar] [CrossRef]
Lee, S.; Van, K.; Sung, M.; Nelson, R.; LaMantia, J.; McHale, L.K.; Mian, M.A.R. Genome-wide association study of seed protein, oil and amino acid contents in soybean from maturity groups I to IV. Theor. Appl. Genet. 2019, 132, 1639–1659. [Google Scholar] [CrossRef]
Fang, C.; Ma, Y.; Wu, S.; Liu, Z.; Wang, Z.; Yang, R.; Hu, G.; Zhou, Z.; Yu, H.; Zhang, M. Genome-wide association studies dissect the genetic networks underlying agronomical traits in soybean. Genome Biol. 2017, 18, 161. [Google Scholar] [CrossRef]
Conesa, A.; Madrigal, P.; Tarazona, S.; Gomez-Cabrero, D.; Cervera, A.; McPherson, A.; Szcześniak, M.W.; Gaffney, D.J.; Elo, L.L.; Zhang, X. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016, 17, 13. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15, 550. [Google Scholar] [CrossRef] [PubMed]
Robinson, M.D.; McCarthy, D.J.; Smyth, G.K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26, 139–140. [Google Scholar] [PubMed]
Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef]
Hagberg, A.; Swart, P.J.; Schult, D.A. Exploring Network Structure, Dynamics, and Function Using NetworkX; Los Alamos National Laboratory (LANL): Los Alamos, NM, USA, 2007.
Langfelder, P.; Horvath, S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinform. 2008, 9, 559. [Google Scholar] [CrossRef]
Van Rossum, G.; Drake, F.L. Python 3 Reference Manual; Python Language Reference, Version 3.10; CreateSpace: Scotts Valley, CA, USA; Python Software Foundation: Beaverton, OR, USA, 2009; Available online: https://www.python.org (accessed on 1 February 2026).
Kluyver, T.; Ragan-Kelley, B.; Pérez, F.; Granger, B.E.; Bussonnier, M.; Frederic, J.; Kelley, K.; Hamrick, J.; Grout, J.; Corlay, S.; et al. Jupyter Notebooks-a publishing format for reproducible computational workflows. In Proceedings of the Positioning and Power in Academic Publishing: Players, Agents and Agendas: Proceedings of the 20th International Conference on Electronic Publishing, Göttingen, Germany, 7–9 June 2016; IOS Press: Amsterdam, The Netherlands, 2016; p. 87. [Google Scholar]
Harris, C.R.; Millman, K.J.; Van Der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
McKinney, W. Data structures for statistical computing in Python. Scipy 2010, 445, 51–56. [Google Scholar]
Hunter, J.D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
Waskom, M.L. Seaborn: Statistical data visualization. J. Open Source Softw. 2021, 6, 3021. [Google Scholar] [CrossRef]

Figure 1. Principal component analysis of soybean tissues. Scatter plot of the first two principal components derived from the log₂-transformed expression matrix, showing global transcriptional relationships among seed developmental stages and vegetative tissues.

Figure 2. Heatmap of the top 100 most variable genes across soybean tissues. Hierarchically clustered heatmap showing row-wise standardized expression (z-score) of the 100 genes with the highest variance across samples, highlighting distinct transcriptional programs associated with seed and vegetative tissues.

Figure 3. Volcano plot of differential gene expression between seed and non-seed tissues. Genes are plotted based on log₂ fold change and −log₁₀(p-value), with significantly differentially expressed genes highlighted in red. The horizontal dashed blue line indicates the significance threshold (p = 0.05), while the vertical dashed green lines represent log₂ fold-change thresholds (±1), illustrating the magnitude and direction of transcriptional changes.

Figure 4. Heatmap of the top 50 differentially expressed genes. Clustered heatmap displaying log₂-transformed expression values of genes with the largest absolute fold changes between seed and non-seed tissues, revealing strong tissue-specific expression patterns.

Figure 5. Top seed-upregulated genes ranked by log₂ fold change. Bar plot showing the genes with the highest positive log₂ fold change in seed tissues relative to non-seed tissues. Genes are ordered by decreasing fold change, highlighting candidate genes associated with seed development, including stress-responsive and membrane-associated proteins.

Figure 6. Sample co-expression network of soybean tissues. Network representation of transcriptional similarity among tissues based on Pearson correlation coefficients (|r| > 0.9), with nodes representing tissues and edges indicating strong correlations.

Figure 7. Pearson correlation matrix of soybean tissue transcriptomes. Heatmap showing pairwise Pearson correlation coefficients among all samples using log₂-transformed gene expression values, illustrating global transcriptional similarity across tissues.

Figure 8. Distribution of differential gene expression magnitudes between seed and non-seed tissues. Histogram showing the distribution of log₂ fold-change values across all genes, with a kernel density curve (blue line) illustrating the overall distribution. The vertical dashed black line indicates no change in expression (log₂ fold change = 0), highlighting the extent and direction of transcriptional differences.

Figure 9. MA plot of soybean differential gene expression. Scatter plot showing the relationship between mean gene expression and log₂ fold change across all genes, illustrating expression-dependent transcriptional variation.

Figure 10. Hierarchical clustering of soybean developmental tissues based on gene expression profiles. Dendrogram showing relationships among tissues using hierarchical clustering of log₂-transformed expression data. The y-axis represents clustering distance. Seed tissues cluster separately from vegetative and reproductive tissues, reflecting distinct transcriptional programs. Branch colors are assigned automatically by the clustering algorithm and do not represent predefined biological groups.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Knizia, D.; Meksem, K.; Kassem, M.A. Integrative Transcriptomic Analysis and Co-Expression Network Characterization of Soybean Developmental Tissues. Plants 2026, 15, 1002. https://doi.org/10.3390/plants15071002

AMA Style

Knizia D, Meksem K, Kassem MA. Integrative Transcriptomic Analysis and Co-Expression Network Characterization of Soybean Developmental Tissues. Plants. 2026; 15(7):1002. https://doi.org/10.3390/plants15071002

Chicago/Turabian Style

Knizia, Dounya, Khalid Meksem, and My Abdelmajid Kassem. 2026. "Integrative Transcriptomic Analysis and Co-Expression Network Characterization of Soybean Developmental Tissues" Plants 15, no. 7: 1002. https://doi.org/10.3390/plants15071002

APA Style

Knizia, D., Meksem, K., & Kassem, M. A. (2026). Integrative Transcriptomic Analysis and Co-Expression Network Characterization of Soybean Developmental Tissues. Plants, 15(7), 1002. https://doi.org/10.3390/plants15071002

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Integrative Transcriptomic Analysis and Co-Expression Network Characterization of Soybean Developmental Tissues

Abstract

1. Introduction

2. Results

2.1. Composition of the Soybean Transcriptomic Dataset

2.2. Principal Component Analysis Reveals Clear Separation of Tissue Types

2.3. Highly Variable Genes Distinguish Developmental and Vegetative Programs

2.4. Differential Gene Expression Between Seed and Non-Seed Tissues

2.5. Expression Patterns of the Top Differentially Expressed Genes

2.6. Functional Characterization of Seed-Upregulated Genes

2.7. Co-Expression Network Reveals Distinct Tissue Modules

2.8. Global Transcriptomic Similarity Among Soybean Tissues

2.9. Distribution of Differential Expression Magnitudes

2.10. Global Summary of Differentially Expressed Genes

2.11. MA Plot of Gene Expression Changes

2.12. Hierarchical Clustering Confirms Tissue Relationships

2.13. Identification of Candidate Genes Associated with Seed Development

3. Discussion

4. Methods

4.1. Data Acquisition and Preprocessing

4.2. Overview of Tissue Composition

4.3. Principal Component Analysis (PCA)

4.4. Identification of Highly Variable Genes

4.5. Differential Gene Expression Analysis

4.6. Visualization of Differential Expression Patterns

4.7. Heatmap of Differentially Expressed Genes

4.8. Correlation Analysis of Tissue Transcriptomes

4.9. Hierarchical Clustering of Tissue Transcriptomes

4.10. Co-Expression Network Analysis

4.11. Software and Computational Environment

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI