Next Article in Journal
UAV-Based Multispectral Imagery for Area-Wide Sustainable Tree Risk Management
Previous Article in Journal
Designing Sustainable Urban Green Spaces: Audio-Visual Interaction for Psychological Restoration
Previous Article in Special Issue
Protein Fishmeal Replacement in Aquaculture: A Systematic Review and Implications on Growth and Adoption Viability
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Elucidating the Drivers of Aquaculture Eutrophication: A Knowledge Graph Framework Powered by Domain-Specific BERT

1
School of Management Science and Engineering, Dongbei University of Finance and Economics, Dalian 116025, China
2
Surrey International Institute, Dongbei University of Finance and Economics, Dalian 116025, China
3
School of Software, Dalian University of Foreign Languages, Dalian 116044, China
4
College of Public Administration, Huazhong University of Science and Technology, Wuhan 430079, China
*
Author to whom correspondence should be addressed.
Sustainability 2025, 17(19), 8907; https://doi.org/10.3390/su17198907
Submission received: 6 September 2025 / Revised: 26 September 2025 / Accepted: 29 September 2025 / Published: 7 October 2025
(This article belongs to the Collection Aquaculture and Environmental Impacts)

Abstract

(1) Background: Marine eutrophication represents a formidable challenge to sustainable global aquaculture, posing a severe threat to marine ecosystems and impeding the achievement of UN Sustainable Development Goal 14. Current methodologies for identifying eutrophication events and tracing their drivers from vast, heterogeneous text data rely on manual analysis and thus have significant limitations. (2) Methods: To address this issue, we developed a novel automated attribution analysis framework. We first pre-trained a domain-specific model (Aquaculture-BERT) on a 210-million-word corpus, which is the foundation for constructing a comprehensive Aquaculture Eutrophication Knowledge Graph (AEKG) with 3.2 million entities and 8.5 million relations. (3) Results: Aquaculture-BERT achieved an F1-score of 92.1% in key information extraction, significantly outperforming generic models. The framework successfully analyzed complex cases, such as Xiamen harmful algal bloom, generating association reports congruent with established scientific conclusions and elucidating latent pollution pathways (e.g., pond aquaculture–nitrogen input–Phaeocystis bloom). (4) Conclusions: This study delivers an AI-driven framework that enables the intelligent and efficient analysis of aquaculture-induced eutrophication, propelling a paradigm shift toward the deep integration of data-driven discovery with hypothesis-driven inquiry. The framework provides a robust tool for quantifying the environmental impacts of aquaculture and identifying pollution sources, contributing to sustainable management and achieving SDG 14 targets.

1. Introduction

As the world’s fastest-growing food production sector, aquaculture plays an increasingly pivotal role in ensuring global food security and supplying high-quality protein [1,2], and according to the FAO’s The State of World Fisheries and Aquaculture 2024, aquaculture became the leading source of aquatic animal production for the first time in 2022, producing 130.9 million tonnes and demonstrating strong growth potential to help meet rising global demand for seafood (https://www.fao.org/newsroom/detail/fao-report-global-fisheries-and-aquaculture-production-reaches-a-new-record-high/en (accessed on 6 July 2024)). However, its rapid intensification and large-scale expansion have imposed immense environmental pressure on fragile coastal ecosystems [3]. Among these pressures, eutrophication driven by aquaculture activities is particularly pronounced [4]. Uneaten feed, metabolic excreta, and aquaculture effluents release substantial quantities of nutrients, such as nitrogen and phosphorus, into surrounding water bodies, thereby disrupting the native ecological balance [5,6]. This excess nutrient enrichment is a primary driver of harmful algal blooms (HABs), including red and green tides. These events have well-documented and severe consequences, including mass fish mortality, widespread anoxia, and significant economic losses due to closures of shellfish harvesting areas [7]. High concentrations of N and P stimulate the excessive proliferation of phytoplankton, altering the aquatic nutrient structure [8,9]. The subsequent decomposition of this biomass consumes vast amounts of dissolved oxygen, ultimately leading to oxygen depletion and the formation of hypoxic zones [10]. This process severely threatens marine biodiversity and impairs the health and resilience of coastal ecosystems [11]. This challenge directly aligns with the core mandate of the United Nations Sustainable Development Goal 14 (SDG 14: Life Below Water) [12,13], which explicitly calls for the prevention and significant reduction of “marine pollution of all kinds, in particular from land-based activities.” Consequently, the precise identification and quantification of aquaculture’s contribution to marine eutrophication is a scientific prerequisite for developing effective management strategies [4], conserving and restoring coastal and wetland ecosystems, and ultimately achieving sustainable aquaculture.
Currently, the attribution analysis of eutrophication events relies predominantly on conventional methods, including fixed-site water quality monitoring, numerical ecosystem models, and extensive manual literature reviews [14,15]. Although these approaches have proven successful in specific contexts, their limitations have become increasingly evident in the face of ever more severe and complex environmental challenges [16,17]. First, relevant information is highly fragmented and heterogeneous, scattered across thousands of scientific papers, government environmental reports, technical documents, and news articles in disparate formats, which renders data integration exceedingly difficult [18,19,20,21]. Second, the reliance on experts for manual screening and data extraction is time-consuming, laborious, and inefficient [22,23]. This process is also susceptible to subjective bias, thereby limiting the consistency and reproducibility of the analytical results [24]. Crucially, in the era of information explosion, conventional methods are inadequate for effectively processing the vast volumes of unstructured text data [25,26]. Consequently, many potential associations and deep-seated causal chains are overlooked [27].
To illustrate the disparity, it is crucial to establish quantitative benchmarks. In terms of processing time, a manual literature review by an expert team might take months to synthesize several hundred papers; in stark contrast, the AI-driven framework presented in this study processed a corpus of over 20,000 documents in a matter of hours. Regarding extraction performance, where manual annotation is prone to fatigue and inconsistency, our automated approach achieves a validated F1-score of 92.1% for domain-specific information extraction—a level of precision and recall competitive with human experts but executed at a vastly greater scale. Most critically, this scale enables a level of link discovery impossible through manual methods. While a conventional review might identify several hundred key relationships, our framework systematically constructed a knowledge graph containing over 3.2 million entities and 8.5 million relational triplets, revealing complex, system-level connections that are otherwise imperceptible.
In recent years, the emergence of pre-trained language models (PLMs), exemplified by Bidirectional Encoder Representations from Transformers (BERT), has provided revolutionary tools for the deep comprehension and mining of unstructured text [28,29]. By leveraging its unique bidirectional Transformer architecture, BERT captures the deep contextual semantics of words [30], achieving state-of-the-art performance on information extraction tasks such as Named Entity Recognition (NER) and Relation Extraction (RE) [31,32]. Concurrently, Knowledge Graphs (KGs), a type of semantic network, enable the structured storage of disparate knowledge extracted from text in the form of (entity-relation-entity) triplets [33,34,35]. This structure facilitates complex querying, logical reasoning, and visual analysis [36].
The integration of these two state-of-the-art technologies offers a novel approach to addressing the aforementioned challenges. By harnessing the powerful semantic comprehension capabilities of BERT, key information regarding eutrophication events can be automatically and precisely extracted from massive volumes of text. A knowledge graph can then be employed to link and integrate this disparate information, thereby constructing a systematic knowledge base. This AI-driven framework is poised to revolutionize the analysis of eutrophication drivers, fundamentally enhancing the efficiency, objectivity, and comprehensiveness of attribution analysis. Against this backdrop, the primary objective of this study is to design, construct, and validate an intelligent framework that integrates a domain-specific BERT model with a knowledge graph. This framework is aimed at automatically identifying aquaculture-related eutrophication events and intelligently analyzing their underlying causal factors.
The primary contributions of this study include:
The development of Aquaculture-BERT, the first pre-trained language model specifically tailored to the domain of aquaculture and marine environmental science for a more precise comprehension of its specialized terminology.
A novel methodology that integrates BERT with knowledge graph technology, designed specifically for the intelligent attribution analysis of eutrophication.
The development of a scalable Aquaculture Eutrophication Knowledge Graph (AEKG), providing researchers and marine environmental managers with a robust, data-driven decision-support tool.
This interdisciplinary research provides an efficient and scalable solution for addressing marine pollution, thereby offering a feasible technological pathway for leveraging scientific knowledge to support the United Nations Sustainable Development Goals and the conservation and sustainable use of marine ecosystems. This study represents the first systematic application of an integrated domain-pre-trained language model and large-scale knowledge graph technology for the automated analysis of factors associated with aquaculture-induced eutrophication. The work aims to establish a novel, scalable analytical framework to advance a research paradigm within this field that fuses ‘data-driven’ exploration with ‘hypothesis-driven’ inquiry. The research approach and structure of this paper are illustrated in Figure 1.

2. Literature Review

2.1. Environmental Impact of Aquaculture and the Attribution Challenge

An extensive body of research establishes aquaculture as a significant anthropogenic stressor, primarily through the mechanism of nutrient enrichment in coastal ecosystems [37]. Foundational studies have focused on quantifying these inputs, identifying uneaten feed and metabolic excreta as the principal pathways for releasing large quantities of nitrogen and phosphorus [38], which is now unequivocally identified in the literature as a key driver of coastal eutrophication [2]. Subsequent ecological research has elaborated on the cascading effects, where nutrient enrichment stimulates harmful algal blooms (HABs) [39,40]. The decomposition of this excess biomass then consumes dissolved oxygen, often creating extensive hypoxic zones [41] that severely impair marine biodiversity and ecosystem health [13]. This well-documented causal chain—from aquaculture operations to ecosystem degradation—directly aligns the issue with the policy imperatives of UN Sustainable Development Goal 14 [42,43], pointing to a critical need for methodologies that can precisely attribute and quantify the role of aquaculture in eutrophication events.
The practical value of attribution studies lies in enabling targeted, evidence-based policymaking, which is essential for achieving SDG 14. Without precise source attribution, regulators often resort to broad, inefficient policies—such as uniform restrictions across an entire region—that can unfairly penalize responsible operators. In contrast, robust attribution provides the evidence for targeted governance. For instance, by identifying a specific farming practice or environmental condition as a primary driver of eutrophication—such as the “Pond-Dinoflagellate” pattern our framework uncovered—policymakers can design precise interventions. These might include mandating the adoption of Integrated Multi-Trophic Aquaculture (IMTA) in high-risk areas or adjusting site licenses, rather than imposing blanket restrictions [44]. This approach transforms the high-level ambitions of SDG 14 into actionable, measurable environmental outcomes, directly linking scientific discovery to effective governance.

2.2. The Evolution and Limitations of Conventional Analytical Methods

For decades, research addressing this attribution challenge has been dominated by a suite of conventional methodologies [2]. This established paradigm traditionally relies on three pillars: long-term water quality monitoring for empirical data, numerical ecosystem models for mechanistic insights, and large-scale manual synthesis of literature for comprehensive knowledge [45,46]. These approaches have been instrumental, providing a foundational understanding of eutrophication phenomena in specific contexts [47].
However, the contemporary era of information explosion has exposed inherent methodological limitations [47,48]. Specifically, while direct water quality monitoring provides high-fidelity ground truth, it is inherently constrained by its spatial and temporal sparsity, offering deep but narrow insights that often miss transient or localized eutrophication events. Numerical ecosystem models, though powerful for forecasting, are fundamentally limited by their reliance on simplifying assumptions and the availability of extensive calibration data, which is often scarce, thereby restricting their applicability and accuracy across diverse geographical regions.
A significant challenge, widely discussed in recent literature, is the severe fragmentation of evidence across disparate sources like scientific reports and government bulletins. This fragmentation does not merely complicate data integration; it actively obstructs a holistic, system-level understanding of cumulative environmental pressures, leaving policymakers with an incomplete picture [49]. Furthermore, the reliance on manual knowledge extraction represents a critical bottleneck that fundamentally compromises scalability; this process is not only resource-intensive and inefficient but is also susceptible to subjective biases that can compromise the objectivity and reproducibility of analyses [30]. Most critically, as the volume of unstructured text data grows exponentially, it has become clear that these traditional methods are fundamentally ill-equipped for the scale and complexity of the modern information landscape.
Their inherent limitations prevent the discovery of the very complex, non-linear, and deep-seated driver-response relationships [50], that are crucial for effective management. This creates a clear and urgent need for a paradigm shift toward automated, AI-driven approaches capable of synthesizing vast amounts of information to reveal insights that remain hidden from conventional analysis, making the development of such a framework not just beneficial, but essential.
Traditional methods are characterized by direct, empirical measurement at specific sites, a practice well-documented in foundational reviews. Departing from this paradigm, our work demonstrates the power of large-scale knowledge synthesis from the scientific literature itself. We harness the ability of AI to identify complex, non-linear patterns from these heterogeneous data sources, while remaining mindful that such models can be data-hungry and sometimes lack interpretability. In essence, the methodological evolution reflects a fundamental trade-off: while traditional methods provide high-accuracy, descriptive ‘deep dives’ at specific locations, advanced methods offer predictive and synthetic ‘wide-angle’ views essential for regional management, albeit with different levels of abstraction and sources of uncertainty [51].

2.3. A New AI-Driven Research Paradigm: From Text Mining to Knowledge Construction

In response to these methodological challenges, a new research paradigm is emerging, driven by artificial intelligence technologies engineered to process and comprehend vast quantities of unstructured text. At the forefront of this shift are pre-trained language models (PLMs), particularly BERT, which offer revolutionary tools to overcome the bottlenecks of manual analysis [52,53]. These models represent a significant leap from previous Natural Language Processing (NLP) techniques that relied on keyword matching or traditional machine learning [54]. By leveraging its unique bidirectional Transformer architecture, BERT captures deep contextual semantics, demonstrating state-of-the-art performance on information extraction tasks [55,56]. Concurrently, to address the challenge of data fragmentation, Knowledge Graphs (KGs) have gained widespread attention as a technology for structured knowledge organization [57]. KGs integrate discrete facts extracted from text into a computable and inferable semantic network of (entity-relation-entity) triplets, which is essential for complex querying and pattern discovery [58]. The literature increasingly regards the synergistic integration of BERT’s text comprehension with the organizational power of KGs as the foundation for a new generation of automated knowledge discovery systems. This approach is seen as pivotal for revolutionizing knowledge-intensive fields by enhancing analytical efficiency, objectivity, and comprehensiveness [59,60].

2.4. The Research Gap and This Study’s Academic Positioning

Despite the transformative potential of this AI-driven paradigm, a review of the literature reveals that its application to the highly specialized, interdisciplinary field of aquaculture eutrophication attribution remains nascent [19]. The principal barrier identified in recent studies is the challenge of ‘semantic drift,’ which occurs when standard language models trained on general-purpose corpora are applied to specialized scientific domains [61,62]. These generalist models struggle to interpret domain-specific terminology (e.g., Karenia mikimotoi) and complex contextual relationships, which fundamentally compromises the quality of information extraction and underscores the necessity for domain-adapted models. Consequently, the literature indicates a clear and critical research gap: no study has yet implemented an intelligent framework that systematically integrates a domain-pre-trained language model with a large-scale knowledge graph for the automated attribution of aquaculture-induced eutrophication. Prior work in this area has been constrained either by a reliance on conventional NLP techniques or by validations confined to small-scale datasets, thus failing to construct a robust knowledge engine capable of supporting large-scale, data-driven exploration [63,64,65]. This study is therefore explicitly designed to fill this gap. By developing an intelligent analysis framework enhanced by domain-specific knowledge, this research seeks to advance the field’s paradigm by integrating a more efficient and scalable ‘data-driven’ approach with traditional ‘hypothesis-driven’ inquiry.

3. Materials and Methods

3.1. Overview of the Research Framework

This study developed a multi-stage intelligent analysis framework (Figure 2), designed for the automated construction of an AEKG from massive volumes of unstructured text. The framework primarily comprises four core modules: (1) the construction of a comprehensive, domain-relevant text corpus; (2) secondary pre-training of a general-purpose BERT model on this corpus to obtain the domain-specific Aquaculture-BERT model; (3) fine-tuning of Aquaculture-BERT for downstream NER and RE tasks to enable the automated extraction of key information; and (4) the fusion and storage of the extracted structured knowledge triplets, culminating in the construction of the AEKG and its subsequent validation through case studies.
This modular architecture was deliberately chosen as it represents a best-practice approach in applied NLP for specialized domains. The sequential design—first building a comprehensive corpus, then imbuing the model with deep domain-specific knowledge via pre-training, and only then adapting it to downstream extraction tasks—is critical for achieving high accuracy. This sequence ensures the model learns the nuances of aquaculture science before being asked to perform specific, high-stakes extraction tasks, a step that significantly improves performance over using a general-purpose model directly.

3.2. Corpus Construction and Preprocessing

To ensure the model’s domain adaptability, a large-scale corpus dedicated to aquaculture and the marine environment was constructed. The data sources primarily included the following:
Academic Literature: Approximately 8000 research articles (abstracts and full texts) published from June 2010 to June 2025 were retrieved from databases such as Web of Science, Scopus, PubMed, and the China National Knowledge Infrastructure (CNKI). The retrieval process utilized keyword combinations including “aquaculture” AND “eutrophication,” “harmful algal bloom” AND “mariculture,” and “cage culture” AND “eutrophication”. The final corpus comprises approximately 8000 documents, of which about 74% (~5920) are full-text articles. The remaining 26% (~2080) are abstracts, which were included to maintain comprehensive topical coverage in cases where full texts could not be obtained.
Industry News and Standards: Approximately 65,000 textual documents, such as industry news and aquaculture technical specifications, were consolidated from leading aquaculture industry websites and standards-setting organizations.
To ensure the comprehensiveness and domain-relevance of the model training, data was collected from multiple sources. The detailed composition of the constructed domain corpus is summarized in Table 1, which now includes an estimate of the token distribution across different source types.

3.3. Domain-Specific Model Pre-Training: Aquaculture-BERT

Before pre-training, the processed text was tokenized. The tokenization process strictly adhered to the requirements of the bert-base-multilingual-cased model, utilizing its corresponding WordPiece tokenizer. For multi-word terms such as ‘harmful algal bloom,’ the WordPiece tokenizer breaks them down into smaller subword units (tokens). In the subsequent NER task, the Transformer architecture of BERT, through its self-attention mechanism, is able to automatically learn the intrinsic relationships between these consecutive subword units. This allows it to recognize and label them collectively as a single entity (e.g., of type ‘Event’), a process accomplished automatically by the model during the fine-tuning stage.
To enable the model to accurately comprehend specialized terminology, such as Karenia mikimotoi and “denitrification”, secondary pre-training was performed on the bert-base-multilingual-cased model. The selection of this specific model was a deliberate and necessary choice, driven directly by the bilingual nature of our purpose-built corpus (constructed in Section 3.2), which is composed of both Chinese (~70%) and English (~30%) texts. The bert-base-multilingual-cased model is specifically designed and pre-trained by Google to handle multiple languages within a single framework, making it the most suitable and logical starting point for our secondary pre-training task. Using a monolingual English model (e.g., bert-base-uncased) would have meant ignoring the majority of our data, while training separate models would have been less efficient and prevented the model from learning valuable cross-lingual representations.
This training leveraged the bilingual Chinese-English domain corpus constructed in Section 3.2. The training tasks adopted the classic strategies of BERT: Masked Language Model (MLM) and Next Sentence Prediction (NSP). The pre-training process was executed on four NVIDIA A100 GPUs with a batch size of 64, a learning rate of 5 × 10−5, and was run for a total of 3 epochs. This process adapted the model’s parameter weights from general-purpose knowledge to the specific domain of aquaculture and the marine environment, thereby significantly enhancing its performance on downstream tasks.

3.4. Model Fine-Tuning for Information Extraction

To enable the automated extraction of key information, the pre-trained Aquaculture-BERT model was adapted and applied to two distinct downstream tasks: NER and RE. This adaptation was accomplished through fine-tuning, which involved adding task-specific layers onto the core model architecture. Figure 3 illustrates the specific fine-tuning architectures for both the NER and RE tasks, detailing how the model processes an input sentence to identify entities and subsequently predict the semantic relationships between them. Figure 3 illustrates the specific fine-tuning architectures for both the NER and RE tasks, detailing how the model processes an input sentence to identify entities and subsequently predict the semantic relationships between them. For the benefit of an international audience, all examples and visualizations in this paper are presented in English.
To construct the knowledge graph from unstructured text, we first defined an ontology schema comprising seven entity types and five relation types, designed to comprehensively capture the core elements and associations within aquaculture-eutrophication events. The specific definitions for this schema are presented in Table 2.

3.4.1. NER

Seven entity types highly relevant to this study were predefined: Event, Location, Time, Aqua-Source (e.g., “cage culture”), Nutrient (e.g., “reactive phosphate”), Organism (e.g., Ulva prolifera), and Parameter (e.g., “Chemical Oxygen Demand,” COD). A total of 5000 sentences were sampled from the corpus for manual annotation. To ensure that the annotated dataset sufficiently represents the complexity of the entire corpus, we employed a Stratified Random Sampling method. The stratification was based on the three major source categories defined in Table 1 (Academic Literature, Official Reports, and Industry & News). We sampled sentences from each stratum in strict proportion to the document distribution of each source type in the overall corpus, thereby guaranteeing that the distribution of source types in our sample aligns with that of the total population.
Regarding the proportion of sentences for each entity and relation type, it was not feasible to pre-stratify the sample by entity or relation type during the sampling stage, as the content of the sentences was unknown prior to annotation. However, to enhance transparency, we conducted a post hoc statistical analysis on the final annotated set of 5000 sentences. We have added a new table in the Appendix A detailing the number of sentences containing each of the seven entity types and six relation types, as well as their respective proportions within the annotated dataset. This provides readers with a clearer perspective on the composition of our model’s training data.
All annotators were students from relevant disciplines possessing pertinent research experience. Prior to the task, all annotators received comprehensive training on the annotation guidelines. During the process, any labeling discrepancies were resolved to a final consensus through discussion among the three-person team and by referencing authoritative literature.
To ensure annotation quality, the inter-annotator agreement was measured using Cohen’s Kappa coefficient, which yielded a value of 0.89, indicating a high degree of consistency. The annotated dataset was subsequently partitioned into training, validation, and test sets at a strict 8:1:1 ratio. For the fine-tuning process, a token classification layer was added on top of the Aquaculture-BERT model. The model was then fine-tuned on the annotated dataset for 5 epochs using the AdamW optimizer with a learning rate of 2 × 10−5.

3.4.2. RE

Six key semantic relation types were predefined to capture the complex associations between entities: Caused_By, Occurs_In, Sourced_From, Affects, Has_Property, and Transported_By. It is crucial to clarify that the relations defined in this study (e.g., Sourced_From) are designed to capture reported causal links and strong, inferred associations as described in the source texts. They do not represent direct causality verified by empirical experimentation. These text-mined relationships form the basis of the knowledge graph, and any conclusions drawn from them should be interpreted as strong associations supported by textual evidence rather than proven causal facts. To construct a large-scale, high-coverage training set, the method of Distant Supervision was employed [66]. This approach leveraged pre-existing knowledge bases (e.g., DBpedia) and lexical-syntactic rule templates to automatically generate training samples.
While Distant Supervision effectively increases data scale, it inevitably introduces noisy labels. It is critical to acknowledge the dangers this presents. Specifically, this method can lead to two significant risks: (1) the propagation of systematic bias, where biases inherent in the external knowledge bases or rule templates are systematically learned by the model, and (2) the generation of incorrect connections, where training on spurious correlations causes the model to learn and subsequently extract factually incorrect (entity-relation-entity) triplets.
To address this challenge, a rigorous, multi-stage strategy for quality control and noise-robust training was designed and implemented. First, at the data level, a 15% random sample of the over 100,000 automatically generated instances was subjected to independent manual verification. The results indicated that the initial accuracy of the auto-labeling was approximately 75%; this figure rose to 91% for the validation set after correction by domain experts. This process not only yielded a high-quality validation set but also provided clear insights into the patterns and proportions of the noise.
Second, at the training level, the entirety of the auto-generated dataset was utilized for model training. This decision was made to ensure the breadth and diversity of relation types and entity coverage in the final knowledge graph, thereby avoiding the poor generalization that can result from an overly small training set. To fundamentally mitigate the impact of label noise, the technique of Label Smoothing was incorporated into the training regimen [67,68]. This technique reduces the model’s overconfidence in potentially incorrect labels by converting hard labels to soft labels, which significantly enhances the model’s generalization capability and robustness on noisy data. This combined strategy—‘large-scale noisy training + quality monitoring + noise-robust algorithm’—represents the targeted solution proposed in this study to balance data coverage with accuracy.
During the fine-tuning stage, entity pairs within each sentence were demarcated with special markers (e.g., <E1>, </E1>) before being input into the Aquaculture-BERT model. The relationship type between the two entities was then predicted based on the output of the model’s (CLS) vector.

3.5. Knowledge Graph Construction and Storage

The vast number of (head entity, relation, tail entity) triplets extracted from the corpus required a knowledge fusion process prior to their ingestion into the Neo4j (v4.4) graph database. This process primarily involved two core stages:
  • Entity Disambiguation: A BERT-based contextual embedding vector clustering method was utilized to differentiate homonymous entities with distinct meanings in different contexts (e.g., “Fujian” as a geographical location versus “Fujian” as a naval vessel).
  • Entity Alignment: This stage mapped different textual mentions referring to the same real-world object (e.g., “Red Tide,” “HABs”) to a single entity node within the knowledge graph. This was primarily accomplished by calculating the cosine similarity between entity embedding vectors. Two entity mentions were considered coreferential and merged if their vector similarity exceeded a predefined threshold of 0.95. This threshold was determined following multiple rounds of experimental tuning on a development set, achieving an optimal balance between precision and recall for the alignment task and effectively preventing erroneous entity merges.
Through these steps, a structured AEKG was constructed, comprising approximately 3.2 million entity nodes and 8.5 million relations.

3.6. Performance Evaluation and Case Validation

To validate the effectiveness of the proposed approach, a dual evaluation scheme was designed:
  • Model Performance Evaluation: The NER and RE models were evaluated on the reserved, independent test set (the 10% of data partitioned in Section 3.4.1). The evaluation metrics included Precision, Recall, and F1-Score. Furthermore, the performance of the models was benchmarked against a generic BERT model and a conventional Bidirectional Long Short-Term Memory with Conditional Random Field (BiLSTM-CRF) model.
  • Framework Application Validation: Typical and well-documented eutrophication events were selected as case studies, such as the 2012 red tide in the western sea of Xiamen and the 2021 hypoxia event in the Pearl River Estuary. Path analysis and subgraph mining were conducted on the constructed AEKG by executing queries in the Cypher language. This process was designed to test whether the framework could automatically and accurately reconstruct the key driving factors and developmental trajectory of these events. The analytical results were subsequently cross-validated against official reports and relevant scientific literature.
All collected textual data underwent a rigorous preprocessing pipeline which involved: format conversion (e.g., PDF to TXT), text cleaning (removal of irrelevant HTML tags, headers, and footers), document deduplication, paragraph merging, and sentence segmentation. The entire preprocessing pipeline was based on a Python 3.9 environment. Specifically, we utilized the PyMuPDF library for efficient PDF-to-text conversion, employed BeautifulSoup4 to clean HTML tags, and leveraged the spaCy (v3.5) NLP toolkit for precise sentence segmentation.
Following text extraction and cleaning, we implemented a strict deduplication strategy: first, an initial deduplication was performed based on document titles. Subsequently, we computed the SHA-256 hash for the cleaned text content of each document and removed all identical copies, ensuring the uniqueness of the corpus. To address formatting issues, a paragraph merging rule was applied: any line of text not ending with a standard terminal punctuation mark (e.g., period, question mark, exclamation mark) was merged with the subsequent line until a complete sentence or paragraph break was encountered. Finally, we leveraged the spaCy (v3.5) NLP toolkit for precise sentence segmentation.
The resulting corpus contains approximately 210 million tokens, providing a robust data foundation for subsequent model training. In terms of geographical distribution, the corpus covers major global aquaculture regions, including East Asia, North America, and Europe. Reflecting China’s dominant position in global aquaculture and the abundance of its publicly available information, Chinese-language literature accounts for approximately 70% of the corpus.

4. Results

This section systematically presents the outcomes of the developed research framework. The chapter begins by quantitatively evaluating the performance of the core information extraction models. It then demonstrates the structure and capabilities of the constructed AEKG from both macroscopic and microscopic perspectives. Finally, through in-depth mining of the AEKG, the analysis not only validates the framework’s effectiveness in complex case attribution but also reveals macro-level associative patterns of eutrophication in China’s coastal waters.

4.1. Quantitative Performance of the Information Extraction Models

To ensure the quality of the source data for knowledge graph construction, the performance of the information extraction models was rigorously evaluated.
As presented in Table 3, the proposed Aquaculture-BERT model achieved an overall F1-Score of 92.1%, significantly outperforming the two baseline models. Notably, the performance improvement was most pronounced for the Aqua-Source and Parameter entity types, which contain a high volume of specialized terminology, with F1-scores reaching 90.8% and 85.4%, respectively. This result strongly indicates that through secondary pre-training on the domain corpus, the model has acquired robust domain-specific knowledge. This enables it to accurately comprehend specialized concepts that are challenging for generic models to distinguish, thereby laying a solid foundation for the subsequent construction of a high-quality knowledge graph.
For a more in-depth analysis of the model’s performance, a confusion matrix on the test set was generated (Figure 4). The matrix results indicate that the model accurately identifies the vast majority of entity types, with the number of correct predictions (diagonal values) for all categories substantially exceeding the number of incorrect predictions. The primary confusion occurred between the semantically related Event and Organism categories (e.g., six Organism entities were misclassified as Event). This is logically plausible, as eutrophication events are often named after the causative organism (e.g., an ‘Ulva prolifera green tide’). However, it is important to recognize that this still represents a model failure that can impact the precision of the resulting knowledge graph by conflating a phenomenon with its biological agent.
The bar chart comparing F1-scores per category (Figure 5) further visually corroborates the performance superiority of the proposed model across all categories. Particularly in the highly specialized Aqua-Source and Parameter categories, the F1-scores of the proposed model (90.8% and 85.4%, respectively) were significantly higher than those of the generic BERT model (83.4% and 78.2%). Furthermore, the Precision-Recall (P-R) curve (Figure 6) reveals that the Area Under the Curve (AUC) for Aquaculture-BERT was 0.926, substantially higher than that of bert-base-multilingual-cased (0.879) and BiLSTM-CRF (0.842). This indicates that the model maintains superior and robust performance across various confidence thresholds.
In the RE task, Aquaculture-BERT similarly demonstrated superior performance (Table 4), achieving an overall F1-score of 86.5%. The relation confusion matrix (Figure 7) indicates that the model identifies factual relations, such as Occurs_In, with extremely high accuracy, reaching 88.7% (141/169). Although challenges were encountered in differentiating between semantically similar relations like Caused_By and Affects—for instance, 18 Affects relations were misclassified as Caused_By, and 13 Caused_By relations were misclassified as Affects—its accuracy remained significantly higher than that of the generic models (see Table 4). While understandable, this conflation has significant implications, as it could lead to an incorrect representation of the causal chain within the knowledge graph, for example, by swapping a driver with its impact. These specific error patterns inform the need for careful, expert-guided interpretation of the final graph and are a key area for future improvement.
To investigate the model’s internal decision-making mechanism, its prediction process was analyzed via attention mechanism visualization (Figure 8). The results revealed that when classifying a typical sentence containing an associative relation, the model accurately focuses its highest attention (a weight of 0.35) on key verbs such as “contributed,” thereby enabling a correct relation judgment.

4.2. Structure and Characteristics of the AEKG

The AEKG was constructed using the optimal Aquaculture-BERT model. The overall statistics of the AEKG are presented in Table 5.
Regarding macroscopic network characteristics, the ontology schema of the knowledge graph was generated (Figure 9), which clearly defines the valid connection rules between the seven entity types and five relation types. To explore the network’s topological structure, the degree distribution of all nodes was fitted (Figure 10). The distribution was found to follow a power-law, exhibiting typical characteristics of a scale-free network. The fitting results indicate a power-law exponent of approximately 2.3, which is consistent with the topology of real-world complex networks, namely the existence of a few ‘hub’ nodes with a large number of connections. This implies that the network is composed of a vast number of low-degree nodes and a few high-degree “hub nodes” (e.g., “eutrophication,” “red tide”). This finding has significant management implications, as interventions targeting these core hubs could potentially produce cascading effects throughout the entire ecosystem.
Finally, when we visualize all event nodes in the knowledge graph by their geographical coordinates (as shown in Figure 11), it can be visually observed that events in different sea areas (e.g., the Bohai Sea, East China Sea, and South China Sea) form relatively concentrated regional groupings in their spatial distribution. It should be emphasized that the term ‘clustering’ here refers to a visual aggregation phenomenon and not the result of applying any statistical clustering algorithm (e.g., K-Means). This reflects the regional disparities in eutrophication issues across China’s coastal waters. This spatial heterogeneity suggests that eutrophication in different sea areas may be governed by unique driving patterns, highlighting the necessity of formulating region-specific management strategies.
To showcase the microscopic knowledge details within the graph, the network was examined from multiple perspectives. Figure 12, for example, presents the association network centered on the core nutrient dissolved inorganic nitrogen. The Fig. reveals that various aquaculture modes act as contributing sources of dissolved inorganic nitrogen; for instance, “Pond Farming,” “Cage Culture,” and “Shrimp Farming” are all linked to dissolved inorganic nitrogen via the Sourced_From relation. dissolved inorganic nitrogen, in turn, is connected to downstream eutrophication events such as “Diatom Bloom” and “Dinoflagellate Bloom” through the Caused_By relation. Figure 13, in contrast, focuses on the key harmful organism Karenia mikimotoi, revealing a red tide event triggered by this species and its associated spatio-temporal characteristics. For example, the event occurred in “Summer 2012,” affected sea areas including the “South China Sea” and the “East China Sea,” and resulted in the ecological impact of “Fish Kill.”

4.3. Knowledge-Driven Attribution Analysis and Case Validation

The value of the AEKG lies in its powerful analytical capabilities, which were validated through two tiers of case studies.
As a case in point, the “2012 red tide in the western sea of Xiamen” event was examined. Through the automated analysis of relevant literature and reports, the framework does not merely establish a simple link between the red tide and “high-density shrimp farming”; rather, it constructs a more refined, multi-dimensional association network (Figure 14). The system first identifies that land-based discharge in the vicinity of the affected sea area, particularly from high-density shrimp farming activities, constitutes the primary source (Sourced_From) of key nutrients such as dissolved inorganic nitrogen and dissolved inorganic phosphorus. Second, the graph reveals the concurrent environmental context: high salinity, resulting from sustained high temperatures and low rainfall, provided ideal physical conditions (Has_Property) for the bloom’s outbreak.
Under the combined influence of these factors, the knowledge graph accurately pinpoints the dominant causative algal species as Prorocentrum donghaiense. It ultimately links this species to the large-scale red tide event via a Caused_By relation, and further connects the event to subsequent ecological impacts, including fish kills, through an Affects relation. The inferred association path, automatically generated by the system from textual evidence, is not only highly consistent with scientific investigation reports on the event but also demonstrates the framework’s advanced capability for performing multi-factor integrated analysis that transcends simple linear correlations.
A comprehensive analysis was conducted on one of the most severe hypoxia events on record, which occurred in the summer of 2021 in the Pearl River Estuary, a globally significant economic zone. The event’s outbreak resulted from the synergistic effect of multiple drivers under specific climatic conditions. Through deep graph queries, the framework automatically generated a comprehensive attribution knowledge subgraph for this event (Figure 15). The subgraph not only identified the “surge in Pearl River runoff”—caused by record-breaking heavy rainfall that summer—as the primary transport pathway (Transported_By) for nutrients, but it also pinpointed local aquaculture activities, such as “estuarine cage culture,” as significant endogenous pollution sources (Sourced_From). More importantly, the system elucidated a coupled physical-biological mechanism: the “water column stratification” (Has_Property), induced by the strong runoff, was exceptionally stable, thereby hindering vertical water exchange and creating conditions conducive to bottom-layer oxygen consumption. Concurrently, the subsequent decomposition of a “diatom bloom,” triggered by the influx of nutrient-rich upstream waters, was identified as the direct cause (Caused_By) of the massive depletion of dissolved oxygen. A comparison between the analytical pathways generated by the system and the core conclusions from authoritative research on this specific event revealed a high degree of consistency in the identification of key driving factors. This demonstrates the framework’s robust capability and scientific reliability in dissecting complex, sudden-onset environmental problems.

4.4. Macro-Level Association Analysis Based on the Knowledge Graph

In addition to attributing individual events, the AEKG facilitates the discovery of macro-level patterns. Based on the graph data, a distribution map illustrating the number of eutrophication events in China’s coastal provinces was generated (Figure 16). The results indicate that these events are highly concentrated geographically, with Zhejiang, Fujian, and Shandong identified as high-incidence areas with the largest number of occurrences. To investigate potential driving factors, Figure 17 further illustrates the latitudinal distribution characteristics of different aquaculture modes (pond, cage, and raft). Figure 17 shows that Pond Farming and Cage Culture are primarily concentrated in mid- to low-latitude regions, whereas Raft Culture is more prevalent in higher-latitude regions north of 35° N.
A combined analysis of these two Figures reveals a significant spatial coupling between the distribution of aquaculture modes and the high-incidence areas. For example, Zhejiang and Fujian, both high-incidence provinces, correspond precisely to the dense regions of pond and cage culture shown in Figure 17. Meanwhile, Shandong, another high-incidence area, aligns with the primary distribution zone for raft culture. This spatial coupling strongly suggests that the agglomeration of specific aquaculture modes is likely a key factor driving the high frequency of regional eutrophication events.
To elucidate the complex relationships among different driving factors, an association heatmap of the primary drivers was generated (Figure 18). The heatmap visually displays the proportional association between major aquaculture sources (pond, cage, raft), core nutrients, and key causative organisms (diatoms, dinoflagellates) using color gradients and numerical values, where the values represent the proportion of each associated factor within a given aquaculture type. The heatmap indicates a high degree of association between specific aquaculture modes and particular organisms or nutrients. For instance, the co-occurrence frequency of “Pond-Dinoflagellate” was the highest, while that of “Raft-Diatom” was also extremely high.
To quantify these associations more clearly, Figure 19 further highlights the normalized association strength of the top five most frequent “aquaculture source-causative organism” co-occurrence pairs in a bar chart format. The results further validate that “Pond-Dinoflagellate” (association strength: 0.69) and “Raft-Diatom” (association strength: 0.64) are the two most prominent patterns, with association strengths significantly higher than other combinations. This finding provides direct data support for the formulation of targeted, differentiated management and control strategies based on specific aquaculture modes.

5. Discussion

This study successfully constructed a knowledge graph framework driven by a domain-specific BERT model, achieving the automated identification and attribution analysis of aquaculture-related eutrophication events. This chapter further elucidates the scientific implications of the research findings and analyzes the methodological innovations and core advantages of the study.

5.1. Interpretation and Elucidation of Key Findings

In contrast to conventional methods that rely on water quality monitoring, numerical models, or purely manual literature reviews, the framework developed in this study demonstrates significant advantages in processing large-scale, unstructured text data. The primary quantitative advantage of our framework lies in its superior information extraction quality. As detailed in Table 3 and Table 4, our Aquaculture-BERT model achieved a significant performance gain over the general BERT and BiLSTM-CRF baselines, with F1-scores of 92.1% for NER and 86.5% for RE. In terms of processing efficiency, the framework demonstrates an order-of-magnitude improvement over manual methods. On our hardware configuration, it achieves an average processing throughput of approximately 500 documents per hour. In contrast, a domain expert performing manual extraction of equivalent depth and detail typically processes no more than 3 documents per hour. Regarding inference latency, the average processing time for a single document is on the order of seconds. While the computational cost of pre-training the model is substantial, it represents a one-time investment. During the application phase, the model can be deployed on a single GPU, making its operational cost significantly lower than the long-term expense of relying on a large team of human experts.
Although previous studies have attempted to analyze environmental texts using NLP, these efforts were largely confined to keyword matching or traditional machine learning approaches. Such methods struggle to capture deep semantic meanings and were generally limited to specific events or small-scale datasets, which we explicitly define here as corpora typically containing fewer than 10,000 annotated sentences or covering only a single geographical region or a specific type of event. In sharp contrast, this study, by introducing a domain-pre-trained BERT model, achieves a deep comprehension of specialized knowledge. Based on this, a large-scale, high-coverage AEKG was constructed, comprising over 3.2 million entities and 8.5 million relations, thereby enabling systematic attribution and pattern discovery on a much broader scale.
First, the significant performance improvement of the Aquaculture-BERT model over generic BERT in domain-specific information extraction (achieving an F1-score of 92.1%) fundamentally validates the critical value of Domain Adaptation in scientific text mining. General-purpose language models, owing to the generalized nature of their training corpora, face the challenge of “semantic drift” in specialized fields, where the same term can have different meanings in general versus professional contexts. Through secondary pre-training on a massive corpus of specialized literature, the model was effectively calibrated and empowered to overcome this semantic gap. This enables it to comprehend terminology, disambiguate meaning, and capture implicit contextual relationships with an accuracy comparable to that of a human domain expert, which is a fundamental prerequisite for ensuring the quality and scientific integrity of the source data for the knowledge graph.
Second, the scale-free network characteristic exhibited by the constructed AEKG (with its degree distribution following a power-law) offers profound insights for environmental management. In a complex ecological pollution network, the existence of a few “hub nodes” (e.g., “eutrophication,” “red tide,” “cage culture”) signifies more than just high term frequency; these nodes represent critical structural points for the transmission of information and influence within the system. They function as “relay stations” for information and impact, and changes to them can potentially propagate widely throughout the rest of the network via their dense connections. This has significant managerial implications: rather than diffusing resources through indiscriminate intervention across all nodes, concentrating limited management resources on these critical hubs could produce cascading effects, thereby achieving more efficient and cost-effective control objectives.
Furthermore, the typical pollution patterns discovered through macro-level association analysis demonstrate the immense potential of this AI framework as an engine for scientific hypothesis generation and validation. As indicated by the association strength analysis (Figure 18 and Figure 19), the framework automatically identified “Pond-Dinoflagellate” and “Raft-Diatom” as the two most dominant disaster patterns. These patterns, automatically mined from vast amounts of data, align closely with classic ecological theories regarding the role of nutrient stoichiometry in determining phytoplankton community structure (e.g., the Redfield Ratio theory) [69]. For instance, effluent from land-based pond aquaculture typically has a high nitrogen-to-phosphorus ratio, which favors the dominance of dinoflagellate populations that thrive in high-nitrogen environments. In contrast, open-sea activities like raft culture have a relatively smaller impact on the ambient N:P ratio, where background populations like diatoms are more likely to form blooms. The ability of the AI framework to systematically reproduce these established scientific principles from unstructured text serves as a powerful validation of its scientific soundness. More importantly, it suggests that the framework can be applied to explore more complex and unknown domains, discovering new scientific questions and potential regularities from data.
These findings do not merely hold theoretical value; they provide a direct pathway to addressing global sustainability challenges, particularly UN Sustainable Development Goal 14 (Life Below Water). For instance, the identification of “hub nodes” such as “cage culture” provides policymakers with clear, high-impact targets for intervention, enabling more efficient and cost-effective environmental management strategies. Similarly, the discovery of macro-level pollution patterns directly supports the development of targeted, region-specific mitigation policies, moving beyond ineffective one-size-fits-all approaches. More broadly, the AEKG framework itself serves as a dynamic tool for the data-driven monitoring and assessment of policy effectiveness, which is crucial for the adaptive management cycles required to meet long-term sustainability targets. By translating vast textual data into actionable insights, this research provides a tangible mechanism to bridge the gap between scientific knowledge and the practical governance needed to protect marine ecosystems.

5.2. Innovation and Advantages of the Study

The fundamental innovation of this study lies in the development and implementation of a novel paradigm that deeply integrates data-driven discovery with knowledge verification. Rather than completely discarding the traditional hypothesis-driven logic of scientific inquiry, this paradigm enhances it by employing data-driven techniques to efficiently validate, quantify, and systematize existing scientific consensus and classical theories within the field. For instance, the alignment observed between the identified eutrophication patterns and established ecological theories (as demonstrated in Section 4.1) serves as compelling evidence of the framework’s scientific robustness.
Building upon this foundation, the core advantage of the framework is its capacity to function as a generative engine for scientific hypotheses. It can extract novel correlations and patterns—often imperceptible to conventional research methods—from vast and seemingly unrelated datasets. This capability significantly advances the data-driven frontier of environmental science, enabling deeper and broader exploration. Specifically, this study proposes and realizes a new paradigm of data-driven, knowledge-automated construction, with the following key advantages:
  • Breakthrough in Domain-Specific Semantic Comprehension:
By conducting secondary pre-training on a corpus exceeding 210 million tokens of domain-specific texts, the developed Aquaculture-BERT model effectively overcomes the “semantic drift” issue inherent in general-purpose models when applied to specialized literature. This is not a mere fine-tuning effort but a significant enhancement that enables the model to accurately comprehend highly technical terms such as Karenia mikimotoi and complex contextual expressions. It provides a powerful new paradigm for precise text mining in environmental science.
  • Transformation in Knowledge Acquisition and Analytical Paradigms:
The proposed integration framework, which combines BERT with knowledge graph techniques, merges cutting-edge natural language processing with structured knowledge representation. This enables a transition from “manual consultation” to “automated attribution.” Beyond processing large-scale unstructured data, the framework constructs a computable and inferable knowledge network, thereby revealing deep-seated relationships that are inaccessible through conventional approaches.
  • Development of a Scalable Decision-Support Tool:
The resulting AEKG constitutes a pivotal contribution. Unlike static databases, the AEKG functions as a dynamic and extensible knowledge engine. Through structured knowledge storage and graph-based querying capabilities, it establishes a robust data-driven foundation for precise pollution source tracing, risk assessment, and simulation of management strategies. This represents a critical step toward intelligent decision-support systems for environmental management.
However, a balanced perspective requires acknowledging both the potential dangers of over-reliance on such AI-driven models and the practical challenges of their implementation. Crucially, this framework is designed to augment, not replace, expert judgment. The primary risks involve “automation bias”—the uncritical acceptance of its outputs—and the model’s potential to inherit and amplify publication biases from its training corpus. Therefore, its findings should be interpreted as strong, text-based correlations that require expert validation, rather than definitive, experimentally verified causation. Beyond the above drawbacks, there are some practical challenges. The significant computational resources required for both model training and the ongoing maintenance of the knowledge graph pose practical hurdles to widespread adoption. Acknowledging these challenges is essential for guiding future research toward more accessible AI systems in environmental science.

5.3. System Scalability and Future Maintenance Strategy

To address long-term viability, we have also considered system scalability and maintenance. The Neo4j graph database used in this study demonstrates excellent horizontal scalability, theoretically capable of supporting tens of billions of nodes and relationships, far exceeding the current graph’s scale. At its present size, typical multi-hop query latencies are stable in the millisecond-to-second range, ensuring robust performance.
Moreover, to maintain the timeliness of the knowledge, we plan to establish a quarterly update cadence to incrementally process new literature. To ensure reproducibility, we will also implement a strict versioning policy, where each major update results in a complete, archived snapshot of the graph (e.g., ‘AEKG-v1.0’, ‘AEKG-v1.1’).

6. Concluding Remarks

This study successfully developed and implemented a novel intelligent framework that integrates a domain-specific language model (Aquaculture-BERT) with knowledge graph technologies, achieving a breakthrough in the automated identification and attribution of marine eutrophication events induced by aquaculture. Experimental results demonstrate that secondary pre-training on domain-specific corpora significantly enhances the model’s performance in scientific information extraction tasks, enabling the efficient and accurate construction of large-scale, high-quality knowledge graphs from unstructured textual data.
Validation using real-world cases—such as the 2012 Xiamen red tide and the 2021 Pearl River Estuary hypoxia event—revealed the framework’s powerful capability to automatically reconstruct the complex association pathways of environmental events from vast textual data. The resulting analytical networks exhibit a high degree of concordance with the findings presented in authoritative scientific conclusions and investigation reports. More importantly, the framework not only validates known environmental issues but also uncovers macro-level spatiotemporal patterns and latent driver–response relationships from data, underscoring its potential as a generative engine for scientific discovery.
The overarching aim of this research is to serve real-world environmental governance. The outcomes provide robust scientific and technological support for achieving United Nations Sustainable Development Goal 14 (Life Below Water). The constructed knowledge graph—AEKG—represents more than a scholarly achievement; it serves as a prototype of an advanced decision support system.
In practical terms, its application scenarios include:
Integration with geographic information system platforms and real-time water quality monitoring networks. When nutrient concentrations exceed thresholds in a given marine area, the system can automatically identify and highlight the most probable pollution sources—such as specific types of cage aquaculture—based on relational proximity, thereby offering environmental authorities precise targets for prioritized investigation.
Support for environmental impact assessments. During the planning phase of new aquaculture projects, the graph can be queried to assess the correlation strength between similar production modes and historical eutrophication events, offering a quantitative basis for ecological risk prediction.
This system enables a strategic shift from “broad-spectrum surveillance” to “targeted source tracing,” allowing regulatory agencies to allocate limited resources more efficiently toward high-risk sources. By analyzing spatially differentiated aquaculture practices and pollution pathways, the framework provides essential data support for risk assessment and early warning and even allows for prospective simulation and evaluation of environmental policy interventions.
Specifically, these applications provide a direct pathway to achieving quantifiable progress toward UN Sustainable Development Goal Target 14.1, which calls for preventing and significantly reducing marine pollution, including nutrient pollution. By identifying and ranking the primary drivers of eutrophication (e.g., “cage culture,” “excess feed”), our framework provides the evidence base needed for policies precisely targeted at the most significant sources of pollution. Furthermore, we propose that the framework can be used longitudinally; by periodically re-analyzing the evolving scientific literature, it offers a novel, data-driven method for monitoring the effectiveness of interventions and assessing progress toward meeting Target 14.1.
However, achieving this policy impact requires more than just technology. Its successful implementation hinges on addressing key governance and institutional challenges. For the framework’s insights to translate into action, they must be embedded within existing environmental governance structures, requiring clear protocols and inter-agency collaboration. Equally critical is capacity building and training for end-users, such as environmental analysts and policymakers. Such training must cover not only how to operate the tool but also how to critically interpret its outputs, understand its inherent limitations, and use it effectively as a decision-support instrument rather than an infallible “black box”.
This study acknowledges several limitations. First, the analytical outcomes of the model are partially influenced by latent publication biases embedded in the training corpus. Second, the proposed framework is designed to uncover associative relationships from textual data rather than experimentally verified causal relationships; all attribution pathways generated are inferential in nature and grounded in existing knowledge. Third, the reliance on Distant Supervision for the relation extraction task, while necessary for scale, introduces the inherent risk of ‘label noise’. This means that systematic biases present in the source knowledge bases or factually incorrect connections could be propagated into the final knowledge graph. While our multi-stage mitigation strategy—combining manual verification with noise-robust algorithms like Label Smoothing—was designed to specifically counteract this, we acknowledge that the potential influence of such noise cannot be entirely eliminated, and this remains a key limitation of the current framework. Fourth, the generalizability of the model—particularly in handling previously unseen pollution events or novel aquaculture patterns—requires further validation with more heterogeneous datasets. Nonetheless, this study lays a solid foundation for a new data-driven paradigm in environmental science centered on automated knowledge construction. Future efforts will focus on developing a multimodal knowledge graph that integrates remote sensing imagery and real-time monitoring data and on incorporating cutting-edge causal inference models. Ultimately, the objective is to build an interactive, visualization-based decision-support platform tailored for environmental managers, thereby transforming research outputs into practical tools for informed decision-making and contributing to the global conservation and sustainable development of marine ecosystems.

Author Contributions

All authors contributed to the study conception and design. Methodology, software application, visualization, and initial draft writing were performed by D.H. B.X. contributed to formal analysis and interpretation of results. M.G. was responsible for data curation and technical support. M.Z. performed editing, review and validation. J.L. participated in writing—review and editing, project administration, and provided overall supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

We would also like to extend our sincere gratitude to Dezhao Shao for their kind support during the course of this work. We wish them continued happiness and good health for their child. We are also grateful to the anonymous reviewers and the editorial team for their constructive comments and valuable suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Distribution of Entity and Relation Types in the Annotated Corpus (N = 5000 Sentences).
Table A1. Distribution of Entity and Relation Types in the Annotated Corpus (N = 5000 Sentences).
Part A: Entity Types
Entity TypeSentence Count (Containing Type)Proportion of Corpus (%)
Event360072.0%
Location395079.0%
Time355071.0%
Aqua-Source225045.0%
Nutrient285057.0%
Organism250050.0%
Parameter175035.0%
Part B: Relation Types
Relation TypeSentence Count (Containing Type)Proportion of Corpus (%)
Occurs_In310062.0%
Caused_By265053.0%
Affects170034.0%
Sourced_From190038.0%
Has_Property130026.0%
Transported_By85017.0%
Note: This table presents a post hoc analysis of the 5000 sentences manually annotated for fine-tuning the information extraction models. The “Sentence Count” indicates the number of sentences in the corpus that contain at least one instance of the specified entity or relation type. Percentages are calculated based on the total corpus size.

References

  1. Carbonell-Garzon, E.; Fernandez-Gonzalez, V.; Romero, F.; Sanchez-Jerez, P.; Belda, L.; Agraso, M.M.; Toledo-Guedes, K. Validating a proxy to carrying capacity for finfish offshore aquaculture in the western Mediterranean. Aquaculture 2025, 608, 742756. [Google Scholar] [CrossRef]
  2. Ferreira, J.G. Aquaculture carrying capacity estimates show that major African lakes and marine waters could sustainably produce 10–11 Mt of fish per year. Nat. Food 2025, 6, 446–455. [Google Scholar] [CrossRef]
  3. Charalambides, M.; Menicou, M.; Triantaphyllidis, G. Economic feasibility study for the expansion of the Cyprus aquaculture sector: A roadmap for transition to offshore in the Mediterranean Sea. Aquac. Fish. 2025, 10, 522–531. [Google Scholar] [CrossRef]
  4. Li, X.; Dong, X.; Yue, F.; Lang, Y.; Ding, H.; Li, X.; Li, S.; Liu, X. Nitrous oxide emissions at aquaculture ponds in the coastal zone of the Bohai Rim Region of China: Impacts of eutrophication and feeding practice. Environ. Pollut. 2025, 371, 125959. [Google Scholar] [CrossRef]
  5. Zhang, J.; Tishchenko, P.Y.; Jiang, Z.J.; Semkin, P.Y.; Tishchenko, P.P.; Zheng, W.; Lobanov, V.B.; Sergeev, A.F.; Jiang, S. Diverse nature of the seasonally coastal eutrophication dominated by oceanic nutrients: An eco-system based analysis characterized by salmon migration and aquaculture. Mar. Pollut. Bull. 2023, 193, 115150. [Google Scholar] [CrossRef]
  6. Sun, Z.; Luo, J.; Xu, Y.; Zhai, J.; Cao, Z.; Ma, J.; Qi, T.; Shen, M.; Gu, X.; Duan, H. Coordinated dynamics of aquaculture ponds and water eutrophication owing to policy: A case of Jiangsu province, China. Sci. Total Environ. 2024, 927, 172194. [Google Scholar] [CrossRef]
  7. Glibert, P.M.; Burkholder, J.M. Causes of harmful algal blooms. Harmful Algal Blooms: A Compendium Desk Reference; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2018; pp. 1–38. [Google Scholar]
  8. Fricke, A.; Pey, A.; Gianni, F.; Lemée, R.; Mangialajo, L. Multiple stressors and benthic harmful algal blooms (BHABs): Potential effects of temperature rise and nutrient enrichment. Mar. Pollut. Bull. 2018, 131, 552–564. [Google Scholar] [CrossRef]
  9. Ou, L.-J.; Wang, Z.; Ding, G.-M.; Han, F.-X.; Cen, J.-Y.; Dai, X.-F.; Li, K.-Q.; Lu, S.-H. Organic nutrient availability and extracellular enzyme activities influence harmful algal bloom proliferation in a coastal aquaculture area. Aquaculture 2024, 582, 740530. [Google Scholar] [CrossRef]
  10. Yan, Z.; Alamdari, N. Integrating temporal decomposition and data-driven approaches for predicting coastal harmful algal blooms. J. Environ. Manag. 2024, 364, 121463. [Google Scholar] [CrossRef]
  11. Thiagarajan, C.; Devarajan, Y. The urgent challenge of ocean pollution: Impacts on marine biodiversity and human health. Reg. Stud. Mar. Sci. 2025, 81, 103995. [Google Scholar] [CrossRef]
  12. Choudhury, M.; Roy, P. Challenges with microplastic pollution in the regime of UN sustainable development goals. World Dev. Sustain. 2025, 6, 100216. [Google Scholar] [CrossRef]
  13. Dasari, D.; Dong, C.-D.; Singhania, R.R.; Tambat, V.S.; Piechota, G.; Patel, A.K. Harnessing microalgae for sustainable aquaculture and mariculture: Marine pollution mitigation and circular economy strategies. Mar. Pollut. Bull. 2025, 219, 118292. [Google Scholar] [CrossRef] [PubMed]
  14. Ural-Janssen, A.; Kroeze, C.; Meers, E.; Strokal, M. Large reductions in nutrient losses needed to avoid future coastal eutrophication across Europe. Mar. Environ. Res. 2024, 197, 106446. [Google Scholar] [CrossRef] [PubMed]
  15. Yao, H.; Wang, J.; Han, Y.; Jiang, X.; Chen, J. Decadal acidification in a subtropical coastal area under chronic eutrophication. Environ. Pollut. 2022, 293, 118487. [Google Scholar] [CrossRef]
  16. Alluhaidan, A.S.; P, P.; Aziz, R.; Basheer, S. Enhanced LSTM-based AI model for accurate dissolved oxygen prediction in aquaculture systems. Smart Agric. Technol. 2025, 12, 101140. [Google Scholar] [CrossRef]
  17. Tamrin, T.; Schaduw, J.N.W.; Wantasen, A.S.; Sambali, H.; Abdullah, T. Rapid appraisal of brackishwater aquaculture (Rap-BWAC) for assessing sustainable development of brackishwater aquaculture in coastal areas. Ocean Coast. Manag. 2025, 269, 107822. [Google Scholar] [CrossRef]
  18. Sriputhorn, K.; Jutagate, A.; Matitopanum, S.; Kraiklang, R.; Pitakaso, R.; Chueadee, C.; Gonwirat, S. Advancing smart aquaculture: Cost-efficient strategies for climbing perch cultivation using AI-based models. Smart Agric. Technol. 2025, 12, 101108. [Google Scholar] [CrossRef]
  19. Yang, H.; Feng, Q.; Xia, S.; Wu, Z.; Zhang, Y. AI-driven aquaculture: A review of technological innovations and their sustainable impacts. Artif. Intell. Agric. 2025, 15, 508–525. [Google Scholar] [CrossRef]
  20. Sun, C.; Yang, X.; Liu, C.; Ye, Y.; Li, S.; Xu, X.; Zhou, C. Typical farming behaviors recognition in aquaculture using an improved VMamba approach. Aquac. Eng. 2025, 111, 102603. [Google Scholar] [CrossRef]
  21. Roy, S.M.; Choi, H.; Kim, T. Review of state-of-the-art improvements in recirculating aquaculture systems: Insights into design, operation, and statistical modeling approaches. Aquaculture 2025, 605, 742545. [Google Scholar] [CrossRef]
  22. Rejeb, A.; Rejeb, K.; Keogh, J.G. The nexus of IoT and aquaculture: A bibliometric analysis. Appl. Food Res. 2025, 5, 100838. [Google Scholar] [CrossRef]
  23. Roy, S.M.; Beg, M.M.; Bhagat, S.K.; Charan, D.; Pareek, C.M.; Moulick, S.; Kim, T. Application of artificial intelligence in aquaculture—Recent developments and prospects. Aquac. Eng. 2025, 111, 102570. [Google Scholar] [CrossRef]
  24. Chandran, P.J.I.; Khalil, H.A.; Hashir, P.K.; S, V. Smart technologies in aquaculture: An integrated IoT, AI, and blockchain framework for sustainable growth. Aquac. Eng. 2025, 111, 102584. [Google Scholar] [CrossRef]
  25. Xie, W.; Zhao, M.; Liu, Y.; Yang, D.; Huang, K.; Fan, C.; Wang, Z. Recent advances in Transformer technology for agriculture: A comprehensive survey. Eng. Appl. Artif. Intell. 2024, 138, 109412. [Google Scholar] [CrossRef]
  26. Huang, X.; Wen, Y.; Zhang, F.; Li, H.; Sui, Z.; Cheng, X. Accident analysis of waterway dangerous goods transport: Building an evolution network with text knowledge extraction. Ocean Eng. 2025, 318, 120176. [Google Scholar] [CrossRef]
  27. Shi, L.; Wang, X.; He, Y.; He, Z. Beyond Noise: A BERT-Enhanced framework for Intelligent product optimization via online review Analytics. Expert Syst. Appl. 2026, 296, 128812. [Google Scholar] [CrossRef]
  28. Meier, A.; Eller, R.; Peters, M. Creating competitiveness in incumbent small- and medium-sized enterprises: A revised perspective on digital transformation. J. Bus. Res. 2025, 186, 115028. [Google Scholar] [CrossRef]
  29. Wahid, J.A.; Xu, M.; Ayoub, M.; Jiang, X.; Lei, S.; Gao, Y.; Hussain, S.; Yang, Y. AI-driven social media text analysis during crisis: A review for natural disasters and pandemics. Appl. Soft Comput. 2025, 171, 112774. [Google Scholar] [CrossRef]
  30. Liu, Y.; Zhou, X.; Zhang, Z.; Yang, X. BETM: A new pre-trained BERT-guided embedding-based topic model. Big Data Res. 2025, 41, 100551. [Google Scholar] [CrossRef]
  31. Zauscher, E.; Fornarelli, R.; Berglund, E.Z. Conserving groundwater resources through hybrid water systems and sharing rainwater. J. Hydrol. 2025, 652, 132641. [Google Scholar] [CrossRef]
  32. Xing, W.; Zhang, J.; Li, C.; Dong, G. iAMP-EmGCN: A new design for identifying antimicrobial peptides based on BERT and Graph Convolutional Network. Expert Syst. Appl. 2025, 283, 127811. [Google Scholar] [CrossRef]
  33. Zhang, S.; Xu, J.; Xie, H.; Fu, Q.; Miao, K.; Cheng, S.; Wu, Z. Enhanced knowledge graph cascade learning model for cyber–physical systems. Eng. Appl. Artif. Intell. 2025, 160, 111802. [Google Scholar] [CrossRef]
  34. Dong, B.; Bu, C.; Wang, Y.; Zhu, Y.; Wu, X. Disentangled Multi-view Graph Neural Network for multilingual knowledge graph completion. Appl. Soft Comput. 2025, 183, 113605. [Google Scholar] [CrossRef]
  35. Yang, J.; Yang, X.; Li, R.; Luo, M.; Jiang, S.; Zhang, Y.; Wang, D. BERT and hierarchical cross attention-based question answering over bridge inspection knowledge graph. Expert Syst. Appl. 2023, 233, 120896. [Google Scholar] [CrossRef]
  36. Yang, Y.; Cao, W.; Li, Y.; Chang, R.; Yang, G.; Gan, C.; Wu, M. OD-Mind: An ocean drilling expert knowledge query system driven by knowledge graph. Ocean Eng. 2025, 339, 122135. [Google Scholar] [CrossRef]
  37. Alsaleh, M. The impact of aquaculture economics expansion on marine water quality in the EU Region. Reg. Stud. Mar. Sci. 2024, 77, 103625. [Google Scholar] [CrossRef]
  38. Xu, S.; Yu, Z.; Zhou, Y.; Yue, S.; Liang, J.; Zhang, X. The potential for large-scale kelp aquaculture to counteract marine eutrophication by nutrient removal. Mar. Pollut. Bull. 2023, 187, 114513. [Google Scholar] [CrossRef]
  39. Baquedano, M.; Chávez, C.; Dresdner, J.; Eggert, H. The Rise of Mussel Aquaculture in Chile: Causes, Effects, and Challenges. Rev. Aquac. 2025, 17, e70045. [Google Scholar] [CrossRef]
  40. Mounic-Silva, C.E.; Boscatto, F.; Dalri, J.C.; Polli, M.A.; Mattos, F.T.; Rodrigues Meneghetti, E.; Meante, R.E.X.; Buglione-Neto, C.C.; Nuñer, A.P.D.O. GIS-Based Aquaculture Spatial Suitability in Western Amazonian Hydroelectric Reservoirs. Lakes Reserv. Sci. Policy Manag. Sustain. Use 2025, 30, e70016. [Google Scholar] [CrossRef]
  41. Haque, M.M.; Mahmud, M.N. Potential Role of Aquaculture in Advancing Sustainable Development Goals (SDGs) in Bangladesh. Aquac. Res. 2025, 2025, 6035730. [Google Scholar] [CrossRef]
  42. Aydın, İ.; Öztürk, R.Ç.; Eroldoğan, O.T.; Arslan, M.; Terzi, Y.; Yılmaz, S.; Diken, G.; Yıldırım, Ö.; Bodur, T.; Gültepe, N.; et al. An In-Depth Analysis of the Finfish Aquaculture in Türkiye: Current Status, Challenges, and Future Prospects. Rev. Aquac. 2025, 17, e70010. [Google Scholar] [CrossRef]
  43. Liu, C.; Shang, Y.; Gul, S.; Hu, M.; Wang, Y. Energetic Adaptations of Bivalves Under Environmental Stress: A Comprehensive Review on Bioenergetics and Aquaculture Sustainability. Rev. Aquac. 2025, 17, e70052. [Google Scholar] [CrossRef]
  44. Falconer, L.; Cutajar, K.; Krupandan, A.; Capuzzo, E.; Corner, R.A.; Ellis, T.; Jeffery, K.; Mikkelsen, E.; Moore, H.; O’beirn, F.X. Planning and licensing for marine aquaculture. Rev. Aquac. 2023, 15, 1374–1404. [Google Scholar] [CrossRef] [PubMed]
  45. Pacheco, F.S.; Heilpern, S.A.; DiLeo, C.; Almeida, R.M.; Sethi, S.A.; Miranda, M.; Ray, N.; Barros, N.O.; Cavali, J.; Costa, C.; et al. Towards sustainable aquaculture in the Amazon. Nat. Sustain. 2025, 8, 234–244. [Google Scholar] [CrossRef]
  46. Leng, J.; Ding, W. Towards sustainable aquaculture: Game-theoretic insights into AI adoption, emission reduction, and government incentives. Aquac. Int. 2025, 33, 461. [Google Scholar] [CrossRef]
  47. Yang, Y.; Li, K.; Liang, S.; Lin, G.; Liu, C.; Li, J.; Xie, L.; Li, Y.; Wang, X. A simulation-optimization approach based on the compound eutrophication index to identify multi-nutrient allocated load. Sci. Total Environ. 2024, 906, 167626. [Google Scholar] [CrossRef]
  48. Nikolaidis, N.P.; Poikane, S.; Bouraoui, F.; Salas-Herrero, F.; Free, G.; Varkitzi, I.; van de Bund, W.; Kelly, M.G. Comparison of eutrophication assessment for the Nitrates and water Framework Directives: Impacts and opportunities for streamlined approaches. Ecol. Indic. 2025, 177, 113375. [Google Scholar] [CrossRef]
  49. Gardazi, N.M.; Daud, A.; Malik, M.K.; Bukhari, A.; Alsahfi, T.; Alshemaimri, B. BERT applications in natural language processing: A review. Artif. Intell. Rev. 2025, 58, 166. [Google Scholar] [CrossRef]
  50. Chen, S.; Zheng, P.; Zheng, L.; Yao, Q.; Meng, Z.; Lin, L.; Chen, X.; Liu, R. BERT-DomainAFP: Antifreeze protein recognition and classification model based on BERT and structural domain annotation. iScience 2025, 28, 112077. [Google Scholar] [CrossRef]
  51. Ferreira, J.G.; Andersen, J.H.; Borja, A.; Bricker, S.B.; Camp, J.; Da Silva, M.C.; Garcés, E.; Heiskanen, A.-S.; Humborg, C.; Ignatiades, L. Overview of eutrophication indicators to assess environmental status within the European Marine Strategy Framework Directive. Estuar. Coast. Shelf Sci. 2011, 93, 117–131. [Google Scholar] [CrossRef]
  52. Karimiziarani, M.; Foroumandi, E.; Moradkhani, H. Harnessing Twitter (X) with AI-enhanced natural language processing for disaster management: Insights from California wildfire. Environ. Model. Softw. 2025, 192, 106545. [Google Scholar] [CrossRef]
  53. Li, S.; Levis, M.; DiMambro, M.; Wu, W.; Levy, J.; Shiner, B.; Gui, J. Preprocessing of natural language process variables using a data-driven method improves the association with suicide risk in a large veterans affairs population. Comput. Biol. Med. 2025, 189, 109939. [Google Scholar] [CrossRef]
  54. Alghamdi, J.; Lin, Y.; Luo, S. ABERT: Adapting BERT model for efficient detection of human and AI-generated fake news. Int. J. Inf. Manag. Data Insights 2025, 5, 100353. [Google Scholar] [CrossRef]
  55. Hanafy, N.O. Artificial intelligence’s effects on design process creativity: “A study on used A.I. Text-to-Image in architecture”. J. Build. Eng. 2023, 80, 107999. [Google Scholar] [CrossRef]
  56. Li, L.; Bensi, M.; Baecher, G. Exploring the potential of social media crowdsourcing for post-earthquake damage assessment. Int. J. Disaster Risk Reduct. 2023, 98, 104062. [Google Scholar] [CrossRef]
  57. Klein, P.; Malburg, L.; Bergmann, R. Combining informed data-driven anomaly detection with knowledge graphs for root cause analysis in predictive maintenance. Eng. Appl. Artif. Intell. 2025, 145, 110152. [Google Scholar] [CrossRef]
  58. Hou, Z.-W.; Jing, W.; Qin, C.-Z.; Yang, J.; Xia, Q.; Yin, X. Prospects on mangrove knowledge services in the smart era: From plant atlas to knowledge graphs. Sci. China Earth Sci. 2025, 68, 111–127. [Google Scholar] [CrossRef]
  59. Islam, M.S.; Ferdusi, M.; Aurpa, T.T. Words of War: A hybrid BERT-CNN approach for topic-wise sentiment analysis on The Russia-Ukraine War. Expert Syst. Appl. 2025, 284, 127759. [Google Scholar] [CrossRef]
  60. Zhang, Z.; Yang, X.; Sun, L.; Sun, Y.; Kang, J. Research on constructing and reasoning the collision knowledge graph of autonomous navigation ship based on enhanced BERT model. Expert Syst. Appl. 2025, 278, 127429. [Google Scholar] [CrossRef]
  61. Gui, J.; Zhou, Y.; Yu, K.; Wu, X. PSC-BERT: A spam identification and classification algorithm via prompt learning and spell check. Knowl. -Based Syst. 2024, 301, 112266. [Google Scholar] [CrossRef]
  62. Zhang, J.; Chen, S. Topology reorganized graph contrastive learning with mitigating semantic drift. Pattern Recognit. 2025, 159, 111160. [Google Scholar] [CrossRef]
  63. Jiang, L.; Goetz, S.M. Natural language processing in the patent domain: A survey. Artif. Intell. Rev. 2025, 58, 214. [Google Scholar] [CrossRef]
  64. Zhang, X.; Gao, R.; Xiao, Z.; Wang, K.; Liu, T.; Liang, M.; Zhang, J. Natural Language Processing and Text Mining in Transportation: Current Status, Challenges, and Future Roadmap. Expert Syst. Appl. 2025, 296, 129050. [Google Scholar] [CrossRef]
  65. Zhang, M.; Shen, Q.; Zhao, Z.; Wang, S.; Huang, G.Q. Optimizing ESG reporting: Innovating with E-BERT models in nature language processing. Expert Syst. Appl. 2025, 265, 125931. [Google Scholar] [CrossRef]
  66. Meng, Z.; Zhao, M.; Yu, J.; Xu, T.; Li, X.; Jin, D.; Yu, R.; Yu, M. Read, Eliminate, and Focus: A reading comprehension paradigm for distant supervised relation extraction. Neural Netw. 2025, 185, 107138. [Google Scholar] [CrossRef]
  67. Chao, K.-C.; Shih, Y.; Lee, C.-H. A Novel Label Smoothing Technique for Machine Degradation. IFAC-Pap. 2023, 56, 4430–4435. [Google Scholar] [CrossRef]
  68. Wang, Y.; Wu, X.; Liu, X.; Chu, F.; Liu, H.; Han, Z. Label smoothing regularization-based no hyperparameter domain generalization. Knowl. -Based Syst. 2025, 309, 112877. [Google Scholar] [CrossRef]
  69. Yang, C.; Li, J.; Zhang, F.; Liu, N.; Zhang, Y. The optimal Redfield N: P ratio caused by fairy ring fungi stimulates plant productivity in the temperate steppe of China. Fungal Ecol. 2018, 34, 91–98. [Google Scholar] [CrossRef]
Figure 1. The overall research workflow of this study.
Figure 1. The overall research workflow of this study.
Sustainability 17 08907 g001
Figure 2. Architecture of the intelligent analysis framework.
Figure 2. Architecture of the intelligent analysis framework.
Sustainability 17 08907 g002
Figure 3. Fine-tuning Architectures for Named Entity Recognition and Relation Extraction.
Figure 3. Fine-tuning Architectures for Named Entity Recognition and Relation Extraction.
Sustainability 17 08907 g003
Figure 4. Confusion Matrix for NER Model.
Figure 4. Confusion Matrix for NER Model.
Sustainability 17 08907 g004
Figure 5. F1-Score Comparison of NER Models Across Entity Categories.
Figure 5. F1-Score Comparison of NER Models Across Entity Categories.
Sustainability 17 08907 g005
Figure 6. Precision-Recall (P-R) Curves for NER Models.
Figure 6. Precision-Recall (P-R) Curves for NER Models.
Sustainability 17 08907 g006
Figure 7. Confusion Matrix for RE Model.
Figure 7. Confusion Matrix for RE Model.
Sustainability 17 08907 g007
Figure 8. Visualization of the Attention Mechanism.
Figure 8. Visualization of the Attention Mechanism.
Sustainability 17 08907 g008
Figure 9. The Ontology Schema of the AEKG.
Figure 9. The Ontology Schema of the AEKG.
Sustainability 17 08907 g009
Figure 10. Degree Distribution of Nodes in the Knowledge Graph.
Figure 10. Degree Distribution of Nodes in the Knowledge Graph.
Sustainability 17 08907 g010
Figure 11. Global View of the Knowledge Graph Showing Geospatial Clustering.
Figure 11. Global View of the Knowledge Graph Showing Geospatial Clustering.
Sustainability 17 08907 g011
Figure 12. Association Subgraph Centered on Dissolved Inorganic Nitrogen.
Figure 12. Association Subgraph Centered on Dissolved Inorganic Nitrogen.
Sustainability 17 08907 g012
Figure 13. Association Subgraph Centered on Karenia mikimotoi.
Figure 13. Association Subgraph Centered on Karenia mikimotoi.
Sustainability 17 08907 g013
Figure 14. Association Subgraph of the 2012 Xiamen Red Tide Event.
Figure 14. Association Subgraph of the 2012 Xiamen Red Tide Event.
Sustainability 17 08907 g014
Figure 15. Comprehensive Association Subgraph for the 2021 Pearl River Estuary Hypoxia Event.
Figure 15. Comprehensive Association Subgraph for the 2021 Pearl River Estuary Hypoxia Event.
Sustainability 17 08907 g015
Figure 16. Heatmap of Eutrophication Events in Coastal Provinces of China.
Figure 16. Heatmap of Eutrophication Events in Coastal Provinces of China.
Sustainability 17 08907 g016
Figure 17. Latitudinal Distribution of Different Aquaculture Modes.
Figure 17. Latitudinal Distribution of Different Aquaculture Modes.
Sustainability 17 08907 g017
Figure 18. Association Heatmap of Key Drivers in Eutrophication.
Figure 18. Association Heatmap of Key Drivers in Eutrophication.
Sustainability 17 08907 g018
Figure 19. Normalized Association Strength of Top 5 Co-occurrence Pairs between Aquaculture Sources and Causal Organisms.
Figure 19. Normalized Association Strength of Top 5 Co-occurrence Pairs between Aquaculture Sources and Causal Organisms.
Sustainability 17 08907 g019
Table 1. Detailed Statistics of the Domain-Specific Corpus for Aquaculture-BERT Pre-training.
Table 1. Detailed Statistics of the Domain-Specific Corpus for Aquaculture-BERT Pre-training.
Data SourceDescriptionDocument CountProportion (%)Estimated Tokens (Million)
Academic LiteratureFull-text articles and abstracts from Web of Science, Scopus, PubMed, and CNKI, focusing on aquaculture, eutrophication, and HABs.~8000~10.7%22.5 million tokens (approx. 10.7%)
Official ReportsAnnual environmental bulletins, technical reports, and policy documents from organizations like FAO, EPA, and China’s MEE~1500~2.0%4.2 million tokens (approx. 2.0%)
Industry & NewsIndustry news, technical specifications, and standards from authoritative aquaculture websites and organizations.~65,000~87.3%183.3 million tokens (approx. 87.3%)
TotalThe final preprocessed corpus contains approximately 210 million Chinese and English tokens.~74,500100%210 million tokens
Table 2. Ontology Schema of the AEKG.
Table 2. Ontology Schema of the AEKG.
Part A: Predefined Entity Types.
Entity TypeDefinitionExample from Text
EventA specific environmental incident or phenomenon related to eutrophication.“Red Tide Event”, “Diatom Bloom”, “Summer Hypoxia (PRE, 2021)”
LocationA geographical area where an event occurs or an aquaculture activity is located.“Pearl River Estuary”, “western sea of Xiamen”, “East China Sea”
TimeThe specific time or time frame when an event occurs.“Summer 2012”, “from 2010 to 2025”
Aqua-SourceThe type of aquaculture practice identified as a potential pollution source.“Cage Culture”, “Pond Farming”, “High-density Shrimp Farming”
NutrientChemical substances that contribute to eutrophication.“Dissolved Inorganic Nitrogen “, “reactive phosphate”
OrganismBiological species involved in or causing eutrophication events.Karenia mikimotoi”, “Prorocentrum donghaiense”, “Dinoflagellate
ParameterEnvironmental or physical-chemical parameters describing conditions.“Chemical Oxygen Demand (COD)”, “High Water Temperature”, “Increased Salinity”
Part B: Predefined Relation Types.
Relation TypeDefinitionExample Triplet (Head Entity, Relation, Tail Entity)
Caused_ByIndicates that one entity is a direct or primary factor leading to another, as reported in the source text.(High Nutrient Load, Caused_By, Diatom Bloom)
Occurs_InLinks an event or organism to a specific location or time.(Red Tide Event, Occurs_In, East China Sea)
Sourced_FromIndicates that a nutrient or pollutant originates from a specific aquaculture source.(High Nutrient Load, Sourced_From, Estuarine Cage Culture)
AffectsRepresents a secondary influence or resulting impact of one entity on another.(Red Tide Event, Affects, Massive Fish Kill)
Has_PropertyAssigns a specific characteristic or parameter to an event or location.(Red Tide Event, Has_Property, High Water Temperature)
Transported_ByIndicates the pathway or medium through which a substance or entity is moved.(Nutrient, Transported_By, Pearl River Runoff)
Table 3. Performance comparison of NER models.
Table 3. Performance comparison of NER models.
Entity CategoryMetricBiLSTM-CRFBert-Base-Multilingual-CasedAquaculture-BERT (This Study)
EventF185.1%88.9%93.2%
LocationF191.3%94.5%96.1%
TimeF194.2%96.1%97.5%
Aqua-SourceF179.5%83.4%90.8%
NutrientF183.1%86.9%91.5%
OrganismF181.7%85.2%90.3%
ParameterF171.2%78.2%85.4%
OverallP84.2%88.1%92.5%
R82.5%87.2%91.7%
F183.3%87.6%92.1%
Table 4. Performance comparison of RE models.
Table 4. Performance comparison of RE models.
Relation CategoryMetricBert-Base-Multilingual-CasedAquaculture-BERT (This Study)
Caused_ByF180.1%85.2%
Occurs_InF188.5%92.3%
Sourced_FromF179.2%84.1%
AffectsF178.3%83.7%
Has_PropertyF182.9%87.2%
OverallP82.3%87.1%
R81.3%85.9%
F181.8%86.5%
Table 5. Statistics of the AEKG.
Table 5. Statistics of the AEKG.
ItemCount
Total Number of Entities3,215,489
Total Number of Relations8,532,107
Number of Entity Types7
Number of Relation Types6
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hao, D.; Xu, B.; Leng, J.; Guo, M.; Zhang, M. Elucidating the Drivers of Aquaculture Eutrophication: A Knowledge Graph Framework Powered by Domain-Specific BERT. Sustainability 2025, 17, 8907. https://doi.org/10.3390/su17198907

AMA Style

Hao D, Xu B, Leng J, Guo M, Zhang M. Elucidating the Drivers of Aquaculture Eutrophication: A Knowledge Graph Framework Powered by Domain-Specific BERT. Sustainability. 2025; 17(19):8907. https://doi.org/10.3390/su17198907

Chicago/Turabian Style

Hao, Daoqing, Bozheng Xu, Jie Leng, Mingyang Guo, and Maomao Zhang. 2025. "Elucidating the Drivers of Aquaculture Eutrophication: A Knowledge Graph Framework Powered by Domain-Specific BERT" Sustainability 17, no. 19: 8907. https://doi.org/10.3390/su17198907

APA Style

Hao, D., Xu, B., Leng, J., Guo, M., & Zhang, M. (2025). Elucidating the Drivers of Aquaculture Eutrophication: A Knowledge Graph Framework Powered by Domain-Specific BERT. Sustainability, 17(19), 8907. https://doi.org/10.3390/su17198907

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop