1. Introduction
China is currently constructing a nationwide eco-environment monitoring network, which is simultaneously undergoing technological modernization and intelligent upgrades. Up to 2024, this network already covers 33,000 monitoring sites, enabling the observation of environmental conditions and ecosystem changes at urban, regional, and national scales [
1]. However, this expansion has created challenges related to rapidly increasing data volumes and limited data application efficiency. Meanwhile, traditional analysis models that rely heavily on manual expertise face significant bottlenecks when dealing with large-scale data processing [
2,
3]. Moreover, the multi-source and heterogeneous nature of eco-environmental monitoring data leads to data fragmentation and difficulties in forming standardized, hierarchically classified data warehouses. This further constrains the agility of regulatory decision-making and cross-domain logical reasoning in eco-environmental monitoring. These practical contradictions highlight the urgency of intelligent transformation in environmental monitoring and for a more integrated system covering perception, analysis, and decision-making.
The advent of multimodal LLMs (e.g., GPT-4.o, Llama) [
4,
5,
6,
7] has profoundly transformed the operational paradigms within vertical industries. The integration of generative LLMs in eco-environmental monitoring and science is advancing rapidly, demonstrating emerging potential in domains such as climate change analysis, environmental quality assessment, and the design of novel treatment technologies [
8,
9,
10]. These models can process large and complex datasets, including scientific literature and policy documents, and support knowledge extraction and synthesis. The application of AI models in environmental domains spans environmental quality monitoring, modeling and prediction, treatment technology design, environmental policy assessment, and risk evaluation [
11,
12,
13,
14,
15,
16]. The recent proliferation of domain-specific LLMs has underscored the efficacy of fine-tuning strategies across diverse environmental disciplines. For instance, ClimateGPT is continually pre-trained on a massive 4.2-billion-token climate corpus and has shown competitive performance on specialized benchmarks like ClimateBench [
17]. Similarly, WaterGPT augments a 7B-parameter backbone with 1.1 billion hydrology-specific tokens and multimodal tools, achieving 83% accuracy on the EvalWater benchmark [
18]. OceanGPT, leveraging a corpus of 67,000 oceanography documents, provides expert-level assistance in marine science tasks [
19]. These specialized models effectively address the challenges of complex terminology and heterogeneous data within their respective domains. Beyond single-discipline applications, integrated frameworks such as EnvGPT mark a significant step towards interdisciplinary environmental intelligence. By fine-tuning an 8B-parameter model on a balanced, cross-domain instruction dataset (ChatEnv) covering climate, ecosystems, water, soil, and energy, EnvGPT achieves superior performance, even rivalling larger general-purpose models on domain-specific benchmarks like EnvBench and EnviroExam [
12]. This underscores the critical importance of high-quality, domain-specific instruction tuning for enhancing the factual accuracy and practical utility of LLMs in scientific applications.
Concurrently, to overcome the inherent limitations of LLMs, such as hallucination and outdated knowledge [
20,
21], retrieval-augmented generation (RAG) frameworks have been widely adopted [
22]. Systems such as ChatClimate ground conversational AI in authoritative, up-to-date scientific sources, including the IPCC AR6 reports, which improves the veracity, reliability, and timeliness of climate information provided to decision-makers and the public [
23]. By leveraging big data analytics, cloud computing, and these advanced AI paradigms, environmental monitoring is evolving from basic data processing towards in-depth mining and comprehensive analysis, which supports more accurate predictive modeling and better-informed environmental governance.
However, current eco-environmental monitoring workflows still face several unresolved operational limitations. Multi-source monitoring data are often distributed across heterogeneous platforms, while standards, historical reports, and expert experience remain fragmented across different organizational levels. As a result, manual interpretation is still required for many cross-domain tasks, including anomaly tracing, causal attribution, and report generation. General-purpose LLMs can assist with text understanding, but they lack direct access to real-time monitoring data, authoritative standards, and domain-specific rule constraints. These limitations motivate the development of a domain-oriented framework that integrates data governance, knowledge retrieval, and multi-agent reasoning.
During the formative stages of large language model advancement, our team made efforts to integrate this technology into eco-environmental monitoring practices. Initially, in collaboration with several top domestic research institutions and leading enterprises, we surveyed leading domestic models and established a core pathway of AI mid-platform integration and domain knowledge embedding, completing the first round of private deployment of base models. In 2024, efforts focused on building a data resource center with millisecond-level response capabilities, optimizing real-time query latency to milliseconds through distributed architecture. At the same time, we established a three-level knowledge base system of center, department, and individual levels, which transformed over 100,000 technical standards and monitoring reports into structured knowledge units.
With breakthroughs in deep reasoning capabilities of models such as DeepSeek-R1, we introduce EnvSentry, a novel multi-engine collaborative large language model designed for environmental monitoring. By coordinating a reasoning engine (DeepSeek-R1), an instruction engine (DeepSeek-V3), and a multimodal engine (Alibaba Qwen-VL), EnvSentry successfully implements an operational closed loop for intent recognition, data retrieval, and quality control. The model has demonstrated transformative operational utility, evidenced by enhanced accuracy in natural language queries and significantly improved efficiency in identifying anomalies within key river basins. Furthermore, by preliminarily deploying a matrix of 50 intelligent agents across four domains including rapid inquiry, intelligent analysis, deep insight, and automated review, the system supports the transition from manual analysis to near real-time response. In practical monitoring workflows, these agents are designed to undertake operational roles such as natural language query interpretation, multi-source data retrieval, abnormal section screening, preliminary pollution attribution, forecast explanation, quality-control checking, and structured report generation. By encapsulating domain rules and expert experience into task-specific agents, the system also reduces the knowledge barrier for non-specialist users and improves the consistency of routine monitoring analysis.
However, the evolution of AI technology still faces bottlenecks such as insufficient confidence in deep reasoning, high dependency on datasets, constraints on computational resources, and security concerns [
24,
25]. Recent research further emphasizes the necessity of building authoritative knowledge bases and RAG frameworks to mitigate model hallucination and enhance the verifiability and reliability of outputs in specialized domains such as environmental monitoring [
26,
27]. This paper investigates the technological integration mechanisms and the reconstruction of application paradigms for large models in eco-environmental monitoring. Using the EnvSentry system as an empirical case study, we provide an in-depth analysis of key breakthroughs in multi-source data governance, dynamic knowledge embedding, and multi-engine collaborative scheduling, while further exploring optimization pathways and future trends for intelligent transformation.
2. Architecture and Methodology
2.1. Schematic Diagram of EnvSentry
The data infrastructure of EnvSentry integrates three distinct modalities consisting of real-time streaming data, historical archives, and textual knowledge. Although this multi-source inputs inherently introduces pervasive processing obstacles, such as data heterogeneity, quality degradation, and semantic gaps, the framework is specifically engineered to resolve these multi-faceted bottlenecks through a synergistic combination of data governance, knowledge embedding, and multi-model collaboration. As delineated in
Figure 1, this co-design culminates in a cohesive closed-loop system characterized by unified data knowledge task integration.
To address the challenges posed by massive datasets and complex decision applications within the national intelligent eco-environmental monitoring network, we proposed a framework that integrates AI capabilities into data governance and decision-making processes. The presented architecture of the national intelligent environmental monitoring network illustrates the systematic use of artificial intelligence (AI) in data aggregation, analytical, and decision-making layers, signifying a shift toward an intelligent eco-environmental governance paradigm (
Figure 1). Currently, the framework is organized into two interrelated domains, including AI-driven data governance and AI-driven decision-making.
The AI-driven data governance layer harmonizes multi-source data streams across diverse ambient domains encompassing remote sensing, meteorology, noise, ecology, and soil properties alongside dedicated modules for surface water, groundwater, and network maintenance. Through AI-enhanced correlation analysis and anomaly detection, this layer significantly refines the spatiotemporal granularity and reliability of environmental diagnostics. Complementing this foundation, the AI-driven decision-making layer integrates predictive modeling for surface water alongside emergency monitoring protocols and AI-based quality assurance. Underpinned by robust pollution source management and regulatory compliance, the decision-making layer operates within an integrated observation–analysis–prediction workflow to facilitate evidence-based policy formulation and proactive environmental intervention.
A critical future direction involves migrating AI capabilities to the network edge through AI-empowered front-end sensing devices. Embedding lightweight models directly within monitoring instruments enables in situ, real-time data preprocessing, adaptive sampling, and quality validation. This approach minimizes transmission latency and bandwidth constraints while fostering a responsive, decentralized, and resource-efficient monitoring ecosystem and supports the development of more adaptive environmental intelligence systems.
2.2. Development Progress of EnvSentry
The deployment of EnvSentry in environmental monitoring follows an evolutionary trajectory, advancing from rudimentary task-specific applications to sophisticated autonomous systems. This progression is conceptualized via a three-mode iterative framework (
Figure 2), which delineates the concomitant maturation of system architecture and intelligence.
The initial stage, direct integration, focuses on establishing foundational capabilities. This mode encompasses several parallel approaches: the straightforward chaining of scenario design, model preparation, prompt engineering, and business integration; the incorporation of agent construction to enhance model interactivity; and the early adoption of RAG alongside tool components within a process orchestration layer to ground model responses in external knowledge and functionality. This mode represents the baseline for leveraging LLMs, primarily treating them as powerful, configurable predictors within a predefined pipeline.
As system requirements become more complex, the framework transitions into the process orchestration mode. This phase prioritizes dynamic workflow management and places a pronounced emphasis on data-centric engineering. Methodologies within this mode integrate data engineering as a critical precursor to model preparation, ensuring the LLM utilizes high-quality, contextually relevant environmental data. The cornerstone of this architecture is the orchestration of intelligent agents, which are modular units capable of executing discrete tasks, into cohesive workflows. This design effectively addresses complex challenges, such as multi-step data analysis or automated reporting, and supports the transition from static integration to adaptive and data-driven analysis.
2.3. Knowledge Base Construction and Vector Indexing
General-purpose large language models often lack the specialized domain knowledge required for advanced eco-environmental science applications. To surmount this limitation, EnvSentry implements a hierarchical three-tiered knowledge hub comprising a national-level repository that synthesizes over 100,000 technical standards, scientific literature, and historical case studies into a dynamic knowledge graph via semantic slicing, a departmental-level core that aggregates tens of thousands of structured monitoring reports and research findings into specialized knowledge modules, and an individual-level interface linked to expert experience databases for personalized knowledge retrieval.
Before vector indexing, documents were first cleaned, deduplicated, and segmented according to document hierarchy, section headings, semantic boundaries, and domain-specific entities. Each semantic unit retained metadata such as document source, document type, monitoring domain, publication time, and applicable standard. Vector indexing was then used to support similarity-based retrieval, while metadata filtering constrained the retrieval scope and improved traceability. This procedure ensures that retrieved knowledge units remain semantically coherent, source-traceable, and suitable for downstream reasoning and verification.
Building upon the comprehensive technical pipeline of EnvSentry, the knowledge base is systematically structured across eight primary eco-environmental domains, encompassing both document-level resources and granular knowledge units (
Figure 3). The literature category constitutes the most extensive segment of the knowledge base, with 780 documents and 125,000 knowledge chunks, providing contextual support for model understanding. This is followed by the technical guidelines category, which includes 610 documents and 88,000 knowledge chunks, offering important procedural references. The standards and specifications and policy and regulation categories also represent substantial components, containing 520 documents (72,000 chunks) and 480 documents (65,000 chunks) respectively, ensuring operational standardization and compliance.
Furthermore, the quality management system and case database categories contribute significant empirical knowledge, with the former containing 450 documents and 60,000 knowledge chunks, and the latter, despite having only 420 documents, comprising a highly granular 58,000 knowledge chunks essential for detailed analysis. The environmental reports category, with 312 documents and 42,000 knowledge chunks, serves as a foundational layer for general domain awareness. This distribution, with literature and technical guidelines as the main components, provides EnvSentry with a broad knowledge base for addressing environmental monitoring tasks in an evidence-based manner.
Based on optimized vector indexing technology, the accuracy of professional term recognition has been significantly enhanced. When a user queries a monitoring solution for the heavy metal antimony, the system automatically matches the technical clauses of the Water Quality, Determination of Mercury, Arsenic, Selenium, Bismuth, and Antimony by Atomic Fluorescence Spectrometry (HJ 694-2014) [
28] and generates a standardized monitoring process framework. The knowledge embedding mechanism of Envsentry achieves an improvement over to the traditional RAG paradigm through the hierarchical integration of eco-environmental monitoring domain knowledge and dynamic reasoning capabilities. Its core innovation lies in constructing a tripartite technical architecture comprising a structured knowledge hub, active knowledge injection, and multi-engine collaborative verification, which effectively addresses the shortcomings of traditional RAG in the environmental monitoring field, such as knowledge fragmentation, passive retrieval modes, and the lack of domain-specific logical constraints.
2.4. Data Pipeline of Envsentry
Envsentry data pipeline systematically transforms multi-source data and expert knowledge into intelligent decision-making outputs (
Figure 4). The pipeline begins with the integration of heterogeneous inputs, including standards, regulations, expert experience, rules, algorithms, and tabular data, which undergo granularity-based nested document chunking. These knowledge chunks are subsequently standardized and processed through four core analytical modules, i.e., visualization, algorithm, statistics, and analysis.
The processed information is then converted into vector representations using the BGE-M3 embedding model, with retrieval precision enhanced by the BGE-Reranker-large model. This enables dynamic integration of supplementary knowledge through intelligent scheduling mechanisms. The pipeline utilizes a comprehensive suite of tools and agents to perform specialized functions, such as code interpretation and real-time data processing. For example, the python_repl gateway operationalizes remote text-to-code execution by routing input programming strings into sandboxed runtime sessions that optimize session state persistence via specialized virtual workspace directories. This programmatic asset unifies server-side code interpretation with automated visualization management, returning aggregated diagnostics through a single, bounded error-logging interface. The air_data_query tool coordinates multi-dimensional environmental data extraction by unifying localized spatial definitions with scalable temporal resolutions into a single programmatic endpoint. This functional framework evaluates incoming parameter constraints to automate backend query generation, supporting localized ranking constraints through limit boundaries and ensuring standardized output formats verified by domain expert knowledge sets.
The final stage employs a large language model to synthesize processed information and generate structured analytical reports. In the prediction analysis of Yangtze River water quality compliance, the system effectively identifies at-risk sections and evaluates key parameters such as dissolved oxygen levels against regulatory standards. This integrated workflow demonstrates a sophisticated fusion of knowledge retrieval, computational analysis, and natural language generation, establishing a comprehensive framework for environmental monitoring and decision support.
Through its layered architecture, the pipeline effectively bridges the gap between raw environmental data (i.e., primary ecological environment monitoring data) (
Table A2) and actionable analytics by deploying a tripartite framework that unifies a structured knowledge hub with active knowledge injection and multi-engine collaborative verification. This approach addresses critical limitations of conventional RAG systems in environmental monitoring, including knowledge chunks, passive retrieval modes, and insufficient domain logic constraints, while maintaining strong interpretability and regulatory compliance throughout the analytical process.
2.5. Multi-Engine Collaborative Mechanism
Given the current absence of a single large language model that is both universally capable and sufficiently specialized in the eco-environmental monitoring domain, we did not rely on a single, general-purpose model. Instead, EnvSentry adopts a multi-engine collaborative mechanism. To balance the demands of deep reasoning and real-time response, EnvSentry employs a task routing hub that dynamically allocates tasks to three specialized engines. This mechanism also encapsulates environmental mechanism models as callable functions, combining physical laws with AI inference. The multi-engine collaboration mechanism establishes an integrated architectural pipeline characterized by tiered processing, iterative feedback loops, and dynamic depth scaling. The operational architecture governing the multi-engine task routing is formulated below.
- (1)
The primary processing layer leverages the DeepSeek V3 engine to execute all initial user requests spanning conventional query resolution, intent recognition, and foundational logical reasoning.
- (2)
The secondary layer deploys the DeepSeek R1 engine to execute diagnostic depth enhancement whenever a user explicitly requests deep reasoning, registers dissatisfaction with agent outcomes, or falls below the pre-established DeepSeek V3 output confidence threshold. Upon trigger activation, the original query paired with the initial V3 baseline response is channeled into the R1 model as a unified input vector. Leveraging its reinforcement learning capacity and deep reasoning pathways, the R1 engine subsequently generates a structurally augmented or rectified resolution accompanied by an optimized confidence score.
- (3)
The Qwen2.5-VL engine governs the multimodal tier to accommodate rich unstructured inputs such as images and videos or to automatically satisfy high-level tasks demanding deep cross-modal reasoning.
For a query such as attribute the cause of excessive total phosphorus in Huanggai Lake, the router assigns the reasoning engine to deduce pollution pathways, the instruction engine to retrieve hydrological data from tributaries, and the multimodal engine to analyze land use patterns from satellite images. The final output quantifies the contribution of agricultural non-point sources by integrating these results with a mass balance equation.
The framework adopts a user-intent-driven paradigm to preemptively resolve potential multi-engine conflicts. Under standard operations, the task routing topology isolates execution within a single selected engine, thereby precluding multi-engine operational friction. In scenarios where the user explicitly rejects the primary response, the architecture triggers a deep analysis sequence coupled with an automated response-evaluation routine designed to enforce ultimate output fidelity.
2.6. Context Adaptation
Upon completion of the Envsentry model, a context-adaptive mechanism was designed to enhance model performance by incorporating more explicit instructions, structured reasoning steps, and domain-specific input formats into the input framework. This flowchart illustrates a dynamic and iterative workflow for context management within EnvSentry (
Figure 5). It conceptualizes contexts not as static inputs, but as evolving guidance for task execution. The process begins with a Query, which initiates the cycle. The Inspirits phase represents the initial generation of ideas and responses. These are then organized into a coherent framework during the Structured Data phase. The core of the refinement loop involves Reflection & Structuring, where strategies are critically assessed and improved. Throughout this cycle, Rule Checking acts as a consistent governance mechanism, ensuring that all outputs adhere to predefined guidelines and quality standards. Ultimately, this modular process of generation, reflection, and curation allows strategies and knowledge to be systematically accumulated, refined, and organized, enabling the system to learn and adapt over time.
3. Application and Performance
3.1. RAG Performance in Envsentry
The integration of Retrieval-Augmented Generation (RAG) represents an important improvement in the deployment of large language models (LLMs) for eco-environmental domains. A tripartite evaluation of the Envsentry model based on accuracy, attribution rate, and response completeness, compares model performance with and without the RAG framework and shows its contribution to model performance (
Figure 6). The consolidated results demonstrate that RAG does not merely incrementally improve performance but fundamentally enhances the model’s reliability, verifiability, and utility, while reducing important limitations of base LLMs.
The evaluation dataset of RAG is constructed by synthesizing a question bank designed by eco-environmental domain experts with authentic user queries derived from real-world operational scenarios, systematically spanning six benchmarking tasks consisting of basic information resolution, standard specifications, case analysis, intent recognition, NL2SQL translation, and programmatic coding ability. To ensure the objectivity and reliability of the evaluation metrics, a rigorous ground-truth dataset is established for all benchmarking queries by extracting definitive reference answers directly from authoritative repository documentation under strict environmental domain expert verification.
The accuracy metrics reveal that RAG improves performance across operational dimensions, especially in knowledge-intensive tasks (
Figure 6a). The increase in basic information accuracy from 35% to 90% and standard specifications from 53% to 88% underscores RAG’s critical function in providing access to an external, authoritative knowledge base. This drastically reduces factual errors and hallucinations. Significant gains in functional capabilities such as NL2SQL (68% to 94%) and case analysis (65% to 85%) further indicate that retrieved context enables more precise reasoning and execution, positioning the model as an analytical tool rather than only a text generator.
The attribution rate data offers the most striking evidence of RAG’s role In ensuring output verifiability (
Figure 6b). The critically low attribution rates without RAG particularly in standard specifications at 5%, case analysis at 9%, and basic information at 7%,indicate that the base model operates as a black box generating unsubstantiated and potentially unreliable content. The implementation of RAG elevates these rates to 90%, 87%, and 88%, respectively. This result shows that RAG links model outputs to retrievable context and provides more transparent and auditable evidence. This capability is indispensable for building trust and facilitating validation in a scientific or regulatory context.
The completeness evaluation demonstrates that RAG is useful for generating completer and more organized responses (
Figure 6c). The extreme deficit in completeness without RAG, again most acute in basic information (27%) and case analysis (37%), reveals a tendency for the base model to produce fragmented, incomplete, or evasive answers. By providing a comprehensive informational scaffold, RAG guides the model to cover required information, raising completeness to 98% and 83% in these domains. This ensures that the outputs are not only correct but also functionally useful for end-users who require a full picture for decision-making.
The combined interpretation of these three metrics shows that the RAG framework systematically addresses the fundamental weaknesses of generative LLMs. Specifically, this architecture simultaneously rectifies semantic hallucination, knowledge obsolescence, and operational unreliability by addressing the underlying empirical shortfalls traditionally manifested in low accuracy, deficient attribution, and incomplete output generation.
Generally, the transition from a base LLM to a RAG-augmented system marks the evolution from a proficient but unreliable pattern-matching engine to a robust, knowledge-grounded, and verifiable assistant. For technical domains such as environmental monitoring, where precision, auditability, and thoroughness are essential, the integration of RAG is not an optional enhancement but a foundational requirement for dependable and effective deployment.
3.2. Workflow of EnvSentry for Applications
The diagram illustrates EnvSentry as a novel LLM-driven agentic architecture designed for complex environmental monitoring and assessment tasks (
Figure 7). A case study is conducted to assess and analyze the current anomalies in the mainstream of the Yangtze River and to forecast its compliance status for the current year.
The operational workflow commences with a user-initiated natural language query processed by the EnvSentry framework acting as the primary orchestration hub. To manage complex high-level objectives, this central planner executes automated task decomposition to translate abstract user inputs into a structured sequential schedule of executable sub-tasks.
To evaluate anomalies and forecast compliance status within the presented case study, the scheduling framework orchestrates three continuous procedural phases spanning the initial baseline analysis of the Yangtze River water quality, the subsequent diagnosis of potential risks alongside their underlying drivers, and the final prediction of current-year compliance trends. The execution of this sequential pipeline is distributed across a multi-agent collective where each sub-task maps directly to a specialized autonomous entity namely a data retrieval agent for source ingestion, an analytical agent for risk and causal diagnostics, and a predictive agent embedded with advanced forecasting models. To operationalize these workflows, the agents dynamically invoke a domain-specific Model Context Protocol toolkit that unifies diverse modular capabilities including statistical reporting, correlation mapping, anomaly detection, geospatial watershed analysis, and multi-scale environmental compliance evaluations.
EnvSentry synthesizes the processed information and outputs from the agentic execution pipeline to generate a coherent, comprehensive response that directly addresses user’s initial query. This architecture modularizes complex problem-solving, combining the reasoning and planning capabilities of an LLM with the specialized, executable functions of autonomous agents and domain-specific tools.
3.3. Case Study and Performance
3.3.1. Real-Time Water Quality Analysis and Assessment
We implemented a case study focused on real-time water quality analysis and warning, deploying the EnvSentry model across the national surface water quality monitoring network (
Figure A1). The objective was to leverage the model’s capacity for processing massive real-time and historical datasets to achieve rapid identification of pollution anomalies. This addresses a critical limitation of conventional methods, which largely depend on univariate outlier statistics for single pollutants and are ill-suited for rapid localization when confronted with multi-source, multi-scale data. The EnvSentry-driven intelligent agent introduces a more integrated analytical framework, integrating multivariate anomaly detection, spatiotemporal correlation analysis, and similarity matching against historical pollution events. This integrated methodology enables the system to move beyond simple detection, providing capabilities for inferring potential pollution etiologies and identifying source-tracing pathways.
Integration of EnvSentry optimizes both the detection latency and diagnostic precision of pollution events. Empirical data indicate that the system elevates the accuracy of pollution source inference from the baseline 50% to 75%, while simultaneously compressing the diagnostic and evaluation period by 90% (
Figure 8). This substantial reduction in processing time enables minute-level anomaly identification and causal tracing, thereby providing reliable empirical support for emergency responses.
3.3.2. Air Quality Forecasting and Trend Assessment
In an application for regional air quality forecasting, the EnvSentry model was deployed as a knowledge coordination module. This system leverages comprehensive national networked datasets and high-frequency monitoring feeds to intelligently orchestrate a hybrid predictive framework, synergizing conventional numerical models, such as deterministic chemical transport models, with advanced statistical methods. This approach addresses the limitations of traditional numerical models, especially under complex scenarios. Because these models often exhibit significant deviations and lack sufficient spatiotemporal resolution, they cannot accurately capture local small-scale pollution events driven by atypical meteorology. Consequently, their practical value for operational guidance remains constrained.
The EnvSentry-driven intelligent agent functions as an adaptive post-processing and bias-correction layer. By analyzing and comprehending the complex, non-linear deviation patterns (i.e., systematic biases) between historical model outputs and observational data, the agent performs dynamic, real-time corrections on new model results. It further refines these corrected forecasts by assimilating near-real-time data from local monitoring stations, generating predictions that demonstrate better agreement with observed conditions.
Practically, this system delivers more precise and reliable trend forecasts for regional joint prevention and control authorities. This improved accuracy helps authorities formulate and execute timely mitigation strategies tailored to specific sub-regional needs. Quantitative evaluation confirmed the system’s performance and efficiency. First, the agent-driven workflow streamlined the analytical process, reducing the time required for comprehensive expert-led judgment from an average of 60 min to 20 min. Second, and perhaps more significantly, predictive judgments generated by non-specialist personnel invoking the agent achieved a consistency rate exceeding 85% when compared to forecasts produced by domain experts. This demonstrates the model’s ability to represent expert knowledge and support rapid-response analysis.
3.3.3. Operations and Maintenance (O&M) of the National Ecological Environmental Monitoring Network
In an application focused on the intelligent Operations and Maintenance (O&M) of the national eco-environmental monitoring network, the EnvSentry model was deployed to address significant challenges in data validation and operational efficiency. The network’s O&M workflows were traditionally hindered by the difficulty in identifying data anomalies, the high labor costs and subjective errors associated with manual verification, and inefficiencies in managing frequent maintenance tasks and work order closures.
The EnvSentry solution utilizes LLM-based reasoning, enabling business rules to be input as natural language descriptions. The model then automatically queries, analyzes, and discriminates anomalous data patterns. This system is integrated with agentic tools, offering a dual-mode capability: it can execute pre-encoded rule-based algorithms for routine tasks, or it can allow non-technical business personnel to directly leverage the model’s reasoning for complex diagnostics. This implementation yielded substantial improvements in network integrity and efficiency. Quantitatively, the introduction of the EnvSentry model improved data anomaly detection accuracy by 10%. Furthermore, it achieved a 90% rule intelligence rate, indicating that the model can translate natural language rules into executable logic. This automation resulted in a significant reduction in the total weekly manual verification workload by approximately 30%, shortening the work order processing cycle and enhancing the overall stability and efficiency of the national monitoring network.
3.4. Performance Comparison with Typical LLMs
We compared the performance of EnvSentry with two LLMs ChatGPT-5 and DeepSeek V3.1, using two questions. The first question asked the models to generate a 72 h air quality forecast for the Beijing-Tianjin-Hebei (BTH) region, and the second one asked them to assess the impact of rainfall during this year’s flood season on water quality in the Yellow River Basin (YRB) (
Table A1). The assessment focuses on forecast structure, meteorological integration, specificity, and practical utility. The user-level evaluation prompts, representative desensitized agent prompt templates, and task execution rules used to support this comparison are provided in
Appendix A.3 to improve methodological transparency and reproducibility.
To move beyond qualitative observation and enable a reproducible quantitative comparison, we evaluated the three models using four standardized criteria: factual accuracy, temporal specificity, structural completeness, and operational practicality. Each criterion was scored on a 0–100 scale according to predefined scoring rules. The evaluation was independently conducted by two annotators using the same set of domain-specific queries and original model outputs reported in this section (
Table 1).
For Query 1, the comparative outputs from the three LLMs revealed distinct approaches and varying levels of sophistication in generating a 72 h air quality forecast for the BTH region. ChatGPT-5 produced a well-structured, general-audience forecast using a generic Day 1–3 format. It correctly identified key meteorological drivers and included public health recommendations. However, its output remained hypothetical and non-specific, lacking temporal anchoring to a real forecast cycle, and therefore resembled a reusable template rather than a situation-specific prediction.
DeepSeek V3.1 demonstrated enhanced specificity by anchoring its forecast to explicit calendar dates (24th–26th November). Its output included a meteorological explanation, describing the progression from cold air intrusion to weakening pressure fields and poor dispersion conditions, including mention of specific phenomena like nocturnal floating dust. This indicated a competent synthesis of typical weather-pollution linkages, though it omitted a formal forecast citation.
In contrast, EnvSentry generated an output most closely aligned with operational forecasting standards. It began by citing a specific forecast issue date, establishing critical provenance. The forecast employed precise, standardized air quality terminology such as Excellent to Good, and provided the highest spatial resolution, identifying nuanced risk areas such as the foothills of Hebei. It communicated the overall trend while appropriately qualifying short-term peaks and localized risks, demonstrating a superior grasp of forecast confidence and spatial variability. Consequently, the EnvSentry output demonstrates superior operational verisimilitude and practical utility within both professional and public communication contexts.
For Query 2, ChatGPT-5 provided a concise, bullet-point summary of general impact mechanisms, including increased runoff, pollutant flushing, a dilution effect, and microbial changes. Its response is factually sound for a generic explanation but exhibits key limitations for the specific query. It makes no reference to the year 2024 mentioned in the prompt, offers no quantitative data, and lacks any temporal or spatial nuance specific to the Yellow River Basin (YRB). The output reads as a static, textbook-style overview suitable for foundational education but devoid of situational analysis or report-like structure.
DeepSeek V3.1 demonstrated a notable advance in conceptual framing and narrative coherence. It introduced framework covering short-term shocks and long-term purification, capturing the competing temporal dynamics of flood impacts. Its response elaborated on specific pollutant types and processes such as channel scouring with greater technical detail than ChatGPT-5. While it anchored the response to 2024, it still operated on a conceptual level without integrating concrete, time-bound monitoring data, such as specific pollutant concentration trends or geographic hotspots. Its output is a well-reasoned, mechanistically rich discussion but remains a qualitative analysis.
EnvSentry generated a response that closely emulates a formal environmental assessment report. Its output is structurally organized with clear sections, including overall situation and key findings, mirroring professional scientific communication. Crucially, it successfully incorporated simulated or recalled specific data points, such as the V-shaped trend, the proportion of Class I-III water bodies from 79.8% to 86.1%, the count of Inferior Class V sections, and named vulnerable tributaries such as the Malian River and water bodies (Ulansuhai Lake). Furthermore, it extended the analysis to include regionally differentiated findings, a mechanism summary, and actionable policy recommendations. This synthesis of quantitative metrics, spatial specificity, and a standardized reporting format results in an output of higher practical relevance and functional utility for a domain expert or decision-maker.
4. Discussion
Several research-oriented large language models have emerged for climate and environmental sciences. For examples, systems such as ClimateGPT, WaterGPT, and OceanGPT have adapted LLMs to specific environmental subdomains [
17,
18,
19,
29,
30], and frameworks like EnvGPT have made strides in interdisciplinary knowledge integration through supervised fine-tuning [
12]. In contrast, EnvSentry differs from these models in its system architecture. Unlike these models, which primarily refine a single model’s knowledge and reasoning capabilities, the multi-engine collaborative mechanism of EnvSentry explicitly acknowledges the absence of a universal yet domain specialized LLM. Through dynamic task routing across specialized engines, EnvSentry achieves a superior balance between deep cognitive reasoning and real-time structured operations to circumvent the limitations of a single generalized model.
A core advantage of EnvSentry lies in its systematic knowledge grounding. Similar to ChatClimate, which grounds its responses in the authoritative IPCC AR6 reports to mitigate hallucination [
23], EnvSentry implements an enhanced RAG paradigm built upon a massive, hierarchically structured knowledge base. However, EnvSentry extends this concept beyond a single document corpus. The system utilizes an architecture comprising a structured knowledge hub, active knowledge injection, and multi-engine collaborative verification. By integrating national standards, technical guidelines, literature, and case databases, this framework enables the precise retrieval of technical specifications (e.g., HJ 694-2014) and ensures that outputs are both verifiable and compliant with regulatory frameworks. This addresses the critical issue of factual reliability, a challenge that persists even in advanced general models.
Furthermore, EnvSentry design as an agentic workflow orchestrator moves beyond a pure Q&A system. It transforms the LLM from an endpoint into a planning and dispatch center that decomposes complex user queries into executable schedules. By leveraging a suite of specialized agents and MCP tools, it connects language understanding with practical environmental modeling, achieving a closed-loop integration of perception, analysis, and decision-making that is only nascent in other environmental AI applications.
Despite the operational efficiencies demonstrated by the EnvSentry framework, multi-engine orchestration inherently introduces complex structural trade-offs that necessitate critical evaluation. First, regarding infrastructure scalability, the expanding topology of coordinated autonomous agents increases systemic communication overhead and algorithmic scheduling complexity during peak environmental monitoring periods. Second, the reliance on high-fidelity inference models such as DeepSeek R1 substantially elevates the cumulative computational cost and operational latency, presenting clear bottlenecks for massive real-time data ingestion pipelines. Finally, a critical vulnerability lies in the potential cascaded propagation of reasoning inconsistencies across interconnected agents; an unresolved cognitive anomaly or semantic deviation generated by the primary processing layer can systematically distort the downstream contextual payloads ingested by specialized analytical or multimodal engines.
The comparative analysis of the three LLMs (ChatGPT-5, Deepseek v3.1, EnvSentry) reveals a clear spectrum of capability in addressing domain-specific environmental queries, ranging from general knowledge summarization to near-professional situational reporting. This comparison shows the influence of model architecture, training data specialization, and prompting constraints on output utility for scientific applications. ChatGPT-5 serves as a competent general explanatory tool, suitable for foundational education on the topic but limited by its static knowledge cutoff and lack of domain-specific data integration. DeepSeek V3.1 offers greater technical depth and reasoning, potentially providing a more satisfactory answer for users seeking a detailed process-based understanding, yet still constrained by the absence of current-year facts. EnvSentry demonstrates specialized domain superiority. By generating a response that incorporates time-bound data, geographic specifics, and a formal reporting structure, it delivers stronger domain-specific performance. for professionals or stakeholders requiring situation-aware analysis. Compared to ChatGPT-5 and Deepseek, EnvSentry illustrates the evolution from a general-purpose explanatory tool to a domain-specialized analytical assistant. This highlights a broader principle in AI application while foundational LLMs provide broad accessibility to knowledge, maximizing utility in specialized scientific and operational fields increasingly requires tailored models or robust retrieval-augmented generation systems capable of accessing and formatting domain-specific information.
Despite these advancements, several challenges persist, aligning with broader limitations in the field. First, while the reasoning engine handles complex tasks, the depth of causal reasoning in novel, multi-factorial environmental scenarios remain a frontier. Enhancing this capability may require integration with more sophisticated symbolic reasoning or mechanistic models, an area where systems like ClimateGPT have begun exploration by synthesizing interdisciplinary research [
17,
31]. Second, the model performance is heavily dependent on data standardization and quality. Inconsistencies in data formats and governance across the vast monitoring network can hinder optimal performance, a common bottleneck for data-driven AI systems. Our future strategy involves completing the migration to a unified domestic database and refining the knowledge hierarchy to mitigate this issue. Third, the computational demand of multi-engine orchestration and large-scale knowledge retrieval presents significant resource constraints. This reflects the high costs associated with training and fine-tuning large models such as EnvGPT [
12,
32]. To distribute the computational load efficiently, the proposed cloud-edge-terminal computing architecture offers a targeted infrastructure solution. Addressing the non-trivial risks of operational security and semantic hallucination, the integration of retrieval-augmented generation and multi-engine verification within EnvSentry significantly mitigates factual errors, yet the vulnerability to unsubstantiated output generation in edge cases persists. Future research will therefore incorporate advanced methodologies including federated learning for data privacy and knowledge anchoring for output fidelity to advance system robustness and reliability in alignment with rigorous source citation paradigms.
5. Conclusions
The EnvSentry model represents a critical advancement in the application of large language model and agent technology to the eco-environmental monitoring domain. By innovating a multi-engine collaborative framework grounded in robust data governance and a comprehensive knowledge base, EnvSentry has effectively transformed the paradigm from delayed, fragmented analysis to real-time, integrated, and intelligent decision-making. The demonstrated performance improvements across diverse application scenarios validate the feasibility of an integrated sensing, analysis, and decision-making framework for environmental governance. Nevertheless, large-scale deployment of EnvSentry still depends on continuous data standardization, stable computational resources, secure access control, and human verification for low-confidence outputs. These practical constraints should be addressed before fully autonomous environmental governance can be achieved in real-world monitoring networks.
In the future developments of EnvSentry, the technology will evolve towards the creation of a highly autonomous, interpretable, and collaborative environmental intelligence agent. A critical developmental pathway involves extending EnvSentry capabilities to the terminal layer of environmental monitoring by embedding lightweight models directly within front-end sensing devices. This edge-level integration will achieve in situ data processing and preliminary analysis at the data source to facilitate seamless end-to-end artificial intelligence deployment across the entire environmental monitoring, data governance, and decision-making pipeline. Future developments will prioritize the synergistic advancement of multimodal data fusion alongside high-resolution spatiotemporal prediction models. To ensure maximum scientific rigor, this integrated analytical framework will deeply incorporate domain-specific mechanistic models. The comprehensive technical paradigm ultimately yields a fully closed-loop intelligent environmental management system.