ForestGPT and Beyond: A Trustworthy Domain-Specific Large Language Model Paving the Way to Forestry 5.0

Ehrlich-Sommer, Florian; Eberhard, Benno; Holzinger, Andreas

doi:10.3390/electronics14183583

Open AccessArticle

ForestGPT and Beyond: A Trustworthy Domain-Specific Large Language Model Paving the Way to Forestry 5.0

by

Florian Ehrlich-Sommer

¹

,

Benno Eberhard

²

and

Andreas Holzinger

^1,*

¹

Human-Centered AI Lab, Institute of Forest Engineering, Department of Ecosystem Management, Climate and Biodiversity, BOKU University, 1190 Vienna, Austria

²

Italian National Research Council-Institute of BioEconomy (CNR IBE), 50019 Sesto Fiorentino, Italy

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(18), 3583; https://doi.org/10.3390/electronics14183583

Submission received: 21 July 2025 / Revised: 1 September 2025 / Accepted: 5 September 2025 / Published: 10 September 2025

(This article belongs to the Special Issue New Insights into Natural Language Processing and Large Language Models)

Download

Browse Figures

Versions Notes

Abstract

Large language models (LLMs) such as Chat Generative Pre-Trained Transformer (ChatGPT) are increasingly used across domains, yet their generic training data and propensity for hallucination limit reliability in safety-critical fields like forestry. This paper outlines the conception and prototype of ForestGPT, a domain-specialised assistant designed to support forest professionals while preserving expert oversight. It addresses two looming risks: unverified adoption of generic outputs and professional mistrust of opaque algorithms. We propose a four-level development path: (1) pre-training a transformer on curated forestry literature to create a baseline conversational tool; (2) augmenting it with Retrieval-Augmented Generation to ground answers in local and time-sensitive documents; (3) coupling growth simulators for scenario modeling; and (4) integrating continuous streams from sensors, drones and machinery for real-time decision support. A Level-1 prototype, deployed at Futa Expo 2025 via a mobile app, successfully guided multilingual visitors and demonstrated the feasibility of lightweight fine-tuning on open-weight checkpoints. We analyse technical challenges, multimodal grounding, continual learning, safety certification, and social barriers including data sovereignty, bias and change management. Results indicate that trustworthy, explainable, and accessible LLMs can accelerate the transition to Forestry 5.0, provided that human-in-the-loop guardrails remain central. Future work will extend ForestGPT with full RAG pipelines, simulator coupling and autonomous data ingestion. Whilst exemplified in forestry, a complex, safety-critical, and ecologically vital domain, the proposed architecture and development path are broadly transferable to other sectors that demand trustworthy, domain-specific language models under expert oversight.

Keywords:

LLM; artificial intelligence; data quality; human-in-the-loop; digital transformation; smart forestry; forest chatbot

Graphical Abstract

1. Introduction

The introduction of publicly available LLMs (Large Language Models), like ChatGPT, or “Chat-Generative Pre-Trained Transformer” [1], led to disruptions in many industries and private applications with a huge impact on society [2].

The possibility of obtaining a broad range of information digested in a form suitable for the question, or better speaking, the prompt, asked to the LLM made it faster and simpler for everyone to get specific information [3]. In recent months, multiple LLMs have emerged and are being widely used. Some include the already mentioned OpenAI ChatGPT, Google Gemini, and Anthropic Claude Opus [4], as well as more recent models such as the OpenAI o3 family (December 2024–June 2025), ChatGPT 5 (August 2025), DeepSeek-R1 (January 2025; updated May 2025), Google Gemini 2.5 Pro (June 2025), xAI Grok 4 (mid 2025), and Alibaba Qwen 3 (mid 2025).

GPT-5 introduces a new “Thinking” mode optimised for multi-step workflows, while Gemini 2.5 Pro and Claude Sonnet 4 expand to 1M-token contexts, enabling full-document compliance workflows without retrieval.

These LLMs operate mostly on data freely available online or might have some specific training data in the background, which makes them optimal for handling everything from everyday tasks to more specific questions. Nevertheless, users may have already realised that, similar to all tools, these tools also have limitations. The importance lies in understanding of the tool, similar to a handyman who needs to know how to use the tools at hand to achieve a certain task.

When experts work with LLMs, it is very common to obtain substandard answers that do not really answer the question asked; it might even be that the model hallucinates and gives completely wrong answers. This is of particular importance when looking at critical tasks, which in this study is forestry. Special expert knowledge is needed to allow a forest operations professional to trust the answers of an LLM, and even though methods like RAG (Retrieval-Augmented Generation) are already in use to give a better understanding of the source of data used, the forest expert might still get a false answer [5,6]. These limitations can lead to two problems in the future. First, the experts do not check the output data well enough and therefore use wrong information, and second, the tools are not used by experts because they do not trust the data. To overcome both of these downsides, a domain-specific system needs to be developed that can be trusted by the forest expert and a clear human-in-the-loop approach needs to be followed. Explainability, trustworthiness, and accessibility will be the cornerstones to bring this technology into the forest industry and really move towards Forestry 5.0 [7].

This paper describes the development of ForestGPT, introduces the initial prototype, and examines the main barriers that must be addressed to facilitate widespread adoption within the forest industry and beyond.

The development of ForestGPT is envisioned as a multi-level progression toward a robust, domain-specific language model tailored to forestry professionals.

The first level (Level 1) focuses on the creation of a transformer-based assistant pretrained on extensive forestry-related literature. The assistant serves as a general-purpose chatbot to support practitioners, students, and decision-makers across the forestry sector. Its scope is to provide foundational knowledge and practical insight into key areas such as silviculture, forest yield analysis, and forest engineering. By drawing on a large corpus of domain-specific texts, Level 1 lays the groundwork for subsequent enhancements that integrate dynamic reasoning over context-specific and temporally sensitive data.

The second level (Level 2) concentrates on Retrieval-Augmented Generation (RAG) for integrating local and dynamic knowledge, such as recent harvest plans or regional guidelines.

On the third level (Level 3) occurs the integration of forest growth simulators, enabling users to explore silvicultural decisions through data-driven scenario modeling and long-term projections. This structured approach combines human-centered design, multimodal interfaces, and domain-specific language modeling to support decision-making at the stand scale and beyond.

The current final Level 4 will additionally access all available forest-specific data that will be continuously and autonomously generated via various technologies like stationary sensors, well-known aerial data collection and autonomous robotic platforms that roam through the forest. The amount of data collected by these systems enable ForestGPT to evaluate the optimal timber harvesting machine configuration. Drawing on parameters such as slope degree, soil moisture and resistance, understory density, the presence of local hindrances such as stumps or rocks, and the prevailing stem diameter class, the system can identify the most suitable machinery setup, considering both productivity and ecological constraints [8,9]. This way, ForestGPT transforms from a knowledge engine into a field-ready operational assistant.

Once completed, it will offer a wide range of advanced functions and skills in one place. It will read and interpret texts, maps, and tables, cite sources through RAG, run forest growth simulations, and suggest optimal timber harvesting machine configurations. Despite this complexity, it will remain simple to use, requiring no special software knowledge. Users will interact with it just like with today’s chatbots, by asking questions in plain language. The key difference is that ForestGPT will deliver far more specialized and powerful support for forestry tasks. An outline of a proposed timeline is available in Figure 1.

This paper will provide a simple explanation of the background and technologies required in LLMs while also explaining certain terms that have already been used in this introduction and more. The possibilities and challenges will be highlighted to help professionals understand what can be done and where they need to be wary in their trust toward LLMs. The first prototype of a Level 1 system of ForestGPT will be presented and the initial results of the first field test will be discussed. Finally there will be an outline on what the future work will look like.

2. Background

This section will provide an overview of the most important aspects when it comes to LLMs and together with the initial prototype will give non-experts an understanding of how they could produce their own version of ForestGPT. The background comprises the following sections: (1) What is an LLM and how does it work? (2) What is RAG and how is it used today? (3) What problems can arise when using LLMs? (4) How can domain-specific models be built today?

2.1. What Is an LLM?

Large Language Models (LLMs) can be conceptualized as computational systems, implemented as deep learning networks [10], particularly Transformers, that have been trained on extensive corpora comprising publicly available text, scientific publications, etc., but also large-scale proprietary datasets and images.

Although the resulting capabilities may appear remarkable, their underlying mechanisms can be explained relative simple without recourse to advanced mathematical formalism [11].

Through large-scale pre-training, LLMs acquire the core capability to predict the most probable subsequent token, given a preceding sequence of tokens, where a token represents a discrete textual unit such as a word fragment or symbol.

Iterating this predictive process trillions (!) of times during training “transforms” an uninitialized neural architecture into a system capable of performing various natural language processing tasks and meanwhile vision tasks [12], including drafting legal contracts, translating technical documentation, or summarizing complex forest management plans in seconds.

2.1.1. How an LLM Learns to Write: From Blank Slate to Fluent Assistant

Training an LLM resembles raising an exceptionally studious apprentice forester. First, we hand the apprentice every textbook, article, and set of field notes ever published; next, we teach proper etiquette and safety; finally, we let the apprentice answer questions in real time under a supervisor’s eye.

Table 1 breaks these phases down.

The engine that makes this possible is the Transformer [13]. Its hallmark is self-attention [1], a mechanism that lets the model inspect every word (or token) in the prompt concurrently and decide which parts matter most when predicting the next token.

By aggregating information across the whole prompt instead of marching through it sequentially, the Transformer can keep track of long-distance dependencies. This, for example, is exactly the ability a forestry assistant needs when they must remember that the species list you mentioned 2000 tokens earlier constrains the thinning recommendations you request at the end of your message [14].

Because attention can be applied layer upon layer, modern models reach hundreds of Transformer blocks stacked atop one another, allowing nuanced reasoning about context, style, and factual detail.

In practical terms, that depth means an LLM can smoothly switch registers (e.g., from a formal report to a conversational explanation) and blend disparate strands of information (silviculture, economics, wildlife management) in a single response.

This deep architectural capacity, however, is only the foundation. To make models practically useful and aligned with user expectations, especially in high-stakes, knowledge-intensive fields like forestry, additional training phases are layered on after pre-training. These post-training stages are where today’s models learn to follow instructions, reason step-by-step, and reliably ground their answers in human values and domain-specific standards. Instruction tuning is just one part of the modern post-training process for large language models. Today’s models typically go through several additional stages to further refine their performance. This often starts with supervised fine-tuning (SFT), using curated question–answer pairs, followed by preference fine-tuning (PFT) or reinforcement learning (RL) [15]. A well-known method is reinforcement learning from human feedback (RLHF), but newer techniques are more scalable: reinforcement learning from AI feedback (RLAIF) uses strong teacher models to give feedback at lower cost, while reinforcement learning with verified rewards incorporates correctness signals from external tools or domain-specific validators [16]. Broader reinforcement learning objectives can also be used to shape skills like reasoning depth or planning. Algorithmic approaches include Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and the newer Kahneman–Tversky Optimization (KTO), each balancing stability and data efficiency in different ways [17]. Alongside this, knowledge distillation helps a smaller “student” model to learn not only final answers but also structured reasoning from a larger “teacher” model. Together, these techniques form a multi-layered alignment process aimed at producing models that are safer, more reliable, and better aligned with human values and expert standards that are critical especially in high-stakes domains like forestry.

2.1.2. Hosted Versus Open-Weight Models: Choosing the Right Deployment Path

Once an organisation decides to use an LLM, the next question is where the model should live. Two broad options exist:

Hosted models: Closed-weight services exposed via a cloud API (e.g., OpenAI GPT-4o and GPT-5, Anthropic Claude 3.5 Sonnet and Opus, Google Gemini 2.5 Pro, xAI Grok 4).
Open-weight models: Freely downloadable checkpoints that you can operate on your own hardware (e.g., Meta Llama 3.1 405B and 70B, Mistral Mixtral 8 × 22B, Google Gemma 3 27B, DeepSeek-V2, Qwen 2 72B).

Table 2 contrasts the two in dimensions that matter to day-to-day practice [18].

For forestry stakeholders, the decision often balances three forces:

Sovereignty versus scale. A national forest agency that cannot allow inventory data to leave the country is likely to prefer open weights. A start-up building a climate-risk dashboard may accept hosted services to reach market faster.
Custom analytics. Integrating proprietary growth simulators or LIDAR pipelines often pushes teams towards self-hosting, where they can splice new modules directly into the inference stack [31].
Budget profile. Hosted models let you start with no capital outlay and scale elastically; self-hosting offers predictable costs once GPU clusters are amortised.

2.1.3. Building Trust in Domain-Specific Settings

Regardless of the deployment route, experience shows that LLMs deliver the most reliable value when embedded in a layered architecture:

Domain fine-tuning. By re-training on silviculture manuals, regional yield tables, and equipment maintenance logs, the model gains a forestry “accent”, reducing the need for verbose prompts.
Retrieval-Augmented Generation (RAG). A search component fetches the most relevant passages from current regulations, growth projections, or remote-sensing feeds and appends them to the user prompt, ensuring the answer is grounded in up-to-date evidence.
Human oversight and guardrails. Output filters enforce professional and legal standards, while audit logs allow reviewers to trace how a recommendation was formed. Treating the model as a junior analyst, excellent at drafting but never the final authority, preserves accountability.

With those safeguards in place, forestry professionals can exploit the remarkable breadth and speed of modern LLMs to streamline reporting, scenario analysis, and stakeholder communication, all while remaining confident that scientific rigour, data privacy, and regulatory compliance are preserved [32].

2.2. What Is RAG?

Retrieval-Augmented Generation, usually shortened to RAG, is an architectural pattern that blends two complementary abilities:

Neural retrieval—locating and returning the most relevant pieces of information from a very large, often private, document collection (think vector databases, geospatial layers, PDF manuals, or sensor logs).
Large-language-model generation—having a transformer-based model read those retrieved passages and draft concise, task-specific answers for the user.

By marrying these components, a system is no longer limited to whatever the LLM learned during pre-training [33]. Instead, every answer can be grounded in fresh, authoritative material that is fetched on-the-fly [34].

2.2.1. How the Pipeline Works, Step by Step

Chunking the corpus (corpora are usually split into overlapping blocks of 200–1000 tokens; overlap preserves context, while fixed length simplifies indexing)—Before the system ever sees a user, an offline job slices the document collection into fixed-size passages and stores metadata such as title, source URL, publication date, spatial extent, or sensor ID. Each passage is converted to a dense vector using a text-embedding model. Recent systems often use domain-adapted embedding models (e.g., BGE-M3, E5, or in-house forestry-tuned variants) to improve alignment with expert language.
Query embedding—The user’s raw question (e.g., “Which spruce-thinning guidelines apply in Austria’s montane zone?”) is passed through the same embedding model, yielding a query vector in the same semantic space.
Vector search—The query vector is compared to all passage vectors (typically via an Approximate Nearest-Neighbour index such as FAISS, ScaNN, or Milvus). The top k passages with the highest cosine similarity are returned. Many production stacks add a re-ranking stage that feeds the top 100–200 candidates to a bi-encoder or cross-encoder for finer scoring. Cohere Rerank v3.5 and BGE-Reranker-v2 offer state-of-practice accuracy and speed. For multilingual forestry corpora, e5-mistral-7B excels at dense embeddings across EU languages.
Some newer models, such as Mistral’s Miqu or Cohere’s Command R+, are explicitly trained for retrieval-augmented generation and can score or re-rank passages natively.In addition to static retrieval, agentic strategies such as Self-RAG (dynamic retrieve-or-not decisions) and GraphRAG (graph-grounded summarisation) now underpin large-scale forestry copilots.
Context assembly—The original question, the k retrieved snippets, and optional system instructions are concatenated into a single prompt. A typical template might read as follows:
[SYSTEM] You are a forestry assistant. Cite your sources in [x]
format.}
[USER QUESTION] <original question>
[SOURCE 1] ...
[SOURCE 2] ...
...
Developers must balance two factors here: placing enough evidence to answer accurately, yet staying within the model’s context-window budget. Many models in 2025 (e.g., Claude 3.5, Gemini 1.5 Pro, GPT-4o) support extended contexts (128 k–1 M tokens), allowing richer document history or multi-turn data inclusion. However, selective inclusion and source prioritisation remain important.
Generation—The LLM ingests the enriched prompt and predicts the next tokens, weaving together its own prior knowledge and the freshly supplied evidence. Well-designed templates instruct the model to quote or paraphrase each snippet and attach an inline citation marker such as “[1].” Advanced prompting (e.g., chain-of-thought scaffolds, scratchpad reasoning, tool plans) can further increase answer quality. Some newer models, like Claude or DeepSeek-R1, are trained to follow such structured reasoning prompts natively.
Answer post-processing—Citation markers are replaced with hyperlinks or footnotes, the answer is truncated or formatted to specification, and the system optionally logs which passages were used, which is crucial for auditability. For benchmarking, we use CRAG and ARES to quantify retrieval quality and hallucination mitigation. For continuous evaluation during deployment, we incorporate RAGAS. If verifier models are available, the generated answer can be automatically scored for factual consistency, confidence, or hallucination risk before being shown to the user.
Iterative refinement (optional)—Some RAG systems loop: if the first draft seems too vague, the LLM itself proposes follow-up queries (“please retrieve recent wind-throw bulletins”), the retriever runs again, and a second-pass answer is generated. This loop can also be automated using scratchpad-based reasoning, where the model explicitly reflects on uncertainty and issues its own intermediate retrieval instructions.

Figure 2 highlights this cycle in a simple fashion.

Because the retrieval happens at inference time, any newly published regulation, satellite scene, or machine log can influence the response without retraining the model, a decisive advantage in forestry, where storm damage maps, bark-beetle alerts, and local ordinances change month by month.

While Figure 2 visualises a streamlined pipeline applied to clean text chunks, this oversimplifies the data reality in operational forestry. In practice, inputs may include low-resolution satellite annotations, unstructured field notes, partially scanned legacy documents, or metadata-deprived shapefiles. These inconsistencies pose substantial barriers for downstream language model reasoning. Therefore, before text chunking and embedding can occur, robust data normalization workflows are essential. These might include OCR correction, schema mapping to standardsfor geospatial metadata, or PDF parsing via tools like DeepSearch and GROBID. Only after such preprocessing can the model engage in semantic search and reasoning.

We stress that for Level 3–4 applications, the integrity of this upstream pipeline is as critical as the model’s downstream reasoning ability.

2.2.2. Human-in-the-Loop (HITL) for Complex Forestry Tasks

While recent frontier models (e.g., GPT-5, Gemini 2.5 Pro, Claude 3.5) have dramatically improved long-context reasoning and planning capabilities, fully automated deployment in high-stakes forestry domains still carries significant risk. To manage these risks, we propose a tightly integrated Human-in-the-Loop (HITL) approach as a foundational safeguard, especially for Level 3 and 4 decision-support systems.

HITL design ensures that model-generated outputs, such as silvicultural recommendations, compliance memos, or stand interventions, are always reviewed, corrected, or approved by qualified experts. This framework supports auditability, contextual understanding, and fail-safe handling in cases where model uncertainty or hallucinations are likely.

More importantly, HITL allows for iterative interaction between foresters and the assistant system. For example, when a thinning prescription is generated, the expert can reject, modify, or request clarification before committing to a management action. This interactivity promotes higher confidence, contextual awareness, and institutional learning.

Technically, this architecture can use abstention mechanisms (e.g., verifier-guided refusals, low-confidence scoring, or structured citations) to trigger human review. These mechanisms are increasingly native to verifier-based systems such as o3-pro and Claude 3.5. In agentic settings, HITL can also intervene at decision checkpoints—e.g., which satellite layer to select, or which forest model to invoke—prior to full pipeline execution.

In our planned pilots, HITL effectiveness will be assessed through user studies that measure (i) task success rate, (ii) decision latency, (iii) intervention frequency, and (iv) trust calibration. This ensures that even when high-performing LLMs are involved, forestry experts remain in the loop for strategic and safety-critical choices.

Newly available sources include ESA’s Biomass P-band SAR (April 2025) and GEDI L4A/L4H data with uncertainty measures, both are streamable into ForestGPT’s retriever layer. Combined with modern reasoning-centric models, this architecture enables both rapid updates and traceable multi-step outputs in dynamic field conditions.

2.2.3. Benefits of RAG at a Glance

Timeliness—Up-to-date information (storm-track bulletins, pest alerts, log-price indices) flows straight into the answer on the day it is published.
Transparency—Inline citations or expandable excerpts let practitioners verify numbers before prescriptions reach the field.
Data sovereignty—Proprietary corpora stay inside a controlled retrieval store; only the short excerpts selected by the retriever are exposed to the LLM.

2.2.4. Where RAG Is Already in Use, and Why These Examples Matter for Forestry

WildfireGPT

Employs a multi-agent, RAG-based pipeline to fuse hotspot imagery, weather forecasts, and any information, delivering location-specific risk assessments during extreme fire seasons. Wildfires are an increasing global threat, driven by rising temperatures, prolonged droughts, and land-use changes that intensify their frequency, severity, and impact on ecosystems, economies, and human health. The same template, combining remote-sensing grids with scholarly work, translates to storm, beetle, or wind-throw analysis in forestry [35].

Agricultural Advisory Bots

For example, Farmer.CHAT, ingest extension leaflets and multilingual Q&A transcripts to answer region-specific agronomy questions. Forestry extension faces an identical need: tailor guidance to particular species mixes, elevations, and socio-economic contexts [36].

Palantir AIP’s Ontology-Augmented Generation

(An enterprise superset of RAG) binds live operational data to chat. A comparable stack could integrate harvest-machine telemetry, weigh-bridge data, and pulp-mill demand forecasts so that an agent can suggest log allocation in peak season [37]. Table 3 highlights all of these usecases.

These deployments demonstrate that RAG already operates in mission-critical environments [38], where data is large, heterogeneous, and time-sensitive, exactly the landscape encountered by modern forest enterprises. By adapting their design patterns, a forestry-specific assistant can anchor its recommendations in the freshest inventories, policies, and sensor feeds while still leveraging the fluent reasoning of a state-of-the-art LLM.

Industry Landscape Note (Non-Exhaustive)

Beyond the academic and open-source literature, of course several companies are actively shipping AI-enabled remote-sensing and MRV tools that may be relevant to Forestry 5.0 deployments. For example, Planet Labs provides a Forest Carbon Monitoring product suite with quarterly global releases reporting canopy cover, canopy height, and aboveground carbon density at near-tree scales [39,40]. On the MRV software side, platforms such as CarbonAi advertise digital MRV workflows (data capture, automated calculation and verification, and portfolio dashboards) for greenhouse-gas reduction projects [41]. There exist of course many more companies like Planet Labs which provide, e.g., Earth-observation, including Maxar Technologies, Spire Global, or Pixxel, but here we do not evaluate or endorse specific vendors; these examples are just provided to situate ForestGPT within an evolving industry ecosystem.

2.3. Problems When Using LLMs

LLMs exhibit remarkable capabilities in language generation. However, they remain fundamentally statistical systems that operate by identifying and reproducing patterns in data rather than reasoning as human experts do [42,43].

While they can simulate human expert-like responses and support complex tasks, their outputs are not grounded in true understanding or domain-specific judgment. Probably the most well-known and serious problem is that of hallucinations. A hallucination is the generation of output that is fluent and seemingly plausible but factually incorrect, logically inconsistent, or unsupported by the model’s training data. In technical terms, hallucination arises because LLMs are probabilistic sequence predictors: as we know, they generate the next token based on learned statistical patterns rather than explicit verification against an authoritative knowledge base. When the internal representation lacks the correct information, or when the prompt induces over-generalisation, the model may produce purely fabricated facts, misattributed sources, or non-existent entities that are syntactically well-formed but semantically false.

In safety-critical contexts such as legal, medical, or scientific applications, hallucinations remain a major risk. Recent state-of-the-art models have reduced, but not eliminated, this issue, prompting ongoing research into mitigation strategies including retrieval-augmented generation (RAG), verifier models, fact-checking pipelines, and uncertainty quantification [44].

As such, their integration into professional contexts must be approached with a clear awareness of their probabilistic nature, limitations in reasoning, and the need for human oversight in critical decision-making processes [45].

LLMs generate text by calculating which token is most likely to follow the tokens they have already produced [46]. That prediction process allows remarkable fluency, yet it also introduces predictable failure modes that every practitioner should recognise.

Nevertheless, there are already implementations in place like coupling forests’ response to droughts [47]. In Table 4, we see the main pitfalls at a glance.

2.3.1. Digging Deeper into Hallucinations

Hallucination is the most surprising pitfall because the output looks authoritative. The root cause is simple: the model is rewarded for producing the most probable continuation, not the most factual continuation. If its internal statistics suggest that a fabricated citation “fits” better than silence, the model will invent one.

Why probability outranks truth—During training, the model never checks the real world; it only compares its guess to the next token in its dataset. Fluency, not factuality, drives the optimisation.
Why fluent text can still be wrong. During pre-training, LLMs are optimised to maximise the likelihood of the next token given preceding tokens (cross-entropy/MLE). This objective rewards plausible continuations, not ground truth. Hence models can produce confident, fluent statements that are nevertheless unfaithful to reality or to the given sources [44,54]. Calibration work shows models can sometimes assess their own uncertainty, but reliability varies by task and format [55].
Why confidence can be misleading—The same mechanism that makes the text read smoothly also lends it an air of certainty. The model does not “know” it is guessing. It simply continues the pattern it has learned.
Mitigation strategies
a.
Use Retrieval-Augmented Generation (RAG) to inject verifiable passages into the prompt so the model can quote rather than invent [56].
b.
Ask for citations or step-by-step reasoning, missing or circular references are red flags.
c.
Keep a human-in-the-loop, especially when decisions have financial, ecological or particularly safety impact [57].

2.3.2. Why Context Length and Freshness Matter

LLMs have a limited “attention span,” i.e., the number of tokens they can process at once. This is typically referred to as the context window. Earlier models like GPT-3.5 and PaLM 2 operated with windows of 4 k to 32 k tokens, which constrained the length of prompts, documents, or multi-step reasoning processes that could be handled without truncation.

As of mid-2025, multiple commercial and open-weight models now support much longer context windows, with several reaching or exceeding 1 million tokens. Notable examples include Claude 3.5 (Anthropic), which supports up to 1M tokens, and Gemini 1.5 Pro (Google), which operates across 128 k to 1 M depending on configuration. GPT-4o/GPT-5 (OpenAI) and DeepSeek-R1 also offer extended contexts in the 128 k–200 k range, enabling broader document understanding, traceable dialogues, and structured planning.

In forestry, this opens up applications such as passing in complete forest stand histories, regulatory guidelines, multi-year logging schedules, or scientific reports without the need for summarisation or external memory tools. However, long-context capabilities still require thoughtful prompting and retrieval logic to avoid performance degradation. Models may still prioritise local attention, and token limits do not guarantee consistent use of all provided input.

2.3.3. Bias, Privacy, and Prompt Craft

Bias—Because the model averages over its data, it reproduces both the wisdom and the prejudice embedded in that data. Domain fine-tuning with balanced corpora and explicit bias-evaluation tests helps reduce this risk.
Privacy leaks—If a snippet appears frequently enough in the training set, the model may regurgitate it verbatim. Guardrails that scan for personal data, as well as policy controls on what enters the fine-tune corpus, are standard defences.
Prompt engineering—Short, vague prompts invite vague answers. Overly long prompts may waste context budget. Clear, specific, and well-structured questions guide the model down a more reliable probability path. In general the information put into the prompt by the user needs to be selected carefully to avoid the model from anchoring on certain aspects [58].

2.3.4. Prompt Injection and Prompt-Library Governance

Prompt injection. When an LLM ingests untrusted text (web pages, PDFs, emails), crafted strings can override system instructions or exfiltrate data. This risk is now formally catalogued by OWASP (LLM01: Prompt Injection) and national guidance [59,60]. We now plan on cross-referencing OWASP’s 2025 LLM Top 10 with ForestGPT mitigations. We plan to adopt MITRE ATLAS for red teaming and include Llama Guard 3 as a prompt/response classifier. Our release process aligns with ISO/IEC 42001. Mitigations include the following: (i) least-privilege tool configs; (ii) input isolation/sanitisation (strip markup, normalise URLs, block network access unless needed); (iii) content-based allow/deny rules; (iv) output handling guards; (v) red-team tests with known attack corpora; and (vi) human-in-the-loop release gates for high-impact actions.

Prompt libraries. Organisations increasingly maintain shared prompt templates. We recommend version control, approval workflows, change logs, differential tests against a regression suite, and restricted secrets (never hard-code credentials into templates). Store prompts as code with CI (continuous integration) checks and security review.

Understanding these limitations does not diminish the value of LLMs, rather, it equips practitioners to harness their speed and fluency while keeping critical decisions, such as harvest planning or habitat conservation, firmly under expert oversight. It is important to understand that even though the LLM works on its own, biases from humans still influence it and one needs to be aware of them [61].

3. How to Build Your Own Domain Expert LLM

One does not need a super-computer farm or a PhD in deep learning to create a language model that “speaks forestry” [62]. What one needs is a capable open-weight model, a well-curated set of forestry texts, and a few days of adapter training on a modern GPU workstation. The outline below walks through the essentials, leaving a full Retrieval-Augmented Generation (RAG) implementation for a future improvement of the later presented work.

3.1. Step 1—Choose a Strong Open-Weight Base

Pick a model that balances reasoning power with hardware realism.

We briefly summarise representative LLMs across the current open- and closed-weight landscape, based on their architectural transparency, licensing, and practical usability for research or deployment.

OpenAI’s GPT-3.5 and GPT-4 are available only via API and operate in a closed loop, both have strong fluency, with GPT-4 showing improved reasoning and reduced hallucination. Google’s PaLM 2 powers Bard, but is not directly accessible as a standalone model. Anthropic’s Claude series uses a “constitutional AI” framework and is accessible via API. Meta’s LLaMA 2 and TII’s Falcon are open-weight models with fine-tuning support. The Mistral family (Mistral 7B and Mixtral 8 × 7B) provides excellent performance at moderate scale (about 10B parameters) and is released under an open license.

Several performant open-weight models released in early 2024 also remain widely used and deployable on consumer hardware:

LLaMA 3.1 70B—near-GPT-4 quality, needs ≈48 GB VRAM in 8-bit form.
Mixtral 8 × 22B—Mixture-of-Experts; fast and memory-frugal (≈36 GB).
Gemma 3 27B—compact; runs on a single 24 GB card after quantisation.

Update (2024–2025 models): Since the initial drafting of this manuscript, several high-performing models have emerged that are highly relevant for domain-specific applications such as Forestry 5.0. These include the following:

GPT-4o (OpenAI)—Introduced in 2024 with strong multi-modal reasoning, long context, and native tool integration.
Claude 3.5 Sonnet (Anthropic)—Improved summarisation, transparency, and faithfulness at scale. Alternative: Claude 3.5 Haiku, a smaller, faster version for real-time tasks.
Gemini 2.5 Pro (Google)—State-of-the-art reasoning and vision–language performance with deep Vertex AI integration. Alternative: Gemini Flash for latency-sensitive use cases.
DeepSeek-R1 (DeepSeek-AI)—Open-weight model optimised for chain-of-thought and math reasoning [23]. Alternative: Miqu-1.5 (Mistral) or Command R+ (Cohere) for strong retrieval-native performance.
Qwen 3 (Alibaba)—Strong coding and multilingual reasoning model with open weights.

3.2. Step 2—Curate a Forestry Corpus

Such a corpus would comprise the following sources:

Authoritative sources: Silviculture handbooks, National forest acts, FSC/PEFC guidelines.
Grey literature: Extension leaflets, Thesis PDFs, Training manuals.
Data clean-up: Remove boilerplate, Deduplicate, Tag each file with basic metadata (country, species, elevation).

Aim for 1–3 GB of high-quality text, more is not always better if quality drops. As a sidenote, forest inventories, harvest permits, and machine logs usually sit in multi-table databases rather than a single flat file. Recent work on Neural RELAGGS shows that even state-of-the-art propositionalisation still requires task-specific, learned aggregations to turn those linked tables into features that downstream models (including LLMs) can digest [63].

3.3. Step 3—Fine-Tune with Low Rank Adapters (LoRA)

Instead of re-training billions of parameters, you insert tiny LoRA adapters and train only those:

Cuts GPU memory to a quarter of full fine-tune needs.
Typical run: 5 epochs, sequence length 2000 tokens, learning rate $2 \times 10^{- 4}$ .
Time cost: ≈10–20 h on 4–8 A100 GPUs, depending on base size.

3.4. Step 4—Add an Instruction and Safety Layer

Rather than assuming a large expert-written set from scratch, we recommend a staged approach:

Seed (200–500 items): extract Q&A from manuals and guidelines; have domain experts rewrite for clarity and safety.
Synthesize (1–5 k): use the model to draft variations and edge cases; experts triage and correct (governance: two-person review, versioning).
Harden (500–2 k): add “red-team” prompts (adversarial/ambiguous) and safety refusals; tag each item with policy references.

This keeps expert time focused on review, not blank-page authoring.

Train a second LoRA adapter (or use Direct Preference Optimisation) so the model follows these patterns.

3.5. Step 5—Evaluate on Forestry Tasks

Off-the-shelf benchmarks miss domain gaps. Create a small test set:

Yield check: “Calculate volume removal under XYZ Thinning Guideline for a 40-year-old spruce stand.”
Reg-trace: “Does §3(2)b allow clear-cutting on slopes > 60°?” Please note this is just an example this does not refer to any specific paragraph in any regulation available.
Geo-reason: “Given the reason we want to plant a height of 1200 m, suggest three fitting species.”
Score answer correctness and clarity: Iterate on the corpus or hyper-parameters as needed.

Beyond Qualitative Evaluation: Formal Metrics for Level 3–4

While early-stage pilots can rely on human feedback to assess clarity and helpfulness, advanced capabilities like policy compliance, schedule generation, or plan verification demand structured, repeatable benchmarks. For this, we propose using Chain-of-Reasoning Assessment Grid (CRAG) for tracing multi-step logic, ARES (Agent Reasoning Evaluation Suite) for assessing decision sequences, and RAGAS for measuring retrieval faithfulness and grounding. These tools allow us to (a) audit each reasoning step, (b) verify tool-chain correctness (e.g., did the GIS call return valid compliance buffers?), and (c) detect hallucinations or premature generalizations. Combined with expert adjudication, these benchmarks ensure that Level 3–4 agents remain transparent, explainable, and auditable in high-stakes forestry contexts.

3.6. Step 6—Deploy with Simple Guardrails

Max context length: Cap prompts to avoid GPU overload.
Rofanity/person-data filter: Basic regex or open-source guardrail toolkit.
Human spot-check: Log 5% of outputs for weekly review.

Figure 3 gives a clear overview of the stepwise process for text-based inputs, currently ignoring the messy nature of operational forest data.

Preview: How RAG Slots in

Once the specialist model performs reliably, the natural next upgrade is a Retrieval Augmented Generation (RAG) layer that injects fresh evidence, new regulations, sensor feeds, or inventory snapshots, into the prompt at answer time.

In practice, this adds four moving parts: (1) split trusted forestry documents into 300–800-token chunks and enrich them with metadata; (2) embed those chunks with a text-embedding model and store them in a vector index (e.g., FAISS or Milvus); (3) at query time, retrieve and optionally re-rank the top-k passages, then concatenate them with the user prompt under a citation-aware template; (4) let the LLM draft an answer that quotes the retrieved sources. Implementing that pipeline involves engineering choices, index refresh cadence, chunk size, re-ranker selection, and citation formatting, that would be too long for this article, so a full treatment is deferred to a forthcoming, RAG-focused paper.

The result: In roughly a week of work, even a small research team can produce a forestry-aware assistant that achieves the following:

Understands silviculture jargon (e.g., “crown-class removal”, “Selective Thinning”, “Thinning from above”);
Cites relevant national regulations;
Adopts a safe, professional tone.

Future versions can bolt on the RAG layer sketched above, trading modest extra infrastructure for live data access and inline citations.

3.7. Possibilities in Forestry

Once a forestry-specific LLM (with RAG, multimodal vision, and secure edge-deployment) reaches maturity, it can underpin an entire ladder of digital tools, starting with modest “helper” apps and climbing to fully augmented machine workspaces.

1.: Entry–level helpers

Forest Management Guidelines quick-reference
A smartphone chat widget answers questions such as “What is the minimum post-harvest basal area for spruce on a site-class II stand?” and cites the exact clause from the regional thinning ordinance. This level resembles the Chatbot presented in the next section.
Field–note summariser
Rangers dictate voice memos; the model tags GPS, species, and damage codes, pushing structured JSON to the inventory database.
Extension leaflets on demand
Students request “one-page cheat sheets” on topics like resin tapping or cable-yarding set-up; the model lays out the leaflet in Microsoft Word or Microsoft PowerPoint.

2.: Context–aware decision support

Stand-level prescription engine
The assistant consumes inventory snapshots, growth-model outputs, and local market prices, then proposes treatment schedules with NPV (Net Present Value) and carbon metrics, all traceable back to cited equations.
Regulatory compliance checker
Foresters paste a coupe map, the model flags parcels that breach retention-tree rules, and drafts the exemption request letter if needed.
Wildfire and pest bulletin generator
Each morning the LLM gathers satellite hotspots, drought indices, bark-beetle trap counts, and ECMWF (European Centre for Medium-Range Weather Forecasts) forecasts, then produces a colour-graded PDF that local forest managers can review at a glance, including possible disturbances. An example is wildfire modeling in Australia [64].

3.7.1. Limitations of Text-Centric Assumptions

Although Levels 1–2 of our architecture work primarily with natural-language corpora and structured snippets (e.g., regulation PDFs or forestry manuals), Level 3–4 agents require capabilities beyond raw text processing. These include interpreting GIS layers, validating inputs from sensors (e.g., UAV point clouds or soil moisture logs), and integrating multi-format data streams into a coherent plan. Given the heterogeneity of formats, ranging from GeoTIFFs and LiDAR files to multilingual XML feeds, model architectures must either rely on upstream adapters or be fine-tuned on multi-modal inputs. Emerging solutions like Point-Bind, MiniGPT-3D, and Segment Anything v2 provide new pathways for incorporating spatial and video data directly into the LLM loop. However, full integration remains a work in progress, and any deployment beyond Level 2 must build tightly coupled preprocessing pipelines and tool agents to translate messy inputs into interpretable features.

3.7.2. Modality Integration Is Non-Trivial

We acknowledge that expecting a single LLM to handle natural language, spatial formats, tabular data, and procedural rules is not yet realistic without a surrounding ecosystem of specialized agents. A forestry copilot operating at Level 3–4 must rely on tool augmentation, such as calling a GIS buffer engine, querying a silvicultural database, or invoking a regulatory ruleset validator. Rather than assuming seamless end-to-end reasoning inside a monolithic model, we propose a layered architecture: LLMs manage task decomposition and semantic synthesis, while dedicated tools handle formal or structured input reasoning. We view this not as a limitation of LLMs, but as a pragmatic design principle for safety-critical deployments.

3.: Multimodal, sensor–linked workflows

Vision-enabled forest inventory update
Drone ortho-mosaics stream through a vision-tuned branch of the model; species and diameter-at-breast-height (DBH) estimates automatically populate Geographic Information System (GIS) layers, which the text branch can instantly query in plain language.
Voice-first dispatch
Dispatchers ask, “Find the nearest skidder crew qualified for cable extraction and within 30 min of compartment 47B.” The model glues telematics, skill rosters, and road-condition feeds into a single answer with route suggestions.
Adaptive hauling optimisation
Live mill quotas, truck GPS, and fuel prices flow into the LLM, which recommends load redistribution every two hours, broadcasting updates to drivers via a chat interface.

It is critical to not only look at the LLM itself but also to access the data quality and security for all the obtained information. Especially when looking at IoT approaches, security has to be one of the highest priorities [65].

To illustrate how domain-specific agents can support operational forestry tasks, we present two worked examples that demonstrate regulatory reasoning and silvicultural planning with verifiable outputs.

3.7.3. Worked Example A—Regulatory Compliance

User uploads a coupe map, the agent retrieves regional slope and riparian-buffer rules, runs a GIS buffer of 30 m around waterways, flags intersecting polygons, and drafts a compliance memo with citations.

3.7.4. Worked Example B—Thinning Prescription with Constraints

Given stand inventory (species, DBH, basal area) and a target post-harvest basal area, the agent proposes a thinning-from-above schedule, verifies slope and soil constraints via rules, and outputs a traceable table (pre/post BA, removals, residuals) with cited guideline paragraphs. Figure 4 helps to understand the simulation coupling.

4.: Fully augmented machine workspaces

Heads-up prescription overlay
Inside the harvester cab, an AR visor highlights which stems to cut, leave, or prune based on real-time stem measurements, habitat buffers, and operator-set objectives (e.g., maximise saw-log length while preserving 20% basal area).
Conversational machine control
The operator says, “Switch to fuel-saving mode and recalculate the optimal cutting pattern for a 26 cm top.” The LLM translates that intent into control parameters and passes them to the PLC (Programmable Logic Controller), confirming changes verbally. A very recent example of this is a ChatGPT-based controlling device that used EEG (Electroencephalography) signals to trigger movements [66].
Embedded safety guardian
Proximity sensors and computer-vision feeds stream into the same model; if a hiker or machine part enters a danger zone, the system issues an audio warning, logs the event, and when policy requires pauses the head hydraulics autonomously.
Continuous learning loop
Post-shift, cut logs are scanned at the mill, deviations between planned and realised assortments feed back into the model’s fine-tune set, refining future prescriptions without manual spreadsheet wrangling.

Each rung of the ladder (Figure 5) builds on the previous one. The modular nature of LoRA adapters, RAG indices, and edge deployment containers lets organisations climb as high as budgets and risk tolerance allow, re-using the same foundation model at every step.

4. Level 1 Demonstration at Futa Expo 2025

To test the practical value of a Level 1 forestry-specific GPT, an app was built and deployed during Futa Expo, the forestry-machinery trade fair in Barberino di Mugello, Italy (4–6 July 2025). Development relied on two no-code tools, Glide (Glide Apps v3.46.0) and Chatbase(v2.1), which together produced a mobile interface with integrated AI. The app ran in Italian and English so both local and international visitors could use it. Figure 6 shows the app setup and the no-code interaction with the two tools. No code was used to test and verify that a simple working solution can already be done just by no-code tools. As the tool was only used for the time of the expo, the chatbot is not available for online testing.

Screen layout

Intro—Button opens the fairground map that marks the demonstration trails where exhibitors operate their machines.
Fair concept—Overview of the 17 thematic categories. Four central themes frame the rest: (i) equipment for agricultural tractors; (ii) advanced mechanisation (harvester-forwarder or skidder-forwarder systems); (iii) cable-yarding technology for steep or sensitive soils; (iv) chippers, shredders and firewood processors for energy wood.
Exhibitor list—Directory of all 61 exhibitors, each with flagship product, short description and a Google-Maps link to the stand.
Browse by category—The same 61 firms are listed by thematic category to support topic-oriented exploration.
Facilities map—Toilets, emergency services, and parking, each pin linked to navigation.
Feedback—Form for visitor comments.
AI assistant—The core innovation. Powered by gpt-3.5-mini via Chatbase and trained on ≈1 MB of curated data: (a) exhibitor and technology descriptions; (b) fair logistics; (c) educational notes on all 17 categories.

Functionality

The chatbot guided visitors with queries such as “Where is the skidder demonstration?” or “Which exhibitors are showing firewood processors?”. Beyond orientation it acted as a didactic aide, placing mechanised forest operations within the wider wood-production chain and explaining benefits such as improved environmental performance, worker safety, mitigation of labour shortages and higher productivity.

Result

This deployment stands as a proof-of-concept for ForestGPT Level 1, a domain-tuned chatbot delivering accurate logistical and educational support in a real-world, multilingual setting. All users who tried the application reported it reliable and useful, confirming the promise of GPT-based assistants in operational forestry contexts.

5. 2024–2025 Update: Reasoning-Centric LLMs and Implications for Forestry 5.0

Since late 2024, the LLM landscape has shifted from “general chat” systems toward reasoning-centric models that explicitly allocate more test-time compute to thinking, use tools natively, and handle vastly longer contexts. This section summarises the most relevant releases, OpenAI o3 family (Dec 2024–Jun 2025), DeepSeek-R1 (early 2025), Google Gemini 2.5 Pro (early 2025), xAI Grok 4 (mid-2025), and Alibaba Qwen 3 (mid-2025), and articulates concrete consequences for ForestGPT and Forestry 5.0. An overview can be found in Table 5.

5.1. Why These Models Matter Now

Reasoning-centric models change three assumptions that underpinned our initial plan:

Test-time compute as a capability lever. New models execute deliberate multi-step internal reasoning before responding, closing gaps on math, coding and scientific tasks that were persistent in earlier LLMs.
Native tool use and verification. Browsing, code execution and structured tool calling are becoming first-class citizens, enabling verifiable answers that combine language with calculators, solvers or retrieval.
Million-token context (in production). Long context reduces prompt engineering overhead and allows direct ingestion of standards, maps, logs and regulatory corpora without heavy pre-digestion.

5.2. Implications for the ForestGPT Roadmap

The reasoning shift enriches, but does not obsolete, our four-level plan. We recommend the following updates:

Model selection and deployment.

For sovereign/on-premise settings (e.g., national inventories), DeepSeek-R1 or Qwen 3 distilled sizes are credible open-weight bases for Level 1–2. For enterprise or research pilots requiring maximal long-context and multimodality, Gemini 2.5 Pro and Grok 4 are strong hosted options. o3 sets a new high bar for deliberate reasoning and ships with more mature integration features—reliable tool-use (web/code/files) and long-context variants—making systems engineering and verification simpler than earlier o1 previews [22,24,25,26].

Pilot studies will evaluate success rate, decision latency, interruption count, and abstention handling. Verifier-style refusal models and Self-RAG reflection tokens will help calibrate human–AI handoff thresholds.

Reasoning+RAG, not reasoning vs. RAG.

Test-time reasoning reduces hallucinations but does not solve freshness or local specificity. Our Level 2 RAG remains central, the update is to let the model plan retrieval (tool-use) and to verify numerical claims with code tools (e.g., growth-model stubs, unit checks) before replying.

Verifier-first outputs for safety.

Add a verification layer that runs deterministic checks on model proposals (e.g., slope limits, buffer constraints, basal-area minima) and feeds failures back for self-correction. Reasoning models are better at using these verifiers when exposed as tools.

Million-token workflows.

Exploit long context (Gemini 2.5 Pro; Qwen 3-Coder variants) to load entire regional guidelines, harvest plans and equipment manuals per query, reducing brittle chunking. Keep provenance tags so Level 2 can still cite sources.

Evaluation refresh.

Beyond generic Question Answering (QA), include constraint-satisfaction and units/consistency checks, score tool-use success (e.g., correct GIS buffer computation) and traceability (citations aligned to outputs). Benchmarks like GPQA/AIME are informative but insufficient for forestry operations.

Cost and latency planning.

Reasoning often increases test-time compute. Mitigate via (i) a router (cheap model for rote tasks, reasoning model for hard cases), (ii) time budgets for internal deliberation or (iii) caching of verified intermediate results (e.g., per-stand constants).

5.3. Updated Level-Wise Outlook

Level 1 (domain fine-tune): swap base to an R1/Qwen3 open-weight where sovereignty is critical, retain hosted reasoning models for comparative pilots.

Level 2 (RAG): upgrade to agentic RAG: the model formulates retrieval sub-queries, re-ranks evidence, and calls a code tool to sanity-check numbers before drafting.

Level 3 (simulator coupling): expose growth simulators as callable tools with schema-validated IO; require the reasoning model to explain why a scenario meets constraints and attach citations.

Level 4 (sensor-linked): reasoners plan multi-step tool use (telemetry → soil/moisture inference → machine selection), while a rule engine gates any action with safety constraints.

5.4. Take-Home Message

The 2024–2025 generation delivers materially better reasoning, tool use, and long-context handling. For forestry, that translates into more reliable prescriptions, stronger auditability, and simpler ingestion of real-world artefacts (maps, manuals, logs). Our architecture was designed for grounding and oversight; with these models, we can pursue the same principles, but with fewer workarounds and better verifiability.

5.5. Methodology

This work combines the following: (i) a scoping review of LLM, RAG and forestry-AI literature to establish requirements; (ii) a design-science artifact—the four-level ForestGPT roadmap—grounded in domain constraints (data sovereignty, certification, safety); (iii) a formative field evaluation via a Level-1 prototype at Futa Expo 2025 (multilingual, curated scope) to assess feasibility and usability; and (iv) a human-centered AI framing (oversight, explainability, verifiability) to align system behaviour with professional practice. Threats to validity include limited prototype scope and potential selection bias in sources; we mitigate these via transparent assumptions, citations, and a verifier-first architecture.

6. Discussion

The prototype explored in this paper demonstrates that a forestry-specific language model can already add practical value in day-to-day operations, yet the journey from a fairground helper to a fully fledged, safety-critical decision engine remains long and non-linear. Below, we synthesise the main lessons learned and chart the next milestones on the ForestGPT roadmap.

6.1. Early Impact Versus Long-Term Ambition

At Futa Expo 2025 the Level 1 chatbot was field-tested on a wide range of visitor questions, confirming that even a modestly tuned open-weight model can serve real users in a multilingual setting. The success hinged on three design choices:

a.: Lean, curated training data. A highly pruned 1 MB corpus proved sufficient for reliable orientation and technology explanations, underscoring the principle that local relevance beats sheer data volume in narrow domains.
b.: Tight scope and guardrails. The bot refused out-of-domain questions (e.g., detailed tax advice) and fell back to human staff for ambiguous requests, an early example of the “LLM as junior analyst” paradigm that underpins trustworthy deployment.
c.: User-centred iteration. Daily log reviews during the fair allowed rapid prompt-template tweaks, noticeably reducing hallucinations.

These findings confirm that levels 1–2 (static knowledge plus RAG) can reach operation with modest teams and hardware. Similar achievements were already achieved in other sectors like education [67]. Levels 3–4, however, demand breakthroughs beyond mere language modeling.

6.2. Technical Challenges on the Horizon

Multimodal grounding at scale.

Integrating growth simulators (Level 3) and live sensor feeds (Level 4) requires a hybrid architecture in which structured numerical outputs flow through deterministic code, while the LLM handles narration, explanation, and operator dialogue. Achieving sub-second latency when the context prompt already nears 100 k tokens will likely force partial migration to specialised retrieval layers (vector, time-series, geospatial) and larger-than-GPU memory pools.

Continual learning without catastrophic drift.

Forestry data are seasonal and region-specific. Allowing the model to ingest new inventories and regulations every quarter risks eroding carefully tuned instruction-following behaviour. Adapter stacking and gated LoRA promotion looks promising, but a robust evaluation harness, ideally a public leaderboard of forestry QA pairs, is still missing.

Safety certification pathways.

As ForestGPT moves into machine-control loops, it will fall under the same standards that govern industrial automation. The community lacks precedents for certifying probabilistic generators. A likely route is to wrap critical actuation behind formally verified rule engines and treat the LLM as an advisory layer whose outputs are either endorsed or overridden by deterministic checks. Figure 7 displays a radar chart to showcase how the right protocols can increase the overall risk management.

6.3. Organisational and Ethical Considerations

Barrier to adoption and change management.

Even the most elegant technical solution will fail if end-users do not embrace it. Forestry is a craft with deep-rooted traditions and long equipment life-cycles; introducing an AI assistant therefore collides with institutional inertia, sunk-cost mind-sets, and legitimate concerns about job displacement. Such change can, and will, encounter pressure from regulators, labour unions, and practitioners who mistrust opaque algorithms. Overcoming this barrier requires a clear value narrative, phased roll-outs that demonstrate quick wins, transparent error reporting, and participatory design sessions where foresters co-define acceptable use cases [68].

Data sovereignty.

National forest inventories often carry legal restrictions that preclude cloud processing. Our results suggest that an open-weight base running on edge clusters can close most of the quality gap with premium hosted models, provided that domain fine-tuning is rigorous and retrieval pipelines are well engineered.

Human capital.

ForestGPT reduces routine workload, yet it raises the bar for digital literacy in this domain sector. Continuous professional development programs will be essential so that rangers, planners and machine operators understand both the power and the caveats of AI-augmented workflows [69].

Bias and equitable resource allocation.

Training data drawn disproportionately from industrial, temperate-zone forestry may skew recommendations against smallholders or tropical forestry contexts. Future corpus curation must therefore include a deliberate geographic and socio-economic balance, coupled with bias audits that report model behaviour across under-represented regions and management regimes [70].

7. Conclusions and Future Outlook

The Level 1 prototype of ForestGPT offers promising initial evidence that domain-specialized language models can effectively support forestry practice without imposing significant demands on hardware or data infrastructure requirements. Through the targeted adaptation of open-weight checkpoints using domain-relevant curated corpora and the integration of the model into a human-in-the-loop workflow, we have demonstrated practical utility in a multilingual, operational context.

Consequently, our findings highlight the potential for scalable, user-aligned AI solutions in forestry and lay the groundwork for future iterations aimed at deeper domain integration and broader stakeholder engagement in various application domains even beyond the forestry domain.

7.1. Key Takeaways

Feasibility. A lightweight training pipeline and a modest corpus already yield a useful assistant when the scope is well defined.
Trust is earned. Guardrails, transparent citations, and iterative user feedback remain indispensable for adoption.
Change management. Technical progress must be matched by organisational buy-in to overcome resistance and unlock productivity gains.

7.2. Future Outlook

The approach presented in this work, whilst developed and validated in the forestry domain, is intentionally designed to be generalizable across other complex and safety-critical sectors. Domains such as ecosystem management, agriculture, disaster response, environmental monitoring, medicine, health-care and well-being share similar demands for domain-specific knowledge integration, explainability, and human-in-the-loop safeguards.

Forestry represents an ideal proving ground due to its heterogeneous data landscape, operational risks, and ecological significance, combined with the critical need for high usability resulting from the limited exposure and acceptance of the end user group to and acceptance of new electronic technologies [71,72,73].

Future research should investigate the adaptation of this framework to a broader range of high-stakes application domains by enhancing multimodal processing capabilities, scaling retrieval-augmented pipelines across heterogeneous knowledge sources in order to develop robust, domain-aligned evaluation benchmarks.

Key open challenges include enabling continual learning mechanisms that preserve model safety, formalizing trust calibration strategies tailored to distinct expert roles, and ensuring real-time responsiveness in edge-constrained environments. The establishment of a shared methodological foundation for domain-specific LLM development [74,75], grounded in human-centered design principles, adherence to local data sovereignty requirements, and transparent reasoning processes will be critical for the responsible and sustainable scaling of this paradigm beyond the forestry domain.

The immediate next step is to integrate a full Retrieval-Augmented Generation layer that refreshes a vector index of regional guidelines, harvest plans, and sensor logs regularly, so each user query is matched against the latest evidence and answered with inline citations. At the same time, a fully autonomous data-generation setup will continuously stream drone imagery, harvester telemetry, and IoT sensor readings into a central structured store that the model can query without manual preparation. This unified pipeline will underpin growth-simulator calls (Level 3) and real-time machine guidance (Level 4), keeping stand metrics and machine status perpetually up to date. Field pilots will track retrieval latency, citation accuracy, and data-ingestion stability, paving the way for sensor-aware, simulator-coupled decision support. Additionally, a fully autonomous data generation robot will be prepared to ensure the later stages of the development can be fed with enough accurate data to perform optimally.

7.3. Human-in-the-Loop at Scale

As ForestGPT moves toward Level 3–4 capabilities, integrating human-in-the-loop (HITL) mechanisms becomes even more critical. While frontier models like GPT-5 and Claude 3.5 offer high-quality multi-step reasoning, safety-critical forestry applications demand a safeguard layer where human expertise remains central. This not only improves trust calibration, but also allows domain experts to correct, verify, and enrich the AI’s outputs in dynamic and uncertain conditions. In practice, HITL workflows can govern abstention logic, decision checkpoints, and interactive correction loops, especially in cases like thinning schedules, policy interpretation, or terrain-sensitive prescriptions.

They also provide interpretability, as human review can ensure citations, rule references, and numerical estimates are not only plausible, but truly grounded in applicable context. We argue that future domain-specific LLM deployments in forestry must anchor automation efforts in such HITL protocols, making foresters collaborators rather than mere users. This enables gradual handover of routine tasks, while preserving expert oversight for strategic interventions. This blended architecture is especially beneficial in modular rollouts, where low-risk components can be automated first, and critical decisions remain human-governed until system maturity and audit mechanisms are fully in place.

7.4. Caveats and Readiness for Level 3–4

While Level 1–2 applications (retrieval, summarization, guideline recall) are already mature and cost-effective with 2025 LLMs, Level 3–4 scenarios (e.g., adaptive planning, verifiable rule chaining, or cross-modal synthesis) remain contingent on multiple factors: standardized data pipelines, interoperable ontologies, agent–tool integration, and robust evaluation frameworks. We caution against assuming full generalization of reasoning agents across modalities without concrete investments in pre-processing, validation, and task-specific agent orchestration. Future deployments must be rigorously scoped and iteratively tested to reach safe, trustworthy outcomes in real forest operations.

Author Contributions

Conceptualization, A.H. and B.E.; Data curation, F.E.-S. and B.E.; Formal analysis, B.E.; Funding acquisition, A.H.; Investigation, A.H. and F.E.-S.; Methodology, A.H.; Software, F.E.-S. and B.E.; Supervision, A.H.; Validation, F.E.-S. and B.E.; Visualization, F.E.-S.; Writing—original draft, A.H. and F.E.-S.; Writing—review and editing, A.H., F.E.-S. and B.E. All authors have read and agreed to the published version of the manuscript.

Funding

We gratefully acknowledge the funding of the research promotion agency of the province of Lower Austria, Project GFF NÖ FTI-22-I-004, Infrastructure for the realistic testing of AI-supported robot systems in demanding environments without direct energy connection for multiple use cases in forestry—human-robot teaming.

Data Availability Statement

Everything is available for public use here: https://www.chatbase.co/chatbot-iframe/wy_SRCiz95zXieg54PjuF (accessed on 14 April 2025).

Acknowledgments

The authors wish to express their sincere gratitude to all three anonymous reviewers for their insightful and constructive comments. The comments of the reviewers have contributed significantly to enhancing the quality and clarity of this manuscript. The authors appreciate the time and effort invested by the reviewers in scrutinizing the manuscript and offering valuable suggestions.

Conflicts of Interest

The authors declare that there are no conflicts of interests. This work does not raise ethical issues. The authors did not use any generative AI for producing text; they only used standard tools such as Writefull, Grammarly, and DeepL to check spelling and improve language.

Abbreviations

AI	Artificial Intelligence
AIP	Artificial Intelligence Platform
API	Application Programming Interface
AR	Augmented Reality
ARES	Answer-aware Retrieval Evaluation for Semantic search
CI	Continuous Integration
CRAG	Contextualized Retrieval Augmented Generation
DBH	Diameter at Breast Height
DPO	Direct Preference Optimisation
ECMWF	European Centre for Medium-Range Weather Forecasts
EEG	Electroencephalogram
ERP	Enterprise Resource Planning
ESA	European Space Agency
FAISS	Facebook AI SimilaritySearch
FSC	Forest Stewardship Council
GEDI	Global Ecosystems Dynamics Investigation
GIS	Geographic Information System
GPS	Global Positioning System
GPT	Generative Pre-Trained Transformer
GPU	Graphics Processing Unit
GROBID	Generation of bibliographic data
gRPC	gRPC Remote Procedure Call (originally Google Remote Procedure Call)
HCAI	Human-Centered AI
HITL	Human-in-the-loop
IEC	International Electrotechnical Commission
IoT	Internet of Things
ISO	International Organization for Standardization
JSON	JavaScript Object Notation
KTO	Kahneman–Tversky Optimisation
LiDAR	Light Detection and Ranging
LLM	Large Language Model
LoRA	Low-Rank Adaptation
NPV	Net Present Value
OAG	Ontology-Augmented Generation
OCR	Optical Character Recognition
PEFC	Programme for the Endorsement of Forest Certification
PFT	Preference Fine-tuning
PLC	Programmable Logic Controller
PPO	Proximal Policy Optimisation
QA	Question/Answer
RAG	Retrieval-Augmented Generation
RAGAS	Retrieval-Augmented Generation Assessment
RELAGGS	Relational Aggregations
REST	Representational State Transfer
RL	Reinforcement Learning
RLAIF	Reinforcement learning from AI feedback
RLHF	Reinforcement learning from Human feedback
ScaNN	Scalable Nearest Neighbors
SFT	Supervised Fine-tuning
VRAM	Video Random-Access Memory

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Viswanathan, S., Garnett, R., Eds.; Curran Associates: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar] [CrossRef]
Haque, M.A.; Li, S. Exploring ChatGPT and its impact on society. AI Ethics 2025, 5, 791–803. [Google Scholar] [CrossRef]
Rane, N. ChatGPT and similar generative artificial intelligence (AI) for smart industry: Role, challenges and opportunities for industry 4.0, industry 5.0 and society 5.0. Challenges Oppor. Ind. 2023, 4, 1–8. [Google Scholar] [CrossRef]
Annepaka, Y.; Pakray, P. Large language models: A survey of their development, capabilities, and applications. Knowl. Inf. Syst. 2025, 67, 2967–3022. [Google Scholar] [CrossRef]
Meng, W.; Li, Y.; Chen, L.; Dong, Z. Using the Retrieval-Augmented Generation to Improve the Question-Answering System in Human Health Risk Assessment: The Development and Application. Electronics 2025, 14, 386. [Google Scholar] [CrossRef]
Arslan, M.; Ghanem, H.; Munawar, S.; Cruz, C. A Survey on RAG with LLMs. Procedia Comput. Sci. 2024, 246, 3781–3790. [Google Scholar] [CrossRef]
Holzinger, A.; Schweier, J.; Gollob, C.; Nothdurft, A.; Hasenauer, H.; Kirisits, T.; Häggström, C.; Visser, R.; Cavalli, R.; Spinelli, R.; et al. From industry 5.0 to forestry 5.0: Bridging the gap with human-centered artificial intelligence. Curr. For. Rep. 2024, 10, 442–455. [Google Scholar] [CrossRef]
Sundberg, B.; Silversides, C. Operational Efficiency in Forestry: Vol. 1: Analysis; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1988; Volume 29. [Google Scholar] [CrossRef]
Piragnolo, M.; Grigolato, S.; Pirotti, F. Planning harvesting operations in forest environment: Remote sensing for decision support. In ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences; Copernicus GmbH: Göttingen, Germany, 2019; Volume 4, pp. 33–40. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Wang, Y.; Pan, Y.; Yan, M.; Su, Z.; Luan, T.H. A survey on ChatGPT: AI–generated contents, challenges, and solutions. IEEE Open J. Comput. Soc. 2023, 4, 280–302. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, Y.; Wang, Y.; Hou, F.; Yuan, J.; Tian, J.; Zhang, Y.; Shi, Z.; Fan, J.; He, Z. A survey of visual transformers. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 7478–7498. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Gupta, N.; Choudhuri, S.S.; Hamsavath, P.N.; Varghese, A. Fundamentals of Chat GPT for Beginners Using AI; Academic Guru Publishing House: Cambridge, MA, USA, 2024. [Google Scholar]
Kaufmann, T.; Weng, P.; Bengs, V.; Hüllermeier, E. A Survey of Reinforcement Learning from Human Feedback. In Reinforcement Learning: Algorithms, Applications and Open Challenges; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar] [CrossRef]
Zhang, K.; Zeng, S.; Hua, E.; Ding, N. Ultramedical: Building Specialized Generalists in Biomedicine. Adv. Neural Inf. Process. Syst. 2024, 37, 26045–26081. [Google Scholar]
Hong, J.; Lee, N.; Thorne, J. ORPO: Monolithic Preference Optimization without Reference Model. arXiv 2024, arXiv:2403.07691. [Google Scholar] [CrossRef]
Khairat, S.; Niu, T.; Geracitano, J.; Zhou, Z. Performance Evaluation of Popular Open-Source Large Language Models in Healthcare. Stud. Health Technol. Inform. 2025, 328, 215–219. [Google Scholar] [CrossRef]
OpenAI. Introducing OpenAI o3 and o4-mini. Updated 10 June 2025: O3-pro Availability. 2025. Available online: https://openai.com/index/introducing-o3-and-o4-mini/ (accessed on 14 August 2025).
OpenAI. Model Release Notes. o3-pro Available in ChatGPT and API (10 June 2025). 2025. Available online: https://help.openai.com/en/articles/9624314-model-release-notes (accessed on 14 August 2025).
OpenAI. Introducing GPT-5. 2025. Available online: https://openai.com/index/introducing-gpt-5/ (accessed on 4 September 2025).
Guo, D.; DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025. [Google Scholar] [CrossRef]
DeepSeek-AI. DeepSeek-R1: Model Card and Checkpoints. Hugging Face Model Repository. 2025. Available online: https://huggingface.co/deepseek-ai/DeepSeek-R1 (accessed on 4 September 2025).
Google DeepMind. Gemini 2.5: Our Most Intelligent AI Model. 2025. Available online: https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/ (accessed on 14 August 2025).
xAI. Introducing Grok 4.Launch Announcement Post. 2025. Available online: https://x.com/xai/status/1943158495588815072 (accessed on 14 August 2025).
QwenLM Team. Qwen3: Open-Weight Reasoning-Centric LLM Family. Model Family Repository. 2025. Available online: https://github.com/QwenLM/Qwen3 (accessed on 14 August 2025).
QwenLM Team. Qwen3-Coder. Coder Variants and Long-Context Notes. 2025. Available online: https://github.com/QwenLM/Qwen3-Coder (accessed on 14 August 2025).
Anthropic. Introducing Claude 3.5 Sonnet. 2024. Available online: https://www.anthropic.com/news/claude-3-5-sonnet (accessed on 20 August 2025).
AI, M. Mistral AI Models Overview—Mistral Large and Codestral. 2025. Available online: https://docs.mistral.ai/getting-started/models/models_overview/ (accessed on 20 August 2025).
Cohere. Cohere Command R+: Optimized for Complex RAG Workflows and Long-Context Tasks. 2024. Available online: https://docs.cohere.com/docs/command-r-plus (accessed on 20 August 2025).
Bejar-Martos, J.A.; Rueda-Ruiz, A.J.; Ogayar-Anguita, C.J.; Segura-Sanchez, R.J.; Lopez-Ruiz, A. Strategies for the storage of large LiDAR datasets—A performance comparison. Remote Sens. 2022, 14, 2623. [Google Scholar] [CrossRef]
Sarker, I.H. LLM potentiality and awareness: A position paper from the perspective of trustworthy and responsible AI modeling. Discov. Artif. Intell. 2024, 4, 40. [Google Scholar] [CrossRef]
Radeva, I.; Popchev, I.; Doukovska, L.; Dimitrova, M. Web application for retrieval-augmented generation: Implementation and testing. Electronics 2024, 13, 1361. [Google Scholar] [CrossRef]
Ahn, Y.; Lee, S.G.; Shim, J.; Park, J. Retrieval-augmented response generation for knowledge-grounded conversation in the wild. IEEE Access 2022, 10, 131374–131385. [Google Scholar] [CrossRef]
Xie, Y.; Jiang, B.; Mallick, T.; Bergerson, J.D.; Hutchison, J.K.; Verner, D.R.; Branham, J.; Alexander, M.R.; Ross, R.B.; Feng, Y.; et al. Wildfiregpt: Tailored large language model for wildfire analysis. arXiv 2024, arXiv:2402.07877. [Google Scholar] [CrossRef]
Digital Green. Farmer.CHAT: AI Assistant for Agricultural Extension. 2024. Available online: https://farmerchat.digitalgreen.org/ (accessed on 17 July 2025).
Palantir Technologies Inc. Foundry AI Platform (AIP) Overview. 2025. Available online: https://www.palantir.com/docs/foundry/aip/overview (accessed on 17 July 2025).
Zarfati, M.; Soffer, S.; Nadkarni, G.N.; Klang, E. Retrieval-Augmented Generation: Advancing personalized care and research in oncology. Eur. J. Cancer 2025, 220, 115341. [Google Scholar] [CrossRef]
Planet Labs PBC. Forest Carbon Monitoring—Technical Specification. 2025. Available online: https://docs.planet.com/data/planetary-variables/forest-carbon-monitoring/techspec/ (accessed on 14 August 2025).
Planet Labs PBC. Forest Carbon Monitoring—Product Overview. 2025. Available online: https://docs.planet.com/data/planetary-variables/forest-carbon-monitoring/ (accessed on 14 August 2025).
CarbonAi Inc. CarbonAi—Digital MRV Software and Tools. 2025. Available online: https://carbonai.ca/ (accessed on 14 August 2025).
Ge, Y.; Hua, W.; Mei, K.; Tan, J.; Xu, S.; Li, Z.; Zhang, Y. Openagi: When llm meets domain experts. Adv. Neural Inf. Process. Syst. NeurIPS 2023, 36, 5539–5568. [Google Scholar]
Zhou, J.; Müller, H.; Holzinger, A.; Chen, F. Ethical ChatGPT: Concerns, challenges, and commandments. Electronics 2024, 13, 3417. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Holzinger, A.; Zatloukal, K.; Müller, H. Is Human Oversight to AI Systems still possible? New Biotechnol. 2025, 85, 59–62. [Google Scholar] [CrossRef] [PubMed]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
Vulova, S.; Horn, K.; Rocha, A.D.; Brill, F.; Somogyvári, M.; Okujeni, A.; Förster, M.; Kleinschmit, B. Unraveling the response of forests to drought with explainable artificial intelligence (XAI). Ecol. Indic. 2025, 172, 113308. [Google Scholar] [CrossRef]
Martino, A.; Iannelli, M.; Truong, C. Knowledge injection to counter large language model (LLM) hallucination. In Proceedings of the European Semantic Web Conference, Hersonissos, Greece, 28–29 May 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 182–185. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar] [CrossRef]
Jin, T.; Yazar, W.; Xu, Z.; Sharify, S.; Wang, X. Self-Selected Attention Span for Accelerating Large Language Model Inference. arXiv 2024, arXiv:2404.09336. [Google Scholar] [CrossRef]
Tian, X. Evaluating the repair ability of LLM under different prompt settings. In Proceedings of the 2024 IEEE International Conference on Software Services Engineering (SSE), Shenzhen, China, 7–13 July 2024; pp. 313–322. [Google Scholar] [CrossRef]
Sakib, S.K.; Das, A.B. Challenging fairness: A comprehensive exploration of bias in llm-based recommendations. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 1585–1592. [Google Scholar] [CrossRef]
Gupta, B.B.; Gaurav, A.; Arya, V.; Alhalabi, W.; Alsalman, D.; Vijayakumar, P. Enhancing user prompt confidentiality in Large Language Models through advanced differential encryption. Comput. Electr. Eng. 2024, 116, 109215. [Google Scholar] [CrossRef]
Maynez, J.; Narayan, S.; Bohnet, B.; McDonald, R. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 1906–1919. [Google Scholar] [CrossRef]
Kadavath, S.; Conerly, T.; Askell, A.; Henighan, T.; Drain, D.; Perez, E.; Schiefer, N.; Hatfield-Dodds, Z.; DasSarma, N.; Tran-Johnson, E.; et al. Language Models (Mostly) Know What They Know. arXiv 2022, arXiv:2207.05221. [Google Scholar] [CrossRef]
Jiang, Z.; Xu, F.F.; Gao, L.; Sun, Z.; Liu, Q.; Dwivedi-Yu, J.; Yang, Y.; Callan, J.; Neubig, G. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 7969–7992. [Google Scholar] [CrossRef]
Frering, L.; Steinbauer-Wagner, G.; Holzinger, A. Integrating Belief-Desire-Intention agents with large language models for reliable human–robot interaction and explainable Artificial Intelligence. Eng. Appl. Artif. Intell. 2025, 141, 109771. [Google Scholar] [CrossRef]
O’Leary, D.E. An anchoring effect in large language models. IEEE Intell. Syst. 2025, 40, 23–26. [Google Scholar] [CrossRef]
OWASP Foundation. OWASP Top 10 for Large Language Model Applications (2025). Includes LLM01: Prompt Injection. 2025. Available online: https://owasp.org/www-project-top-10-for-large-language-model-applications/ (accessed on 14 August 2025).
UK National Cyber Security Centre. Thinking About the Security of AI Systems. 2023. Available online: https://www.ncsc.gov.uk/blog-post/thinking-about-security-ai-systems (accessed on 14 August 2025).
O’Leary, D.E. Confirmation and specificity biases in large language models: An explorative study. IEEE Intell. Syst. 2025, 40, 63–68. [Google Scholar] [CrossRef]
Yi, W.; Zhang, L.; Kuzmin, S.; Gerasimov, I.; Liu, M. Agricultural large language model for standardized production of distinctive agricultural products. Comput. Electron. Agric. 2025, 234, 110218. [Google Scholar] [CrossRef]
Pensel, L.; Kramer, S. Neural RELAGGS. Mach. Learn. 2025, 114, 123. [Google Scholar] [CrossRef]
Abdollahi, A.; Pradhan, B. Explainable artificial intelligence (XAI) for interpreting the contributing factors feed into the wildfire susceptibility prediction model. Sci. Total Environ. 2023, 879, 163004. [Google Scholar] [CrossRef]
Buccafurri, F.; Lazzaro, S. A Framework for Secure Internet of Things Applications. In Proceedings of the 2024 10th International Conference on Control, Decision and Information Technologies (CoDIT), Vallette, Malta, 1–4 July 2024; pp. 2845–2850. [Google Scholar] [CrossRef]
Mota, T.d.S.; Sarkar, S.; Poojary, R.; Alqasemi, R. ChatGPT-Based Model for Controlling Active Assistive Devices Using Non-Invasive EEG Signals. Electronics 2025, 14, 2481. [Google Scholar] [CrossRef]
Mittal, U.; Sai, S.; Chamola, V.; Sangwan, D. A comprehensive review on generative AI for education. IEEE Access 2024, 12, 142733–142759. [Google Scholar] [CrossRef]
Nunes, L.J. The Role of Artificial Intelligence (AI) in the Future of Forestry Sector Logistics. Future Transp. 2025, 5, 63. [Google Scholar] [CrossRef]
Holzinger, A.; Fister, I., Jr.; Fister, I.; Kaul, H.P.; Asseng, S. Human-Centered AI in smart farming: Towards Agriculture 5.0. IEEE Access 2024, 12, 62199–62214. [Google Scholar] [CrossRef]
Sarkar, D.; Chapman, C.A. The smart forest Conundrum: Contextualizing pitfalls of sensors and AI in conservation science for tropical forests. Trop. Conserv. Sci. 2021, 14, 19400829211014740. [Google Scholar] [CrossRef]
Retzlaff, C.O.; Gollob, C.; Nothdurft, A.; Stampfer, K.; Holzinger, A. Developing a User-Friendly Interface for Interactive Cable Corridor Planning. Croat. J. For. Eng. J. Theory Appl. For. Eng. 2025, 46, 213–223. [Google Scholar] [CrossRef]
Schraick, L.M.; Ehrlich-Sommer, F.; Stampfer, K.; Meixner, O.; Holzinger, A. Usability in Human-Robot Collaborative Workspaces. Univers. Access Inf. Soc. UAIS 2024, 24, 1609–1622. [Google Scholar] [CrossRef]
Ehrlich-Sommer, F.; Hörl, B.; Gollob, C.; Nothdurft, A.; Stampfer, K.; Holzinger, A. Robot Usability in the Wild: Bridging Accessibility Gaps for Diverse User Groups in Complex Forestry Operation. Univers. Access Inf. Soc. 2025, 24, 2867–2887. [Google Scholar] [CrossRef]
Kocic, V.; Lukac, N.; Rozajac, D.; Schweng, S.; Gollob, C.; Nothdurft, A.; Stampfer, K.; Ser, J.D.; Holzinger, A. LLM in the Loop: A Framework for Contextualizing Counterfactual Segment Perturbations in Point Clouds. IEEE Access 2025, 13, 85507–85525. [Google Scholar] [CrossRef]
Kraišniković, C.; Harb, R.; Plass, M.; Al Zoughbi, W.; Holzinger, A.; Müller, H. Fine-tuning language model embeddings to reveal domain knowledge: An explainable artificial intelligence perspective on medical decision making. Eng. Appl. Artif. Intell. 2025, 139, 109561. [Google Scholar] [CrossRef]

Figure 1. Proposed timeline for working prototypes of each level.

Figure 2. Simplified RAG query cycle.

Figure 3. Base–level ForestGPT tuning pipeline on the example of pure text-based inputs.

Figure 4. Depiction of the ForestGPT workflow when multimodal data is incorporated.

Figure 5. Simple depiction of the development steps where each one builds on the previous one.

Figure 6. Screenshots: The top section presents the view of the final app, while the bottom section illustrates the interface used for building the app.

Figure 7. Risk radar chart for the separate points that need to be addressed.

Table 1. Stages of pre-training and post-training for LLM alignment and specialization (2024–2025 state of the art).

Stage	Description	Key Examples
Pre-training	Unsupervised learning from massive text corpora by predicting the next token (causal language modeling).	GPT-3, LLaMA, PaLM, Falcon, Mistral base models
Instruction tuning	Supervised fine-tuning on human-written prompt–response pairs to guide format and tone.	FLAN, Alpaca, OpenAssistant, Baize, LLaMA 2
Preference modeling	Use of human or synthetic ranking data to teach models which completions are preferred.	InstructGPT, RLHF preference phase, QLoRA tuning with ranked datasets
Reward modeling	Learn a reward function from human ratings or programmatic rules to guide later tuning.	o3 “rewarded-by-verifier”, Gemini rater pipelines, Constitutional AI rewarders
Reinforcement learning (RLHF/RLAIF)	Fine-tune the model using reinforcement learning with human or automated feedback.	ChatGPT (RLHF), Gemini (tool-use agents), DeepSeek-R1 (RLAIF), Claude 3 constitutional fine-tuning
Distillation from stronger models	Use a higher-quality model (or ensemble) to label prompts, training a smaller one on these outputs.	Zephyr, DistilGPT-2, Qwen Mini, DeepSeek-MoE
Format transfer/reasoning induction	Copy reasoning chains, scratchpads, or tool-use traces from stronger models.	DeepSeek-R1 (verifier traces), Grok 4 (tool-use chains), Gemini 2.5 (step-wise reasoning)
Safety and refusal tuning	Teach the model to avoid unsafe or out-of-scope responses via refusals or safety prompts.	o3-pro refusal layers, Claude 3 refusals, Gemini red-teaming outputs
Multimodal alignment (optional)	Align text, image, or code modalities using curated data or contrastive objectives.	Gemini, Qwen-VL, GPT-4V, InternLM-XComposer

Table 2. Representative reasoning-capable LLMs as of mid-2025, focusing on practical deployment and alignment methods.

Model Family	Release (Public)	Access Type	Context Length	Notable Features (Reasoning)
OpenAI o3-pro	June 2025	Hosted (ChatGPT/API)	128k tokens	Verifier-first alignment, high-reliability reasoning, tool use (code/web/files) [19,20]
OpenAI GPT-5	August 2025	Hosted (ChatGPT/API)	128 k tokens (with “Thinking” mode)	State-of-the-art multi-step reasoning with verifier-backed “Thinking” mode; high robustness in safety-critical workflows [21].
DeepSeek-R1	January 2025/updated May 2025	Open weights + API	128 k tokens	Trained via RLAIF with chain-of-verifier reasoning; very strong step-by-step math [22,23]
Gemini 2.5 Pro	March–May 2025	Hosted (Vertex AI)	1 M tokens	Scratchpad-style reasoning, agentic tool use, long-context verified mode [24]
Grok 4	June 2025	Hosted (xAI/Twitter)	128 k tokens (est.)	Tool-use traces and API access, emphasis on dialogue structure and utility [25]
Qwen 3/Qwen3-Coder	April–June 2025	Open weights (Apache 2.0)	128 k–200 k tokens	Multi-turn reasoning, math/code scratchpads, long-context coder variants [26,27]
Claude 3.5 Sonnet	October 2024	Hosted (Anthropic)	200 k tokens	Constitutional fine-tuning; strong real-world reasoning and tool integration [28]
Mistral Large/Codestral	March/May 2025	Hosted (API)/open weights	65 k tokens (est.)	Open-weight base models and API-available reasoning-capable large variants [29]
Command R+ (Cohere)	April 2025	Hosted + weights	128 k tokens	RAG-tuned foundation model with strong retriever-integrated reasoning [30]

Table 3. Selected RAG deployments whose requirements echo those of forest management.

System	Primary Domain	Typical Data Sources	Forestry-Relevant Parallels
WildfireGPT	Natural-hazard decision support	Near-real-time fire-weather grids, burn-scar satellite scenes, peer-reviewed studies	Integrates dynamic geospatial layers with scientific literature to advise land managers.
Farmer.CHAT and Agri-Llama	Smallholder agriculture	Extension manuals, call-centre transcripts, farmer videos	Mirrors forestry extension: multilingual guidance, region-specific best practice, small-holder constraints.
Palantir AIP (OAG)	Supply-chain resilience	ERP streams, sensor alerts, optimisation models	Shows how live telemetry plus deterministic solvers can be piped into a chat agent, analogous to combining harvester feeds with wood-flow optimisation.

Table 4. Typical problems you can encounter when using an LLM, why they occur, and mitigation strategies.

Problem	What It Looks Like in Practice	Underlying Reason and Mitigation (2025 Update)
Hallucination	The model invents a regulation, cites a non-existent journal article, or fabricates numerical results.	It always chooses the most statistically probable token sequence. If no reliable pattern exists, it will still produce a fluent guess [48]. Verifier-rewarded models (e.g., o3, DeepSeek-R1) and tool use fallback chains (e.g., Gemini 2.5) reduce hallucination frequency.
Stale knowledge	Answers do not mention a law passed last month or a beetle outbreak reported yesterday.	The model’s weights are frozen at the moment of training. Anything published after that date is unknown unless fed in via RAG or agents [49]. RAG-native models like Command R+ or Gemini can inject up-to-date data in context.
Limited “attention span”	When a prompt exceeds the model’s context window (e.g., >100 k tokens), early passages are silently dropped, leading to contradictions or omissions.	The Transformer can only process a fixed number of tokens at once; older tokens fall off the back of the window [50]. Models like Gemini 2.5 and Claude 3.5 extend limits to 200k–1M tokens, and structured summarization or retrieval mitigates overflow.
Prompt sensitivity	Changing a single word in the question yields a noticeably different answer.	Small phrasing shifts alter the statistical path the model follows, much like nudging a marble down a branching maze [51]. Stability improves via prompt libraries, system prompts, and verifier-guided sampling (as in o3 and Claude 3.5).
Bias and unfairness	Stereotypical or unbalanced language appears in generated text.	The model reflects patterns present in its training data, including historical biases [52]. 2025 models use fine-grained constitutional training (Claude), RAG-based perspective balancing, and verifier gating to reduce bias.
Confidentiality leak	Private fragments from earlier sessions or fine-tuning data surface in a response.	Without strict filtering, the model can echo memorised snippets when they boost token-level probability [53]. Differential privacy, fine-tuning firewalls, and audit logs are increasingly common in enterprise deployments.
Inconsistent tool use	The model fails to reliably invoke external tools (e.g., GIS or simulation APIs) or mixes tool output with hallucinated content.	Tool-calling in agents (e.g., Gemini, o3) can drift if reward signals are unclear. New verifier-first tool chains with rejection sampling and system checks improve reliability.
Verifier bypass risk	Despite verifier layers, the model produces an incorrect but confident output (e.g., unsafe forest management advice).	Verifier-guided decoding is not foolproof; attacks or distributional drift can cause bypass. Runtime feedback loops, fallback to tools, and human-in-the-loop checkpoints mitigate failure.

Table 5. Reasoning-centric models relevant to Forestry 5.0 (release window, availability and highlights).

Model	Release	Access	Context (Headline)	Highlights (Forestry-Relevant)
OpenAI o3/o3-mini/ o3-pro	December 2024–June 2025	Hosted (ChatGPT/API)	—(reasoning-focused)	Successors to o1 with stronger deliberate reasoning; o3-mini offers speed/latency trade-offs; o3-pro (June 2025) targets highest reliability and native tool use (web/code/files) [19,20]
DeepSeek-R1	January 2025/updated May 2025	Open weights and API	Varies by size	RL-trained reasoning; open release incl. distilled 1.5–70B checkpoints; performance comparable to o1 on reasoning tasks, at open-source cost/sovereignty [22,23]
Gemini 2.5 Pro	March–May 2025	Hosted (Gemini/Vertex)	1 M tokens (2 M announced)	Native multimodality; state-of-the-art reasoning/coding; production-grade million-token context useful for long regulations, manuals, and code bases [24].
Grok 4	July 2025	Hosted (xAI API)	∼256 k	RL-scaled “native tool use” and real-time search; strong frontier reasoning benchmarks; designed to autonomously plan searches and use code tools [25]
Qwen 3	April–July 2025	Open weights and API	Up to 256 k (1 M via extrapolation in Coder variants)	Hybrid dense/MoE family with “hybrid reasoning” focus; permissive licensing and long-context variants; strong multilingual support and agentic coding [26,27]
OpenAI GPT-5	August 2025	Hosted (ChatGPT/API)	128 k tokens (“Thinking mode”)	New frontier-level baseline from OpenAI with significantly improved multi-step reasoning and the debut of Thinking Mode for better planning, explanation, and reflection. Recommended for complex forestry copilots [21]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ehrlich-Sommer, F.; Eberhard, B.; Holzinger, A. ForestGPT and Beyond: A Trustworthy Domain-Specific Large Language Model Paving the Way to Forestry 5.0. Electronics 2025, 14, 3583. https://doi.org/10.3390/electronics14183583

AMA Style

Ehrlich-Sommer F, Eberhard B, Holzinger A. ForestGPT and Beyond: A Trustworthy Domain-Specific Large Language Model Paving the Way to Forestry 5.0. Electronics. 2025; 14(18):3583. https://doi.org/10.3390/electronics14183583

Chicago/Turabian Style

Ehrlich-Sommer, Florian, Benno Eberhard, and Andreas Holzinger. 2025. "ForestGPT and Beyond: A Trustworthy Domain-Specific Large Language Model Paving the Way to Forestry 5.0" Electronics 14, no. 18: 3583. https://doi.org/10.3390/electronics14183583

APA Style

Ehrlich-Sommer, F., Eberhard, B., & Holzinger, A. (2025). ForestGPT and Beyond: A Trustworthy Domain-Specific Large Language Model Paving the Way to Forestry 5.0. Electronics, 14(18), 3583. https://doi.org/10.3390/electronics14183583

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ForestGPT and Beyond: A Trustworthy Domain-Specific Large Language Model Paving the Way to Forestry 5.0

Abstract

1. Introduction

2. Background

2.1. What Is an LLM?

2.1.1. How an LLM Learns to Write: From Blank Slate to Fluent Assistant

2.1.2. Hosted Versus Open-Weight Models: Choosing the Right Deployment Path

2.1.3. Building Trust in Domain-Specific Settings

2.2. What Is RAG?

2.2.1. How the Pipeline Works, Step by Step

2.2.2. Human-in-the-Loop (HITL) for Complex Forestry Tasks

2.2.3. Benefits of RAG at a Glance

2.2.4. Where RAG Is Already in Use, and Why These Examples Matter for Forestry

2.3. Problems When Using LLMs

2.3.1. Digging Deeper into Hallucinations

2.3.2. Why Context Length and Freshness Matter

2.3.3. Bias, Privacy, and Prompt Craft

2.3.4. Prompt Injection and Prompt-Library Governance

3. How to Build Your Own Domain Expert LLM

3.1. Step 1—Choose a Strong Open-Weight Base

3.2. Step 2—Curate a Forestry Corpus

3.3. Step 3—Fine-Tune with Low Rank Adapters (LoRA)

3.4. Step 4—Add an Instruction and Safety Layer

3.5. Step 5—Evaluate on Forestry Tasks

3.6. Step 6—Deploy with Simple Guardrails

3.7. Possibilities in Forestry

3.7.1. Limitations of Text-Centric Assumptions

3.7.2. Modality Integration Is Non-Trivial

3.7.3. Worked Example A—Regulatory Compliance

3.7.4. Worked Example B—Thinning Prescription with Constraints

4. Level 1 Demonstration at Futa Expo 2025

5. 2024–2025 Update: Reasoning-Centric LLMs and Implications for Forestry 5.0

5.1. Why These Models Matter Now

5.2. Implications for the ForestGPT Roadmap

5.3. Updated Level-Wise Outlook

5.4. Take-Home Message

5.5. Methodology

6. Discussion

6.1. Early Impact Versus Long-Term Ambition

6.2. Technical Challenges on the Horizon

6.3. Organisational and Ethical Considerations

7. Conclusions and Future Outlook

7.1. Key Takeaways

7.2. Future Outlook

7.3. Human-in-the-Loop at Scale

7.4. Caveats and Readiness for Level 3–4

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI