Exploring Thematic Evolution in Interdisciplinary Forest Fire Prediction Research: A Latent Dirichlet Allocation–Bidirectional Encoder Representations from Transformers Model Analysis

Zhang, Shuo

doi:10.3390/f16020346

Open AccessArticle

Exploring Thematic Evolution in Interdisciplinary Forest Fire Prediction Research: A Latent Dirichlet Allocation–Bidirectional Encoder Representations from Transformers Model Analysis

by

Shuo Zhang

School of Information Management, Nanjing University, Nanjing 210023, China

Forests 2025, 16(2), 346; https://doi.org/10.3390/f16020346

Submission received: 4 January 2025 / Revised: 10 February 2025 / Accepted: 11 February 2025 / Published: 14 February 2025

(This article belongs to the Special Issue Forest Fires Prediction and Detection—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Facing the severe global wildfire challenge and the need for advanced prediction, this study analysed the evolving research in forest fire prediction using an LDA-BERT similarity model. Due to climate change, human activities, and natural factors, forest fires threaten ecosystems, society, and the climate system. The vast existing literature on forest fire prediction makes it challenging to identify research themes manually. The proposed LDA-BERT model combines LDA and BERT. LDA was used for topic mining, determining the optimal number of topics by calculating the semantic consistency. BERT was employed in word vector training, using topic word probabilities as weights. The cosine similarity algorithm and normalisation were used to measure the topic similarity. Through empirical research on 13,552 publications from 1980–2023 retrieved from the Web of Science database, several key themes were identified, such as “wildfire risk management”, “vegetation and habitat changes”, and “climate change and forests”. Research trends show a shift from macro-level to micro-level studies, with modern technologies becoming a focus. Multidimensional scaling revealed a hierarchical theme distribution, with themes closely related to forest fires being dominant. This research offers valuable insights for the scientific community and policymakers, facilitating understanding these changes and contributing to wildfire mitigation. However, it has limitations like subjectivity in theme-representative word selection and needs further improvement in threshold setting and model performance evaluation. Future research can optimise these aspects and integrate emerging technologies to enhance forest fire prediction research.

Keywords:

wildfire; LDA-BERT; topic modelling; text mining; research trends

1. Introduction

Forest fires are recognized as natural disasters that increasingly threaten global ecosystems and human societies [1]. The complex interplay of global climate change, human activities, and natural influences has increased the frequency and scale of wildfires. These fires pose significant challenges, including threats to life, property, and the environment. Forest fires not only jeopardise human communities but also profoundly impact the ecological balance of the atmosphere and the land surface, as well as the global climate system. As such, they have become a critical issue with far-reaching environmental and social implications.

In this context, predicting and detecting wildfires promptly has become critical. Forest fires’ rapid progression and spreading nature make early detection and effective responses imperative. However, achieving accurate prediction and effective detection remains a formidable obstacle in forest fire management. Researchers have been working to address this challenge by integrating a wide range of knowledge and techniques such as meteorology, geographic information systems (GIS) [2], remote sensing [3], data analytics and machine learning [4]. As a result, the literature on forest fire prediction is exponentially growing, flourishing, and in-depth. It contains a large amount of knowledge of high academic value in terms of the research content, which includes essential information such as expert scholars’ perspectives, research methods, and research results. In the face of massive amounts of academic information, scientific and technological intelligence workers and field researchers only manually process this information and subjectively analyse and interpret these information resources. It is time-consuming and labour-intensive but makes it tricky to comprehensively and accurately identify the research themes and obtain valuable information. How to utilize emerging information technology to quickly and effectively identify the theme content of massive amounts of scientific and technological information, assist scientific knowledge discovery, and improve the efficiency of scientific research [5] is a crucial issue that needs to be urgently solved.

Theme recognition aims to process and analyse large-scale data information, quickly extract the research themes, and use epithets to represent essential information [6]. Researchers have carried out in-depth studies on theme recognition methods, mainly focusing on the two directions of co-occurrence analysis and theme modelling. By constructing word co-occurrence networks, complex network algorithms are used to identify research topics; machine learning algorithms are used to mine the information of the topic epithets hidden in the documents. The existing research is mainly realized by extracting words and calculating the strength of inter-word relationships; however, it is difficult to accurately reveal the meaning of the theme by using words that lack contextual context as theme epithets alone. Phrases are more capable of expressing rich semantic information than words and are easy to understand and analyse. Therefore, it has become urgent to construct a new method of theme recognition that generates phrase structure epithets from the perspective of theme representation [7].

In addition, while theme recognition is completed, accurately revealing the content of the research theme is equally essential [8,9]. In contrast, the related research focuses on improving the theme recognition algorithm and theme evolution [10,11]. Hotspot distribution [12,13] is based on the theme words, period, etc., and less on fine-grained mining for the original textual information to which the theme belongs [14,15]. Text sentence step structure recognition can classify the content from a semantic point of view. It can effectively identify the sentences that express the text’s purpose, research method, results, and conclusions. The in-depth mining of sentences will help divide the topic into blocks of the semantic step structure, which is significant for revealing the text’s deep and fine-grained scientific knowledge [16,17].

In addition, research themes are dynamic and in a state of constant evolution. As the scientific field advances, new topics emerge, the original topics mature or gradually decline, and the focus of various topics continues to shift [18,19]. Some topics may split into multiple sub-topics, while others merge to create new research areas. Therefore, analysis of the evolution of research themes plays a crucial role in predicting trends and hotspots within scientific fields, facilitating knowledge exchange between researchers within and across disciplines, and providing a roadmap for developing scientific innovation and assisting funding agencies and policymakers. Policymakers monitor the flow of knowledge within a given field.

Understanding the evolution of forest fire prediction research themes is crucial for scientific advancement and policy formulation. Traditional methods such as citation analysis and co-occurrence analysis have been instrumental in this regard. Still, they have limitations, including time lags and a disconnect between analytical outcomes and the original literature. To overcome these challenges, this study applied the LDA-BERT similarity model, which leverages the strengths of Latent Dirichlet Allocation (LDA) [20] and the Bidirectional Encoder Representations from Transformers (BERT) model [21]. LDA, a conventional topic-modelling technique, identifies topics based on word probability distributions but often falls short in capturing the semantic context, particularly in short documents or complex knowledge domains. In contrast, BERT, a pre-trained natural language processing model, excels in understanding the semantic relationships between words within sentences through masked language modelling and next-sentence prediction tasks.

BERT’s deep semantic understanding complements LDA’s topic extraction capabilities. By integrating these two models, we can achieve a more nuanced and accurate identification of research themes. This hybrid approach addresses LDA’s semantic comprehension limitations, allowing for a more in-depth analysis of the literature on forest fire prediction. The primary objective of this study was to provide a comprehensive understanding of the development trends in forest fire prediction research. By exploring the evolution of research themes, we aimed to offer valuable insights to the scientific community and policymakers. These insights are designed to enhance their strategies for mitigating forest fire risks and to foster interdisciplinary collaboration and technological innovation. Ultimately, this research paves the way for improved wildfire prediction methods, essential for humanity to tackle the formidable challenge of forest fires better. The integration of LDA and BERT in this study not only advances the field of text analysis but also contributes to the broader goal of wildfire management and prevention.

2. Related Work

Theme identification research mainly includes three aspects: theme consistency, theme clustering, and theme classification, whose objectivity and rationality are of great significance for obtaining real-time dynamics in the subject area, investigating the current situation of industry development, and scientifically formulating strategic plans [22,23]. Using the LDA topic model to identify a research topic in intelligence research has gradually become a research hotspot [24]. However, the characteristics of the topic recognition research of complex knowledge bases in artificial intelligence make the traditional LDA model less applicable and cannot sufficiently meet the interface requirements. When the training corpus of documents is tiny or very short, the distribution results are not good [25].

The idea that external corpora can help to improve topic representation was first proposed in 2006, and web search results were used to improve the information in short documents [26]. Later, assuming that complex repositories are samples of topics from larger corpora such as Wikipedia and then using the topics found in the larger corpora to help shape the topic representations in the complex repositories yielded better results. However, if the larger corpus has many unrelated topics, this will consume the topic space of the model. An extension of LDA was proposed in 2010, which uses external information about word similarity to smooth the topic-to-word distribution [27]. Likelihood feature (LF) vectors have been used in various topic analysis tasks. The use of LFs to construct a topic analysis model combines the allowed values of various likelihood features to form a high-dimensional space, which makes it ideal for modelling the topics of a larger corpus [24,28].

In scientific and technical intelligence analysis, topic evolution refers to the phenomenon of gradual changes in topics during the development of a discipline, and these changes include both temporal and spatial trends in topic changes, such as topic intensity evolution and topic semantic evolution [29]. Theme intensity evolution refers to the trend of the hotspot degree of the theme concerned over time; theme semantic evolution refers to the trend of the topic content within the theme over time, such as the emergence of new topics and the demise of old topics, as well as the trend of a specific topic related to the spread of several other topics or convergence, and so on [30].

Many scholars have begun to pay attention to the study of the semantic evolution of topics, and they have established analytical models of subject-topic evolution from multiple perspectives, such as the intensity, structure, and content of topics. These models are usually based on keyword co-occurrence networks and use web front-end visualization techniques to draw a knowledge map of topic evolution. However, this analysis relies too much on the manual annotation of subject content evolution and observes subject evolution only through keyword changes; such an approach is deficient in the richness of semantic information.

To solve this problem, researchers have begun using LDA topic models and hidden Markov Models to analyse the trends of the semantic evolution of technical topics, which helps predict the future development of technical topics [31]. The LDA topic models reveal the semantic relationships between topics within a subject area through topic filtering and association. However, using different LDA models in various stages of topic extraction results in inconsistent vector spaces for topic representation, which requires manual intervention in calculating the topic similarity and performing topic filtering.

Topic evolution research involves analysing topic evolution and identifying topic mutations. Theme evolution analysis focuses on discovering themes and identifying evolutionary paths, while theme mutation analysis concentrates on dynamically monitoring dramatically changing themes. Topic mutation analysis is vital in identifying emerging and cutting-edge topics and is integral to topic semantic evolution analysis [32].

In conclusion, while LDA topic models are instrumental in analysing topic semantic evolution, several challenges persist, such as refining the selection of topics during the growth stages of disciplines, calculating the topic relevance during development, and tracing the evolutionary paths of topics throughout a discipline’s lifecycle. Addressing these issues, researchers have also explored the synergy between LDA and deep learning models like BERT. For example, Xie et al. (2020) [33] examined the utility of LDA and BERT embeddings for monolingual and multilingual topic analysis, underscoring the efficacy of these methods in discerning textual data. Atagün et al. (2021) [34] applied LDA and BERT to topic modelling within the Teknofest setting, exemplifying their use in NLP tasks. Collectively, the amalgamation of LDA and BERT for topic analysis has significantly advanced NLP research, providing a nuanced understanding of text data across multiple domains. These studies serve as valuable precedents for our research on employing the LDA-BERT model in forest fire prediction research, highlighting both the potential and the challenges inherent in using such integrated models.

3. Methods

The LDA-BERT model proposed in this study provides an in-depth analysis of a text collection utilizing a topic-mining method. This method uses the correlation relationship between the text collection topics and calculates the similarity between the documents by combining the documents’ characteristics. The technical route of the research is shown in Figure 1.

As shown in Figure 1, the research framework of this paper was divided into four steps. The first step was text preprocessing. We obtained the relevant literature in the field of forest fire prediction from the Web of Science database and combined TF-IDF (Term Frequency–Inverse Document Frequency) and TextRank methods to extract feature words from the title and abstract fields of the literature, merged and de-weighted the extracted feature words with the keyword fields, and then used custom dictionaries to remove deactivated words and merged synonyms in the merged results to eliminate the interference of irrelevant words. The second step was similarity measurement, which included mining research topics in forest fire prediction with the LDA model. Then, the BERT model was embedded into the LDA topic model for word–vector conversion to represent the topics as real-valued vector forms. Finally, the similarity between topics was calculated based on the vectors using the cosine similarity algorithm. The fourth step was visualisation and analysis, where the probabilities output from the LDA model were used as the weights of the topic vectors to vectorise the dataset representation and calculate the associations between the datasets at different stages.

3.1. Data Preprocessing and Thematic Mining

Several methods and techniques were used in this study to improve the accuracy and depth of text analysis. First, two methods, TF-IDF [35] and TextRank [36], were used to extract feature words in text titles and abstracts, and these feature words were merged with keywords for de-duplication. Next, the feature words were screened and filtered to ensure the quality of the model input items by constructing a lexicon of participles and deactivated words.

Subsequently, topic mining was performed on the datasets using the LDA topic model, which determined the number of topics by calculating the semantic consistency and building a separate topic model for each dataset. The output included ”document-topic” probability distributions and ”topic-topic word” distributions, which provided the corpus for the word vector training module and the probability weights for the topic similarity measure. In natural language processing, feature word extraction is the core of understanding the text’s content. This study used TF-IDF and TextRank to extract the feature words of text titles and summaries. It merged them with keywords for de-weighting to obtain a more comprehensive textual representation.

TF-IDF is a commonly used weighting technique that evaluates the importance of words by calculating the product of the frequency of the word in the text and the inverse frequency of its distribution in the corpus. Its calculation formula is as following:

T F - I D F = \frac{n_{i, j}}{\sum_{k} n_{k, j}} \times log \frac{| D |}{| j : t_{i} \in d_{j} | + 1}

(1)

$n_{i, j}$ : The frequency of term $t_{i}$ in document $d_{j}$ .
$\sum_{k} n_{k, j}$ : The total number of terms in document $d_{j}$ .
$| D |$ : The total number of documents in the corpus.
$| j : t_{i} \in d_{j} |$ : The number of documents in which term $t_{i}$ appears.

TextRank, on the other hand, is a method based on a graph ranking algorithm, which treats the text as a graph structure, with words as nodes and co-occurring relationships between words as edges, and calculates the weight of each node through iteration to extract the critical feature words in the text. The calculation formula is as follows:

W S (v_{i}) = (1 - d) + d \times \sum_{v_{i} \in I n (v_{i})} \frac{W_{j i}}{\sum_{v_{k} \in O u t (v_{j})} W_{j k}} W S (v_{j})

(2)

$W S (v_{i})$ : The weight of node $v_{i}$ in the graph.
d: A damping factor, typically set to 0.85, represents the probability of continuing to the next node.
$In (v_{i})$ : The set of nodes that have edges pointing to $v_{i}$ .
$W_{j i}$ : The weight of the edge from node $v_{j}$ to node $v_{i}$ .
$Out (v_{j})$ : The set of nodes that $v_{j}$ points to.
$W S (v_{j})$ : The weight of node $v_{j}$ .

The feature words extracted by the above two methods were merged and de-emphasized with the keywords of the literature, and the result was used as the input corpus of the model. Topic mining was then carried out using the LDA topic model. The specific structure of the LDA topic model is shown in Figure 2.

In Figure 2,

α

is the prior parameter for topic k in the Dirichlet distribution, which influences the distribution of topics within a document.

β

is the prior parameter for word v in the Dirichlet distribution, affecting the distribution of words within a topic. Suppose we have a document collection, M, consisting of m documents, within which K topics exist. Let N denote the total vocabulary size in the collection of M and V, comprising the unique words in this collection. Here, Z represents the set of topics the model generates, while Z denotes the set of topic terms produced. The hyperparameters

α

and

β

signify the prior Dirichlet distributions over the topics in document m and the terms in topic k, respectively. The posterior parameters

θ_{m}

and

φ_{k}

represent the topic distribution within document m and the term distribution within topic k, respectively. Among these, only W is an observed variable; the rest are latent variables or parameters.

The solution to the LDA topic model was derived through parameter estimation using Gibbs sampling, an iterative algorithm for probabilistic inference. The co-occurrence frequency matrix of topics and topic terms in the dataset was computed based on the correspondence between each topic and its respective topic. Ultimately, the conditional probability for the Gibbs sampling of each word to a specific topic, given the current state of all other words, was calculated according to the following formula:

p (z_{i} = k ∣ \vec{w}, {\vec{z}}_{ℸ i}) \propto \frac{n_{m, ℸ i}^{(k)} + α_{k}}{\sum_{k = 1}^{K} n_{m, ℸ i}^{(k)} + α_{k}} \times \frac{n_{k, ℸ i}^{(v)} + β_{v}}{\sum_{v = 1}^{V} n_{m, ℸ i}^{(k)} + α_{k}}

(3)

$p (z_{i} = k | \vec{w}, {\vec{z}}_{i})$ : The probability that word $w_{i}$ belongs to topic k, given the current state of all other words.
$n {(k)}_{m, i}$ : The number of words in document m that are assigned to topic k.
$α_{k}$ : The prior parameter for topic k in the Dirichlet distribution.
$n {(v)}_{k, i}$ : The number of times word v is assigned to topic k.
$β_{v}$ : The prior parameter for word v in the Dirichlet distribution.
K: The total number of topics.
V: The total number of unique words in the vocabulary.

In applying the LDA model for topic mining, each dataset’s optimal number of topics was determined by calculating the semantic coherence. A higher semantic coherence score indicated a better clustering quality of the topics, with words within a topic being more closely related and contextually meaningful. The formula employed for calculating semantic coherence was as follows:

C (t; V^{(t)}) = \sum_{m = 2}^{M} \sum_{l = 1}^{m - 1} log \frac{D (v_{m}^{(t)}, v_{l}^{(t)}) + ε}{D (v_{l}^{(t)})}

(4)

$C (t; V (t))$ : The coherence score for topic t.
M: The number of top words used to calculate the coherence.
$v {(t)}_{m}$ and $v {(t)}_{l}$ : The mth and lth top words in topic t.
$D (v {(t)}_{m}, v {(t)}_{l})$ : The document frequency of the pair of words $v {(t)}_{m}$ and $v {(t)}_{l}$ occurring together.
$D (v {(t)}_{l})$ : The document frequency of word $v {(t)}_{l}$ .
$ϵ$ : A small constant to avoid division by zero.

This methodology extracted salient keywords from the text and established an LDA topic model, providing a rich corpus for word vector training. These probability distributions from document—topic and topic—term assignments offered profound insights into the text content and facilitated the measurement of topic similarity between different texts, thereby enhancing the analysis of topic evolution and the fine-grained understanding of textual content.

3.2. Word Vector Training

This study employed the BERT model to train on the dataset, leveraging the probabilities of topic words as weights assigned to their respective word vectors. This process yielded a real-valued vector representation linking themes and their constituent words. BERT, a pre-trained natural language processing model sourced from a vast corpus of unlabelled data, significantly improves the accuracy of various NLP tasks through its unsupervised masked language modelling (MLM) and next-sentence prediction (NSP) tasks. These tasks enable the model to capture semantic representations beyond individual words at the sentence level, enhancing the context awareness of keyword-generated word vectors and the precision of similarity computations.

Given a document dataset,

D_{z} = {d_{1}, d_{2}, \dots, d_{n}}

, within period T, which encompasses Z distinct themes,

t_{1}, t_{2}, \dots, t_{z}

, each theme,

t_{i}

, generates words, with the jth word being denoted as

t_{i j}

, and its generation probability is

a_{i j}

. In the LDA topic extraction process, words with higher probabilities are deemed more representative of the theme. Consequently, these words are allocated higher weights when constructing the theme vector. To compute the theme vector based on word vectors, the top m words from each theme’s probability distribution are selected, and their probabilities are normalized. The weight

p_{i j}

for the jth word in the ith theme is calculated as follows:

p_{i j} = \frac{α_{i j}}{\sum_{n = 1}^{m} α_{i j}}

(5)

Therefore, the theme vector

v (t_{i})

for a specific theme,

t_{i}

, is derived by multiplying the word vectors of its top m words,

v (t_{i, n})

, with their respective weights and summing them up:

v (t_{i}) = \sum_{n = 1}^{m} p_{i j} \times v (t_{i, n})

(6)

To represent the entire dataset

D_{Z}^{T}

based on its thematic composition, the theme vectors for all Z themes are aggregated:

v (D_{Z}^{T}) = \sum_{i = 1}^{z} v (t_{i})

(7)

This aggregation reflects the overall thematic structure of the dataset within the specified time frame T, with each theme’s contribution weighted by the semantic richness of its representative words in the context of the entire corpus.

3.3. Similarity Measure

In analysing the evolution of research topics, the cosine similarity algorithm was employed to calculate the similarity between themes represented as fixed-dimensional space vectors. This transformation turned the assessment of thematic similarity in the literature into a spatial similarity problem among these vectors. The cosine similarity was computed using the following formula:

S = c o s (θ) = \frac{A \times B}{∥ A ∥ \times ∥ B ∥} = \frac{\sum_{i = 1}^{n} A_{i} \times B_{i}}{\sqrt{\sum_{i = 1}^{n} {(A_{i})}^{2}} \times \sqrt{\sum_{i = 1}^{n} {(B_{i})}^{2}}}

(8)

A_{i}

denotes a feature word from Theme A, and

B_{i}

is a feature word from Theme B, with each vector symbolizing a theme composed of its constituent theme words. The cosine similarity value ranges from

- 1

to 1, with higher values indicating more remarkable similarity in the content represented by the themes. When utilizing the BERT model for text similarity matching without fine-tuning, the cosine similarity scores tend to be universally high, rendering them less informative in practice. To address this, normalization was introduced to differentiate the results more effectively after obtaining similarity scores. The normalization process followed the following equation:

S_{i} = \frac{C S_{i} - min (C S)}{max (C S) - min (C S)}

(9)

where

C S = {C S_{1}, C S_{2}, \dots, C S_{n}}

, and

C S_{i}

represents the cosine similarity outcome between two theme vectors or dataset vectors. This normalized outcome served as the foundation for constructing the thematic evolution graph. Following theories of knowledge evolution and its lifecycle, theme evolution can be categorized into five stages: emergence, inheritance, differentiation, fusion, and extinction. This study systematically analysed thematic evolution within the researched domain by combining manual interpretation with similarity computation outcomes and setting appropriate thresholds.

This methodology enables a comprehensive understanding of how themes evolve and facilitates the identification of trends, hotspots, and potential areas for interdisciplinary collaboration and innovation. Ultimately, it supports more informed decision-making and strategic planning in combating critical challenges such as forest fires.

4. Empirical Research

4.1. Data Acquisition

This study selected research papers on forest fire prediction as the empirical data source. Since its inception, forest fire prediction has been an interdisciplinary research area, particularly integrating information science and computer science. In practice, the methods and technological tools for forest fire prediction have been widely applied across numerous academic domains, encompassing not only disciplines related to information science and computer science but also natural sciences such as geographic information systems, ecology, meteorology, and social sciences including economics, sociology, psychology, and policy management. As a research subject for the evolution of scientific themes, the forest fire prediction field, with its moderate scale and interdisciplinary characteristics that foster distinctions and interconnections among subfields and topics, was an ideal choice for studying the evolution of research interests and related content.

The textual data for this study were retrieved from the Web of Science database, covering 1980 to 2023. The search query criteria were set as “TS = (‘wildfire’ or ‘forest fire’ or ‘wildland fire’ or ‘bush fire’ or ‘vegetation fire’) AND TS = (‘prediction’ or ‘forecast’ or ‘assessment’)”, leading to the identification of 13,552 publications. These search results were exported in a plain text format and further processed using RefWorks to include complete bibliographic references. The text corpus used for theme extraction and word analysis primarily consisted of titles (field label TI) and abstracts (field label AB) from each document.

Fields organized the metadata from the downloaded records in the Web of Science database, assigning each document a unique identifier. The data fields extracted included the publication year (PY), title (TI), authors (AU), publication name (SO), and abstract (AB), forming the initial corpus for analysis.

Subsequently, text preprocessing was performed to remove stop words. The list of stop words was referenced from commonly adopted lists in the field of information retrieval, encompassing generic words like “of”, “is”, “in”, “and”, “etc.”, and “about”, along with single-letter words and those occurring fewer than five times, to filter out the noise and enhance the quality of subsequent thematic analysis.

4.2. Exploratory Analysis

Between 1980 and 2023, research articles on forest fire prediction were published in various journals, highlighting the vibrant nature of research in this field. Over time, there has been a steady increase in the number of papers, indicating a growing community of scholars studying forest fire prediction. Both the quantity and quality of these publications have garnered significant attention. The temporal distribution of these research papers, as depicted in Figure 3, provides insights into the level of research activity within this specific domain.

This study analysed the literature focusing on forest fire prediction, revealing that this research domain is interdisciplinary, encompassing diverse technical and scientific approaches from fields such as forestry, fire science, remote sensing, atmospheric chemistry, and physics. The top three publishing platforms for wildfire prediction research were the ‘INTERNATIONAL JOURNAL OF WILDLAND FIRE’ (with 454 articles), ‘FOREST ECOLOGY AND MANAGEMENT’ (274 articles), and ‘ATMOSPHERIC ENVIRONMENT’ (418 articles). Figure 4 illustrates the growth of articles in the core journals ranking among the top five in this specialized area.

4.3. Theme Mining

After preprocessing, a document–term matrix suitable for model computations was obtained, and the LDA-BERT model was constructed. Perplexity, a metric used to evaluate the model’s performance, was adopted. As shown in Figure 5, the perplexity curve illustrates that when the number of topics was set to 20, the model achieved a relatively low perplexity across the entire text corpus. Further increases in topics did not significantly reduce the perplexity, hence establishing the optimal number of topics, K, as 20. The model parameters were configured with

α = 50

,

k = 2.5

, and

β = 0.01

. Each document was iterated 2000 times in the modelling process. The top 30 words with the highest probabilities were extracted from each identified topic and ranked by frequency in descending order. Subsequently, the six most active words with the highest probabilities from each topic were selected to represent the topic’s essence. Finally, other extracted words were combined to refine and label the themes further.

The perplexity curve illustrates the general trend of the model fit as the number of topics increased. It is crucial to understand that perplexity in LDA, especially when estimated using Gibbs sampling, is a stochastic measure and thus exhibits some degree of variability across different model initializations and runs. Therefore, the absolute values of perplexity should be interpreted with caution. Instead, the focus should be on the overall trend of the curve—the point at which the perplexity is minimized and begins to plateau. While providing confidence intervals for the perplexity would offer a more statistically rigorous assessment of variability for topic number selection in exploratory topic modelling, the general trend indicated by the perplexity curve, considered alongside the semantic coherence and qualitative topic evaluation, is widely accepted as a practical and informative guide. The curve shown in Figure 5 represents the typical perplexity behaviour observed in our experiments.

Following the model training, five distinct files were generated: an information file, a term-topic probability distribution file, a document-topic probability distribution file, a document-term-topic distribution file, and a term-topic inference file. These outputs were instrumental in understanding the thematic structure of the text corpus. By leveraging the probability distributions between themes and terms, it became feasible to identify the vocabulary distribution under each theme and the probability of each term belonging to its respective themes. This information allowed for adequate theme labelling, with the distribution of themes and their constituent terms summarized in Table 1.

Based on the document-topic probability distribution, the likelihood of each document belonging to different topics was determined, enabling the calculation of theme strength. Theme strength represents the level of attention a particular theme receives within a given time frame, proportional to the number of documents addressing that theme; thus, a higher theme strength suggests a more prominent or ‘hot’ topic.

The analysis of the provided data and information revealed that discussions spanned a wide range of subjects, including wildfire management, vegetation ecology, the impact of climate change on fires, and risk assessment. While the explicit quantification of “time windows” and “theme perplexity” was not given, the content of the texts allowed for an examination of how themes have evolved and their interconnectedness.

Wildfire Risk Management (Topic 3): Over time, human activities, such as urbanisation encroaching on forested areas (the “wildland-urban interface,” or WUI), have significantly increased the fire risk. Factors like proximity to transportation networks, human presence, and energy infrastructure density affect burn areas, illustrating the effect of human and natural factors on fire susceptibility. Long-term strategies, including predictive systems and economic models, aim to help policymakers mitigate future risks, particularly considering climate change scenarios.

Vegetation and Habitat Changes (Topic 1 and Topic 4): With abandoned agricultural land reverting to vegetation and inadequate forest use leading to fuel accumulation, vegetation patterns directly influence the fire risk. Different vegetation types exhibit varying burn probabilities depending on climate conditions, highlighting the decisive role of the vegetation type and climatic variability in the fire likelihood.

Climate Change and Forests (Topic 6): Climate change significantly affects fire patterns by altering the precipitation and temperature, making vegetation drier and more prone to fires, and increasing the frequency of extreme weather events, exacerbating fire risks. Planning for future fire risks under changing climate scenarios underscores the need for adaptation strategies.

Air Quality and Environmental Impacts (Topic 7): Although direct discussion on particulate matter (PM) concentrations or air quality is limited, it can be inferred that frequent and intense fires will heighten concerns about air pollution, particularly in densely populated WUI regions.

Modelling and Data Analysis (Topic 8 and Topic 9): Recent research trends favour using advanced models and data analytics to predict fire risks, assess management measures’ cost-effectiveness, and forecast future scenarios. Tools like GIS, ecological disturbance models, ignition models, and climate-informed predictions of fire probabilities reflect the growing sophistication in analytical approaches.

In summary, even without detailed metrics for “theme perplexity,” the evolving trends in these themes demonstrate a heightened focus and complexity in research related to wildfire management, ecosystem responses, climate change implications, and predictive capabilities. This evolution underscores the scientific and management communities’ ongoing efforts to adapt to and mitigate the multifaceted impacts of wildfires amidst global environmental changes.

4.4. Thematic Evolution Analysis

By analysing the frequency and occurrence of keywords over time, trends in research topics can be identified. Figure 6 is an intertropical distance map generated by the pyLDAvis tool. It is a screenshot of an interactive display, and only a partial view is presented here. This figure aims to visualize the potential topics in two-dimensional space through multidimensional scaling (MDS), thereby revealing the correlation between topics. The scattered bubble chart on the left shows the distribution of different topics in Figure 6. Each bubble represents an independent topic, and the size of the bubble is proportional to its proportion in the entire document set. The distance between bubbles reflects the semantic similarity of the topic. The smaller the distance, the closer the topic is to the content. The right side of Figure 6 shows each topic’s 30 most frequent feature words. For each feature word, the light blue bar in the legend indicates the word’s overall frequency (weight) in the entire document corpus, while the dark red bar indicates the frequency (weight) of the word in this specific topic. A slider for an adjustable parameter,

λ

, is provided in the upper right corner of the figure. Users can dynamically adjust the weight calculation method for the feature word by adjusting the

λ

value to explore the feature word distribution of the topic model.

This study systematically analysed the literature corpus using thematic modelling and multidimensional scaling analysis methods. Through visual presentation, we observed the following key findings:

(1) Regarding the spatial distribution of themes, the multidimensional scale analysis map shows the relative positional relationships of 10 different themes. Among them, Theme 1 presents significant dominance, confirmed by its larger circular area in the visualization map. Theme 2 is relatively close in spatial proximity to Theme 1, suggesting that there may be a robust semantic association between these two themes. In contrast, Themes 6 to 10 are distributed at the left edge of the graph, suggesting that these themes have relatively weak correlations with the core themes.

(2) Regarding the word frequency analysis for Theme 1, by counting the top 30 most relevant words, we found that ”fire” and ”forest” had the highest relative frequencies, with about 35,000 and 20,000 occurrences, respectively. They were followed by "model," "data," "fires," etc. The distribution pattern of these high-frequency words strongly suggests that Theme 1 focuses mainly on forest fire-related research. It is noteworthy that the appearance of "climate," management," and "risk" further indicates that this research area is closely related to climate change and risk management.

(3) Regarding the statistical characteristics of the theme distribution, the marginal distribution pattern demonstrates the relative weights of different themes, which are quantitatively represented by concentric circles of 2%, 5%, and 10%. This distribution pattern indicates that the research themes exhibit a clear hierarchical structure, with Theme 1 having the most significant weight. In addition, the relevance measure slider (

λ = 1

) in the upper right corner of Figure 6 is used to regulate the method of calculating the weights of the lexical items, which offers the possibility of the precise tuning of the topic model.

Overall, the results of this visualization reveal the existence of a dominant research theme in the research corpus centred on forest fires and also reflect that the related research involves several cross-disciplinary fields, including climate science, risk assessment, and resource management. This pattern of theme distribution provides an essential empirical basis for understanding the knowledge structure of this research area.

5. Discussion

This study employed the LDA-BERT model to analyse the thematic evolution within forest fire prediction research, identifying key themes and their interrelationships. The model successfully integrated phrase-based (LDA) and contextualized (BERT) representations, offering a more nuanced understanding of the research landscape than traditional topic modelling approaches alone. This is a significant contribution, as it addresses a critical gap in existing research: the ability to capture broad thematic structures and fine-grained semantic meanings simultaneously. Many prior studies have relied on either LDA or similar methods, leading to potential limitations in the topic coherence or contextual understanding. Our combined approach mitigates these issues.

The analysis revealed the inherently interdisciplinary nature of forest fire prediction, encompassing not only traditional fields like forestry and fire science but also increasingly incorporating disciplines such as computer science (through machine learning and data analytics), atmospheric science, ecology, and even social sciences (in risk management and policy). This interdisciplinarity presents both opportunities and challenges. While integrating diverse expertise and methodologies (e.g., GIS, remote sensing, ecological modelling, and socioeconomic analysis) offers a more holistic approach to understanding and mitigating fire risks, it also necessitates overcoming significant integration hurdles. For instance, combining ecological insights (e.g., vegetation dynamics and species-specific fire responses) with computational insights (e.g., predictive model parameters and data uncertainty) requires careful consideration of differing scales, data formats, and underlying assumptions. A computational model might prioritize predictive accuracy based on the available data. At the same time, an ecological perspective might emphasize the importance of understanding the underlying causal mechanisms, even if those mechanisms are difficult to quantify precisely. Bridging these disciplinary divides is crucial for developing practical and applicable prediction and management strategies.

Specifically, our findings highlight several key evolving trends:

Shift from Macro to Micro Level: Early research (pre-2000s, as indicated by the literature review and keyword analysis) predominantly focused on broad-scale factors like general climate patterns and large-scale vegetation classifications. More recent work (post-2010, evident in the increased prominence of terms like “model,” “data,” and “prediction,” and specific technologies) emphasizes finer scale analyses, incorporating detailed data on the fuel moisture, localized weather conditions, and individual fire behaviour. This shift is directly linked to advancements in remote sensing, data availability, and computational power.
Integration of Human Dimensions: While early research acknowledged human influence, the focus has sharpened considerably. The “Wildfire Risk Management” theme explicitly highlights the increasing importance of understanding the wildland-urban interface, human-caused ignitions, and the socioeconomic impacts of fires. This reflects a growing recognition that fire prediction and management are not solely ecological or technological problems but also deeply social ones.
Technological Advancement: The prominence of themes related to ”Modelling and Data Analysis” underscores the increasing reliance on sophisticated computational tools. This includes using machine learning algorithms, GIS, and remote sensing data to improve the prediction accuracy, assess the management effectiveness, and understand complex fire dynamics.
Emphasis on Climate Change: The increasing frequency of ”climate”- and ”change”-related terms is a crucial takeaway. Regarding the interaction among themes, Theme 6 is essential, demonstrating that scientists are increasingly considering climate alterations’ role in fire regimes.

Although this study has achieved specific results, it also has some limitations. First, in selecting topic-representative words, although the selection was based on probability, there may still have been a certain degree of subjectivity, and the semantic diversity and potential semantic associations of words were not fully considered. Secondly, although the threshold setting had a specific theoretical basis in topic evolution analysis, the adaptability of the threshold under different data characteristics still needs more in-depth exploration, which may affect the accuracy of topic evolution classification. In addition, the model performance evaluation mainly relied on the perplexity index. In actual application scenarios, other performance indicators, such as the interpretability and generalization ability of the model, need to be more comprehensively evaluated.

Future research should address these limitations:

Optimize Theme Representation: Further, refine the selection of theme-representative words by incorporating semantic analysis and expert knowledge to improve accuracy. Explore methods for capturing the semantic diversity and relationships between words within themes.
Develop Adaptive Thresholds: Investigate the rationality of threshold settings and develop adaptive thresholding methods to suit different datasets and research questions better. This could involve machine learning techniques that learn optimal thresholds from the data.
Enhance Model Evaluation: Expand the model performance evaluation framework to include metrics that assess the interpretability and generalization ability alongside traditional metrics like perplexity. Consider evaluating the model’s performance in real-world scenarios using case studies or simulations.
Integrate Emerging Technologies: Explore incorporating emerging technologies, such as advanced deep learning algorithms and big data processing techniques, to improve prediction accuracy and timeliness.
Strengthen Interdisciplinary Collaboration: Foster deeper interdisciplinary collaboration to promote knowledge sharing and collaborative innovation. This includes developing common frameworks and tools for researchers from different disciplines.

6. Conclusions

This study utilized the LDA-BERT similarity measurement model to analyse the thematic evolution in forest fire prediction research comprehensively. By combining the strengths of LDA (for topic extraction) and BERT (for a contextualized semantic understanding), we have demonstrated a novel approach that surpasses the limitations of traditional topic modelling techniques, which often struggle to capture both broad thematic structures and fine-grained semantic nuances. This is the key novel contribution of our work: the methodological advancement of combining these two powerful models for a more complete and accurate representation of the research landscape.

Our analysis revealed a clear shift in forest fire prediction research from broad, macro-level investigations towards more focused, micro-level studies incorporating advanced technologies and detailed data. Early research, as evidenced by our literature analysis and keyword trends, primarily focused on general ecological factors and broad climate patterns. However, more recent research significantly emphasises incorporating detailed data on fuel conditions, localized weather patterns, and human activities and applying sophisticated modelling techniques, including machine learning and GIS. This evolution is driven by the increasing availability of high-resolution data and the growing recognition of fire risks’ complex, multifaceted nature.

Furthermore, the study highlights the field’s increasingly interdisciplinary nature, with a growing emphasis on integrating knowledge and methodologies from diverse disciplines. While this interdisciplinary approach offers significant potential for improving prediction and management strategies, it also presents challenges related to data integration, methodological differences, and communication barriers.

The dominance of ”Theme 1” (centred on forest fires, the climate, and risk management) underscores the core focus of the research field. The identified relationships between themes, such as the close connection between ”Theme 1” and ”Theme 2” (focused on forests and trees specifically), reveal the interconnectedness of different research areas.

While this study offers valuable insights, it is essential to acknowledge its limitations, including the potential data source bias, model parameter sensitivity, and the inherent subjectivity in theme interpretation. Future research should address these limitations by exploring more advanced model integration techniques, expanding the scope of analysis to include diverse data sources and languages, and developing frameworks to facilitate interdisciplinary collaboration. Specifically, future work should focus on creating dynamic topic models to track the evolution of topics within forest fire prediction.

In conclusion, this research provides a valuable overview of the evolving landscape of forest fire prediction research. It offers a novel methodological approach that can be applied to other interdisciplinary research domains. The findings have practical implications for researchers, policymakers, and funding agencies, informing strategic decision-making and promoting more effective approaches to mitigating the growing threat of wildfires in a changing world. The identified limitations and future research directions provide a roadmap for continued advancement in this critical area.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article. The dataset is available on request from the author.

Conflicts of Interest

The author declares no conflicts of interest.

References

Barmpoutis, P.; Papaioannou, P.; Dimitropoulos, K.; Grammalidis, N. A review on early forest fire detection systems using optical remote sensing. Sensors 2020, 20, 6442. [Google Scholar] [CrossRef] [PubMed]
Sivrikaya, F.; Küçük, Ö. Modeling forest fire risk based on GIS-based analytical hierarchy process and statistical analysis in Mediterranean region. Ecol. Inform. 2022, 68, 101537. [Google Scholar] [CrossRef]
Mohajane, M.; Costache, R.; Karimi, F.; Pham, Q.B.; Essahlaoui, A.; Nguyen, H.; Laneve, G.; Oudija, F. Application of remote sensing and machine learning algorithms for forest fire mapping in a Mediterranean area. Ecol. Indic. 2021, 129, 107869. [Google Scholar] [CrossRef]
Alkhatib, R.; Sahwan, W.; Alkhatieb, A.; Schütt, B. A brief review of machine learning algorithms in forest fires science. Appl. Sci. 2023, 13, 8275. [Google Scholar] [CrossRef]
Xu, Y.; Liu, X.; Cao, X.; Huang, C.; Liu, E.; Qian, S.; Liu, X.; Wu, Y.; Dong, F.; Qiu, C.-W.; et al. Artificial intelligence: A powerful paradigm for scientific research. Innovation 2021, 2, 100179. [Google Scholar] [CrossRef]
Machado, H.P.V.; Elias, M.L.G.G.R. Knowledge management: The field’s constitution, themes, and research perspectives. Transinformação 2020, 32, e200037. [Google Scholar] [CrossRef]
Fudolig, M.I.; Alshaabi, T.; Arnold, M.V.; Danforth, C.M.; Dodds, P.S. Sentiment and structure in word co-occurrence networks on Twitter. Appl. Netw. Sci. 2022, 7, 1–27. [Google Scholar] [CrossRef]
Tura, N.; Ojanen, V. Sustainability-oriented innovations in smart cities: A systematic review and emerging themes. Cities 2022, 126, 103716. [Google Scholar] [CrossRef]
Donthu, N.; Gremler, D.D.; Kumar, S.; Pattnaik, D. Mapping of Journal of Service Research themes: A 22-year review. J. Serv. Res. 2022, 25, 187–193. [Google Scholar] [CrossRef]
Climent, R.C.; Haftor, D.M. Value creation through the evolution of business model themes. J. Bus. Res. 2021, 122, 353–361. [Google Scholar] [CrossRef]
Sott, M.K.; Nascimento, L.d.S.; Foguesatto, C.R.; Furstenau, L.B.; Faccin, K.; Zawislak, P.A.; Mellado, B.; Kong, J.D.; Bragazzi, N.L. A bibliometric network analysis of recent publications on digital agriculture to depict strategic themes and evolution structure. Sensors 2021, 21, 7889. [Google Scholar] [CrossRef] [PubMed]
Feng, S.; Yan, Y.; Li, H.; Zhang, L.; Yang, S. Thermal management of 3D chip with non-uniform hotspots by integrated gradient distribution annular-cavity micro-pin fins. Appl. Therm. Eng. 2021, 182, 116132. [Google Scholar] [CrossRef]
Ye, Y.; Jiao, B.; Kong, Y.; Liu, R.; Du, X.; Jia, K.; Yun, S.; Chen, D. Experimental investigations on the thermal superposition effect of multiple hotspots for embedded microfluidic cooling. Appl. Therm. Eng. 2022, 202, 117849. [Google Scholar] [CrossRef]
Zhang, Z.; Parulian, N.N.; Ji, H.; Elsayed, A.S.; Myers, S.; Palmer, M. Fine-grained information extraction from biomedical literature based on knowledge-enriched abstract meaning representation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), Bangkok, Thailand, 1–6 August 2021. [Google Scholar]
Yang, Y.; Wang, L.; Xie, D.; Deng, C.; Tao, D. Multi-sentence auxiliary adversarial networks for fine-grained text-to-image synthesis. IEEE Trans. Image Process. 2021, 30, 2798–2809. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Pan, B.; Cai, D.; Sun, H. Topnet: Learning from neural topic model to generate long stories. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event, 14–18 August 2021; pp. 1997–2005. [Google Scholar]
Bai, X.; Zhang, X.; Li, K.X.; Zhou, Y.; Yuen, K.F. Research topics and trends in the maritime transport: A structural topic model. Transport Policy 2021, 102, 11–24. [Google Scholar] [CrossRef]
Martin, F.; Borup, J. Online learner engagement: Conceptual definitions, research themes, and supportive practices. Educ. Psychol. 2022, 57, 162–177. [Google Scholar] [CrossRef]
Ouyang, H.; Tang, X.; Zhang, R. Research Themes, trends and future priorities in the field of climate change and Health: A Review. Atmosphere 2022, 13, 2076. [Google Scholar] [CrossRef]
Blei, D.; Ng, A.; Jordan, M. Latent Dirichlet Allocation. Adv. Neural Inf. Process. Syst. 2001, 14, 993–1022. [Google Scholar]
Alaparthi, S.; Mishra, M. Bidirectional Encoder Representations from Transformers (BERT): A sentiment analysis odyssey. arXiv 2020, arXiv:2007.01127. [Google Scholar]
Jayady, S.H.; Antong, H. Theme Identification using Machine Learning Techniques. J. Integr. Adv. Eng. (JIAE) 2021, 1, 123–134. [Google Scholar] [CrossRef]
Carter, P.; Gee, M.; McIlhone, H.; Lally, H.; Lawson, R. Comparing manual and computational approaches to theme identification in online forums: A case study of a sex work special interest community. Methods Psychol. 2021, 5, 100065. [Google Scholar] [CrossRef]
Maier, D.; Waldherr, A.; Miltner, P.; Wiedemann, G.; Niekler, A.; Keinert, A.; Pfetsch, B.; Heyer, G.; Reber, U.; Häussler, T.; et al. Applying LDA topic modeling in communication research: Toward a valid and reliable methodology. In Computational Methods for Communication Science; Routledge: New York, NY, USA, 2021; pp. 13–38. [Google Scholar]
Choubey, D.K.; Kumar, M.; Shukla, V.; Tripathi, S.; Dhandhania, V.K. Comparative analysis of classification methods with PCA and LDA for diabetes. Curr. Diabetes Rev. 2020, 16, 833–850. [Google Scholar] [PubMed]
Cao, J.; Xia, T.; Li, J.; Zhang, Y.; Tang, S. A density-based method for adaptive LDA model selection. Neurocomputing 2009, 72, 1775–1781. [Google Scholar] [CrossRef]
Nguyen, D.Q.; Billingsley, R.; Du, L.; Johnson, M. Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 2015, 3, 299–313. [Google Scholar] [CrossRef]
Yang, J.; Yu, H.; Kunz, W. An efficient LDA algorithm for face recognition. In Proceedings of the International Conference on Automation, Robotics, and Computer Vision (ICARCV 2000), Singapore, 5–8 December 2000; pp. 34–47. [Google Scholar]
Zhou, H.; Yu, H.; Hu, R.; Hu, J. A survey on trends of cross-media topic evolution map. Knowl.-Based Syst. 2017, 124, 164–175. [Google Scholar] [CrossRef]
Han, W.; Han, X.; Zhou, S.; Zhu, Q. The development history and research tendency of medical informatics: Topic evolution analysis. JMIR Med. Inform. 2022, 10, e31918. [Google Scholar] [CrossRef]
Yu, H.; Zhang, G.; Shen, Y. Hidden Markov-based LDA Internet Sensitive Information Text Filtering. In Proceedings of the 2020 IEEE 11th International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 16–18 October 2020; pp. 1–6. [Google Scholar]
Lu, X.; An, J. Evolution Analysis of Network Public Opinion Theme Based on LDA Model. In Proceedings of the 2022 4th International Conference on Applied Machine Learning (ICAML), Changsha, China, 23–25 July 2022; pp. 396–400. [Google Scholar]
Xie, Q.; Zhang, X.; Ding, Y.; Song, M. Monolingual and multilingual topic analysis using LDA and BERT embeddings. J. Inf. 2020, 14, 101055. [Google Scholar] [CrossRef]
Atagün, E.; Hartoka, B.; Albayrak, A. Topic modeling using LDA and BERT techniques: Teknofest example. In Proceedings of the 2021 6th International Conference on Computer Science and Engineering (UBMK), Ankara, Turkey, 15–17 September 2021; pp. 660–664. [Google Scholar]
Havrlant, L.; Kreinovich, V. A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation). Int. J. Gen. Syst. 2017, 46, 27–36. [Google Scholar] [CrossRef]
Mihalcea, R.; Tarau, P. Textrank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 404–411. [Google Scholar]

Figure 1. Technical route.

Figure 2. LDA topic model structure.

Figure 3. Temporal distribution of papers.

Figure 4. Top 10 journals in terms of publications.

Figure 5. Changes in perplexity with different numbers of themes.

Figure 6. Keyword trends topic map (multidimensional scaling analysis map, screenshot of part of interactive display).

Table 1. Core themes of research, 1980–2023.

Topic	Five High-Probability Words Related to the Topic
Topic 1	‘species’, ‘plant’, ‘habitat’, ‘vegetation’, ‘sites’
Topic 2	‘forest’, ‘tree’, ‘pine’, ‘forests’, ‘trees’
Topic 3	‘risk’, ‘wildfire’, ‘management’, ‘wildfires’, ‘assessment’
Topic 4	‘soil’, ‘vegetation’, ‘burned’, ‘area’, ‘data’
Topic 5	‘emissions’, ‘biomass’, ‘burning’, ‘carbon’, ‘emission’
Topic 6	‘climate’, ‘change’, ‘soil’, ‘forest’, ‘carbon’
Topic 7	‘pm’, ‘smoke’, ‘air’, ‘concentrations’, ‘quality’
Topic 8	‘forest’, ‘model’, ‘data’, ‘models’, ‘based’
Topic 9	‘fires’, ‘weather’, ‘conditions’, ‘model’, ‘surface’
Topic 10	‘fuel’, ‘moisture’, ‘model’, ‘models’, ‘spread’

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, S. Exploring Thematic Evolution in Interdisciplinary Forest Fire Prediction Research: A Latent Dirichlet Allocation–Bidirectional Encoder Representations from Transformers Model Analysis. Forests 2025, 16, 346. https://doi.org/10.3390/f16020346

AMA Style

Zhang S. Exploring Thematic Evolution in Interdisciplinary Forest Fire Prediction Research: A Latent Dirichlet Allocation–Bidirectional Encoder Representations from Transformers Model Analysis. Forests. 2025; 16(2):346. https://doi.org/10.3390/f16020346

Chicago/Turabian Style

Zhang, Shuo. 2025. "Exploring Thematic Evolution in Interdisciplinary Forest Fire Prediction Research: A Latent Dirichlet Allocation–Bidirectional Encoder Representations from Transformers Model Analysis" Forests 16, no. 2: 346. https://doi.org/10.3390/f16020346

APA Style

Zhang, S. (2025). Exploring Thematic Evolution in Interdisciplinary Forest Fire Prediction Research: A Latent Dirichlet Allocation–Bidirectional Encoder Representations from Transformers Model Analysis. Forests, 16(2), 346. https://doi.org/10.3390/f16020346

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring Thematic Evolution in Interdisciplinary Forest Fire Prediction Research: A Latent Dirichlet Allocation–Bidirectional Encoder Representations from Transformers Model Analysis

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Data Preprocessing and Thematic Mining

3.2. Word Vector Training

3.3. Similarity Measure

4. Empirical Research

4.1. Data Acquisition

4.2. Exploratory Analysis

4.3. Theme Mining

4.4. Thematic Evolution Analysis

5. Discussion

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI