Unveiling Technological Evolution with a Patent-Based Dynamic Topic Modeling Framework: A Case Study of Advanced 6G Technologies

Jiang, Jieru; Ying, Fangli; Dhuny, Riyad

doi:10.3390/app15073783

Open AccessArticle

Unveiling Technological Evolution with a Patent-Based Dynamic Topic Modeling Framework: A Case Study of Advanced 6G Technologies

by

Jieru Jiang

¹,

Fangli Ying

^2,*

and

Riyad Dhuny

³

¹

Institute of Scientific and Technical Information of Shanghai, Shanghai 200031, China

²

Department of Computer Science, State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai 200237, China

³

Department of Creative Arts, Film and Media Technologies, University of Technology, Mauritius, La Tour Koenig, Pointe-aux-Sables 11134, Mauritius

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3783; https://doi.org/10.3390/app15073783

Submission received: 25 February 2025 / Revised: 26 March 2025 / Accepted: 27 March 2025 / Published: 30 March 2025

(This article belongs to the Special Issue Recent Applications of Machine Learning in Natural Language Processing (NLP))

Download

Browse Figures

Versions Notes

Abstract

As the next frontier in wireless communication, the landscape of 6G technologies is characterized by its rapid evolution and increasing complexity, driven by the need to address global challenges such as ubiquitous connectivity, ultra-high data rates, and intelligent applications. Given the significance of 6G in shaping the future of communication and its potential to revolutionize various industries, understanding the technological evolution within this domain is crucial. Traditional topic modeling approaches fall short in adapting to the rapidly changing and highly complex nature of patent-based topic analysis in this field, thereby impeding a comprehensive understanding of the advanced technological evolution in terms of capturing temporal changes and uncovering semantic relationships. This study delves into the exploration of the evolving technologies of 6G in patent data through a novel dynamic topic modeling framework. Specifically, this work harnesses the power of large language models to effectively reduce the noise in patent data pre-processing using a prompt-based summarization technique. Then, we propose an enhanced dynamic topic modeling framework based on BERTopic to capture the time-aware features of evolving topics across periods. Additionally, we conduct comparative analysis in contextual embedding techniques and leverage SBERT pre-trained on patent data to extract the content semantics in domain-specific patent data within this framework. Finally, we apply the weak signal analysis method to identify the emerging topics in 6G technology over the periods, which makes the topic evolution analysis more interpretable than traditional topic modeling methods. The empirical results, which were validated by human experts, show that the proposed method can effectively uncover patterns of technological evolution, thus enabling its potential application to enhance strategic decision-making and stay ahead in the highly competitive and rapidly evolving technological sector.

Keywords:

technological evolution analysis; dynamic topic modeling; 6G

1. Introduction

In the relentless pursuit of advancing wireless communication technologies, the transition from 5G to 6G emerges as a pivotal milestone, promising to transcend the limitations of current networks and unlock a new era of connectivity [1]. As the next frontier in wireless communication, the landscape of 6G technologies is characterized by its rapid evolution and increasing complexity, driven by the need to address global challenges such as ubiquitous connectivity, ultra-high data rates, and intelligent applications [2,3]. Unlike its predecessor, 5G, which has significantly enhanced data rates, reduced latency, and enabled massive machine-type communications, 6G is poised to deliver even more profound advancements by exploiting higher frequency bands, including the mid band (1–24 GHz), the mmWave (24–92 GHz), and sub-THz (92–300 GHz) spectrums [4]. Among these bands, the sub-terahertz band and the upper mid-band (7–24 GHz) have recently garnered significant attention as candidate frequency bands. Recent advancements in new spectrum technologies (NSTs) have highlighted the critical role of the mid-to-high frequency spectrum (7.125–24.25 GHz, FR3) as the most promising candidate for 6G deployment [5], owing to its advantages in capacity, coverage, and spectral efficiency, as well as its integration with integrated sensing and communication (ISAC) technologies [6], and it has been identified by the ITU as a critical technology for enabling key 6G applications and underscores its growing importance in the evolution of next-generation wireless systems. To enable seamless, ubiquitous connectivity in the 6G era, the integrated space–air–ground–sea communication (ISAGSC) framework has emerged as a highly promising heterogeneous network solution to address coverage challenges in harsh environments such as forests, deserts, and oceanic regions [7]. Within this framework, AI/ML-enhanced air interface (EAI) technologies have become a key research focus, with 3GPP Release 18 actively exploring the potential of AI/ML-based algorithms to enhance the air interface [8]. Current research focuses on advancements in channel state information feedback for massive multi-input multi-output, reference signals for channel estimation, beam management, and wireless positioning. In the context of communication transmission environments, reconfigurable intelligent surfaces (RISs) [9] are regarded as a transformative technology for 6G. By altering the electrical and magnetic properties of material surfaces to control the reflection of electromagnetic waves, RISs can operate without high-energy-consuming sources such as radio frequency chains, offering cost advantages over massive multi-input multi-output and providing unique wireless communication capabilities for 6G [10]. These improvements are not just incremental; they represent a paradigm shift in the capabilities and applications of wireless networks. The significance of 6G lies in its potential to catalyze the integration of communication, computing, and sensing, thereby enabling a wide array of transformative applications. Understanding the evolution of 6G technologies is crucial for researchers, industry stakeholders, and policymakers as it will shape the future landscape of communication and drive the development of new applications and services that can significantly impact various sectors of society [11]. However, the rapid proliferation and diversification of 6G technologies, coupled with their intricate interdependencies, pose significant challenges for field-related experts in manually analyzing and tracking technological evolution trends, as the dynamic landscape often outpaces traditional analytical capabilities. This highlights the necessity for more advanced and automated approaches to effectively navigate and understand the complex and fast-evolving 6G technological domain.

Analyzing patents related to 6G is an effective way to understand the complex technological landscape and fast-evolving trends in this field [11]. Patents are technical documents that contain detailed technological information and play a vital role in revealing the evolutionary trajectory of core technologies [12]. In the fast-evolving field of 6G technologies, the rapidly growing number of patent documents poses an enormous challenge for retrieving and analyzing timely information from patent data to identify technical themes and unveil technological evolution processes. In recent years, based on deep learning (DL) methods for natural language processing, many novel approaches have been developed in the field of patent analysis to reduce costs by automating tasks that previously only domain experts could solve [13]. Among them, topic modeling techniques are widely used in exploratory data analysis for organizing, understanding, and summarizing large amounts of text data [14]. Dynamic Topic Models (DTMs) are a variant of topic models designed specifically for temporal archives [15]. Unlike static topic models, such as Latent Dirichlet Allocation (LDA) [16], they possess the unique ability to adapt and refine their assessments of underlying topics as new documents are added to the corpus, allowing the identification of evolving trends and patterns in the archive [17]. This adaptability feature facilitates the identification of evolving trends and patterns in patent-based technological development [18], particularly in rapidly evolving fields like 6G. DTM-based topic evolution analysis can help track how technical concepts and innovation trends in patents change over time, thereby contributing to a more comprehensive understanding of the technological evolution in this field.

However, these approaches come with certain limitations and challenges in analyzing the evolving technology in 6G. Firstly, traditional statistics-based methods based on the Bag-of-words (BOW) assumption, such as the Term Frequency–Inverse Document Frequency model (TF-IDF) and LDA, face difficulties in effectively capturing topic semantics and dealing with high-dimensional data, resulting in potentially inaccurate and inefficient topic extraction and connection identification via similarity measurements [19]. For instance, in the early research on DTMs [15], the authors compared DTMs with static models in terms of capturing the topic evolution in Science articles. The quantitative results showed that the dynamic topic model outperformed LDA-based static topic models in predicting the content of the next year’s Science articles, consistently assigning higher likelihood values. Additionally, in a study comparing LDA and BERTopic on a corpus of 22,101 climate protection articles from Austrian newspapers, LDA-like models often produced topics with very broad terminologies. For example, a housing-related topic in LDA might include general terms like ‘city’, ‘new’, and ‘project’, among others. At the same time, BERTopic identified more specific salient words such as ‘heat pump’ and ‘oil heating’ for the same kind of topic, making it more interpretable and better at highlighting specific aspects [20]. Nevertheless, DL-based DTM methods like BERTopic [21] handle complex semantic topic representation by utilizing high-dimensional transformer-based embedding. The study comparing LDA and BERTopic on Twitter data related to travel and COVID-19 found that LDA identified fewer topics, some of which were rather general and uninformative [17]. In dynamic topic modeling, its assumption of independent topics and reliance on word frequency might lead to overlooking topic correlations over time. For instance, it may not effectively capture how the ‘government response’ topic evolves in relation to other topics like ‘travel restrictions’ as the situation changes, making it less suitable for tracking dynamic topic changes. On the other hand, BERTopic, which uses a combination of Bidirectional Encoder Representations from Transformers (BERT) and c-TF-IDF, generated more specific and interpretable topics. For example, its c-TF-IDF algorithm can highlight the most representative terms within a cluster. In dynamic topic modeling, this feature can help in tracking how topics change over time. Despite its ability to handle complex semantic representations, significant issues arise when applying BERTopic for topic modeling in patent-related technologies, especially within the rapidly changing and complex 6G landscape. Firstly, the presence of substantial noise in the patent data, compounded by challenges in accurately understanding domain-specific technological terms, poses a significant hurdle. These factors either lead to the model’s failure to cluster key technological points effectively or result in overly general clustering outcomes. Consequently, the model struggles to precisely identify and represent the core technological features within the 6G domain, undermining the effectiveness of topic-based analysis in this highly specialized and dynamic field. Additionally, detecting early signs of emerging trends in the fast-evolving landscape is crucial for strategic planning. These early signs, which are referred to as weak signals [22], indicate potential but unconfirmed changes. They may later become significant indicators of critical forces, characterized by low terms and document frequencies but high growth rates, and can signal the emergence or disruption of a target technology in 6G, even with minimal current impact. The existing DTM-based applications, such as RollingLDA [23], also face challenges in identifying and tracking emerging signals at a given moment rather than in a long-term evolution. This challenge necessitates that the model adapts to the crucial temporal dynamics of emerging trends in a complex technological landscape and demonstrates interpretability regarding shifts in industry focus. Moreover, the topics generated by these DTM-based methods often fail to provide meaningful insight for interpreting emerging technologies to enhance strategic planning.

To overcome these challenges, we propose a novel dynamic topic modeling framework in patent analysis, which employs Large Language Models (LLMs) to enhance the dynamic topic modeling framework to capture the semantics of evolving topics in a complex technological landscape and identify the emerging technology trend in a more interpretable manner. In this study, we collected 78,625 patents in the field of 6G. We performed a prompt-based technological summarization with LLMs to effectively filter out irrelevant information, thereby significantly enhancing the performance of subsequent text clustering models. Then, we conducted a comparative analysis of the performance of various embedding techniques, such as word2vec, doc2vec, BERT, and PatentsBERTa, against the baseline results and utilized contextual embedding techniques to capture semantics and extract topics in the text data. Furthermore, we adopted an enhanced BERTopic-based dynamic topic modeling framework for technological evolution analysis with random-search-based hyperparameter tuning, which can better capture the underlying temporal relationships between topics over time. Finally, to quantitatively identify the emerging trend and interpret the results to obtain technological insights in this field, we performed the weak sign analysis on the generated topics, enabling us to identify the development trends of technologies in Topic Emerging Maps (TEMs).

The main contributions of this paper are as follows:

We proposed a novel dynamic topic modeling framework in patent analysis, which employs an enhanced BERTopic-based approach with hyperparameter optimization to capture the temporal relationship and semantics of evolving topics. Additionally, we utilized LLMs to pre-process the redundant patent data to significantly reduce the noise, thereby enabling automatic tracking of the evolving trends in a complex landscape;
By conducting a comparative analysis of various embedding techniques, we selected the SBERT model specially pre-trained on patent data combined with temporal aligned clustering; we achieved better semantic capture and more accurate evolving topic extraction over time. This improvement enables a more in-depth understanding of the technical content within patent documents, uncovering the underlying relationships between topics over time;
By performing a weak sign analysis on the generated topics, we demonstrate that our framework can adapt to generate more interpretable results and track technological trends in a complex landscape. The empirical studies show that our framework is an alternative tool to provide meaningful insights for industry strategic planning.

2. Literature Review

To establish the theoretical foundation for our work, we reviewed three key areas that inform our approach to analyzing 6G technological evolution through patent data.

2.1. Embedding-Based Topic Extraction for Patent Analysis

As a paradigm shift in intelligent mobile communication, 6G addresses more complex and demanding challenges, such as achieving truly global coverage, supporting ultra-high data rates, and enabling ultra-low latency communications [1,11]. Given their increasing significance and the rapid pace of technological innovation in this area, understanding the evolving technology landscape in this sector becomes essential. Patents serve as crucial references for analyzing the core technologies in this field. However, the patent text is characterized by specific, technical, legal, and, therefore, ambiguous jargon [24]. Moreover, patent texts contain novel technologies. Therefore, the traditional keyword-based approaches [25] using co-occurrence approaches, such as pointwise mutual information (PMI) or n-grams [26], often fail to adequately consider the context when defining the semantics of words and suffer from the computation expense because of the large and extremely sparse embedding matrices.

In recent years, neural network-based embedding approaches have emerged to map words or sentences into fixed-length numeric vectors. Thereby, latent semantics in massive textual data can be translated into low-dimensional dense space [27]. Word2vec and doc2vec are representative embedding techniques using neural networks based on static embedding that does not change with context once learned [28,29]. Top2vec [30], a variant of Word2Vec, capitalizes on the joint embedding of document and word semantics to find more representative and informative topics than topic models such as TF-IDF or LDA. More recently, the BERT model has utilized the advantages of LLMs and emerged as a focal point for researchers engaged in topic extraction [31]. This approach relies on dynamic embedding techniques, learning novel word vectors from the corpus by considering the contextual information. In the field of patent text mining, PatentBERT [32] and PatentSBERTa [33] focus primarily on patent text analysis and have already been shown to be very effective in these tasks. Compared to static embedding, dynamic embedding might better capture the meaning of words in a context environment; however, it may require more training time and cost.

Even though embedding-based approaches integrated with clustering techniques have been employed for topic extraction and analysis [21], there is still no consensus on evaluating these embedding methods for identifying the evolutionary states of topics in patent text analysis. Additionally, applying embedding techniques to the noise patent data without proper pre-processing may fail clustering results in the topic modeling process. In this work, we performed data pre-processing on redundant patent data using LLMs and conducted a comprehensive comparative analysis to show the performance of various embedding-based approaches to reveal the effectiveness of dynamic topic modeling for exploring technological evolution tasks.

2.2. Dynamic Topic Modeling for Analyzing Technological Evolution

Topic modeling has emerged as a powerful technique in the realm of patent analysis, offering valuable insights into the complex landscape of technological innovation. Many existing methods harness the information in patent documents to identify and classify technical topics. The typical output of a topic modeling algorithm is a set of topics, either fixed or flexible in number, with each topic represented by a list of top words [14]. DTMs introduce a novel approach by creating topics defined by a set of words and associated with a timestamp or time range [15]. They employ a Bayesian framework to model the temporal changes in the relative proportions of topics within documents. This Bayesian modeling enables the identification of evolving trends and patterns over time in the corpus [34]. Furthermore, building upon DTMs, topic evolution analysis [18] has been used in a variety of applications, such as discovering the evolution of research topics and innovations in scientific archives [35] and understanding the evolution trend in patent analysis [36].

When it comes to statistical modeling for deriving insights from data, the research of DTMs mainly falls into two categories: Probabilistic Dynamic Topic Models (PDTMs) and Algorithmic Dynamic Topic Models (ADTMs) [34]. Probabilistic topic models assume that each document in a corpus is a mixture of topics, with each topic represented as a probability distribution over the words in the corpus. PDTMs, a subset of these models, assign probabilities to different words and topics across time. This temporal probability assignment allows them to infer the underlying themes and patterns as they change [37,38]. For example, the classic DTM [15] was among the first PDTMs, which is a variant of the LDA and incorporates temporal components. Despite their extensive application across a range of studies [39,40], PDTMs exhibit elevated computational costs when confronted with substantial archives containing vast vocabularies [41]. Consequently, the work extends the class of tractable priors from Wiener processes to the more general class of Gaussian processes. This approach allows the model to be applied to large collections of text and explore topics that evolve smoothly over a long period. Despite all the efforts to address the problems of PDTMs, they still show less coherence and diversity in topic representation compared to ADTMs.

In contrast, ADTMs leverage advancements in neural networks to transform the underlying temporal probability distributions of words and documents into a fixed-length vector representation space. These models not only retain crucial words in topic descriptions but also form dense clusters of documents that are highly interpretable. For instance, BERTopic [14] is a widely-used ADTM. It leverages LLMs to generate document representations, thereby embedding context within the generated topics. This model integrates BERT and c-TF-IDF, creating dense clusters of interpretable topics while safeguarding important words in the topic descriptions. By calculating the c-TF-IDF topic representation for each time window, BERTopic enables dynamic topic modeling. This approach allows for a more nuanced understanding of how topics evolve over time, as the c-TF-IDF representation captures the relative importance of terms within each time-specific cluster, and the use of LLMs and BERT ensures that the topics are rich in context and semantic information. ANTM [34] can capture the dynamic topic over the periods using aligned UMAP to enable a flexible number of topics. Additionally, BERTrend [42] extends from BERTopic by operating within an online learning framework to dynamically classify topics.

Despite their advantages, current dynamic topic modeling approaches cannot be directly applied to identify emerging technological trends in patent analysis. Firstly, the landscape of 6G is complex and full of domain-specific technological terms. These terms often have highly specialized semantics that are difficult for existing models to interpret accurately, leading to the misclassification or omission of crucial information during topic extraction. Secondly, patent data are not only voluminous but also highly redundant. Existing dynamic topic modeling methods struggle to effectively filter out this redundancy, which can be challenging for clustering the evolution of emerging topics in the field. Moreover, the evolving technological landscape in the patent domain is characterized by rapid and discontinuous changes. With their relatively static architectures and algorithms, current models fall short of capturing these dynamic shifts in a timely and accurate manner. In contrast, we propose a novel dynamic topic modeling framework in this study, which harnesses the capabilities of LLMs to mitigate noise in redundant patent data laden with domain-specific technological terms, can better adapt to these challenges, and enables the automatic tracking of evolving technological trends within a complex technological landscape.

2.3. Weak Signal Analysis for Emerging Trend Identification

Emerging technologies in the complex landscape of 6G can have major economic impacts and affect strategic stability. The early detection of new technology trends is critical for strategic planning, as it enables users to identify opportunities and risks quickly and react to them accordingly by formulating appropriate research, development, and innovation strategies [35]. Yet, early identification of emerging technologies in large amounts of patent data remains challenging.

Early research attempted to use co-word analysis to identify emerging research themes in information security by analyzing patterns and trends [43]. Xu et al. [44] assessed the degree of topic association by considering the strength of topic linkages in multiple networks such as co-word and co-author networks. Miao et al. [35] produced a semantic analysis method to extract terms involving products, functions, and technologies from patents. They constructed the technology road mapping according to the term structure and opinions of domain experts. Park et al. [45] presented a quantitative approach to discover potential future technological opportunities from the patent–citation network. Additionally, detecting weak signals, trends, and issues in an evolving technology landscape as early as possible has been proven to be an alternative solution for strategic planning [46]. In emerging topic research, weak signals can be regarded as the early stage of emerging topics, with earliness being their primary characteristic [47]. Weak signals are identified as keywords with low frequency but high growth potential. Analyzing weak signals of emerging topics is an early identification task for specific innovative content, which can help to identify emerging topics early and provide valuable insights into their subsequent development and evolution.

Among past works about weak signal detection, many are keyword-based. Thus, portfolio maps, pioneered by the authors of [48], involve constructing Keyword Emergence Maps (KEMs) and Keyword Issue Maps (KIMs) based on two key metrics: Degree of Visibility (DoV) that quantifies the frequency of a keyword within a document set; and Degree of Diffusion (DoD) that measures the document frequency of each keyword. However, KEMs and KIMs present two major drawbacks: by focusing on keywords only, they can miss the context surrounding a weak signal, and the output is a single snapshot, which does not give clear clues of evolution over time. Maitre et al. [49] integrate LDA and Word2Vec to detect weak signals in weakly structured data. Furthermore, El Akrouchi et al. [50] introduce two functions for deep filtering metrics and the evolution of topics using numerous metrics to identify potential weak signals.

However, traditional topic modeling methods have been useful for weak signal detection but have significant limitations. They rely on pre-set topic numbers, which makes them inflexible and require hard-to-obtain prior knowledge. Also, they cannot use the context-aware embeddings from modern pre-trained models, leading to less detailed analysis. Furthermore, these methods work statically, overlooking how weak signals change over time. In identifying emerging technological trends in 6G, weak signals, as the early signs of core technologies, may emerge or fade quickly over time. Ignoring these changes can cause important trends to be misinterpreted or missed. Therefore, inspired by WISDOM [51], we propose an enhanced DTM technique with a continuous weak signal analysis method to handle emerging trends in the fast-evolving technological landscape of 6G.

3. Methodology

As illustrated in Figure 1, the methodology in this study is structured around a comprehensive DTM framework for managing evolving topics over time and exploring emerging technological topics. This framework consists of three main modules: a data acquisition and pre-processing module, a BERTopic-based DTM module, and a weak signal analysis module. Initially, the data acquisition module collected patent records from the incoPat [52] database, extracted titles and abstracts, and performed the data pre-processing process with an LLM-based summarization technique to obtain more relevant information as input without removing the technical content. Then, we deployed specifically selected pre-trained transformer-based model to generate document-level embeddings after a comparative performance analysis. Consequently, we performed a hyperparameter search to enhance the performance of the BERTopic-based framework. Finally, a weak signal analysis process was applied to interpret the weak signals in the evolving technological field and identify the emerging technological trends over time. Our automatic framework can provide valuable insights with interpretable emerging technological trends analysis for strategical decision-making in the 6G industry. The following subsections elaborate on each module’s specific process and contributions in detail.

3.1. Data Acquisition and Text Embedding

The first component of our framework focuses on acquiring relevant patent data and transforming it into meaningful vector representations that capture semantic information.

3.1.1. Data Acquisition

We initiated our research by collecting 6G patent data in five emerging fields, namely, NSTs, RISs, ISAC, ISAGSC, and EAIs. All the related patent records were retrieved and collected from the Incopat database [52], a globally renowned patent information platform with abundant patent resources. In order to ensure a comprehensive collection of relevant patents in these fields, this study retrieved the data using specific search formulas set by experts and obtained 78,625 patent records in 6G after an initial screening process for eliminating irrelevant patents. Figure 2 shows these technological fields in 6G have been highly active in recent years and contain many emerging technologies in these five fields, which may potentially attract more industry attention. The patents’ titles and abstracts, being the primary repositories of condensed yet valuable technological information, were extracted for further pre-processing.

Consequently, the abstracts and titles were pre-processed to reduce noise for subsequent analysis. Inspired by the work of Khandelwal [53], we employed an LLM-based summarization process to eliminate noise in the technical context and enhance the performance of dynamic topic modeling for 6G patent text analysis. In the context of dynamic topic modeling in analyzing technological trends in 6G, the patent data are often voluminous and contain a wealth of information, including technical details, background knowledge, and future research directions. Neural-based topic models like BERTopic show promise in topic modeling but can be further optimized with more concise and directed input. Secondly, the large amount of noise and domain-specific jargon in 6G patent texts can obscure the underlying themes. Summarization can help distill the essential information, reduce noise, and improve the quality of topic modeling.

3.1.2. LLM-Based Data Pre-Processing

For efficient summarization of large volumes of patent data, we employed the online API-based LLM approach ChatGLM because it allowed us to parallel process the patent context in a seamless manner. Additionally, to efficiently pre-process the large volume of patent data, we adopted an online API-based LLM approach using ChatGLM (GLM-4-Flash). This choice is because ChatGLM enables the seamless parallel processing of massive inquiries, which is more cost-effective and efficient than deploying local LLMs.

In this work, we adopted Khandelwal’s [53] long and short summarization strategies using LLMs to set the different numbers of words <number> for content summarization. We further employed role-playing and few-shot in-context learning methods combined with constrained output and enhanced the prompts to further improve the performance of the DTM during the subsequent processing steps, as shown in Figure 3.

The pre-processing steps include removing hyperlinks, punctuations, and numeric values, converting text to lowercase, and eliminating unnecessary whitespaces. After that, tokenization is performed to assign unique identifiers to each unique word in the corpus. Then, Part-Of-Speech (POS) tagging is performed to mark each token with a POS tag, like noun, adjective, verb, or adverb, according to its context, followed by lemmatization to transform tokens into their root forms. Finally, n-grams, which are continuous sequences of n words in the text corpus, are identified to uncover the language’s contextual meanings.

3.2. Enhancing BERTopic-Based Framework for Dynamic Topic Modeling

Building upon the embedded representations, we developed an enhanced version of BERTopic to effectively model the temporal evolution of technological topics within the 6G domain.

3.2.1. Embedding Method Benchmark for Topic Extraction

Text embedding represents each unique word in a corpus as an N-dimensional vector. In essence, it maps words into an N-dimensional vector space, from which valuable insights, such as the semantic similarity between different words, can be deduced. When aiming to identify an effective and efficient text embedding method for unraveling dynamic technological topic evolution in 6G communication, we evaluated the performance of the aforementioned embedding approaches in terms of their ability to capture content semantics and extract technological topics. To guarantee a fair comparison during the model selection phase, we executed the document clustering tasks using the same clustering algorithm. Term Frequency–Inverse Document Frequency (TF–IDF) [54] is an initial textual embedding technique. Essentially, it is one-dimensional text embedding, which can effectively represent the importance of words in a text within the context of a document collection. However, TF-IDF has several limitations when capturing the dynamic nature of technological topics in 6G communication. Firstly, it is a static approach that only considers the frequency of words in documents without considering the semantic context in which they are used. In the rapidly evolving field of 6G, words may have different meanings depending on the specific technological context, and TF-IDF fails to capture these nuances. Furthermore, TF-IDF does not account for the emergence of new words and concepts that are characteristic of dynamic technological evolution. As 6G technology develops, new terminologies are constantly being introduced to describe novel features and technologies. TF-IDF, relying on historical document frequencies, may not be able to assign appropriate weights to these new words, potentially missing important aspects of the evolving technological topics.

In this study, we also investigated static embedding techniques, such as Doc2Vec (Word2Vec), which are based on predicting word probabilities to map words into low-dimensional dense vectors. This can be obtained through two neural network methods: Skip Gram and CBOW. In Skip Gram, the target word is input to predict neighboring words; in CBOW, neighboring words are input to predict the target word. This way, Word2Vec can capture semantic relationships between words, placing related words closer in its vector space. However, Word2Vec has difficulty fully grasping the context-sensitive and specialized meanings of some 6G-specific terms. In addition, its embedding quality is highly dependent on the training data. In case the fast-evolving field does not cover all aspects of 6G technology, the embedding may not accurately represent word relationships.

Therefore, we further introduced contextual embedding techniques. They leverage pre-trained transformer-based LLMs, such as Data2Vec [55] and SBERT [56], to effectively capture the semantic and contextual information of words in a document. The attention mechanism in transformers allows for a more nuanced understanding of the relationships between words, enabling it to handle the complex semantic landscape of 6G technology. Furthermore, we considered the PatentSBERTa [33] in this study, which is specially trained on patent data and has never been studied in a DTM before. We conducted comparative experiments to evaluate these contextual embedding techniques and obtain a suitable embedding model for our DTM framework.

In the context of a document corpus

D

, each document (title and abstract)

d

is assigned a vector representation

y

. More precisely, the document embedding

y

for a given document

d

is obtained through a mapping function

F

, denoted as

F : d \mapsto y \in ℝ^{z}

. This mapping effectively captures both the contextual and semantic information from the corpus. Document embeddings are designed to represent words and documents within a low-dimensional feature vector space. In this space, the embedding dimension

z

is significantly smaller than the vocabulary size (i.e., the count of distinct words in

D

). This dimensionality reduction is a key characteristic of word embedding methodologies as it enables more efficient processing and representation of text data while still retaining essential semantic and syntactic information. It allows for better generalization and comparison of documents, facilitating topic modeling tasks for 6G.

3.2.2. Hyperparameter Tuning for BERTopic

In the rapidly evolving landscape of 6G technology, the continuous tracking of the ‘topic evolution’ is crucial for the understanding of emerging technologies in 6G. Traditional topic modeling approaches are naturally static and thus do not allow for the modeling of documents ordered sequentially. It is understood that dealing with dynamic, constantly changing volumes of patent data would require transitioning to a DTM. Given the cyclic nature of topic evolution, we adopt the BERTopic-based framework, as shown in the second part of Figure 1.

Initially, BERTopic was fitted as if the data have no temporal aspect, thereby creating a general topic model. After obtaining the embeddings from contextual embeddings, the three critical components used in our BERTopic-based framework are as follows:

We first performed the Uniform Manifold Approximation and Projection (UMAP) algorithm to reduce their dimensionality. The high-dimensional vectors obtained from the embedding model are often redundant and computationally expensive to process, especially in the subsequent clustering step. UMAP aims to find a low-dimensional representation of the data that preserves the local and global structure of high-dimensional data. It constructs a fuzzy topological representation of the data in the high-dimensional space and then optimizes a low-dimensional representation approximating this topology. In processing 6G patent data, UMAP can transform the high-dimensional vectors (e.g., vectors in $ℝ^{z}$ ) into two- or three-dimensional vectors (e.g., in $ℝ^{2}$ or $ℝ^{3}$ ), which can significantly reduce the computational complexity of the subsequent clustering step and is easy to visualize;
Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) was used to group the low-dimensional vectors obtained from UMAP into clusters. HDBSCAN is a hierarchical and density-based clustering method that allows it to handle clusters of irregular shapes. The topics in the 6G landscape may have complex and non-spherical distributions in the low-dimensional space, and HDBSCAN can effectively identify these clusters. Let $V^{'}$ be the set of low-dimensional vectors obtained from UMAP and HDBSCAN partition $V^{'}$ into $k$ clusters $C_{1}, \dots, C_{k}$ and a set of outliers $O$ ;
Cluster tagging using c-TF-IDF is the final step in BERTopic to extract topic representation for each cluster at different time frames. Global representation was utilized to identify the main topics likely to emerge at different timesteps. For each topic and timestep, the c-TF-IDF representation was calculated to identify representative terms for each cluster. The mathematical details of c-TF-IDF are presented in the upcoming section. In the tasks of the DTM, both global fine-tuning and evolutionary fine-tuning were used to enhance the results. Firstly, global fine-tuning was carried out. This method combines the c-TF-IDF representation of a topic at a particular timestep with the global representation. By taking an average of these two representations, the topic representation at that timestep was adjusted to move slightly closer to the global one. This approach enables the topic to maintain some of its unique characteristics while also aligning with the overall trends captured by the global representation. Secondly, evolutionary fine-tuning was performed. Here, the c-TF-IDF representation of a topic at a given timestep was averaged with the c-TF-IDF representation of the same topic at the previous timestep. This process allows the topic representation to gradually change and adapt over time, reflecting the dynamic nature of the data. As new information becomes available at each timestep, the topic representation evolves to incorporate these changes. These keywords effectively summarize the main concepts and characteristics of the cluster, providing valuable insights into the technological topics within that time period;
Finally, integrating KeyBERT for keyword extraction with the Maximal Representation Model yielded promising results. KeyBERT, leveraging pre-trained language models like BERT, can effectively identify the most salient keywords within the text. In the context of 6G patent emerging technologies analysis, these keywords are crucial as they distill the essence of the patent content. When combined with the Maximal Representation Model, the extracted keywords enhance the model’s ability to prioritize informative content. The Maximal Representation Model aims to capture the most significant aspects of the data. By feeding the keywords identified by KeyBERT into this model, it can better focus on the important terms within the data.

Additionally, we performed a random search for hyperparameter optimization to select the hyperparameters of the UMAP and HDBSCAN models based on the DBCV value [57], which is a score for the goodness of a density-based clustering model spanning [−1, 1]. During each iteration of the random search, we randomly sampled 25% of the abstracts from the data prepared. In the optimal run, we achieved a DBCV score of 0.45. More detailed parameter settings are explained in the Experimental Results section.

3.3. Topic Representation and Emerging Trend Analysis

After extracting the evolving topics, we employed specialized techniques to represent them meaningfully and analyze emerging technological trends within the 6G landscape.

3.3.1. Topic Representation

To obtain representative technical keywords for each cluster in each time frame within documents, the topic representation method was employed. This method is crucial for identifying the most relevant terms or phrases that characterize a particular concept in each time frame.

We adopted the

c - T F - I D F

, a variation of

T F - I D F

that considers the class labels of documents, to compute a list of

m

representative terms describing each document cluster’s semantic contents.

T F - I D F

calculates the importance of a word

w

in a document

d

within a collection of documents

D

as follows:

c - T F - I D F (w, d, c, D) = T F (w, d) \times I D F_{c} (w, D)

(1)

where

T F (w, d)

is the term frequency of word

w

in a document

d

.

I D F_{C} (w, D)

is the inverse document frequency of word

w

in a collection of documents

D

in class

c

. It is calculated as follows:

I D F_{c} (w, D) = \log \frac{|D_{c}|}{|d \in D_{c} : w \in d|}

(2)

where

|D c|

is the total number of documents in class

c

. We used

c - T F - I D F

to compute a list of representative terms describing each document cluster’s semantic contents. For a given document cluster in a particular time frame

t

, we calculated the

c - T F - I D F

values for all words in the documents within the local cluster. Then, we sorted these words in descending order based on their

c - T F - I D F

values. The top words were selected as the representative technical key words for that cluster in the time frame. Therefore, these key words can effectively summarize the main concepts and characteristics of the cluster, providing valuable insights into the technological topics within that time period.

3.3.2. Emerging Trend Analysis

The early identification of emerging signals and trends in the constantly evolving technology landscape of 6G represents a substantial challenge. By harnessing the concept of weak signals within the extracted topics, this study is dedicated to pinpointing technological topics currently garnering limited attention yet possessing substantial promise for future development. In the realm of 6G patent research, discerning the annual evolution of key technologies is pivotal for foreseeing industry advancements. After leveraging BERTopic to extract evolving topics from 6G patent data, we adopted the Topic Emergence Map (TEM) approach to analyze the emerging trends in the topic evolution. By visualizing each topic’s signal state across different years as a heatmap, we can gain more intuitive insights into the development of 6G technology topics and detect emerging trends.

In order to quantitatively extract future signals and make the result more interpretable, an early study [48] inspired us to employ topic proportions obtained from our time-aware dynamic topic modeling framework and use TEMs to identify future signs of emerging technologies in related 6G fields.

The TEM is a valuable method for identifying future-oriented signals within research topics. Its underlying principle lies in recognizing that the early detection of emerging signals and trends is crucial in a rapidly evolving technological landscape. The TEM is constructed using two key dimensions. The x-axis represents the average topic proportions derived from the BERTopic model, which reflects a topic’s relative significance in the dataset at a given time frame. The y-axis, on the other hand, denotes the rate of increase in topic proportions, measuring how quickly a topic’s importance changes over time. Based on the mean of the x-axis values and a y-axis value of zero, the TEM is partitioned into four quadrants: strong signals (high average proportion and high growth rate), weak signals (low average proportion but high growth rate), latent signals (low average proportion and low growth rate), and not strong but well-known (NSWK) signals (high average proportion and low growth rate).

In order to utilize the TEM principle to create a heat map to analyze the evolution of 6G technology topics, a series of steps are involved. First, the 6G patent data, which have been processed with topics extracted by BERTopic, needs further processing. Each topic should be associated with its corresponding patent documents and timestamped according to the patent application year. Then, calculate the topic proportions for each year by dividing the number of patent documents of a specific topic in a year by the total number of patents in that year. Simultaneously, the rate of increase in topic proportions for consecutive years is determined. Afterward, a matrix with rows representing different 6G technology topics and columns representing different years is created. Color each cell in the matrix based on the topic’s position in the TEM quadrant in the corresponding year. Finally, data visualization tools are employed to generate the heatmap. Observing the heatmap allows us to analyze the evolution of 6G technology topics, identify trends, and compare different topics in the same year, enabling a comprehensive understanding of the 6G technology landscape over time.

We segmented the examined period into three-year intervals with one year overlapping to track the temporal evolution and changes in signal patterns. We refer to these intervals as {

p i}

. TEMs were then built for each period, with a signal label assigned to each topic within the TEM across the examined periods.

4. Experimental Results and Comparative Analyses

The experiments conducted in this study were designed to comprehensively validate the three key components of the methodology proposed in Section 3. Given the dynamic and complex nature of 6G technology and the limitations of existing research methods in accurately analyzing its technological evolution, our experiments aimed to provide robust evidence for the effectiveness of our approach.

4.1. Datasets and Experimental Setup

Our experimental evaluation begins with describing the datasets used and the specific parameters and metrics employed to assess model performance.

4.1.1. Datasets

To explore the technological evolution in the domain-specific fields in 6G, we collected the datasets of five emerging fields in 6G using a carefully crafted search query developed by domain experts. These fields include new spectrum technologies (NSTs), reconfigurable intelligent surfaces (RISs), integrated sensing and communication (ISAC), integrated space–air–ground–sea communication (ISAGSC), and enhanced air interfaces (EAIs). The detailed statistics of the datasets are illustrated in Table 1. Given the restricted number of topics suitable for modeling in the early years, the time period from 2016 to 2024 was selected for performing the DTM. This time frame provides a more comprehensive and representative data source for the research, compensating for the limitations in the initial stages and enabling a more in-depth exploration of the technological evolution in these 6G fields.

4.1.2. Evaluation Metric

As the number of topics produced by a dynamic topic model does not adequately gauge its quality or superiority compared to other models, we assessed the performance of the topic models from the perspectives of human interpretability and diversity. Specifically, we adopted two widely used metrics, namely, topic coherence and topic diversity, for this evaluation.

Topic coherence (TC) [58], referring to the coherence of top words in a topic, indicates the interpretability of a topic containing

m

words. If the words in a topic often appear together in the same documents, it indicates that the topic has a high coherence. We employed the popular the co-occurrence metric CV as the coherence metric [59], which outperforms early metrics like Normalized Pointwise Mutual Information (NPMI) [60], which is an improved version of

P M I

[61]. The formula for

P M I

is as follows:

P M I (w_{i}, w_{j}) = l o g \frac{P (w_{i}, w_{j})}{P (w_{i}) P (w_{j})}

(3)

where

P (w_{i})

and

P (w_{j})

are the probabilities of words

w_{i}

and

w_{j}

being present in the window, and

P (w_{i}, w_{j})

is the probability that both are present in the window.

N P M I

normalizes

P M I

to keep the values between −1 and 1, potentiating the result for better differentiation. Its formula is:

N P M I {(w_{i}, w_{j})}^{γ} = {(\frac{P M I (w_{i}, w_{j})}{- l o g (P (w_{i}, w_{j}))})}^{γ}

(4)

The context vector for a word

w_{i}

is then

\vec{v} (w_{i}) = \{\sum_{w_{j} \in W} N P M I {(w_{i}, w_{j})}^{γ}\}

(5)

where

W

is the vocabulary. For a given topic, the cosine similarity between the context vectors of all its top words was calculated. After that, all these distances were aggregated, and the average value is the topic coherence

C v

. To measure the association between a topic and its time slice, we also used the documents within that time slice as a reference corpus to estimate the probabilities of word occurrences. In practice, a higher TC value implies that the topic exhibits greater coherence within the documents of the time slice.

T o p i c d i v e r s i t y (T D)

[62] is used to evaluate the degree of difference in representing a given set of topics. For a given topic within a specific time slice, we calculated the proportion of its 15 top-ranked words that both occur only once and are present within that same time slice.

T D

is defined as the percentage of unique words in the top

N

words of the topics. The formula for

T D

is relatively straightforward:

T D = \frac{Number of unique words in top N words}{Total number of top N words}

(6)

T D

is used to evaluate the distinctiveness of topics. A

T D

value close to 0 suggests many repeated words among the topics, indicating redundant topics. On the other hand, a value close to 1 implies that the topics are more singular, with a high proportion of unique words. In a news article corpus, high

T D

values would mean that different topics cover a wide range of distinct concepts, while low

T D

values might indicate that topics are overlapping and less diverse. Consequently,

T D

can be used to determine whether repetitive topics exist. This metric is useful for ensuring that the topics generated by a topic model are diverse enough to represent different aspects of the corpus content.

4.2. Performance Evaluations

In this section, we present comprehensive empirical findings from our experiments, analyzing the effects of different components of our methodology on topic modeling performance across the 6G patent datasets.

4.2.1. Comparisons of LLM-Based Pre-Processing Techniques

In our research on 6G patent data, the primary objective was to understand how different summarization strategies impact the performance of various tasks related to 6G technological analysis. We collected patent datasets from five emerging fields in 6G from 2016 to 2024. To process this large volume of patent data, we employed ChatGLM, an online API-based LLM protocol. It was selected for its ability to seamlessly parallel process the patent context, making it more cost-effective and efficient than local LLMs. For content summarization, we adopted long and short summarization strategies with different constraints on the number of words, which were 15–30 and 30–60, respectively. Additionally, we used refined prompts as proposed in Section 3.1.2 to further improve the results and prevent irrelevant information in the topic extractions. We adopted the all-MiniLM-L6-v2, a typical SBERT model, as the default baseline setting in this experiment. We calculated the TC and TD metrics for each period at one-year intervals throughout the entire duration. The values reported for both TC and TD are the average values calculated across all the years.

From the data presented in Table 2 and Figure 4, it is evident that the application of our proposed summarization techniques, when combined with our LLM-based pre-processing, has distinct effects on the performance metrics of TC and TD across different datasets consistently.

For the TC metric in Table 2, compared to the baseline, the ‘Baseline + long-sum’ approach shows an improvement in most fields in five datasets. For instance, in the NST field, the TC value increased from 0.4900 to 0.5039, indicating that the long summarization (a key element of the ‘Baseline + long - sum’ approach, characterized by its ability to preserve detailed information) indeed helps in more complex analysis tasks. Similarly, in the ISAC field, the value rose from 0.4749 to 0.5024. This improvement is consistent with our previous analysis that long summarization are beneficial for exploring complex relationships within 6G technological concepts. However, the ‘Baseline + short-sum’ approach led to a decrease in TC performance across all fields. This performance decrease is expected as short summarization (a key element of the ‘Baseline + short - sum’ approach) sacrifice detailed information crucial for complexity-related tasks.

The baseline model’s TC curve in Figure 4 shows a trend over the years, potentially reflecting the baseline performance in capturing topic-related characteristics. The long-sum and short-sum models, which are based on BERTopic with different summary lengths, exhibit distinct TC patterns. These differences might indicate that the summary length impacts how well the model can represent topics in terms of TC, with variations in the curves suggesting different levels of effectiveness or adaptability to the data over time.

Regarding the TD metric in Table 2, the ‘Baseline + short-sum’ approach demonstrates significant enhancements. In the ISAC field, the TD value increased from 0.6775 to 0.7537, and in the EAI field, it rose from 0.6647 to 0.7400. This clearly shows that short summarization, which focuses on key point extraction, is highly effective for density-related tasks. In contrast, the ‘Baseline + long-sum’ approach also improved the TD performance in some fields like EAIs, increasing from 0.6647 to 0.6983. However, the overall improvement was more pronounced with the short-sum method.

In the Supplementary Materials, we included one folder named ‘Heatmaps’ to comprehensively visualize the performance of different embedding methods. These visualizations are presented for each dataset under three conditions: the baseline model, the baseline with short summarization (baseline + short sum), and the baseline with long summarization (baseline + long sum). The heatmaps specifically focus on cosine similarity, which is a crucial metric for evaluating the embedding quality. By comparing the cosine similarity heatmaps across these three scenarios, we can clearly observe the differences in how well each approach captures semantic relationships within the patent data. For example, we can observe how the long summarization might preserve more detailed semantic information, leading to higher cosine similarity values in relevant regions of the heatmap compared to the baseline method. This provides an intuitive understanding of the impact of different summarization strategies and the baseline model on the quality of embeddings, further validating the effectiveness of our proposed methods in handling 6G patent data. Additionally, to further validate the effectiveness of our LLM-based pre-processing methods, we conducted additional experiments and included 2D visualizations of the embeddings for each dataset in the Supplementary Materials named ‘Clusters’. We utilized UMAP, a powerful dimensionality reduction technique, to project the high-dimensional embeddings into a 2D space for better visualization. The generated HTML-based interactive visualizations offer an intuitive understanding of the impact of LLM-based pre-processing on the patent data. In these visualizations, we can clearly observe the distribution of data points. When LLM-based pre-processing was applied, the data points tended to cluster more tightly, indicating a reduction in noise. For instance, in the RIS dataset, the pre-processed data (both the long summarization approach and short summarization approach) show a more organized structure, with data points forming distinct groups. This suggests that LLM-based summarization effectively filters out irrelevant information, enabling the model to better capture the underlying semantic relationships within the patent data. These visualizations serve as evidence for the noise-reducing capabilities of our LLM-based pre-processing approach, complementing our overall analysis of the dynamic topic modeling framework for 6G patent data.

In summary, our research validates that different LLM-based summarization strategies have specific and predictable impacts on 6G-related task performances. The long-summarization technique enhances the overall performance compared to the baseline, making it a more fitting choice for complex analysis tasks that demand in-depth understanding and comprehensive interpretation. In contrast, short-summarization techniques stand out in key point extraction, where the goal is to quickly distill the most crucial information and high-speed process tasks, enabling rapid and efficient handling of large volumes of text. These findings can guide future research and application development in 6G patent data analysis, enabling more efficient and accurate utilization of the vast amount of patent information in the 6G domain.

4.2.2. Comparison of Contextual Embedding Techniques

Our experimental study focused on evaluating the performance of different contextual embedding models in a DTM in 6G. We investigated four models: the Baseline (SBERT model: all-MiniLM-L6-v2), PatentSBerta [33], SciBert [63], and Roberta [64]. The comparison was conducted across five parameters: NSTs, RISs, ISAC, ISAGSC, and EAIs. After analyzing the data presented in the two tables (Table 3 and Table 4), it is clear that no single model can be declared an absolute winner across all the evaluated parameters.

The baseline model shows a relatively high NST score of 0.4900, yet it does not consistently outperform other metrics. In some cases, PatentSBerta has slightly lower NSTs but higher ISAC than the baseline. SciBert demonstrates strong performance in RISs with a value of 0.5090, but its performance in other metrics is not uniformly superior. Roberta also has a mixed bag of results, with some metrics showing comparable values to the others.

This lack of a winner indicates that the choice of model depends on the specific metric of interest. However, if overall balanced performance is desired, a more in-depth analysis of each metric’s weights and importance would be necessary. Moreover, these results suggest that there is room for further improvement in developing these models to achieve more consistent and superior performance across all the evaluated metrics.

4.2.3. Results of Hyperparameter Tuning

We further tested our model with hyperparameter tuning, and the experimental results indicate that the optimized hyperparameters significantly enhanced the model’s performance. After conducting random searches based on the DBCV value, the carefully selected parameters for the UMAP and HDBSCAN models brought about substantial improvements in the clustering quality.

For the UMAP model, setting n_neighbors = 50, n_components = 50, min_dist = 0, and metric = ‘cosine’ enabled a more effective dimensionality reduction. The reduced-dimensional vectors retained the essential structure of the high-dimensional data and facilitated the subsequent clustering process. This was evident in the improved performance of the HDBSCAN model, which was able to identify more distinct and meaningful clusters.

With the HDBSCAN model’s parameters set as min_cluster_size = 100, min_samples = 10, metric = ‘euclidean’, and cluster_selection_method = ‘leaf’, the model could better handle the complex distribution of 6G patent data. It effectively separated the data into clusters, minimizing the number of misclassified outliers. As a result, the topic representation obtained from the c-TF-IDF step became more accurate and interpretable.

In terms of the evaluation metrics, as shown in Table 5, the optimized model showed a remarkable increase in topic coherence. The top words selected for each cluster were more closely related semantically, with the average topic coherence (TC) value increasing by approximately 40%, on average, across all datasets compared to the non-optimized model. This improvement in TC indicates that the model better captured the underlying concepts within each cluster.

Simultaneously, the topic diversity (TD) also witnessed a positive change, as shown in Table 6. The proportion of unique words in the top-ranked words of each topic increased, resulting in a more diverse set of topics. This was particularly beneficial for comprehensively representing the various technological aspects in the 6G patent data. The TD value increased by about 30%, on average, suggesting that the hyperparameter-tuned model could generate more distinct and non-redundant topics.

In summary, the hyperparameter tuning process was crucial for achieving better performance in our 6G technology DTM model. It provided a more refined way of handling patent data in a complex landscape, leading to more accurate topic representation and a deeper understanding of the emerging trends in 6G technology. Future research could explore more advanced hyperparameter optimization techniques, such as random searches or Bayesian optimization, to further improve the model’s performance and generalization ability.

4.3. Empirical Analysis

In the empirical analysis section of this study, our experiments were designed to comprehensively explore the technological evolution in the 6G field. We conducted DTM analysis on individual subject fields of five datasets in 6G, which included NSTs, RISs, ISAC, ISAGSC, and EAIs. Based on the concept of weak signal analysis, we calculated the TEM and heatmap to quantitatively identify the weak signals of 6G, which is an emerging technology in the near future. These results provide a deeper understanding of the technological evolution in the 6G field, enabling us to identify emerging technologies and trends. This information can support strategic decision-making in the industry, helping stakeholders to plan research and development directions and stay ahead in the highly competitive 6G market.

Firstly, we adopted a BERTopic-based DTM to conduct an analysis of the overall trends for 6G technology. Figure 5 shows the top-ranked topics from the charts for five datasets. The topics with line chart visualizations were generated based on our DTM framework. Clearly, all the technologies that are displayed as top-rated topics have grown fast over the years. As we can see in the charts, in the new spectrum technology (NST) of 6G, the ‘power amplifier_mixer_inductor_stage’ has been rated as the top-most ranked topic recently. This is likely because the power amplifier is essential for boosting the signal power. Since 6G requires high-speed data transmission over long distances in a wider spectrum, a power amplifier can ensure that the signal can still be accurately detected after suffering from attenuation during propagation. The mixer is mainly responsible for frequency conversion. It enables the transformation of signals between different frequencies, which is beneficial for signal processing and transmission in the 6G new spectrum. Inductors, on the other hand, work with capacitors to form LC filters, filtering out unwanted signals and noise. They also play a role in energy storage and impedance matching. Similarly, we can easily identify the top-rated topics for the remaining four subjects from Figure 5, which are the ‘dielectric substrate_metasurface antenna’ for RISs, ‘radar communication_ofdm’ for ISAC, ‘satellite_network’ for ISAGSC, and ‘beam management’ for EAIs.

To further explore the technological evolution in the domain-specific fields in 6G, we employed TEMs associated with heatmaps to quantify the emergence of technological shifts in the five subjects of the patent datasets, as shown in Figure 6. The weak signals refer to the signals with a low average proportion but a high growth rate in the obtained topics, as shown in Table 7.

For instance, in the NST dataset, the identified weak signals provide valuable insights into the emerging technological directions. Topic 17, ‘17_channel estimation_sparse_millimeter wave channel_deep learning’, indicates that deep learning-based methods for sparse millimeter-wave channel estimation are emerging. In the context of 6G, millimeter-wave communication is crucial for achieving high-data-rate transmission. However, the complex channel environment poses challenges. Deep learning can potentially optimize channel estimation algorithms, improving signal reception and transmission quality. Topic 18, ‘18_millimeter wave radar_wave radar_doppler_target detection’, suggests that millimeter-wave radar technology for Doppler target detection is on the rise. This is important for applications such as 6G-enabled autonomous driving, where precise target detection is essential for vehicle safety. By observing the TEM heatmap for NSTs, we can see how these topics’ importance has changed over time. A topic with a low average proportion but a high growth rate, like these two, is considered a weak signal. Monitoring such topics helps stakeholders in the 6G industry anticipate future technological needs and invest in relevant research and development.

Some other examples in the RIS dataset are shown in the table. For instance, Topic 5, ‘5_terahertz_terahertz wave_vanadium dioxide_resonator’, shows that the research on vanadium dioxide resonators for terahertz waves is emerging. Terahertz communication is a key technology for 6G, and vanadium dioxide’s unique properties may enable the development of more efficient terahertz devices. Topic 10, ‘10_holographic_nano brick_micro nano optic_nano optic’, indicates the potential of holographic nano-brick-based micro-nano optics in 6G. This could be applied to improve the performance of optical communication components in RIS systems. The heatmap of the TEM for RISs visually represents the development trends of these topics. It allows us to compare the growth rates and average proportions of different topics, facilitating the identification of technologies that are likely to have a significant impact on the future of RIS-related 6G applications. This information can guide companies in the 6G field to allocate resources more effectively and focus on technologies with high-growth potential.

Regarding the ISAC dataset, Topic 15, ‘15_grating_waveguide_wavelength_coupler’, implies that grating waveguide wavelength couplers are emerging as an important technology. These couplers can play a crucial role in optical-based integrated sensing and communication systems, enabling efficient coupling of optical signals. Topic 18, ‘18_millimeter wave_mimo_hybrid beamforming_mmwave’, shows that millimeter-wave MIMO hybrid beamforming technology is gaining momentum. In 6G, MIMO technology combined with hybrid beamforming can enhance communication capacity and signal coverage. The TEM heatmap for ISAC helps visualize the evolution of these topics. It provides a clear picture of which technologies are emerging and which are more established, enabling researchers and industry players to understand the technological landscape better. This understanding is vital for strategic planning, such as setting research priorities and formulating product development strategies in the ISAC domain of 6G.

In the ISAGSC field, the TEM helps identify emerging trends such as laser communication in satellite systems and edge computing offloading in ground-integrated networks. Weak-signal topics like ‘12_laser communication_satellite laser_optical communication_leo satellite’ and ‘19_edge_edge computing_offloading_ground integrated network’ imply that these areas are emerging as important research directions. This insight is crucial for understanding the future development of 6G integrated space–air–ground–sea communication systems, enabling stakeholders to focus on improving satellite communication and optimizing network computing resources.

Our TEM reveals emerging trends in areas like radar signal processing and AI-based terminal device applications in the EAI dataset. Topics such as ‘13_radar_radar signal_doppler_chirp signal’ and ‘14_ai model_terminal device_application_channel access’ in the weak-signal quadrant suggest that these technologies are evolving. This information is valuable for enhancing the performance of 6G-enhanced air interfaces, guiding research toward better radar-based communication and AI-enabled terminal device capabilities.

Overall, we demonstrate that the DTM framework with TEMs offers valuable insights into emerging technologies in each field. It helps in the early-stage identification of promising technologies, guiding strategic decision-making, resource allocation, and research and development planning in the highly competitive 6G industry.

5. Conclusions

In this study, we aimed to address the challenges of analyzing the technological evolution of 6G through patent data. Given the rapid development and complexity of 6G technologies, traditional topic modeling approaches were insufficient. We proposed a novel dynamic topic modeling framework to overcome these limitations and provide more interpretable alternative tools to analyze emerging trends in the field.

Specifically, we contributed to the usage of LLM-based tools for cost-effective and efficient summarization techniques to pre-process the large volume of patent data. The experimental results show that long summarization improved topic coherence (TC) in most fields, while short summarization enhanced topic diversity (TD). This indicates that different summarization strategies could be applied according to specific analysis requirements, providing a more flexible and efficient way to handle patent data.

Then, we conducted a comparative analysis of various embedding techniques and found no single model was superior across all metrics and datasets. Furthermore, we optimized the hyperparameters of the BERTopic-based framework. Through random searches based on the DBCV value, we significantly improved the performance of the model. The optimized model showed a remarkable increase in TC (about 40% on average) and TD (about 30% on average), enabling more accurate topic representation and a deeper understanding of emerging trends in 6G technology.

Finally, we applied the weak signal analysis method using TEMs in our dynamic topic modeling framework. By analyzing the five 6G patent datasets (NSTs, RISs, ISAC, ISAGSC, and EAIs), we identified numerous emerging technologies. For example, in the NST dataset, topics like ‘17_channel estimation_sparse_millimeter wave channel_deep learning’ and ‘18_millimeter wave radar_wave radar_doppler_target detection’ were recognized as weak signals, indicating emerging research directions in millimeter-wave communication. In the RIS dataset, ‘5_terahertz_terahertz wave_vanadium dioxide_resonator’ and other topics showed potential for future development in terahertz-related and optical-based RIS technologies.

In conclusion, our proposed dynamic topic modeling framework effectively uncovers 6G technological evolution patterns. It provides valuable insights for strategic decision-making in the 6G industry and enables stakeholders to identify emerging technologies, allocate resources rationally, and stay ahead in the highly competitive 6G technology race. Future research could further explore more advanced hyperparameter optimization techniques and expand the scope of patent data analysis to other 6G-related fields to enhance a comprehensive understanding of 6G technological development.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app15073783/s1.

Author Contributions

Conceptualization, J.J. and F.Y.; methodology, F.Y.; software, F.Y. and R.D.; validation, J.J. and R.D.; formal analysis, F.Y.; investigation, J.J. and R.D.; resources, J.J.; data curation, J.J.; writing—original draft preparation, J.J.; writing—review and editing, J.J., F.Y., and R.D.; visualization, F.Y.; supervision, J.J.; funding acquisition, F.Y.; project administration, F.Y. and R.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Major Scientific Instruments and Equipments Development Project of National Natural Science Foundation of China, NO. 32327801.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the sensitivity of 6G technology competition-related information.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ADTM	Algorithmic Dynamic Topic Models
BOW	Bag-of-words
BERT	Bidirectional Encoder Representations from Transformers
CBOW	Continuous Bag of Words
DBCV	Density-Based Clustering Validation
DL	Deep Learning
DTM	Dynamic Topic Models
EAI	Enhanced Air Interface
HDBSCAN	Hierarchical Density-Based Spatial Clustering of Applications with Noise
ISAC	Integrated Sensing and Communication
ISAGSC	Integrated Space–Air–Ground–Sea Communication
KEM	Keyword Emergence Maps
KIM	Keyword Issue Maps
LC	Link Clustering
LDA	Latent Dirichlet Allocation
LLMs	Large Language Models
MIMO	Multiple-Input Multiple-Output
NPMI	Normalized Pointwise Mutual Information
NSTs	New Spectrum Technologies
OAM	Orbital Angular Momentum
PDTM	Probabilistic Dynamic Topic Models
PMI	Pointwise Mutual Information
POS	Part-Of-Speech
RIS	Reconfigurable Intelligent Surface
SBERT	Sentence-BERT
TC	Topic Coherence
TD	Topic Diversity
TEMs	Topic Emergence Maps
TF-IDF	Term Frequency–Inverse Document Frequency
UMAP	Uniform Manifold Approximation and Projection

References

Pennanen, H.; Hänninen, T.; Tervo, O.; Tölli, A.; Latva-aho, M. 6G: The Intelligent Network of Everything—A Comprehensive Vision, Survey, and Tutorial. arXiv 2024, arXiv:2407.09398. [Google Scholar]
Chowdhury, M.Z.; Shahjalal, M.; Ahmed, S.; Jang, Y.M. 6G Wireless Communication Systems: Applications, Requirements, Technologies, Challenges, and Research Directions. IEEE Open J. Commun. Soc. 2020, 1, 957–975. [Google Scholar] [CrossRef]
Prasad Tera, S.; Chinthaginjala, R.; Pau, G.; Hoon Kim, T. Toward 6G: An Overview of the next Generation of Intelligent Network Connectivity. IEEE Access 2025, 13, 925–961. [Google Scholar] [CrossRef]
Na, M.; Lee, J.; Choi, G.; Yu, T.; Choi, J.; Lee, J.; Bahk, S. Operator’s Perspective on 6G: 6G Services, Vision, and Spectrum. IEEE Commun. Mag. 2024, 62, 178–184. [Google Scholar] [CrossRef]
Bazzi, A.; Bomfin, R.; Mezzavilla, M.; Rangan, S.; Rappaport, T.; Chafii, M. Upper Mid-Band Spectrum for 6G: Vision, Opportunity and Challenges. arXiv 2025, arXiv:2502.17914. [Google Scholar]
Lu, S.; Liu, F.; Li, Y.; Zhang, K.; Huang, H.; Zou, J.; Li, X.; Dong, Y.; Dong, F.; Zhu, J.; et al. Integrated Sensing and Communications: Recent Advances and Ten Open Challenges. IEEE Internet Things J. 2024, 11, 19094–19120. [Google Scholar] [CrossRef]
Meng, S.; Wu, S.; Zhang, J.; Cheng, J.; Zhou, H.; Zhang, Q. Semantics-Empowered Space-Air-Ground-Sea Integrated Network: New Paradigm, Frameworks, and Challenges. IEEE Commun. Surveys Tuts. 2025, 27, 140–183. [Google Scholar] [CrossRef]
Yuan, Y.; Huang, Y.; Ding, H.; Cui, C.; Wang, Q. Key Technologies for the 6G Air Interface; Elsevier: Amsterdam, The Netherlands, 2025; ISBN 978-0-443-33660-7. [Google Scholar]
Hasan, S.R.; Sabuj, S.R.; Hamamura, M.; Hossain, M.A. A Comprehensive Review on Reconfigurable Intelligent Surface for 6G Communications: Overview, Deployment, Control Mechanism, Application, Challenges, and Opportunities. Wireless Pers. Commun. 2024, 139, 375–429. [Google Scholar] [CrossRef]
Jiang, W.; Han, B.; Habibi, M.A.; Schotten, H.D. The Road towards 6G: A Comprehensive Survey. IEEE Open J. Commun. Soc. 2021, 2, 334–366. [Google Scholar] [CrossRef]
Trappey, A.J.C.; Huang, A.Z.C.; Chen, N.K.T.; Pa, R.J.S.; Trappey, C.V.; Li, K.A.; Hung, L.P. Transdisciplinary Technology Mining of Advanced 6G Satellite Communication Innovations. In Advances in Transdisciplinary Engineering; Moser, B.R., Koomsap, P., Stjepandić, J., Eds.; IOS Press: Amsterdam, The Netherlands, 2022; ISBN 978-1-64368-338-6. [Google Scholar]
Yu, X.; Zhang, B. Obtaining Advantages from Technology Revolution: A Patent Roadmap for Competition Analysis and Strategy Planning. Technol. Forecast. Soc. Chang. 2019, 145, 273–283. [Google Scholar] [CrossRef]
Krestel, R.; Chikkamath, R.; Hewel, C.; Risch, J. A Survey on Deep Learning for Patent Analysis. World Pat. Inf. 2021, 65, 102035. [Google Scholar] [CrossRef]
Alghamdi, R.; Alfalqi, K. A Survey of Topic Modeling in Text Mining. Int. J. Adv. Comput. Sci. Appl. 2015, 6. [Google Scholar] [CrossRef]
Blei, D.M.; Lafferty, J.D. Dynamic Topic Models. In Proceedings of the 23rd International Conference on Machine Learning—ICML ’06, Pittsburgh, PA, USA, 25–29 June 2006; ACM Press: Pittsburgh, PA, USA, 2006; pp. 113–120. [Google Scholar]
Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, a Survey|Multimedia Tools and Applications. Available online: https://link.springer.com/article/10.1007/s11042-018-6894-4 (accessed on 27 January 2025).
Egger, R.; Yu, J. A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Front. Sociol. 2022, 7, 886498. [Google Scholar] [CrossRef] [PubMed]
Topics over Time|Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Available online: https://dl.acm.org/doi/10.1145/1150402.1150450 (accessed on 27 January 2025).
Ramos, J. Using TF-IDF to Determine Word Relevance in Document Queries. In Proceedings of the First Instructional Conference on Machine Learning, Virtual, 14 September 2020. [Google Scholar]
Adam, R.; Kogler, M. Tracking the Evolution of Climate Protection Discourse in Austrian Newspapers: A Comparative Study of BERTopic and Dynamic Topic Modeling. Conf. Proc. 2023. [Google Scholar] [CrossRef]
Grootendorst, M. BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
Ebadi, A.; Auger, A.; Gauthier, Y. Detecting Emerging Technologies and Their Evolution Using Deep Learning and Weak Signal Analysis. J. Informetr. 2022, 16, 101344. [Google Scholar] [CrossRef]
Rieger, J.; Jentsch, C.; Rahnenführer, J. RollingLDA: An Update Algorithm of Latent Dirichlet Allocation to Construct Consistent Time Series from Textual Data. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event/Punta Cana, Dominican Republic, 16–20 November 2021; Moens, M.-F., Huang, X., Specia, L., Yih, S.W., Eds.; Association for Computational Linguistics: Punta Cana, Dominican Republic, 2021; pp. 2337–2347. [Google Scholar]
Bonino, D.; Ciaramella, A.; Corno, F. Review of the State-of-the-Art in Patent Information and Forthcoming Evolutions in Intelligent Patent Informatics. World Pat. Inf. 2010, 32, 30–38. [Google Scholar] [CrossRef]
Probabilistic Non-Negative Matrix Factorization and Its Robust Extensions for Topic Modeling|Proceedings of the AAAI Conference on Artificial Intelligence. Available online: https://ojs.aaai.org/index.php/AAAI/article/view/10832 (accessed on 28 January 2025).
Arts, S.; Hou, J.; Gomez, J.C. Natural Language Processing to Identify the Creation and Impact of New Technologies in Patent Text: Code, Data, and New Measures. Res. Policy 2021, 50, 104144. [Google Scholar] [CrossRef]
Obtaining Better Static Word Embeddings Using Contextual Embedding Models—ACL Anthology. Available online: https://aclanthology.org/2021.acl-long.408/ (accessed on 28 January 2025).
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. arXiv 2013, arXiv:1310.4546. Available online: https://arxiv.org/abs/1310.4546 (accessed on 28 January 2025).
Le, Q.V.; Mikolov, T. Distributed Representations of Sentences and Documents. arXiv 2014, arXiv:1405.4053. [Google Scholar]
Angelov, D. Top2Vec: Distributed Representations of Topics. arXiv 2020, arXiv:2008.09470. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
Patent Classification by Fine-Tuning BERT Language Model. World Pat. Inf. 2020, 61, 101965. [CrossRef]
Bekamiri, H.; Hain, D.S.; Jurowetzki, R. PatentSBERTa: A Deep NLP Based Hybrid Model for Patent Distance and Classification Using Augmented SBERT. Technol. Forecast. Soc. Chang. 2024, 206, 123536. [Google Scholar] [CrossRef]
Rahimi, H.; Naacke, H.; Constantin, C.; Amann, B. ANTM: An Aligned Neural Topic Model for Exploring Evolving Topics. arXiv 2023, arXiv:2302.01501. [Google Scholar]
Integrating Bibliometrics and Roadmapping Methods: A Case of Dye-Sensitized Solar Cell Technology-Based Industry in China—ScienceDirect. Available online: https://www.sciencedirect.com/science/article/pii/S0040162514001644 (accessed on 28 January 2025).
Denter, N.; Caferoglu, H.; Moehrle, M.G. Applying Dynamic Topic Modeling for Understanding the Evolution of the RFID Technology. In Proceedings of the 2019 Portland International Conference on Management of Engineering and Technology (PICMET), Portland, OR, USA, 25–29 August 2019; pp. 1–9. [Google Scholar]
Wang, C.; Blei, D.; Heckerman, D. Continuous Time Dynamic Topic Models. arXiv 2015, arXiv:1206.3298. [Google Scholar]
Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes. Available online: https://papers.nips.cc/paper_files/paper/2004/hash/fb4ab556bc42d6f0ee0f9e24ec4d1af0-Abstract.html (accessed on 28 January 2025).
Greene, D.; Cross, J.P. Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach. arXiv 2016, arXiv:1607.03055. Available online: https://arxiv.org/abs/1607.03055 (accessed on 28 January 2025). [CrossRef]
Tracking Urban Geo-Topics Based on Dynamic Topic Model—ScienceDirect. Available online: https://www.sciencedirect.com/science/article/pii/S019897151930105X (accessed on 28 January 2025).
Frontiers|Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis. Available online: https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2020.00042/full (accessed on 28 January 2025).
Boutaleb, A.; Picault, J.; Grosjean, G. BERTrend: Neural Topic Modeling for Emerging Trends Detection. In Proceedings of the Workshop on the Future of Event Detection (FuturED), Miami, FL, USA, 15–16 November 2024; Tetreault, J., Nguyen, T.H., Lamba, H., Hughes, A., Eds.; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 1–17. [Google Scholar]
Lee, W.H. How to Identify Emerging Research Fields Using Scientometrics: An Example in the Field of Information Security. Scientometrics 2008, 76, 503–525. [Google Scholar] [CrossRef]
Topic-Linked Innovation Paths in Science and Technology. Available online: https://ideas.repec.org/a/eee/infome/v14y2020i2s175115771930210x.html (accessed on 28 January 2025). [CrossRef]
Technological Opportunity Discovery for Technological Convergence Based on the Prediction of Technology Knowledge Flow in a Citation Network—ScienceDirect. Available online: https://www.sciencedirect.com/science/article/pii/S1751157718300907 (accessed on 28 January 2025). [CrossRef]
Ansoff, H.I. Managing Strategic Surprise by Response to Weak Signals. Calif. Manage. Rev. 1975, 18, 21–33. [Google Scholar] [CrossRef]
Xu, H.; Luo, R.; Winnink, J.; Wang, C.; Elahi, E. A Methodology for Identifying Breakthrough Topics Using Structural Entropy. Inf. Process. Manag. Int. J. 2022, 59, 102862. Available online: https://dl.acm.org/doi/10.1016/j.ipm.2021.102862 (accessed on 28 January 2025).
Yoon, J. Detecting Weak Signals for Long-Term Business Opportunities Using Text Mining of Web News. Expert Syst. Appl. Int. J. 2012, 39, 12543–12550. Available online: https://dl.acm.org/doi/10.1016/j.eswa.2012.04.059 (accessed on 28 January 2025).
Maitre, J.; Menard, M.; Chiron, G.; Bouju, A. Détection de Signaux Faibles Dans Des Masses de Données Faiblement Structurées. Rech. Inf. Doc. Web Sémant. 2019, 3. [Google Scholar] [CrossRef]
El Akrouchi, M.; Benbrahim, H.; Kassou, I. End-to-End LDA-Based Automatic Weak Signal Detection in Web News. Knowledge-Based Syst. 2021, 212, 106650. [Google Scholar] [CrossRef]
Ebadi, A.; Auger, A.; Gauthier, Y. WISDOM: An AI-Powered Framework for Emerging Research Detection Using Weak Signal Analysis and Advanced Topic Modeling. arXiv 2024, arXiv:2409.15340. [Google Scholar]
incoPat Reliable and User-Friendly Global Patent Database Patent Search Patent Analyze Intellectual Property incoPat. Available online: https://www.incopat.com/ (accessed on 28 January 2025).
Khandelwal, T. Investigating the Impact of Text Summarization on Topic Modeling. arXiv 2024, arXiv:2410.09063. [Google Scholar]
Salton, G.; Buckley, C. Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef]
Baevski, A.; Hsu, W.-N.; Xu, Q.; Babu, A.; Gu, J.; Auli, M. Data2vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language. arXiv 2022, arXiv:2202.03555. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar]
Campello, R.J.G.B.; Moulavi, D.; Sander, J. Density-Based Clustering Based on Hierarchical Density Estimates. In Proceedings of the Advances in Knowledge Discovery and Data Mining; Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 160–172. [Google Scholar]
Newman, D.; Lau, J.H.; Grieser, K.; Baldwin, T. Automatic Evaluation of Topic Coherence. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, USA, 1–6 June 2010; Kaplan, R., Burstein, J., Harper, M., Penn, G., Eds.; Association for Computational Linguistics: Los Angeles, CA, USA, 2010; pp. 100–108. [Google Scholar]
Röder, M.; Both, A.; Hinneburg, A. Exploring the Space of Topic Coherence Measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, New York, NY, USA, 2–6 February 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 399–408. [Google Scholar]
Bouma, G. Normalized (Pointwise) Mutual Information in Collocation Extraction. In Proceedings of the Biennial GSCL Conference 2009, Bergen, Norway, 17–20 September 2009. [Google Scholar]
Church, K.W.; Hanks, P. Word Association Norms, Mutual Information, and Lexicography. Comput. Linguist. 1990, 16, 22–29. [Google Scholar]
Hashimoto, T.; Shepard, D.L.; Kuboyama, T.; Shin, K.; Kobayashi, R.; Uno, T. Analyzing Temporal Patterns of Topic Diversity Using Graph Clustering. J. Supercomput. 2021, 77, 4375–4388. [Google Scholar] [CrossRef]
Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. arXiv 2019, arXiv:1903.10676. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]

Figure 1. Our dynamic topic modeling framework for exploring the emerging technologies in 6G over time.

Figure 2. The detailed statistics for the five identified emerging technological fields in 6G.

Figure 3. This figure is an example of our LLM-based summarization process for enhancing the input data in the pre-processing module.

Figure 4. TC (left-hand side panels: (a) baseline, (c) baseline + long-sum, and (e) baseline + short-sum) values and TD (right-hand side panels: (b) baseline, (d) baseline + long-sum, and (f) baseline + short-sum) values over the years for different topic modeling approaches. The baseline represents the baseline model, while the long and short models are our LLM-based summarization techniques with long-and short-length summaries, respectively, illustrating the performance trends in TC and TD metrics over the years.

Figure 5. This figure shows overall trends and hierarchy clustering for 5 subjects.

Figure 6. This figure shows overall trends for 5 subjects on the left-hand side (Strong Signal (4), Weak Signal (3), Potential Signal (2), Well-known but Non-strong Signal (1), No Signal or Unclassified (0)) and hierarchy clustering on the right-hand side.

Table 1. The statistics of 6G patent datasets used in the experiment.

Patent Subjects	Periods	Documents	Vocabulary	Tokens
NSTs	2016–2024	22,834	18,945	16.952 MB
RISs	2016–2024	7863	9705	6.070 MB
ISAC	2016–2024	14,918	21,715	15.055 MB
ISAGSC	2016–2024	17,240	14,918	12.462 MB
EAIs	2016–2024	15,770	17,876	11.386 MB

Table 2. This table compares the performance of annual TC and TD averages in DTM (From 2016 to 2024). We compare the baseline to our LLM-based pre-processing techniques with long and short summarization techniques. The elements highlighted in red in the table represent the maximum values in the corresponding columns.

Model	Metrics	NSTs	RISs	ISAC	ISAGSC	EAIs
Baseline	TC	0.4900	0.4978	0.4749	0.4683	0.4615
Baseline	TD	0.5968	0.6849	0.6775	0.6542	0.6647
Baseline + long sum	TC	0.5039	0.5002	0.5024	0.4883	0.4717
Baseline + long sum	TD	0.6157	0.6985	0.6923	0.6643	0.6983
Baseline + short sum	TC	0.4005	0.4314	0.4140	0.4062	0.4032
Baseline + short sum	TD	0.6656	0.7159	0.7537	0.6923	0.7400

Table 3. This table shows the comparison of performances of averages of TC in different models. The elements highlighted in red in the table represent the maximum values in the corresponding columns.

Model	NSTs	RISs	ISAC	ISAGSC	EAIs
Baseline	0.4900	0.4978	0.4749	0.4683	0.4615
PatentSBerta	0.4823	0.4953	0.4771	0.4768	0.4532
SciBert	0.4875	0.5090	0.4812	0.4709	0.4571
Roberta	0.4848	0.4967	0.4835	0.4675	0.4615

Table 4. This table shows the comparison of performances of averages of TD in different models. The elements highlighted in red in the table represent the maximum values in the corresponding columns.

Model	NSTs	RISs	ISAC	ISAGSC	EAIs
Baseline	0.5968	0.6849	0.6775	0.6542	0.6647
PatentSBerta	0.5860	0.6813	0.6739	0.6480	0.6645
SciBert	0.5999	0.6735	0.6633	0.6508	0.6733
Roberta	0.5922	0.6703	0.6755	0.6473	0.6724

Table 5. This table shows the comparison of performances of averages of TC in the baseline model and baseline with hyperparameter optimizations. The elements highlighted in red in the table represent the maximum values in the corresponding columns.

Model	NSTs	RISs	ISAC	ISAGSC	EAIs
Baseline	0.4900	0.4978	0.4749	0.4683	0.4615
Baseline + OPT	0.7877	0.6511	0.5498	0.6412	0.5982

Table 6. This table shows the comparisons of performances of averages of TD in the baseline model and baseline with hyperparameter optimizations. The elements highlighted in red in the table represent the maximum values in the corresponding columns.

Model	NSTs	RISs	ISAC	ISAGSC	EAIs
Baseline	0.5968	0.6849	0.6775	0.6542	0.6647
Baseline + OPT	0.8898	0.9080	0.9467	0.9241	0.9463

Table 7. This table shows the weak signals in 5 subjects in patent datasets of 6G.

Datasets	Weak Signals in the Top 20 Topics
NSTs	Topic 17: 17_channel estimation_sparse_millimeter wave channel_deep learning Topic 18: 18_millimeter wave radar_wave radar_doppler_target detection Topic 19: 19_graphene_wave absorber_terahertz wave_substrate layer Topic 20: 20_packaging_packaging structure_millimeter wave chip_wave chip
RISs	Topic 5: 5_terahertz_terahertz wave_vanadium dioxide_resonator Topic 9: 9_display device_pixel_grating_light emitting Topic 10: 10_holographic_nano brick_micro nano optic_nano optic Topic 12: 12_energy efficiency_wireless energy_power_intelligent reflecting surface Topic 13: 13_intelligent reflecting surface_relay_intelligent reflector_communication system Topic 14: 14_supercell_vibration_wave_insulation Topic 15: 15_graphene_absorber_terahertz wave_bias voltage Topic 16: 16_aerial vehicle_unmanned aerial vehicle_unmanned aerial_unmanned Topic 17: 17_positioning_mobile device_reference signal_configuration information Topic 18: 18_wave absorbing_honeycomb_absorber based_metasurface wave Topic 19: 19_vortex wave_orbital angular_vortex beam_oam
ISAC	Topic 15: 15_grating_waveguide_wavelength_coupler Topic 16: 16_offloading_sensor network_wireless sensor network_cluster Topic 17: 17_touch_fingerprint_display device_touch screen Topic 18: 18_millimeter wave_mimo_hybrid beamforming_mmwave Topic 19: 19_underwater_marine_boat_navigation
ISAGSC	Topic 11: 11_frequency offset_clock error_doppler frequency_time frequency Topic 12: 12_laser communication_satellite laser_optical communication_leo satellite Topic 13: 13_conditional_handover command_handover terrestrial_handover terrestrial network Topic 14: 14_switching method_satellite switching_beam switching_satellite base station Topic 15: 15_disclosure relates_system supporting_data transmission rate_pdu Topic 17: 17_constellation_constellation configuration_satellite constellation_design method Topic 18: 18_data transmission method_data transmission_transmission method_transmission Topic 19: 19_edge_edge computing_offloading_ground integrated network
EAIs	Topic 11: 11_transformer_fault_gas_utility Topic 12: 12_syntactic_query_semantic_word vector Topic 13: 13_radar_radar signal_doppler_chirp signal Topic 14: 14_ai model_terminal device_application_channel access Topic 16: 16_voice_speech_audio_microphone Topic 17: 17_ldpc_ldpc code_parity check_channel input Topic 18: 18_amplifier_impedance_transistor_rf signal Topic 19: 19_image quality_image classification_wavelet_image data

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, J.; Ying, F.; Dhuny, R. Unveiling Technological Evolution with a Patent-Based Dynamic Topic Modeling Framework: A Case Study of Advanced 6G Technologies. Appl. Sci. 2025, 15, 3783. https://doi.org/10.3390/app15073783

AMA Style

Jiang J, Ying F, Dhuny R. Unveiling Technological Evolution with a Patent-Based Dynamic Topic Modeling Framework: A Case Study of Advanced 6G Technologies. Applied Sciences. 2025; 15(7):3783. https://doi.org/10.3390/app15073783

Chicago/Turabian Style

Jiang, Jieru, Fangli Ying, and Riyad Dhuny. 2025. "Unveiling Technological Evolution with a Patent-Based Dynamic Topic Modeling Framework: A Case Study of Advanced 6G Technologies" Applied Sciences 15, no. 7: 3783. https://doi.org/10.3390/app15073783

APA Style

Jiang, J., Ying, F., & Dhuny, R. (2025). Unveiling Technological Evolution with a Patent-Based Dynamic Topic Modeling Framework: A Case Study of Advanced 6G Technologies. Applied Sciences, 15(7), 3783. https://doi.org/10.3390/app15073783

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unveiling Technological Evolution with a Patent-Based Dynamic Topic Modeling Framework: A Case Study of Advanced 6G Technologies

Abstract

1. Introduction

2. Literature Review

2.1. Embedding-Based Topic Extraction for Patent Analysis

2.2. Dynamic Topic Modeling for Analyzing Technological Evolution

2.3. Weak Signal Analysis for Emerging Trend Identification

3. Methodology

3.1. Data Acquisition and Text Embedding

3.1.1. Data Acquisition

3.1.2. LLM-Based Data Pre-Processing

3.2. Enhancing BERTopic-Based Framework for Dynamic Topic Modeling

3.2.1. Embedding Method Benchmark for Topic Extraction

3.2.2. Hyperparameter Tuning for BERTopic

3.3. Topic Representation and Emerging Trend Analysis

3.3.1. Topic Representation

3.3.2. Emerging Trend Analysis

4. Experimental Results and Comparative Analyses

4.1. Datasets and Experimental Setup

4.1.1. Datasets

4.1.2. Evaluation Metric

4.2. Performance Evaluations

4.2.1. Comparisons of LLM-Based Pre-Processing Techniques

4.2.2. Comparison of Contextual Embedding Techniques

4.2.3. Results of Hyperparameter Tuning

4.3. Empirical Analysis

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI