Tech Trend Analysis System: Using Large Language Models and Finite State Chain Machines

Sburlan, Dragoş Florin; Sburlan, Cristina; Bobe, Alexandru

doi:10.3390/electronics14112191

Open AccessArticle

Tech Trend Analysis System: Using Large Language Models and Finite State Chain Machines

by

Dragoş Florin Sburlan

^*

,

Cristina Sburlan

and

Alexandru Bobe

Faculty of Mathematics and Informatics, Ovidius University of Constanta, 900470 Constanta, Romania

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(11), 2191; https://doi.org/10.3390/electronics14112191

Submission received: 25 April 2025 / Revised: 24 May 2025 / Accepted: 26 May 2025 / Published: 28 May 2025

(This article belongs to the Topic Recent Applications of Artificial Intelligence in Economy and Society)

Download

Browse Figures

Versions Notes

Abstract

In today’s fast-paced technological environment, spotting emerging trends and anticipating future developments are important tasks in strategic planning and business decision-making. However, the volume and complexity of unstructured data containing relevant information make it very difficult for humans to effectively monitor, analyze, and identify inflection points by themselves. In this paper, we aim to prove the potential of integrating large language models (LLMs) with a novel finite state chain machine (FSCM) with output and graph databases to extract insights from unstructured data, specifically from earnings call transcripts of 40 top Technology Sector companies. The FSCM provides a modular, state-based approach for processing texts, enabling entity and relationship recognition. The extracted information is stored in a knowledge graph, further enabling semantic search and entity clustering. By leveraging this approach, we identified over 20,000 hidden (overlapping) trends and topics across various types. Our experiment on real-world datasets confirms the scalability and effectiveness of the method in extracting valuable knowledge from large datasets. The present work contributes to the field of Natural Language Processing (NLP) by showcasing the proposed method in addressing real-world business problems. The findings shed new light on current trends and challenges faced by tech companies, highlighting the potential for further integration with other NLP methods, leading to more robust and effective outcomes.

Keywords:

large language models; graph database; trend analysis

1. Introduction

Nowadays, the accelerating technological change determines new unpredictable frontiers of knowledge, productivity, and customer expectations. This is why spotting emerging trends and anticipating future developments become important tasks in whatever concerns strategic planning, innovation, and business decision-making. Because the complexity of the technological field is expanding at a fast pace, it becomes increasingly challenging for humans to monitor, analyze, and identify significant shifts on their own; hence, automated analytics tools gain more traction.

Recent advances in the development of Large Language Modeling (LLM) substantially enrich the automatic tools designed to perform pattern recognition and insight generation in the field of trend analysis, market intelligence, and future decision-making. LLMs are well suited for tasks like interpreting unstructured data for a deep understanding of targeted processes and for establishing correlations among complex variables (see [1,2,3,4,5]). In particular, in the finance field, the transformative role of Natural Language Processing has lately played a significant role in topics like financial narrative processing, financial forecasting, and financial sentiment analysis, which were explored using modern AI models and tools (see [6]). Specifically, for tasks like financial sentiment analysis, modern approaches leverage deep learning models (convolutional neural networks, recurrent neural networks, and long short-term memory networks) or large language models (like FinBERT). Additionally, there were developed techniques and tools that enhance generative AI performance, like retrieval-augmented generation (RAG), zero-shot prompting, fine-tuning, and so on. In what concerns financial forecasting, predictive models were enriched through the integration of AI models with classical statistical methods and time-series analysis. Similarly, financial narrative processing employs summary creation and information retrieval methods (like data clustering, LLM fine-tuning, and relational augmentation for sectionalized narratives) as well as abstractive techniques for distilled summaries. Knowledge graphs are used for storing gathered information, and they represent the support for unveiling hidden patterns.

In this paper, by using an LLM as language/cognitive support for a finite state-like machine with output, we developed a predictive analytics system which is able to uncover insights from cross-data comparisons within technology companies’ earnings call transcripts.

In this respect, our main contributions lie in the integration and application of three key components: an LLM, a novel Finite State Chain Machine (FSCM) model, and a graph database. The FSCM serves as a modular, state-based processing framework that guides the LLM in tasks such as entity and relationship discovery and structuring data from sequential inputs. Using a graph database to store and query the discovered entities and their relationships, as well as powerful machine models (Word2Vec, SBERT), one succeeds in revealing trends and clustering them into technology megatrends. More specifically, we demonstrated a robust process for trend analysis by implementing the following steps: querying the knowledge graph, identifying key “anchor entities” (nodes with high connectivity), clustering them, and analyzing their surrounding context. This approach uncovers subtle nuances and tendencies existing in official company press releases.

Through our proposed approach, we aim to automate key aspects of trend discovery, hence gaining a deeper understanding of what is relevant for the leading technology companies and highlighting the adoption, development, and industry effects of advanced technologies. The application of our research might contribute to real-world settings by allowing the early identification of groundbreaking technologies. Moreover, the benefits and future prospects acquired by using LLMs together with data processing systems as described within the paper might represent the foundation of a powerful analytical framework that can be used in a multitude of circumstances. From this perspective, it is worth mentioning that a similar approach can be employed for AI-assisted trend analysis for newspaper articles and reports. For example, concerning Environmental and Safety Trends, one can use large volumes of text data gathered from news articles, reports, and journals related to environmental monitoring, air quality management, noise pollution control, traffic safety initiatives, emergency response systems, and so on. In this case, the proposed system might produce evidence and clues that allow for the early identification of new challenges and solutions within a city.

A brief overview of the paper is given below:

Section 2 introduces terminology for Large Language Models are described as neural networks trained on vast datasets to process text data and learn language patterns. Some notions revisited in this section include self-attention and transformer blocks, limitations induced by various parameters, techniques such as prompt engineering, Retrieval-Augmented Generation (RAG), and Large Language Model Agents.
In Section 3 we define Finite State Chain Machine, a new computational model designed to capture state-based, directed calls of LLMs based on sequential inputs. FSCMs with polynomially bounded transition functions prove to be non-universal, but become so when arbitrary Turing computable functions are used.
The Large-Scale Text Analytics System is presented in Section 4. An intelligent decision-support architecture for extracting insights from unstructured data is presented. It addresses the limitations of LLMs with static knowledge bases by proposing integration with an external Long-Term Memory (LTM), specifically a knowledge graph, for dynamic information storage and retrieval. The system architecture includes components like the Transcript Receiver, FSCM with LLM support, Graph Database, Knowledge Retrieval Module, Clustering Module, and Data Visualization Module. Heavy-duty tasks for the LLM involve translation, summarization, entity recognition, relation extraction, and entity disambiguation. The process includes structuring data into JSON, error correction, and consolidating entities using similarity metrics. Data are stored in a graph database, which allows semantic searches.
Section 5, A Real Application, Finding Technological Trends, describes the application of the system to identify R&D trends by using earnings call transcripts of top tech companies. The process involves extracting data into JSON files using LLMs, which are then loaded into a knowledge graph. An algorithm for trend discovery is presented and several examples of identified trends and types are provided.
Section 6 provides an overview of the research and the steps of the proposed trend analysis process, from data collection to trend identification using an LLM. It states the effectiveness and scalability of the developed method. Future lines of research, such as multimodal embeddings, are suggested.

2. Large Language Models

In this paper we assume that the reader is familiar with the main concepts related to Large Language Models. However, in this section, we point out some of the terminology used throughout the paper. For more information on this topic, we indicate [7,8,9].

Large Language Models (LLMs) are advanced and complex neural networks that were initially designed to process and analyze text data. These models learn patterns, relationships, and structures in language as they are trained on extremely large datasets. Thus, LLMs have been successfully used to solve a wide range of tasks and problems, ranging from simple text classification to more complex applications such as conversational dialogue. Later, different (multimodal) architectures were proposed to interpret/generate multimedia content.

LLMs extensively use the self-attention mechanism that allows the model to weigh the relevance of different words in a sentence without taking into account their positional distance. LLMs are built using transformer blocks, which the model uses to focus on each word and its context within a sentence, to evaluate/transform the input sequence of tokens (the fundamental units that make up a text), and capture the complex relations between them. Having trained on a wide and diverse dataset, LLMs “understand” the sense of words as they create statistical models that “properly” group tokens to output meaningful responses. However, despite their impressive ability to generate human-like written texts, LLMs also have limitations. For example, many LLMs have a maximum token capacity, which includes both the input prompt and the model response (in a certain sense, context length represents how much information from the input text the model can consider when generating the output). In this respect, one can assume a context limitation that depends on the number of tokens used by the model at a time. A rough estimation states that 1 token ≈ 4 English symbols (or 100 tokens ≈ 75 English words). This is so that if an LLM has a given limit of tokens and the prompt uses most of them, then there will be minimal space for the response. In this case, the output might be truncated, incomplete, or reduced in size (that is, the response has reduced usability). Yet another important parameter refers to embedding length, that is, the dimensionality of the embedding vectors used to represent individual tokens in the model vocabulary. Consequently, a higher embedding length provides a higher representation of the meaning of each token.

In this general frame, a known issue (rare for the latest models but still existing) of LLMs refers to hallucinations; these are situations when the LLM generates a readable (and possibly plausible) response to a request, but which is wrong, nonsensical, or is not having a reference in reality. Although there is no explicit reason why LLM hallucinations happen, it is assumed that they are due to the data the models are trained on, the limitations of model architecture (including the decoding strategies), or even to the prompts themselves.

Together with the development of LLMs, a new technique emerged: prompt engineering represents the process of crafting effective interactions with an LLM to maximize its performance (in terms of “output quality”). Effective prompts provide a balance between sufficient and relevant guidance for the LLM, which might determine the model’s ability to express itself in a creative and/or analytical way (see [10,11,12,13]). Moreover, in this general framework, prompt patterns and templates capture the dynamic inputs that are used to generate/adapt/process a wide variety of scenarios (see [14,15]).

Retrieval-Augmented Generation (RAG) represents the process of refining an LLM’s response generation proficiency through the incorporation of external knowledge sources, resulting in more informed and relevant output (see [16,17]).

Throughout all AI models and AI-based systems, Large Language Model Agents constitute a major step ahead, as they can perceive their environment, reason towards goals, and take actions to achieve them (see [18,19]). Such agents use LLMs as controllers and they are able to retain information from past interactions or knowledge from a structured data repository (which allows better contextual understanding). In this way, by breaking down complex tasks into simpler and solvable constituents, agents can execute the actions required to achieve their objectives.

Finally, it is worth saying that there exist proprietary/closed source LLMs (like different versions of OpenAI ChatGPT [20], Google Gemini [21], Microsoft Copilot [22], etc.), open weight LLM (like Llama [23], Deepseek [24]), or even open source. In this paper, we focus on a general-purpose open-weight LLM, running locally and whose performance is very good in terms of general capability (based on the existing benchmarks, see, for example, [25]).

3. Finite State Chain Machine

In this section, we introduce a new computation model that aims to capture a state-based directed call of LLM based on sequential inputs. A Finite State Chain Machine (FSCM) model is a state machine that processes a sequence of inputs, maintaining a history of its outputs and using context-dependent functions to determine its transitions and outputs. FSCMs resemble, in a certain sense, the Mealy and Moore machines, which are well-known models of computation (see [26]).

A Finite State Chain Machine is defined as a construct

M = (Q, Σ, q_{0}, h_{0}, C, F)

where

$Q = {q_{0}, q_{1}, \dots, q_{n}}$ is a finite set of states;
$Σ$ is the alphabet of M;
$q_{0} \in Q$ is the initial state;
$h_{0} = ϵ$ is the initial history string;
$C = {c_{0}, \dots, c_{n}}$ is a finite set of contexts, $c_{i} \in Σ^{*}$ , $0 \leq i \leq n$ , corresponding to the states of Q;
$F = {f_{0}, \dots, f_{n}}$ is a finite set of computable functions corresponding to the states of Q, $f_{i} : Σ^{*} \times Σ^{*} \times Σ^{*} \to Σ^{*} \times Q$ , $0 \leq i \leq n$ , which maps a tuple containing a context string, an input string, and a history string to a pair consisting of an output string and a next state.

M processes a list of input strings

[w_{1}, w_{2}, \dots, w_{n}]

as follows:

M starts in the initial state $q_{0}$ . In this case, the context string is $c_{0}$ , the initial history string is $h_{0} = ϵ$ , and the first input string is $w_{1}$ .
At each step, assuming that q is the current state of M, c is the context string associated with the current state, w is the current input string to be processed from the list $[w_{1}, w_{2}, \dots, w_{n}]$ , and h is the history string, then M performs as follows:
•
M computes the new output string and the next state using the function f of the current state: $(o u t p u t, p) = f (c, w, h)$ . If f is not defined for $(c, w, h)$ , then M halts, rejecting the list of input strings.
•
M updates the history string h with the newly computed output, that is, $h = o u t p u t$ .
•
M changes its state from q to p.
Step 2 is repeated until all input strings have been processed (a successful computation) or it halts by rejecting them (as described above). In a successful computation, M returns the last computed output string.

A configuration of the FSCM M at any given moment during computation is a 4-tuple

(q, h, w_{i n d e x}, o u t p u t)

, where

q \in Q

is the current state of M,

h \in Σ^{*}

is the current history string,

w_{i n d e x}

is the next input string to be processed from the input list, and

o u t p u t \in Σ^{*}

is the last output generated by the machine. The initial configuration is

C_{0} = (q_{0}, ϵ, w_{1}, ϵ)

, that is, M starts the computation in the initial state

q_{0}

, the initial history string is empty (

ϵ = λ

), the first input string to be processed is

w_{1}

, and the initial output is empty. M performs a transition between two configurations

C_{1} = (q_{i}, h_{i}, w_{i n d e x}, o u t p u t_{i})

and

C_{2} = (q_{j}, h_{j}, w_{i n d e x + 1}, o u t p u t_{j})

if

(o u t p u t_{j}, q_{j}) = f_{i} (c_{i}, w_{i n d e x}, h_{i})

and

h_{j} = o u t p u t_{j}

. In this case, we write

C_{1} ⟶ C_{2}

.

The following example explains in detail the functioning of an FSCM system.

Example 1.

Let

M = (Q, Σ, q_{0}, h_{0}, C, F)

be a finite state chain machine defined as follows:

$Q = {q_{0}, q_{1}}$ ;
$C = {c_{0} = “ s t a r t ”, c_{1} = “ n e x t ”}$ ;
$F = {f_{0}, f_{1}}$ where the functions are defined by the following:

$f_{0} (c, w, h) = \{\begin{matrix} (c + w, q_{1}) & i f w s t a r t s w i t h a c o n s o n a n t, \\ (c + w, q_{0}) & o t h e r w i s e . \end{matrix}$

$f_{1} (c, w, h) = (r e v e r s e (h) + w, q_{0})$
The list of input strings $[w_{1} = “ h e l l o ”, w_{2} = “ w o r l d ”]$ ;
Initial history string $h = ϵ$ .

The step-by-step computation of M:

1.

M starts in state

q_{0}

, the context being

c_{0} = “ s t a r t ”

.

2.

M processes the first input string

w_{1} = “ h e l l o ”

:

M computes $f_{0} (“ s t a r t ”, “ h e l l o ”, ϵ) = (“ s t a r t ” + “ h e l l o ”, q_{1}) = (“ s t a r t h e l l o ”, q_{1})$
M updates history string $h = “ s t a r t h e l l o ”$ .
M performs a transition from state $q_{0}$ to state $q_{1}$ .

3.

M processes

w_{2} = “ w o r l d ”

in state

q_{1}

, the context being now

c_{1} = “ n e x t ”

:

M computes
$f_{1} (“ n e x t ”, “ w o r l d ”, “ s t a r t h e l l o ”) = (r e v e r s e (“ s t a r t h e l l o ”) + “ w o r l d ”, q_{0})$
$= (“ o l l e h t r a t s ” + “ w o r l d ”, q_{0})$
M update history $h = “ o l l e h t r a t s w o r l d ”$ .
M performs the transition back to state $q_{0}$ .

4.

Computation stops, all the input strings being processed.

Consequently, for the input list of strings

[“ h e l l o ”, “ w o r l d ”]

, M outputs the last computed string

“ o l l e h t r a t s w o r l d ”

.

Remark 1.

As described in the above definition, the FSCM M processes lists of strings (the lists have variable but finite sizes), for each of them outputting a new string (seeing M as a multivariate function calculator working on strings is a feature that we will use later in the paper). However, one might simplify the construct by considering only one-symbol input strings; hence, the entire list of input strings in the definition of M can actually be represented as a single string. Correspondingly, M will process an input string (symbol by symbol) and return another one. In this particular case, the language accepted by M can be easily defined as the set of all strings returned by M when it receives as input strings from

Σ^{*}

. By considering this new simplified setup, we can state Theorem 1, which can shed some light on the capability of the general model, working with lists of strings.

Theorem 1.

A Finite State Chain Machine with polynomial bounded transition functions is not universally computable.

Proof.

Assume by contradiction that the FSCM model with polynomially bounded functions

f_{i}

is computationally universal, that is, it can simulate any Turing machine. Consider now the Busy Beaver function

B B (n)

which gives the maximum number of 1’s that an n-state Turing machine can write on its tape before halting, starting from an empty tape. It is known that

B B (n)

is not computable by any Turing machine because it grows faster than any computable function (see [27]). By assuming that the FSCM is universal, then it follows that it can simulate any n-state Turing machine. Therefore, the FSCM should be able to compute

B B (n)

for any n. Because all functions

f_{i}

,

0 \leq i \leq n

, are polynomially bounded, the length of the history string h (which is the output of these functions) can only grow polynomially with respect to the input size and previous history. Assume that the growth of the history string is bounded by a polynomial

p (x)

, where x is the length of the input and the previous history.

Consider a Turing machine

M_{B B}

that computes

B B (n)

. The output of

M_{B B}

is a string of 1’s of length

B B (n)

. If the FSCM simulates

M_{B B}

, it must be able to produce a history string of length

B B (n)

. However, due to the polynomial boundedness of the

f_{i}

functions, the history string can grow at most at a polynomial rate

p (x)

. Since

B B (n)

grows faster than any polynomial, there exists an n such that

B B (n) > p (x)

for any x that the FSCM can generate in the simulation. It follows that the FSCM cannot produce a history string of length

B B (n)

for all n, contradicting our assumption that it can simulate any Turing machine. Hence, the FSCM model with polynomially bounded functions is not universal. □

Some simple examples of string polynomial functions include the following:

concatenation function, which concatenates two strings $s_{1}$ and $s_{2}$ ;
reversal function, which reverses a given string;
palindrome check function, which checks if a string s is a palindrome;
word count function, which counts the number of words in a string;
string compression function, which compresses a string s by replacing consecutive repeated characters with the character followed by the count.

However, it is worth mentioning that if one considers arbitrary Turing computable functions for $f_{i}$ then the model becomes computationally universal.

Regarding the computation of the FSCM, by including the history, local context, and input as arguments to the

f_{i}

functions, one is able to capture a significant degree of dynamics. The history string h acts as a form of memory; it stores the sequence of outputs generated by the machine so far. Consequently, by using h as an argument, the functions

f_{i}

can access and utilize “information” from past computations. This allows the machine to make decisions based on its previous behavior, capturing temporal dependencies. On the other hand, the local context

c_{i}

provides state-specific “information”; hence, by including it, each state can have its own “view” or “perspective” on the current situation. This allows for branching behavior based on state, and for each state to handle input in its own specific way. Moreover, the input w can be seen as the immediate stimulus (coming from the “outside” of the model) that triggers the current transition. This “piece of information” is processed by the FSCM, which determines its output.

Finally, the presented model can be easily modified to capture the dynamic changes in the context. For instance, one might consider in the definition of an FSCM M a new context transition computable function g which allows for flexible updates of the context string based on the current state and input string (that is,

g : Q \times Σ^{*} \to C

, where C is a finite set of context strings). A simple example for g might be a regular expression-based context. On the other hand, the local context

c_{i}

provides state-specific “information”; hence, by including it, one allows each state to have its own “view” or “perspective” on the current situation. This allows for branching behavior based on state, and for each state to handle input in its own specific way. Similarly, one could allow the history string to be used to influence the context of a given state. All these changes would increase the “power” of the machine, and also significantly increase its complexity, understanding, and design.

However, the presented formal model was introduced as an abstraction for a broader aim. We intend to use FSCM to model human cognition and decision-making processes, taking into account contextual information and history-dependent behavior. One possible attempt is to augment FSCM with a Large Language Model (LLM). In this respect, LLMs operate on a different design paradigm, implementation approach, or problem-solving strategy compared to traditional, explicitly defined functions. LLMs are known for their ability to generate human-like text based on the input they receive. In the context of Finite State Chain Machines, replacing the computable functions

f_{i}

with LLMs can significantly enhance the capabilities of the machine. More precisely, the standard FSCMs (as defined above) rely on hand-designed computable functions

f_{i}

to perform specific tasks, such as state transitions or output generation. By using these types of functions, one is often limited by their simplicity and lack of expressiveness. However, LLMs have shown remarkable capabilities in learning complex patterns and relationships from large datasets.

The gains of using LLMs in this formalism are straightforward. Firstly, by providing the LLM with “well-crafted” prompts (which contain relevant context augmented with history and input strings), one can improve the ability to generate accurate outputs. Moreover, this approach allows for easy modification or extension of the task sequence, hence one can address new use cases very easily. Yet another advantage regards the modularized design, which enables adding new tasks or prompts without significantly affecting the overall performance of the system. Most of the existing Large Language Models impose a maximum token limit for the output, ensuring that it remains manageable, efficient, and cost-effective. In this respect, the bounding of the output size can be considered a form of polynomial constraint.

4. Large-Scale Text Analytics System

Large Language Models (LLMs) are complex systems that rely on massive amounts of text data for training. When prompted, they analyze the input text into smaller units called tokens and generate responses based on the probability of subsequent tokens in their knowledge base.

However, after the training phase is complete, the model weights become static. Hence, to increase the model’s knowledge base, one typically has to retrain or fine-tune the model (by exposing it to the new data). These approaches have limitations due to the cost and computational demands that make them impractical for on-the-fly adaptation.

Nevertheless, a key feature of LLMs is their reliance on sampling methods, such as temperature control, which governs the randomness of their output. By adjusting the temperature, users can modify the likelihood of receiving identical responses to the same prompt. However, this does not enable continuous learning or direct modification of the model knowledge base.

In contrast, an LLM augmented with an external memory that can be updated and queried would allow dynamic/up-to-date information to be stored and recalled as needed rather than relying on pre-existing knowledge from the training dataset. Unlike LLMs, which can only draw upon their static knowledge base, Long-Term Memory (LTM) would enable the model to learn and adapt over time (by capturing, for instance, evolving concepts).

The limitations of LLMs are evident in their inability to answer complex or personal questions that may lie outside their scope. Since these models cannot update their knowledge bases in real-time, they are forced to rely on probability-based responses from their training data. In this regard, the proposed LTM system would offer a significant improvement by providing dynamic memory capability.

To expand the capabilities of the LLM, one may integrate a Long-Term Memory (LTM) component that stores structured information based on some text representation of it (user input, documents, and so on). This will allow users to instruct the system to remember specific details, which will be stored and retrievable at a later time. When queried about these matters in the future, the system will draw upon these LTM data as its primary source of truth, rather than relying on its training dataset. In this way, the system will provide an answer based on the information it has been taught to remember (see [28] for an initial attempt). In particular, a knowledge graph might be considered as a suitable implementation for the LTM component. A knowledge graph is a structured representation of entities and their relationships, allowing efficient querying and retrieval of specific pieces of information (see [29]).

Although we do not investigate the theoretical implications or limitations of integrating LTM with LLMs in this paper, our primary goal is to develop an architecture for an automated system that can analyze Earning call transcripts or other economic documents and text sources to identify trends, patterns, and insights (see Figure 1). In this way, the designed system would be able to learn from user-provided information, retrieve specific facts efficiently, and perform advanced applications that require contextual understanding and reasoning. From this point of view, the purpose of the system might go beyond simply providing an analysis of economic data.

The system is built on the following main components:

Transcript Receiver: Responsible for collecting and preprocessing Earning call transcripts or other texts from various sources.
Finite State Chain Machine with LLM support: A modular, state-based processing framework that performs tasks, such as Translation, Summarization, Entity recognition, Entity filtering, Relation extraction, and Structuring data.
Graph Database: Stores the extracted information in a graph structure to enable efficient querying and analysis.
Knowledge Retrieval Module: Provides an interface for querying and retrieving specific information from the graph database.
Clustering Module: Identifies related concepts across several texts.
Data Visualization Module: Offers visualization tools to present insights and trends in an intuitive and user-friendly manner.

It is worth mentioning that most heavy–duty tasks are performed by the LLM, excluding the storing and querying of structured data. Here, the LLM is responsible for the following:

Translating and summarizing the input text that contains relevant information on any given topic. Translation is required as input texts regarding the same subject might come from different sources, so one has to standardize the content. Summarization is meant to remove unnecessary details and obtain brief content.
Finding entities and relations among them and representing them in a structured format.
Entity disambiguation. In this case, the LLM is asked to identify multiple entities that actually refer to the same concept or subject and merge them into a single, consistent representation.

The graph database system represents the place where the body of knowledge is stored. In this case, the LTM is constructed around entities and relations among them. When a new sentence/paragraph/document is received, depending on its size, it will be split into chunks of user-defined size (typically, we define the size of a chunk at 400 words, approximately 20–25 sentences). Then, for each chunk, we perform the following tasks:

chunk translation into English and summarization (the objective is to uniformize the input text and obtain a brief version of it). Correspondingly, the LLM model is fed with a prompt composed of the context c1 concatenated with the chunk:
c1 = “Translate the text in English and build a very brief and concise bullet list version by including only the essential and relevant information. You must use the complete names for the identified entities such that each statement in the list could be self understandable and will refer to the same entities as the previous ones. The output must have around 150 words (plus or minus 10 words). Use simple telegraphic and logically connected sentences in plain English. Obey strictly this requirements. Here is the text:”
For a summarized English version of a chunk, we produce a structured version of it by identifying interacting entities and relations among them. To this aim, the LLM model is fed with a prompt composed of the context c2 concatenated with the summarized text:
c2 = “Analyze the given text and break it down into individual statements to build a JSON file with a ‘brief’ key summarizing the text, and an ‘assertions’ key. The assertions is a list where each element has the key ‘stm’ stating the fact or opinion, ‘id’ with a unique ID (’"+id+"_’ + counter), ‘summary’ with a brief summary, ‘domain’ with a listing of relevant comma-separated domains, ‘keywords’ with four relevant comma-separated keywords, and ‘knowledge’ in a structured format with all the entities (defined by ‘name’, ‘type’, and ‘properties’) and at least two relationships among the found entities (defined by ‘source’:‘type’ entity, ‘relation’, and ‘destination’:‘type’ entity) describing the reason why the corresponding entities are linked. Once entities are detected, they must be disambiguated to distinguish between entities with similar or identical names. Identify potential ambiguities among these entities, considering contextual clues such as their relationships with other nodes. Use your understanding of domain-specific knowledge to disambiguate any ambiguous entities. In relationships you have to only use the discovered entities and nothing else. Produce only the JSON output following this structure strictly. The text is:”
Here, the ID represents a unique identifier which is composed by an id of the analyzed text and a counter.

For example, for a given statement stm extracted from the original text, a snippet of the produced JSON is as follows:


{"stm": "The company saw strong performance across its portfolio and is excited about early
     traction in generative AI.",
"id": "a32_3",
"summary": "Company’s performance and interest in AI",
"domain": "Business, Technology",
"keywords": "Portfolio Performance, Generative AI, Traction, Excitement",
"knowledge": {
     "entities":
          [{"name": "The Company", "type": "organization"},
          {"name": "Generative AI", "type": "concept", "properties": {"field": "AI"}}],
     "relationships":
          [{"source": {"name": "The Company", "type": "organization"},
          "relation": "sees strong performance in",
          "destination": {"name": "Portfolio", "type": "concept"}},
 
          {"source": {"name": "The Company", "type": "organization"},
          "relation": "is excited about",
          "destination": {"name": "Generative AI", "type": "concept"}}]
}}

However, the JSON produced in this step may contain errors, such as typos and misspellings, which need to be corrected before we can aggregate the knowledge gathered from all the chunks. To handle this situation, we employed a library that automatically corrects minor JSON errors (e.g., json-corrector) based on the schema we provide. The schema is used to ensure that the data conform to a specific format, enforcing type constraints, and defining the required properties.

Once these steps are performed, the resulting JSON files are concatenated into a larger one, which will become the input of the next stage. There, we aim to uniformize and standardize all the discovered entities. To achieve this goal, the LLM model is fed with a prompt composed by concatenating the context c3 with the output JSON:

c3 = “The following JSON file represents the knowledge related to the same subject. Identify all the entities referring to the same concept/object/person and group and standardize them, ensuring consistency across all entities (find all the entities involved in relations and sharing the same name; group them into one, leaving as ‘type’ the most relevant type value which you consider). As output, produce only a new JSON file following strictly the same structure and no other additional text or explanations. Here is the input JSON:”

As a result, one obtains a JSON file containing the knowledge extracted from the input text. It is worth mentioning that the resulting JSON is further processed, and all the values are sanitized (for instance, symbols are lowered, additional spaces are removed, special symbols like ‘&’ and ‘$’ are replaced, and so on).

However, there might exist cases where the designed system discovers, in different analyzed chunks, entities having the same name and type but with different properties; in this situation, the union of all matching entities is conducted.

After structuring the input data as described above and operating some additional minor cleanup and adjusting of the output JSONs (relabeling ids, etc.), the resulting file is further processed. Firstly, as the extracted entities in a given JSON file might still not be consolidated (even if in a previous step we instructed the LLM to identify all the entities referring to the same concept/object/person and group them, then standardize them, a step which actually removed many inconsistencies), we iterate through all of them. Based on the Levenshtein and Jaro–Winkler distances (see [30,31]) and Word2Vec cosine similarity (see [32]), we computed the differences between their names and types, and merged those that exceeded a certain threshold.

More specifically, supposing that in the JSON file built for a given company, the system identifies the entities:

                  {“name”: “ABC Corp”, “type”: “company”}
                  {“name”: “ABC Co.”, “type”: “organization”}

then, the outcome of applying the described method is only one entity:

                  {“name”: “ABC Corp”, “type”: “company”}

In detail, a Word2Vec cosine similarity check can be performed for the types of the entities (that is, in the provided example, organization and company). In this respect, one may consider that words are semantically similar if the cosine similarity exceeds the threshold of 0.75. For entities of similar types, we employed a distance-weighted algorithm that combines multiple distance measures to determine similarity between the entity names. Correspondingly, we computed the Levenshtein and Jaro–Winkler distances between the corresponding names (we used TheFuzz library for computing the Levenshtein Distance and we set the threshold to 80; in what concerns the Jaro–Winkler distance we used the jaro–winkler 2.0.3 library and set the threshold to 0.80). In case the names are also similar, then we merged the respective entities as our goal is to create a single, formal representation for each entity (ensuring that they are consistently referred to across the JSON file). Since we compare each entity from the JSON with all subsequent entities, the overall procedure has a quadratic time complexity with respect to the number of entities.

It is worth mentioning that higher thresholds for Levenshtein and Jaro–Winkler distances generally lead to higher precision (fewer false merges) but lower recall (more missed merges), while a lower threshold leads to higher recall but lower precision. The presented thresholds were empirically deduced by testing several values and picking those that lead to consolidating several entities without producing inconsistencies. However, the values considered in this work are only representative of the (particular) input data and have not been shown to scale effectively to other datasets (optimal thresholds may vary depending on various factors, such as entity labeling conventions, domain-specific terminology, and data consistency).

Remark 2.

While significant effort was spent designing general-purpose entity disambiguation tools (see [33,34,35]), financial field data require particular attention as they are particularly prone to ambiguity. In particular, these types of data contain many abbreviations, naming conventions, or specific roles for entities, which might cause challenges with regard to entity classification and the establishment of reliable semantic relationships among them. Although the method introduced above (which combines lexical and semantic similarity measures, such as Levenshtein and Jaro–Winkler distances, and Word2Vec cosine similarity) removes most of basic inconsistencies, it does not provide meaningful entity fusion and disambiguation, particularly when entities share the same name and type but have distinct contextual properties across multiple document segments.

By integrating LLMs with traditional similarity metrics, a more robust and integrated solution for entity disambiguation can be obtained. LLMs have the ability to leverage contextual embeddings and nuanced understanding of language to accurately distinguish between entities with similar names that refer to different financial concepts (e.g., see the part of the prompt Use your understanding of domain–specific knowledge to disambiguate any ambiguous entities). This hybrid approach also offers better generalization across different formats, variations, and domains, surpassing the limitations of fixed thresholds. Furthermore, this technique could incorporate entity role inference, cross-document reasoning, and learnable merging criteria, making it adaptable to evolving data and more generalizable across diverse financial datasets (see [36]). However, this topic remains a new line of research and was not addressed here.

Once this step is completed, all discovered entities are inserted into the Neo4J graph database. Furthermore, the arcs among inserted entities are added to the graph database. For each relation, apart from its name (which captures the meaning of how the corresponding entities are connected), we store the following:

the keywords: several words which are relevant for the relation;
the domain: the field/scope of the relation;
the statement: a short sentence that briefly describes the relation;
the source: a string representing the source of information (e.g., the filename containing the analyzed information);
the timestamp: the moment when this relation was established;
the index: the order of the relation in the input JSON file(s).

In this setup, keywords are used to easily search for particular relations and their corresponding entities; the domain is used for classification; the statement is used to provide a larger context if needed while searching and discovering relations among entities. We also considered the source, which is useful for tracing back the original document from which the knowledge was extracted. The timestamp and index are used in conjunction with the names of entities, the names of relations, and statements to restore a temporal narrative after performing a search.

Once the knowledge graph is built from several input sources, one can find entities or relations between them containing in their name/properties the given input word(s). As a direct search of the input word(s) might not return all relevant data, we used several techniques to determine several “similar” words and perform the search for all of them. For a given word, the Word2vec algorithm/technique was used to obtain its vector representation. Next, based on cosine similarity, one obtains other words which appear in similar contexts, hence one may perform the search for all of them. In our setup, we used a pre-trained Word2Vec model (see [37]).

Although the Word2Vec technique for representing words as vectors in a high-dimensional space is effective when it comes to finding similar words, it is biased in several ways. For instance, the model is trained on large corpora, which might not be too representative of the analyzed subject/field; hence, the resulting word embeddings might not accurately capture similarities among words within the field (lexical bias). Additionally, Word2Vec representations tend to reflect the part of speech (POS) distribution in the training data, which may not accurately represent other linguistic categories. Moreover, in its simplest form, Word2Vec can handle single words.

A possible solution to all these issues is achieved by finding similar words with a given input by using the LLM itself and a proper context. For example, a prompt that can be used to find similar words may be as follows:

Given the context [document context] provide a list of five closely related words to [input word], comma separated and without any additional explanations.

This particular method is suitable when the context is relevant while performing the search. Nevertheless, this method is more time-consuming as opposed to the previous one but may provide better results in certain cases.

Moreover, since one might be interested in a broader context, for a given node found in the search, one might take k neighboring nodes up to a given depth. For this example, we only considered the keywords and domain properties. The corresponding Neo4J query was designed to find specific relationships in the knowledge graph based on keywords or domains and then explore their neighboring connections up to a certain depth, capturing all relevant paths along the way. In this way, one may extract the corresponding statements and build the respective narrative. However, one can note that higher depth values increase query complexity and potential execution time, as it enables more extensive exploration.

The results are clustered on the timestamp (or source), and for each such cluster, one can aggregate all the data (in the order specified by the index) and one can feed the LLM with an input prompt template instantiated with the search results.

Imagine a scenario where:

[Entity $A_{1}$ ] [Relation $R_{1}$ ] [Entity $B_{1}$ ]. The assertion is related to domain [Domain $D_{1}$ ] and some relevant keywords are [Keyword $K_{1}$ ]. The statement describing the relation is: [Statement $S_{1}$ ].
…
[Entity $A_{n}$ ] [Relation $R_{n}$ ] [Entity $B_{n}$ ]. The assertion is related to domain [Domain $D_{n}$ ] and some relevant keywords are [Keyword $K_{n}$ ]. The statement describing the relation is: [Statement $S_{n}$ ].
Generate a brief narrative that summarizes the key findings incorporating relevant details about [input word]. Ensure that the story flows logically and provides context for future exploration.

This yields a list of texts (one for each cluster). Then, two techniques might be used to identify related concepts across different texts:

Using SBERT (see [38]) one can generate dense vector representations for all these texts/sentences which further allows to perform efficient similarity comparison between them. In this way, one may automatically discover related concepts, similar ideas, or patterns and trends.
Using the LLM and a proper context. For example, the input prompt might be the following:
Given the texts [text/sentence] and [text/sentence] provide a list of common related concepts or ideas if any.

One can notice that there are different trade-offs between these methods. For example, SBERT computes a 768-dimensional vector representation for an input sentence in tenths of milliseconds (depending on the size of the model and on the hardware configuration). Next, computing the distance metric (cosine similarity or Euclidean distance) between the vector representations of two input sentences is, in general, computationally easy to perform. Overall, performing these tasks for any two sentences will determine a squared time complexity on the number of texts. It is also worth noting that shorter sentences are represented more accurately by their corresponding vector representation than the longer ones; this is because the embeddings have a fixed size. On the other hand, running an LLM is more computationally intensive (especially when using larger LLMs), but the model is able to provide good results (in particular, when the input strings/texts have larger sizes).

In our setup, we used an offline local large language model using Ollama a tool with which one can run LLMs locally (see [39]). We employed several models (like

d e e p s e e k - r 1 : 70 b

, see [24] and

L l a m a 3.1

,

L l a m a 3.3

, see [23]) and we configured the behavior of the model by specifying the temperature (which instructs the model to output a more diverse or creative output as this value increases; the code snippet below the

t e m p e r a t u r e = 0

will produce the most constrained and factual output) and

n u m_c t x

number of context (which specifies how many tokens from the input prompt are used to contextualize the generated text).


stream = ollama.generate(
     #model=‘deepseek-r1:70b’,
     model=‘llama3.3’,
     prompt=mycontext +"\""+mytext_english+"\"",
     stream=False,
     options={‘temperature’: 0, ‘num_ctx’: 4096}
  )

It should be emphasized that the main limitation of

L l a m a 3.1

remains its occasional inconsistency in strict schema adherence and its reliance on prompt engineering to ensure valid structured outputs (see also [40] for a more detailed study on the model’s ability to generate structured outputs; there, across 24 experiments, the model achieved an average success rate of 82.55%, with performance varying significantly depending on the task complexity and prompting strategy).

On the other hand, regarding the usage of

d e e p s e e k - r 1 : 70 b

or

L l a m a 3.3

, the reliability and consistency of structured output generation were significantly improved. However, sometimes the output JSON contained syntactical errors that had to be fixed.

Yet another comment regards the size of the chunk and its influence on entity discovery. For example, in the case of one Earning Call Transcript text file (more specifically, the Apple Q1/2025 Earning Call presentation, without the answering part: 3109 words, 18,793 characters), we obtained the following results (Table 1):

If the summarization part is removed, then one obtains the following results (Table 2):

These results deserve a discussion, as in the case of summarized chunks, one can notice that the system discovered more entities than in the case of raw text chunks. This can be possibly explained because of the prompt definition that requires a bullet-list version which includes only the essential and relevant information (in about 150 words or about 5–6 assertions–sentences). From each such sentence, the system extracts around 4–5 entities. On the other hand, if the summarization is removed, the text contains bloatware and the system is not able to directly determine all the existing relevant entities.

Apart from this, it is also important to mention that entity filtering is achieved by employing specific directives into the prompt (see c2 = “…Once entities are detected, they must be disambiguated to distinguish between entities with similar or identical names. Identify potential ambiguities among these entities, considering contextual clues such as their relationships with other nodes. Use your understanding of domain–specific knowledge to disambiguate any ambiguous entities…”). For example, from the existing 167 entities (the case when summarization is made and the chunk size is set to 400 words), one obtains as a result 101 entities (out of which 96 are unique). The experiment was carried out by considering entity lists and calling the LLM with a prompt defined as above. However, it is important to mention that, depending on the size of the list, one can obtain similar but different results (a very close but possibly different number of disambiguated entities). In fact, an empirical remark is that the output “quality” of the employed LLM tends to be better for clear, concise and relatively small, structured, and unambiguous input (this confirms the findings from [41] where simpler and direct prompts usually outperform complex ones).

In summary, in this section, the goal was to detail the design of an intelligent decision-support architecture capable of generating predictive analytics and actionable advice for users. However, with regard to identifying technological trends, in the next section this architecture will undergo slight modifications to capture the emerging patterns.

5. A Real Application, Finding Technological Trends

To test and validate the functioning of the proposed system, one might use it to discover trends concerning tech companies’ interests in research and development. To analyze company goals, we collected a dataset of earnings call transcripts for various companies, as well as publicly available news and other freely available sources of information for a given period of time. For example, one can use the Securities and Exchange Commission’s EDGAR database, which contains filings and transcripts from public companies. Additionally, other information might be used directly from the press releases of the companies as well as from various top (free) platforms for accessing financial data.

Earning call transcripts represent rich but condensed information on the status of a company at a certain moment by providing an executive’s view on the financial performance of the company, its business future projects and plans, as well as its general strategy in the evolving market. They are written records taken during regular and scheduled conferences, which are organized for analysts and investors. During these events, the focus is both on the financial results of the company (quarterly, yearly) and on the future investment performance, revenue returns expectations, and risks. These details are provided directly by top executives or during a Q&A session where they answer questions from analysts.

It is worth mentioning that while earnings call conferences/transcripts are not mandatory for companies there exist several guidelines and regulations from Securities and Exchange Commission which require companies to disclose information about their business operations and financial results (see Form 8-K, Regulation S-K (Rule 14a–9), Form 10-K and data eXtensible Business Reporting Language (XBRL)). Most companies organize public conferences to present their development and future perspectives as this is unanimously considered as a good practice and an open way to prove their commitment to transparency (for a company, this becomes a common way to promote an open dialogue with stakeholders and where it can deliver information beyond the required minimum). From this point of view, these events are beneficial both to the company and to the stakeholders as senior executives provide a window to the company while investors and analysts can gather knowledge, and hence will be able to make more informed investment decisions.

In our study, we considered several top companies in the Technology sector (selected by market cap/popularity/activity as reported by Yahoo Finance, finance.yahoo.com, see Table 3). The total market cap of listed companies exceeds $20 trillion, making them representative of the tech sector.

The inputs for the developed algorithm were the last known (at the time of writing this paper, that is, the end of 2024 to the beginning of 2025) Earnings call transcripts of the companies mentioned above. Consequently, as described in the previous section, for each company, the algorithm generated a JSON containing an array of objects, each object corresponding to a statement in the summarized version of the text. The knowledge extraction from this unstructured data by using LLM provides for each object the entities that are involved as well as the relations among them.

Finally, the resulting JSON files were loaded into the Neo4J graph database, so the knowledge graph was built (see Figure 2). This knowledge graph represents a valuable asset that can be leveraged to gain insights and make predictions.

Based on the data included in the JSON files (built for each company), the resulting knowledge graph contains 9191 nodes and 10,727 relations. The goal is to discover economic trends by querying and analyzing the knowledge graph. To this aim, we implemented the following algorithm:

Step 1. Finding the entity node(s) in the knowledge graph that exhibit the highest in or out-degree.

We order the entity-nodes according to their in–out degree and we take the first n of them, where

n >

number of studied companies. Empirically, the identified nodes represent the central/the most important entities in the knowledge graph (for example, the company, the main strategy, and so on), they are named anchor entities. Supposedly, as we assumed that the number/sizes of documents per company are roughly uniform (hence, the “semantic information” related to a company is roughly uniform as well), these anchor entities will most probably represent all companies considered in this study.

Once the anchor entities are selected, the following procedure is applied:

(a): Use K-Means Clustering to group the anchor entities based on their attribute values (say the label attribute) into distinct clusters by using adaptive silhouette analysis (which is used to determine the optimal number of clusters).
For example, assuming that the number of anchor entities is 200 (where the number of analyzed earning calls transcripts is 35), the algorithm returned the following set of labels:

${‘ p e r s o n ’, ‘ c o n c e p t ’, ‘ t h i n g ’, ‘ g r o u p ’, ‘ o r g a n i z a t i o n ’, ‘ s e r v i c e ’, ‘ p r o d u c t ’, ‘ l o c a t i o n ’}$

By using the silhouette score one can find that the optimal number of clusters is 3:

${0 : [‘ p e r s o n ’, ‘ t h i n g ’], 1 : [‘ c o n c e p t ’, ‘ s e r v i c e ’, ‘ p r o d u c t ’, ‘ l o c a t i o n ’], 2 : [‘ g r o u p ’, ‘ o r g a n i z a t i o n ’]}$
(b): Select the representative anchors by sorting the entities within each cluster based on their combined in–out degree and using a cyclic top-k selector (that is, we are cycling through all clusters, selecting at each passing the highest value element from each cluster until we obtain a list of k entities).
For example, we used $k = 50$ , a number large enough to ensure that each company will be represented by at least one anchor entity.

Step 2. Contextually relevant subgraph extraction.

For each anchor entity-node A discovered in Step 1, we obtain the subgraph by considering the nodes-entities being at a distance of at most t arcs from A (in our approach we did not consider the direction of the arcs, hence for this particular task we considered the graph as being undirected) and which are “contextually relevant”. This is iteratively conducted by considering only the neighbors that are semantically close to the entity-node A. We considered the vector representation of the entity properties (using word embeddings like word2vec) and we picked from the direct neighbors of A only those that are “similar” to A (for this task, we used vector dot product). Then, we proceed by considering the neighbors of neighbors and so on until we reach the t-distance. The described procedure is intended to reduce the “noise” and includes a more focused context needed to better highlight key patterns, themes, or entities related to the core topic. In essence, this expansion methodology acts as a contextual spotlight on neighborhoods closer to anchor entities.

Step 3. Assembling the core document.

We concatenate the entity-nodes properties and the corresponding relationships properties and form a text, the core-document. As we already pointed out, each relation has keywords, domain, statement, source, timestamp, and index. Correspondingly, for each relation, we have the following text:

“Domain: [domain]; Keywords: [keywords]; Statement: [statement].”

All such texts are sorted by source, timestamp, and index, preserving in this way the original sequence of facts. Moreover, it is assumed to be true that the core document contains data from all companies.

The core document will serve as a unified resource for analysis, potentially capturing broader trends across different entities.

Step 4. Identification of latent topics in the core document.

We employ an LLM to determine the latent topics existing in the core document, and we store/update them as they are identified. To perform this task, we used the following prompt:

“Analyze the following text and if it contains information about trends related to disruptive technologies, return a JSON that contains a list “trends” in which each element (that is, a discovered disruptive trend) has the properties: “trend_name”—at most four words describing the trend and “type”—one word used to classify the trend. If the text does not contain any information about trends, then do not return anything. Do not add any additional explanations. Here is the text:”

However, as the LLM is not able to discover trends when fed with a large input text (as is the case of the core document obtained at Step 3) we split it into chunks and analyze them individually. In our experiment, each chunk has at most 10 consecutive sentences from the core document. As an example for a given chunk, the output is a JSON of the following type:

{"trends": [{"trend_name": "Artificial Intelligence", "type": "Technological"},
       {"trend_name": "AI Adoption", "type": "Technological"},
       {"trend_name": "Cloud Growth", "type": "Digital"},
       {"trend_name": "Cybersecurity Consolidation", "type": "Security"}]}

After running the algorithm, we obtained the following main types for the discovered trends (see Table 4).

The lists of all trends determined by running the algorithm for the first four types are shown in Appendix A (Technological), Appendix B (Digital), Appendix C (Technology), and Appendix D (Hardware). Finally, for a given type, one might run once again the clustering algorithm, which, based on the cosine similarity of the vectors corresponding to the identified trends and on the adaptive silhouette score, is able to determine meaningful classes of trends.

In particular, for the Technological type, one obtains the following classes (here are presented only some values from each class):

Cluster 1: [Autonomy Advancement, AI Opportunity, Artificial Intelligence, AI Innovation, AI Adoption, AI Leadership, AI Growth, AI Search, AI Models, Agentic AI, Small AI, Custom AI, Self Driving Cars, AI Training, AI Infrastructure, AI Platform, AI Partnerships, AI Agent Collaboration, AI Services, …]
Cluster 2: [Power Electronics, Silicon Photonics, Advanced IoT, Custom Hardware, Edge Computing, High-Speed Networking, Hybrid Integration, Hardware Advancements, High Performance Computing, Advanced Process, Quantum Computing, Programmable Networks, Silicon Software Integration, 3D Dram, Advanced Chip, …]
Cluster 3: [Cloud Infrastructure, Cloud Adoption, Cloud Expansion, Multicloud Expansion, Cloud Growth, Cloud Computing, Cloud Migration, Cloud Optimization, Cloud Transformation, Cloud Computing Decline, Multicloud Adoption, Cloud Productivity, Cloud Performance, Cloud Integration, Cloud Automation, Crosscloud Solutions, …]
Cluster 4: [IoT Advancements, 5G Adoption, Data Innovation, Platformization, Robotics Growth, Data Analytics, Innovative Technologies, Cybersecurity Advancement, Cybersecurity Demand, Cybersecurity Growth, Wireless Access, 5G Development, 5g Content, Digital Transformation, Smartphone Market, IoT Growth, Robotics Expansion, Robotics Adoption, …]

Based on the content of each cluster, one might give them some relevant names. For example, the above clusters might be named 1: Artificial Intelligence Evolution, 2: Advanced Hardware, 3: Cloud Transformation, and 4: Connected Ecosystems.

In what concerns Digital type, one obtains the clusters:

Cluster 5: [Cloud Growth, Cloud Computing, Cloud Expansion, Cloud Transformation, Cloud Services, Cloud Transition, Cloud Native, Cloud Adoption, Cloud Infrastructure, Cloud Security, Cloud Gaming, Cloud Automation, Cloud Integration, Cloud Subscriptions, Cloud Migration, Cloud Service Dominance, Cloud Deployment, …]
Cluster 6: [Social Media, User Generated Content, User Content, Personalized Content, Web Creators, Custom Models]
Cluster 7: [AI Assistant, AI Search, AI Migration, Search Growth, AI Adoption, Artificial Intelligence, AI Marketing, AI Related Solutions]
Cluster 8: [Cybersecurity Expansion, Cross Service Integration, Unified Experiences, Remote Productivity, Platform Expansion, Platform Shifts, Remote Work, Internet Growth, Platformization, Cyber Security, Collaboration Network, Agentforce Growth]
Cluster 9: [Software Updates, Software Demand, CRM Disruption, Performance Tools, Broadband Necessity]
Cluster 10: [Autonomous Database, Business Automation, Data Analytics, Data Companies, Data Extraction, Ecommerce Discovery]
Cluster 11: [Mobile Apps, Video Monetization, Tech Innovation, Mobile Services, Mobile Offerings, Mobile Products]
Cluster 12: [Automated Cropping, Video Features]

For these identified clusters, one might give the names: 5: Cloud Environment, 6: User-Generated Content, 7: AI-Driven Services, 8: Cybersecurity, 9: Software, 10: Data Driven Technologies, 11: Mobile, 12: Video Processing

The clusters obtained for the type Technology are as follows:

Cluster 13: [Artificial Intelligence, Reasoning AI, AI Recommendation, AI Data Centers, AI Based, AI Data Center, Quantum Computing, Data Analytics, AI Infrastructure, Conversational AI, AI Innovation, 5g Ai Hpc, AI Adoption, Generative AI, AI Powered, Energy Efficient, Connected Computing, Autonomous Driving, …]
Cluster 14: [Digital Media, Fixed Wireless, Materials Innovation, Digital Growth, High Bandwidth, Robotics Innovation, Digital Experience, HBM Capacity, HBM Adoption, Electro Optics, Digital Labor]
Cluster 15: [Cloud Growth, Cloud Computing, Cloud Database, Cloud Based, Cloud Services, Cloud Adoption, Cloud Transformation, Cloud Guidance, Data Cloud, Cloud Security, Cloud Optimization, Cross Cloud, Cloud Infrastructure]

The names of these clusters might be 13: Cognitive AI Technologies, 14: Digital Systems, 15: Cloud Infrastructure.

For the Hardware type, one obtains the clusters:

Cluster 16: [Advanced Semiconductors, Chipset Advancements, Chip Innovation, Custom Accelerators, Advanced Packaging, SSD Adoption, Advanced SSDs, Custom CPUs, SSD Expansion, SSD Enterprise, Silicon Demand, Nextgen Chips, Custom ASIC, Silicon Photonics, Custom Silicon, Custom Nic, High Yield Chips, Optics Performance]
Cluster 17: [Quantum Computing, Copackaged Optics, High Bandwidth, Edge Technologies, Edge Technology, Internet Things, N2 Technology, Edge Devices, Edge AI, Node Technology]
Cluster 18: [Electrooptics Growth, Scale Up Fabrics, Wearables Growth, Data Center Growth, HBM Growth, Semiconductor Growth, Rack Scale Solutions, Rack Scale, System Level Scaling]
Cluster 19: [Autonomous Vehicles, Advanced Cameras, AI Robotics, Robot Production, Robot Technology, Robotics Growth, Robot Delivery, Gate All Around, Lidar System]
Cluster 20: [High Capacity DRAM, Low Power, Thin Film Battery, HBM Revenue, High Capacity Modules, High Performance SSD, Fast Memory, High Bandwidth Memory, Dram Shipments, Compute Memory, Full Rack Solutions, Nand Improvement, Euv Wafers]
Cluster 21: [AI Smartphones, AI Glasses, AI Powered PCs, Industrial IoT, AI Appliances, AI Silicon, Custom AI Chips, AI Compute, AI Accelerators, Energy Efficient, Adaptive Computing, AI Powered Laptops, AI XPU, AI Servers, Smart Glasses, AI PCs, AI Custom Chips, Custom Logic]
Cluster 22: [Compute Performance, HPC Demand, Next Gen GPUs, Nextgen GPUs, XPU Production, GPU Service, Next Gen XPU, Nextgen XPU, GPU Business, Mainframe Cycle, GPU Growth, GPU Adoption, Gaming CPUs, Compute Infrastructure, GPU Demand, GPU Deployment]
Cluster 23: [Server Upgrades, Accelerated Infrastructure, Server Upgrade, Server CPU, Mac Upgrades, Device Upgrade]
Cluster 24: [Mobile Storage, Mobile G9managed, Mobile Processors, Internet Devices, Premium Smartphone, Smartphone Replacement, Data Center, Device Capability, Device Innovation, Device Replacement, Mobile Offerings]

In these cases, one might consider the names: 16: Advanced Chip Technologies, 17: Quantum-Enabled Edge Technologies, 18: Scalable Systems, 19: Artificial Intelligence Ecosystems, 20: Next-Gen Storage, 21: Intelligent Edge, 22: Supercomputing Development, 23: High-Efficiency Hardware Infrastructure, 24: Advanced Mobile Infrastructure.

As one can notice, some of the clusters from different types overlap, and this occurrence might be attributed to the evolving nature of the technologies involved, where concepts like AI, Cloud, Technology, and Digital often intersect and influence each other. Although one can easily merge the respective clusters based on their similarity, in this paper, we aim to keep their distinctiveness, preserving their types (which enables us to capture the intricate patterns within a type).

Concerning the technological trends discovered by employing the above-described method, it is noteworthy to compare them with Deloitte’s report on Tech Trends 2025 (see [42]). More precisely, one can notice the profound influence and widespread stakes of artificial intelligence (AI) in the economy (and future technology trends, in particular). Deloitte’s report sets AI as a common thread underlying nearly every trend and a foundational element of future activities. Similarly, the proposed algorithm also identifies AI as a significant trend type. Beyond AI, there are additional areas of convergence, particularly with respect to the technological domains prioritized. Deloitte’s report discusses trends in hardware driven by AI demands, the challenges and advancements in cybersecurity (specifically quantum computing threats), and the modernization of core IT systems. Correspondingly, in the present work, the proposed automated system identified trend types, such as Technological, Hardware, Cybersecurity, and Computing, with specific trends related to Quantum Computing, Edge Computing, High Performance Computing, and Hardware Advancements emerging from its analysis. However, while Deloitte’s report provides qualitative insights and strategic context, our approach offers quantitative results in the form of algorithmically derived trend types and clustered trend names, contributing a data-driven perspective to technology trend forecasting.

The contributions from related research endeavors also include the analysis and identification of technology development trends. For instance, in [43], the significance of trend analysis is highlighted for strategic planning, innovation, business decision-making, and future competitiveness. In this case, an automated system and data-driven methodology are proposed to overcome the challenges associated with manually identifying trends from large volumes of information. However, the proposed approach differs significantly from ours in the specific methodology, the type of data analyzed, and the nature of the insights produced. More precisely, a method focused on predicting the intensity of interactions between technologies is detailed by analyzing the co-occurrence of classification codes in patent data. Their approach utilizes a specific predictive model, the SAFE-LSTM, trained on a patent dataset to forecast future connection strengths between these codes. While in our work we identify which trends are being discussed based on company communications, the mentioned work focuses on predicting how technologies will interact in the future, considering the patterns identified in patent filings. The referenced work’s main emphasis is on predictive modeling and forecasting, specifically generating “predictive trend insights” and a “landscape for future technology interactions”.

Yet another remark regards the model evaluation under different system changes. Firstly, the temperature parameter of the LLM might slightly influence the generated trends. Similarly, different LLM architectures or variations (e.g., model sizes and configurations) possibly produce different results (for example, by using Llama 3.1, 8.0B, although one obtains similar results, structuring the data into the specified JSON format was flawed). Moreover, the prompts themselves might be polished, rephrased, and optimized: although it is hard to evaluate semantically the output generated for a given prompt, there are some useful tools and techniques (see [44,45]) to improve the definition of the prompt (for example, one can compare semantically similar prompts on the number of generated tokens). Here, we used an empirical approach by analyzing the consistency, relevance, and completeness of the generated responses, triggered by several prompt instances.

The developed application ran on a Supermicro SYS-540A-TR system (Supermicro Netherlands, s-Hertogenbosch, The Netherlands—EMEA Headquarters), with an Intel Xeon Silver 4309Y × 16 processor (Intel, Santa Clara, CA, USA), 64 GB, 1TB SSD, 4× NVIDIA RTX A5000 (NVIDIA, Santa Clara, CA, USA), 24 GB VRAM and running Ubuntu 24.04.2 LTS. The system load during a run as reported by the application NVTOP (a tool for GPU monitoring) and by System Monitor (the default monitoring tool in Ubuntu for CPU load, used memory, swap usage, and so on) is presented in Figure 3. The system snapshot was taken during the most intensive load, that is, when running the large language model (namely Llama3.3) to find entities and their relations in a text.

As can empirically be observed, the computing load during the most intensive tasks (see Figure 3) is uniformly distributed on all available GPUs and CPUs (leaving enough room for further developments). Similarly, the memory consumption of the system stays way below the maximum available.

6. Conclusions

The research aims to extract meaningful insights and patterns from various input sources (e.g., the earnings call transcripts of top companies in the Technology Sector). For this purpose, an LLM-based knowledge extraction system is built to recognize entities and relationships from unstructured data (texts), perform semantic searches on these, and cluster the discovered anchors into coherent classes. These clusters enable us to discover hidden topics that could be pivotal.

The proposed process performs the following steps:

Collecting relevant data (e.g., obtaining earnings calls transcripts for tech companies).
Feeding data to an LLM that extracts relevant entity/relationship.
Consolidating discovered entities.
Storing extracted knowledge in a graph database.
Querying the graph database to find the anchor entities.
For each anchor entity, finding the corresponding numerical vector (by using Word2vec) of its label-type attribute. Clustering anchor entities using K-means and adaptive silhouette analysis.
Running a top K filter on the clusters to isolate representative node–entity anchors.
For each representative node–entity anchor, finding its neighboring within a specific max distance t and which are semantically close.
Building the core document.
Identifying the latent topics in the core document and finding trends.

The experiment conducted on real-world datasets proved the effectiveness and scalability of our approach. In this respect, we not only found trends that are confirmed by professional service firms (also) specialized in providing tech trends reports (see [42]), but we additionally identified unlisted ones.

Moreover, the present work aims to bring relevant contributions toward practical methodologies, leveraging LLM with finite state chain machines and graph databases as an efficient tool to process texts for solving real business problems. In this respect, our goal is to facilitate the integration of other NLP techniques to further amplify results within these application realms.

The same ideas and techniques can be used to solve an entire range of problems which involve the analysis of large volumes of unstructured texts. For example, one might consider analyzing data from a set of newspapers and automatically finding anchor entities related to a given subject.

In an even larger context, one can design a long-term memory for LLMs based on these concepts. This memory can be continuously updated and can be queried before making a call to the static knowledge base of LLM. In this case, the LLM plays both the role of language/conversation support for the system, but also as an interface to communicate with this LTM.

Moreover, a promising line of research is the development of multimodal embeddings that combine text-based embeddings with other types of data, such as images and audio. This can be achieved through techniques like visual or auditory feature extraction, followed by fusion with existing text-based embeddings. In this way, one can create more comprehensive knowledge representations that capture a wider range of information. For example, in the context of financial analysis, multimedia data often contain valuable insights and context that may not be captured through text-based analysis alone. By including multimodal embeddings into the presented system, the accuracy and effectiveness of knowledge extraction and semantic search could be improved. Even more, the proposed methods might be applied to social media data (texts, videos, video transcripts) to automatically infer trends and/or user behavior. However, in this case, significant effort would be needed to adapt the prompts to define a suitable schema for the collected data and to determine the entity types that are relevant for the social media data.

It is worth mentioning that repeated calls to the general-purpose LLM are time-consuming. This is the reason why we limit its usage to the main and necessary tasks (as described in the paper). However, one might imagine other use-cases where the LLM might be a part of the solution:

Given some brief context related to whatever subject and a user input search word, the LLM is asked to find similar and relevant other words (in this case, similar words are determined using a broader context that can be dynamically set). The input search word (or pattern), together with similar ones, will be used in a multiple-term regular expression search over the data stored in the knowledge graph.
Once the outcome of a given search is returned by the graph database system, the results are clustered according to their source/date. After this step, a narrative corresponding to each cluster and based on the member entities/relationships names and properties might be built by the LLM.
The clusters of trends obtained in the last step can be further processed by the LLM (for instance, a simple application might be the automatic cluster naming).

Yet another comment regards the way input data are processed and the corresponding limitations. A key step in finding trends involves analyzing a core document built by querying the knowledge graph and by using an LLM to identify latent topics. It should be noted that LLMs are not designed to unveil (all) trends (or other patterns) when fed with a single large input text. This requires splitting the input file into smaller chunks for individual analysis. This chunk-based processing, while practical, could potentially lead to limitations in identifying trends or patterns that span across chunk boundaries or require a broader context than available in individual segments. Additionally, the raw JSON output generated by the LLM for structuring data can be malformed and have syntactic faults. This matter might affect the assembly of the knowledge base before subsequent consolidation and cleaning. Consequently, there are certain limitations in the precise and comprehensive identification of tech directions.

One final consideration regards the potential of LLMs to “hallucinate” or produce results that are not grounded in reality. Although the presented approach has proven useful in trend analysis, it is essential to acknowledge that LLMs can occasionally generate responses based on their own biases and misunderstandings of context (for example, when the temperature parameter of the LLM is large). This may lead to unreliable trend analysis if hallucinated information is not accurately identified and corrected.

Author Contributions

D.F.S.: Conceptualization, investigation, software, writing. C.S.: Conceptualization, formal analysis, data curation. A.B.: Conceptualization, formal analysis, software. All authors have read and agreed to the published version of the manuscript.

Funding

The work of D.F. Sburlan was supported by CNFIS, Ministerul Educaţiei şi Cercetării, România, grant number CNFIS-FDI-2025-F-0183, Spirit Antreprenorial la Ovidius: Pregătire, Inovare, Lansare, Dezvoltare şi Afaceri de Succes—PILDAS. The work of A. Bobe and C. Sburlan was supported by Ministerul Investiţiilor şi Proiectelor Europene, România, Practică Racordată Actualităţii Constănţene în Tehnologia Informaţiei şi Comunicaţii (PRACTIC), G2024-63816/10.10.2024, within call PEO/71/PEO_P7/OP4/ESO4.5/PEO_A49—Promovarea dezvoltării programelor de studii terţiare de înaltă calitate, flexibile şi corelate cu cerinţele pieţei muncii—STAGII STUDENŢI– Regiuni mai puţin dezvoltate.

Data Availability Statement

The data presented in the study are openly available at https://drive.google.com/drive/folders/1tFlpiImC4A1prX1ULQe26XphjdStWwYJ?usp=sharing (accessed on 25 May 2025). These data were derived from the earnings call transcripts available in the public domain.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. The Trends Identified for the Type Technological

AI Initiatives, AI Revolution, Autonomy Advancement, AI Opportunity, Artificial Intelligence, AI Innovation, Product Innovation, AI Adoption, AI Leadership, Power Electronics, AI Growth, Silicon Photonics, AI Search, AI Models, Agentic AI, Small AI, Custom AI, Advanced IoT, IoT Advancements, Self Driving Cars, AI Instances, Open Source AI, Custom Hardware, AI Advancements, Autonomous Vehicles, 5G Adoption, AI Strategy, Edge Computing, AI Development, High-Speed Networking, AI Engineering, AI Expansion, AI Outcomes, Data Innovation, Hybrid Integration, Hardware Advancements, Autonomy Feature, AI Innovations, Platformization, AI Optimization, Transformational AI, Robotics Growth, High Performance Computing, AI Training, Data Analytics, AI Infrastructure, Advanced Process, Quantum Computing, Software Defined, AI Platform, Cloud Infrastructure, Cloud Adoption, AI Partnerships, AI Applications, AI Powered Devices, AI Data Center, AI Compute, AI Powered Tools, Cloud Expansion, AI Performance, AI Data Centers, AI Accelerators, AI Agent Collaboration, AI Collaboration, AI Capabilities, AI Strength, AI Notebooks, Hypergrowth AI, Multicloud Expansion, Semiconductor Growth, Cloud Growth, Advanced Packaging, AI Services, Innovative Technologies, Network Optimization, AI Solutions, Autonomy Features, Cloud Computing, Generative AI, AI Integration, AI Ecosystem, Cybersecurity Advancement, AI Market, AI Research, AI Personalization, XPU Production, Programmable Networks, Cybersecurity Demand, Cybersecurity Growth, Cloud Migration, Unified Data, AI Robotics, AI Momentum, AI Efficiency, Cloud Optimization, AI Scaling, Wireless Access, Silicon Software Integration, HBM Investment Increase, 5G Development, 5g Content, AI Offerings, 3D Dram, Hardware Renewal, Memory Innovations, Cloud Transformation, Mainframe Growth, AI Assist, Cloud Computing Decline, AI Demand Shift, Multicloud Adoption, AI Investments, Advanced Chip, Server Growth, Hardware Launch, AI Implementation, Digital Transformation, Smartphone Market, Advanced Memory, Advanced Logic, AI Insights, AI Workflows, AI Era, AI Guided Search, Optical Networking, AI Super Cycle, AI Orchestration, AI Workloads, NAND Improvement, Autonomy Opportunity, Silicon Design, IoT Growth, Language Models, AI Deployments, Compute Performance, Robotics Expansion, Robotics Adoption, Compute Memory Growth, GenAI Development, Copackaged Optics, Robotics Innovations, AI Productivity, Accelerated Computing, Autopilot Software, Real World AI, Node Growth, AI Progress, AI Control, AI Proliferation, AI Advances, AI Retail, Cyber Security, AI Transformation, Cloud Productivity, Multimodal Models, High Performance, AI Processing, Innovation Technologies, High Speed Products, Digital Upgrades, Cloud Performance, AI Automation, Automation Support, HBM4 Development, Virtual Assistant, AI Value, AI Chat, Cloud Integration, Cband Rollout, Linux Adoption, Cloud Automation, AI Deployment, Data Center Growth, Crosscloud Solutions, Autonomous Features, AI Products, Autonomous Transport, AI Assistant, AI Connect, Digital Economy, Digital Labor, AI CRM

Appendix B. The Trends Identified for the Type Digital

Cloud Growth, Cloud Computing, Social Media, User Generated Content, User Content, Cloud Expansion, AI Assistant, AI Search, Cloud Transformation, Cybersecurity Expansion, Software Updates, Cloud Services, Cloud Transition, Cloud Native, Autonomous Database, Cloud Adoption, Cloud Infrastructure, Cloud Security, Software Demand, Cross Service Integration, CRM Disruption, Cloud Gaming, Cloud Automation, Business Automation, Mobile Apps, Performance Tools, Data Analytics, Automated Cropping, Cloud Integration, Cloud Subscriptions, Cloud Migration, Video Monetization, Unified Experiences, AI Migration, Personalized Content, Cloud Service Dominance, Cloud Deployment, Search Growth, Remote Productivity, Platform Expansion, Tech Innovation, AI Migrations, AI Adoption, Cloud Products, Artificial Intelligence, Data Companies, Platform Shifts, Remote Work, Internet Growth, Data Extraction, AI Marketing, Platformization, Cloud Consumption, Cyber Security, Hybrid Cloud, Video Features, Ecommerce Discovery, Cloud Environments, Mobile Services, Cloud Business, Web Creators, Mobile Offerings, Cloud Subscription, Mobile Products, Cloud Demand, AI Related Solutions, Collaboration Network, Cloud Commitments, Cloud Momentum, Broadband Necessity, Cloud Scaling, Cloud Efficiency, Custom Models, Agentforce Growth

Appendix C. The Trends Identified for the Type Technology

Advanced Semiconductors, Chipset Advancements, Chip Innovation, Quantum Computing, Electrooptics Growth, Autonomous Vehicles, High Capacity DRAM, Low Power, AI Smartphones, AI Glasses, Scale Up Fabrics, Copackaged Optics, AI Powered PCs, Custom Accelerators, Compute Performance, Advanced Packaging, Advanced Cameras, High Bandwidth, Thin Film Battery, Industrial IoT, Wearables Growth, HPC Demand, AI Appliances, Edge Technologies, HBM Revenue, High Capacity Modules, Data Center Growth, SSD Adoption, High Performance SSD, AI Silicon, Custom AI Chips, AI Robotics, Edge Technology, Server Upgrades, Robot Production, AI Compute, AI Accelerators, Advanced SSDs, Custom CPUs, Accelerated Infrastructure, Fast Memory, Mobile Storage, Mobile G9managed, Robot Technology, Internet Things, SSD Expansion, SSD Enterprise, Mobile Processors, High Bandwidth Memory, Next Gen GPUs, Nextgen GPUs, Energy Efficient, Silicon Demand, N2 Technology, Adaptive Computing, Edge Devices, Nextgen Chips, Internet Devices, XPU Production, AI Powered Laptops, AI XPU, Robotics Growth, GPU Service, Next Gen XPU, Nextgen XPU, Dram Shipments, Server Upgrade, GPU Business, AI Servers, Custom ASIC, Mainframe Cycle, Premium Smartphone, Smartphone Replacement, Server CPU, Data Center, Smart Glasses, Robot Delivery, GPU Growth, Silicon Photonics, Gate All Around, HBM Growth, Semiconductor Growth, GPU Adoption, Custom Silicon, AI PCs, AI Custom Chips, Custom Nic, High Yield Chips, Custom Logic, Device Capability, Compute Memory, Gaming CPUs, Compute Infrastructure, Rack Scale Solutions, Rack Scale, Mac Upgrades, System Level Scaling, Full Rack Solutions, Optics Performance, HBM Technology, Lidar System, Nand Improvement, Device Innovation, GPU Demand, GPU Deployment, Edge AI, Device Upgrade, Device Replacement Hardware, Semiconductor Manufacturing, Node Technology, Euv Wafers, Mobile Offerings Hardware

Appendix D. The Trends Identified for the Type Innovative

Artificial Intelligence, Custom AI, Digital Experience, Autonomy Progress, Autonomous Vehicles, Autonomous Assets, AI Powered, Custom Solutions Innovative Interconnect Technologies, Virtual Reality, Digital Transformation, Industrial IoT, Quantum Computing, Autonomous Cars, Data Analytics, Custom Products, Digital Labor, Edge Technologies, AI Access Capabilities, Generative AI, Metaverse Expansion, Autonomy Features, Edge Technology, Autonomous Driving, Next Gen AI, AI Ecosystem, AI Startups, Automation Future, Agent Automation, Silicon Integration, New Architecture Savings, Personalized AI, 3D Dram, Robotics Investment, Automation Adoption, Robotics Adoption, Gen AI, Frontier Models, Glasses Computing, Computer Vision, Optical Networking, Silicon Photonics, Augmented Reality, Agentic AI, Copackaged Optics, RandD Investment, Robotics Use, Gate All Around, M4 Based Products, Generative Audio, Generative Ai, Advanced Packaging, Custom Programs, Optical Neural, Mobileye Tech, AI Functionality, Open Source Databases, Autonomous Mode, Autonomous Transport, Autonomous Sales, Agentic Activity, AI Acceleration, Next Gen Products, Digital Technology

References

Castro, A.; Pinto, J.; Reino, L.; Pipek, P.; Capinha, C. Large language models overcome the challenges of unstructured text data in ecology. Ecol. Inform. 2024, 82, 102742. [Google Scholar] [CrossRef]
Perron, B.E.; Luan, H.; Victor, B.G.; Hiltz-Perron, O.; Ryan, J. Moving Beyond ChatGPT: Local Large Language Models (LLMs) and the Secure Analysis of Confidential Unstructured Text Data in Social Work Research. Res. Soc. Work. Pract. 2024; online first. [Google Scholar]
Raja, H.; Munawar, A.; Mylonas, N.; Delsoz, M.; Madadi, Y.; Elahi, M.; Hassan, A.; Serhan, H.A.; Inam, O.; Hernandez, L.; et al. Automated Category and Trend Analysis of Scientific Articles on Ophthalmology Using Large Language Models: Development and Usability Study. JMIR Form. Res. 2024, 8, e52462. [Google Scholar] [CrossRef]
Thapa, S.; Shiwakoti, S.; Shah, S.B.; Adhikari, S.; Veeramani, H.; Nasim, M.; Naseem, U. Large language models (LLM) in computational social science: Prospects, current state, and challenges. Soc. Netw. Anal. Min. 2025, 15, 4. [Google Scholar] [CrossRef]
Tu, X.; He, Z.; Huang, Y.; Zhang, Z.; Yang, M.; Zhao, J. An overview of large AI models and their applications. Vis. Intell. 2024, 2, 34. [Google Scholar] [CrossRef]
Du, K.; Zhao, Y.; Mao, R.; Xing, F.; Cambria, E. Natural language processing in finance: A survey. Inf. Fusion 2025, 115, 102755. [Google Scholar] [CrossRef]
Kamath, U.; Keenan, K.; Somers, G.; Sorenson, S. Large Language Models: A Deep Dive; Springer: Cham, Switzerland, 2024. [Google Scholar]
Atkinson-Abutridy, J. Large Language Models: Concepts, Techniques and Applications; CRC Press: Boca Raton, FL, USA, 2024. [Google Scholar]
Amaratunga, T. Understanding Large Language Models; Apress: Berkeley, CA, USA, 2023. [Google Scholar]
Geroimenko, V. The Essential Guide to Prompt Engineering: Key Principles, Techniques, Challenges, and Security Risks; Springer: Cham, Switzerland, 2025. [Google Scholar]
Giray, L. Prompt engineering with ChatGPT: A guide for academic writers. Ann. Biomed. Eng. 2023, 51, 2629–2633. [Google Scholar] [CrossRef] [PubMed]
Marvin, G.; Hellen, N.; Jjingo, D.; Nakatumba-Nabende, J. Prompt engineering in large language models. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Proceedings of the ICDICI 2023, Tirunelveli, India, 27–28 June 2023; Springer: Singapore, 2024; pp. 387–402. [Google Scholar]
Pawlik, L. How the Choice of LLM and Prompt Engineering Affects Chatbot Effectiveness. Electronics 2025, 14, 888. [Google Scholar] [CrossRef]
Jacobsen, L.J.; Weber, K.E. The Promises and Pitfalls of Large Language Models as Feedback Providers: A Study of Prompt Engineering and the Quality of AI-Driven Feedback. AI 2025, 6, 35. [Google Scholar] [CrossRef]
Polat, F.; Tiddi, I.; Groth, P. Testing prompt engineering methods for knowledge extraction from text. Semant. Web 2025, 16, 1–34. [Google Scholar] [CrossRef]
Han, B.; Susnjak, T.; Mathrani, A. Automating Systematic Literature Reviews with Retrieval-Augmented Generation: A Comprehensive Overview. Appl. Sci. 2024, 14, 9103. [Google Scholar] [CrossRef]
Lakatos, R.; Pollner, P.; Hajdu, A.; Joó, T. Investigating the Performance of RAG and Domain-Specific Fine-Tuning for the Development of AI-Driven Knowledge-Based Systems. Mach. Learn. Knowl. Extr. 2025, 7, 15. [Google Scholar] [CrossRef]
Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.V.; Wiest, O.; Zhang, X. Large Language Model Based Multi-agents: A Survey of Progress and Challenges. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, Jeju, Republic of Korea, 3–9 August 2024; pp. 8048–8057. [Google Scholar]
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A survey on large language model based autonomous agents. Front. Comput. Sci. 2024, 18, 186345. [Google Scholar] [CrossRef]
AI, ChatGPT, OpenAI. 2025. Available online: https://chatgpt.com (accessed on 25 May 2025).
AI, Gemini, Google. 2025. Available online: https://gemini.google.com/ (accessed on 25 May 2025).
AI, Copilot, Microsoft. 2025. Available online: https://copilot.microsoft.com (accessed on 25 May 2025).
AI, Llama, Meta. 2025. Available online: https://www.llama.com (accessed on 25 May 2025).
AI, Deepseek, High-Flyer. 2025. Available online: https://www.deepseek.com (accessed on 25 May 2025).
Huggingface Leaderboard. 2025. Available online: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?ref=hub.athina.ai (accessed on 25 May 2025).
Linz, P.; Rodger, S.H. An Introduction to Formal Languages and Automata; Jones & Bartlett Learning: Burlington, MA, USA, 2022. [Google Scholar]
Radó, T. On non-computable functions. Bell Syst. Tech. J. 1962, 41, 877–884. [Google Scholar] [CrossRef]
Dumitriu, D.C.; Sburlan, D.F. Enhanced Human-Machine Conversations by Long-Term Memory and LLMs. Int. J. User-Syst. Interact. 2023, 16, 85–102. [Google Scholar]
Hornsteiner, M.; Kreussel, M.; Steindl, C.; Ebner, F.; Empl, P.; Schönig, S. Real-Time Text-to-Cypher Query Generation with Large Language Models for Graph Databases. Future Internet 2024, 16, 438. [Google Scholar] [CrossRef]
Wang, Y.; Qin, J.; Wang, W. Efficient approximate entity matching using jaro–winkler distance. In Web Information Systems Engineering—WISE 2017, Proceedings of the 18th International Conference, Puschino, Russia, 7–11 October 2017; Springer International Publishing: Cham, Switzerland, 2017; pp. 231–239. [Google Scholar]
Recchia, G.; Louwerse, M. A Comparison of String Similarity Measures for Toponym Matching. In Proceedings of the First ACM SIGSPATIAL International Workshop on Computational Models of Place (COMP ’13), Orlando, FL, USA, 5–8 November 2013; Association for Computing Machinery: New York, NY, USA, 2013; pp. 54–61. [Google Scholar]
Liao, J.; Huang, Y.; Wang, H.; Li, M. Matching Ontologies with Word2Vec Model Based on Cosine Similarity. In Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2021). AICV 2021, Settat, Morocco, 28–30 June 2021; Hassanien, A.E., Haqiq, A., Tonellato, P.J., Bellatreche, L., Goundar, S., Azar, A.T., Sabir, E., Bouzidi, D., Eds.; Advances in Intelligent Systems and Computing. Springer: Cham, Switzerland, 2021; Volume 1377. [Google Scholar]
DBpedia Spotlight. 2025. Available online: https://www.dbpedia-spotlight.org/ (accessed on 25 May 2025).
Falcon 2.0. 2025. Available online: https://labs.tib.eu/falcon/falcon2/ (accessed on 25 May 2025).
Tagme. 2025. Available online: https://tagme.d4science.org/ (accessed on 25 May 2025).
Lu, Y.T.; Huo, Y. Financial Named Entity Recognition: How Far Can LLM Go? In Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal), Abu Dhabi, United Arab Emirates; 2025; pp. 164–168. [Google Scholar]
Johnson, S.J.; Murty, M.R.; Navakanth, I. A detailed review on word embedding techniques with emphasis on word2vec. Multimed. Tools Appl. 2024, 83, 37979–38007. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar]
Ollama. 2025. Available online: https://ollama.com/ (accessed on 25 May 2025).
Shorten, C.; Pierse, C.; Smith, T.; Cardenas, E.; Sharma, A.; Trengrove, J.; Luijt, B. StructuredRAG: JSON Response Formatting with Large Language Models. arXiv 2024, arXiv:2408.11061. [Google Scholar]
Nananukul, N.; Sisaengsuwanchai, K.; Kejriwal, M. Cost-efficient prompt engineering for unsupervised entity resolution in the product matching domain. Discov. Artif. Intell. 2024, 4, 56. [Google Scholar] [CrossRef]
Deloitte Tech Trends Report. 2025. Available online: https://www2.deloitte.com/us/en/insights/focus/tech-trends.html (accessed on 25 May 2025).
Chang, Z.X.; Guo, W.; Wang, L.; Shao, H.Y.; Zhang, Y.R.; Liu, Z.H. Forecasting and analyzing technology development trends with self-attention and frequency enhanced LSTM. Adv. Eng. Inform. 2025, 64, 103093. [Google Scholar] [CrossRef]
Promptfoo. 2025. Available online: https://github.com/promptfoo/promptfoo (accessed on 25 May 2025).
PromptPerfect. 2025. Available online: https://promptperfect.jina.ai (accessed on 25 May 2025).

Figure 1. The architecture of an intelligent decision-support system that can provide actionable insights and predictions for interested parties.

Figure 2. A subgraf extracted from the knowledge graph stored in Neo4J.

Figure 3. The system load during the execution of the application.

Table 1. The number of detected entities when considering different chunk sizes.

Chunk Size	No. of Words in Summary	No. of Chars	Total No. of Entities/Unique
400	979	5933	167/98
800	596	3538	79/45

Table 2. The number of detected entities when handling raw text, without summarization.

Chunk Size	Total No. of Entities/Unique
400	124/72
800	70/31

Table 3. Leading tech companies by capitalization and social impact as reported by Yahoo Finance.

Adobe (ADBE)	Booking Holdings (BKNG)	Meta Platforms (Facebook)	Qualcomm (QCOM)
Adv. Micro Devices (AMD)	Broadcom (AVGO)	Microsoft (MSFT)	Salesforce (CRM)
Alphabet (GOOG)	Cisco (CSCO)	MicroStrategy (MSTR)	ServiceNow (NOW)
Amazon (AMZN)	CrowdStrike (CRWD)	NVIDIA (NVDA)	Super Micro Comp. (SMCI)
Apple (AAPL)	Fortinet (FTNT)	NXPSemiconductors (NXPI)	TSMC (TSM)
Applied Materials (AMAT)	Garmin (GRMN)	Oracle (ORCL)	Tesla (TSLA)
AppLovin (APP)	IBM (IBM)	Palantir (PLTR)	Uber (UBER)
ARM Holdings (ARM)	Intel (INTC)	PaloAltoNetworks (PANW)	Verizon (VZ)
At&T (T)	Marvel Technolgy (MRVL)	Pinterest (PINS)

Table 4. In the table there are listed the types of the discovered trends as well as the corresponding number of occurrences.

Technological 450	Digital 228	Technology 153	Hardware 145	Financial 138	Innovative 109
Software 102	Innovation 75	Artificial 73	Security 73	Economic 67	Cybersecurity 62
Infrastructure 52	Growth 43	Transportation 41	Industrial 34	Computing 31	Automotive 29
AI 29	Marketing 25	Adoption 21	Telecom 20	Sustainable 20	Network 19

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sburlan, D.F.; Sburlan, C.; Bobe, A. Tech Trend Analysis System: Using Large Language Models and Finite State Chain Machines. Electronics 2025, 14, 2191. https://doi.org/10.3390/electronics14112191

AMA Style

Sburlan DF, Sburlan C, Bobe A. Tech Trend Analysis System: Using Large Language Models and Finite State Chain Machines. Electronics. 2025; 14(11):2191. https://doi.org/10.3390/electronics14112191

Chicago/Turabian Style

Sburlan, Dragoş Florin, Cristina Sburlan, and Alexandru Bobe. 2025. "Tech Trend Analysis System: Using Large Language Models and Finite State Chain Machines" Electronics 14, no. 11: 2191. https://doi.org/10.3390/electronics14112191

APA Style

Sburlan, D. F., Sburlan, C., & Bobe, A. (2025). Tech Trend Analysis System: Using Large Language Models and Finite State Chain Machines. Electronics, 14(11), 2191. https://doi.org/10.3390/electronics14112191

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Tech Trend Analysis System: Using Large Language Models and Finite State Chain Machines

Abstract

1. Introduction

2. Large Language Models

3. Finite State Chain Machine

4. Large-Scale Text Analytics System

5. A Real Application, Finding Technological Trends

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. The Trends Identified for the Type Technological

Appendix B. The Trends Identified for the Type Digital

Appendix C. The Trends Identified for the Type Technology

Appendix D. The Trends Identified for the Type Innovative

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI