Next Article in Journal
A Functional Framework for E-Learning Content Creation Using Generative AI Tools
Next Article in Special Issue
Knowledge-Injected Transformer (KIT): A Modular Encoder–Decoder Architecture for Efficient Knowledge Integration and Reliable Question Answering
Previous Article in Journal
Development and Application of Building Circularity Assessment Tool Based on Building Information Modeling
Previous Article in Special Issue
A Federated Deep Learning Framework for Sleep-Stage Monitoring Using the ISRUC-Sleep Dataset
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

An Overview of Recent Advances in Natural Language Processing for Information Systems

by
Douglas O’Shaughnessy
Centre Énergie, Matériaux et Télécommunications (EMT), Institut National de la Recherche Scientifique (INRS), University of Quebec, Montreal, QC H5A 1K6, Canada
Appl. Sci. 2026, 16(2), 1122; https://doi.org/10.3390/app16021122
Submission received: 19 December 2025 / Revised: 14 January 2026 / Accepted: 18 January 2026 / Published: 22 January 2026

Abstract

The crux of information systems is efficient storage and access to useful data by users. This paper is an overview of work that has advanced the use of such systems in recent years, primarily in machine learning, and specifically, deep learning methods. Situating progress in terms of classical pattern recognition techniques for text, we review computational methods to process spoken and written data. Digital assistants such as Siri, Cortana, and Google Now exploit large language models and encoder-only transformer-based systems such as BERT. Practical tasks include Machine Translation, Information Retrieval, Text Summarization, Question-Answering, Sentiment Analysis, Natural Language Generation, Named Entity Recognition, and Relation Extraction. Issues to be covered include: post-training through alignment, parsing, and Reinforcement Learning.

1. Introduction

Recent years have seen considerable interest in facilitating computational access by people to information in databases. There is a huge demand for users to be able to access and manipulate the massive amount of data in various servers (e.g., the Cloud and the Internet of Things). Such information can be in the form of text, images/video, audio (music and speech), and other data (e.g., from sensors). While the recent growth in such accessible data has been massive, ways to reliably and efficiently access this data have not always kept pace. Early question-and-answer systems were very simplistic and limited, with excessive constraints. Recent web search systems and digital assistants (e.g., Siri, Cortana, and Google Now) have become very efficient, but limitations remain.
Most people interact with information systems by queries, either textual (typed) or verbal (speech), using natural language to express their wishes. For efficient data access, such queries must be transformed into suitable representation forms to allow for efficient interaction with stored data. Thus, the research area of natural language processing (NLP) is foremost in facilitating such access. NLP converts natural human output (speech or text) into forms that facilitate data access. Typical text includes documents, email, web pages, articles, reports, and blogs. For example, in the domain of answering a question, a majority of useful natural language content is found in textbooks, encyclopedias, guidelines, and electronic records.
The NLP field has numerous applications in business Intelligence and knowledge management; specific cases include Named Entity Recognition, Named Entity Linking, Coreference Resolution, Temporal Information Extraction, Relation Extraction, Knowledge Base Construction and reasoning, paraphrase detection, entailment recognition, discourse analysis, grounded language learning and image retrieval, computer-assisted translation, and grammatical error correction. For example, conversational recommender systems suggest to users potential items of interest [1]. One seeks to efficiently analyze text, speech, and other data, to find relevant knowledge and structured information, and to extract salient facts about specific types of events, entities, and relationships. Many problems in NLP involve multiple input modalities [2], thus requiring a broader view for the field, as we take in this article.
This paper is an overview of recent advances in NLP to permit better access to stored information. We do not give an extensive survey of methods, as such has been well handled in recent papers [3,4,5,6,7]. Instead, we explain NLP methods without great technical or algorithmic detail, as this allows for understanding by non-experts. Detailed math descriptions are avoided in order to keep the level of the explanations accessible to non-experts. The cited papers have ample details for experts.
To motivate this work, Section 2 starts with a discussion of major applications of NLP for user interaction with information systems. Section 3 then examines how users interrogate such systems and receive useful responses. As computer models of human natural language now dominate the field, a longer Section 4 examines how models of human language handle the complex problems of artificial intelligence in human communication. Two important approaches to specific problems in the area are addressed in Section 5 (how to assign semantic roles to words in user questions) and Section 6 (fine-tuning language models). Section 7 discusses ways to evaluate the performance of information systems. Section 8 details the many specific ways to apply neural networks to user interaction for information systems. Specific sub-topics of modeling text sentences and their topics are examined in Section 9 and Section 10. Section 11 describes specific databases relevant to the field, as well as formal challenges that have motivated much research. We end with an examination of what to look for in the near future (Section 12).

2. Applications

There are many ways for humans to access information systems. In Information Retrieval, also called Information Extraction, a user seeks to obtain desired information from a computer system. One specific version of this interaction is Question-Answering (QA) [8]. Information systems can provide a list of documents, websites, or other data in response to a user query, e.g., what web-searchers Google or Bing do. NLP assists the initial steps of this task via Natural Language Understanding (NLU), to convert inquiries into meta-forms to access the data, and then uses Natural Language Generation (NLG) to format text or speech output to the user. We describe methods for NLU and NLG below.
Specific forms of NL interaction include the following: (1) machine translation, where the input is speech/text in a user’s (native) language and the output is in a different desired (target) language [9]; (2) automatic text summarization, which transforms a text into a more concise version [10]; (3) sentiment analysis, where the affect or emotion present in some text or speech is estimated automatically [11]; (4) text mining, where useful elements are retrieved from broader text [12]; and (5) automatic annotation of web pages [12].
As an example, NLG could access time-series data in an information system and produce a weather forecast [13] or medical report [14]. Summarization of such data could be content selection (what to say) [15], or surface realization (how to say it, i.e., selecting, ordering, and inflecting words) [16]. The objective of content selection is to choose the correct information to output for a given semantic input and communicative goal. For summarization, one seeks to reduce a given document in size, while retaining as much important and relevant information as possible, with minimal distortion [17]. Some systems generate computer source code from natural language descriptions: specialized large language models (LLMs), such as CodeLLaMa [18] and DeepSeek-Coder [19], as discussed below.
The architecture for an information system should represent models for aspects of business processes. This would include activities, groupings, and their hierarchical relationships, from the point of view of their functions. For a company, this means an overview of its organizational structure (human resources, machines, and their relationships), as well as events that generate data, correspondence, and documents; its products and services; and a schedule for any process chain. Such models often use a modular architecture: (1) an information extraction pipeline to parse interrogations, (2) a query formulation pipeline for automatic generation, and (3) output for interactive query review, refinement, and execution.

3. Methods for Question-Answering

Given the text of a question from a user, one NLP procedure is to first attempt to understand the intent of the question [20]. A common way is via forms of semantic parsing (shallow or deep) [21]. A question is converted into a meaning representation, then mapped to database queries. This applies to both word sense discovery (acquisition of vocabulary) and word sense disambiguation (understanding) [22].
Semantic parsing maps natural language text or utterances into a suitable meaning representation, often based on some formalism or grammar. QA without semantic parsing may use SPARQL (Protocol and Query Language) queries on interlinked data represented by Resource Description Framework triples [23].
Core problems in QA for information systems include the following: complex query understanding, NL interfaces for cross-databases, real-time updating for dynamic knowledge bases, and handling unstructured documents. Many QA systems do well in answering simple questions, but have difficulty with queries that require significant reasoning (see Section 6). Most NL query systems assume well-formed single-sentence questions, but complex questions require more detailed interactive exchanges. They often involve multi-hop reasoning, constrained relations, or numerical operations [24]. Even multiple interrelated but simple questions require the capacity to process requests sequentially, in order to access information in relational databases. Task-oriented dialogue systems tend to require pre-defined semantic slots (e.g., simple SQL queries), only function for restricted domains, and are unable to handle the diverse semantics of most practical user queries. It is useful to do knowledge extraction and management from unstructured documents [25], as most texts of relevance are such.

4. Language Models (LMs)

Many applications for information systems involve LMs, which are computational systems that can be used to process human natural language. Large LMs (LLMs) are extensive and complex models that learn very detailed statistical properties and semantics, based on vast amounts of training text. As with all artificial neural networks (ANNs) (Section 8), LLMs exploit the availability of huge numbers of training examples, which provide knowledge to understand and transform language.
The basic task of an LM is the estimation of the probability of a specified sequence of text “tokens,” i.e., short sequences of text, usually words, but sometimes shorter (e.g., byte-pairs) and sometimes sets of words. Simple LMs use N-gram statistics, i.e., probabilities of sequences of N successive words in ordinary text. As LMs have grown in size and power (owing to advances in computer memory and speed), training texts have used all sorts of available data on the internet and elsewhere (private companies also exploit their access to confidential data).

4.1. Basic N-Gram Models

Unigrams (N = 1) are individual words, and so a unigram probability is simply the general estimated likelihood of any given word among all possible text words (e.g., common “function” words such as a, an, the, with, for, etc., have relatively high probabilities, as they occur relatively more often than “content” words, such as nouns, verbs, adjectives, and adverbs). Bigrams are sequences of two successive words in text; trigrams use three words, etc. One can readily see that using large values of N for N-grams uses exponentially large memory. However, it is also the case that natural language often has correlations for large values of N. For example, subject noun phrases and their corresponding verbs usually have high correlation (e.g., animals eat, fish swim, water flows). Many languages allow for much correlation across many words in a sequence (e.g., in German, the verb occurs at the end of a sentence in general). Older NLP systems used tokenizers and part-of-speech taggers to identify the syntactic role of each word [26]. More modern systems use neural networks to indirectly do such tasks.
A challenge for basic LMs is to reliably estimate probabilities of successive words that are generally conditioned on long text sequences. The use of N-grams tends to assume the validity of a Markov assumption, i.e., that probabilities can be approximated by limiting the range of conditional probabilities. Simply extending N-grams to large ranges of N not only risks exponential memory growth, but also severe under-training, despite the increasing availability of lots of training text data. There are various approaches to mitigate this problem, such as class LMs that group words into categories based on attributes such as color and size (e.g., rather than have separate individual stored likelihoods for “blue ball” and “red ball,” use merged statistics for COLOR + ball). For more efficient LMs, one can cluster various word classes or allow for word sequences to skip some words [27].
When combining statistics of N-grams into a single global probability, one can use “back-off” principles, giving more weight to the N-grams of lower N, which are based on larger numbers of training examples (i.e., shorter contexts are observed more often, and suffer less from data sparsity). There are many other variations for LMs [28]: Good-Turing estimator [29], deleted interpolation [30], Katz backoff [31], and Kneser–Ney smoothing [32].
Relationships among words in a text go well beyond simple statistics of word sequences such as N-grams. Relation extraction systems find closed-domain relationships. Open-domain relation extraction systems use specific phrases to define word relationships. To discover such relationships, sentences can be analyzed using techniques such as part-of-speech tagging, dependency parsers, or Named Entity Recognizers [33].
One method is multi-relational learning, which consists of the following: (1) statistical relational learning, such as Markov-logic networks, which directly encode multi-relational graphs using probabilistic models [34]; (2) path ranking methods, which explicitly explore the large relational feature space of relations with random walk [35]; and (3) embedding-based models, which embed multi-relational knowledge into low-dimensional representations of entities and relations via tensor/matrix factorization, Bayesian clustering framework, and neural networks [12].

4.2. Large Language Models

In recent years, research has developed LMs with very complex systems called Large Language Models (LLMs) [36,37]. They can provide a very detailed understanding of text, using a limited window of focus. With an advanced neural architecture, an LLM processes written instructions or questions from users and generates natural language text as output. After pre-training on large amounts of general information, LLMs have had great recent success in solving a wide range of information tasks by conditioning their models on a very small number of examples (“few-shot”) or even only using instructions describing the task (“zero-shot”). Conditioning an LLM is called “prompting” (Section 6.2), with either manual prompts or automatic ones.
LLMs use autoregressive Transformer neural architectures (Section 8.2.3), where word tokens are iteratively predicted. On the other hand, alternative masked diffusion language models (MDLMs) are not constrained to generate data sequentially. In comparison to masked language models, the MDLM objective is a principled variational lower bound, and it supports NLG by ancestral sampling [38].
Pre-Trained Language Models (PLMs) such as BERT [39] are commonly used in many applications due to their favorable performance. Other notable PLMs are open-source BLOOM [40], LLaMa-1/2 [37], OPT [41], and Falcon [42]; these have similar performance to closed “product” pre-trained LLMs such as GPT-3 [43] and Chinchilla [44], ChatGPT, Google’s BARD, PaLM [45], and Claude [46]. The latter are heavily fine-tuned to align with human preferences, which greatly enhances their usability and safety (reliability), but incurs high costs in computation and human annotation. With (commercial) private training, data and model architecture details are generally not shared with the general public.
LLMs can do a huge amount of unsupervised pre-training on textual corpora, and then use a limited amount of supervised fine-tuning (SFT) on high-quality data from an appropriate domain, which may depend upon the application. This latter stage can be expensive, but it is used to align the general model with specific human preferences. Prime methods for this fine-tuning are Reinforcement Learning (RL) from Human Feedback (RLHF) [47,48] and Direct Preference Optimization (DPO) [49].
RLHF fits a reward model to human preferences and uses RL to optimize an LM to produce responses with high rewards, while constraining it to remain close to the original model (KL-divergence constraint). Unfortunately, RLHF is far more complex than supervised learning. RLHF learns from training environments through interaction and rewards [50] and has three stages: SFT, reward modeling, and RL. Other related methods are syntax fine-tuning, knowledge preservation fine-tuning, and task-oriented fine-tuning [51]. DPO instead directly optimizes an LM using a loss function derived from the classic BT model [52]; this follows human preferences without explicit reward modeling or reinforcement learning. DPO increases the relative log probability of preferred responses, while including a dynamic, per-example importance weight. DPO uses a change in variables to define the preference loss as a function of the policy directly.
As ANNs are almost universally based on huge numbers of examples and their huge networks are impossible to debug, there has been a recent effort toward “explainable AI,” which may eventually allow for ANNs that can be interpreted [53]. As of now, however, LLMs sometimes generate incorrect output information (though otherwise plausible), known as “hallucinations” [54]. By dynamically integrating real-time search results from several search engines, LLMs can process contextually relevant information and reduce the chances of generating a hallucination response [55].

4.3. Brief Summary of Use of Language Models

The importance of LMs was made very clear in the 1980s, when they were first integrated with acoustic analysis for automatic speech recognition, rendering a major increase in accuracy. They have recently been dominating most NLP applications, as they furnish an efficient means to interact with massive databases in a highly user-friendly fashion (e.g., no programming). LMs have evolved from simple, basic N-gram conditional likelihoods to massive neural networks.

5. Semantic Role Labeling (SRL)

Semantic hierarchies can offer ways to organize textual (NLP) knowledge.
SRL can be used to identify predicates and arguments in text, and to label their semantic roles in sentences. This may involve joint classification of semantic arguments [56,57], feature engineering for SRL [58,59], or WordNet [60]. In the ontology YAGO [61], categories in Wikipedia are linked to WordNet, which is limited by the scope of Wikipedia. These operations often involve automatic discovery of hypernym (word with a broad meaning or class) versus hyponym (more specific meaning; subordinate in a class) relations, e.g., “is-a” relations (“X is a dog”).
In SRL, each input sentence is analyzed in terms of possible propositions related to target verbs in the sentence. Other recognized constituents in the sentence are checked to fill semantic roles for the verb. The semantic arguments used are items such as agent, patient, and instrument, or adjuncts (locative, cause, temporal, and manner).
As a major objective in information systems is to extract meaning from text, one method for this uses distributional semantics. Distributional semantic models (DSMs) employ vectors that follow context, where target words in large corpora serve as proxies for meaning representations [62]. They use geometric techniques to measure similarity in the meaning of corresponding words [63]. This follows the Distributional Hypothesis [64], where words with similar linguistic contexts are likely to have similar meanings. DSMs represent lexical entries using vectors (embeddings) that demonstrate their distribution in text corpora.
Early DSMs established distributional vectors with word co-occurrence frequencies. Then, prediction DSMs learned vectors with shallow ANNs for local surrounding words. Now, contextual DSMs use deep NTMs to generate contextualized vectors for words. Contextual embeddings (e.g., ELMo and BERT) improve on global word representations (e.g., Word2Vec) [1,65].

6. Fine-Tuning LLMs

The important step of fine-tuning (transfer learning) for LLMs has several versions: prefix, adapter, task-oriented, and parameter-efficient. In prefix tuning, trainable tokens (soft prompts) are added to both the input and internal layers [66]; their parameters are shared. In these decoder-only models, instead of a causal mask, a fully visible mask is used for the prefix part of the input sequence, and a causal mask is used for the target sequence.
With adapter tuning, one adds small neural network modules to the original model, then fine-tunes them on specific tasks, thus fine-tuning a few external parameters instead of entire LLMs [67].
As for task-oriented fine-tuning [68], Parameter-Efficient Fine-Tuning [69] optimizes the subset of parameters fine-tuned, thereby reducing the overall computational complexity. One version of this is Low-Rank Adaptation (LoRA), which freezes pre-trained model weights and inserts trainable rank decomposition matrices into each layer of the Transformer architecture, while freezing the original model parameters, which greatly reduces the trainable parameters for downstream tasks [70]. This is one of several ways to do domain adaptation [71], i.e., update LLMs to different domains.

6.1. Retrieval-Augmented Generation

Most LLMs base their training on huge prior fixed sets of data, which can limit performance on specific or dynamic topics that deviate from earlier training. An important recent approach called Retrieval-Augmented Generation (RAG) integrates real-time knowledge retrieval with LLM generation [72]. RAG uses a neural retriever to find the best text sections by dense similarity, which are concatenated with the query and conditioned into an NLG component. RAG integrates real-time knowledge retrieval with LLM generation, which secures outputs in specific data from current domains.
The knowledge that an LLM retains is stored in the many parameters of the model, which are difficult to update once they have been established via huge pre-training. RAG uses a partial parameterization by including a non-parameterized corpus database with a parameterized model. RAG combines a pre-trained retriever with a pre-trained sequence-to-sequence model (i.e., a generator) and does end-to-end fine-tuning to facilitate knowledge use. RAG dynamically fetches data from external knowledge sources and uses such retrieved information to organize answers. RAG combines the advantages of generative models with the flexibility of retrieval models [72]. It improves performance by associating proper answers to external knowledge, which lowers hallucinations and renders responses more accurate. Unlike most ANNs, which are fully opaque, RAG uses accessible sources, thus allowing users to check answer accuracy.

6.2. Prompts

NLP often uses fine-tuning to raise model performance by use of instructions (“prompts”), without needing to modify existing model parameters [73]. In “few-shot prompting,” one provides a few examples to the model for a specific task. LLMs can be fed with step-by-step reasoning examples, e.g., Chain-of-Thought (CoT) prompting [74], Automatic Prompt Engineer [75], and Chain of Code [76]. One can tune such systems by appending learnable tokens to the input of the model. CoT decomposes complex problems into smaller sub-problems, then solves them with NL reasoning steps. This works best if the knowledge to answer an input question exists in its context or in the model’s parameters (e.g., common sense reasoning, learned from much pre-training).
Prompting is commonly used to improve LLMs for downstream applications. Generally, one would try to employ well-designed text prompt sequences in an LLM’s input. Finding an effective prompt from a small training set for a specific task is a discrete optimization problem with a large search space, and classic gradient-based optimization methods are not always useful.

6.2.1. Hard Versus Soft Prompts

Recent advancements in prompt optimization include (1) r representation-based greedy methods and (2) soft representation-based gradient methods. They leverage gradient information to accelerate prompt optimization. Hard representation methods directly use discrete optimization by using hard tokens as intermediate representations: (1) AutoPrompt [77] uses a token search strategy based on gradients to locate useful prompts, (2) Greedy Coordinate Gradient [78] uses a greedy strategy to advance optimization. Projected gradient descent methods [79] map a soft prompt to the nearest token in the vocabulary at each step of regular gradient descent.
Discrete (hard) prompt methods typically use hand-crafted sequences of human-interpretable tokens for model behavior [80]. However, many suitable prompts are only discovered by trial and error. Soft prompts have continuous-valued language embeddings, obtained through gradient-based optimization.

6.2.2. AI Agents

Some NLP approaches combine reasoning and acting with LLMs to handle language reasoning and decision-making tasks (e.g., an AI agent, such as ReAct [81]). In “zero-shot” prompting, a user typically enters a question text and receives a result with no additional input. More recently, AI agents can permit more complex user interactions, using planning, loops, reflection, and other control devices [82]. These agents use LMs to execute intended goals over multiple iterations and augment the model’s inherent reasoning capacity to accomplish a task.
AI agents do task planning, which needs reasoning capacity to do task decomposition, multi-plan selection, external module-aided planning, reflection and refinement, and memory-augmented planning [82]. As a result, the model decomposes a task into sub-tasks, selects a plan from among several options, exploits and revises any external plan, or uses external information to improve the plan. One of an agent’s abilities to solve complex problems is calling multiple tools that enable the agent to interact with external data sources, or to send or retrieve information from existing APIs. However, an issue that remains is avoiding both over-thinking when faced with simple reasoning tasks and under-thinking with more difficult questions [83].
The agent system called RAISE [84] adds a memory that resembles the short-term and long-term memory in people, by using a “scratchpad” for the former and similar previous examples for the latter. The system called Reflexion [85] uses success state, current trajectory, and persistent memory as metrics for relevant feedback to the agent.

7. Distance Metrics for NLP

There are several ways to measure success in training for information systems. Mean Absolute Error has been used for regression tasks [86], whereas for classification tasks, one generally uses Precision, Recall, and F1-score [87]. For recommendation tasks, Mean Reciprocal Rank [88] is in common usage. NLG tasks employ metrics like BLEU [89].
The reward model (RM) [90] is a metric that estimates the degree to which model outputs align with human preferences. RMs typically follow the classic BT model [52], but one also uses regression paradigms [91] and the “LLM as a judge” approach [92]. The Area Under the Receiver Operating Characteristic Curve is a common threshold-independent metric [93]. Human coding metrics include the Jaccard Score [94].
One can measure the performance of a topic model by the log-likelihood of a model on held-out test documents, i.e., its predictive accuracy. Another measure is perplexity, which is inversely proportional to average log-likelihood. Other common measures are topic coherence and topic diversity [7]. For coherence, point-wise mutual information measures the proximity of words in each topic based on relative co-frequencies with each other. Topic diversity uses the percentage of unique words in a topic set.

8. Neural Networks in Information Systems

As with many computer applications in recent years, artificial neural networks (ANNs) have come to dominate the field of information systems. At the cost of significant computer resources, ANNs provide an automatic processing tool to accomplish a wide range of artificial intelligence tasks, including all those mentioned above for NLP. While termed “neural,” such processing algorithms are only very loosely associated with the natural nervous system found in humans and other living beings. The elemental basis of both natural and artificial neural systems is the neuron (or called a mathematical node in ANNs), which transforms a weighted set of electrical inputs (from other neurons) to a single output in a nonlinear operation. By arranging huge numbers of such neurons/nodes in increasingly complex architectures, arbitrarily complex tasks can be accomplished.
The basic ANN feedforward architecture (“multi-layer perceptron”) arranges neurons in layers, whereby the outputs of one layer feed into a successive (higher) layer, and so forth. The initial lowest layer accepts input data from sensors, and the highest layer can provide a classification of the input into a set of predetermined classes (i.e., do pattern recognition). Alternatively, the output can be a set of data that corresponds to a transformation of the input data (e.g., text-to-speech synthesis or NLG). The nonlinear processing in ANNs allows for analysis and synthesis of complex data patterns, but needs huge network models and much computation. The term deep neural network (DNN) refers to an ANN with several layers of nodes, rather than shallow ANNs, such as support vector machines [95]. Intermediate “hidden” layers in an ANN refine features of analysis (e.g., as data progresses layer-to-layer, in a well-trained ANN, the outputs at each layer gradually move toward more relevant features for a given task). The number of nodes in each layer varies greatly, and is often chosen empirically.

8.1. Fundamentals of ANNs

For NLP classification, such as text interpretation, the training of one node in an ANN can be viewed as optimizing the placement of hyperplane boundaries for class regions in a representation space. For a neuron in biological (natural) neural systems, dendrites route electrical signals to succeeding axons, each of which then yields a binary output consisting of a brief pulse in time when the weighted sum of its inputs exceeds a threshold. The output of an ANN node follows an activation function y = φ(wx + b), where w is an N-dimensional weight vector, b is a scalar bias, x is a N-dimensional set of inputs for the node, and φ(.) is a nonlinearity (e.g., sigmoid, rectifier, or tanh). For each node, its parameters specify the location of a hyperplane in a virtual space, for which the node’s binary output chooses either side of the hyperplane. Huge DNNs, with many millions of parameters, are often used to attain enough precision for a desired task.
Training parameters in an ANN start from initial estimates chosen randomly (or pre-trained, using much unlabeled data). Iterative training alters these parameters in an incremental fashion, while minimizing a loss (or cost) function. Direct minimization of accuracy for any specific task is not feasible because ANN training requires a differentiable loss to allow for a product chain of derivatives. This shows the direction and amount to alter parameters at each training iteration, using a typical steepest gradient descent approach. Loss functions approximate a penalty for ANN classification errors. They vary greatly across applications, but a common one is cross-entropy [96], which uses the log-likelihood of training data, and corresponds to least squares error (L2 norm).

8.2. Common ANN Architectures

In a basic ANN, all nodes in each layer provide their outputs to all nodes in the next layer. Each connection requires a multiply-and-add operation (hence, the popularity of graphical processing units, which specialize in such calculations).
Such a general approach is usually excessively costly (in memory and computation). The useful information in most data (including text) is distributed very non-uniformly in its dimensions (e.g., in time). Thus, most applications use versions of the more efficient ANN components discussed below.

8.2.1. Convolutional Neural Networks (CNNs)

For many tasks, relevant information in an input data sequence is largely localized (i.e., found in limited spans); e.g., in images, objects have useful edges that occupy a small percentage of the full data range. In addition, in localized regions, data may display pseudo-random variations. Such cases suggest the use of filters to exploit these factors, to better accomplish practical tasks. It is also common to view data as a tensor (e.g., a matrix, if the data are conveniently viewed in two dimensions). For example, while audio data such as speech is a signal showing air pressure as a function of time, a spectral display notes energy as a function of both time and frequency; smoothing edges of a wide-band spectrogram is useful in ASR. In such cases, 2-D data are multiplied by a square weight matrix over a small range (e.g., 3 × 3), and results are summed (pooled) [97]. One may also apply 1-D convolution to vectors, in cases such as text, which do not benefit from a 2-D display. CNNs thus usually have alternating layers of “convolution” (weighted multiply-and-add filters), with pooling layers that reduce the size of the vector from the previous layer, and thus do data reduction.

8.2.2. Recurrent Neural Networks (RNNs)

To further take advantage of the uneven distribution of pertinent data across dimensions, nodes in ANNs can be more properly weighted across time. For example, CNNs usually exploit data correlations over small local ranges, but recurrence can be helpful to utilize significant correlations over long ranges. RNNs have neural models where data is fed back from earlier layers, and use distributed hidden states that store information about past input [98]. Specialized network gates (called input, forget, and output) control the flow of data across layers. Typical RNNs are termed long short-term memory and residual networks.

8.2.3. Attention

A major recent development in the field of ANNs is called attention—a way to improve performance by emphasizing weights for specific sections of data; this is another method to exploit the non-uniform distribution of information. The focus for attention is determined mathematically as correlation among data using matrix operations (e.g., multiplications and sums) that use queries (inputs), keys (features), and values (desired outputs). The queries come from a decoder layer earlier in an ANN, and keys and values come from encoder outputs. Attention can be readily applied across various dimensions, including time, frequency, and ANN layers. While the principle of such correlation-based attention is sound, the simple mechanisms of correlation that are used are often inadequate to exploit well the complex information distribution in data.
Neural models that use attention are often called Transformers [99]. In recurrent networks (RNNs), inputs are processed in sequence. Transformers do not follow sequences, instead using an embedding table with time-positional encodings. These latter use non-linear functions for monotonic temporal mappings.
A major disadvantage of Transformers is their need for more computation than other ANN approaches. Transformers can be quadratic in the length of data sequences. Transformer architecture can learn general structural information from high-dimensional data and exploit the huge amount of unlabeled data (available generally) through self- and unsupervised learning. RNN encoder–decoder models with attention are called sequence-to-sequence models.

8.3. Neural LLM Architectures

Many information systems that use analysis and synthesis (e.g., of text) apply an approach that involves coding and decoding data. (This is analogous to inverse operations in radio transmission and reception.) In an ANN approach for such an encoder/decoder structure, the initial network layers act as an encoder to automatically learn hidden latent features for data in a compressed representation, and then the ensuing layers act as a decoder to form a reconstructed data form. In many audio and video coding schemes, the decoder steps often proceed in inverse fashion to the encoder steps. When this encoder–decoder is trained on unlabeled data, it is called an autoencoder.
Many neural systems use deep generative models try to learn complex data distributions through unsupervised training. The so-called GAN (adversarial) architecture [100] uses two networks, a generator and a discriminator, in a minimum–maximum operation. The generator produces data from a low-dimensional latent representation space, often starting from a simple Gaussian noise vector. The discriminator learns to distinguish between “real” training data and “false” generator outputs. The criterion for generators is to maximize the discriminator’s error rate, while discriminators try to minimize their error rate.
The encoder processes and encodes an input sentence into a “hidden” representational form, which attempts to find relationships between words and the overall textual context. For example, the common module BERT uses only an encoder (no decoder) with bi-directional attention.
In an autoencoder, the decoder utilizes the hidden space of the system to generate a target output text; it converts the abstract representation into specific contextually relevant expressions. Examples of this are PLBART [101], CodeT5 [102], AlphaCode [103], and CoTexT [104].
For many applications, it is not necessary to use both an encoder and a decoder. For example, GPT uses only a decoder; indeed, most current LLMs are decoder-only with causal attention, i.e., earlier tokens in a text do not attend to ensuing tokens. Other decoder-only systems include Google Gemini [105] and Claude.
The effectiveness of encoder–decoder architectures versus decoder-only methods is an active research question. Decoder-only approaches can be more parameter-efficient because a single module learns representations for both source and target sequences, whereas separate encoder and decoder components are needed in the other method. Parameter sharing for both input and output jointly may be more efficient in the decoder-only architecture.

8.4. Brief Summary of Neural Networks in Natural Language Processing

The dominant methods for information system interactions use deep ANNs and LLMs. Virtually all such models are combinations of CNN and RNN layers, with attention. Many are versions of either encoder–decoder architectures or decoder-only methods.

9. Sentence Models

Distributional semantic models (DSMs) use vectors that keep track of text contexts (e.g., co-occurring words) in which target terms appear in a large corpus as proxies for meaning representations, and apply geometric techniques to these vectors to measure the similarity in meaning of the corresponding words [106].
A general class of basic sentence models is that of Neural Bag-of-Words (NBoW) models [107]. These generally consist of a projection layer that maps words, sub-word units, or N-grams to high-dimensional embeddings; the latter are then combined as components with an operation such as summation. Large-scale word embeddings can provide context for topic models (discussed in the next section). The Bag-of-Words approach ignores the sequential ordering of words in text.

10. Topic Models

In NLP, a topic model attempts to obtain a general idea of the topics (subjects) of an input text [108]. Topic models use unsupervised machine learning to compress many documents into a short summary that captures the most prevalent subjects in a corpus. When applied to a set of documents, a topic model estimates a set of underlying (latent) topics, each of which describes a semantic concept. Early topic models used Latent Semantic Indexing (LSI) or PLSA [109,110]: using a Word-document matrix, a singular value decomposition reduces the dimensionality. Another approach is Latent Dirichlet Allocation (LDA), which uses the Dirichlet distribution, a bag-of-words model, and the same document-term matrix as in LSI [111]. In LDA, a topic is a distribution of words, where each word in a text comes from a mixture of multinomial distributions with Dirichlet as the prior.
Before neural approaches, most topic models were probabilistic graphical models (such as LDA) or non-negative matrix factorization (NNMF). The former models document the generation process with topics as latent variables and model parameters through Variational Inference [112] or Monte Carlo Markov Chain methods, like collapsed Gibbs sampling [113]. In probabilistic generative models, model random variables are assumed to come from certain prior distributions. NNMF directly finds topics via decomposition of a term–document matrix into two low-rank factor matrices.
A simple topic model is the “bag-of-words” model, which represents a document by a vector of word counts across the vocabulary of all possible words, by the number of each of their occurrences in the text. In graph-based approaches [114], words are nodes and their co-occurrences use weighted edges. Large-scale word embeddings can provide context for the topic models; they represent words as continuous vectors in a low-dimensional space.
Neural Topic Models (NTMs) [2] use DNNs to model distributions for topic models [5]. Unlike earlier topic models, NTMs can estimate model parameters through automatic gradient back-propagation by adopting deep neural networks to model latent topics, such as a Variational Autoencoder [115]. This flexibility enables adjustments to model structures to fit diverse application scenarios. They typically use the re-parameterization trick of Gaussian distributions. In addition, NTMs can handle large-scale datasets via the use of parallel computing facilities like GPUs.
One can use LLMs to improve topic modeling. Topic models are largely based on word co-occurrences to infer latent topics, but such information is scarce in short texts such as tweets (i.e., problem of data sparsity). Probabilistic topic models, such as PLSA and LDA, work well on long texts. For short text topic modeling, Biterm Topic Model (BTM) [116] and Dirichlet Multinomial Mixture (DMM) model [97] are common approaches. BTM directly constructs the topic distributions over unordered word-pairs (biterms), while DMM applies auxiliary pre-trained word embeddings to introduce external information from other sources; both use classic Bayesian inference. Contrastive learning is also used for topic models [117].
Several extensions based on BTM and DMM have also been proposed, such as Generalized Polya Urn-DMM [118] with word embeddings and Multiterm Topic Model [119]. In addition, Semantics-assisted Non-negative Matrix Factorization [120] was lately proposed as an NMF topic model incorporating word-context semantic correlations solved by a block coordinate descent algorithm.
A recent method called Adversarial-neural Topic Model (ATM) [121] uses a generator network to learn a projection function between a document–topic distribution and a document–word distribution. A discriminator network determines if an input document is real or fake, and its output signal helps the generator to construct a more realistic document from random noise drawn from a Dirichlet distribution.

11. Research Challenges and Benchmarks

To motivate uniform research in the field, several technical meetings and efforts have been organized. One of the first was the Text REtrieval Conference (TREC), of which there were several in the 1990s [122]. A later one was the Cross Language Evaluation Forum [8].
The Document Understanding Conference examined query-focused multi-document summarization since 2004, focusing on complex queries with specific answers [123]. A preposition disambiguation task was focus in SemEval 2007 [124].
Benchmarks or datasets to assess language models’ reasoning include the following:
CriticBench [125]—evaluates an LM’s capability to critique solutions and correct mistakes in reasoning tasks;
MathCheck [126]—synthesizes solutions containing erroneous steps using the GSM8K dataset [127];
PRM800K [128]—builds on the MATH problems [129];
Process reward models [91]—explicitly assess and guide reasoning at the step or trajectory level;
The Stanford Question Answering Dataset [130] and DROP [131]—often used as reading comprehension benchmarks;
HellaSwag [132]—advances common sense NL inference via the use of Adversarial Filtering, which is a data collection paradigm where discriminators iteratively select an adversarial set of machine-generated wrong answers;
Flan [133]—a large publicly available set of tasks and methods for instruction tuning;
  • TREC CAsT 2019—conversational assistance track [134]
  • Older knowledge bases such as DBPedia [135], Freebase [136], Yago2 [137], and FrameNet [138];
Propbank [139]—a large hand-annotated text corpus.
Other possibilities include the use of expert annotators or crowdsourcing (e.g., Amazon’s Mechanical Turk [140]).
Early software includes MetaMap [141], used to identify concepts in text.
SEMREP [142] can detect some relations using hand-crafted rules. As for commercial systems, we have Google Now (with Google’s Knowledge Graph [143]) and Facebook Graph Search [144]. An earlier system was the Unified Medical Language System [145], having 2.7 million concepts from over 160 source vocabularies. Distant Supervision has a knowledge base, such as Wikipedia or Freebase, that is used to automatically tag training examples from the text corpora [146].
In a related direction, some interactions with databases involve sentence compression, where an input text is transformed into a shorter output text, while retaining the general meaning (i.e., omitting lesser detail). Examples include the Ziff-Davis corpus [147] and the Edinburgh compression corpora [148].

12. Computational Issues

Major factors in all information systems are their computational requirements, i.e., storage space and mathematical operations. In general, the trend has been toward massive systems. For example, LLaMa-2 [37] uses 13 billion parameters (e.g., multiply-and-adds). Fine-tuning a 16-bit LLaMa model with 65 billion parameters requires more than 780 gigabytes of memory [149]. The massive recent progression in model size is readily seen in GPT: GPT-1 had 117 million parameters, GPT-2 had 1.5 billion, and even further, GPT-3 had 175 billion [150].
Recent QLoRA uses paged optimization to limit memory spikes. It uses a high-precision method to quantize a pre-trained model to 4 bits-per-parameter, and uses some learnable low-rank adapter weights, tuned by back-propagation of gradients through the quantized weights. So-called Double Quantization quantizes the quantization constants. The paged optimizers avoid the gradient checkpointing memory spikes that occur when it would otherwise process a mini-batch with long sequence lengths.
The pre-trained model CodeBERT [151] has a total of 125M parameters, resulting in a model size of 476 MB. Recently proposed models like Codex [152] and CodeGen [153] have over 100 billion parameters and over 100 GB in size.
Training a relatively smaller model like the PolyCoder (2.7 billion parameters) [154], employing eight NVIDIA RTX 8000 GPUs on a single machine, requires about 6 weeks to train.
Pruning model parameters and reducing the quantization precision of these parameters is part of some efforts to reduce model size in “light-weighting” [155], while trying to maintain performance.

13. Conclusions and Future Work

This overview examined recent developments in information systems in the context of classical approaches. As with many application areas of artificial intelligence, the field has become dominated by deep learning methods, as they provide rapid and reasonable solutions, at the cost of significant computer memory and computation needs. There remain many strengths and weaknesses in the current approaches in the field. LLMs with RAG support dominate, with fine-tuning being the focus of much active research.
Among a majority of recent work, we see a predominance of the role of contextual information, how to handle uncontrolled dialog flow, and how to combine orchestration, tool use, and conversational capabilities. Agent collaboration, how to estimate latent information, and how to use federated dialogue generation have become increasingly important.
Problems that remain include queries with colloquial omissions and ambiguous references, which cause difficulty for effective NL retrieval and generation. Current conversational AI systems often provide generic, one-answer responses that ignore individual user characteristics and do not adapt well. Determining questions that are out-of-scope, as well as estimating user intent, are among active areas of interest. Current taxonomies of conversational behavior often overgeneralize or remain domain-specific.
Future generative AI systems should be capable of creating new content when prompted, which requires aspects of thinking, problem-solving, and knowledge creation. Current frameworks of traditional knowledge management typically manage pre-existing content in ways that may be inadequate. As such, future knowledge systems must be dynamic and collaborative.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Jannach, D.; Manzoor, A.; Cai, W.; Chen, L. A survey on conversational recommender systems. ACM Comput. Surv. (CSUR) 2021, 54, 1–36. [Google Scholar] [CrossRef]
  2. Zhang, C.; Yang, Z.; He, X.; Deng, L. Multimodal intelligence: Representation learning, information fusion, and applications. IEEE J. Sel. Top. Signal Process. 2020, 14, 478–493. [Google Scholar] [CrossRef]
  3. Liu, Q.; Kusner, M.J.; Blunsom, P. A survey on contextual embeddings. arXiv 2020, arXiv:2003.07278. [Google Scholar] [CrossRef]
  4. Zhao, H.; Phung, D.; Huynh, V.; Jin, Y.; Du, L.; Buntine, W. Topic modelling meets deep neural networks: A survey. arXiv 2021, arXiv:2103.00498. [Google Scholar] [CrossRef]
  5. Long, L.; Wang, R.; Xiao, R.; Zhao, J.; Ding, X.; Chen, G.; Wang, H. On LLMs-driven synthetic data generation, curation, and evaluation: A survey. arXiv 2024, arXiv:2406.15126. [Google Scholar] [CrossRef]
  6. Tan, Z.; Li, D.; Wang, S.; Beigi, A.; Jiang, B.; Bhattacharjee, A.; Karami, M.; Li, J.; Cheng, L.; Liu, H. Large language models for data annotation and synthesis: A survey. arXiv 2024, arXiv:2402.13446. [Google Scholar] [CrossRef]
  7. Wu, X.; Nguyen, T.; Luu, A.T. A survey on neural topic models: Methods, applications, and challenges. Artif. Intell. Rev. 2024, 57, 18. [Google Scholar] [CrossRef]
  8. Peñas, A.; Magnini, B.; Forner, P.; Sutcliffe, R.; Rodrigo, A.; Giampiccolo, D. Question answering at the cross-language evaluation forum 2003–2010. Lang. Resour. Eval. 2012, 46, 77–217. [Google Scholar]
  9. Wang, H.; Wu, H.; He, Z.; Huang, L.; Church, K.W. Progress in machine translation. Engineering 2022, 18, 143–153. [Google Scholar] [CrossRef]
  10. El-Kassas, W.S.; Salama, C.R.; Rafea, A.A.; Mohamed, H.K. Automatic text summarization: A comprehensive survey. Expert Syst. Appl. 2021, 165, 113679. [Google Scholar] [CrossRef]
  11. Wankhade, M.; Rao, A.C.S.; Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 2022, 55, 5731–5780. [Google Scholar] [CrossRef]
  12. Topaz, M.; Murga, L.; Gaddis, K.M.; McDonald, M.V.; Bar-Bachar, O.; Goldberg, Y.; Bowles, K.H. Mining fall-related information in clinical notes: Comparison of rule-based and novel word embedding-based machine learning approaches. J. Biomed. Inform. 2019, 90, 103103. [Google Scholar] [CrossRef] [PubMed]
  13. Angeli, G.; Liang, P.; Klein, D. A simple domain-independent probabilistic approach to generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, USA, 9–11 October 2010; pp. 502–512. [Google Scholar]
  14. Lee, S.H. Natural language generation for electronic health records. npj Digit. Med. 2018, 1, 63. [Google Scholar] [CrossRef] [PubMed]
  15. Zhang, H.; Cai, J.; Xu, J.; Wang, J. Pretraining-based natural language generation for text summarization. In Proceedings of the Conference on Computational Natural Language Learning, Hong Kong, China, 3–4 November 2019; pp. 789–797. [Google Scholar]
  16. Mille, S.; Belz, A.; Bohnet, B.; Graham, Y.; Pitler, E.; Wanner, L. The first multilingual surface realisation shared task (SR’18): Overview and evaluation results. In Proceedings of the Workshop on Multilingual Surface Realisation, Melbourne, Australia, 19 July 2018; pp. 1–12. [Google Scholar]
  17. Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Reis, E.P.; Seehofnerová, A.; et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 2024, 30, 1134–1142. [Google Scholar] [CrossRef]
  18. Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code LLaMa: Open foundation models for code. arXiv 2023, arXiv:2308.12950. [Google Scholar] [CrossRef]
  19. Guo, D.; Zhu, Q.; Yang, D.; Xie, Z.; Dong, K.; Zhang, W.; Chen, G.; Bi, X.; Wu, Y.; Li, Y.K.; et al. DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence. arXiv 2024, arXiv:2401.14196. [Google Scholar] [CrossRef]
  20. Weld, H.; Huang, X.; Long, S.; Poon, J.; Han, S.C. A survey of joint intent detection and slot filling models in natural language understanding. ACM Comput. Surv. 2022, 55, 1–38. [Google Scholar] [CrossRef]
  21. Berant, J.; Chou, A.; Frostig, R.; Liang, P. Semantic parsing on freebase from question-answer pairs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1533–1544. [Google Scholar]
  22. Ide, N.; Véronis, J. Introduction to the special issue on word sense disambiguation: The state of the art. Comput. Linguist. 1998, 24, 1–40. [Google Scholar]
  23. Pan, J.Z. Resource description framework. In Handbook on Ontologies; Springer: Berlin/Heidelberg, Germany, 2009; pp. 71–90. [Google Scholar] [CrossRef]
  24. Lan, Y.; He, G.; Jiang, J.; Jiang, J.; Zhao, W.X.; Wen, J.R. Complex knowledge base question answering: A survey. IEEE Trans. Knowl. Data Eng. 2022, 35, 11196–11215. [Google Scholar] [CrossRef]
  25. Mahadevkar, S.V.; Patil, S.; Kotecha, K.; Soong, L.W.; Choudhury, T. Exploring AI-driven approaches for unstructured document analysis and future horizons. J. Big Data 2024, 11, 92. [Google Scholar] [CrossRef]
  26. Mielke, S.J.; Alyafeai, Z.; Salesky, E.; Raffel, C.; Dey, M.; Gallé, M.; Raja, A.; Si, C.; Lee, W.Y.; Sagot, B.; et al. Between words and characters: A brief history of open-vocabulary modeling and tokenization in MLP. arXiv 2021, arXiv:2112.10508. [Google Scholar] [CrossRef]
  27. Ney, H.; Essen, U.; Kneser, R. On structuring probabilistic dependences in stochastic language modelling. Comput. Speech Lang. 1994, 8, 1–38. [Google Scholar] [CrossRef]
  28. Chen, S.F.; Goodman, J. An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 1999, 13, 359–394. [Google Scholar] [CrossRef]
  29. Good, I.J. The population frequencies of species and the estimation of population parameters. Biometrika 1953, 40, 237–264. [Google Scholar] [CrossRef]
  30. Jelinek, F.; Mercer, R.L. Interpolated estimation of Markov source parameters from sparse data. In Pattern Recognition ill Practice; Gelsema, E.S., Kanal, L.N., Eds.; North-Holland: Amsterdam, The Netherlands, 1980; pp. 381–397. [Google Scholar]
  31. Katz, S. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust. Speech Signal Process. 2003, 35, 400–401. [Google Scholar] [CrossRef]
  32. Kneser, R.; Ney, H. Improved backing-off for m-gram language modeling. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Detroit, MI, USA, 9–12 May 1995; pp. 181–184. [Google Scholar] [CrossRef]
  33. Kühnel, L.; Fluck, J. We are not ready yet: Limitations of state-of-the-art disease named entity recognizers. J. Biomed. Semant. 2022, 13, 26. [Google Scholar] [CrossRef]
  34. Getoor, L.; Friedman, N.; Koller, D.; Pfeffer, A.; Taskar, B. Probabilistic relational models. In Introduction to Statistical Relational Learning; MIT Press: Cambridge, MA, USA, 2007; Volume 8. [Google Scholar]
  35. Zhang, X.; Zhan, K.; Hu, E.; Fu, C.; Luo, L.; Jiang, H.; Jia, Y.; Yu, F.; Dou, Z.; Cao, Z.; et al. Answer complex questions: Path ranker is all you need. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 449–458. [Google Scholar] [CrossRef]
  36. Sanderson, K. GPT-4 is here: What scientists think. Nature 2023, 615, 773. [Google Scholar] [CrossRef]
  37. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. LLaMa 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
  38. Sahoo, S.; Arriola, M.; Schiff, Y.; Gokaslan, A.; Marroquin, E.; Chiu, J.; Rush, A.; Kuleshov, V. Simple and effective masked diffusion language models. Adv. Neural Inf. Process. Syst. 2024, 37, 130136–130184. [Google Scholar] [CrossRef]
  39. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805v2. [Google Scholar]
  40. Scao, T.L.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Luccioni, A.S.; Yvon, F.; Gallé, M. Bloom: A 176b-parameter open-access multilingual language model. arXiv 2022, arXiv:2211.05100. [Google Scholar] [CrossRef]
  41. Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. OPT: Open pre-trained transformer language models. arXiv 2022, arXiv:2205.01068. [Google Scholar] [CrossRef]
  42. Penedo, G.; Malartic, Q.; Hesslow, D.; Cojocaru, R.; Cappelli, A.; Alobeidli, H.; Pannier, B.; Almazrouei, E.; Launay, J. The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only. arXiv 2023, arXiv:2306.01116. [Google Scholar] [CrossRef]
  43. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  44. Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; Casas, D.D.L.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training compute-optimal large language models. arXiv 2022, arXiv:2203.15556. [Google Scholar] [CrossRef]
  45. Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
  46. Anthropic, P.B.C. Introducing Claude. 14 March 2023. Available online: https://www.anthropic.com/news/introducing-claude (accessed on 17 January 2026).
  47. Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
  48. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  49. Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C.D.; Ermon, S.; Finn, C. Direct preference optimization: Your language model is secretly a reward model. Adv. Neural Inf. Process. Syst. 2023, 36, 53728–53741. [Google Scholar]
  50. Yuan, W.; Pang, R.Y.; Cho, K.; Li, X.; Sukhbaatar, S.; Xu, J.; Weston, J.E. Self-Rewarding Language Models. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; Available online: https://openreview.net/forum?id=0NphYCmgua (accessed on 17 January 2026).
  51. Dong, G.; Yuan, H.; Lu, K.; Li, C.; Xue, M.; Liu, D.; Wang, W.; Yuan, Z.; Zhou, C.; Zhou, J. How abilities in large language models are affected by supervised fine-tuning data composition. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; pp. 177–198. [Google Scholar]
  52. Bradley, R.A.; Terry, M.E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika 1952, 39, 324–345. [Google Scholar] [CrossRef]
  53. Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Wang, S.; Yin, D.; Du, M. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–38. [Google Scholar] [CrossRef]
  54. Zhang, Y.; Li, Y.; Cui, L.; Cai, D.; Liu, L.; Fu, T.; Huang, X.; Zhao, E.; Zhang, Y.; Chen, Y.; et al. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. Comput. Linguist. 2025, 51, 1–46. [Google Scholar] [CrossRef]
  55. Chen, J.; Huang, X.; Li, Y. Dynamic supplementation of federated search results for reducing hallucinations in LLMs. 2024. Available online: https://osf.io/preprints/osf/x5vge_v1 (accessed on 17 January 2026).
  56. Toutanova, K.; Haghighi, A.; Manning, C.D. A global joint model for semantic role labeling. Comput. Linguist. 2008, 34, 161–191. [Google Scholar] [CrossRef]
  57. Johansson, R.; Nugues, P. Dependency-based syntactic–semantic analysis with PropBank and NomBank. In Proceedings of the Conference on Computational Natural Language Learning, Manchester, UK, 16 August 2008; pp. 183–187. Available online: https://aclanthology.org/W08-2123.pdf (accessed on 17 January 2026).
  58. Zhou, J.; Xu, W. End-to-end learning of semantic role labeling using recurrent neural networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 26–31 July 2015; pp. 1127–1137. [Google Scholar]
  59. Björkelund, A.; Hafdell, L.; Nugues, P. Multilingual semantic role labeling. In Proceedings of the Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task, Boulder, CO, USA, 4 June 2009; pp. 43–48. [Google Scholar]
  60. Miller, G.A. WordNet: A lexical database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
  61. Suchanek, F.M.; Kasneci, G.; Weikum, G. Yago: A large ontology from Wikipedia and WordNet. J. Web Semant. 2008, 6, 203–217. [Google Scholar] [CrossRef]
  62. Boleda, G. Distributional semantics and linguistic theory. Annu. Rev. Linguist. 2020, 6, 213–234. [Google Scholar] [CrossRef]
  63. Lenci, A. Distributional models of word meaning. Annu. Rev. Linguist. 2018, 4, 151–171. [Google Scholar] [CrossRef]
  64. Sahlgren, M. The distributional hypothesis. Ital. J. Linguist. 2008, 20, 33–53. [Google Scholar]
  65. Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. arXiv 2018, arXiv:1808.08949. [Google Scholar]
  66. Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar] [CrossRef]
  67. Hu, Z.; Wang, L.; Lan, Y.; Xu, W.; Lim, E.P.; Bing, L.; Xu, X.; Poria, S.; Lee, R. LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 5254–5276. [Google Scholar] [CrossRef]
  68. Rastogi, A.; Zang, X.; Sunkara, S.; Gupta, R.; Khaitan, P. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. AAAI Conf. Artif. Intell. 2020, 34, 8689–8696. [Google Scholar] [CrossRef]
  69. Han, Z.; Gao, C.; Liu, J.; Zhang, J.; Zhang, S.Q. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv 2024, arXiv:2403.14608. [Google Scholar] [CrossRef]
  70. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
  71. Laparra, E.; Mascio, A.; Velupillai, S.; Miller, T. A review of recent work in transfer learning and domain adaptation for natural language processing of electronic health records. Yearb. Med. Inform. 2021, 30, 239–244. [Google Scholar] [CrossRef] [PubMed]
  72. Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H.; Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
  73. Guo, Q.; Wang, R.; Guo, J.; Li, B.; Song, K.; Tan, X.; Liu, G.; Bian, J.; Yang, Y. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. arXiv 2023, arXiv:2309.08532. [Google Scholar] [CrossRef]
  74. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
  75. Zhou, Y.; Muresanu, A.I.; Han, Z.; Paster, K.; Pitis, S.; Chan, H.; Ba, J. Large language models are human-level prompt engineers. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
  76. Li, C.; Liang, J.; Zeng, A.; Chen, X.; Hausman, K.; Sadigh, D.; Levine, S.; Fei-Fei, L.; Xia, F.; Ichter, B. Chain of code: Reasoning with a language model-augmented code emulator. arXiv 2023, arXiv:2312.04474. [Google Scholar] [CrossRef]
  77. Shin, T.; Razeghi, Y.; Logan, R.L., IV; Wallace, E.; Singh, S. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv 2020, arXiv:2010.15980. [Google Scholar] [CrossRef]
  78. Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J.Z.; Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv 2023, arXiv:2307.15043. [Google Scholar] [CrossRef]
  79. Tian, J.; He, Z.; Dai, X.; Ma, C.Y.; Liu, Y.C.; Kira, Z. Trainable projected gradient method for robust fine-tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7836–7845. [Google Scholar]
  80. Wen, Y.; Jain, N.; Kirchenbauer, J.; Goldblum, M.; Geiping, J.; Goldstein, T. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. Adv. Neural Inf. Process. Syst. 2023, 36, 51008–51025. [Google Scholar]
  81. Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.R.; Cao, Y. ReAct: Synergizing reasoning and acting in language models. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  82. Masterman, T.; Besen, S.; Sawtell, M.; Chao, A. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey. arXiv 2024, arXiv:2404.11584. [Google Scholar] [CrossRef]
  83. Chen, X.; Xu, J.; Liang, T.; He, Z.; Pang, J.; Yu, D.; Song, L.; Liu, Q.; Zhou, M.; Zhang, Z.; et al. Do not think that much for 2+3=? on the overthinking of o1-like LLMs. arXiv 2024, arXiv:2412.21187. [Google Scholar]
  84. Liu, N.; Chen, L.; Tian, X.; Zou, W.; Chen, K.; Cui, M. From LLM to conversational agent: A memory enhanced architecture with fine-tuning of large language models. arXiv 2024, arXiv:2401.02777. [Google Scholar] [CrossRef]
  85. Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. Adv. Neural Inf. Process. Syst. 2023, 36, 8634–8652. [Google Scholar]
  86. Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
  87. Yacouby, R.; Axman, D. Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models. In Proceedings of the Workshop on Evaluation and Comparison of NLP Systems, Online, 20 November 2020; pp. 79–91. [Google Scholar] [CrossRef]
  88. Chapelle, O.; Metlzer, D.; Zhang, Y.; Grinspan, P. Expected reciprocal rank for graded relevance. In Proceedings of the ACM Conference on Information and Knowledge Management, Hong Kong, China, 2–6 November 2009; pp. 621–630. [Google Scholar] [CrossRef]
  89. Reiter, E. A structured review of the validity of BLEU. Comput. Linguist. 2018, 44, 393–401. [Google Scholar] [CrossRef]
  90. Stiennon, N.; Ouyang, L.; Wu, J.; Ziegler, D.; Lowe, R.; Voss, C.; Radford, A.; Amodei, D.; Christiano, P.F. Learning to summarize with human feedback. Adv. Neural Inf. Process. Syst. 2020, 33, 3008–3021. [Google Scholar]
  91. Wang, C.; Deng, Y.; Lyu, Z.; Zeng, L.; He, J.; Yan, S.; An, B. Q*: Improving multi-step reasoning for LLMs with deliberative planning. arXiv 2024, arXiv:2406.14283. [Google Scholar] [CrossRef]
  92. Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]
  93. Hanley, J.A.; McNeil, B.J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef] [PubMed]
  94. Singh, R.; Singh, S. Text similarity measures in news articles by vector space model using NLP. J. Inst. Eng. (India) Ser. B 2021, 102, 329–338. [Google Scholar] [CrossRef]
  95. Wahba, Y.; Madhavji, N.; Steinbacher, J. A comparison of SVM against pre-trained language models (PLMs) for text classification tasks. In International Conference on Machine Learning, Optimization, and Data Science; Springer Nature: Cham, Switzerland, 2022; pp. 304–313. [Google Scholar] [CrossRef]
  96. Rosenfeld, R. A maximum entropy approach to adaptive statistical language modelling. Comput. Speech Lang. 1996, 10, 187. [Google Scholar] [CrossRef]
  97. Yin, J.; Wang, J. A Dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 233–242. [Google Scholar] [CrossRef]
  98. Yin, W.; Kann, K.; Yu, M.; Schütze, H. Comparative study of CNN and RNN for natural language processing. arXiv 2017, arXiv:1702.01923. [Google Scholar] [CrossRef]
  99. So, D.; Mańke, W.; Liu, H.; Dai, Z.; Shazeer, N.; Le, Q.V. Searching for efficient transformers for language modeling. Adv. Neural Inf. Process. Syst. 2021, 34, 6010–6022. [Google Scholar]
  100. Cheng, J.; Yang, Y.; Tang, X.; Xiong, N.; Zhang, Y.; Lei, F. Generative adversarial networks: A literature review. KSII Trans. Internet Inf. Syst. 2020, 14, 4625–4647. [Google Scholar] [CrossRef]
  101. Ahmad, W.; Chakraborty, S.; Ray, B.; Chang, K.W. Unified pre-training for program understanding and generation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 2655–2668. [Google Scholar] [CrossRef]
  102. Wang, Y.; Wang, W.; Joty, S.; Hoi, S.C. Codet-5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv 2021, arXiv:2109.00859. [Google Scholar]
  103. Li, Y.; Choi, D.; Chung, J.; Kushman, N.; Schrittwieser, J.; Leblond, R.; Eccles, T.; Keeling, J.; Gimeno, F.; Dal Lago, A.; et al. Competition-level code generation with alphacode. Science 2022, 378, 1092–1097. [Google Scholar] [CrossRef]
  104. Pham, C.M.; Hoyle, A.; Sun, S.; Resnik, P.; Iyyer, M. TopicGPT: A prompt-based topic modeling framework. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 16–21 June 2024; pp. 2956–2984. [Google Scholar] [CrossRef]
  105. Imran, M.; Almusharraf, N. Google Gemini as a next generation AI educational tool: A review of emerging educational technology. Smart Learn. Environ. 2024, 11, 22. [Google Scholar] [CrossRef]
  106. Lenci, A.; Sahlgren, M.; Jeuniaux, P.; Cuba Gyllensten, A.; Miliani, M. A comparative evaluation and analysis of three generations of Distributional Semantic Models. Lang. Resour. Eval. 2022, 56, 1269–1313. [Google Scholar] [CrossRef]
  107. Iyyer, M.; Manjunatha, V.; Boyd-Graber, J.; Daumé, H., III. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 26–31 July 2015; pp. 1681–1691. [Google Scholar]
  108. Churchill, R.; Singh, L. The evolution of topic modeling. ACM Comput. Surv. 2022, 54, 1–35. [Google Scholar] [CrossRef]
  109. Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391–407. [Google Scholar] [CrossRef]
  110. Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, USA, 15–19 August 1999; pp. 50–57. [Google Scholar] [CrossRef]
  111. Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
  112. Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
  113. Steyvers, M.; Griffiths, T. Probabilistic topic models. In Handbook of Latent Semantic Analysis; Psychology Press: London, UK, 2007; pp. 439–460. [Google Scholar]
  114. Cataldi, M.; Di Caro, L.; Schifanella, C. Emerging topic detection on twitter based on temporal and social terms evaluation. In Proceedings of the International Workshop on Multimedia Data Mining, Washington, DC, USA, 25–28 July 2010; pp. 1–10. [Google Scholar] [CrossRef]
  115. Kingma, D.P.; Welling, M. An introduction to variational autoencoders. Found. Trends Mach. Learn. 2019, 12, 307–392. [Google Scholar] [CrossRef]
  116. Yan, X.; Guo, J.; Lan, Y.; Cheng, X. A biterm topic model for short texts. In Proceedings of the International Conference on World Wide Web, Rio de Janeiro, Brazil, 13–17 May 2013; pp. 1445–1456. [Google Scholar] [CrossRef]
  117. Nguyen, T.; Luu, A.T. Contrastive learning for neural topic model. Adv. Neural Inf. Process. Syst. 2021, 34, 11974–11986. [Google Scholar]
  118. Li, C.; Wang, H.; Zhang, Z.; Sun, A.; Ma, Z. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the International ACM SIGIR conference on Research and Development in Information Retrieval, Pisa, Italy, 17–21 July 2016; pp. 165–174. [Google Scholar] [CrossRef]
  119. Wu, X.; Li, C. Short text topic modeling with flexible word patterns. In Proceedings of the International Joint Conference on Neural Networks, Budapest, Hungary, 14–19 July 2019; pp. 1–7. [Google Scholar] [CrossRef]
  120. Shi, T.; Kang, K.; Choo, J.; Reddy, C.K. Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In Proceedings of the World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 1105–1114. [Google Scholar] [CrossRef]
  121. Wang, R.; Zhou, D.; He, Y. ATM: Adversarial-neural topic model. Inf. Process. Manag. 2019, 56, 102098. [Google Scholar] [CrossRef]
  122. Voorhees, E.; Harman, D. Overview of the Sixth Text Retrieval Conference (TREC-6). Inf. Process. Manag. 2000, 36, 3–35. [Google Scholar] [CrossRef]
  123. Nenkova, A. Automatic text summarization of newswire: Lessons learned from the document understanding conference. In Proceedings of the 20th National Conference on Artificial Intelligence, Pittsburgh, PA, USA, 9–13 July 2005. [Google Scholar]
  124. Navigli, R.; Litkowski, K.C.; Hargraves, O. Semeval-2007 task 07: Coarse-grained English all-words task. In Proceedings of the International Workshop on Semantic Evaluations, Prague, Czech Republic, 23–24 June 2007; pp. 30–35. [Google Scholar]
  125. Lin, Z.; Gou, Z.; Liang, T.; Luo, R.; Liu, H.; Yang, Y. CriticBench: Benchmarking LLMs for critique-correct reasoning. arXiv 2024, arXiv:2402.14809. [Google Scholar] [CrossRef]
  126. Zhou, Z.; Liu, S.; Ning, M.; Liu, W.; Wang, J.; Wong, D.F.; Huang, X.; Wang, Q.; Huang, K. Is your model really a good math reasoner? Evaluating mathematical reasoning with checklist. arXiv 2024, arXiv:2407.08733. [Google Scholar] [CrossRef]
  127. Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training verifiers to solve math word problems. arXiv 2021, arXiv:2110.14168. [Google Scholar] [CrossRef]
  128. Lightman, H.; Kosaraju, V.; Burda, Y.; Edwards, H.; Baker, B.; Lee, T.; Leike, J.; Schulman, J.; Sutskever, I.; Cobbe, K. Let’s verify step by step. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  129. Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; Steinhardt, J. Measuring mathematical problem solving with the math dataset. arXiv 2021, arXiv:2103.03874. [Google Scholar] [CrossRef]
  130. Rajpurkar, P.; Jia, R.; Liang, P. Know what you don’t know: Unanswerable questions for SQuAD. arXiv 2018, arXiv:1806.03822. [Google Scholar] [CrossRef]
  131. Dua, D.; Wang, Y.; Dasigi, P.; Stanovsky, G.; Singh, S.; Gardner, M. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv 2019, arXiv:1903.00161. [Google Scholar] [CrossRef]
  132. Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; Choi, Y. Hellaswag: Can a machine really finish your sentence? arXiv 2019, arXiv:1905.07830. [Google Scholar] [CrossRef]
  133. Longpre, S.; Hou, L.; Vu, T.; Webson, A.; Chung, H.W.; Tay, Y.; Zhou, D.; Le, Q.V.; Zoph, B.; Wei, J.; et al. The Flan collection: Designing data and methods for effective instruction tuning. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 22631–22648. [Google Scholar]
  134. Dalton, J.; Xiong, C.; Callan, J. TREC CAsT 2019: The conversational assistance track overview. arXiv 2020, arXiv:2003.13624. [Google Scholar] [CrossRef]
  135. Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; Ives, Z. DBpedia: A nucleus for a web of open data. In Proceedings of the International Semantic Web Conference, Busan, Republic of Korea, 11–15 November 2007; pp. 722–735. [Google Scholar] [CrossRef]
  136. Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; Taylor, J. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 9–12 June 2008; pp. 1247–1250. [Google Scholar] [CrossRef]
  137. Hoffart, J.; Suchanek, F.M.; Berberich, K.; Lewis-Kelham, E.; De Melo, G.; Weikum, G. YAGO2: Exploring and querying world knowledge in time, space, context, and many languages. In Proceedings of the International conference companion on World wide web, Hyderabad, India, 28 March–1 April 2011; pp. 229–232. [Google Scholar] [CrossRef]
  138. Baker, C.F.; Fillmore, C.J.; Lowe, J.B. The Berkeley FrameNet project. In Proceedings of the International Conference on Computational Linguistics, Montreal, QC, Canada, 10–14 August 1998. [Google Scholar]
  139. Palmer, M.; Gildea, D.; Kingsbury, P. The Proposition Bank: An annotated corpus of semantic roles. Comput. Linguist. 2005, 31, 71–106. [Google Scholar] [CrossRef]
  140. Crowston, K. Amazon Mechanical Turk: A research tool for organizations and information systems scholars. In Shaping the Future of ICT Research. Methods and Approaches: IFIP WG 8.2, Working Conference; Springer: Berlin/Heidelberg, Germany, 2012; pp. 210–221. [Google Scholar]
  141. Aronson, A.R. Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. In Proceedings of the AMIA Symposium, Washington, DC, USA, 3–7 November 2001; p. 17. [Google Scholar]
  142. Fiszman, M.; Rindflesch, T.C.; Kilicoglu, H. Integrating a hypernymic proposition interpreter into a semantic processor for biomedical texts. In Proceedings of the AMIA Annual Symposium Proceedings, Washington, DC, USA, 8–12 November 2003; p. 239. [Google Scholar]
  143. Fensel, D.; Şimşek, U.; Angele, K.; Huaman, E.; Kärle, E.; Panasiuk, O.; Toma, I.; Umbrich, J.; Wahler, A. Introduction: What is a knowledge graph? In Knowledge Graphs: Methodology, Tools and Selected Use Cases; Springer: Cham, Switzerland, 2020; pp. 1–10. [Google Scholar] [CrossRef]
  144. Huang, J.T.; Sharma, A.; Sun, S.; Xia, L.; Zhang, D.; Pronin, P.; Padmanabhan, J.; Ottaviano, G.; Yang, L. Embedding-based retrieval in Facebook search. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Online, 23–27 August 2020; pp. 2553–2561. [Google Scholar] [CrossRef]
  145. Lindberg, D.A.; Humphreys, B.L.; McCray, A.T. The unified medical language system. Yearb. Med. Inform. 1993, 2, 41–51. [Google Scholar] [CrossRef]
  146. Mintz, M.; Bills, S.; Snow, R.; Jurafsky, D. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore, 2–7 August 2009; pp. 1003–1011. [Google Scholar]
  147. Knight, K.; Marcu, D. Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artif. Intell. 2002, 139, 91–107. [Google Scholar] [CrossRef]
  148. Clarke, J.; Lapata, M. Models for sentence compression: A comparison across domains, training requirements and evaluation measures. In Proceedings of the International Conference on Computational Linguistics, Sydney, Australia, 17–21 July 2006; pp. 377–384. [Google Scholar]
  149. Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient fine-tuning of quantized LLMs. Adv. Neural Inf. Process. Syst. 2023, 36, 10088–10115. [Google Scholar]
  150. Floridi, L.; Chiriatti, M. GPT-3: Its nature, scope, limits, and consequences. Minds Mach. 2020, 30, 681–694. [Google Scholar] [CrossRef]
  151. Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. CodeBERT: A pre-trained model for programming and natural languages. arXiv 2020, arXiv:2002.08155. [Google Scholar] [CrossRef]
  152. Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; de Oliveira Pinto, H.P.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
  153. Nijkamp, E.; Pang, B.; Hayashi, H.; Tu, L.; Wang, H.; Zhou, Y.; Savarese, S.; Xiong, C. CodeGen: An open large language model for code with multi-turn program synthesis. arXiv 2022, arXiv:2203.13474. [Google Scholar] [CrossRef]
  154. Xu, F.F.; Alon, U.; Neubig, G.; Hellendoorn, V.J. A systematic evaluation of large language models of code. In Proceedings of the ACM SIGPLAN International Symposium on Machine Programming, San Diego, CA, USA, 13 June 2022; pp. 1–10. [Google Scholar] [CrossRef]
  155. Wang, C.H.; Huang, K.Y.; Yao, Y.; Chen, J.C.; Shuai, H.H.; Cheng, W.H. Lightweight deep learning: An overview. IEEE Consum. Electron. Mag. 2022, 13, 51–64. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

O’Shaughnessy, D. An Overview of Recent Advances in Natural Language Processing for Information Systems. Appl. Sci. 2026, 16, 1122. https://doi.org/10.3390/app16021122

AMA Style

O’Shaughnessy D. An Overview of Recent Advances in Natural Language Processing for Information Systems. Applied Sciences. 2026; 16(2):1122. https://doi.org/10.3390/app16021122

Chicago/Turabian Style

O’Shaughnessy, Douglas. 2026. "An Overview of Recent Advances in Natural Language Processing for Information Systems" Applied Sciences 16, no. 2: 1122. https://doi.org/10.3390/app16021122

APA Style

O’Shaughnessy, D. (2026). An Overview of Recent Advances in Natural Language Processing for Information Systems. Applied Sciences, 16(2), 1122. https://doi.org/10.3390/app16021122

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop