Next Article in Journal
Current Loop Decoupling and Disturbance Rejection for PMSM Based on a Resonant Control Periodic Disturbance Observer
Previous Article in Journal
An Investigation of Three-Dimensional Void Changes and Top-Down Microcrack Formation of AC-16 in Rutted and Non-Rutted Zones Under Extremely High Temperature and Heavy Load
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Big Loop and Atomization: A Holistic Review on the Expansion Capabilities of Large Language Models

JIUTIAN Team, China Mobile Research Institute, Beijing 100053, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(17), 9466; https://doi.org/10.3390/app15179466
Submission received: 30 June 2025 / Revised: 15 August 2025 / Accepted: 23 August 2025 / Published: 28 August 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

Large language models (LLMs) have demonstrated impressive capabilities, yet they face significant limitations in real-world applications. To overcome these boundaries, research areas such as tool learning, model collaboration, agents, and multi-agent systems have increasingly drawn attention. However, current studies are often conducted in isolation, lacking a unified framework for systematic integration, which hinders the synergy among closely related research efforts. To address this gap, this study, for the first time, brings together tool learning, model collaboration, and agent-related fields under a unified framework based on the concepts of the “Big Loop” and “Atomization”. In this framework, atomic components refer to fundamental units such as models, tools, and agents. The Big Loop is formed through interactions among these atomic components to achieve end-to-end task completion—for example, tool calling requires the integration of agent modules, tool retrieval models, and tools themselves; multi-agent systems require coordination among multiple agent units. This review first clarifies the foundational concepts of the Big Loop and Atomization and elaborates on the advantages of the Big Loop compared to a single LLM. It then systematically introduces the construction of atomic components, the scheduling of these components within the Big Loop, and the optimization of the overall system. The paper also discusses existing challenges and outlines future research directions. This work aims to offer a systematic perspective for both academia and industry, and to chart a course for exploration in this emerging and highly promising field of Big Loop and Atomization.

1. Introduction

In recent years, large language models (LLMs) have made groundbreaking progress in areas such as natural language processing, knowledge reasoning, and text generation. They have demonstrated powerful capabilities in language understanding and generation, and have been widely applied in tasks like question answering, machine translation, and summarization [1,2,3,4,5]. However, despite these remarkable achievements, LLMs still face numerous limitations in aspects such as knowledge timeliness, complex task handling, and computational efficiency, which significantly hinder their further development and practical deployment.
To address these bottlenecks, research directions based on LLMs—such as tool learning, model collaboration, and multi-agent systems—have attracted increasing attention in both academia and industry, yielding rapid advancements. These studies introduce external tools, integrate the capabilities of multiple models, or build collaborative agent systems to enhance model performance, offering novel ideas and methodologies. However, most of these efforts have been carried out independently, lacking systematic integration and a unified theoretical framework. As a result, the synergy between research outcomes remains limited, constraining our comprehensive understanding and deep application of these technologies.
This review builds upon the framework of Big Loop and Atomization [6], systematically integrating seemingly isolated research directions such as tool learning, model collaboration, and multi-agent systems. We tightly link the concepts of Atomization and Big Loop with current research hotspots and provide clear definitions: atomic components encompass basic units such as models, tools, and agents, while the Big Loop achieves end-to-end task completion through interactions among these components. On this basis, the review comprehensively illustrates the advantages of Big Loop over standalone LLMs and explores the construction of atomic components, their scheduling within Big Loop, and methods for optimizing such systems. The framework of the Big Loop and Atomization is presented in Figure 1. We also discuss existing challenges and future research directions.
While numerous prior surveys have focused on tool learning [7,8,9], model collaboration [10,11], agents [12,13,14], and multi-agent systems [15], these works typically review each direction in isolation. Furthermore, the systematic AI survey [6] introduced the concepts of Big Loop and Atomization for the first time, but only at a broad conceptual level, lacking detailed methodological categorization. Current surveys on tool learning and intelligent agents generally lack detailed introductions to tool construction and pay little attention to how to construct atomic models in a targeted manner. In contrast, surveys on model collaboration focus more on model integration, lacking discussions on scheduling and planning in closed-loop systems as well as tool-based collaboration. This survey incorporates model collaboration and tool invocation into a unified framework, enabling the two fields to draw insights from each other’s cutting-edge technological advancements and promote mutual integration.
In summary, this study distinguishes itself from prior reviews with three main innovations:
  • First, we propose a systematic Construction–Scheduling–Optimization taxonomy for Big Loop and Atomization, deconstructing the overall system into core methodological components. This allows us to build a comprehensive theoretical framework and provide clearer definitions for both concepts. This is also the first systematic review dedicated specifically to Big Loop and Atomization.
  • Second, by applying the Big Loop and Atomization framework, we unify and integrate fragmented research efforts—especially by treating tools and models as atomic components—thereby fostering cross-pollination among different research paradigms.
  • Finally, the review delves deeply into real-world challenges and future directions of Big Loop and Atomization, offering a forward-looking perspective to support the transition of these technologies from research to real-world deployment.
As LLM technologies continue to evolve rapidly, Big Loop systems based on LLMs are becoming increasingly integrated into critical domains. Understanding their architecture and operational mechanisms is vital not only for researchers but also for policymakers, industry practitioners, and society at large. This review aims to provide foundational insights into this emerging field, clarify research trajectories, and point the way forward for future research and applications—ultimately contributing to the development of more intelligent and efficient AI systems.
In the subsequent sections of this survey, we first provide more detailed definitions of Atomization and Big Loop in Section 2. Furthermore, we elaborate on the advantages of Big Loop over individual models from multiple perspectives in Section 3. Subsequently, we introduce how to construct atomic components in Section 4, how to schedule these components within the Big Loop in Section 5, and how to optimize both the atomic components and the scheduling process in Section 6. This survey also discusses the challenges and future directions of Big Loop and Atomization in Section 7. Finally, we conclude the entire review in Section 8.

2. Definition of Big Loop and Atomization

Big Loop systems in AI applications exhibit diverse forms, ranging from commonly seen retrieval-augmented question answering systems to complex tool-calling workflows and multi-agent collaborative systems addressing real-world scenarios. From a formal perspective, a Big Loop can be represented as follows:
L = Scheduling ( C o l l e c t i o n ( A ) )
where A denotes the set of atomic components, C o l l e c t i o n ( A ) represents a collection of such components, Scheduling defines the mechanism for orchestrating or interacting among these components, and L represents the complete Big Loop system. This formalism emphasizes the essence of a Big Loop: achieving a task loop through the scheduling of multiple atomic components.
The fundamental distinction between a Big Loop and a single large language model (LLM) lies in that the former completes tasks via coordinated interaction among diverse atomic components whereas the latter relies solely on internally pre-trained knowledge. This interactive mechanism enables Big Loop to integrate multi-source model capabilities, call specialized tools, and handle dynamically evolving domain knowledge—thus exhibiting stronger adaptability and interpretability in complex tasks. Atomic components, as the foundational units of Big Loop, directly determine overall system performance through their construction and interaction characteristics.

2.1. Classification of Atomic Components

The core criterion for classifying atomic components is their interactive adaptability, which includes two aspects: the constraints on interaction formats with other components and the number of component types that they can interact with. This classification framework unifies tools, models, and agents into a coherent atomic system. Below, we detail three core types of components:

2.1.1. Tool Components

Tool components are defined by integrating different research perspectives: some studies distinguish tools from APIs by viewing tools as collections of APIs [16,17], while others regard each API as an independent tool [18,19]. This review adopts the latter definition—treating each callable API as a basic atomic tool. Complex tools can be constructed by composing basic tools, forming a hierarchical “tool–subtool” structure.
Tool components exhibit the following characteristics:
  • Non-learnability: The logic of the tool is fixed and cannot be optimized through data-driven training;
  • Strict formatting: Calling must adhere to predefined input–output formats; missing parameters may cause failure;
  • Deterministic execution: The same input will always yield the same output, with little to no randomness in execution.
Typical applications include retrieval tools [20,21], calculators [22,23,24,25], etc., which are functionally simple but highly efficient in execution.

2.1.2. Model Components

Model components are typically based on neural network architectures, trained for specific tasks, and exhibit the following core features:
  • Learnability: They can be optimized through data training to improve task performance;
  • Flexible interaction: They impose looser constraints on input formats than tools, tolerating semantically equivalent variations in expressions;
It is worth noting that general-purpose LLMs (e.g., GPT-4 [26]) can be considered as special types of model components. Compared to task-specific models, general-purpose LLMs can handle a broader range of tasks, although they may lack in-domain specialization.

2.1.3. Agent Components

Agents are considered a key path toward artificial general intelligence (AGI), incorporating cognitive characteristics inspired by human abilities (e.g., memory, planning, tool use) [12,13,14]. Compared to model components, agent components aim to emulate human cognitive mechanisms and often adopt specific roles to perform assigned tasks.
When an agent operates as part of a multi-agent system, it becomes subject to interaction with other agents or coordination within a multi-agent scheduler. In this context, it qualifies as an atomic component within the Big Loop framework for multi-agent systems.

2.2. Hierarchical Composability of Big Loop

Big Loop systems are constructed following a task-oriented principle, characterized by their hierarchical composability:
  • Task decomposition mechanism: Complex application scenarios can be decomposed into multiple subtasks, each corresponding to an independent sub-loop. For example, a medical diagnosis system can be divided into four sub-loops: “symptom collection–examination suggestion–diagnostic reasoning–treatment planning”;
  • Component reusability principle: Basic atomic components can be reused across different loops; for instance, the retrieval tool can serve both the symptom collection and diagnostic reasoning stages;
  • Standardized interaction protocols: By defining unified interaction interfaces (e.g., function call formats, data transmission protocols), smooth coordination across loops can be ensured.
Such hierarchical structures endow Big Loop systems with excellent scalability. The composable design not only reduces system construction costs but also provides a clear path for subsequent functional expansion and iteration. The existing workflow-based orchestration methods in software engineering and the hierarchical task network (HTN) planning in artificial intelligence share certain similarities with this hierarchical composability. However, hierarchical composability primarily focuses on two aspects: one is the reusability of atomic components, meaning that a single atomic component can be incorporated into the Big Loops of multiple different tasks, and the functions of atomic components possess a certain degree of generality; the other is that more advanced functional components can be formed by combining current atomic components, and even a closed loop of subtasks can be created.

3. Advantages of Big Loop Systems

With the continuous advancement of LLMs, their capacity expansion paradigm has gradually shifted from relying solely on increased model parameters to a more structured architecture based on the collaboration of “atomic components.” These components include independent-function tools (such as search engines, calculators, database APIs, etc.) and specific models (e.g., small language models, SLMs), which collectively form a Big Loop system through orderly interactive collaboration. This section analyzes the notable advantages of Big Loop systems over traditional single-model systems from multiple dimensions. Figure 2 outlines the advantages of Big Loop systems.

3.1. Enhanced Capability

Although LLMs possess general capabilities, they still face limitations when handling domain-specific tasks and are prone to generating seemingly reasonable yet factually incorrect “hallucinations” [27,28,29,30]. Big Loop systems can significantly improve domain expertise and mitigate hallucination issues through the following methods:
  • Acquisition of Domain Knowledge: Big Loop systems enhance the ability of LLMs to dynamically access external tools, enabling them to retrieve and integrate external knowledge. By integrating database tools, LLMs can access structured databases to perform specific information retrieval and complex queries, effectively expanding their knowledge base [31,32,33]. The retrieval process often relies on specific atomic models, such as vector models [34,35,36,37,38], query optimization modules [39,40,41], and re-ranking components [42,43].
  • Use of Domain-Specific Tools: Integrating tools specific to certain domains enhances the LLM’s domain knowledge [22,23,24,25]. For instance, LLMs can use online calculators or mathematical tools to perform complex calculations, solve equations, and conduct statistical analyses [23,44,45,46]; external programming resources (such as Python compilers and interpreters) allow LLMs to execute and refine code based on feedback, improving code quality [47,48,49]. This approach compensates for the lack of professional knowledge and enhances practical utility in specialized scenarios.
  • Calling of Specialized Models: In domains such as finance [50,51], law [52,53], healthcare [54,55], and education [56,57], dedicated models can be trained to meet specific needs. Even though general-purpose models have evolved into multi-modal ones, they still struggle with certain niche modalities (e.g., hyperspectral images [58], point clouds [59], LiDAR [60], MRI [61]). General models can significantly enhance their domain-specific performance by calling specialized models.

3.2. Functionality Expansion

Big Loop systems possess robust capabilities for functional expansion:
  • Access to Real-Time Information: By integrating search engine tools, LLMs can access the latest information [20,21]; with weather tools, they can provide real-time weather updates, forecasts, and historical data [17,62]; interaction with map tools enables geographic data access and location-based queries [7]; in the financial sector, real-time exchange rate APIs and stock market tools allow for precise valuation of international asset portfolios [63].
  • Interaction with External Environments: LLMs are essentially language processors and lack the capability to independently execute external operations such as booking meeting rooms or flights [9], scheduling [63], setting reminders [64], or filtering emails [65]. Big Loop systems can integrate project management and workflow tools to manage tasks, monitor progress, and optimize processes [65]; integrate online shopping assistants to streamline the shopping process [66]; and utilize spreadsheet tools for direct execution of data analysis and visualization [7].

3.3. Computational Efficiency

As LLMs grow in size, the computational resources required for inference increase dramatically, making accelerated inference a key demand. Lightweight smaller LLMs play a crucial role in accelerating inference of larger models [11,67]. The main approaches to such collaborative acceleration fall into three categories:
  • Input Compression: Smaller LLMs compress inputs to shorten context length and achieve efficient computation [68,69,70,71,72].
  • Speculative Decoding: Smaller LLMs speculatively generate multiple tokens, and the LLM verifies them concurrently, reducing the LLM’s token generation burden [73,74].
  • Offloading High-Frequency Tasks: High-frequency tasks such as translation [75], summarization [76], and text rewriting [77] can be efficiently handled by smaller specialized models. Big Loop systems delegate these tasks to small models, enhancing overall computational efficiency.

3.4. User-Centered Design

Big Loop systems demonstrate notable advantages in user-centered design:
  • Enhanced Interpretability and User Trust: Most existing LLMs operate as “black boxes” with non-transparent decision processes, lacking interpretability [78,79]. This leads to user concerns about the reliability of responses, especially in high-stakes domains like healthcare and finance, where interpretability is critical [7,51]. Big Loop systems, by orchestrating multiple components, can expose the decision-making process, increasing transparency. Even when errors occur, users can quickly identify the source, improving understanding and trust and facilitating more effective human–AI collaboration.
  • Privacy Protection: LLM applications face major challenges in protecting user privacy [80]. Directly transmitting personal data to general models risks data leakage. Big Loop systems can locally deploy small models to handle user requests and only transmit privacy-stripped queries to server-side LLMs [81]; additionally, LLMs can utilize federated learning to transfer knowledge without transmitting raw data [82,83].
  • Customization Capability: Since Big Loop systems consist of multiple atomic components, specific small models or tools can be optimized or replaced according to user needs, without relying on large-scale labeled datasets or extensive training, thereby enabling easier customization [84].

4. Construction of Atomized Components

4.1. Tool Construction

4.1.1. Using Human Tools Directly as Model Tools

Big Loop systems can directly integrate existing human tools into the model calling framework. For real-time information unknown to LLM agents, knowledge retrieval tools (such as search engines) can help LLM agents to quickly obtain the latest knowledge [20,21], overcoming the limitations of the knowledge base during the training phase. WebGPT [85] successfully achieved deep integration of online search engines and LLMs by incorporating commercial APIs; ToolCoder [86] uses DuckDuckGo as the search engine. In addition, dedicated query tools such as weather queries, map services, and real-time exchange rate interfaces can provide domain-specific real-time data.
In terms of computational capacity expansion, Big Loop systems significantly enhance LLM agents’ ability to execute complex code and numerical computations by leveraging computational tools like Python interpreters and calculators [23,44,45,46,47,48,49]. RLEF [87] optimizes code generation performance through an end-to-end reinforcement learning framework, enabling LLMs to learn feedback from code executors; CodeActAgent [88] dynamically updates operation strategies based on interaction with code interpreters; Toolformer [63] integrates various tools, including calculators, greatly improving model performance on mathematical tasks; ART [89] demonstrates significant advantages in mathematical reasoning and complex computations by invoking external tools like calculators.
In the field of external environment interaction, Big Loop systems can achieve functionalities such as meeting room booking, flight ticket ordering [9], schedule management [63], reminder setting [64], and email filtering [65] through tool calls. Moreover, systems can integrate project management and workflow tools to enable task allocation, progress tracking, and process optimization [65]; connect to online shopping assistants to simplify procurement processes [66]; and support direct execution of data analysis and visualization by LLMs through spreadsheet processing tools [7]. These tools, originally designed to improve human work efficiency, can be directly utilized by LLMs.

4.1.2. Manually Constructed Tools

Researchers have designed manually constructed tools specifically for LLMs, with typical examples being the MCP [90] and the A2A protocol [91], whose core goals are to enhance interaction capabilities between LLMs and external systems and improve multi-agent collaboration efficiency.
For important tools required by large models, a more direct approach is to manually design corresponding tools in a targeted manner. TravelAgent is a travel planning system driven by large language models. In addition to invoking real-time tools such as those for cities, flights, hotels, restaurants, and destinations, it also incorporates manually designed distance calculation tools and time conversion tools to deliver superior planning capabilities.
Another category of manually constructed tools consists of interaction protocols between large models. The MCP (Model Context Protocol) is an open protocol that standardizes how applications provide context to LLMs, establishing secure links between LLMs and data sources, supporting AI agent and workflow construction. This protocol allows AI agents to securely and efficiently integrate various data sources and APIs, expanding capability boundaries; A2A (agent-to-agent protocol), proposed by Google, is an open standard aimed at enabling automatic discovery, secure communication, task sharing, and real-time coordination among different AI agents, abstracting away underlying framework and vendor differences. If the MCP can be viewed as the “dialogue interface” between AI agents and external tools, A2A serves as the “collaboration bridge” between agents, resolving fragmentation in agent ecosystems and promoting multi-agent systems to operate as coordinated teams on complex tasks.

4.1.3. Automatic Construction of Tools by LLMs

Traditional large language models (LLMs) often face challenges such as verbose code, complex logic, and difficult verification when generating code-based solutions. This has prompted researchers to explore enabling LLMs to acquire automated tool construction capabilities similar to human developers, aiming to enhance the conciseness, accuracy, and generality of solutions. Take the calculation of the rate of change in tabular data as an example: traditional LLMs rely on multi-step raw function calls (e.g., data slicing, value extraction, arithmetic operations), whereas human developers would create a dedicated function (e.g., c a l c _ r a t e _ o f _ c h a n g e ) to encapsulate these operations, making the solution more understandable and usable. This human development pattern provides a crucial insight for LLM-based tool construction.
LATM [92] employs GPT-4—powerful yet resource-intensive—as the “tool creator” to design tools for a range of tasks, with these tools implemented as Python utility functions. GPT-3.5, in turn, acts as the “tool user,” leveraging the tools built by the creator to solve problems. Compared to having GPT-4 take on both roles, LATM achieves comparable performance on various complex reasoning tasks, including those in the Big-Bench suite, while significantly reducing inference costs. CREATOR [93] enables large language models to independently create tools using documents and code, decoupling abstract tool creation from specific decision-making and execution processes, thereby improving performance on math and table-related tasks.
Beyond tool construction via code, some works focus on constructing tools from textual knowledge. REFTOOL [94] breaks through the inherent capability boundaries of LLMs by extracting and generating tools from external reference materials such as textbooks. Its tools are directly derived from verified external knowledge, significantly enhancing accuracy and reliability. KTCE [95] defines tools as executable forms of domain knowledge and proposes a “problem-knowledge-tool” paradigm. It abstracts knowledge from training data through LLMs, constructs hierarchical knowledge trees, and further induces atomic tools. Although KTCE’s knowledge originates from training data, this method successfully transforms implicit knowledge into explicit executable tools.
To improve tool reusability, some studies attempt to construct more complex advanced tools based on basic ones. TROVE [96] proposes a training-agnostic tool construction method, building a verifiable and efficient function toolbox through a “use-grow-prune” mechanism. Its core lies in learning and selecting reusable advanced functions from problem–solution pairs. REGAL [97] defines tool construction as a process of “learning reusable function libraries through code refactoring,” reorganizing code while maintaining consistent execution results and extracting general patterns through iterative verification.
For dynamic task environments, researchers have further explored the ability of LLM agents to autonomously discover tools. ASI [98] refers to tool construction as “skill induction,” where agents autonomously learn and create advanced skills represented by executable programs through online interaction; these programs encapsulate primitive actions or sequences of learned skills. The SKILLWEAVER framework [99] supports agents in synthesizing structured skills (APIs) by exploring website environments and equips them with self-testing and debugging capabilities to ensure robustness in complex scenarios.

4.1.4. Tool Documentation Optimization

Although many tools already exist or have been constructed, without necessary tool descriptions, models often struggle to know how to invoke them—especially for human-designed tools. Thus, numerous studies have emphasized the importance of tool documentation. Research utilizing existing tool documentation has shown, through experiments on six tasks across visual and language modalities, that zero-shot prompts containing only documentation significantly outperform few-shot prompts without documentation [100]. Additionally, due to the lengthiness of original tool documentation, some studies implicitly extract key information (e.g., tool names, parameter schemas, usage scenarios) from the documentation before inputting it to the agent. This ensures that the tool descriptions provided to the agent are concise yet contain core elements, helping the agent to quickly understand tool functions and invoke them accurately. For instance, tool parameter descriptions are indirectly corrected through an LLM re-ranker to ensure that parameter names and values align with the agent’s reasoning logic [101]. DRAFT [102] dynamically optimizes tool documentation by analyzing feedback and attempt results from interactions between LLMs and external tools. It includes three distinct learning phases: experience collection, learning from experience, and documentation rewriting, iteratively improving documentation quality. Extensive experiments across multiple datasets demonstrate that DRAFT’s feedback-based iterative optimization significantly enhances documentation quality, enabling LLMs to better understand and utilize tools.
In tool construction research, whether in tool creation or documentation optimization, many existing works can be categorized under prompt engineering. Within prompt engineering, one approach involves directly designing relevant prompts; the other leverages feedback results from tool invocations for optimization. Examples include REGAL [97], which learns reusable function libraries through refactoring, and DRAFT [102], which optimizes tool documentation based on feedback from tool interactions.

4.2. Model Construction

How to construct atomized models differentiated from general LLMs as fundamental components, achieving complementary capabilities with general models, is the core topic of this section. The following details are elaborated from four dimensions: architecture design, task paradigms, training strategies, and knowledge distillation.

4.2.1. Architectural Innovations

Deep learning architectures show diversified development trends, with different models exhibiting unique advantages on specific tasks:
  • Limitations and Improvements of Transformer: Despite dominating natural language processing, Transformers suffer efficiency bottlenecks in long-context memory handling (e.g., quadratic computational complexity) and high computational costs, prompting researchers to explore hybrid architectures. For example, recurrent neural networks (LSTM/GRU) remain indispensable for temporal data (such as financial sequences, speech signals) by capturing long-term dependencies through gating mechanisms [103,104].
  • Continuous Evolution and Cross-modal Fusion of CNNs: Convolutional neural networks, through innovations like residual connections (ResNet), depthwise separable convolutions (MobileNet), and attention mechanisms (CBAM), maintain core status in image recognition. For example, visual-language models (such as CLIP) combining CNNs achieve more accurate cross-modal understanding by local feature extraction aligned with global semantics [105,106].
  • Challenges and Frontiers in Graph Neural Networks (GNNs): GNNs excel in relational tasks like social network analysis and molecular structure prediction but face issues of poor generalization (e.g., over-smoothing) and large-scale graph processing efficiency. Topological deep learning, with higher-order neighborhood aggregation and dynamic graph representations, is gradually overcoming traditional GNN expressive limitations [107].
  • Technical Iterations of Generative Models: GANs generate highly realistic images via adversarial training but suffer from mode collapse; VAEs offer structured latent spaces through variational inference but face blurry outputs; diffusion models balance generation quality (e.g., Stable Diffusion) and training stability via denoising processes, becoming the mainstream for multi-modal generation [103].

4.2.2. Task Paradigms

Current LLMs’ generative language modeling paradigm does not handle all tasks well. For example, retrieval tasks involve processing massive documents. Evaluating an output often requires a reward model producing a scalar value. Such tasks typically necessitate training specific models.
Embedding Models: For retrieval, a key approach is training dense retrieval models. As the core module of the RAG paradigm, DPR enhances semantic alignment between queries and document embeddings via contrastive learning fine-tuning. For example, multi-vector retrieval divides documents into fine-grained segments to improve retrieval accuracy for complex queries [108]. With the rapid development of LLMs, efforts focus on building efficient embedding models. BGE M3-Embedding achieves three innovations via self-knowledge distillation: multilingual support (covering 100+ languages), multitask capability (dense retrieval/sparse retrieval/semantic similarity), and multi-granularity processing (from short sentences to 8192-token documents) [34]. The E5 model uses LLM-generated synthetic data to enhance training diversity, surpassing traditional embeddings on MTEB benchmarks [105].
Reward Models: Reward models are trained on human preference datasets typically consisting of paired “chosen” and “rejected” responses. During training, models learn to assign higher scores to “chosen” responses and lower scores to “rejected” ones, fitting human preference patterns [109,110,111].
Common training objectives maximize likelihood of human preference. For example, the widely used Bradley–Terry (BT) model defines the probability of one response outperforming another, optimized by minimizing negative log-likelihood loss [110,111]. Once trained, reward models accept new prompts and responses generated by LLMs or other generative models, outputting a single scalar reward value predicting response quality or alignment with human preferences [109,111].

4.2.3. Training Strategies

Training strategies are crucial for building atomized models tailored for specific functions:
  • Supervised Fine-tuning (SFT): Using domain-specific labeled data to adapt pretrained LLMs to target tasks.
  • Reinforcement Learning with Human Feedback (RLHF): Incorporates human feedback signals into policy optimization for more aligned behavior.
  • Self-Supervised Learning: Leverages massive unlabeled data via pretext tasks (e.g., masked modeling, contrastive learning) to enhance representations.
  • Knowledge Distillation: Transfers knowledge from large teacher models to smaller student models, enabling lightweight yet performant components.

4.2.4. Knowledge Distillation

Knowledge distillation is a key technology in model Atomization. It enables compressing large, complex models into smaller, more efficient ones without significant performance loss, supporting deployment in resource-constrained environments.
Distillation techniques include the following:
  • Logits Distillation: Matching the output probabilities of teacher and student models.
  • Feature Distillation: Aligning intermediate layer representations.
  • Relation Distillation: Preserving inter-sample or inter-class relationships.
  • Data-Free Distillation: Generating synthetic data for distillation without original training data.
Combined with multi-task and transfer learning, distillation enables creating specialized atomic models that complement general LLMs, enhancing overall system efficiency and capability.

5. Big Loop Scheduling

Currently, the inherent deficiencies of LLMs in long-term planning, real-time environment adaptation, and hallucination resistance are driving AI systems to evolve from single all-purpose LLM architectures to modular collaborative orchestration systems [112,113,114,115]. Big Loop scheduling, through fine-grained task decomposition and component orchestration mechanisms, effectively addresses core challenges such as outdated knowledge, biased factual reasoning, and insufficient domain expertise [112,113]. Big Loop scheduling marks an important paradigm innovation in AI system design, centered on efficiently handling complex tasks by collaboratively integrating general LLMs, task-specific models, and external tools as atomic components. A typical architecture uses a large language model (LLM) as the central coordination hub, dynamically scheduling components to respond to user demands. This architectural innovation targets the limitations of traditional single models and significantly enhances scalability, efficiency, and environmental adaptability in solving complex problems [116]. Besides the centralized control paradigm, decentralized interactive scheduling modes have become an important exploration direction.

5.1. Task Planning

Task planning is the core of Big Loop scheduling. It determines how to decompose a complex user request into a series of executable subtasks and select appropriate atomic components for each subtask. Effective task planning significantly improves system efficiency, accuracy, and robustness. Research on LLM-driven scientific agents also emphasizes the critical role of the “planner” in decomposing and managing scientific tasks [112].

5.1.1. Task Decomposition

Task decomposition typically refers to breaking down a task into a series of fixed steps that remain unchanged during execution. This approach suits scenarios with clear task structures and preset execution paths, e.g., “booking a flight” can be decomposed into fixed steps such as “query flights,” “select flight,” “fill passenger info,” and “payment.”
Although chain-of-thought (CoT) prompting is primarily used to enhance LLM reasoning capabilities, its idea of breaking complex problems into intermediate steps provides inspiration for task decomposition [63,117]. By prompting LLMs to generate explicit reasoning steps, it not only enhances interpretability but also improves performance on reasoning-intensive tasks.
Recent methods such as SoftCoT [118] and Coconut [119] explore reasoning in continuous latent spaces using “soft thinking tokens” instead of discrete token sequences. This approach aims to improve efficiency and alleviate catastrophic forgetting often accompanying full-model fine-tuning, which is crucial for complex reasoning tasks. For example, SoftCoT uses a lightweight auxiliary model to speculatively generate these soft tokens and then maps them into the main LLM’s representation space through a trainable projection module.
Some studies have explored task planning from a graph structure perspective [120]. Subtasks can naturally be modeled as a graph, where nodes represent subtasks and edges denote the dependency relationships between them. Therefore, task planning is essentially a decision-making problem that involves selecting and invoking a connected path or subgraph within the corresponding graph. Graph-based task planning can mitigate the biases in the attention mechanism of large language models and the shortcomings of autoregressive losses in graph-structured tasks. Extensive experiments have demonstrated that graph-neural-network-based methods can outperform existing solutions even without training, and that performance can be further improved with a small amount of training. Moreover, this performance advantage becomes more pronounced as the scale of the task graph increases [120].

5.1.2. Component-Based Planning

Component-based planning can be mainly categorized into two types of methods: prompting-based and fine-tuning-based. ReAct [121] integrates the invocation of external tools into the planning process. In two interactive decision-making benchmark tests, ReAct, with only one or two in-context example prompts, outperformed imitation learning and reinforcement learning methods in absolute success rates by 34% and 10%, respectively. To avoid manually designing task-specific demonstration examples, some studies have proposed the Automatic Reasoning and Tool-use (ART) framework [89], which selects demonstration examples of multi-step reasoning and tool usage from a task library when faced with new tasks to be solved. STRIDE [122] designs an agent framework equipped with memory functions and dedicated tools to enhance the strategic decision-making capabilities of LLMs. Recent studies have encapsulated capabilities such as web search, code execution, and structured memory into independent agents, which collaborate with the core LLM through clear invocation rules. By designing a special token triggering mechanism, the reasoning model embeds tokens like [Web-Search], [Code], and [Mind-Map] during generation, enabling dynamic invocation of external agents. One key innovation is the “Mind-Map agent,” which constructs a structured knowledge graph to store reasoning contexts and track logical relationships, ensuring coherence in long reasoning chains involving extensive tool usage [123].
ToolChain* is an efficient tree-search planning algorithm based on LLM agents [124]. This algorithm constructs the entire action space as a decision tree, where each node represents a potential API function call involved in the solution plan. By integrating the A* search algorithm with task-specific cost function design, it can effectively prune high-cost branches that may contain erroneous actions and identify the valid path with the lowest cost as the solution. Extensive experiments on multiple tool usage and reasoning tasks show that ToolChain* efficiently balances exploration and exploitation in a large action space. It outperforms state-of-the-art benchmark models on planning and reasoning tasks by an average of 3.1% and 3.5%, respectively, while reducing the required time by 7.35 times and 2.31 times, respectively. To address the complex API coupling issues commonly encountered in academic query processing, a solution-based LLM API academic information retrieval method called SOAY has been proposed [125]. This method uses code containing solutions as the reasoning approach, where “solutions” refer to pre-constructed API call sequences. The inclusion of solutions reduces the difficulty for models to understand complex relationships between APIs, while code improves reasoning efficiency. Compared with state-of-the-art LLM API-based benchmark methods, SOAY achieves performance improvements ranging from 34.58% to 75.99%.
Prompting methods typically rely on large models with large parameter scales. To enhance the planning capabilities of smaller models, some studies have decomposed agent capabilities into three modules: a planner, an invoker, and a summarizer [126]. Each component is implemented by a single LLM, focusing on specific capabilities and collaborating with other components to complete tasks. To effectively train this framework, a two-stage training paradigm is introduced: in the first stage, the backbone LLM is fine-tuned on the entire dataset without distinguishing subtasks, allowing the model to comprehensively understand the tasks; in the second stage, the fine-tuned LLM is used to instantiate the planner, invoker, and summarizer, respectively, which are then continuously fine-tuned on their respective subtasks. Evaluations on various tool usage benchmarks show that the proposed multi-LLM framework outperforms traditional single-LLM methods, highlighting its effectiveness and advantages in tool learning. Toollink [127] uses ChatGPT to filter useful tools related to the problem from the toolset and generate natural language planning (CoS-Planning); based on the planning, ChatGPT further generates code to implement tool invocation (CoS-Calling); manual verification of the correctness of generated results ensures the relevance of planned tools and the execution effectiveness of invocation code; the verified planning and invocation data are incorporated into datasets, forming tool planning and tool invocation sub-datasets, respectively. The final dataset is used to train open-source models to enhance their tool planning capabilities. The ToolLLaMA model [65] utilizes a depth-first search-based decision tree (DFSDT) mechanism, combined with over 16,000 real-world APIs for multi-step reasoning, which effectively improves the performance of tool-augmented LLMs compared with traditional chain-of-reasoning mechanisms. However, their method only uses successful paths in the decision tree (also called reasoning tree) for supervised fine-tuning (SFT), ignoring potential learning opportunities in failed paths. ToolPrefer-LLaMA [128] further leverages the previously ignored failure exploration processes in the decision tree. In the subsequent training stage, LLMs are first fine-tuned on successful tool usage expert trajectories, and then direct preference optimization (DPO) is applied with preference data to update the LLM’s strategy. ToolPrefer-LLaMA significantly outperforms benchmark models in almost all test scenarios and exhibits better generalization ability to unseen APIs.
In addition, some studies focus on tool learning under budget constraints. Before using tools, they first evaluate the usefulness of candidate tools based on past experience under budget constraints, and then use dynamic programming to construct plans [129]. Another study considers the scalability issues when dealing with APIs that are irreversible and have a significant impact on the system (such as database deletion APIs) [130]. Similarly, processes where each API call requires a lot of time, as well as those requiring forward-looking planning (such as automated action pipelines), also pose complex challenges. A suite of black-box algorithms called SwissNYF for planning and verification tasks is designed.
Moreover, some studies consider invoking tools only when necessary [131]. An ideal approach is to invoke external tools only when the LLM is not confident in the answer. This study self-assesses whether it can answer the question directly or needs to invoke external tools. It focuses on supervised learning and generates labels using closed-book question-answering tasks by introducing an illusion masking mechanism. Experimental results on question-answering tasks using models trained with parameter-efficient fine-tuning techniques show that the model can directly provide answers to 78.2% of known queries and choose to search for 77.2% of unknown queries. This reduces the API invocation rate to 62%. Some studies have proposed a tool learning framework based on execution feedback [132], which is a two-stage end-to-end framework that enables models to continuously learn through feedback from tool execution, thereby mastering the timing and methods for effective tool usage. Experimental results and further analysis indicate that TRICE can improve the insufficient tool learning and reduce over-reliance on tools by enhancing the accuracy of tool usage, thus enabling LLMs to achieve selective use of tools.

5.1.3. Adaptive Planning

Adaptive planning refers to dynamically refining or adjusting the initial task decomposition based on task execution status or environmental feedback. Task decomposition evolves from simple fixed sequences (inspired by basic CoT) to more dynamic, graph-like structures (ToT, GoT), indicating that internal LLM reasoning mechanisms are developing toward greater flexibility and efficiency to better support complex, multi-step tasks. This approach is more flexible and can handle uncertainty or changes during task execution.
The Tree-of-Thought (ToT) framework exemplifies adaptive planning by allowing the model to explore different reasoning paths and self-evaluate, dynamically adjusting its thinking during execution [117]. This aligns with the concept of adaptive task planning, where subtasks are adjusted based on feedback, demonstrating “deliberate problem-solving” capabilities [117]. Similarly, the Graph-of-Thought (GoT) framework [117] organizes diverse reasoning paths to explore multiple possibilities and select higher-quality steps, enabling more structured task decomposition.
Reinforcement learning (RL) methods are widely used in adaptive planning to learn optimal decision policies through environment interaction [116]. For instance, in multi-agent collaboration, a centralized coordinator can be trained by RL to adaptively prioritize and schedule agents according to evolving task states [116].
Recent research like PSALM (Predicting Semantics of Actions with Language Models) combines LLMs with symbolic planners to iteratively learn action semantics (preconditions and postconditions) through closed-loop environment interaction and feedback [133]. This method leverages LLM commonsense reasoning to infer domain-specific rules and dynamically refine environmental understanding, significantly improving planning success rates [133]. This shows that LLMs do not directly replace traditional planning but enhance it [112,133,134].
Dynamic retrieval-augmented generation (Dynamic RAG) is another emerging adaptive paradigm, which dynamically decides when and what external knowledge to retrieve during LLM generation, enabling real-time adaptation to the evolving information needs of LLMs [113]. This is particularly beneficial for multi-hop reasoning and complex generation tasks, allowing for the iterative integration of retrieved knowledge during generation [113].
Auto-Curriculum Expert Iteration (Auto-CEI) enhances LLM reasoning by exploring reasoning trajectories and correcting erroneous paths, reducing error accumulation [115]. It employs an automatic curriculum that adjusts rewards to encourage longer reasoning before admitting incapability, balancing confidence and conservatism. This helps to align LLM responses with their true capabilities, improving robustness in logic, mathematics, and planning tasks [115].
These advances indicate that hybrid neuro-symbolic planning is becoming mainstream. The flexible reasoning of LLMs (e.g., inferring semantics or generating heuristic rules) combined with the rigor of symbolic planners (ensuring execution and state tracking) is a powerful and necessary approach to solving complex real-world planning problems [112,133,134]. This suggests that LLMs enhance rather than replace traditional planners. Furthermore, advances in dynamic RAG and Auto-CEI highlight the urgent need for systems to adapt their information acquisition and reasoning processes in real time according to changing context and internal model states, surpassing static preprocessing and achieving truly dynamic closed-loop feedback during generation and execution.

5.1.4. Proactive Planning

Proactive planning emphasizes human–machine collaboration and is particularly suitable for open-domain or ambiguous task scenarios, where tasks may be insufficiently defined and require multiple interactions with humans to be further clarified. This human–machine collaboration aims to improve system performance, reliability, and safety [114].
Humans play an indispensable role in proactive planning; they can provide necessary clarifications, context, or domain knowledge, offer critical feedback and corrections, and perform essential supervision and control. This is especially important in high-risk or sensitive scenarios such as healthcare, privacy, or security [114]. In robotics, the LLM A framework leverages the commonsense knowledge of LLMs and human feedback to facilitate few-shot, near-optimal path planning [135]. Prompts are used to provide environmental information to the LLM and convey human feedback on intermediate planning results, making the entire planning process transparent and code-free [135].
The system needs to proactively communicate with users through interactive dialogues to gradually clarify task details. This is implemented via a dialogue management module responsible for understanding user intents, identifying missing information, and actively initiating clarifying conversations [114,136,137]. Recent research focuses on training LLMs to proactively ask clarifying questions when necessary by simulating expected outcomes of future dialogue turns to learn when to inquire [136]. This enables LLMs to learn that posing clarifying questions can lead to more accurate and tailored subsequent responses [136].
LLMs have also made significant progress in multi-turn interaction capabilities, enabling them to maintain context, generate coherent responses over multiple dialogue turns, and dynamically interact as intelligent agents with users or environments [114,137]. For example, the ‘clem:todd’ framework systematically evaluates these task-oriented dialogue systems, including their ability to handle goal-driven conversations and manage ambiguities [137].
These developments indicate that human–machine collaboration has become an inevitable choice for complex, ambiguous, or high-risk tasks rather than a mere optional feature. For such tasks, fully autonomous systems are currently infeasible or undesirable. Therefore, the future of Big Loop systems lies in robust human–machine collaboration, where humans provide necessary guidance, feedback, and supervision. Moreover, multi-turn dialogue systems are not only user interfaces but increasingly integral parts of the planning process itself, allowing tasks to be iteratively defined and goals and constraints to be dynamically adjusted through dialogue. This means that dialogue itself is a dynamic, evolving “plan” where the system and user collaboratively define and refine tasks rather than the system receiving a fully predefined task upfront.

5.2. Atomic Component Selection

Atomic component selection refers to assigning corresponding subtasks to the most suitable atomic components for completion. This involves evaluating different components’ capabilities, costs, and efficiencies to achieve optimal task execution.

5.2.1. Basic Selection Methods

Basic selection methods typically consider only choosing the most accurate atomic component. A common approach is retriever-based tool selection. This method matches subtask descriptions with predefined tool functionality descriptions, using information retrieval techniques (such as BM25, vector similarity search) to find the most relevant tools [138]. This is especially prevalent in retrieval-augmented generation (RAG) systems, where the retriever module identifies relevant external knowledge (tools or documents) based on encoded inputs [112,138]. Retrievers typically consist of an encoder for encoding inputs, an efficient indexing system supporting approximate nearest neighbor search (e.g., IVFPQ, HNSW), and a datastore storing external knowledge [138].
Another approach is LLM-based tool selection. With the advancement of LLM capabilities, LLMs can directly understand the semantics of subtasks and select appropriate tools based on their internal knowledge or external tool descriptions. LLMs can be trained to recognize when external tools need to be invoked and generate the necessary call parameters [63]. Toolformer is an example where LLMs learn to use external tool APIs through self-supervised learning [63]. LLMs are also used as “data selectors” in instruction fine-tuning, identifying high-quality instruction data, which aligns with the concept of selecting optimal tools [139]. Additionally, the capabilities of LLM agents themselves are being evaluated, such as their ability to reproduce code from research papers, which involves selecting and using appropriate tools/APIs [140].
The development of these basic selection methods indicates that LLMs are increasingly becoming intelligent routers. They are capable not only of simple information retrieval but also of semantically aware routing or “selection” of tools and other models. This suggests a more complex, semantic-aware routing mechanism driven by LLMs’ understanding abilities rather than relying solely on keyword or embedding matching.

5.2.2. Advanced Selection Strategies

Advanced selection strategies require consideration of multiple objectives simultaneously to address real-world constraints and demands [115]. This typically necessitates more complex decision modules, possibly combining multi-objective optimization, reinforcement learning, or heuristic rules.
Cost-effectiveness is an important consideration. When multiple components can complete the task, priority is given to the component with the lowest calling cost. For example, for simple text summarization tasks, lightweight, low-cost models are preferred over large, expensive ones [116].
Latency sensitivity is critical for tasks requiring real-time responsiveness. Such tasks prioritize components with fast execution and low response times. For instance, in voice assistant scenarios, local models with slightly lower accuracy may be preferred over highly accurate models requiring cloud inference [116].
Data privacy and security require prioritizing locally running components or rigorously certified cloud services when handling sensitive or private data. This is especially important in finance, healthcare, and similar fields. It often involves trade-offs between privacy and knowledge transfer [141].
Robustness and reliability are core considerations in critical tasks. Even at a somewhat higher cost, more reliable components may be chosen, and redundant calls may be considered to ensure system stability, low error rates, and fault recovery capabilities. The Auto-Curriculum Expert Iteration (Auto-CEI) method enhances LLM reasoning capabilities and aligns its responses with model capacities, helping to reduce hallucinations and improve reliability. However, the inherent unpredictability of LLM agents and the accumulation of uncertainty in interactions pose significant robustness challenges for multi-agent systems [115].
In multi-agent systems, coordinators can adaptively promote more efficient agents while suppressing less efficient ones based on evolving task states. This dynamic orchestration is trained via reinforcement learning to enable flexible and evolvable collective reasoning [116].
The development of these advanced strategies indicates that component selection is shifting from a single “best” criterion to a multi-objective optimization problem, reflecting a mature understanding of real-world system design considerations (such as cost, latency, privacy, and robustness). This heralds the transition of LLM-driven applications from research prototypes toward deployable production-grade systems. Moreover, robustness is no longer merely a property of individual components but rather how the entire multi-agent system handles error accumulation, knowledge drift, and conflict alignment [115]. This requires dynamic monitoring, uncertainty quantification, and adaptive governance rules, surpassing static mitigation strategies to ensure overall system stability and reliability.

5.3. Atomic Component Calling

Atomic component calling is a critical step in the execution phase, transforming the planning and selection decisions into actual operations. This includes accurately generating tool calls and ensuring semantic compatibility between models.

5.3.1. Adhering to Calling Formats

Tool calling primarily requires accurately generating the format and parameters needed to call the tool. Different tools usually have distinct API interfaces and parameter requirements. LLMs need the capability to convert natural language subtask descriptions into specific tool API call formats, including identifying necessary parameters, filling in parameter values, and handling optional parameters. This is often achieved through function calling or tool usage capabilities.
For example, OpenAI’s function calling feature allows models to generate function call parameters conforming to a specific JSON schema based on user input. This demands a deep understanding of the tool’s API documentation and the ability to accurately extract required information from user intent.
Enhancing function calling capabilities is an important direction in LLM research. Various methods are being explored to improve LLM function calling abilities, including different prompt formats to integrate function descriptions, merging function call data with instruction-following data, and self-supervised learning to boost performance across domains. Meanwhile, industry efforts are also developing benchmarks to evaluate LLM function calling abilities [63].
The ability of LLM agents to reproduce code from research papers has also become a key benchmark for assessing their tool usage and API interaction capabilities. However, challenges remain, such as failures in paper parsing, incomplete understanding of problem context, and errors in cross-file retrieval, all of which directly affect the agents’ ability to generate correct tool calls and interact with complex codebases [140].
These advances indicate that LLMs are shifting from pure text generation tasks toward direct interaction with external systems and environments. The capability to reliably generate structured API calls (function calls) means that LLMs are no longer merely language models but agents capable of performing actions. Furthermore, difficulties encountered by LLM agents in code reproduction—especially regarding cross-file dependencies and incomplete problem understanding—reveal grounding challenges. LLMs must not only know how to invoke APIs but also accurately comprehend contexts and dependencies within complex environments (such as codebases) to make correct calls.

5.3.2. Semantic Alignment Between Models

Model calling requires semantic alignment between the input and output of two model components. When the output of one model serves as input to another, ensuring that the information is correctly understood and utilized is crucial [116]. This is a key issue in multi-agent systems [116].
Subsequent models need to receive clear, explicit instructions containing all necessary context to correctly perform their tasks. This may involve formatting, extracting, or augmenting the output from the preceding model. Ensuring that different models share a common understanding of task background, goals, and constraints can be achieved by passing shared contextual information between models or using a unified knowledge base.
Outputs from the previous model may need parsing, filtering, or restructuring to fit the input format or semantic requirements of the next model. For example, one model may output free-form text while the next requires structured JSON data [116]. The LLM-wrapper approach demonstrates how LLMs enable “black-box” semantic adaptation of vision-language models (VLMs) [142]. It leverages the reasoning capabilities of LLMs to process VLM outputs (e.g., bounding boxes) and convert them into natural language prompts for further reasoning or to select the most relevant outputs [142]. This highlights the role of LLMs as intelligent adapters between different modalities or model types.
The concept of semantic alignment also extends to ensuring understanding across different languages and modalities. In cross-lingual alignment, studies show that aligning LLM internal representations (e.g., intermediate layers) can significantly improve cross-lingual transfer in fine-tuned models, especially for low-resource languages [143]. This indicates that explicit alignment objectives can bridge representation gaps. Omni-multi-modal LLMs (Omni-MLLMs) aim to achieve unified multi-modal understanding and generation by mapping various non-linguistic modalities (vision, audio, 3D, IMU) into the LLM embedding space [144]. They realize this through complex encoding (continuous, discrete, hybrid) and alignment (projection, embedding) mechanisms, enabling interaction and understanding of arbitrary modality combinations within a single model [144].
The challenge of semantic alignment essentially involves defining a robust “language” or universal representation for different atomic components to communicate effectively. This goes beyond simple data transfer to ensure shared understanding of context, intent, and knowledge. The development of LLM-wrappers and explicit intermediate-layer alignment objectives suggests that specialized architectural components will serve as semantic “adapters” or “translators” between different models and modalities. This implies that direct, unmediated communication between highly specialized components is often insufficient, requiring proactive transformation and alignment of semantic representations.

5.4. Future Directions

Big Loop scheduling has rapidly advanced under the impetus of LLMs as powerful coordinators. In task planning, research has shifted from fixed decomposition to adaptive and proactive strategies, integrating complex reasoning mechanisms (such as CoT, ToT, SoftCoT) and human–machine collaboration approaches. The selection of atomic components has also evolved to consider multi-objective criteria (cost, latency, privacy, robustness) and to leverage both retriever-based and LLM-based intelligent routing. Component calling has been enhanced through robust function calling capabilities and specialized work on semantic alignment between models, including cross-lingual and cross-modal understanding. Trends in neuro-symbolic hybrid methods and human–machine collaboration highlight pragmatic pathways for building reliable and effective systems.
Despite significant progress, the field of Big Loop scheduling still faces multiple open research challenges and promising future directions:
  • Robustness and Reliability in Complex Multi-Agent Systems: Handling error accumulation, knowledge drift, and conflict alignment remain key challenges in highly dynamic LLM multi-agent systems [115]. Future work needs to focus on quantifiable uncertainty management, formal verification, and adaptive runtime monitoring.
  • Explainability and Transparency: As LLM-driven closed-loop systems grow increasingly complex, understanding their decision-making processes and ensuring transparency are crucial for building trust and facilitating debugging. This includes explaining internal reasoning steps of LLMs as well as how components interact.
  • Ethical Considerations and Governance: The growing autonomy and impact of these systems demand sound ethical frameworks, bias mitigation strategies, and clear accountability mechanisms, especially in sensitive domains [112,114]. This also involves addressing potential issues such as hallucination propagation and implicit collusion [115].
  • Scalability and Efficiency: Although modularization improves scalability, optimizing computational costs and latency across networks of models and tools remains an ongoing challenge, particularly for real-time and resource-constrained environments [116].
  • Generalization to Novel and Open-Ended Tasks: Enhancing these systems’ ability to generalize to truly novel, underspecified, or open-ended tasks without extensive retraining or human intervention is a critical long-term goal. This includes improving LLMs’ inference of new action semantics and adaptation to unknown environments [133].
  • Unified Evaluation Framework: Developing comprehensive and unified benchmarks that assess the end-to-end performance of Big Loop systems—considering multi-turn interaction, dynamic adaptation, and multi-objective criteria—is vital to guide future research [114,137,140].

6. Big Loop Optimization

Big Loop optimization mainly focuses on optimizing atomic components or the scheduling process of the Big Loop system itself. This optimization is typically guided by feedback signals from other components or external sources. Components can be tools or models, and such optimization is generally non-differentiable. Current feedback signals can be categorized into natural language feedback and numerical signals, where numerical signals include scalar forms such as system performance metrics and training losses. The optimization targets include atomic component construction (e.g., models and tools) and Big Loop scheduling (e.g., topology and workflows). It is important to note that Big Loop optimization is not strictly distinguishable from the construction of atomic components and the scheduling of Big Loops; instead, it places greater emphasis on optimizing these aspects based on feedback from the Big Loop.

6.1. Atomic Component Construction Optimization

6.1.1. Model Optimization

Model optimization focuses on tuning parameters of core components like LLMs, mainly driven by natural language feedback or numerical signals:
  • Supervised Fine-Tuning (SFT): Directly optimizes model parameters using labeled data. For example, SiriuS [145] collects high-performance reasoning trajectories to perform independent supervised fine-tuning on multi-role LLM nodes; multi-agent fine-tuning [146] generates training data via multi-agent debates to optimize both generative and critic models.
  • Reinforcement Learning (RL): Guides model optimization through reward functions. MAPoRL [147] trains verifier LLMs to assign correctness rewards for multi-agent debates, combining impact-aware rewards to promote cooperation; GPTSwarm [148] uses the REINFORCE algorithm to optimize the connectivity probability distribution within an agent swarm.
  • Natural Language Feedback Optimization: Utilizes LLM-generated textual guidance to update models. For example, TextGrad [149] employs an evaluator LLM to generate textual loss signals and a gradient estimator LLM to produce node optimization suggestions, simulating natural language backpropagation; LLM-AutoDiff [149] introduces temporal gradient accumulation of textual feedback for cyclic system structures to optimize multi-component LLM workflows.

6.1.2. Tool Optimization

Tool optimization includes the creation, updating, and merging of tools, aiming to improve tool–system synergy:
  • Tool Creation and Integration: The ToolOptimizer component in Agent Symbolic Learning (ASL) [150] supports tool creation and optimizes node additions or deletions via PipelineOptimizer; ADAS uses Python code as the first system representation and iteratively designs workflows containing tool calls (such as web search and code interpreter integration) based on meta-LLMs.
  • Tool Parameter Tuning: DSPy [151] applies rejection sampling (Bootstrap-) to generate high-quality context examples for optimizing tool call parameters; MIPRO jointly optimizes tool instructions and demonstration configurations using Bayesian surrogate models.
  • Collaborative Tool Optimization: GASO [152] addresses the ignored interactions among sibling tools in TextGrad by proposing semantic gradient descent, aggregating context-aware gradients to optimize credit assignment across tools.

6.2. Big Loop Scheduling Optimization

Optimization of Big Loop scheduling focuses on dynamically adjusting system topology and workflows, achieving efficient scheduling through structural flexibility combined with learning signals:
  • Topology Optimization: Topology optimization can be divided into two categories. The first involves fixed-structure adjustment. For example, Trace updates all LLM prompts at once based on global natural language feedback, using minimal subgraphs to reduce LLM calls; DSPy [151] uses guided sampling to optimize tool calling order within fixed pipelines, enhancing overall task efficiency. The second category involves flexible structure design. AFlow [153] applies Monte Carlo Tree Search (MCTS) to explore optimal system topologies within predefined operator spaces, preserving historical design experience to support efficient iterative improvements; DebFlow [154] introduces a multi-agent debate framework with debaters and judge roles to dynamically generate optimized workflow structures via collaborative debate.
  • Dynamic Workflow Generation: The generation of dynamic workflows primarily depends on numerical signals or natural language feedback. In numerical-signal-driven scenarios, DyLAN [155] models multi-round debates as a time-feedforward network, incrementally optimizing workflow structure by pruning ineffective agents and adaptively reconnecting surviving nodes; ScoreFlow uses a Score-DPO strategy to sample candidate workflows from meta-LLMs, collecting preference data based on task performance differences to improve decision policies. In natural language feedback-driven scenarios, AutoFlow [156] prompts meta-LLMs to generate candidate workflows using CoRE syntax, applying task data average scores as rewards for reinforcement fine-tuning; MAS-GPT constructs planner-query-MAS training sets, fine-tuning meta-LLMs to produce optimal workflows adapted to different tasks.
  • System Representation and Execution Optimization: Scheduling logic is implemented using graph structures or code representations. For example, GPTSwarm [148] models the system as a hierarchical graph of nodes–agents–swarms, optimizing cross-level connection weights; FlowReasoner combines Python code representations with multi-purpose rewards to dynamically generate efficient workflows that include conditional logic and loops.

7. Challenges and Future Trends

Although Big Loop optimization has made significant progress in improving the performance of composite AI systems, the field still faces numerous challenges that point to future research directions.

7.1. Technical Challenges

Non-differentiability and Optimization Complexity: Since Big Loop optimization involves various non-differentiable components (e.g., tool calling, discrete decision modules), traditional gradient-based optimization methods cannot be directly applied. This necessitates the development of new optimization algorithms such as sampling-based methods (e.g., Monte Carlo Tree Search in reinforcement learning) and meta-learning approaches (e.g., learning optimizer hyperparameters to adapt to different tasks) to tackle complex non-differentiable systems. Furthermore, as system scale and complexity increase, the optimization search space grows exponentially, making it a pressing challenge to efficiently find optimal solutions within such vast spaces.
Inter-component Collaboration and Communication Overhead: In composite systems, effective collaboration among multiple components requires efficient communication and synchronization mechanisms. Different components may operate at varying speeds and use diverse data formats and interfaces, leading to communication delays and data inconsistencies. For example, in multi-agent systems, information exchange and decision coordination among agents can be hindered by limited communication bandwidth. Future research should explore more effective inter-component communication protocols and collaboration strategies, such as asynchronous communication to reduce wait times and adaptive data format conversion methods to enhance compatibility.
Model Explainability and Transparency: As Big Loop optimization systems grow increasingly complex, understanding their decision-making processes becomes more difficult. This is especially critical in domains such as healthcare and finance, where decision explainability is essential. For instance, in medical diagnosis, physicians need to comprehend the AI system’s reasoning to make final decisions. Therefore, developing explainability techniques—such as visualization tools that display system workflows and decision logic, and explanation generation algorithms that provide natural language rationales—is crucial for building user trust in Big Loop optimization systems.

7.2. Data Challenges

Data Scarcity and Annotation Difficulty: In certain specialized domains or tasks, obtaining sufficient high-quality data for optimization can be extremely challenging. For example, in rare disease diagnosis, limited case numbers make it difficult to gather extensive training data. Moreover, annotating accurate reference data for complex system outputs often requires domain experts, which is time-consuming and costly. Potential solutions include leveraging transfer learning (transferring knowledge from related domains), semi-supervised learning (combining small amounts of labeled data with large unlabeled datasets), and active learning (enabling models to select the most valuable data for annotation).
Data Privacy and Security: Big Loop optimization systems typically involve collecting and processing vast amounts of user data, making data privacy and security critical concerns. For instance, smart home systems collecting user lifestyle data may risk privacy violations. Thus, it is necessary to develop privacy-preserving techniques such as federated learning (training models collaboratively without sharing raw data) and differential privacy (adding noise to data to protect individual privacy), ensuring data security and regulatory compliance during use.

7.3. Optimization Strategy Challenges

Long-term Stability and Adaptability: Big Loop optimization systems need to maintain stable performance over time amid constantly changing environments. However, real-world task demands, data distributions, and external conditions can change unpredictably, requiring systems to adapt dynamically. For example, in e-commerce recommendation systems, user preferences may vary with seasons and trends. Future research must design online learning-based optimization strategies that enable systems to perceive environmental changes in real time and adjust optimization parameters dynamically to sustain long-term performance advantages.
Multi-objective Optimization Balance: Big Loop optimization often involves simultaneously considering multiple objectives such as system performance, resource consumption, and explainability. Balancing these goals is challenging—for example, improving performance might increase computational costs, while enhancing explainability may reduce predictive accuracy. Therefore, developing multi-objective optimization algorithms like Pareto optimization (finding a set of non-dominated solutions that are relatively optimal across objectives) and weighted sum methods (assigning weights to different objectives to convert them into a single-objective problem) is necessary to meet the diverse requirements of various applications.

7.4. Future Trends

Automation and Intelligent Optimization: In the future, Big Loop optimization is expected to achieve higher levels of automation and intelligence. Automated optimization tools will be able to select appropriate optimization algorithms and tune parameters according to task requirements, while dynamically monitoring and improving system performance during operation. For instance, using meta-learning techniques, systems can autonomously learn the best optimization strategies for different tasks without manual intervention. On the intelligence front, the optimization process will more effectively leverage prior knowledge and real-time feedback, such as guiding optimization via knowledge graphs and interacting with environments through reinforcement learning to continuously improve strategies.
Cross-domain Application Expansion: As technologies mature, Big Loop optimization will see broader application across many fields. Beyond current AI system optimization, it will play key roles in industrial manufacturing (optimizing production processes and supply chain management), urban transportation (optimizing traffic flow control and intelligent logistics), and energy management (optimizing energy distribution and grid scheduling). Through such cross-domain expansion, Big Loop optimization will provide general solutions for complex real-world problems, driving intelligent upgrades and sustainable development across industries.

8. Conclusions

This study systematically constructs an integrated framework of the Big Loop and atomic components for the first time. By incorporating tool learning, model collaboration, and multi-agent systems into a unified paradigm, it breaks through the capability boundaries of single model. Through the structured interaction of atomic components (tools, models, agents), the Big Loop system achieves hierarchical compositionality and dynamic adaptability in task processing, demonstrating significant advantages in professional knowledge acquisition, functional expansion, computational efficiency, and user-centered design.
The full-process analysis from atomic component construction to loop scheduling optimization shows that the core competitiveness of the Big Loop system lies in the following:
  • Modular Collaboration: Forming a complementary system of “professional capability–generalization capability–decision-making capability” through the deterministic execution of tools, the learnability of models, and the environmental adaptability of agents.
  • Dynamic Optimization Mechanism: Achieving the joint iteration of component parameter tuning and topological structure optimization by combining natural language feedback and numerical signals.
  • Hierarchical Task Decomposition: Reducing the implementation cost of complex tasks through sub-loop reuse and standardized interaction protocols.
However, challenges such as the optimization complexity of non-differentiable components, cross-modal semantic alignment, and data privacy still need to be addressed. Future research can focus on the design of automated optimization algorithms and interdisciplinary applications in key fields such as healthcare and finance, promoting the evolution of the Big Loop system from a theoretical framework to a general intelligent infrastructure.

Author Contributions

Conceptualization, Z.H., Y.H. and J.F.; methodology, Z.H.; formal analysis, Y.H.; resources, J.F. and C.D.; writing—original draft preparation, Z.H.; writing—review and editing, Z.H. and Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Beijing Natural Science Foundation (L222006) and China Mobile Holistic Artificial Intelligence Major Project Funding (R22105ZS, R22105ZSC01).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  2. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
  3. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
  4. Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
  5. Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 technical report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]
  6. Feng, J. Systematic artificial intelligence. J. Beijing Univ. Posts Telecommun. 2024, 47, 1. [Google Scholar]
  7. Qin, Y.; Hu, S.; Lin, Y.; Chen, W.; Ding, N.; Cui, G.; Zeng, Z.; Zhou, X.; Huang, Y.; Xiao, C.; et al. Tool learning with foundation models. ACM Comput. Surv. 2024, 57, 1–40. [Google Scholar] [CrossRef]
  8. Qu, C.; Dai, S.; Wei, X.; Cai, H.; Wang, S.; Yin, D.; Xu, J.; Wen, J.-R. Tool learning with large language models: A survey. Front. Comput. Sci. 2025, 19, 198343. [Google Scholar] [CrossRef]
  9. Wang, Z.; Cheng, Z.; Zhu, H.; Fried, D.; Neubig, G. What Are Tools Anyway? A Survey from the Language Model Perspective. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
  10. Wang, F.; Zhang, Z.; Zhang, X.; Wu, Z.; Mo, T.; Lu, Q.; Wang, W.; Li, R.; Xu, J.; Tang, X.; et al. A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness. arXiv 2024, arXiv:2411.03350. [Google Scholar] [CrossRef]
  11. Lu, J.; Pang, Z.; Xiao, M.; Zhu, Y.; Xia, R.; Zhang, J. Merge, ensemble, and cooperate! a survey on collaborative strategies in the era of large language models. arXiv 2024, arXiv:2407.06089. [Google Scholar] [CrossRef]
  12. Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A survey on large language model based autonomous agents. Front. Comput. Sci. 2024, 18, 186345. [Google Scholar] [CrossRef]
  13. Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.; Zhang, M.; Wang, J.; Jin, S.; Zhou, E.; et al. The rise and potential of large language model based agents: A survey. Sci. China Inf. Sci. 2025, 68, 121101. [Google Scholar] [CrossRef]
  14. Luo, J.; Zhang, W.; Yuan, Y.; Zhao, Y.; Yang, J.; Gu, Y.; Wu, B.; Chen, B.; Qiao, Z.; Long, Q.; et al. Large language model agent: A survey on methodology, applications and challenges. arXiv 2025, arXiv:2503.21460. [Google Scholar] [CrossRef]
  15. Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.V.; Wiest, O.; Zhang, X. Large language model based multi-agents: A survey of progress and challenges. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, Jeju, Republic of Korea, 3–9 August 2024; pp. 8048–8057. [Google Scholar]
  16. Patil, S.G.; Zhang, T.; Wang, X.; Gonzalez, J.E. Gorilla: Large Language Model Connected with Massive APIs. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
  17. Xu, Q.; Hong, F.; Li, B.; Hu, C.; Chen, Z.; Zhang, J. On the tool manipulation capability of open-source large language models. arXiv 2023, arXiv:2305.16504. [Google Scholar] [CrossRef]
  18. Anantha, R.; Bandyopadhyay, B.; Kashi, A.; Mahinder, S.; Hill, A.W.; Chappidi, S. ProTIP: Progressive Tool Retrieval Improves Planning. arXiv 2023, arXiv:2312.10332. [Google Scholar] [CrossRef]
  19. Li, M.; Zhao, Y.; Yu, B.; Song, F.; Li, H.; Yu, H.; Li, Z.; Huang, F.; Li, Y. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 3102–3116. [Google Scholar]
  20. Xiong, H.; Bian, J.; Li, Y.; Li, X.; Du, M.; Wang, S.; Yin, D.; Helal, S. When search engine services meet large language models: Visions and challenges. IEEE Trans. Serv. Comput. 2024, 17, 4558–4577. [Google Scholar] [CrossRef]
  21. Jin, B.; Zeng, H.; Yue, Z.; Yoon, J.; Arik, S.; Wang, D.; Zamani, H.; Han, J. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv 2025, arXiv:2503.09516. [Google Scholar]
  22. He-Yueya, J.; Poesia, G.; Wang, R.; Goodman, N. Solving Math Word Problems by Combining Language Models with Symbolic Solvers. In Proceedings of the 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23, Los Angeles, CA, USA, 15 December 2023. [Google Scholar]
  23. Kadlčík, M.; Štefánik, M.; Sotolar, O.; Martinek, V. Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through Interaction with Symbolic Systems. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 12101–12108. [Google Scholar]
  24. Jin, Q.; Yang, Y.; Chen, Q.; Lu, Z. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics 2024, 40, btae075. [Google Scholar] [CrossRef]
  25. Kim, Y.; Park, C.; Jeong, H.; Chan, Y.S.; Xu, X.; McDuff, D.; Lee, H.; Ghassemi, M.; Breazeal, C.; Park, H.W. Mdagents: An adaptive collaboration of llms for medical decision-making. Adv. Neural Inf. Process. Syst. 2024, 37, 79410–79452. [Google Scholar]
  26. OpenAI. ChatGPT. 2025. Available online: https://openai.com/index/chatgpt/ (accessed on 1 August 2025).
  27. Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
  28. Liu, H.; Xue, W.; Chen, Y.; Chen, D.; Zhao, X.; Wang, K.; Hou, L.; Li, R.; Peng, W. A survey on hallucination in large vision-language models. arXiv 2024, arXiv:2402.00253. [Google Scholar] [CrossRef]
  29. Bai, Z.; Wang, P.; Xiao, T.; He, T.; Han, Z.; Zhang, Z.; Shou, M.Z. Hallucination of multimodal large language models: A survey. arXiv 2024, arXiv:2404.18930. [Google Scholar] [CrossRef]
  30. Bang, Y.; Ji, Z.; Schelten, A.; Hartshorn, A.; Fowler, T.; Zhang, C.; Cancedda, N.; Fung, P. Hallulens: Llm hallucination benchmark. arXiv 2025, arXiv:2504.17550. [Google Scholar] [CrossRef]
  31. Liang, L.; Bo, Z.; Gui, Z.; Zhu, Z.; Zhong, L.; Zhao, P.; Sun, M.; Zhang, Z.; Zhou, J.; Chen, W.; et al. Kag: Boosting llms in professional domains via knowledge augmented generation. In Proceedings of the Companion the ACM on Web Conference 2025, Sydney, NSW, Australia, 28 April–2 May 2025; pp. 334–343. [Google Scholar]
  32. Zhang, Q.; Chen, S.; Bei, Y.; Yuan, Z.; Zhou, H.; Hong, Z.; Dong, J.; Chen, H.; Chang, Y.; Huang, X. A Survey of Graph Retriev-al-Augmented Generation for Customized Large Language Models. arXiv 2025, arXiv:2501.13958. [Google Scholar]
  33. Cheng, M.; Luo, Y.; Ouyang, J.; Liu, Q.; Liu, H.; Li, L.; Yu, S.; Zhang, B.; Cao, J.; Ma, J.; et al. A survey on knowledge-oriented re-trieval-augmented generation. arXiv 2025, arXiv:2503.10677. [Google Scholar]
  34. Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv 2024, arXiv:2402.03216. [Google Scholar]
  35. AI21. Embeddings. Available online: https://docs.ai21.com/docs/embeddings-api (accessed on 1 August 2025).
  36. Amazon. Amazon Titan Text Embeddings Models. Available online: https://docs.aws.amazon.com/bedrock/latest/userguide/titanembedding-models.html (accessed on 1 August 2025).
  37. OpenAI. Embeddings. Available online: https://platform.openai.com/docs/guides/embeddings (accessed on 1 August 2025).
  38. Voyage AI. Embeddings. Available online: https://docs.voyageai.com/docs/embeddings (accessed on 1 August 2025).
  39. Dhuliawala, S.; Komeili, M.; Xu, J.; Raileanu, R.; Li, X.; Celikyilmaz, A.; Weston, J. Chain-of-Verification Reduces Hallucination in Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 3563–3578. [Google Scholar]
  40. Peng, W.; Li, G.; Jiang, Y.; Wang, Z.; Ou, D.; Zeng, X.; Xu, D.; Xu, T.; Chen, E. Large language model based long-tail query rewriting in taobao search. In Proceedings of the Companion the ACM on Web Conference 2024, Singapore, 13–17 May 2024; pp. 20–28. [Google Scholar]
  41. Gao, L.; Ma, X.; Lin, J.; Callan, J. Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, Canada, ON, 9–14 July 2023; Volume 1: Long Papers, pp. 1762–1777. [Google Scholar]
  42. Zhang, Y.; Li, M.; Long, D.; Zhang, X.; Lin, H.; Yang, B.; Xie, P.; Yang, A.; Liu, D.; Lin, J. Qwen3 Embedding: Advancing Text Em-bedding and Reranking Through Foundation Models. arXiv 2025, arXiv:2506.05176. [Google Scholar]
  43. Yang, E.; Yates, A.; Ricci, K.; Weller, O.; Chari, V.; Van Durme, B.; Lawrie, D. Rank-K: Test-Time Reasoning for Listwise Reranking. arXiv 2025, arXiv:2505.14432. [Google Scholar]
  44. Shao, Z.; Huang, F.; Huang, M. Chaining Simultaneous Thoughts for Numerical Reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 2533–2547. [Google Scholar]
  45. Gou, Z.; Shao, Z.; Gong, Y.; Yang, Y.; Huang, M.; Duan, N.; Chen, W. ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  46. Veerendranath, V.; Shah, V.; Ghate, K. Calc-CMU at SemEval-2024 Task 7: Pre-Calc-Learning to Use the Calculator Improves Nu-meracy in Language Models. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), Vienna, Austria, 7–11 May 2024; pp. 1468–1475. [Google Scholar]
  47. Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; Callan, J.; Neubig, G. Pal: Program-aided language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 10764–10799. [Google Scholar]
  48. Wang, X.; Wang, Z.; Liu, J.; Chen, Y.; Yuan, L.; Peng, H.; Ji, H. MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  49. Hu, Q.; Long, Q.; Wang, W. BOOST: Bootstrapping Strategy-Driven Reasoning Programs for Program-Guided Fact-Checking. arXiv 2025, arXiv:2504.02467. [Google Scholar]
  50. Zhang, X.; Yang, Q. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 4435–4439. [Google Scholar]
  51. Liu, X.-Y.; Wang, G.; Yang, H.; Zha, D. FinGPT: Democratizing Internet-scale Data for Financial Large Language Models. In Proceedings of the NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, New Orleans, LA, USA, 15 December 2023. [Google Scholar]
  52. Yue, S.; Chen, W.; Wang, S.; Li, B.; Shen, C.; Liu, S.; Zhou, Y.; Xiao, Y.; Yun, S.; Huang, X.; et al. Disc-lawllm: Fine-tuning large language models for intelligent legal services. arXiv 2023, arXiv:2309.11325. [Google Scholar]
  53. Fei, Z.; Zhang, S.; Shen, X.; Zhu, D.; Wang, X.; Ge, J.; Ng, V. InternLM-Law: An Open-Sourced Chinese Legal Large Language Model. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 9376–9392. [Google Scholar]
  54. Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H.; et al. Toward expert-level medical question answering with large language models. Nat. Med. 2025, 31, 943–950. [Google Scholar] [CrossRef]
  55. Chen, J.; Wang, X.; Ji, K.; Gao, A.; Jiang, F.; Chen, S.; Zhang, H.; Dingjie, S.; Xie, W.; Kong, C.; et al. HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
  56. Dan, Y.; Lei, Z.; Gu, Y.; Li, Y.; Yin, J.; Lin, J.; Ye, L.; Tie, Z.; Zhou, Y.; Wang, Y.; et al. Educhat: A large-scale language model-based chatbot system for intelligent education. arXiv 2023, arXiv:2308.02773. [Google Scholar]
  57. Yu, J.; Zhu, J.; Wang, Y.; Liu, Y.; Chang, H.; Nie, J.; Kong, C.; Chong, R.; Liu, X.; An, J.; et al. Taoli Llama. Available online: https://github.com/blcuicall/taoli (accessed on 22 August 2025).
  58. Pang, L.; Cao, X.; Tang, D.; Xu, S.; Bai, X.; Zhou, F.; Meng, D. Hsigene: A foundation model for hyperspectral image generation. arXiv 2024, arXiv:2409.12470. [Google Scholar]
  59. Xue, L.; Yu, N.; Zhang, S.; Panagopoulou, A.; Li, J.; Martín-Martín, R.; Wu, J.; Xiong, C.; Xu, R.; Niebles, J.C.; et al. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27091–27101. [Google Scholar]
  60. Yang, S.; Liu, J.; Zhang, R.; Pan, M.; Guo, Z.; Li, X.; Chen, Z.; Gao, P.; Li, H.; Guo, Y.; et al. Lidar-llm: Exploring the potential of large language models for 3d lidar understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 27 February–2 March 2025; Volume 39, pp. 9247–9255. [Google Scholar]
  61. Tak, D.; Garomsa, B.A.; Chaunzwa, T.L.; Zapaishchykova, A.; Pardo, J.C.C.; Ye, Z.; Zielke, J.; Ravipati, Y.; Vajapeyam, S.; Mahootiha, M.; et al. A foundation model for generalized brain MRI analysis. medRxiv 2024. [Google Scholar] [CrossRef]
  62. Huang, Y.; Shi, J.; Li, Y.; Fan, C.; Wu, S.; Zhang, Q.; Liu, Y.; Zhou, P.; Wan, Y.; Gong, N.Z.; et al. MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  63. Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 2023, 36, 68539–68551. [Google Scholar]
  64. Zhuang, Y.; Yu, Y.; Wang, K.; Sun, H.; Zhang, C. Toolqa: A dataset for llm question answering with external tools. Adv. Neural Inf. Process. Syst. 2023, 36, 50117–50143. [Google Scholar]
  65. Qin, Y.; Liang, S.; Ye, Y.; Zhu, K.; Yan, L.; Lu, Y.; Lin, Y.; Cong, X.; Tang, X.; Qian, B.; et al. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. arXiv 2023, arXiv:2307.16789. [Google Scholar]
  66. Yao, S.; Chen, H.; Yang, J.; Narasimhan, K. Webshop: Towards scalable real-world web interaction with grounded language agents. Adv. Neural Inf. Process. Syst. 2022, 35, 20744–20757. [Google Scholar]
  67. Miao, X.; Oliaro, G.; Zhang, Z.; Cheng, X.; Wang, Z.; Wong, R.Y.Y.; Chen, Z.; Arfeen, D.; Abhyankar, R.; Jia, Z. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv 2023, arXiv:2305.09781. [Google Scholar]
  68. Ali, M.A.; Li, Z.; Yang, S.; Cheng, K.; Cao, Y.; Huang, T.; Hu, G.; Lyu, W.; Hu, L.; Yu, L.; et al. Prompt-saw: Leveraging relation-aware graphs for textual prompt compression. arXiv 2024, arXiv:2404.00489. [Google Scholar]
  69. Huang, X.; Zhang, L.L.; Cheng, K.T.; Yang, F.; Yang, M. Fewer is more: Boosting LLM reasoning with reinforced context pruning. arXiv 2023, arXiv:2312.08901. [Google Scholar]
  70. Liu, J.; Li, L.; Xiang, T.; Wang, B.; Qian, Y. TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 9796–9810. [Google Scholar]
  71. Fei, W.; Niu, X.; Zhou, P.; Hou, L.; Bai, B.; Deng, L.; Han, W. Extending Context Window of Large Language Models via Semantic Compression. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 5169–5181. [Google Scholar]
  72. Li, J.; Lan, Y.; Wang, L.; Wang, H. PCToolkit: A Unified Plug-and-Play Prompt Compression Toolkit of Large Language Models. arXiv 2024, arXiv:2403.17411. [Google Scholar]
  73. Huang, K.; Guo, X.; Wang, M. SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths. In Proceedings of the Workshop on Efficient Systems for Foundation Models II@ ICML2024, Vienna, Austria, 21–27 July 2024. [Google Scholar]
  74. Liu, J.; Wang, Q.; Wang, J.; Cai, X. Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 3027–3043. [Google Scholar]
  75. Lu, Y.; Zhu, W.; Li, L.; Qiao, Y.; Yuan, F. LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 10748–10772. [Google Scholar]
  76. Yun, T.; Oh, J.; Min, H.; Lee, Y.; Bang, J.; Cai, J.; Song, H. ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on Feedback. arXiv 2025, arXiv:2503.21332. [Google Scholar]
  77. Shu, L.; Luo, L.; Hoskere, J.; Zhu, Y.; Liu, Y.; Tong, S.; Chen, J.; Meng, L. Rewritelm: An instruction-tuned large language model for text rewriting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 18970–18980. [Google Scholar]
  78. Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable ai: A review of machine learning interpretability methods. Entropy 2020, 23, 18. [Google Scholar] [CrossRef]
  79. Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Wang, S.; Yin, D.; Du, M. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol. 2024, 15, 20. [Google Scholar] [CrossRef]
  80. Chen, K.; Zhou, X.; Lin, Y.; Feng, S.; Shen, L.; Wu, P. A Survey on Privacy Risks and Protection in Large Language Models. arXiv 2025, arXiv:2505.01976. [Google Scholar] [CrossRef]
  81. Chong, C.J.; Hou, C.; Yao, Z.; Talebi, S.M.S. Casper: Prompt Sanitization for Protecting User Privacy in Web-Based Large Language Models. arXiv 2024, arXiv:2408.07004. [Google Scholar] [CrossRef]
  82. Vu, M.; Nguyen, T.; Jeter, T.; Thai, M.T. Analysis of Privacy Leakage in Federated Large Language Models. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS), Palacio de Congresos de València, Valencia, Spain, 2–4 May 2024. [Google Scholar]
  83. Ahmadi, K.; Kim, H.W.; Sharma, R. An Interactive Framework for Implementing Privacy-Preserving Federated Learning: Experi-ments on Large Language Models. arXiv 2025, arXiv:2502.08008. [Google Scholar]
  84. Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In Proceedings of the Twelfth International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  85. Nakano, R.; Hilton, J.; Balaji, S.; Wu, J.; Ouyang, L.; Kim, C.; Hesse, C.; Jain, S.; Kosaraju, V.; Saunders, W.; et al. Webgpt: Brows-er-assisted question-answering with human feedback. arXiv 2021, arXiv:2112.09332. [Google Scholar]
  86. Zhang, K.; Zhang, H.; Li, G.; Li, J.; Li, Z.; Jin, Z. Toolcoder: Teach code generation models to use api search tools. arXiv 2023, arXiv:2305.04032. [Google Scholar] [CrossRef]
  87. Gehring, J.; Zheng, K.; Copet, J.; Mella, V.; Carbonneaux, Q.; Cohen, T.; Synnaeve, G. Rlef: Grounding code llms in execution feedback with reinforcement learning. arXiv 2024, arXiv:2410.02089. [Google Scholar] [CrossRef]
  88. Wang, X.; Chen, Y.; Yuan, L.; Zhang, Y.; Li, Y.; Peng, H.; Ji, H. Executable code actions elicit better llm agents. In Proceedings of the Forty-First International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
  89. Paranjape, B.; Lundberg, S.; Singh, S.; Hajishirzi, H.; Zettlemoyer, L.; Ribeiro, M.T. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv 2023, arXiv:2303.09014. [Google Scholar]
  90. Descope. What Is the Model Context Protocol (MCP) and How It Works. Available online: https://www.descope.com/learn/post/mcp (accessed on 22 August 2025).
  91. Surapaneni, R.; Jha, M.; Vakoc, M.; Segal, T. Announcing the Agent2Agent Protocol (A2A). Google Developers Blog 2025. Available online: https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/ (accessed on 22 August 2025).
  92. Cai, T.; Wang, X.; Ma, T.; Chen, X.; Zhou, D. Large Language Models as Tool Makers. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  93. Qian, C.; Han, C.; Fung, Y.; Qin, Y.; Liu, Z.; Ji, H. CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Suzhou, China, 5–9 November 2023; pp. 6922–6939. [Google Scholar]
  94. Liu, X.; Yin, D.; Wu, Z.; Feng, Y. RefTool: Enhancing Model Reasoning with Reference-Guided Tool Creation. arXiv 2025, arXiv:2505.21413. [Google Scholar]
  95. Ma, Z.; Huang, Z.; Liu, J.; Wang, M.; Zhao, H.; Li, X. Automated creation of reusable and diverse toolsets for enhancing llm reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 24821–24830. [Google Scholar]
  96. Wang, Z.Z.; Neubig, G.; Fried, D. TROVE: Inducing verifiable and efficient toolboxes for solving programmatic tasks. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; pp. 51177–51191. [Google Scholar]
  97. Stengel-Eskin, E.; Prasad, A.; Bansal, M. ReGAL: Refactoring Programs to Discover Generalizable Abstractions. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; pp. 46605–46624. [Google Scholar]
  98. Wang, Z.Z.; Gandhi, A.; Neubig, G.; Fried, D. Inducing programmatic skills for agentic tasks. arXiv 2025, arXiv:2504.06821. [Google Scholar] [CrossRef]
  99. Zheng, B.; Fatemi, M.Y.; Jin, X.; Wang, Z.Z.; Gandhi, A.; Song, Y.; Gu, Y.; Srinivasa, J.; Liu, G.; Neubig, G.; et al. Skillweaver: Web agents can self-improve by discovering and honing skills. arXiv 2025, arXiv:2504.07079. [Google Scholar]
  100. Hsieh, C.-Y.; Chen, S.-A.; Li, C.-L.; Fujii, Y.; Ratner, A.; Lee, C.-Y.; Krishna, R.; Pfister, T. Tool documentation enables zero-shot tool-usage with large language models. arXiv 2023, arXiv:2308.00675. [Google Scholar]
  101. Lumer, E.; Subbiah, V.K.; Burke, J.A.; Basavaraju, P.H.; Huber, A. Toolshed: Scale tool-equipped agents with advanced rag-tool fusion and tool knowledge bases. arXiv 2024, arXiv:2410.14594. [Google Scholar]
  102. Qu, C.; Dai, S.; Wei, X.; Cai, H.; Wang, S.; Yin, D.; Xu, J.; Wen, J.-R. From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
  103. Bengesi, S.; El-Sayed, H.; Sarker, M.K.; Houkpati, Y.; Irungu, J.; Oladunni, T. Advancements in Generative AI: A Comprehensive Review of GANs, GPT, Autoencoders, Diffusion Model, and Transformers. IEEE Access 2024, 12, 69812–69837. [Google Scholar] [CrossRef]
  104. Bulatov, A.; Kuratov, Y.; Kapushev, Y.; Burtsev, M. Beyond Attention: Breaking the Limits of Transformer Context Length with Recurrent Memory. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22–27 February 2024; Volume 38, pp. 17700–17708. [Google Scholar]
  105. Khan, S.H.; Almaktoof, A.; Abo-Al-Ez, K. A Comprehensive Survey on Architectural Advances in Deep CNNs: Challenges, Applications, and Emerging Research Directions. Sensors 2025, 25, 531. [Google Scholar]
  106. Haruna, Y.; Qin, S.; Chukkol, A.H.A.; Yusuf, A.A.; Bello, I.; Lawan, A. Exploring the synergies of hybrid convolutional neural network and Vision Transformer architectures for computer vision: A survey. Eng. Appl. Artif. Intell. 2025, 144, 110057. [Google Scholar] [CrossRef]
  107. Zhou, Y.; Feng, L.; Ke, Y.; Jiang, X.; Yan, J.; Yang, X.; Zhang, W. Towards vision-language geo-foundation model: A survey. arXiv 2024, arXiv:2406.09385. [Google Scholar]
  108. Jayaram, R.; Dhulipala, L.; Hadian, M.; Lee, J.D.; Mirrokni, V. MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encoding. Adv. Neural Inf. Process. Syst. 2024, 37, 101042–101073. [Google Scholar]
  109. Cao, Y.; Yao, L.; McAuley, J.; Sheng, Q.Z. Reinforcement Learning for Generative AI: A Survey. arXiv 2023, arXiv:2308.14328. [Google Scholar] [CrossRef]
  110. Christian, B.; Kirk, H.R.; Thompson, J.A.F.; Summerfield, C.; Dumbalska, T. Reward Model Interpretability via Optimal and Pessimal Tokens. arXiv 2025, arXiv:2506.07326. [Google Scholar] [CrossRef]
  111. Anonymous. Decomposed Reward Models: Learning Multi-Dimensional Human Preferences from Binary Comparisons. arXiv 2025, arXiv:2502.13131. [Google Scholar]
  112. Ren, S.; Ren, S.; Jian, P.; Ren, Z.; Leng, C.; Xie, C.; Zhang, J. Towards scientific intelligence: A survey of llm-based scientific agents. arXiv 2025, arXiv:2503.24047. [Google Scholar] [CrossRef]
  113. Su, Y.; Ai, Q.; Zhan, J.; Dong, Q.; Liu, Y. Dynamic and Parametric Retrieval-Augmented Generation. arXiv 2025, arXiv:2506.06704. [Google Scholar] [CrossRef]
  114. Tran, K.-T.; Huang, W.C.; Wu, Y.; Chen, Y.; Miao, C.; Nguyen, H.; Zhou, Y.; Zhang, W.; Fang, L.; He, L. A Survey on Large Language Model based Human-Agent Systems. arXiv 2025, arXiv:2505.00753. [Google Scholar] [CrossRef]
  115. Zhao, Z.; Dong, H.; Saha, A.; Xiong, C. Automatic Curriculum Expert Iteration for Reliable LLM Reasoning. In Proceedings of the International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
  116. Tran, K.-T.; Dao, D.; Nguyen, M.-D.; Pham, Q.-V.; O’Sullivan, B.; Nguyen, H.D. Multi-Agent Collaboration Mechanisms: A Survey of LLMs. arXiv 2025, arXiv:2501.06322. [Google Scholar] [CrossRef]
  117. Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 4–9 December 2023; Volume 36, pp. 11809–11822. [Google Scholar]
  118. Xu, Y.; Guo, X.; Zeng, Z.; Miao, C. Softcot: Soft chain-of-thought for efficient reasoning with llms. arXiv 2025, arXiv:2502.12134. [Google Scholar]
  119. Hao, S.; Sukhbaatar, S.; Su, D.; Li, X.; Hu, Z.; Weston, J.; Tian, Y. Training large language models to reason in a continuous latent space. arXiv 2024, arXiv:2412.06769. [Google Scholar] [CrossRef]
  120. Wu, X.; Shen, Y.; Shan, C.; Song, K.; Wang, S.; Zhang, B.; Feng, J.; Cheng, H.; Chen, W.; Xiong, Y.; et al. Can graph learning improve planning in LLM-based agents? Adv. Neural Inf. Process. Syst. 2024, 37, 5338–5383. [Google Scholar]
  121. Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. React: Synergizing reasoning and acting in language models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  122. Li, C.; Yang, R.; Li, T.; Bafarassat, M.; Sharifi, K.; Bergemann, D.; Yang, Z. Stride: A tool-assisted llm agent framework for strategic and interactive decision-making. arXiv 2024, arXiv:2405.16376. [Google Scholar] [CrossRef]
  123. Wu, J.; Zhu, J.; Liu, Y.; Xu, M.; Jin, Y. Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, 27 July–1 August 2025; Volume 1: Long Papers, pp. 28489–28503. [Google Scholar]
  124. Zhuang, Y.; Chen, X.; Yu, T.; Mitra, S.; Bursztyn, V.; Rossi, R.A.; Sarkhel, S.; Zhang, C. ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  125. Wang, Y.; Yu, J.; Yao, Z.; Zhang, J.; Xie, Y.; Tu, S.; Fu, Y.; Feng, Y.; Zhang, J.; Zhang, J.; et al. SoAy: A Solution-based LLM API-using Methodology for Academic Information Seeking. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, Toronto, ON, Canada, 3–7 August 2025; pp. 2660–2671. [Google Scholar]
  126. Shen, W.; Li, C.; Chen, H.; Yan, M.; Quan, X.; Chen, H.; Zhang, J.; Huang, F. Small LLMs Are Weak Tool Learners: A Multi-LLM Agent. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 16658–16680. [Google Scholar]
  127. Qian, C.; Xiong, C.; Liu, Z.; Liu, Z. Toolink: Linking Toolkit Creation and Using through Chain-of-Solving on Open-Source Model. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; pp. 831–854. [Google Scholar]
  128. Chen, S.; Wang, Y.; Wu, Y.-F.; Chen, Q.; Xu, Z.; Luo, W.; Zhang, K.; Zhang, L. Advancing tool-augmented large language models: Integrating insights from errors in inference trees. In Proceedings of the 37th Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Volume 37, pp. 106555–106581. [Google Scholar]
  129. Zheng, Y.; Li, P.; Yan, M.; Zhang, J.; Huang, F.; Liu, Y. Budget-Constrained Tool Learning with Planning. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 7–12 August 2024; pp. 9039–9052. [Google Scholar]
  130. Kumar, S.S.; Jain, D.; Agarwal, E.; Pandey, R. Swissnyf: Tool grounded llm agents for black box setting. arXiv 2024, arXiv:2402.10051. [Google Scholar] [CrossRef]
  131. Erbacher, P.; Falissar, L.; Guigue, V.; Soulier, L. Navigating uncertainty: Optimizing api dependency for hallucination reduction in closed-book question answering. arXiv 2024, arXiv:2401.01780. [Google Scholar] [CrossRef]
  132. Qiao, S.; Gui, H.; Lv, C.; Jia, Q.; Chen, H.; Zhang, N. Making Language Models Better Tool Learners with Execution Feedback. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 16–21 June 2024; Volume 1: Long Papers, pp. 3550–3568. [Google Scholar]
  133. Zhu, W.; Singh, I.; Jia, R.; Thomason, J. Language Models can Infer Action Semantics for Classical Planners from Environment Feedback. In Proceedings of the North American Association for Computational Linguistics (NAACL), Albuquerque, NM, USA, 22–27 June 2025. [Google Scholar]
  134. Tantakoun, M.; Zhu, X.; Muise, C. LLMs as Planning Modelers: A Survey for Leveraging Large Language Models to Construct Au-tomated Planning Models. arXiv 2025, arXiv:2503.18971. [Google Scholar]
  135. Wang, Y.; Wang, P. LLM A*: Human in the Loop Large Language Models Enabled A* Search for Robotics. arXiv 2023, arXiv:2312.01797. [Google Scholar] [CrossRef]
  136. Andukuri, C.; Fränken, J.-P.; Gerstenberg, T.; Goodman, N.D. STaR-GATE: Teaching Language Models to Ask Clarifying Questions. In Proceedings of the Conference on Logic, Language, and Computation (COLL 2024), Stanford, CA, USA, 10–12 July 2024; pp. 1–12. [Google Scholar]
  137. Kranti, C.; Hakimov, S.; Schlangen, D. clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations. arXiv 2025, arXiv:2505.05445. [Google Scholar]
  138. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2020, arXiv:2005.11401. [Google Scholar]
  139. Lu, Y.; Yuan, H.; Yuan, Z.; Lin, R.; Lin, J.; Tan, C.; Zhou, C.; Zhou, J. INSTAG: Automated Instruction Tagging for Enhanced Instruction Tuning. arXiv 2024, arXiv:2402.05123. [Google Scholar]
  140. Si, C.; Li, R.; Luo, Z.; Wang, Z.; Li, D.; Jing, L.; He, K.; Wu, P.; Michalopoulos, G. LMR-BENCH: Evaluating LLM Agent’s Ability on Reproducing Language Modeling Research. arXiv 2025, arXiv:2506.17335. [Google Scholar]
  141. Romijnders, R.; Laskaridis, S.; Shamsabadi, A.S.; Haddadi, H. NoEsis: Differentially Private Knowledge Transfer in Modular LLM Adaptation. arXiv 2025, arXiv:2504.18147. [Google Scholar] [CrossRef]
  142. Cardiel, A.; Zablocki, E.; Ramzi, E.; Siméoni, O.; Cord, M. LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vi-sion-Language Models for Referring Expression Comprehension. In Proceedings of the International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
  143. Liu, D.; Niehues, J. Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 7–11 July 2025. [Google Scholar]
  144. Jiang, S.; Liang, J.; Wang, J.; Dong, X.; Chang, H.; Yu, W.; Du, J.; Liu, M.; Qin, B. From Specific-MLLMs to Omni-MLLMs: A Survey on MLLMs Aligned with Multi-modalities. arXiv 2024, arXiv:2412.11694. [Google Scholar]
  145. Zhao, W.; Yuksekgonul, M.; Wu, S.; Zou, J. Sirius: Self-improving Multi-Agent Systems via Bootstrapped Reasoning. arXiv 2025, arXiv:2502.04780. [Google Scholar]
  146. Subramaniam, V.; Du, Y.; Tenenbaum, J.B.; Torralba, A.; Li, S.; Mordatch, I. Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore, 24–28 April 2025. [Google Scholar]
  147. Park, C.; Han, S.; Guo, X.; Ozdaglar, A.; Zhang, K.; Kim, J.-K. MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning. arXiv 2025, arXiv:2502.18439. [Google Scholar]
  148. Zhuge, M.; Wang, W.; Kirsch, L.; Faccio, F.; Khizbullin, D.; Schmidhuber, J. GPTSwarm: Language Agents as Optimizable Graphs. In Proceedings of the Forty-First International Conference on Machine Learning (ICML 2024), Vienna, Austria, 17–23 July 2024. [Google Scholar]
  149. Yuksekgonul, M.; Bianchi, F.; Boen, J.; Liu, S.; Huang, Z.; Guestrin, C.; Zou, J. TextGrad: Automatic “Differentiation” via Text. arXiv 2024, arXiv:2406.07496. [Google Scholar] [CrossRef]
  150. Zhou, W.; Ou, Y.; Ding, S.; Li, L.; Wu, J.; Wang, T.; Chen, J.; Wang, S.; Xu, X.; Zhang, N.; et al. Symbolic Learning Enables Self-Evolving Agents. arXiv 2024, arXiv:2406.18532. [Google Scholar] [CrossRef]
  151. Khattab, O.; Singhvi, A.; Maheshwari, P.; Zhang, Z.; Santhanam, K.; Haq, S.; Sharma, A.; Joshi, T.T.; Moazam, H.; Miller, H.; et al. DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 24–28 April 2024. [Google Scholar]
  152. Wang, W.; Alyahya, H.A.; Ashley, D.R.; Serikov, O.; Khizbullin, D.; Faccio, F.; Schmidhuber, J. How to Correctly Do Semantic Backpropagation on Language-Based Agentic Systems. arXiv 2024, arXiv:2412.03624. [Google Scholar] [CrossRef]
  153. Zhang, J.; Xiang, J.; Yu, Z.; Teng, F.; Chen, X.-H.; Chen, J.; Zhuge, M.; Cheng, X.; Hong, S.; Wang, J.; et al. AFlow: Automating Agentic Workflow Generation. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore, 24–28 April 2025. [Google Scholar]
  154. Su, J.; Xia, Y.; Shi, R.; Wang, J.; Huang, J.; Wang, Y.; Shi, T.; Yang, J.; He, L. DebFlow: Automating Agent Creation via Agent Debate. In Proceedings of the ICML 2025 Workshop on Collaborative and Federated Agentic Workflows, Vancouver, Canada, 14 July 2025. [Google Scholar]
  155. Liu, Z.; Zhang, Y.; Li, P.; Liu, Y.; Yang, D. A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
  156. Li, Z.; Xu, S.; Mei, K.; Hua, W.; Rama, B.; Raheja, O.; Wang, H.; Zhu, H.; Zhang, Y. AutoFlow: Automated Workflow Generation for Large Language Model Agents. arXiv 2024, arXiv:2407.12821. [Google Scholar]
Figure 1. The framework of Big Loop and Atomization, including atomic component construction, Big Loop scheduling, and Big Loop optimization.
Figure 1. The framework of Big Loop and Atomization, including atomic component construction, Big Loop scheduling, and Big Loop optimization.
Applsci 15 09466 g001
Figure 2. The advantages of Big Loop.
Figure 2. The advantages of Big Loop.
Applsci 15 09466 g002
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, Z.; Huang, Y.; Feng, J.; Deng, C. Big Loop and Atomization: A Holistic Review on the Expansion Capabilities of Large Language Models. Appl. Sci. 2025, 15, 9466. https://doi.org/10.3390/app15179466

AMA Style

Hu Z, Huang Y, Feng J, Deng C. Big Loop and Atomization: A Holistic Review on the Expansion Capabilities of Large Language Models. Applied Sciences. 2025; 15(17):9466. https://doi.org/10.3390/app15179466

Chicago/Turabian Style

Hu, Zefa, Yi Huang, Junlan Feng, and Chao Deng. 2025. "Big Loop and Atomization: A Holistic Review on the Expansion Capabilities of Large Language Models" Applied Sciences 15, no. 17: 9466. https://doi.org/10.3390/app15179466

APA Style

Hu, Z., Huang, Y., Feng, J., & Deng, C. (2025). Big Loop and Atomization: A Holistic Review on the Expansion Capabilities of Large Language Models. Applied Sciences, 15(17), 9466. https://doi.org/10.3390/app15179466

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop