Next Article in Journal
Assistive Control through a Hapto-Visual Digital Twin for a Master Device Used for Didactic Telesurgery
Previous Article in Journal
Vibration Propulsion in Untethered Insect-Scale Robots with Piezoelectric Bimorphs and 3D-Printed Legs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Harnessing the Power of Large Language Models for Automated Code Generation and Verification

by
Unai Antero
1,*,
Francisco Blanco
1,
Jon Oñativia
1,
Damien Sallé
1 and
Basilio Sierra
2
1
Industry and Transport Division, TECNALIA, Basque Research and Technology Alliance (BRTA), 20009 Donostia-San Sebastian, Spain
2
Department of Computer Sciences and Artificial Intelligence, University of the Basque Country (UPV/EHU), 20018 Donostia, Spain
*
Author to whom correspondence should be addressed.
Robotics 2024, 13(9), 137; https://doi.org/10.3390/robotics13090137
Submission received: 4 August 2024 / Revised: 28 August 2024 / Accepted: 5 September 2024 / Published: 11 September 2024
(This article belongs to the Section AI in Robotics)

Abstract

:
The cost landscape in advanced technology systems is shifting dramatically. Traditionally, hardware costs took the spotlight, but now, programming and debugging complexities are gaining prominence. This paper explores this shift and its implications, focusing on reducing the cost of programming complex robot behaviors, using the latest innovations from the Generative AI field, such as large language models (LLMs). We leverage finite state machines (FSMs) and LLMs to streamline robot programming while ensuring functionality. The paper addresses LLM challenges related to content quality, emphasizing a two-fold approach using predefined software blocks and a Supervisory LLM.

1. Introduction

Robots excel at repetitive tasks in controlled environments, but as the world is complex and ever-changing, there are many other possible markets and applications that would benefit from using robots. The introduction of robots into these unstructured environments requires improving the intelligence and adaptability of the robots, which directly impacts the complexity of their software. Consequently, programming or configuring them demands more expert programmers and additional time for programming, testing, and debugging, which in turn raises costs. Traditionally, when contemplating the expenses linked to the development and deployment of advanced robots, the focus has predominantly been on hardware costs, as the sophisticated mechanical components, sensors, processors, and associated infrastructure have historically dominated budgetary considerations. However, a compelling shift is currently underway: the cost associated with programming and debugging these intricate systems is increasingly overshadowing hardware expenses. This paradigm shift fundamentally alters our perception of the economic dimensions of cutting-edge technology. Specifically, it underscores that the software-related aspects of technology play a pivotal role in both cost and safety considerations. This paper presents an approach to program and configure robotic applications using large language models as a tool to reduce the time required for the programming (and the costs associated with such time).

1.1. Problem Statement

The advancement of robotics, autonomous vehicles, and comparable intricate systems entails a complex interplay of hardware and software challenges. Although prioritizing high-quality hardware remains crucial, it is increasingly evident that the primary financial hurdle resides in the software domain, specifically during the design, development, and debugging stages. The robust, dependable, and adaptable software underpinning these systems not only impacts costs significantly but also profoundly influences their overall success.
The transition toward software as the most pivotal aspect of systems has been propelled mainly by the need for robots to operate seamlessly in a diverse range of real-world environments, which means navigating dynamic terrains and intelligently interacting with their surroundings. Achieving this level of functionality and adaptability necessitates a corresponding degree of software complexity, which includes intricate algorithms for perception, decision making, and control.
However, this complexity also magnifies the intricacies of software debugging and maintenance. As advanced robots and related systems incorporate these advanced technologies, their software becomes more complex to design, develop, configure, and maintain.

1.2. Research Objectives

In this paper, the shifting cost dynamics of complex system development are examined, with a focus on the prominence of programming and debugging expenses. The implications of this shift and its impact on decision-making processes related to technology investment are explored. By comprehending this transformation, strategies and resource allocation can be adjusted to effectively address the challenges of contemporary technology development, ensuring the efficient and cost-effective management of software complexity.
The budget of a robotic project has to consider many costs (the robotic hardware itself, physical installation and integration, software and programming, maintenance and support, training, etc.). But in situations where the robot has to be frequently programmed and retasked, its programming and integration can account for up to 50–70% of the cost of a robot application [1].
It is considered that complex software means increased costs, slower updates, the need for expert programmers, more errors, and challenging maintenance [2]. This paper proposes a large-language-model-based methodology to simplify robot software programming (which includes robot retasking or reprogramming), making it easy and simple even for non-programmers, as this will impact two aspects that can reduce costs: less time required for programming robots and reducing the expertise barrier for programming such software.
The focus of this study centers on the utilization of finite state machines (FSMs) as the foundation for controlling complex robot behaviors. FSMs provide a structured and systematic method for defining and orchestrating the actions of a robot, rendering them an essential tool in the toolkit of robotics programmers. However, we extend beyond the conventional boundaries of FSMs. Instead, we harness the cutting-edge capabilities of large language models (LLMs) to revolutionize how we program such FSMs, particularly in the domain of advanced robots. By integrating LLM technology into robotics operating in the real world, we aim to demonstrate how these advanced AI systems can function as intelligent supervisors, simplifying the programming process.
This research paper explores how a combination of advanced language models, creative programming methods, and automated monitoring can greatly impact the cost and safety of robotic applications.
The paper is organized as follows: in Section 2, we present FSMs and also take a close look at LLMs to understand their strengths and challenges, focusing on content quality, and reliability when generating code. Finally, we introduce the idea of automated generated content monitoring using LLMs, using them in a way inspired by Generative Adversarial Networks (GANs), to improve code quality and safety in the fast-paced world of robotics and artificial intelligence. In Section 3 the design of the proposed system architecture is explained. In Section 4 implementations on a synthetic and on a physical robot are described and the performance of our proposed approach is evaluated. Finally, conclusions are presented in Section 5.

2. Background

When addressing the behavior of intricate systems, such as robots, a multitude of methods are available for consideration. However, this research paper directs its focus toward FSMs due to their significant historical impact within the field of robotics. FSMs offer simplicity, efficiency, and deterministic behavior, rendering them well-suited for modeling systems characterized by clearly defined states. Nevertheless, programming FSMs can pose challenges, particularly in large-scale systems, leading to potential errors. Additional limitations stem from their constrained abstraction level, which may result in verbose code and the difficulty of adapting FSMs to changes. Maintaining FSMs can be burdensome, and developers new to FSM programming often encounter a learning curve. While certain tools (such as graphical interfaces) can mitigate some of these challenges, a definitive approach to expediting the development of FSM-based systems remains elusive.
We posit that this methodology holds significant potential for advancing the rapidly evolving fields of robotics and artificial intelligence.

2.1. Finite State Machines in Robotics

Robots play a pivotal role in various domains, from autonomous vehicles navigating urban environments to robots operating in industrial settings. These robots must exhibit complex and dynamic behaviors while ensuring safety, reliability, and efficiency. Achieving such a delicate balance is a formidable challenge, as a slight error in control can have far-reaching consequences. Many approaches have been historically used to achieve that (such as Behavior Trees, finite state machines...), but one of the most widely used is finite state machines. Because of this reason, this paper is based on such FSMs, considering them a well-established [3] and powerful tool for addressing this challenge [4].
FSMs provide a systematic approach to model and manage intricate tasks, allowing robots to transition between discrete states, each representing a specific behavior or operation [5]. The discrete nature of FSMs simplifies the task of controlling complex behaviors, making it easier to design, understand, and maintain robot-control systems [6].
FSMs are inherently modular and can be hierarchically structured, enabling the decomposition of complex tasks into manageable subtasks. This modularity promotes code reusability and facilitates the testing and verification of individual states. Additionally, FSMs promote a clear and intuitive representation of robot behaviors, making it easier for human operators and engineers to understand and predict robot actions, even in intricate scenarios.
In this paper, the exploration is focused on enhancing the ease and efficiency of FSM coding while ensuring alignment with the original project specifications.

Flexbotics

Flexbotics is a framework designed and developed by Tecnalia [7] to streamline the development of robot behaviors. Its core component, the Flexbotics Execution Manager, serves as the application engine that can execute complex tasks represented as a step-by-step execution of a sequence of skills [8]. It manages the connections between the input and output parameters required by each skill, adapting them as specified in the plan. Importantly, the Execution Manager is versatile and can be applied to a wide range of robotic tasks, including advanced manipulation, computer vision, mobility, and navigation.
In the context of Flexbotics, a “skill” refers to the implementation of a specific capability or functionality, such as perception processing or robotic motion. These skills are designed to be standardized for seamless triggering by the Execution Manager, regardless of their complexity or purpose. Flexbotics is one of the many tools available (such as [9]) that can be run on top of the ROS [10] (Robot Operating System) in order to execute complex tasks by the consecutive execution of simpler skills.
Flexbotics also introduces the concept of a “process”, which defines a succession of simple operations (called “skills”) needed to achieve a global task or ‘mission’. This process description involves connecting the input and output of skills, allowing the Execution Manager to determine which operation to trigger and with which parameters during runtime.
The hierarchical nature of the process description supports the reuse of skills and can be used to break down intricate processes into reusable operations. Importantly, process descriptions in Flexbotics are implemented using human-readable data-serialization languages (JSON or YAML, both commonly used for writing configuration files) for readability and ease of process creation.

2.2. Large Language Models (LLMs)

Language models (LMs) try to model the generative likelihood of word sequences in order to predict the probability of the next word in a phrase or to identify missing words, for example. In the last decades, research on this topic has evolved from statistical language models (based on statistical learning methods) to neural language models (which characterize the probability of word sequences by neural networks) to pretrained language models (PLMs) (which, instead of learning fixed word representations, try to capture context-aware word representations by first pretraining a neural network) [11].
Working with PLMs, researchers found that as the size of the PLM is increased (considering the number of its internal parameters), the language models gain a significant performance improvement and also start to show some special abilities (e.g., reasoning and learning). To classify and discriminate the language models in different parameter scales, the research community coined the term large language models (LLM) for PLMs that contain parameters in the range of tens or hundreds of billions.
OpenAI, an AI research organization, has been at the forefront of developing such advanced artificial intelligence technologies. One of their most significant contributions is the ChatGPT series, with the introduction of GPT-3 in June 2020 marking a pivotal moment in the technical landscape. GPT-3 (or Generative Pretrained Transformer 3), is an autoregressive language model that leverages deep learning to produce human-like text. With 175 billion parameters, when introduced, it represented one of the largest and most powerful AI language models ever created, and it was very successful at bringing AI-based text processing to a wide range of end-users.
While ChatGPT gained unprecedented popularity as the fastest-growing consumer product, it also triggered a global AI competition with notable alternatives like Anthropic (founded by former OpenAI researchers), Google’s DeepMind or Gemini, Cohere AI (which caters to business customers, distinguishing itself from consumer-oriented large-language models by OpenAI), Aleph Alpha (which also targets enterprise customers rather than the mass market), or some open-source models (like the “LLAMA” models released by Meta) to name a few.
The LLAMA (“Large Language Model Meta AI”) models from Meta are designed to be highly versatile and robust, and their open-source nature allows for customization and fine-tuning to meet the specific needs of this project. Because of this, the Meta-Llama-3-70B-Instruct model [12] on Deep Infra [13] computation resources is used.
The rapid advancement of LLMs has significantly transformed the landscape of natural language processing [14]. These models, such as GPT-3 (“Generative Pre-trained Transformer” version 3) and their successors, have exhibited remarkable capabilities in understanding and generating human-like text. As in other areas where humans were hard to beat by automated systems (such as image identification, reading comprehension, etc.), it is expected that the capabilities of LLMs will improve in the near future, as shown in the trend depicted in Figure 1.
In various contexts, LLMs demonstrate potential efficacy when applied to specific development tasks that are typically slow and error-prone for human developers. A notable instance of this is the automated generation of unit tests [16,17].
However, alongside their immense potential, concerns about ethical behavior, content quality, and reliability have risen [18]. Thus, it is generally considered that using LLMs to generate code presents several risks and challenges [19].
At their core, LLMs lack a true understanding of the code they generate, potentially leading to logic errors [20,21]. And of course, they were not designed for code generation: they are language models and may produce code that does not adhere to best practices. This can result in software that does not fully follow the specifications or which is difficult to maintain and debug [19].
Security is an additional significant concern, as the code generated by LLMs may inadvertently introduce vulnerabilities if not carefully reviewed and tested. The generated code might not follow secure coding practices, putting systems at risk [22,23].
Of course, as mentioned before, debugging can be complex, as the code generated by LLMs can be convoluted and challenging to decipher, increasing the time and effort required to identify and fix issues.
It should also be noted that LLMs do not possess domain-specific knowledge, leading to code that may not meet specific industry or application requirements. This can be a particular concern when developing safety-critical systems or highly specialized software [24].
Ethical concerns also arise when using LLMs to generate code, especially in applications where code quality and safety are paramount, such as medical devices or autonomous vehicles. Ensuring that the code generated by LLMs meets regulatory and ethical standards is a critical consideration [25,26].

2.3. Guaranteeing the Safety and Quality of Automatically Generated Code

Ensuring the safety and quality of automatically generated code, particularly in the context of technologies like LLMs, holds significant importance across diverse applications, including robotics and software development. Until recently, the predominant approach for establishing a system to automatically monitor the quality and safety of the code produced by LLMs was mainly as follows ([27,28]):
Static-code-analysis tools: These tools assess the generated code without executing it, examining code for adherence to coding standards, potential bugs, and readability. They can identify issues related to coding style and practices, promoting code quality and maintainability.
Dynamic code analysis: This approach focuses on evaluating code during runtime. It helps identify runtime errors, memory issues, and runtime vulnerabilities. Dynamic-code-analysis tools can be employed to ensure code safety by detecting issues that may not be apparent in static analysis.
Testing and test automation: Automated testing frameworks are essential to ensure the functional correctness of LLM-generated code. Test automation helps identify defects and verify that the code behaves as expected, contributing to both code quality and safety.
Machine-learning-based code analysis: Leveraging machine learning models for code analysis can assist in identifying security vulnerabilities and ensuring compliance with coding best practices. These models can be trained to recognize patterns of code that may pose risks or fall short of quality standards.
Security scanners: Automated security-scanning tools are pivotal in identifying vulnerabilities within LLM-generated code. They help mitigate security risks by flagging potential weaknesses or exploitable points in the code.
Model interpretability: The code generated by LLMs often exhibits complexity and challenges in comprehension. To address this, enhancing model interpretability becomes crucial, aiming to render the decision-making processes of LLMs transparent and comprehensible. This transparency is essential for ensuring code quality and safety.
In this paper, an alternative approach is presented: utilizing a second LLM to oversee the quality and safety of code produced by an initial LLM, as detailed in the subsequent section.

2.4. Proposed Approach

Guaranteeing the safety and quality of automatically generated code, especially when using technologies like LLMs, is crucial in various applications, including robotics, software development, and more.
This research paper posits that the quality and safety of the code produced by LLMs for FSMs can be substantially improved through a dual-pronged approach.
First, the foundation of the FSM can be established by utilizing a collection of human-programmed and carefully curated “software blocks” (some ‘skills’ previously programmed). These software blocks are predefined, trusted components that encapsulate well-established logic, reducing the likelihood of errors and vulnerabilities.
Secondly, to further improve the quality and safety of the generated FSM code, a second LLM can be employed to supervise the entire process. Inspired by the idea of using adversarial techniques to produce a desired result (see, for example, [29,30,31,32]), our goal here is to create a pipeline that automates both the code generation and its validation. The innovative application of two LLMs in a manner akin to GANs introduces a novel paradigm for enhancing code quality and safety in the context of FSMs. By employing two distinct LLMs, we create a dynamic interplay that mirrors the adversarial relationship seen in GANs. These LLMs collaborate to generate the FSM code, each playing a unique role:
Generator LLM: this LLM acts as the “generator,” producing initial code candidates. It draws upon its vast language knowledge and context to propose a plan or sequence for the FSM implementations.
Discriminator or Supervisory LLM: The second LLM serves as the “discriminator”. Its purpose is to critically evaluate the generated code snippets. It assesses correctness, adherence to safety guidelines, and overall quality.
Similar to GANs, the two LLMs can engage in an iterative process. The generator strives to create high-quality FSM code, while the discriminator provides feedback, pushing for improvements.
The adversarial nature of this interaction encourages both LLMs to excel: the generator seeks to outwit the discriminator, while the discriminator becomes more discerning over time. If any anomalies or potential issues are detected, the Supervisory LLM can provide feedback and recommendations for improvements.
By combining these two approaches, the inherent creativity and code-generation capabilities of LLMs are harmonized with the reliability and safety assurance provided by human-programmed software blocks and a Supervisory LLM. This fusion not only expedites the development process but also produces high-quality, safe, and reliable FSM code, making it a promising approach for the efficient and dependable automation of complex systems such as robots.

2.4.1. Human-Programmed and Carefully Curated “Software Blocks”

In the context of a research paper, “flexbotics” can serve as a fundamental technology to explore how it streamlines robot programming, as its human-programmed (and curated) blocks can act as the base technologies on which to base this research. Using “flexbotics”-curated “skills” as a base technology that works on physical and synthetic environments, we can focus on the other research topics: how to use a first LLM to assemble those skills and how to use a second LLM to verify that results are valid.

2.4.2. Use of a Second LLM to Verify Results from a First One

Inspired by the adversarial approach in GANs, this paper explores how to use one LLM to supervise the results obtained by another previous LLM. The Supervisory LLM acts like a discriminator or validator, guiding and evaluating the generative LLM outputs to ensure high-quality content, fostering continuous improvement in the performance of the generative model.
This methodology aims to address several critical concerns. Firstly, it provides a structured framework for enhancing the ethical and qualitative aspects of the content generated by the LLM. The Supervisory LLM establishes benchmarks for the generative LLM, encouraging it to produce content that aligns with predefined ethical and quality standards. Secondly, this methodology enables real-time adjustments and the fine-tuning of LLM behavior based on ever-evolving societal norms and ethical considerations. In an era where the appropriateness of generated content can change rapidly, this adaptability is of great value.
This paper explores the theoretical framework and practical implementation of this “GAN-inspired approach” for creating a program for robots in different situations, using LLMs as a base technology. Throughout this process, the goal is to use LLM technology that seamlessly translates some initial specification (provided as text in a human language) into machine-readable code, enabling robots to perform complex tasks autonomously. This not only simplifies the interaction between humans and robots but also opens up new possibilities for automation and efficiency in various fields, from household chores to industrial operations.

3. Architecture

Plain LLMs cannot be used as required to successfully create machine-readable code. They have to be properly guided and controlled to minimize errors and to guarantee that the output is the desired one. While LLMs are adept at understanding and generating human-like text, they may still produce ambiguous or incorrect code without proper oversight. This is due to the inherent complexity of programming languages and the specific requirements of robotic tasks, which demand precision and accuracy. Therefore, simply relying on LLMs without additional layers of control can lead to suboptimal or even hazardous outcomes, particularly in critical applications like robotics.
In the scope of this paper, a complete software architecture was developed to guarantee that the generated code follows a set of previously defined steps, as visually depicted in the accompanying Figure 2.
Description of the methodology:

3.1. Prepare Context Information

In LLMs, “context information” refers to the additional information, guidance, or instructions provided to the model to help it understand and generate more contextually relevant responses. In our case, it is some guidance for helping the robot to proceed with the mission description specified by the user, hence providing the robot information about its environment, elements around the robot (and their characteristics), and allowed actions on them.
One example is shown below:
You are an assistant that is starting with an initial user request:
create a plan for a robot. This step-by-step plan allows the robot
to achieve what the user requested.
The robot is located in a kitchen with certain elements around it.
The robot can only manipulate appliances in the kitchen and
pick-and-place elements.

3.2. Code Generation—Step 1: Automatically Generates a Plan from the Specification Provided by the User

Using the available context information, the LLM generates a plan to achieve the goals indicated by the user. In other words, the Generator LLM will try to reach the final state specified by the user using the available resources and allowed actions.
This plan will be a text description that contains a list of actions that the robot should execute to achieve the desired result.

3.3. Plan Validation: Evaluate Plan and Used Elements

A second Supervisory LLM will check the plan created in the previous step, performing a triple check:
  • Actions should follow a certain logical order (for example: an object has to be picked up before it is put somewhere, something has to be opened before it can be closed, ...).
  • Actions do not overflow the robot’s capabilities (for example: picking up more than one element at the same time).
  • A validation to check that the robot’s actions will create a final state of the system that matches the final state indicated by the user (for example: validating that ‘all door should be closed’).
If the plan presents some errors, the Supervisory LLM will create and send some suggestions to the Generator LLM, so it can replan the actions. This loop will be repeated until the validation detects no errors (or, in our implementation, until the loop is repeated more than a certain amount of times).

3.4. Code Generation—Step 2: Convert the Plan into a Sequence of Allowed Actions (“Skills” That the Robot Can Execute)

Using the already-validated plan, the Generator LLM will now convert the plan into a valid JSON representation of an FSM that uses a set of allowed “skills” (human-developed curated software modules).
For example, the plan:
  •  Step 1) Move to the fridge location.
  •  Step 2) Open the fridge.
  •  Step 3) Pick up the egg.
  •  Step 4) Close the fridge.
will be converted into a syntactically correct JSON:
{
    "MISSION_NAME": "Pick egg",
    "TASKS": [
        {
            "NAME": "MoveToFridge",
            "SKILL": "MoveRobot",
            "PARAMETERS": {
                "LOCATION": "Fridge_location"
            }
        },
        {
            "NAME": "OpenFridge",
            "SKILL": "OpenObject",
            "PARAMETERS": {
                "objectId": "id_fridge"
            }
        },
        {
            "NAME": "PickEgg",
            "SKILL": "PickupObject",
            "PARAMETERS": {
                "objectId": "id_egg"
            }
        },
        {
            "NAME": "CloseFridge",
            "SKILL": "CloseObject",
            "PARAMETERS": {
                "objectId": "id_fridge"
            }
        }
    ]
}

4. Experimental Setup and Validation

The approach presented in this paper is evaluated on two different setups:
  • A synthetic setup based on the iTHOR environment [33]. iTHOR is an environment within the AI2-THOR framework, which includes a set of interactive objects and more than 120 scenes and provides accurate modeling of the physics of the world. The AI2-THOR framework is a popular simulated environment for embodied AI research, providing a set of realistic and interactive scenarios for testing and training AI systems.
  • A physical setup based on a dual-arm robot that performs pick-and-place operations on a factory assembly line.

4.1. Synthetic Setup

A synthetic environment was setup in a simulation, providing a low-cost, risk-free space to explore and refine solutions. Simulations enable unlimited retries, efficient exploration, and scalability, allowing the system to generalize to diverse environment configurations.
This approach can greatly reduce the need for extensive real-world experience. By leveraging simulation, it is possible to accelerate the development process, reduce the risk of physical damage, and create more robust and capable robots.
Additionally, it should be noted that simulation allows for the testing of complex scenarios, edge cases, and corner cases that may be difficult or impossible to replicate in the real world. Simulation facilitates the collection of large amounts of data, which can be used to train and fine-tune the LLM-based system, leading to an improved performance and accuracy.
In the scope of this paper, the iTHOR simulated environment was used for testing and validation (more specifically, we used the iTHOR default simulated kitchen).
In order to work with such an iTHOR simulated environment, it was necessary to develop a middleware that allowed for interactions between the Flexbotics and the iTHOR software. Such middleware, which acts as an intermediary layer between the Flexbotics system and the iTHOR simulator, translates the actions and commands generated by our LLM-based system into a format that can be understood by the iTHOR environment, and vice versa.
This enables our system to interact with the simulated kitchen environment, allowing us to test and refine its performance in a realistic and controlled setting, as illustrated in Figure 3.
Some specific tasks that the middleware performs include the following:
  • Translating high-level action commands from the Flexbotics system into low-level motor control commands that can be executed in the iTHOR environment.
  • Converting sensor data from the AI2-THOR environment into a format that can be processed by the Flexbotics system.
  • Handling communication between the Flexbotics system and the iTHOR environment, ensuring that actions are executed correctly and that the system receives accurate feedback.
The iTHOR robot simulation environment is a highly interactive and versatile platform designed to simulate household environments for AI and robotics research. It enables the testing and training of intelligent agents in realistic settings, offering rich visual and physical interactions with objects and spaces, thereby facilitating advancements in tasks like navigation, object manipulation, and human–robot interaction.
There are 120 scenes in iTHOR, evenly spread across kitchens, living rooms, bedrooms, and bathrooms. For the purpose of this paper, we will use the default kitchen, shown in Figure 4.
The iTHOR “default kitchen” includes many elements that can be used to test different kinds of missions. In the scope of this paper, the following ones were considered (and are provided as context to the LLMs when required):
  • Fridge: Is an appliance that is initially closed. Initially, it contains an egg and a head of lettuce.
  • Egg: a pickable food located initially inside the fridge.
  • Lettuce: a pickable food located initially inside the fridge.
  • Stove: is an appliance for cooking.
  • StoveBurner: the burner of the stove appliance.
  • Pan: a frying pan is a tool used for cooking.
  • Plate: an empty plate container located on top of the countertop.
  • Countertop: there are two countertops: one aligned with the fridge and a kitchen island countertop.
  • Basin: a kitchen sink.
  • Toaster: Is an appliance on the countertop. Can be switched on and off.
  • Knife: is a slicing tool inside a drawer.
  • Tomato: located on the kitchen island countertop.
  • Bread: a bread loaf on the kitchen island countertop.
  • Drawer: a drawer that contains a knife.
Location of these elements is shown in Figure 5.
A context text file was created to explain the Generator LLM:
  • Environment and elements in it: which elements can be found in the synthetic setup around the robot (the ones in the simulated kitchen shown above) and their initial state.
  • Allowed robot actions (basic actions that can later be translated to Flexbotics skills):
    moving around.
    picking up and using tools.
    act on elements (to switch on/off, open/close, slice food, etc.).
    moving elements around.
  • Translation of robot actions into Flexbotics skills: describing how to generate a valid JSON from a high-level plan.
With the software developed using the methodology presented in this paper, it is now possible for a non-expert programmer to easily create programs for the robot just by writing a couple of lines of text (for example, “I want the robot to pick the tomato and put it into the plate”).

Synthetic Setup: Experimental Results

Although some initial effort is required to properly detail the context to the Generator LLM, that effort makes sense if we later want to reprogram the robot to perform different missions in the same environment (as it greatly reduces the required programming time compared to doing it manually).
As they are presented in this paper, the Generator LLM and Validation LLM form a non-reactive offline programming system. It is assumed that the robot already includes a low-level latency system that will cope with perception, obstacle avoidance, and navigation in its environment, so these tasks are out of the scope of the approach proposed in this paper and therefore out of scope for these tests (e.g., the location information of elements is provided to the robot as a predefined table).
It is important to note that during these validation tests, we will measure the total number of “tokens” processed by the Generator LLM in its initial phase (code-generation step 1). This measurement aims to indirectly assess the complexity of the task assigned to the system. “Tokens” refer to words, character sequences, or combinations of words and punctuation that LLMs use to break down and process text. Therefore, a more complex task will typically involve a larger number of tokens.
Some user-specified missions (and the corresponding output plan from the system) are shown in the next user requests:
  • USER REQUEST 1: “Put the tomato in the fridge. Make sure that in the end the fridge is closed.”
  • Total number of tokens used by Generator LLM: 1047.
    PLAN prepared by Generator LLM, and~validated by Supervisory~LLM.
    Step 1: Pick the tomato
    - PICK_ELEMENT on Tomato:
    • Move to tomato_location
    • Pick the tomato
    ################
    Step 2: Move to the fridge
    - MOVING to fridge:
    • Move to fridge
    ################
    Step 3: Open the fridge
    - ACT on Fridge:
    • Open the fridge
    ################
    Step 4: Place the tomato in the fridge
    - PLACE_ELEMENT on Tomato:
    • Move to fridge
    • Place the tomato in the fridge
    ################
    Step 5: Close the fridge
    - ACT on Fridge:
    • Close the fridge
    ################
This sequence is trivial and maybe too simple, but it can be used as an initial and quick benchmark to compare the required programming time between the LLM-based approach and human developers. For this test, four different user profiles were selected:
  • S1: a junior user with no prior experience generating JSON files.
  • S2: a junior user with some previous experience generating JSON files.
  • S3: a senior programmer with experience generating JSON files.
  • S4: a senior with experience programming robots and generating JSON files.
All these users had the task of creating a JSON that could achieve the indicated task (put the tomato in the fridge).
Table 1 summarizes the time (in minutes and seconds) required by each subject to create a JSON and check that it fulfills the specification. In these results, ‘LLM advantage’ measures the performance improvement in percentage when using the LLM.
Although we are not comparing the code (just the required time), it can already be seen that even for the programming of simple robot tasks, an LLM-based automated system reduces time.
It should be noted here that preparing the “context” information also takes some time that was not used by the human developers. But this effort is avoided in further retasking the robot in the environment, as the context can be reused.
If we want to program the robot to perform other tasks, the human developers will need some extra time, while the LLM-based system can perform the programming with just a short indication (that even programmers can create), as seen in the following requests:
  • USER REQUEST 2: “I want the robot to slice some tomato and then put it on a plate.”
  • Total number of tokens used by Generator LLM: 1107.
    PLAN prepared by Generator LLM, and~validated by Supervisory~LLM.
    Step 1: Get the Knife
    - PICK_TOOL on Knife:
    • MOVE to Drawer
    • ACT on Drawer: Open
    • PICK_TOOL on Knife
    ################
    Step 2: Move to Tomato
    - MOVING to Tomato_Location
    ################
    Step 3: Slice the Tomato
    - ELEMENT_MODIFICATION on Tomato: Slice
    • ELEMENT_MODIFICATION on Tomato: Slice (using the picked Knife)
    - PLACE_ELEMENT on Knife: Put back in Drawer
    • MOVE to Drawer
    • PLACE_ELEMENT on Knife: Put back in Drawer
    ################
    Step 4: Pick the Sliced Tomato
    - PICK_ELEMENT on Tomato:
    • MOVE to Tomato_Location
    • PICK_ELEMENT on Tomato
    ################
    Step 5: Move to Plate
    - MOVING to Plate
    ################
    Step 6: Put the Sliced Tomato on the Plate
    - PLACE_ELEMENT on Tomato: Put on Plate
    • MOVE to Plate
    • PLACE_ELEMENT on Tomato: Put on Plate
After the validation, the result of the FSM JSON was successfully tested on iTHOR, as seen in Figure 6.
LLM technology does not come without limitations. A glimpse of the limits of the LLM-based approach presented in this paper can be seen in the next request:
  • USER REQUEST 3: “Prepare a toast of sliced bread, and put it on the plate with a slice of tomato on top of it. ”
  • Total number of tokens used by Generator LLM: 1427.
    PLAN prepared by Generator LLM, and~validated by Supervisory~LLM.
    Step 1: Slice the bread
    ################
    - PICK_TOOL on Knife:
    • Move to the drawer
    • ACT on Drawer: open
    • Move to the knife location
    • PICK the knife
    - MOVE to bread location
    - ELEMENT_MODIFICATION on Bread: slice
    • Use the knife to slice the bread
    - PICK_PLACE on Knife:
    • Move to the drawer
    • PLACE the knife
    • ACT on Drawer: close
    - PICK_ELEMENT on sliced Bread:
    • Move to the bread location
    • PICK the sliced bread
    ################
    Step 2: Toast the sliced bread
    ################
    - MOVE to toaster location
    - ACT on Toaster: switch on
    • ACT on Toaster: switch on
    - PLACE_ELEMENT on sliced Bread:
    • PLACE the sliced bread in the toaster
    - ACT on Toaster: switch off
    • Wait for the toasting process to finish
    • ACT on Toaster: switch off
    - PICK_ELEMENT on toasted Bread:
    • Move to the toaster location
    • PICK the toasted bread
    ################
    Step 3: Slice the tomato
    ################
    - PICK_TOOL on Knife:
    • Move to the drawer
    • ACT on Drawer: open
    • Move to the knife location
    • PICK the knife
    - MOVE to tomato location
    - ELEMENT_MODIFICATION on Tomato: slice
    • Use the knife to slice the tomato
    - PICK_PLACE on Knife:
    • Move to the drawer
    • PLACE the knife
    • ACT on Drawer: close
    - PICK_ELEMENT on sliced Tomato:
    • Move to the tomato location
    • PICK the sliced tomato
    ################
    Step 4: Assemble the plate
    ################
    - PICK_ELEMENT on Plate:
    • Move to the plate location
    • PICK the plate
    - PLACE_ELEMENT on Plate:
    • Move to a temporary location (e.g. countertop)
    • PLACE the plate
    - PICK_ELEMENT on toasted Bread:
    • Move to the toaster location
    • PICK the toasted bread
    - PLACE_ELEMENT on toasted Bread:
    • Move to the plate location
    • PLACE the toasted bread on the plate
    - PICK_ELEMENT on sliced Tomato:
    • Move to the tomato location
    • PICK the sliced tomato
    - PLACE_ELEMENT on sliced Tomato:
    • Move to the plate location
    • PLACE the sliced tomato on top of the bread
Although the request was successfully finished by the combination of the Generator LLM and the Supervisory LLM, it required more than one correction (in this case, the Generator LLM created a plan where it tried to open the drawer twice). In the end, the code was successfully validated in the simulated iTHOR environment.
Requests as complex as the last one may create problems for LLMs. In our case, the Generator LLM can fail at generating the code, or the Supervisory LLM may fail to validate things as its reasoning logic fails.
Modern large language models offer powerful natural language interpretation and/or generation capabilities but can still struggle when confronted with very long input prompts that require sophisticated reasoning. In our tests (if we measure complexity according to the number of elements to manipulate and/or the consecutive actions on them), complex requests tend to fail and non-correct action sequences are proposed by the system. In this situation, when a certain limit is reached, both the Generator LLM and the Supervisory LLM start having problems (the first one can not create a valid plan, while the latter identifies some correct sequences as problematic).
As an example of a request that is too complex, we can use, for example, USER REQUEST 4: “Prepare a fried egg using the pan and the stove, then prepare a toast of sliced bread, and put that (together with the egg and some sliced lettuce) on a plate.”, a request that requires a total number 1961 of tokens in the Generator LLM (step 1).
This request required four iterations between the Generator LLM and the Supervisory LLM, as in the first iterations, the Generator LLM tried to pick up a couple of items at the same time, and then the Supervisory LLM failed to identify some action sequences as correct. In the end, an output JSON was generated, but the task was not successfully finished in the simulation.
This degradation is examined in more detail in Section 4.3.

4.2. Physical Robot

For the physical setup, the proposed architecture was implemented in a dual-arm robot equipped with a stereo camera and a robotic platform designed for the manipulation of parts in industrial environments, as seen in Figure 7.
This dual-armed robotic platform is composed of the following:
  • A dual-arm Kawada Nextage robot
  • A stereo camera.
  • A set of vitro-ceramic electric cooker components that must be assembled by the robot.
In this setup, the robot has to be programmed to pick up and place different components into a base, in order to assemble a vitro-ceramic electric cooker.
The programming time benchmark in the physical setup follows the same approach as for the synthetic setup.
In this case, the four subjects were given the following task: to create a robot program to assemble a vitro-ceramic kitchen (which means to pick up an electronic component and put it in its correct position).
The goal here was to verify that the approach presented in this paper can be used in a real environment, so a very simple goal was defined.
As the physical setup did not allow for much flexibility, the generator–supervisory approach worked very well for simple pick-and-place tasks in this robot.
The sequence should be as follows: put component number one in position 2 and component number three in position 1 in the vitro-ceramic kitchen. The final result must be to have the elements correctly placed.
The available commands (provided as context to the Generator LLM) are as follows:
"LookForObject": Use artificial vision to locate an object.
Needs a parameter (named "object_id") indicating the id of
the object to locate
"PickObject": Use the robot grippers to pick up an object.
Needs two parameters: the arm to use (named "arm", which could
be "left" or "right") and the object id (named "object_id")
"MoveTo": Moves the robot arm to a certain location.
Needs two parameters: the arm to use (named "arm", which could
be "left" or "right") and the position id (named "location_id")
"Ungrasp": opens the robot gripper
In this case, the time used by the subjects to finish the JSON programming task is as shown in Table 2.

4.3. Results after Validation on Synthetic and Physical Setups

From the outset, it was evident that the LLMs had a significant advantage in speed (using LLMs can accelerate the development time by more than 90%). While the human developers took their time to plan, code, and integrate various components, the LLMs swiftly produced a working version of the program. It is true that the LLMs have to be previously programmed and prepared for this task (providing them with the right context), but once setup, when working on a well-defined field, they can quickly generate and refine the code, allowing them to complete the task far more quickly than their human counterparts.
When we consider debugging and testing, the LLMs again had the upper hand. The working approach presented here iterates until a successful code is generated, necessitating minimal time for debugging and testing (this way, testing is seamlessly integrated into the development process). In contrast, the human developers spent considerable time identifying and fixing bugs, resulting in an iterative cycle of code revisions and retests.
When it came to the quality of the final result (considered as being able to produce a code that can successfully achieve what the user required in the simulated or real environment), the success depends on certain factors.
The proposed approach can tame the stochastic nature of LLMs (LLMs can sometimes produce inconsistent or random outputs for the same input) through the use of a dual Generator LLM–Supervisory LLM approach and dividing the processing in each LLM in different steps, but the initial findings from our research show the reasoning limitations on the use of LLMs. Current LLMs work well for simple tasks, but they often struggle with tasks that require complex logical reasoning or multistep problem solving.
Researchers claim that LLMs are not capable of performing real reasoning [34], as instead of learning how to reason, LLMs learn statistical features that inherently exist in logical-reasoning problems, or they consider that while current LLMs may possess abstract task-solving skills to an extent, they often also rely on narrow (and non-transferable) procedures for reasoning and task solving [35].
Moving beyond the barrier of about 1500 total tokens (in the code-generation step 1) affects the reasoning capabilities of the LLM used (LLAMA 3.0 70B-Instruct), and this affects both the Generator LLM and Supervisory LLM. Even with multiple verification loops between them, beyond a certain limit, the Generator LLM and Supervisory LLM can not create a plan that successfully satisfies the initial request from the user.
Similar behavior has been observed in other LLMs, and research suggests splitting complex requests into a list of smaller ones [36] (in our case, maybe inform the user that the requested specification is beyond the limits, and that it should be divided into a sequence of smaller requests).
As an example of the performance in our tests, Table 3 shows a sample list of user requests (in the simulated iTHOR environment) and their output using the Generator LLM–Supervisory LLM approach.
This table includes the following columns:
  • User Request: the mission requested by the user.
  • Actions in JSON: the number of sequential actions created in the generated JSON.
  • Total Tokens: the total number of tokens evaluated by the Generator LLM in the first step of code generation (as a measure of the complexity of the user request).
  • Verification Loops: the number of times the Supervisory LLM requested a change in the plan.
  • Successful Simulation: indicates if the mission was fully executed in the simulation and the goals indicated by the user were achieved.
As seen in Table 3, there is a certain amount of tokens that the system can handle, and after such a limit, the behavior quickly degrades. In our tests, moving beyond 1500 total tokens produces non-optimal plans and code, so the limit for the used LLM seems clear (some quick tests using the next LLAMA model, Llama3.1, did not alleviate this situation).
The approach presented in this paper could be considered an example application of the rapidly growing field of prompt-based reasoning with LLMs, and as indicated in [37], most works on reasoning in LLMs are still experimental and limited. In any case, it should be noted that at present, researchers work on both improving the reasoning capabilities of current models [38] and on the development of new, more powerful models, so it is expected that these limits will be increased in the near future.
This degradation of reasoning is a common problem in the current generation of LLM-based prompt-based reasoning systems, and alleviating its effects is a very active AI research topic [39]. In general, although the degradation limits may differ, as the number of processed tokens increases, the reasoning degrades. This degradation can be clearly identified in Figure 8, prepared by [40].
With the perspective of new LLMs that can reason beyond the limits of current models, it is clear that LLMs pose a very interesting research area for automated programming. The agility of LLMs facilitates rapid prototyping, offering a valuable asset for accommodating evolving requirements and swiftly iterating FSM designs. The obtained results show the feasibility of the approach, as the development and testing time (compared to human developers) is significantly reduced.

4.4. Scalability

An important aspect of this approach is scalability. In the proposed automatic solution, the following should be noted: Handling multiple FSMs: One of the key advantages of employing LLMs for FSM development is their exceptional scalability. LLMs are adept at managing the effortless creation of different FSMs. This capability is especially valuable when dealing with projects that require the creation of numerous state machines.
Resource efficiency: Scalability in this context means that projects can efficiently allocate resources to address the coding needs of multiple FSMs concurrently. Compared to manual coding, which might be constrained by developer availability, the automated LLM approach offers resource efficiency.
Consistency across projects: LLMs contribute to consistency across multiple FSMs within a project. Since they generate code based on standardized models, the risk of variations in coding style and quality is significantly reduced. This is especially valuable when numerous FSMs must work together cohesively.
Streamlining complex systems: In large-scale projects or applications where multiple interdependent FSMs are required, LLMs can streamline the development process. They ensure that individual FSMs align with the overall project goals and requirements, facilitating the creation of complex systems.
Rapid deployment: The ability to scale quickly and efficiently is essential for projects with dynamic requirements. LLMs enable rapid deployment and adaptation to changing project needs, making them suitable for scenarios where scalability is a vital consideration.
This approach could be extended to the automatic generation of code for other different missions.

5. Conclusions and Future Work

The presented work introduces a novel method for the automated generation of code in the context of programming robots for complex tasks, using LLM agents specialized in either writing or verifying code. The proposed solution presents an innovative approach that entails a two-stepped pipeline: first, it generates code automatically (interpreting a text-based initial specification), and then it verifies that the code is safe and functional (aligned with the initial specification).
The implications of these findings are profound. LLMs have the potential to revolutionize software development by significantly enhancing productivity and code quality. Their ability to deliver rapid, efficient, and high-quality code can alleviate many of the bottlenecks currently faced by human developers, especially in large-scale and complex projects.
To validate the proposed solution, two different setups were prepared: first, a synthetic environment (based on a well-known home-environment simulator), and then a real physical environment with a two-armed robot in charge of component assembly.
For future steps, the first research line would be to work on ways to improve the reasoning capabilities of available LLMs, allowing current LLMs to work on more complex tasks (“divide-and-conquer” seems to be the most promising approach here).
The next step is to introduce reactivity into this LLM agent-based approach (as the approach presented here prepares an “offline” program that is downloaded to the robot and not modified during its execution).
Additionally, another important issue is to analyze the feasibility of running LLMs locally (to avoid issues with the privacy and/or reliability of cloud-based LLM systems). This could be achieved using more specialized and optimized LLMs (using, for example, model-quantization techniques, as they are used to reducing the size of large language models by modifying the precision of their weights).
An evolution of the system may run the LLMs locally close to the robot, allowing for the close-to-real-time adaptation of the robots’ behavior to changes in its environment or task.

Author Contributions

Conceptualization, U.A. and B.S.; investigation, U.A.; methodology, U.A. and B.S.; resources, U.A. and B.S.; software, U.A. and F.B.; supervision, J.O. and B.S.; validation, F.B.; writing—original draft, U.A.; writing—review and editing, U.A., F.B., J.O., D.S. and B.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the ELKARTEK Research Program of the Basque Government, project #KK-2024/00050. The APC was funded by Fundación TECNALIA Research and Innovation.

Data Availability Statement

The experiment data and software code (without the code owned by Tecnalia) are publicly accessible at https://github.com/uantero/paper_robotics-13-00137 (accessed on 4 September 2024). The code is limited to its use with the iTHOR environment and includes a very simple sequential state machine instead of Tecnalia’s Flexbotics framework.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. IFR Secretariat Blog—AI in Robotics. 2022. Available online: https://ifr.org/post/ai-in-robotics-blog (accessed on 4 September 2024).
  2. Ogheneovo, E. On the Relationship between Software Complexity and Maintenance Costs. J. Comput. Commun. 2014, 2, 51631. [Google Scholar] [CrossRef]
  3. Ben-Ari, M. Finite state machines. In Elements of Robotics; Springer: Berlin/Heidelberg, Germany, 2018; pp. 55–61. [Google Scholar] [CrossRef]
  4. Balogh, R.; Obdržálek, D. Using Finite State Machines in Introductory Robotics: Methods and Applications for Teaching and Learning. In Using Finite State Machines in Introductory Robotics: Methods and Applications for Teaching and Learning; Springer: Berlin/Heidelberg, Germany, 2019; pp. 85–91. [Google Scholar] [CrossRef]
  5. Zhou, H.; Min, H.; Lin, Y.; Zhang, S. A Robot Architecture of Hierarchical Finite State Machine for Autonomous Mobile Manipulator. In A Robot Architecture of Hierarchical Finite State Machine for Autonomous Mobile Manipulator; Springer: Berlin/Heidelberg, Germany, 2017; pp. 425–436. [Google Scholar] [CrossRef]
  6. Foukarakis, M.; Leonidis, A.; Antona, M.; Stephanidis, C. Combining Finite State Machine and Decision-Making Tools for Adaptable Robot Behavior. In Proceedings of the Universal Access in Human-Computer Interaction—Aging and Assistive Environments: 8th International Conference, UAHCI 2014, Heraklion, Crete, Greece, 22–27 June 2014. [Google Scholar]
  7. Herrero, H.; Outón, J.L.; Puerto, M.; Sallé, D.; López de Ipiña, K. Enhanced Flexibility and Reusability through State Machine-Based Architectures for Multisensor Intelligent Robotics. Sensors 2017, 17, 1249. [Google Scholar] [CrossRef] [PubMed]
  8. Herrero, H.; Outón, J.L.; Esnaola, U.; Sallé, D.; López de Ipiña, K. Development and evaluation of a Skill Based Architecture for applied industrial robotics. In Proceedings of the 2015 4th International Work Conference on Bioinspired Intelligence (IWOBI), San Sebastian, Spain, 10–12 June 2015; pp. 191–196. [Google Scholar] [CrossRef]
  9. Diab, M.; Pomarlan, M.; Beßler, D.; Akbari, A.; Rosell, J.; Bateman, J.; Beetz, M. SkillMaN—A skill-based robotic manipulation framework based on perception and reasoning. Robot. Auton. Syst. 2020, 134, 103653. [Google Scholar] [CrossRef]
  10. Quigley, M.; Conley, K.; Gerkey, B.; Faust, J.; Foote, T.; Leibs, J.; Wheeler, R.; Ng, A. ROS: An Open-Source Robot Operating System. In Proceedings of the ICRA Workshop on Open Source Software, Kobe, Japan, 12–17 May 2009; Volume 3. [Google Scholar]
  11. Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar] [CrossRef]
  12. META. Meta-LLama-3-70B-Instruct. 2024. Available online: https://huggingface.co/meta-llama/Meta-Llama-3-70B (accessed on 4 September 2024).
  13. DEEPINFRA. Meta-LLama-3-70B-Instruct. 2024. Available online: https://deepinfra.com/meta-llama/Meta-Llama-3-70B-Instruct (accessed on 4 September 2024).
  14. Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. arXiv 2024, arXiv:2307.06435. [Google Scholar] [CrossRef]
  15. Will Henshall. TIME: 4 Charts That Show Why AI Progress Is Unlikely to Slow Down. 2023. Available online: https://time.com/6300942/ai-progress-charts/ (accessed on 4 September 2024).
  16. Schäfer, M.; Nadi, S.; Eghbali, A.; Tip, F. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. arXiv 2023, arXiv:2302.06527. [Google Scholar] [CrossRef]
  17. D’Silva, V.; Kroening, D.; Weissenbacher, G. A Survey of Automated Techniques for Formal Software Verification. Trans. Comp.-Aided Des. Integ. Cir. Sys. 2008, 27, 1165–1178. [Google Scholar] [CrossRef]
  18. Floridi, L.; Chiriatti, M. GPT-3: Its Nature, Scope, Limits, and Consequences. Minds Mach. 2020, 30, 681–694. [Google Scholar] [CrossRef]
  19. Ye, W.; Ou, M.; Li, T.; Chen, Y.; Ma, X.; Yanggong, Y.; Wu, S.; Fu, J.; Chen, G.; Wang, H.; et al. Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility. arXiv 2023, arXiv:2305.10235. [Google Scholar] [CrossRef]
  20. Bisk, Y.; Zellers, R.; Bras, R.L.; Gao, J.; Choi, Y. PIQA: Reasoning about Physical Commonsense in Natural Language. arXiv 2019, arXiv:1911.11641. [Google Scholar] [CrossRef]
  21. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
  22. Kocoń, J.; Cichecki, I.; Kaszyca, O.; Kochanek, M.; Szydło, D.; Baran, J.; Bielaniewicz, J.; Gruza, M.; Janz, A.; Kanclerz, K.; et al. ChatGPT: Jack of all trades, master of none. Inf. Fusion 2023, 99, 101861. [Google Scholar] [CrossRef]
  23. Yang, X.; Li, Y.; Zhang, X.; Chen, H.; Cheng, W. Exploring the Limits ChatGPT for Query or Aspect-based Text Summarization. arXiv 2023, arXiv:2302.08081. [Google Scholar] [CrossRef]
  24. De Wynter, A.; Wang, X.; Sokolov, A.; Gu, Q.; Chen, S.Q. An evaluation on large language model outputs: Discourse and memorization. Nat. Lang. Process. J. 2023, 4, 100024. [Google Scholar] [CrossRef]
  25. Delobelle, P.; Tokpo, E.; Calders, T.; Berendt, B. Measuring Fairness with Biased Rulers: A Comparative Study on Bias Metrics for Pre-trained Language Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 15 July 2022; Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 1693–1706. [Google Scholar] [CrossRef]
  26. Domnich, A.; Anbarjafari, G. Responsible AI: Gender bias assessment in emotion recognition. arXiv 2021, arXiv:2103.11436. [Google Scholar] [CrossRef]
  27. ESA. Guide to Software Verification and Validation; ESA PSS-05-10 Issue 1, Revision 1, March 1995; ESA: Paris, France, 1995. [Google Scholar]
  28. Peleska, J.; Haxthausen, A.; Schnieder, E.; Tarnai, G. Object Code Verification for Safety-Critical Railway Control Systems. In Proceedings of the Formal Methods for Automation and Safety in Railway and Automotive Systems (FORMS/FORMAT 2007), Braunschweig, Germany, 25–26 January 2007; pp. 184–199. [Google Scholar]
  29. Iyyer, M.; Wieting, J.; Gimpel, K.; Zettlemoyer, L. Adversarial Example Generation with Syntactically Controlled Paraphrase Networks. arXiv 2018, arXiv:1804.06059. [Google Scholar] [CrossRef]
  30. Alzantot, M.; Sharma, Y.; Elgohary, A.; Ho, B.J.; Srivastava, M.; Chang, K.W. Generating Natural Language Adversarial Examples. arXiv 2018, arXiv:1804.07998. [Google Scholar] [CrossRef]
  31. Gao, J.; Lanchantin, J.; Soffa, M.L.; Qi, Y. Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers. arXiv 2018, arXiv:1801.04354. [Google Scholar] [CrossRef]
  32. Zhao, Z.; Dua, D.; Singh, S. Generating Natural Adversarial Examples. arXiv 2018, arXiv:1710.11342. [Google Scholar] [CrossRef]
  33. Kolve, E.; Mottaghi, R.; Han, W.; VanderBilt, E.; Weihs, L.; Herrasti, A.; Deitke, M.; Ehsani, K.; Gordon, D.; Zhu, Y.; et al. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv 2022, arXiv:1712.05474. [Google Scholar] [CrossRef]
  34. Zhang, H.; Li, L.H.; Meng, T.; Chang, K.W.; den Broeck, G.V. On the Paradox of Learning to Reason from Data. arXiv 2022, arXiv:2205.11502. [Google Scholar] [CrossRef]
  35. Wu, Z.; Qiu, L.; Ross, A.; Akyürek, E.; Chen, B.; Wang, B.; Kim, N.; Andreas, J.; Kim, Y. Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks. arXiv 2024, arXiv:2307.02477. [Google Scholar] [CrossRef]
  36. Wu, Z.; Bai, H.; Zhang, A.; Gu, J.; Vydiswaran, V.V.; Jaitly, N.; Zhang, Y. Divide-or-Conquer? Which Part Should You Distill Your LLM? arXiv 2024, arXiv:2402.15000. [Google Scholar] [CrossRef]
  37. Plaat, A.; Wong, A.; Verberne, S.; Broekens, J.; van Stein, N.; Back, T. Reasoning with Large Language Models, a Survey. arXiv 2024, arXiv:2407.11511. [Google Scholar] [CrossRef]
  38. Huang, J.; Chang, K.C.C. Towards Reasoning in Large Language Models: A Survey. arXiv 2023, arXiv:2212.10403. [Google Scholar] [CrossRef]
  39. Qiao, S.; Ou, Y.; Zhang, N.; Chen, X.; Yao, Y.; Deng, S.; Tan, C.; Huang, F.; Chen, H. Reasoning with Language Model Prompting: A Survey. arXiv 2023, arXiv:2212.09597. [Google Scholar] [CrossRef]
  40. Levy, M.; Jacoby, A.; Goldberg, Y. Same Task, More Tokens: The Impact of Input Length on the Reasoning Performance of Large Language Models. arXiv 2024, arXiv:2402.14848. [Google Scholar] [CrossRef]
Figure 1. AI performance on benchmarks (relative to human performance) [15].
Figure 1. AI performance on benchmarks (relative to human performance) [15].
Robotics 13 00137 g001
Figure 2. Steps in the proposed methodology, indicating relevant Large Language Model (LLM) roles.
Figure 2. Steps in the proposed methodology, indicating relevant Large Language Model (LLM) roles.
Robotics 13 00137 g002
Figure 3. Connection to iTHOR simulated environment.
Figure 3. Connection to iTHOR simulated environment.
Robotics 13 00137 g003
Figure 4. iTHOR simulated kitchen.
Figure 4. iTHOR simulated kitchen.
Robotics 13 00137 g004
Figure 5. Simulated kitchen showing element location.
Figure 5. Simulated kitchen showing element location.
Robotics 13 00137 g005
Figure 6. Tomato slice on plate, as requested by the user.
Figure 6. Tomato slice on plate, as requested by the user.
Robotics 13 00137 g006
Figure 7. Real (physical) test device and environment.
Figure 7. Real (physical) test device and environment.
Robotics 13 00137 g007
Figure 8. LLM reasoning degradation.
Figure 8. LLM reasoning degradation.
Robotics 13 00137 g008
Table 1. Development time for a successful mission in the synthetic setup (put tomato in fridge) for users S1-S4 and for the Large Language Model (LLM).
Table 1. Development time for a successful mission in the synthetic setup (put tomato in fridge) for users S1-S4 and for the Large Language Model (LLM).
SyntheticS1S2S3S4User MeanLLMLLM Adv.
Generation14 m 2 s6 m 52 s4 m 27 s3 m 54 s7 m 18 s14 s97%
Check1 m 16 s1 m 6 s56 s28 s26 s12 s78%
Total15 m 8 s8 m 8 s5 m 23 s4 m 20 s8 m 14 s26 s95%
Table 2. Development time for a successful mission in the physical setup (mount components).
Table 2. Development time for a successful mission in the physical setup (mount components).
PhysicalS1S2S3S4User MeanLLMLLM Adv.
Generate FSM4 m 5 s2 m 2 s1 m 20 s54 s2 m 5 s15 s88%
Check FSM57 s52 s49 s22 s45 s13 s71%
Total5 m 2 s2 m 54 s2 m 9 s1 m 16 s2 m 50 s14 s92%
Table 3. Example requests run using the Generator LLM–Supervisory LLM methodology.
Table 3. Example requests run using the Generator LLM–Supervisory LLM methodology.
User RequestActions in JSONTotal TokensVerification LoopsSuccessful Simulation
Open the fridge28510Yes
Pick up the egg from the fridge59480Yes
Put the tomato in the fridge. Make sure that in the end, the fridge is closed710470Yes
I want the robot to slice some tomato and then put it on a plate1211071Yes
Slice the bread and make toast1912371Yes
Prepare toast from the sliced bread and put that on a plate with a slice of the tomato on top of it2414272Yes
Put the egg on the plate. Slice the tomato and put it onto the plate. Put the pan in the basin. Pick up the bread and put it onto the countertop2314540Yes
Put the frying pan in the basin. Slice and toast some bread and put that on the plate with an egg. When you are finished with the knife, leave it in the basin2615071Yes
Pick up the knife and slice the bread and the tomato. Then, toast the bread and put it onto the plate. Then, cook the tomato slice in the pan and put it onto the plate2118202No
Pick up the knife and slice the bread and the tomato. Then, toast the bread, fry it, and put it onto the plate. Then, cook the tomato slice in the pan and put it onto the plate4116772No
Prepare a fried egg using the pan and the stove, and then prepare toast from the sliced bread and put that (together with the egg and some sliced lettuce) on a plate3019614No
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Antero, U.; Blanco, F.; Oñativia, J.; Sallé, D.; Sierra, B. Harnessing the Power of Large Language Models for Automated Code Generation and Verification. Robotics 2024, 13, 137. https://doi.org/10.3390/robotics13090137

AMA Style

Antero U, Blanco F, Oñativia J, Sallé D, Sierra B. Harnessing the Power of Large Language Models for Automated Code Generation and Verification. Robotics. 2024; 13(9):137. https://doi.org/10.3390/robotics13090137

Chicago/Turabian Style

Antero, Unai, Francisco Blanco, Jon Oñativia, Damien Sallé, and Basilio Sierra. 2024. "Harnessing the Power of Large Language Models for Automated Code Generation and Verification" Robotics 13, no. 9: 137. https://doi.org/10.3390/robotics13090137

APA Style

Antero, U., Blanco, F., Oñativia, J., Sallé, D., & Sierra, B. (2024). Harnessing the Power of Large Language Models for Automated Code Generation and Verification. Robotics, 13(9), 137. https://doi.org/10.3390/robotics13090137

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop