Next Article in Journal
Toward Secure SDN Infrastructure in Smart Cities: Kafka-Enabled Machine Learning Framework for Anomaly Detection
Previous Article in Journal
GraphRAG-Enhanced Dialogue Engine for Domain-Specific Question Answering: A Case Study on the Civil IoT Taiwan Platform
Previous Article in Special Issue
GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prompt-Driven and Kubernetes Error Report-Aware Container Orchestration

Expert Group AI in Applications, Institute for Interactive Systems, Lübeck University of Applied Sciences, 23562 Lübeck, Germany
*
Author to whom correspondence should be addressed.
Future Internet 2025, 17(9), 416; https://doi.org/10.3390/fi17090416
Submission received: 5 August 2025 / Revised: 4 September 2025 / Accepted: 8 September 2025 / Published: 11 September 2025
(This article belongs to the Special Issue Artificial Intelligence (AI) and Natural Language Processing (NLP))

Abstract

Background: Container orchestration systems like Kubernetes rely heavily on declarative manifest files, which serve as orchestration blueprints. However, managing these manifest files is often complex and requires substantial DevOps expertise. Methodology: This study investigates the use of Large Language Models (LLMs) to automate the creation of Kubernetes manifest files from natural language specifications, utilizing prompt engineering techniques within an innovative error- and warning-report–aware refinement process. We assess the capabilities of these LLMs using Zero-Shot, Few-Shot, Prompt-Chaining, and Self-Refine methods to address DevOps needs and support fully automated deployment pipelines. Results: Our findings show that LLMs can generate Kubernetes manifests with varying levels of manual intervention. Notably, GPT-4 and GPT-3.5 demonstrate strong potential for deployment automation. Interestingly, smaller models sometimes outperform larger ones, challenging the assumption that larger models always yield better results. Conclusions: This research highlights the crucial impact of prompt engineering on LLM performance for Kubernetes tasks and recommends further exploration of prompt techniques and model comparisons, outlining a promising path for integrating LLMs into automated deployment workflows.

1. Introduction

In the dynamic field of cloud-native computing, Kubernetes has emerged as a crucial tool for transforming the deployment, scaling, and management of containerized applications. Despite its advantages, the intricate architecture of Kubernetes and similar platforms heavily relies on manifest files, which are used to define the desired operational state declaratively. These manifest files serve as essential blueprints for orchestrating containers; however, managing them can be challenging and often demands high expertise [1]. This situation highlights a potential for automation and optimization. We are particularly interested in how to define non-fine-tuned models for generating descriptive operating states (intended states) of resources such as deployments, services, or pods to be provided and operated in a Kubernetes cluster. If this could be conducted in a natural language way, it would drastically reduce Kubernetes’s somewhat steep learning curve for DevOps engineers.
Concurrently, there has been notable progress regarding Large Language Models (LLMs) [2,3], showcasing their ability to generate human-like text [4]. As these models advance in complexity and capability, they offer new possibilities for programming in high-level languages. A pertinent question arises: Can LLMs generate declarative deployment instructions for Kubernetes or similar systems [5]? If so, this could simplify the creation of Kubernetes manifests, making them more accessible to DevOps engineers and potentially reducing the need for deployment-specific languages [6].
An underexplored connection between Kubernetes and LLMs lies in prompt engineering. Utilizing LLMs in this way could address many of the challenges Kubernetes faces, particularly in manifest management. As demonstrated in [7], prompt engineering could transform cloud computing and Kubernetes management, leading to more intelligent and efficient systems.
Our earlier work [7] established that systematic refinement, prompt engineering, and prompt chaining could enhance the output of less powerful, smaller language models. However, it also revealed that unexpected ‘disasters’ could occur in such prompt chains during the refinement stage, degrading the manifest files despite well-intentioned refinement prompts. Notably, our study showed that the refinement stage produced worse outcomes than the initial zero-shot generation in nearly all instances. This shortcoming was primarily due to sequential refinement steps overwriting the outputs of preceding steps.
This paper explores ways to optimize this iterative refinement process by implementing tool-supported self-feedback, following the concepts proposed by [8]. Specifically, we utilize error messages as feedback generated by the Kubernetes command-line tool kubectl if a manifest file were deployed. For clarity and clearer understanding, we also present the essential parts and results of our initial study [7] in this extended paper. The reader is recommended to consult the original research [7] for further details.
Overall, our research contributes to the technical field of container orchestration and expands the growing body of knowledge on the practical applications of LLMs and prompt engineering in technology and cloud-native computing.

2. Background and Related Work

In Kubernetes, manifest files, typically written in YAML or JSON, define the desired state of operations, such as pods, services, and controllers. These files are essential for deploying and managing applications within Kubernetes. However, as systems scale, managing these files becomes increasingly difficult. Challenges include maintaining configuration consistency, updating features, and ensuring security compliance. The complexity is further amplified by the proliferation of microservices [9,10]. Integrating AI and machine learning, particularly through large language models, offers significant potential to improve the management and generation of these manifests. By automating tasks and optimizing configurations, these technologies promise to simplify management and enhance the efficiency and reliability of container orchestration.

2.1. Large Language Models

Large Language Models (LLMs), such as OpenAI’s GPT series, have significantly advanced natural language processing by understanding, generating, and manipulating written text. These models have evolved from simple origins to complex systems with impressive linguistic capabilities, transitioning from rule-based systems to neural network architectures that learn from vast datasets to produce contextually rich text [11]. Their expanding role in automation and data processing enables the automation of complex language tasks, including document summarization, code generation, language translation, and content creation [12]. LLMs can analyze text to extract insights and trends, supporting business and technology strategies. Their precise language processing capabilities hold promise in various domains such as healthcare, finance, customer service, and system management, including Kubernetes. In these areas, they can streamline tasks like manifest file generation, error diagnosis, and configuration optimization, thereby reducing manual work and enhancing efficiency.

2.2. Prompt Engineering

Training LLMs for domain-specific applications typically involves an extensive pre-training phase for general language comprehension, followed by a specialized fine-tuning phase. Recently, there has been a shift towards a “pre-train, prompt, predict” methodology, which reduces computational demands and utilizes specialized datasets through prompt engineering [13,14]. Prompt engineering entails crafting strategic inputs (prompts) to direct LLMs in producing the desired outputs. In the Kubernetes context, prompt engineering could markedly enhance LLMs’ capability to manage technical tasks, such as generating or optimizing manifest files, diagnosing deployment problems, and recommending configuration best practices without needing task-specific fine-tuning. While not extensively studied, prompt engineering presents a promising approach to making Kubernetes management more intuitive and efficient, potentially lowering technical barriers and improving system reliability.

2.3. Related Work

Current research on integrating LLMs with Kubernetes highlights several promising but limited approaches. Lanciano et al. propose utilizing specialized LLMs to analyze Kubernetes deployment files, assisting non-experts in design and quality assurance [15]. Xu et al. introduce CloudEval-YAML, a benchmark designed to evaluate LLMs in generating cloud-native application code, with a focus on YAML and a dataset supplemented by unit tests [16]. Kowal et al. suggest a pipeline leveraging LLMs for anomaly detection and auto-remediation in microservices, aiming to improve system stability [17]. These methods generally depend on training specialized LLMs. In contrast, our research explores the use of standard LLMs combined with straightforward prompt engineering to automate Kubernetes configurations for security and compliance, setting it apart from the reliance on specialized models.
The effectiveness of refinement strategies in NLP is contingent upon the specific task and domain. While approaches like SR-NLE and SPEFT have shown promise in enhancing model performance [18], their success is not universal. Domain-specific challenges and the complexity of tasks necessitate tailored strategies to achieve optimal results.

3. Prompting Set-Up

A collection of base prompts is defined to represent different deployment scenarios in Kubernetes. These scenarios include the deployment of five commonly used services: MongoDB, Redis, PostgreSQL, MySQL, and NGINX. Each base prompt specifies the task of generating a single YAML file that contains all Kubernetes resources necessary for deploying the respective service. This formulation ensures that the language model is instructed to consolidate multiple Kubernetes objects—such as Deployments, Services, PersistentVolumeClaims, or NetworkPolicies—into one coherent manifest, thereby simplifying both generation and application. By diversifying the base prompts across multiple database engines and a web server, the script covers a representative range of cloud-native workloads, which differ in their storage requirements, runtime characteristics, and networking needs. The whole prompts are uniquely applied to all LLMs.
In addition to these task-specific instructions, the script defines a set of common constraints that apply universally to all generated manifests. These constraints encapsulate the established best practices for secure, stable, and policy-compliant deployments in Kubernetes, specifically the following:
  • Deployment resources: Every manifest must include a Deployment object, ensuring that the service runs with managed pod replicas, automatic restarts, and version-controlled rollouts.
  • Persistent storage: If the service requires data persistence, a PersistentVolumeClaim (PVC) must be generated. This ensures that critical application data is not lost when pods are rescheduled.
  • Volume mounting: Any defined PersistentVolumeClaim must be properly mounted into the application container, thereby linking storage to the running service.
  • Security context: Containers must explicitly declare a securityContext with privileged set to false. This restriction reduces the risk of privilege escalation attacks and enforces least-privilege principles.
  • Resource management: Containers must declare both resource requests and limits. This guarantees predictable performance, prevents resource starvation, and improves cluster stability under load.
  • Service exposure: A Service object must be generated to expose the required network ports, thereby enabling communication between the deployed application and other workloads or clients.
  • Network policies: Finally, a NetworkPolicy must be defined to restrict incoming and outgoing traffic to only the namespaces that are strictly required. This constraint enforces a zero-trust networking model and limits the attack surface.
By combining base prompts that target different Kubernetes workloads with a uniform set of best-practice constraints, the script produces complete and standardized LLM input prompts. These prompts guide the model toward generating Kubernetes manifests that are not only syntactically correct, but that are also aligned with operational, security, and resource management guidelines. The reader can find detailed information in the Appendix A, which displays which prompt templates were used to generate the initial Kubernetes manifests (see Appendix A.1), how these were enhanced with best practices (see Appendix A.2), and how automated error correction was added (see Appendix A.3).

4. Methodology

Advanced prompt engineering can guide LLMs to more effectively understand the intricacies of Kubernetes manifests, ensuring best practices in container security and operations. This research intends to connect the advanced language capabilities of LLMs with the technical requirements of Kubernetes management, aiming to improve DevOps efficiency and security in Kubernetes operations. Our objective is to utilize the inherent knowledge base of these LLMs [11] to create accurate Kubernetes configurations. We explored different LLMs and prompt engineering techniques to assess their suitability for this task. We aimed to leverage standard LLMs without specific fine-tuning.
Our approach to the analysis was divided into two phases (see Figure 1). In the first applicability analysis phase [7] (see Section 4.1), we analyzed the general applicability of LLMs for generating Kubernetes manifest files. We were interested in how well LLMs can generate Kubernetes manifest files (zero-shot capability) and whether this generation could be optimized with simple examples (few-shot capability). In the first phase, we were also interested in whether it made sense to refine more complex deployment scenarios step-by-step or generate them all at once. The first phase was heavily manual, particularly in the evaluation stage, and depended on DevOps experts and their assessment of the generated results. Our aim was to determine the strengths and limitations of LLMs for the specific purpose and to obtain a more reliable assessment of whether LLMs are suitable for container orchestration and DevOps use cases.
In the second optimization and automation phase (see Section 4.2), we addressed the weaknesses recognized in the first phase. These related, in particular, to increasing the automated evaluation of the generated manifest files and optimizing the iterative refinement process, which proved to be a weak point in the first phase.

4.1. Phase 1: Feasibility Analysis

Although prompt engineering is still very young and dynamic, several distinct approaches exist to different prompting techniques that can be derived from existing prompt engineering overviews [13]. The following methods seem very promising from the current state of knowledge, and were used to derive our research questions.
Large LLMs are tuned to follow instructions and are pre-trained on large amounts of data to perform some tasks out of the box (zero-shot). For example, an LLM can generate text with a single prompt without any required specifications as input. This works astonishingly well for simple tasks like categorization [19].
We addressed the following research questions (RQs) in this phase:
  • RQ 1 (Zero-Shot Capability):
    We aim to determine how well LLMs can generate Kubernetes manifests out-of-the-box.
Although LLMs demonstrate remarkable zero-shot capabilities, they fall short on more complex tasks when using the zero-shot setting. In these cases, prompting can enable in-context learning, where we provide a guess of expected output text within a prompt—so-called demonstrations (e.g., Kubernetes manifest files)—to steer the model to better performance. These demonstrations serve as conditioning for subsequent examples where we induce the model to generate a response. According to [20], few-shot prompting requires models of sufficient size [21].
  • RQ 2 (Few-Shot Capability):
    We are therefore interested in examining whether larger LLMs produce better results in few-shot settings.
To enhance the performance and reliability of LLMs, an essential prompt engineering technique involves breaking down complex tasks into smaller, manageable subtasks. This approach starts by prompting the LLM with one subtask at a time. The response generated from each subtask becomes the sequence’s input for the following prompt. This method of sequentially linking prompts allows for the LLM to tackle complex tasks that might be challenging to address in a single, comprehensive prompt. Prompt chaining not only improves the LLM’s ability to handle intricate tasks, but also increases the transparency and controllability of LLM applications. This approach makes debugging and analyzing the model’s responses at each stage easier, facilitating targeted improvements where needed. A frequently used framework in this context is LangChain [22].
  • RQ 3 (Prompt-Chaining Capability):
    We aim to determine whether Kubernetes manifests can be gradually refined with prompt chaining in order to add capabilities that LLMs do not “retrieve from their memory” by default in zero-shot settings.
The techniques mentioned above appear to be the most promising for an initial explorative analysis based on the current state of knowledge. Nevertheless, techniques such as Chain-of-Thought [23,24], Self-Consistency [25], Generated Knowledge Prompting [26], Tree of Thoughts [27,28], Automatic Reasoning and Tool-use [29], Program-Aided Language Models [30], ReACT Prompting [31], and Retrieval Augmented Generation [32] should also be investigated in a systematic screening in the future. In particular, we found that problems arose in the refinement stage of our approach. We therefore investigated a prompting strategy called Iterative Refinement with Self-Feedback [8] in the second phase of our study.
We iteratively insert LLM-generated manifest files and possible error messages into a Kubernetes command-line client until no error messages are returned. If no error messages are generated in the first loop cycle, the loop is being executed with a working file. If errors still occur until an upper limit of iteration cycles is reached, the LLMs cannot generate a working file.

4.1.1. Analyzed Use Case (NoSQL DB) Considering Real-World Constraints

We examine basic prompt engineering methods like zero-/few-shot and Prompt-Chaining to assess if non-fine-tuned LLMs (e.g., GPT-3.5, GPT-4, Llama2/3, Mistral) can efficiently generate Kubernetes manifest files. Our goal is to determine the effectiveness of these LLMs and to identify which prompt engineering techniques, as discussed in Section 4.1, are most effective for designing and optimizing manifest generation.
Our exploratory approach centers on deploying and operating a NoSQL database (such as MongoDB or similar systems), which is commonly used for web applications within Kubernetes. Although this may not seem like a particularly complex use case, it allows us to look at all relevant aspects and cross-resource relationships between concepts such as Ingress, Service, Deployment, StatefulSet, and Persistent Volume Claim as the general components of a Kubernetes manifest file. We also consider aspects such as security and operational aspects (such as security contexts, network policies, avoidance of resource monopolisation), which are often not included in the standard Kubernetes examples found on the web, which are presumably used when training the language models. For such real-world constraints, we have oriented ourselves on the recommendations of the ‘Kubernetes Security Hardening Guide’ [33]:
  • The database or application containers should not run with elevated privileges (securityContext.privileged: false).
  • The database/application should be accessible only within its namespace, necessitating the correct generation of a NetworkPolicy.
  • The database/application containers should not monopolize resources, requiring the generation of memory and CPU resource limits.
Furthermore, we expect the LLM to derive the necessary manifests, even if they are not explicitly requested in the prompt. An experienced DevOps engineer would have developed manifests for the above-mentioned setting. We use this DevOps experience as a benchmark for our expectations of the LLMs:
  • Deployment (or StatefulSet including a PersistentVolumeClaimTemplate).
  • Correct Volume mounts in Deployment/StatefulSets.
  • PersistentVolumeClaim (unless the LLM opts for a StatefulSet).
  • Service.

4.1.2. Generation and Evaluation Strategy

Our evaluation utilized a prompt chain (as depicted in Figure 2) that began with a zero-shot prompt to generate initial manifests. This was followed by a second phase involving iterative refinement to ensure that operational constraints were met, using a specific check and refinement prompt template. The following check and refinement prompts (slightly shortened for presentation) were applied in the refinement stage in the following sequence:
  • Verify that a Deployment manifest has been generated for the database.
  • Verify that a PersistentVolumeClaim manifest has been generated for the database.
  • Ensure that the PersistentVolumeClaim manifest is mounted within the database container.
  • Ensure that the container’s securityContext is set to privileged false.
  • Ensure that the containers have appropriate resource/limit settings.
  • Verify that a service manifest addressing the database port has been generated.
  • Ensure that a Network Policy restricts database port access within the namespace.
The resulting manifest files from the draft and refinement stages were analyzed by Kubernetes experts and tools (kubectl apply –dry-run) to assess whether the generated manifests adequately described the situation and were valid and deployable on Kubernetes (kubectl apply). A DevOps expert identified and corrected errors found with the tool, making the minimum necessary changes to achieve a deployable result. In the second phase of this research, we automated these manual analysis steps to increase assessment objectivity and evaluate larger deployments and datasets. However, this semi-automated approach was adequate for our initial analysis to derive a research position and direction. This evaluation was conducted for the following manifest generation strategies, based on Figure 2:
  • Zero-Shot: The prompt did not explicitly specify the constraints to be met. Consequently, the refinement stage depicted in Figure 2 was not executed.
  • Zero-Shot + Constraints: The prompt explicitly specified the constraints to be met. However, no incremental refinement was carried out for each constraint individually. Therefore, the refinement stage shown in Figure 2 was not executed in this case either.
  • Few-Shot + Refinement: The prompt did not specify the constraints to be met. However, the draft stage results were explicitly refined iteratively for each constraint during the refinement stage illustrated in Figure 2.
The main difference between Zero-Shot+Constraints and Few-Shot+Refinement is that, in the former, an LLM must consider all constraints simultaneously, while in the latter, it can process and improve upon each constraint one at a time.

4.2. Phase 2: Optimization and Automatization

We found that, in phase 1, the Few-Shot+Refinement approach, in particular, led to significant losses (or, at best, showed no significant effect) and was, therefore, not worth the runtime and token expenditure involved. This surprised us, as we had high expectations for this approach in particular. We extended this approach to a tool-based refinement approach in the second phase. We used the prompt chain displayed in Figure 2 with a tool-based self-refinement approach. This approach, seen in Figure 3, has the advantage that the generated manifest files can be checked automatically and are tool-based, and the result can, in turn, influence the generation process. As a tool to check manifest files, we utilized a Kubernetes Python package, which offers a function named kubectl to load the manifest files with a dry-run on Kubernetes. The returned message delivers feedback on the success or failure status of the call.
The concept of “self-refinement” refers to an iterative approach where a large language model (LLM) generates an initial output and then uses its own or elsewhere generated feedback (e.g., from a compiler or alike) to improve this output progressively [8]. This process, termed “SELF-REFINE,” involves two main steps: feedback and refinement. The model generates an output and then provides feedback on its output, identifying areas needing improvement. The model then uses the feedback to refine the output, and this cycle is repeated until the desired quality is achieved. This method does not require supervised training data or additional training, but instead leverages the model’s existing capabilities to enhance its performance across diverse tasks [8].
The concept was introduced by [8], and the authors report that a “SELF-REFINE” approach leads to significant improvements across various tasks. For instance, when applied to the GPT-4 model, SELF-REFINE resulted in an 8.7% absolute increase in code optimization performance. The most significant gains were observed in preference-based tasks such as Dialogue Response Generation, where the GPT-4 model’s preference score improved by 49.2%.
This led us to the following research question, further analyzed in phase 2 (and this paper):
  • RQ 4 (Self-Refinement Capability):
    Is it possible to increase the quality of generated container orchestration manifest files with tool-based self-refinement?
Recent advancements in generative models and AI-powered development tools suggest that it is feasible for such systems not only to generate initial manifest files but also to iteratively refine them toward higher quality by applying best practices, correcting errors, and incorporating feedback. However, the actual effectiveness of tool-based self-refinement, where the generation tool itself revisits and improves its previous outputs without direct human intervention, remains largely unexplored. We were interested in whether self-refinement mechanisms within generation tools truly enhance the quality of orchestration manifests. By investigating this question, we aim to assess the tangible benefits and limitations of automated self-refinement in practice, understand where such approaches succeed or fall short, and provide insights for tool designers and practitioners seeking more reliable manifest generation solutions.

5. Results

Table 1 shows the language models that were selected for evaluation based on their current popularity (OpenAI) or reported performance for self-hosting (Llama2/3, Mistral) at the time of each phase.
All self-hosted machine learning models were run via HuggingFace’s Text Generation Inference Interface, enabling AWQ quantization [39]. We worked with the non-fine-tuned base models from HuggingFace, except for the Mistral model. For Mistral, we specifically used a model fine-tuned for coding assistance to more effectively evaluate the potential effects of fine-tuning. The models were used programmatically with the LangChain library (https://pypi.org/project/langchain, accessed on 7 September 2025) and OpenAI (https://pypi.org/project/langchain-openai, accessed on 7 September 2025) or the Text Generation Inference Interface from HuggingFace (https://pypi.org/project/text-generation, accessed on 7 September 2025). We used LangChain’s default values and set the temperature parameter to 0.
We did not conduct a detailed cost or token consumption assessment or runtime analysis, as this was a feasibility study. Nevertheless, the incurred costs and inference times remained within acceptable limits. From a computational cost perspective, the inference time of generating each manifest file and achieving the Kubernetes response takes approximately a few seconds for all LLMs. Depending on the loop depth, it may take a few tens of seconds. Self-hosted LLMs like the Llama Models run on NVIDIA A100 (VRAM: 40 GB) and A6000 (VRAM: 48 GB) machines.

5.1. Explanation of Phase 1 Results

The key question of phase #1 was which strategy best fulfills all of the required constraints and whether there are differences between the LLMs, and does the approach work at all? The results are displayed in Figure 4. All models succeeded in generating a functional deployment, but their adherence to operational constraints varied.
Fulfilment was evaluated on a scale from 0.0 (no requirements met) to 1.0 (all requirements met), where a score of 1.0 indicates the potential for a fully automatic, error-free deployment in Kubernetes. Scores below 1.0 necessitated manual corrections, detailed in the original study [7].
GPT-4 and GPT-3.5 achieved the highest fulfilment scores, demonstrating their capability for fully automatic deployment. The free models, Llama2 and Mistral, had lower fulfilment levels, with simpler zero-shot approaches outperforming iterative refinement strategies. Interestingly, the smaller 7B Llama2 model performed as well as or slightly better than the 13B version. The smallest Mistral model outperformed Llama2 in zero-shot tasks, but not when operational constraints were included in the prompt.
This is only an empirical study, since the exact training procedure of foundation models is unknown. Different reasons might cause this behavior, e.g., a different portion of the Kubernetes manifest files in the training data set in different models.

5.2. Explanation of Phase 2 Results

Based on the results from phase 1, it is evident that all models require some degree of refinement or correction for many of the generated manifest files. So far, we have focused on a NoSQL use case for manifest file creation. In phase 2, we expanded the scope to include five use cases (NoSQL, Redis, PostgreSQL, MySQL, NGINX) to validate the results across a broader range of scenarios. As demonstrated in Figure 3, the tool-based correction iteration is supposed to effectively support the automated generation of correct manifest files.
Of the 30 generated use cases, 20 immediately successfully used the large language models (LLMs). Our focus, therefore, shifts to the remaining ten cases, where corrections were needed to ensure that the manifest files ran successfully. Figure 5 illustrates the number of corrected versus uncorrectable manifest files for each LLM. Notably, all of the models could repair at least five use cases across the board, but two models successfully handled 6 cases and one even 7 cases. LLaMA3 70B and GPT-4(o) stood out as the only models capable of restoring all but two or three original manifest files. Overall, phase 2 resolved eight out of ten failed cases without user intervention, leaving just two unresolved (see Figure 6). The LLaMA3 70B model achieved the best performance, with a standard deviation of only 3.39 iterations across all manifest files.
The comparison plot in Figure 6 illustrates the performance of various LLMs in generating manifest files for different use cases and the required iterations to repair them, if the initial version fails. The X-axis represents different initially failing use cases. The Y-axis displays the iterations each model requires to achieve a successful run. A value of ‘−1’ indicates that at least one manifest file was not successfully repaired after all iterations for that specific use case. For other cases, the Y-axis shows the number of iterations required to get the use case running.
The variation in use cases revealed that certain failures could not be resolved, even with the automated approach. These failures typically involved missing information, such as the correct namespace or values represented by placeholders, necessitating manual intervention by the user or more initial information provided for the LLM (e.g., the namespace).

5.3. Discussion of Results

So, what conclusions can be drawn?
Research Question 1 (RQ1): 
How well can LLMs generate Kubernetes manifests out-of-the-box?
All LLMs successfully generated Kubernetes manifest files, correctly recognizing the semantic relationships between components such as Deployment, PersistentVolumeClaim, Service, and NetworkPolicy. However, most cases required some manual adjustments. The two commercial GPT models showed the potential for fully automated database deployment without user intervention. Coding-optimized models like Mistral performed better on simple zero-shot prompts than Llama2, but did not surpass the GPT models. Further investigation is needed to assess the generalizability of these findings.
RQ2: 
Do larger LLMs generate better results in zero-/few-shot settings?
Commercial LLMs like GPT-4 and GPT-3.5 outperform free models such as Llama2 and Mistral, although larger models (13B) do not necessarily produce better results than smaller ones (7B). Our results indicate that the quality of results depends on both model size and training data; for example, Mistral, optimized for coding tasks, performs better in zero-shot tasks than Llama2. Similarly, the superior performance of GPT-3.5 and GPT-4 is likely due to more extensive training data. Prompt engineering is crucial, as proper techniques can enable free models to nearly match GPT-4’s performance, suggesting that further research should explore the role of prompt engineering in enhancing LLM outcomes.
RQ3: 
Is it worthwhile to gradually refine Kubernetes manifests with prompt chaining?
Our initial hypothesis was that iterative refinement would improve the quality of Kubernetes manifests across various LLMs by addressing specific optimization aspects. Contrary to expectations, the results varied significantly. Incremental refinement showed minimal benefits for commercial models like GPT-4 and GPT-3.5, which already performed well with basic prompt engineering. Conversely, this approach negatively impacted the performance of free models like Llama2 and Mistral, possibly due to the overwriting of earlier optimizations over seven iterations. This suggests that the predefined structure of manifest files limits the effectiveness of iterative refinements due to their low complexity. Interestingly, the final refinement stage focusing on SecurityPolicy yielded effective policies, raising questions about the optimality of the iterative strategy and the organization of result integration. This discrepancy highlights a potential area for future research, particularly the impact of increasing complexity on the effectiveness of refinement prompts. These findings can be used to optimize the refinement stage, ensuring that later steps do not “overwrite” previous ones.
RQ4: 
Is it possible to increase the quality of generated container orchestration manifest files with tool-based self-refinement?
The integration of tools and the iterative process of automatic correction validate our assumption, as many syntactic errors and missing attributes in the manifest files were successfully addressed. However, the overall iterative process proved to be more complex than initially anticipated. Since large language models (LLMs) generate responses based on statistical patterns, we observed significant variation in their iterative outputs. This required extensive parsing, exception handling, and prompt engineering to manage the wide range of responses produced by different LLMs. Notably, LLMs often included extraneous information around or even within the manifest files, such as Python comments, necessitating the extraction of the improved manifest file from the output. In some cases, the model’s response did not contain a corrected manifest file at all, which also had to be detected and accounted for during the iteration process.
Interestingly, we found that, while some models were unable to correct the manifest file in a single iteration, they succeeded after several iterations. This indicates that, even when the model fails to provide the correct improvement initially, it may recover in subsequent iterations.
Errors in the automated generation of small Kubernetes YAML files by LLMs can be attributed to several factors. Model limitations play a key role: limited context length can prevent proper handling of dependencies between resources, and the model may occasionally “hallucinate” fields or parameters (e.g., apiVersion or kind). Data-related causes stem from training on incomplete, outdated, or unbalanced examples across different Kubernetes APIs and YAML structures. In particular, cases such as security contexts, which are not found in standard tutorials on the Internet, were often overlooked by the models in the initial generation iterations because they occur less frequently in the training data. Additionally, generative randomness affects outputs: probabilistic token generation can produce different YAML files, even for the same prompt, particularly at higher temperature settings. Typically, such cases are addressed by applying more code- or configuration-specialized models, such as OpenAI Codex or Code Llama, which achieve more reliable results in benchmarks like HumanEval or MBPP, highlighting the importance of specialized training data [40]. However, we were able to demonstrate that even complex requirements can be generated by standard models and iterative refinements. It is only necessary to specify boundary conditions in prompts that can be derived from best practices, such as the NSA ’Kubernetes Hardening Guide’.
Once the setup was established, we maintained consistent general prompts, and no manual intervention was required, aside from the automatic inclusion of dynamically generated failure information. This suggests that integrating tools with LLMs offers a promising approach to automating pipelines.
For future projects, our experience highlights that applying LLMs to product-centric tasks requires a deeper understanding and fine-tuning of structured outputs. This will help minimize the effort needed later to parse the LLM-generated content for the desired manifest file.

5.4. Limitations to Consider

This study investigates prompt engineering in Kubernetes, specifically assessing large language models’ (LLMs) capabilities in expressing Kubernetes operational states via YAML. Our findings, which focus on single-component and typical database deployments, are preliminary and context-specific, cautioning against their generalization for broader LLM performance assessments. Acknowledging the exploratory nature of our work, we emphasize its role in laying foundational knowledge for future, more complex studies. Although our initial research aligns with the existing literature, highlighting the utility of LLMs in DevOps, it deliberately avoids the challenges of multi-service, interconnected deployments to ensure a solid baseline for subsequent investigation. Our phased research approach is designed to enhance our systematic understanding of LLMs and Kubernetes deployments, setting the stage for a comprehensive exploration of these technologies’ interplay in future studies.

6. Conclusions and Outlook

This extended study reinforces the potential of large language models (LLMs), such as GPT-4 and GPT-3.5, for automating Kubernetes deployments by generating manifest files from natural language inputs. Notably, performance did not always correlate with model size; smaller models like LLaMA2 and Mistral 7B sometimes outperformed larger ones, highlighting the importance of optimization and prompt-engineering strategies.
We introduced a tool-based self-refinement approach to address limitations observed in our initial study, where the refinement stage often failed to yield significant improvements. This enhanced iterative process employed a Python-based Kubernetes toolchain to automatically validate and refine manifest files through feedback from dry-run outputs. Results show that this approach significantly improved manifest file accuracy by automating much of the error detection and correction, particularly in more complex deployment scenarios. However, the process proved more complex than anticipated due to variations in LLM outputs, necessitating robust parsing, exception handling, and prompt engineering.
This study found that, while many errors were successfully corrected through multiple iterations, some, particularly those involving missing information, still required manual intervention. However, using a standardized prompt chain with integrated tool feedback proved effective across various and typical deployment scenarios (NoSQL, Redis, PostgreSQL, MySQL, NGINX).
These findings confirm that challenges remain while tool-based self-refinement strategies improve the quality of generated manifest files. These include handling the variability of LLM outputs and integrating models into deployment pipelines. Future research should prioritize optimizing these self-refinement techniques and refining prompt strategies to accommodate more deployment scenarios.
Our findings suggest that, with the right strategies and tool integrations, LLMs can significantly enhance automated deployment pipelines. However, it is important to note that achieving this will require the ongoing optimization of their interactions with automated tools. Nevertheless, advancing LLM capabilities will likely enhance automated deployment workflows, potentially reshaping traditional DevOps roles.

Key Take-Aways

  • LLMs can generate Kubernetes manifests out-of-the-box, but most outputs need refinement or corrections.
  • GPT-4 and GPT-3.5 performed best, sometimes enabling fully automated deployments.
  • Open-source models (LLaMA2/3, Mistral) worked, but had lower success rates; performance did not always scale with model size.
  • Prompt-only iterative refinement was ineffective: it sometimes worsened results for smaller models.
  • Tool-based self-refinement (using Kubernetes dry-runs) greatly improved outcomes, fixing most failed cases automatically.
  • Some failures remained unrecoverable without human input, usually due to missing contextual info (e.g., namespaces).
  • Best practice: Combine LLMs with automated validation tools for reliable DevOps automation.

Author Contributions

Conceptualization, N.K.; methodology, N.K. and A.D.; software, N.B. and A.D.; validation, N.B. and A.D.; formal analysis, A.D.; investigation, N.B. and A.D.; resources, N.K. and N.B.; data curation, A.D.; writing—original draft preparation, N.K.; writing—review and editing, A.D. and N.B.; visualization, N.K. and N.B.; supervision, N.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Acknowledgments

We thank Ralph Hänsel and Christian Töbermann for providing GPU resources via JupyterHub and Jonas Flodin, Max Sternitzke, and Patrick Willnow for managing JupyterHub and our Kubernetes infrastructure. Only your support and expertise made this study possible.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Here, we explain the prompting and the prompt templates in more detail.

Appendix A.1. Zero-Shot Prompt

Zero-shot prompting means that an AI model solves a new task solely through a clear prompt, without receiving any prior examples or explanations of the task. In the case of our study, this means, for example, that one would give the prompt a simple instruction such as the following prompt:
  • Create required manifests to deploy a MongoDB database in Kubernetes.
  • and “hope” that a valid Kubernetes manifest file like this one will be created. In fact, GPT-4 generated the following example (Listing A1):
Listing A1. Generated example MongoDB Deployment in Kubernetes.
Futureinternet 17 00416 i001
Of course, we can conduct this for any product like NGINX, MySQL, …, and would end up with the following prompt template where {{PRODUCT}} denotes the respective tool, framework, database, or whatever is to be deployed:
  • Create required manifests to deploy a {{PRODUCT}} in Kubernetes.
The above presented outcome for MongoDB was only slightly modified for clarity, demonstrating the language model’s ability to generate a complete deployment manifest using a basic zero-shot approach, including a PersistentVolumeClaim and a Service manifest (the latter not shown for the sake of clarity). This indicates that certain LLMs can independently create valid Kubernetes manifest files without requiring specialized tuning. However, whether deployment manifests, persistent volume claims, and service manifests are not only syntactically correct, but also logically compatible is another matter entirely. Investigating this question was the focus of the present study. Our evaluation employed a prompt chain (as illustrated in Figure 2) that began with a zero-shot prompt to create such initial manifests. This was followed by a second phase of iterative refinement to ensure operation constraints are met, using a specific check and refinement prompt template.

Appendix A.2. Refinement Prompting

These zero-shot suggestions (see Appendix A.1) are usually syntactically correct and often even work out of the box, but they often do not take real-world constraints into account. For such real-world constraints, we have based our study on the recommendations of the ‘Kubernetes Security Hardening Guide’ and refined the set of generated Kubernetes manifest files with the following prompt (Listing A2):
Listing A2. Refinement Prompt-Template.
Futureinternet 17 00416 i002

Appendix A.3. Error Correction Prompting

These generated and refined Kubernetes manifests (see Appendix A.2) took into account the constraints specified in the refinement prompting, such as best practices from the ’Kubernetes Security Hardening Guide’, and usually extended the original manifest files with appropriate settings. Whether these were syntactically correct was simply checked with a dry run and kubectl. Of course, errors can occur in this process, leading to rejection by the Kubernetes cluster (e.g., due to syntax errors, incorrectly set parameters, faulty naming, etc.). In the final correction step, these errors were therefore iteratively corrected using the following prompt, repeating the process as many times as necessary until the errors no longer occurred (see Listing A3). In our study, this was attempted up to ten times. If more than ten iterations were required, the attempt was considered unsuccessful.
Listing A3. Error Correction Prompt Template.
Futureinternet 17 00416 i003

References

  1. Kratzke, N. Cloud-Native Computing: Software Engineering von Diensten und Applikationen für die Cloud; Carl Hanser Verlag GmbH Co KG: München, Germany, 2023. [Google Scholar]
  2. Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Zhu, K.; Chen, H.; Yang, L.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. arXiv 2023, arXiv:2307.03109. [Google Scholar] [CrossRef]
  3. Kaddour, J.; Harris, J.; Mozes, M.; Bradley, H.; Raileanu, R.; McHardy, R. Challenges and applications of large language models. arXiv 2023, arXiv:2307.10169. [Google Scholar] [CrossRef]
  4. Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Barnes, N.; Mian, A. A comprehensive overview of large language models. arXiv 2023, arXiv:2307.06435. [Google Scholar] [CrossRef]
  5. Zhao, X.; Lu, J.; Deng, C.; Zheng, C.; Wang, J.; Chowdhury, T.; Yun, L.; Cui, H.; Xuchao, Z.; Zhao, T.; et al. Domain specialization as the key to make large language models disruptive: A comprehensive survey. arXiv 2023, arXiv:2305.18703. [Google Scholar]
  6. Quint, P.C.; Kratzke, N. Towards a Lightweight Multi-Cloud DSL for Elastic and Transferable Cloud-native Applications. arXiv 2019, arXiv:1802.03562. [Google Scholar]
  7. Kratzke, N.; Drews, A. Don’t Train, Just Prompt: Towards a Prompt Engineering Approach for a More Generative Container Orchestration Management. In Proceedings of the 14th International Conference on Cloud Computing and Services Science—Volume 1: CLOSER. INSTICC, SciTePress, Angers, France, 2–4 May 2024; pp. 248–256. [Google Scholar] [CrossRef]
  8. Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-Refine: Iterative Refinement with Self-Feedback. arXiv 2023, arXiv:2303.17651. [Google Scholar] [CrossRef]
  9. Tosatto, A.; Ruiu, P.; Attanasio, A. Container-based orchestration in cloud: State of the art and challenges. In Proceedings of the 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems, Santa Catarina, Brazil, 8–10 July 2015; pp. 70–75. [Google Scholar]
  10. Sultan, S.; Ahmad, I.; Dimitriou, T. Container security: Issues, challenges, and the road ahead. IEEE Access 2019, 7, 52976–52996. [Google Scholar] [CrossRef]
  11. Petroni, F.; Rocktäschel, T.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller, A.H.; Riedel, S. Language models as knowledge bases? arXiv 2019, arXiv:1909.01066. [Google Scholar] [CrossRef]
  12. Hou, X.; Zhao, Y.; Liu, Y.; Yang, Z.; Wang, K.; Li, L.; Luo, X.; Lo, D.; Grundy, J.; Wang, H. Large language models for software engineering: A systematic literature review. arXiv 2023, arXiv:2308.10620. [Google Scholar] [CrossRef]
  13. Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM J. 2023, 55, 1–35. [Google Scholar] [CrossRef]
  14. Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering in Large Language Models: A comprehensive review. arXiv 2023, arXiv:2310.14735. [Google Scholar] [CrossRef]
  15. Lanciano, G.; Stein, M.; Hilt, V.; Cucinotta, T. Analyzing Declarative Deployment Code with Large Language Models. CLOSER 2023, 2023, 289–296. [Google Scholar]
  16. Xu, Y.; Chen, Y.; Zhang, X.; Lin, X.; Hu, P.; Ma, Y.; Lu, S.; Du, W.; Mao, Z.M.; Zhai, E.; et al. CloudEval-YAML: A Realistic and Scalable Benchmark for Cloud Configuration Generation. arXiv 2023, arXiv:2401.06786. [Google Scholar]
  17. Komal, S.; Zakeya, N.; Raphael, R.; Harit, A.; Mohammadreza, R.; Marin, L.; Larisa, S.; Ian, W. ADARMA Auto-Detection and Auto-Remediation of Microservice Anomalies by Leveraging Large Language Models. In Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering, Las Vegas, NV, USA, 11–14 September 2023; CASCON ’23. pp. 200–205. [Google Scholar]
  18. Liu, X.; Thomas, A.; Zhang, C.; Cheng, J.; Zhao, Y.; Gao, X. Refining Salience-Aware Sparse Fine-Tuning Strategies for Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 31932–31945. [Google Scholar] [CrossRef]
  19. Wei, J.; Bosma, M.; Zhao, V.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned Language Models Are Zero-Shot Learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
  20. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
  21. Kaplan, J.; McCandlish, S.; Henighan, T.J.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling Laws for Neural Language Models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
  22. Topsakal, O.; Akinci, T.C. Creating Large Language Model Applications Utilizing LangChain: A Primer on Developing LLM Apps Fast. In Proceedings of the International Conference on Applied Engineering and Natural Sciences, Konya, Turkey, 10–12 July 2023. [Google Scholar]
  23. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Hsin Chi, E.H.; Xia, F.; Le, Q.; Zhou, D. Chain of Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar]
  24. Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. arXiv 2022, arXiv:2205.11916. [Google Scholar]
  25. Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Hsin Chi, E.H.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
  26. Liu, J.; Liu, A.; Lu, X.; Welleck, S.; West, P.; Bras, R.L.; Choi, Y.; Hajishirzi, H. Generated Knowledge Prompting for Commonsense Reasoning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Online, 1–6 August 2021. [Google Scholar]
  27. Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.L.; Cao, Y.; Narasimhan, K. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv 2023, arXiv:2305.10601. [Google Scholar] [CrossRef]
  28. Long, J. Large Language Model Guided Tree-of-Thought. arXiv 2023, arXiv:2305.08291. [Google Scholar] [CrossRef]
  29. Paranjape, B.; Lundberg, S.M.; Singh, S.; Hajishirzi, H.; Zettlemoyer, L.; Ribeiro, M.T. ART: Automatic multi-step reasoning and tool-use for large language models. arXiv 2023, arXiv:2303.09014. [Google Scholar]
  30. Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; Callan, J.; Neubig, G. PAL: Program-aided Language Models. arXiv 2022, arXiv:2211.10435. [Google Scholar]
  31. Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv 2022, arXiv:2210.03629. [Google Scholar]
  32. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020. NIPS’20. [Google Scholar]
  33. National Security Agency (NSA); Cybersecurity Directorate; Endpoint Security; Cybersecurity and Infrastructure Security Agency (CISA). Kubernetes Hardening Guide. 2022. Available online: https://media.defense.gov/2022/Aug/29/2003066362/-1/-1/0/CTR_KUBERNETES_HARDENING_GUIDANCE_1.2_20220829.PDF (accessed on 7 September 2025).
  34. OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
  35. Ye, J.; Chen, X.; Xu, N.; Zu, C.; Shao, Z.; Liu, S.; Cui, Y.; Zhou, Z.; Gong, C.; Shen, Y.; et al. A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. arXiv 2023, arXiv:2303.10420. [Google Scholar] [CrossRef]
  36. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
  37. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.d.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
  38. Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
  39. Lin, J.; Tang, J.; Tang, H.; Yang, S.; Dang, X.; Han, S. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv 2023, arXiv:2306.00978. [Google Scholar] [CrossRef]
  40. Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.d.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
Figure 1. Methodology pursued to investigate the applicability of LLMs to generate container orchestration manifest files.
Figure 1. Methodology pursued to investigate the applicability of LLMs to generate container orchestration manifest files.
Futureinternet 17 00416 g001
Figure 2. Phase 1: Analyzed a prompt chain composed of a drafting zero-shot and an iteratively refining few-shot stage (taken from [7]).
Figure 2. Phase 1: Analyzed a prompt chain composed of a drafting zero-shot and an iteratively refining few-shot stage (taken from [7]).
Futureinternet 17 00416 g002
Figure 3. Phase 2: An evolved prompt chain consisting of a design zero-shot and an iterative self refining stage.
Figure 3. Phase 2: An evolved prompt chain consisting of a design zero-shot and an iterative self refining stage.
Futureinternet 17 00416 g003
Figure 4. Which degrees of fulfilment were achieved with which LLM and prompt strategies? For more details, see [7]. It can be seen that the refinement did not significantly influence the results for good models, but led to losses compared to the simpler constraint-based approach for smaller or lower-performing models.
Figure 4. Which degrees of fulfilment were achieved with which LLM and prompt strategies? For more details, see [7]. It can be seen that the refinement did not significantly influence the results for good models, but led to losses compared to the simpler constraint-based approach for smaller or lower-performing models.
Futureinternet 17 00416 g004
Figure 5. Which levels of fulfillment were achieved by each LLM? All of the models successfully recover at least five out of ten use cases. Notably, Llama 70B and GPT versions demonstrate even higher performance, recovering from six to seven cases.
Figure 5. Which levels of fulfillment were achieved by each LLM? All of the models successfully recover at least five out of ten use cases. Notably, Llama 70B and GPT versions demonstrate even higher performance, recovering from six to seven cases.
Futureinternet 17 00416 g005
Figure 6. Details for phase 2: All of the initially failed use cases, as well as the iterations needed to correct these files, are plotted for each large language model. Failing files are denoted with a ‘−1’. Three use cases are only corrected by one or two models each. 70B Llama and GPT4 variants outperformed the others in these cases. However, two use cases remained unrecoverable by any of the models.
Figure 6. Details for phase 2: All of the initially failed use cases, as well as the iterations needed to correct these files, are plotted for each large language model. Failing files are denoted with a ‘−1’. Three use cases are only corrected by one or two models each. 70B Llama and GPT4 variants outperformed the others in these cases. However, two use cases remained unrecoverable by any of the models.
Futureinternet 17 00416 g006
Table 1. Analysed large language models (self-hosted services were operated using AWQ quantification on mentioned Nvidia GPUs).
Table 1. Analysed large language models (self-hosted services were operated using AWQ quantification on mentioned Nvidia GPUs).
LLMServiceGPUVRAMPhaseRemarks
GPT-4Managed??1 + 2OpenAI (details unknown, [34])
GPT-4oManaged??1OpenAI (details unknown, [34])
GPT-3.5-turboManaged??1 + 2OpenAI (details unknown, [35])
Llama2 13BSelf-hostA600046.8 Gi1Chat model [36]
Llama2 7BSelf-hostA2 or A400014.7 Gi1Chat model [36]
Mistral 7BSelf-hostA2 or A400010.8 Gi1Fine-tuned for coding [37]
Llama3 70BSelf-hostA600047.8 Gi2Instruct model [38]
Llama3 13BSelf-hostA600023.3 Gi2Instruct model [38]
Llama3 8BSelf-hostA2 or A400014.3 Gi2Instruct model [38]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Beuter, N.; Drews, A.; Kratzke, N. Prompt-Driven and Kubernetes Error Report-Aware Container Orchestration. Future Internet 2025, 17, 416. https://doi.org/10.3390/fi17090416

AMA Style

Beuter N, Drews A, Kratzke N. Prompt-Driven and Kubernetes Error Report-Aware Container Orchestration. Future Internet. 2025; 17(9):416. https://doi.org/10.3390/fi17090416

Chicago/Turabian Style

Beuter, Niklas, André Drews, and Nane Kratzke. 2025. "Prompt-Driven and Kubernetes Error Report-Aware Container Orchestration" Future Internet 17, no. 9: 416. https://doi.org/10.3390/fi17090416

APA Style

Beuter, N., Drews, A., & Kratzke, N. (2025). Prompt-Driven and Kubernetes Error Report-Aware Container Orchestration. Future Internet, 17(9), 416. https://doi.org/10.3390/fi17090416

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop