Large Language Model-Driven Framework for Automated Constraint Model Generation in Configuration Problems

Penco, Roberto; Pintar, Damir; Vranić, Mihaela; Šoštarić, Marko

doi:10.3390/app15126518

Open AccessArticle

Large Language Model-Driven Framework for Automated Constraint Model Generation in Configuration Problems

¹

Ericsson Nikola Tesla d.d., Krapinska 45, 10000 Zagreb, Croatia

²

University of Zagreb Faculty of Electrical Engineering and Computing, 10000 Zagreb, Croatia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(12), 6518; https://doi.org/10.3390/app15126518

Submission received: 2 April 2025 / Revised: 2 June 2025 / Accepted: 4 June 2025 / Published: 10 June 2025

Download

Browse Figures

Versions Notes

Abstract

Constraint satisfaction problems (CSPs) are widely used in domains such as product configuration, scheduling, and resource allocation. However, formulating constraint models remains a significant challenge that often requires specialized expertise in constraint programming (CP). This study introduces the Automatic Constraint Model Generator (ACMG), a novel framework that leverages fine-tuned large language models (LLMs) to automate the translation of natural language problem descriptions into formal CSP models. The ACMG employs a multi-step process involving semantic entity extraction, constraint model generation, and iterative validation using the MiniZinc solver. Our approach achieves state-of-the-art (SOTA) or near-SOTA results, demonstrating the viability of LLMs in simplifying the adoption of CP. Its key contributions include a high-quality dataset for fine-tuning, a modular architecture with specialized LLM components, and empirical validation which shows its promising results for complex configuration tasks. By bridging the gap between natural language and formal constraint models, the ACMG significantly lowers the barrier to CP, making it more accessible to non-experts while maintaining a high level of robustness for industrial applications.

Keywords:

large language models; constraint programming; constraint satisfaction problem; MiniZinc; code generation

1. Introduction

Recent years have seen significant advancements in large language models (LLMs), contributing to the increased prominence of the artificial intelligence (AI) field. Models trained on extensive datasets exhibit significant proficiency in multiple Natural Language Processing (NLP) tasks, including text generation, summarization, and translation. This advancement has led to the practical application of AI in many domains.

Work on backtracking algorithms [1] and network consistency [1] helped constraint programming (CP) to become a basic AI paradigm between 1965 and 1985. Approaches to solving constraint satisfaction problems (CSPs) have become central to CP [1]. Though Freuder (1996, cited in [2]) saw CP as reaching the ideal of “users stating problems and computers solving them”, it remains “somewhat of an art” [1] with two ongoing challenges: (1) the need for specialized modeling expertise [3] and (2) its limited accessibility for non-programmers. This work addresses these problems by using artificial intelligence-driven automation to investigate the possibility of autonomously generating constraint code for CSPs using LLMs.

Two main directions for integrating CP and LLMs have been investigated recently: (1) using CP to improve LLM capabilities (e.g., factuality checks via constraint-based attention analysis [4] or hybrid text generation [5]) and (2) leveraging LLMs to improve CP (e.g., automating constraint model generation). Our work is directed at the latter. Section 3 contains a detailed review of past methods.

Inspired by developments in transformer architectures [6] and deep learning [7], the artificial intelligence revolution has made transformative progress in code generation possible. While early methods depended on strict rule-based systems [8], current transformer-based LLMs such as ChatGPT (GPT-4) [9] may parse natural language instructions to build functional code, dynamically lowering programming barriers [8]. This area has sparked significant research interest and commercial development, as seen by the development of tools such as AlphaCodium (v1.0) [10] and GitHub Copilot (v1.97) [11].

However, several limitations remain unresolved, particularly regarding programming language coverage. Existing LLMs primarily excel with mainstream high-level languages, offering limited support in specialized domains, low-level languages, and less common programming languages. This gap constrains the utility of LLMs in systems programming and niche applications [8]. CP, in particular, falls into the category of less common programming languages.

Potential solutions include advanced transfer learning techniques to adapt knowledge from resource-rich languages [8], as well as collaborations with domain specialists to develop tailored training datasets. Enhancing the multilingual code generation ability of LLMs could significantly expand their practical applications, as noted by the authors of [8]. LLMs have demonstrated emerging cognitive abilities, including deliberate, logical reasoning (System-2 reasoning per Kahneman’s dual-process theory [12]) [13], and instruction-following capabilities [14,15], enabling junior developers to generate code through natural language descriptions.

Constraint optimization and satisfaction problems (COPs and CSPs) are important in many fields, such as scheduling, planning, and resource allocation. In our study, we focus on pure CSPs, of which COPs are an extension, although the Automatic Constraint Model Generator (ACMG) engine works on COPs as well. Formulating solutions to these problems, called models, can be a challenging and time-consuming task, often requiring specialized domain knowledge and mathematical expertise. Recent developments in large language models have opened up new possibilities for automating the process of generating constraint models [16]. The introduction of the transformer model created an opportunity to address existing challenges in CP, but first, the limitation of low-resource programming languages needs to be solved. Our study aims to address this limitation by creating high-quality datasets in collaboration with domain experts and using LLMs to address the aforementioned challenges in CP.

This study seeks to bridge the research gap by investigating the feasibility of leveraging LLMs to automatically generate constraint codes for constraint satisfaction problems (contribution 1). Support for low-resource, low-level, and domain-specific programming languages [8] has been identified as one of the possible research gaps still present in the area of code generation using LLMs. We propose an approach that extracts semantic information from natural language descriptions of constraint satisfaction problems by employing fine-tuned LLMs (contribution 2). A recent study [17] utilized a different method [18] that did not yield satisfactory results, and the authors suggested that fine-tuning LLMs might be a good alternative. Fine-tuned LLMs are used to generate corresponding CSP models (contribution 3). With this approach, we aim to create complex constraint satisfaction models that describe complex configurations as their primary use case. In this study, we also present the datasets we created to fine-tune our LLMs (contribution 4). Generating synthetic data for a domain-specific programming language for which there is not much data available on the Internet is important, as noted by the authors of [8,17]. Furthermore, to facilitate future research, we have made our datasets [19] and code publicly available [20] (Supplementary Materials).

The remainder of this study is organized as follows: Section 2 presents the research background necessary to understand our methodology, including key concepts such as constraint programming, constraint satisfaction problems, configuration, large language models, fine-tuning, and prompt engineering (PE) methods. Section 3 covers related work and discusses how our approach advances existing research. Section 4 introduces the proposed Automatic Constraint Model Generator, a general framework designed to automatically generate constraint models from natural language prompts. The ACMG framework consists of two core components: semantic entity extraction and constraint model generation. Section 5 presents the evaluation of the proposed approach, including the datasets used, experimental results, and findings from an ablation study. Section 6 discusses key findings, the implications for industry and academia, limitations, and directions for future work. Finally, Section 7 concludes this study.

2. Background

2.1. Constraint Programming

Constraint programming (CP) serves as a robust conceptual framework for articulating a diverse array of combinatorial problems. The primary challenge in CP is applying it to more engineering-focused disciplines, thereby enhancing the accessibility and user-friendliness of constraint technology for individuals without programming expertise [1]. CP has numerous applications in artificial intelligence, operations research, and other areas of Computer Science and related disciplines. The concept of CP is that the user specifies the constraints, and then a general-purpose constraint solver is utilized to address them. Constraints are essentially relationships, and a constraint satisfaction problem (CSP) outlines the relationships that must be upheld among the specified decision variables.

2.2. Constraint Satisfaction Problem

A constraint satisfaction problem, as defined in [1], is a triple Ῥ = ⟨X, D, C⟩, where X = ⟨x₁, x₂, ..., x_n⟩ represents an n-tuple of variables, D = ⟨D₁, D₂, ..., D_n⟩ is a corresponding n-tuple of domains such that each variable xᵢ ∈ Dᵢ, and C = ⟨c₁, c₂, ..., c_t⟩ is a t-tuple of constraints. Each constraint c_j is defined as a pair ⟨R(S_j), S_j⟩, where S_j denotes the set of variables involved in the constraint (its scope) and R(S_j) is the relation of these variables, specifying a subset of the Cartesian product of their domains.

A solution [1] to the CSP Ῥ is an n-tuple A = ⟨a₁, a₂, ..., a_n⟩, where each aᵢ ∈ Dᵢ and every constraint c_j is satisfied when A is projected onto S_j. The problem may require finding all solutions (sol(Ῥ)), checking whether at least one solution exists, or determining that the CSP is unsatisfiable (i.e., sol(Ῥ) = ∅). The CSP framework, while simple, demonstrates broad applicability across various domains such as artificial intelligence, operations research, scheduling, supply chain management, graph algorithms, computer vision, and computational linguistics.

Figure 1 shows an example of a constraint satisfaction problem and its solution. There are three variables, X₁, X₂, and X₃, with domains D(X₁) = D(X₂) = D(X₃) = {1, 2, 3} and constraints c₁₂ ≡ (X₁ < X₂) and c₂₃ ≡ (X₂ = X₃). On the left side is the problem definition, and on the right side is the solution to the CSP inspired by an example from the authors of [1].

An instantiation I on a subset of variables Y = (x₁, ..., xₖ) ⊆ X in a CSP P = (X, D, C) assigns values (v₁, ..., vₖ) to the variables, denoted as I = ((x₁, v₁), ..., (xₖ, vₖ)), where each pair (xᵢ, vᵢ) indicates that xᵢ takes the value vᵢ. The instantiation is valid if every assigned value vᵢ belongs to the domain of xᵢ (i.e., I[xᵢ] ∈ D(xᵢ) for all xᵢ ∈ Y). It is locally consistent if it is both valid and satisfies every constraint c ∈ C whose variables X(c) are fully included in Y. If I fails to meet these conditions, it is locally inconsistent. Figure 2 (left) provides examples of instantiations.

2.3. Configuration

Configuration refers to the process of assembling customized systems from generic, mass-produced components, as described in [1]. This task is widely performed across industries such as computer manufacturing, automotive production, furniture design, telecommunications, and travel services, where components are standardized yet allow for personalized solutions. Customers provide specific requirements, and configurations are created by selecting and adapting component instances to meet these requirements. Constraint programming, particularly through constraint satisfaction problems, has emerged as an effective approach for configuration tasks due to its ability to systematically explore solution spaces while managing combinatorial complexity. CSPs excel at defining clear boundaries for possible configurations and identifying component interactions. Instantiations (on the left in Figure 2) or partial instantiations (on the right in Figure 2) are starting points for configurations. Both refer to the solved CSP on the right side of Figure 1.

The suitability of CP/CSPs for configuration stems from four key advantages: First, their flexible, declarative modeling approach allows for natural problem representation. Second, they can diagnose over-constrained scenarios by explaining why no solution exists and suggesting constraint relaxations. Third, they support interactive configuration processes where user choices dynamically constrain subsequent options through constraint propagation. Finally, they can incorporate and reason about user preferences during the configuration process. These capabilities make CP/CSPs particularly valuable for developing customized solutions while maintaining efficiency and user satisfaction.

Figure 3 shows an example of a configuration task. There are eight available slots and two types of boards: Board A and Board B. Board As can only go into odd slots, and Board Bs can go into any slot, but there cannot be three or more together. There are three Board As and five Board Bs. The task is to allocate all boards to available free slots.

2.4. Large Language Models

Large language models (LLMs) are extensive, pre-trained statistical language models that utilize neural network architectures [21]. Transformer-based neural language models with vast parameter counts, typically in the tens to hundreds of billions, are commonly referred to as large language models. These models are pre-trained on extensive text datasets, and most existing LLMs are split into the following three families: PaLM [22], LLaMA [23], and GPT-4 [24]. There is also a recent newcomer, Claude [25]. During the inference process, LLMs with model parameters θ, denoted as p_θ, create text based on input text x by predicting each subsequent token in a sequence y, relying on the tokens that precede it. This process is described by the posterior probability p_θ = (y|x); more precisely, this process is described by the following formula:

p_{θ} (x) = \prod_{i = 1}^{n} p_{θ} (x [i] ∣ x [1 \dots i]) .

(1)

Here, x[i] denotes a single generated token.

2.5. Fine-Tuning and Instruction Tuning

The pre-trained foundation model requires fine-tuning to be useful for a specific task. Fine-tuning (FT) the model on labeled data can improve its results and reduce the complexity of prompt engineering, serving as an alternative to retrieval-augmented generation (RAG) [17]. Other reasons to fine-tune include exposing the model to new or proprietary data not covered during pre-training. An important reason to fine-tune large language models is to align their responses with human expectations when provided with instructions through prompts. This process, known as instruction tuning, involves further training LLMs on a dataset of supervised pairs, thus bridging the gap between an LLM’s next-word prediction objective and the user’s goal of having the model adhere to human instructions [21,26].

Fine-tuning adjusts the model’s parameters, θ, to better predict a target dataset, D = {(x, y)}, where x is the input and y is the output. The objective is to minimize the negative log-likelihood of the data:

L_{f i n e - t u n e} (θ) = - \frac{1}{| D |} \sum_{(x, y) \in D} l o g p_{θ} (y | x),

(2)

where

L_{f i n e - t u n e}

is a loss function and

p_{θ} (y | x)

is the model’s conditional probability of output y given input x.

Instruction tuning (IT) involves further fine-tuning the model using a dataset, D_inst = {(x_inst, y_inst)}, where x_inst includes the task instructions and y_inst is the expected response. The goal is to align the model’s behavior with human instructions. The loss function is similar but adapted to instruction-focused tasks:

L_{f i n e - t u n e} (θ) = - \frac{1}{| D |} \sum_{(x_{i n s t}, y_{i n s t}) \in D_{i n s t}} l o g p_{θ} (y_{i n s t} | x_{i n s t}),

(3)

2.6. Prompt Engineering

Prompt engineering (PE) [27] has emerged as a promising technique for leveraging the reasoning abilities of LLMs. It uses natural language to write task descriptions, called prompts, to guide model output. Compared with other techniques [26,28,29], such as fine-tuning [30], instruction tuning [31], and in-context learning [32], prompt engineering involves crafting an input x_prompt to guide the model,

p_{θ}

, toward generating the desired output, y_desired. PE does not modify the model parameters but strategically structures the input to influence the output.

The objective is to maximize the probability of the desired output:

\binom{x_{p r o m p t} = a r g}{} \binom{m a x}{x_{t e m p l a t e}} \binom{p_{θ} (y_{d e s i r e d} | x_{t e m p l a t e})}{},

(4)

where

x_{t e m p l a t e}

is a structured input prompt (e.g., instruction, examples, or context) and

y_{d e s i r e d}

is the target output. In practice, there are zero-shot prompting (ZSP) techniques, where

x_{p r o m p t}

is a simple instruction, and Few-Shot Prompting (FST) techniques, where

x_{p r o m p t}

includes demonstrations

\{(x_{i}, y_{i})\}

to improve

p_{θ} (y_{d e s i r e d} | x_{p r o m p t})

.

x_{p r o m p t} = C o n c a t (x_{i n s t u c t i o n}, {(x_{i}, y_{i})}_{i = 1}^{k}),

(5)

3. Related Work

Traditionally, CP and LLMs have developed independently of each other, with little collaboration. Recent advances in LLMs present opportunities to integrate these fields. Two approaches are emerging: using CP to improve LLM capabilities, and leveraging LLMs to enhance CP. Several studies [4,5] have explored the first approach. The authors of [4] explored why LLMs sometimes generate factually incorrect information. The authors found that by analyzing the attention patterns of large language models, they could predict when the model might make a factual error. They call this method SAT Probe, and it essentially looks at how much attention LLMs pay to specific constraints within a given prompt. The more attention paid to these constraints, the more likely the LLM is to generate factually accurate text. The authors of [5] discussed how to combine constraint programming with large language models for improved text generation. The authors argued that while constraint programming excels at handling structural constraints, it struggles with incorporating “meaning”, which LLMs are adept at. The study introduced GenCP, an approach that integrates an LLM into a CP solver. This allows the LLM to generate text while the CP framework ensures all constraints are met. The authors also demonstrated that GenCP outperforms traditional Natural Language Processing (NLP) methods such as Beam Search, producing higher-quality results more efficiently. Essentially, their study proposes a synergistic approach where LLMs and CP work together to overcome their individual limitations, opening up new possibilities for text generation under constraints. Our study focuses on the second approach, using large language models to enhance constraint programming.

Several studies have explored using large language models (LLMs) for constraint programming (CP) code generation, including the generation of constraint models [17,18,19]. All of these studies recognized the complexity of the constraint model generated, and the most successful ones split the problem into two parts: the recognition of optimization problem entities, called Named Entity Recognition (NER), and generation of problem formulation. Building on this research, a study [17] proposed a generative model capable of automatically generating CP models from natural language descriptions. It uses an NER method similar to [18] to extract semantic entities. Then, it uses retrieval-augmented generation (RAG) [33] with in-context learning (ICL) [32] to create a blueprint model with extracted semantic entities. RAG uses embedding to find semantically similar predefined models, and those retrieved semantically similar natural language descriptions are used as ICL examples when prompting an LLM to generate a formal model.

While the potential of using large language models for code generation has been explored [8,34], there is a lack of research on their use to generate complex constraint satisfaction models. Existing work has focused on exploring the ability of LLMs to generate a simple optimization model [16]. This study uses a vanilla method to generate basic constrained optimization (COP) models with the necessary constraints and variables. The vanilla method only generated elementary models, or both constraint satisfaction and optimization models. The authors of [16] did not define a specific problem description in natural language but, rather, a natural language task that prompts an LLM to generate an arbitrary model in order to explore the ability of LLMs to generate rudimentary COP and CSP models. Our study defines specific, not arbitrary, configuration tasks in natural language. Constraint satisfaction problems are well suited to defining the closed space of a configuration [1,17]. Our study defines specific tasks in natural language and asks LLMs to solve them.

The biggest set of tasks explored in [17] are Linear Optimization Problems and logic grid puzzles, which can be expressed as CSPs, and a small corpus of constraint programming problems from the domains of COP and CSP, both of which are based on an introductory university course. The authors of [17] used Few-Shot Prompting (FSP) and noted that zero-shot prompting (ZSP) would not yield any success in their tasks. FSP outperforms ZSP because it provides explicit input–output examples, enabling the model to infer task specifics through in-context learning. This reduces ambiguity in task interpretation, aligns with the model’s training paradigm (which incorporates patterns from data), and facilitates complex reasoning via demonstrations [32]. Our study uses ZSP, and, while based on the same tasks as those used in [17], obtained better results than [17]; this is explained in more detail in Section 5.

The authors of [17] used a similar overall process to that our study, as defined in [2], which consisted of two phases; in particular, NER and generation phases. In the first phase, the authors of [17] use NER as defined in [35,36,37], which is an ensemble approach with an encoder transformer architecture, while our study uses LLMs from the GPT family (i.e., a decoder transformer architecture with a fine-tuning strategy, as described in Section 5). The authors of [17] reported in their conclusion that including NER, as proposed in the NL4OPT challenge [38], did not always improve the quality of the final models.

The difference in the second phase is that our study uses LLMs from the GPT family, whereas the authors of [17] used RAG. Fine-tuning will allow for a better adjustment of LLM parameters for a specific task compared to the vanilla LLMs and RAG reported in [17], which use the most similar entries retrieved from the embedding database as FST demonstration examples. The studies reported in [16,17] used modest CSP datasets. Other studies have focused on only linear programming or the NER phase for constraint optimization problems [18,19]. Key similarities and differences between the ACMG and [17] are presented in Table 1.

Our study builds upon the conclusions of [8,17], where [17] concluded that advancement towards the development of a CP modeling assistant requires a more high-quality dataset that integrates a pair of problem descriptions into their CP model, while a similar conclusion in [8] states that partnership with domain experts could inform the development of targeted datasets and strategies for fine-tuning. Our dataset is relatively small compared to some computer-generated datasets or datasets based on data accessible on the Internet. However, building on the conclusions reported in [8,39], we ensured that our dataset is of very high quality through close cooperation with domain experts, considering the “remarkable impact of high-quality data in honing a language model’s proficiency in code-generation tasks”, as stated in [39].

4. Automatic Constraint Model Generator (ACMG)

To address the two challenges identified in Section 1, namely developing constraint-based tools that are easier for non-programmers to use and promoting the wider adoption of constraint programming (CP), we created a general framework for the automatic generation of constraint models from natural language prompts. This framework is inspired by ideas from the studies reported in [16,17,18,19,40]. Our approach leverages the powerful natural language understanding capabilities of large language models (LLMs) to extract relevant textual information and then generate a corresponding constraint model. This section consists of three subsections: Section 4.1 explains the main terms we need to introduce and define; Section 4.2 explains the ACMG’s process and data flow; and, finally, in Section 4.3, we cover the ACMG’s architecture in detail.

4.1. Main Components

To explain the ACMG, we introduce the following basic terms:

Message from the user: A message from the user is a prompt written in a natural language consisting of a constraint task and a model. The description in Figure 3 is an example of such a prompt, where the model and constraint task are intertwined (see last paragraph of Section 2.3).

Constraint task: A constraint task is the partial instantiation of constraint satisfaction problem (CSP) variables described in natural language. For example, in the Game of 24 model, a constraint task is a set of four numbers, e.g., 3, 5, 7, 9.

Constraint model: This model describes the general constraints or rules that apply to the problem. For example, the case of a sudoku model should be described as follows: Sudoku is a logic-based number placement puzzle that fills a 9 × 9 grid with digits from 1 to 9. Each row, each column, and the nine 3 × 3 subgrids must contain all the digits from 1 to 9 without repetition. Note: for reasons of brevity, in the rest of this paper we will use the terms “model” and “constraint model” interchangeably.

Model name: The model name serves as a unique textual identifier for the constraint model. It is automatically generated by the LLM and then manually validated and stored in a catalog for future reference.

LLMs: The two different architectural configurations of the ACMG used the following LLMs: GPT-3.5-turbo-0125 and GPT-4o-mini-2024-07-18. In each distinct architectural configuration of the ACMG, we only used one of those two large language models. Vanilla LLM (vLLM) means we used the default LLM without fine-tuning it using our specially created dataset. We did not use a combination of GPT-3.5-turbo-0125 and GPT-4o-mini-2024-07-18 in one distinct configuration of the ACMG, but we did use, as described in Section 5, a combination of the vanilla LLM and one or more fine-tuned LLMs (fLLMs) of the same type (e.g., GPT-3.5-turbo-0125).

In the ACMG, LLMs are fine-tuned using OpenAI’s supervised learning pipeline [9], with input data formatted as jsonl (JSON Lines), where each entry contains a structured prompt–completion pair (e.g., {“prompt”: “...”, “completion”: “...”}). Tokenization was handled internally by OpenAI’s subword tokenizer (based on Byte-Pair Encoding for GPT-3.5/GPT-4), ensuring consistency across examples. The fine-tuning adhered to OpenAI’s default hyperparameters: a batch size of 1 (sequential processing per example), 1 epoch (default setting to balance convergence and computational cost), and a learning rate multiplier of 1 (applied to OpenAI’s undisclosed base rate). The optimizer and early stopping criteria were not user-configurable, as these are proprietary to OpenAI’s pipeline. Training concluded after 1 epoch with a final loss of 0.057, indicating stable convergence. While an ablation study of the hyperparameters (e.g., epochs, batch size) might further optimize performance, we prioritized reproducibility and cost efficiency by opting for the default settings.

Parser: A Parser is a pre-trained foundational LLM that is fine-tuned and instruction-tuned using our dataset [19]. The output from the Parser increases the validity and correctness of downstream tasks, i.e., the generated constraint model. The Parser is used to separate the user’s message into two parts: the model and the constraint task. Both are textual messages written in natural language.

Recognizer: A Recognizer is a pre-trained foundational LLM that is fine-tuned and instruction-tuned using our dataset [19]. The output from the Recognizer increases the validity and correctness of downstream tasks; that is, generated constraint model. The Recognizer extracts semantic entities from the message containing the model. The goal is to reduce ambiguity by (i) identifying relevant CSP variables, (ii) extracting the domains of each variable, (iii) extracting the main constraints, and (iv) extracting input parameters for narrowing down the solution to a single instance.

Generator: A Generator is a pre-trained foundational LLM that is fine-tuned and instruction-tuned using our dataset of constraint models [19]. The Generator takes the extracted semantic entities from the Recognizer, a message describing the model in natural language, and generates the final constraint model in the MiniZinc [41] modeling language.

Evaluator: A Evaluator is a pre-trained foundational LLM from the GPT family of models, and the base architecture we used is the same LLM as vanilla ChatGPT, GPT-3.5-turbo. We did not fine-tune an LLM for this role; we used a vanilla LLM. We will dive deeper into this in our future research.

MiniZinc solver: The MiniZinc modeling framework is a high-level, declarative programming language for defining and solving constraint satisfaction and optimization problems. MiniZinc is solver-independent, meaning you can choose the underlying solver that is best suited for the given problems. There are other solver-independent modeling frameworks, such as MiniZinc. Most notable are Savile Row [42] with Essence Prime [43] and CPpy [44]. Both support different solvers in the backend. Comparing one to another is an open research gap, and a unified benchmark is required. We chose MiniZinc as we think it is currently one of the most versatile and powerful modeling frameworks, especially after evaluating the feedback we received from our domain experts and the fact that it offers strong support for Python 3.13.3 integration.

Errors and warnings: As stated by [41,45], the MiniZinc solver generates two types of faults: first, it generates faults that prevent execution, which are called errors, and second, it generates faults that do not halt execution but indicate potential issues, which are called warnings, as indicated in [41,45]. During our evaluation of the ACMG, we encountered the following MiniZinc errors: syntax, type, assertion failed, include, cyclic and evaluation errors. We also encountered warnings, such as the variable shadows another variable with the same name, and model inconsistency was detected before the search. For more information about MiniZinc errors and warnings, please refer to [41,45].

File types: The file types that MiniZinc uses have two possible extensions: mzn and dzn. Mzn files are used to describe the problem that is modeled in the MiniZinc modeling language, and dzn files are used as data files. The goal is to have the data agnostic model described in the mzn file, before the data are changed, e.g., the rules for the n-queen problem are based on the problem, and number of data points can be four for the four-queen problem or eight for the eight-queen problem. Dzn files are optional, as you can embed data into mzn files, e.g., a sudoku puzzle is traditionally 9 × 9; if there is no need to create a more general model, this constant will be embedded into mzn. More details about MiniZinc modeling can be found in [45]. We generate dzn files together with mzn files, meaning we ask an LLM, via a prompt, to return both mzn and dzn files. No formal ablation study was conducted on the generation of mzn and dzn files in one response or two to determine which method yielded better results, but we did notice that combining files yielded better results. This version of the ACMG generates both mzn and dzn files in one response.

4.2. ACMG Process

The generic building blocks of the ACMG process are provided in Figure 4. The input message prepared by the ACMG engine (yellow) is the input for the next step of the ACMG process (dark blue outline). In this step, the ACMG aggregates the input message with the corresponding control prompt (green); which control prompt is selected depends on the present phase of the ACMG process. This aggregated prompt is sent to the LLM, which generates an output prompt; that is, a message (light blue). Note that in Figure 4, there is a light blue box with the text v/f LLM, denoting that this can be either a vanilla or fine-tuned LLM. The ablation study (Section 5.3) we conducted demonstrates the consequences of using different combinations of vanilla and fine-tuned LLMs.

The key steps to our approach are provided in Figure 5. Figure 5a consists of a specific instantiation of the building block shown in Figure 4 (steps 7 and 8 are omitted in Figure 5a as they do not require communication with the LLM). The pipeline decomposes constraint modeling into four formally distinct phases (see Appendix A). The full data flow is shown in Figure 5b.

The numbers in the figures refer to the steps of the process, detailed descriptions of which are provided below:

Step 1.: A natural language prompt (i.e., the user’s message) is forwarded to the LLM to discern the content of the user’s message. With help from the specially designed control prompt [46], an attempt is made to extract the constraint task and model description.
Step 2.: The input user message is separated into a constraint task and a model. Each has a separate flow, indicated by 2.a and 2.b. In addition, in this step, with help from the LLM, a name is generated for the new model that is about to be created.
Step 3.: At this stage, the fine-tuned LLM (the Recognizer) is used to recognize (i.e., extract) the relevant semantic entities from the prompt. Those extracted entities and the original messages that contain the model described in natural language are forwarded to step 4. This step is inspired by studies of the authors of [18,19]. Figure 5 and Figure 6 show steps 3 and 4, respectively. Green is the control prompt fetched from the memory module, yellow are the prompts prepared by the ACMG engine, and blue is the response from the LLM.

Figure 7 shows an example of an aggregated input prompt based on the example shown in Figure 6. This prompt is constructed from the model, which is parsed from the input user message. The input user message is presented in the description of Figure 3, which is provided in the last paragraph of Section 2.3. The input user message is parsed in step 2, and the model and the constraint task are extracted. In Figure 4, the top box on the left contains the parsed model, while the bottom box on the left contains the control prompt for the recognition step. These two prompts are concatenated; the result of this concatenation is presented in Figure 7 and sent to the LLM, while Figure 8 shows an example of the prompts used during the ACMG process’s Generator phase..

Step 4.: Next, another fine-tuned LLM (the Generator) uses the semantic entities and the model description from step 2 to generate a constraint model in the MiniZinc modeling language. The generated MiniZinc model is forwarded to step 6.

Step 5.

The constraint task is still in natural language form. The constraint task, the valid MiniZinc model generated in step 6, and the relevant control prompt are sent to the LLM, which transforms the constraint task into a data MiniZinc file, i.e., dzn file [46]. Once this is carried out, the generated dzn file is loaded into the MiniZinc solver, and the ACMG can now solve the constraint task in the next step. Figure 9 (bottom left box) contains a generated MiniZinc data file (dzn).

Step 6.

The generated MiniZinc model is evaluated by the MiniZinc solver, which checks for syntactic correctness. If the solver reports warnings or errors, the following repair process is initiated:

1.

Inner Repair Loop (Model Regeneration):

▪: The solver’s output (including the model, warnings, errors, and the original natural language description) is sent back to the Generator.
▪: The Generator attempts to regenerate a corrected MiniZinc model, addressing the reported issues.
▪: This inner loop is executed a maximum of three times. If the model remains invalid after three attempts, the process proceeds to the outer loop.

2.

Outer Repair Loop (Semantic Regeneration):

▪: The unresolved warnings/errors, along with the original natural language model description, are returned to Step 3 (Semantic Entity Generation).
▪: The system regenerates the semantic entities that may have caused the persistent issues.
▪: The outer loop also runs at most three times. If the model still fails validation after these attempts, the ACMG engine terminates with an error. Otherwise, the process advances to Step 7.

This iterative repair mechanism, inspired by [16], ensures robustness by combining localized model fixes (inner loop) with higher-level syntactic adjustments (outer loop). For clarity, Figure 5b (see also Appendix B) illustrates the finite-state process governing these transitions.

Step 7.: The constraint task is solved with a MiniZinc solver, which already has the model loaded into it. The output from the MiniZinc solver is sent to the user who initiated this process. This step is not shown in Figure 5a, as it does not include communication with the LLM.
Step 8.: Finally, if the generated MiniZinc model is invalid, the ACMG engine stops. Otherwise, if there are no warnings or errors, the model is stored in the database for future use, along with its unique identifier (the model name obtained in step 2). The stored model is then loaded into a MiniZinc solver for processing. This step is not shown in Figure 5a, as it does not include communication with the LLM.

4.3. ACMG Architecture

The overall system architecture is shown in Figure 10. The main components of the Automatic Constraint Model Generator engine are as follows:

Orchestrator (OR): The OR is responsible for controlling the flow from the user’s input to the given output. The OR fetches relevant control prompts from the Memory Controller and invokes different components depending on which stage of the process it is in. This module orchestrates the generation of semantic entities, the MiniZinc model, model evaluation, regeneration, and solving.

Memory Controller (MC): The MC consists of a model database and a table of help prompts. The model database consists of unique models that can be fetched by a unique identifier, the model’s name. The table of help prompts has the following prompts: prompt for assessing user message content; prompt for extracting the constraint task and the model; prompt for classifying the model; prompt for naming the model; prompt for extracting relevant semantic entities; prompt for generating the MiniZinc model; prompt for checking if the generated model validated using MiniZinc is OK or has warnings and/or errors; prompt for transforming warnings and errors into an acceptable format for model and constraint task regeneration; and a prompt for assigning values from the constraint task given in natural language to the model variables that can be solved using the MiniZinc solver (all of these prompts are provided in [46]).

Parser (P): The Parser is responsible for separating the model and the constraint task from the input/user message. The Parser is a fine-tuned, pre-trained foundational LLM. We prepared high-quality data with manual annotations to train the Recognizer [19].

MiniZinc solver (MCS): The MiniZinc solver has a Python interface called the MiniZinc Python API. This lets the ACMG interact directly with MiniZinc models and solvers from Python code. MiniZinc is used to model problems with constraints and aims to find solutions that satisfy them. A typical MiniZinc model consists of decision variables representing the unknown quantities in the problem to be solved, constraints that restrict the possible values for the decision variables, and an objective function (optional) that evaluates the quality of each solution. An objective function is applied when a user is attempting to minimize or maximize a variable (e.g., minimize cost, maximize profit). However, as we are interested in constraint satisfaction problems and not constraint optimization, we did not use an objective function; instead, we used parameters to narrow down configurations, i.e., the instantiation of a CSP. MiniZinc is solver-agnostic, meaning it can work with various back-end solvers (such as Gecode, CPLEX, CBC, etc.). The user can choose the solver depending on the type of problem and the efficiency required (e.g., for CSP, Gecode [47,48,49] or OR-Tools [50] are best suited).

Recognizer (R): The Recognizer is responsible for extracting semantic entities from the user message using a control prompt written for this task [46]. The authors of study [17] use the NER method [38] as a specialized framework trained for entity tagging in an optimization context. This can be applied to linear programming and constraint optimization problems, not constraint satisfaction problems. We fine-tuned and instruction-tuned a pre-trained foundational LLM. We prepared high-quality data with manual annotations to train the Recognizer [19]. Figure 9 (top left box) shows the relevant semantic entities extracted, i.e., the decision variables, domains of those variables, and constraints between decision variables.

Generator (G): The Generator is responsible for generating MiniZinc code based on the output from the Recognizer and the model described in natural language. LLM foundational models have parameter knowledge related to the generation of MiniZinc models. We aimed to improve upon that parameter knowledge further; therefore, we fine-tuned and instruction-tuned the model on a dataset of MiniZinc models, their natural language descriptions, and the generated MiniZinc code [19]. Figure 9 (right box) contains the generated MiniZinc code (mzn file).

Evaluator (E): The Evaluator checks the generated MiniZinc model for warnings and errors using the MiniZinc Python API and provides feedback to the Generator for model validation and regeneration. We drew inspiration for this component from the process described by the authors of [16]. Each phase of validation and regeneration has its own control prompt [46].

Gateway (GW): The Gateway module enables flexible LLM assignments—each component can deploy either gpt-3.5-turbo or gpt-4o-mini depending on the architectural configuration being tested (see Section 5.3 and Appendix C). All models are fine-tuned except the Evaluator, which uses vanilla LLMs to maintain impartial validation. The Gateway module centralizes all LLM interactions, storing model assignments (e.g., Parser→ft:gpt-3.5-turbo) and API parameters.

5. Validation

Our study investigates several research questions (RQs) regarding the automated generation of MiniZinc constraint models from natural language descriptions. RQ1. Can an automated system generate MiniZinc constraint models from natural language task descriptions with both high syntactic correctness and solution validity? RQ2. Would such a system perform better using vanilla or fine-tuned pre-trained LLMs? RQ3. Does employing multiple LLMs for different system components improve the results? RQ4. Do newer LLMs generate MiniZinc constraint models with higher syntactic correctness and solution validity than older versions? RQ5. Does maintaining message history help the system generate improved constraint models? These questions collectively aim to advance our understanding of the optimal approaches for translating natural language to constraint models.

In order to perform any validation, it was necessary to create datasets for the fine-tuning of the LLMs. In Section 5.1, we describe in detail the datasets created in this study. Section 5.2 identifies experiments that we believe best validate our research questions. In Section 5.3, we describe in detail the ablation study we conducted. Finally, Section 5.4 shows the results of our experiments and the ablation variants.

5.1. Datasets

In addition to the ACMG process and its architecture, the main contribution of this study is the datasets we created. All datasets were created manually in collaboration with domain experts in the field of MiniZinc and/or telecommunication equipment configurations. Each dataset was preprocessed into jsonl format, with prompts and completions standardized to avoid truncation or padding mismatches. OpenAI’s tokenizer automatically managed subword segmentation, ensuring consistent input encoding. No additional tokenization normalization was applied, as the pipeline’s built-in methods align with the models’ pretraining standards.

Domain experts annotated the dataset using guidelines covering entity extraction (variables, domains, constraints) and MiniZinc translation. Inter-annotator agreement was measured (κ = 0.82), and all MiniZinc models were validated for syntax/solution correctness. Discrepancies were resolved via consensus.

We produced the following datasets:

CSPNERMZC-50: This dataset [19] comprises 50 constraint satisfaction problems described in natural language with manually annotated semantic entities (input parameters, variables, domains, and constraints) and corresponding MiniZinc code. The dataset consists of seven columns:

A.: The model and constraint task are described in natural language.
B.: Model only.
C.: Constraint task only.
D.: Model-relevant semantic entities.
E.: Constraint task-relevant semantic entities.
F.: MiniZinc model (mzn file).
G.: MiniZinc constraint task (dzn file).

The segmentation of the CSPNERMZC-50 dataset into smaller subsets is presented in Figure 11. These smaller datasets were used to build the different datasets described in the remainder of this subsection (e.g., CSPNER-50 is a combination of CSP-50 and NER-50, as shown in Figure 11).

CSP-50: This dataset [19] comprises 50 constraint satisfaction problems described in natural language. The dataset consists of three columns:

The model and constraint task are described in natural language.
Model only.
Constraint task only.

CSPNER-50: This dataset [19] comprises 50 constraint satisfaction problems described in natural language with manually annotated semantic entities (input parameters, variables, domains, and constraints). The dataset consists of five columns:

The model and constraint task are described in natural language.
Model only.
Constraint task only.
Model-relevant semantic entities.
Constraint task-relevant semantic entities.

MZC-50: This dataset [19] comprises 50 MiniZinc constraint models, their natural language descriptions, and the generated MiniZinc code. The dataset consists of five columns:

The model and constraint task are described in natural language.
Model only.
Constraint task only.
MiniZinc model (mzn file).
MiniZinc constraint task (dzn file).

VAL-10: This dataset consists of 10 constraint satisfaction problems that are used for validation [51].

In addition to the four datasets described above, we also used the dataset in [16] and the Mixed CP Dataset from [17] for benchmarking; however, the dataset in [16] only contains arbitrary tasks, so we discarded it from our results. As the Mixed CP Dataset comprises 13 CSPs and 5 COPs, we only used its CSPs.

5.2. Experiments

The results from experiment 1 were validated automatically with help from the MiniZinc Python library, while the results from experiments 2 and 3 had to be validated manually, but only once for both sets of experiments:

Evaluation of Generated Constraint Model’s Syntax Correctness

In this experiment, we quantified the percentage of generated constraint models that exhibited syntactical correctness, defined as the absence of errors or warnings during parsing and validation.

2.: Evaluation of Generated Constraint Model’s Solution Correctness

We further assessed the percentage of generated constraint models that, while syntactically correct (i.e., free of errors or warnings), failed to align with the ground truth solution derived from the original problem definition. These models, though valid in syntax, produced solutions that deviated from the expected behavior. Such models may represent plausible solutions to related or analogous problems but do not accurately address the specific constraints and requirements of the target problem.

3.: Evaluation of Generated Constraint Model’s Validity

Finally, we measured the percentage of constraint models generated by the ACMG engine that not only passed the MiniZinc validation checks without errors or warnings but also produced solutions consistent with the ground truth. These models were deemed fully valid, as they exhibited both syntactical correctness and functional alignment with the expected behavior defined by the input prompt. A valid model must precisely reflect the user’s intent and adhere to the problem’s specified constraints.

5.3. Ablation Study

To properly answer RQ2 and RQ3, we have devised different variants of ACMG’s architectural configuration. We segmented the two-step process from prior studies [2,18,38] into a multi-step process. The ACMG comprises four modules that utilize an LLM, as defined in Section 4. To address RQ2 and RQ3, we designed an ablation study with six variants (see Appendix D), which is enumerated as follows:

Without the fine-tuned model, only the vanilla LLM is used.
One fine-tuned model was used as the Parser, Recognizer, and Generator. The Evaluator was a vanilla LLM. The one LLM was fine-tuned with one large, combined dataset (fine-tuned with CSPNERMZC-50 [19]).
Two fine-tuned models were used: one for the Recognizer (fine-tuned with CSPNER-50 [19]) and one for the Generator (fine-tuned with MZC-50 [19]). The vanilla LLM was used for syntax correction and parsing.
Two additional fine-tuned models were used: one for the Recognizer (fine-tuned with CSPNER-50 [19]) and one for the Generator (fine-tuned with CSPNERMZC-50 [19]). The vanilla LLM was used for syntax correction and parsing.
Three fine-tuned models were used, one for each module: one each for the Parser (finetuned with CSP-50 [19]), Recognizer (fine-tuned with CSPNER-50 [19]), and Generator (fine-tuned with MZC-50 [19]).
Three additional fine-tuned models were used, one for each module: one each for the Parser (finetuned with CSP-50 [19]), Recognizer (fine-tuned with CSPNER-50 [19]), and Generator (fine-tuned with CSPNERMZC-50 [19]).

In addition to the six base variants described above, we ran each with and without a message history; the first six experiments were run without a message history and the second six with a message history. Furthermore, we used only GPT-3.5-turbo in one experiment and GPT-4o in the second. The fine-tuning process employed default hyperparameters as specified by OpenAI’s pipeline, with the key settings summarized in Table 2.

5.4. Results

All experiments were run using GPT-3.5-turbo and GPT-4o, and the results of those experiments using our dataset are provided in Figure 12.

Table 3 describes the legend in Figure 12 in more detail. The valid column indicates a successful result if the LLM generated valid MiniZinc syntax but was incorrect otherwise. The solution column indicates a successful result if the MiniZinc model generated by the LLM corresponds to the problem described by the input prompt; if it differs from what was stated in the user input prompt, then the result is identified as unsuccessful.

Figure 12a,b are the results obtained using the dataset used by the authors of [51], and Figure 12c,d are those obtained when using the Mixed CP Dataset from the study in [17]. The different types of experiment are explained in more detail in the ablation study. Experiment type 0 indicates an experiment where the ACMG was not used, but plain zero-shot prompting was employed on the same problem for the corresponding vanilla LLM, GPT 3.5 Turbo, or GPT 4o. Experiments (1–6) correspond to ablation studies without message history, while experiments (7–12) use message history. The message history is the complete message history obtained during the ACMG process from when the input user prompt is received until the solution, the output prompt, is returned to the user. A detailed analysis of the results is provided below.

Each of the four graphs in Figure 12a–d has 13 different architectural configurations on the x-axis and different user prompts (experiments) on the y-axis. In Figure 12a,b, there are 10 different experiments (our 10 original CSPs [51]), and in Figure 12c,d there are 13 different experiments (the 13 CSPs taken from the Mix-CP dataset from the study in [17]). In each configuration set up, all experiments were run sequentially. If the message history was turned on, that means it was used within the same LLM session. The ablation study (Section 5.3) provides further detail about the choice of dataset or subset of the dataset, if any was used for fine-tuning.

The first configuration, 0, indicates that one vanilla LLM (vLLM) was used in a simple zero-shot prompting technique, meaning that the user prompt was sent to the vLLM with a control prompt asking it to generate MiniZinc files, mzn and dzn, with the vLLM returning both files. This case was used as the baseline. In addition, case 0 did include message history.

In the following six configurations, 1–6, message history was not used, with the last six configurations, 7–12, using message history again. The results show that message history influenced the results in a positive way, pushing both the validity and correctness of the generated constraint models higher. These two groups of configurations, 1–6 and 7–13, are the same in all other aspects, with configuration 1 being equivalent to configuration 7 and so forth.

In configuration 1, one vLLM was used by all actors in the ACMG process (the Parser, the Recognizer, the Generator, and the Evaluator). In Figure 12a,b, configuration 1 shows worse results. In the case of Figure 12a, having the messaging history turned off probably had a detrimental effect, while in Figure 12b, due to all experiments being run in the same session, some basic assumptions were interpreted incorrectly, which influenced the results of all experiments.

Configurations 2 and 8 employ one common fine-tuned LLM (fLLM) that is used by the Parser, the Recognizer, and the Generator. The Evaluator was a vLLM, as our dataset does not influence the evaluation process. For this configuration, we observed an improvement in both the validity and correctness of the generated constraint models.

Configurations 3–4 and 9–10 use four distinct LLMs; two are vLLMs used by the Parser and the Evaluator, and two are fLLMs used by the Recognizer and the Generator. Each fLLM is fine-tuned with different datasets, as described in more detail in the ablation study (Section 5.3). The differences between configurations 3 and 4 and 9 and 10 are the datasets used. In configurations 3 and 9, and 4 and 10, a specialized (a subset of the dataset tailored to the generation of constraint models) and complete dataset were used for fine-tuning the LLM used by the Generator, respectively. It seems that using a specialized dataset yielded better results in relation to the percentage of valid and correct constraint models generated, even though both had the same data. Furthermore, configurations 3 and 9 indicate that two fLLMs seem to be working better than one or no fLLM.

Configurations 5–6 and 11–12 again use four distinct LLMs, but only one is a vLLM, that used by the Evaluator. The three fLLMs are used by the Parser, the Recognizer, and the Generator. The differences between configurations 5 and 6 and 11 and 12 are the datasets used. In configurations 5 and 11 and 6 and 12, a specialized (a subset of the dataset tailored to the generation of constraint models) and complete dataset were used for fine-tuning the LLM used by the Generator, respectively. These configurations confirm that specialized datasets yield better results in terms of the percentage of valid and correct constraint models generated. These configurations further reaffirmed that more fLLMs yielded better results, as seen in configurations 5 and 11.

All aforementioned behaviors and conclusions are valid for all experiments in Figure 12a–d. It was interesting to notice that in the case of experiments run on different versions of LLMs, the experiments in Figure 12a, which were new original tests created by us [51], were run on GPT-3.5-turbo, while the experiments in Figure 12b were the tests were run on GPT-4o; the results were almost identical. This was not the case for the experiments from Mixed CP [17] in Figure 12c, which were run on GPT-3.5-turbo, and in Figure 12d, which were run on GPT-4o. We observed clear improvements in the results in Figure 12d. We suspect this was caused by data leakage, as the Mixed-CP dataset is based on an introductory university course on constraint programming.

6. Discussion

Before starting the research, we formulated some questions to which we were trying to find answers. To answer these questions, we developed an ACMG engine prototype, the ACMG process, and the ACMG data flow. Furthermore, we constructed the datasets described in Section 5.1 and conducted validations. Next, we will summarize our key findings.

In most cases, there is no difference between GPT-3.5-turbo and GPT-4o. We believe that this is due to the scarcity of MiniZinc data available for LLMs to be trained on. Regarding RQ5, we confirmed that maintaining message history yielded better results in most cases, highlighting the importance of context for the ACMG engine and LLMs in general. In terms of RQ4, we confirmed that newer LLMs produce higher-quality and better-performing constraint models compared to older versions.

Regarding RQ2 and RQ3, we found that using multiple fine-tuned models (i.e., one for each module) produces better results than using a single fine-tuned model for all modules, two fine-tuned models, or none. The generation of constraint models from natural language has been recognized as a complex problem for some time, and to make this task simpler, it was split into two parts [2,18,38]: NER and model generation. Moreover, different methods were developed for solving the NER [18,37,38] and model generation [17,18,38] tasks. Our hypothesis is that by segmenting this process even further by introducing a parsing and evaluation step, we can boost performance in a manner similar to how segmentation in the Chain of Thought [52] boosted the performance of LLMs. Furthermore, by fine-tuning LLMs, we adjusted their weights for specific sub-tasks, basically specializing each actor to its role, which in turn boosted performance even further.

We could generate valid MiniZinc constraint models with valid syntax and valid solutions and up to 80% and 30% accuracy, respectively, which is promising as the problems provided were not trivial and required a significant understanding of the constraint programming domain. This shows that the ACMG engine has great potential in helping junior- and mid-level constraint developers write constraint models. Furthermore, in terms of RQ1, in which we asked if an automated system could generate valid MiniZinc constraint models with high accuracy from natural language task descriptions, we believe the ACMG engine is superior to other solutions, as we used zero-shot prompting as opposed to few-shot prompting used in [17]; moreover, we obtained better results than the studies in [16,17].

6.1. Limitations

Carefully crafted prompts written in natural language are required for good results. Poor or ambiguous task descriptions will result in incorrect or incomplete constraint models.

The context size of LLMs remains a limitation; for instance, GPT-3.5-turbo is still restricted to 16k tokens. Given this constraint, complex constraint satisfaction problems (CSPs) with numerous variables, intricate constraints, and elaborate logical structures may prove challenging. In the case of the ACMG’s engine configuration, which relies on message history and GPT-3.5-turbo, the 16k token limit (which, until recently, was only 4k tokens) could pose significant limitations for complex CSPs. This is because CSPs often require extensive textual descriptions, and the ACMG process itself adds further token overhead due to its composite nature, combining NER and generation phases, as well as three inner and three outer loops. These factors can easily exceed the 16k token window. While multiple GPT versions are available via API, only GPT-4-turbo (GPT-4o) currently supports a 128k context size. Other model families may offer larger context windows, but they often lack fine-tuning capabilities or are less user-friendly.

While this study achieved its objectives through OpenAI’s managed service, we note two hardware-related limitations: (1) the inability to specify or benchmark physical hardware configurations and (2) dependence on OpenAI’s proprietary scaling decisions. These constraints are inherent to commercial LLM APIs but warrant consideration for studies requiring hardware-level reproducibility.

The quality and diversity of the training datasets have an impact on model performance. The ACMG engine and datasets for fine-tuning presented in this study were created with the configuration task in mind. This means that they can be used for other constraint programming problems; however, they might not yield SOTA results.

6.2. Future Work

Further improvements to the ACMG engine are already being planned. First, we plan to add a method for classifying the problem so that we can use the best solver supported by MiniZinc. As mentioned in the methodology section, a unified benchmark for solver-independent modeling frameworks is required.

Furthermore, we plan to support other flows, such as task- or model-only flows. Additionally, we will aim to enrich the model’s storage by using embeddings, and we intend to use stored models as input for semantically similar models, inspired by the authors of [17]. A study by the authors of [53] suggests that RAG is a superior method for fine-tuning for less popular knowledge. This claim should be explored. Is it valid for code generation in general and for CP? CP is specific, and the SOTA for generating CP presently consists of two parts, as first defined by the authors of [2]. We believe that further advancement towards the development of a CP modeling assistant will involve a combination of fine-tuning LLMs and RAG. In the future, the LLM will not only be used to name the model, but also to create a natural language description of the model. Both of these functionalities will be implemented and improved in future RAG extensions of the ACMG.

We plan to expand and increase the quality of the datasets used for fine-tuning the Recognizer and Generator with more and even higher quality examples to test if this will improve the quality of generated MiniZinc models even further, as well as increase the quality of existing datasets by employing other experts in MiniZinc. Additional improvements could be achieved if an adversarial attack method is used to create a much larger artificially generated dataset from our high-quality seed dataset. Such datasets could increase the performance of the ACMG by an additional 30%, according to [54], and an additional 30% improvement in performance accuracy can be achieved if longer/repeatable fine-tuning is carried out with that same dataset [55]. The datasets we created do not tackle evaluation problems, so additional high-quality datasets should be created specifically with evaluation in mind.

Further research is required to address how comments in the code used for fine-tuning influence the performance of that fine-tuning. There are additional questions to explore, such as whether there is a difference in the performance of fine-tuning with datasets that contain code that has comments compared to datasets that do not contain comments, and the impact of the quality of comments on the overall performance of fine-tuning.

Lastly, more experiments should be conducted with other LLM families, especially with recent newcomers such as Deepseek [56] and Qwen [57].

7. Conclusions

The Automatic Constraint Model Generator (ACMG) is a novel approach for automating constraint satisfaction problem (CSP) modeling by leveraging fine-tuned large language models (LLMs) to translate natural language descriptions into formal MiniZinc constraint models. Our experiments demonstrated that the ACMG can generate syntactically correct MiniZinc models with up to 80% accuracy and produce valid solutions matching the ground truth in 30% of cases, significantly advancing the state of automated constraint programming. Our ablation studies revealed key insights: fine-tuned LLMs substantially outperform vanilla models; a modular architecture with specialized components for parsing, recognition, and generation yields superior results; and maintaining message history enhances performance through improved contextual awareness. While GPT-4o and GPT-3.5-turbo showed comparable performances, likely due to the limited MiniZinc specific training data for LLMs, the framework’s success highlights the potential of LLMs to democratize constraint programming. Current limitations include their sensitivity to input prompt quality and the challenges with highly complex CSPs, suggesting directions for future work such as dataset expansion, the integration of retrieval-augmented generation techniques, and the exploration of alternative LLM architectures. By making our datasets and implementation publicly available, we aim to foster further research in this emerging field at the intersection of Natural Language Processing and constraint programming. The ACMG framework represents a significant step toward making constraint technology more accessible to non-experts while maintaining the rigor required for industrial and academic applications.

Supplementary Materials

The following supporting information can be downloaded from the repository available at https://github.com/erobpen/ACMG/ (accessed on 6 June 2025).

Author Contributions

Conceptualization, R.P., D.P. and M.V.; methodology, R.P. and D.P.; software, R.P. and D.P.; validation, R.P., D.P. and M.Š.; formal analysis, R.P., D.P. and M.V.; investigation, R.P.; resources, R.P. and D.P.; data curation, R.P. and M.Š.; writing—original draft preparation R.P.; writing—review and editing, R.P., D.P. and M.V.; visualization, R.P.; supervision, D.P. and M.V.; project administration, R.P.; funding acquisition, R.P., D.P. and M.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Ericsson Nikola Tesla d.d.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in Dataset for Fine-Tuning LLM to Generate MiniZinc at https://doi.org/10.34740/KAGGLE/DSV/11207997 (accessed on 6 June 2025), [19].

Acknowledgments

The authors would like to thank Ericsson Nikola Tesla d.d. for supporting this research.

Conflicts of Interest

Authors Roberto Penco and Marko Šoštarić were employed by the company Ericsson Nikola Tesla d.d. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare that this study received funding from Ericsson Nikola Tesla d.d. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

Appendix A

Table A1. Component Role Definitions.

Component	Input	Output	Role
Parser	Natural language prompt	{Model, Task} segments	Instruction following
Recognizer	Model description	Structured entities	Constraint NER
Generator	Entities + model desc.	Structured entities	Constraint-to-code mapping
Evaluator	Generated code	Validation diagnostics	Solution verification

Appendix B

Algorithm A1. Constraint Model Generation with Iterative Repair Mechanism

Input: model (natural language), task, recognizer_model, generator_model
Output: status (SATISFIED/UNSATISFIED), result
1: procedure ITERATIVE-REPAIR-REGENERATION
2: model_name ← GENERATE-MODEL-NAME(model)
3: for repair_cycle ← 1 to 3 do
4: recognized ← RECOGNIZER.extract_entities(model)
5: warnings ← null
6: for gen_attempt ← 1 to 3 do
7: constraint_model ← GENERATOR.generate(
8: model, recognized, warnings)
9: task_dzn ← FORMATTER.reformat_task(task, constraint_model)
10: status, result, warnings ← SOLVER.evaluate(
11: constraint_model, task_dzn)
12: if status = SATISFIED then
13: MEMORY.store_model(model_name, constraint_model)
14: return (status, result)
15: return (UNSATISFIED, null)
16: procedure GENERATE
17: if warnings = null then
18: prompt ← PROMPTS.initial(model, recognized)
19: else
20: prompt ← PROMPTS.repair(model, recognized, warnings)
21: return LLM.query(prompt)

Appendix C

Table A2. Ablation Configurations (Component LLM View).

Component	Base Options	Tuning Status	Ablation Variants
Parser	gpt-3.5/gpt-4o	Both	V1 (vanilla), V2–V6 (tuned)
Recognizer	gpt-3.5/gpt-4o	Both	V1 (vanilla), V2–V6 (tuned)
Generator	gpt-3.5/gpt-4o	Both	V1 (vanilla), V2–V6 (tuned)
Evaluator	gpt-3.5/gpt-4o	Vanilla only	All variants

Appendix D

Table A3. Ablation Configurations (Component Dataset Fine-Tuning View).

	V1 ¹	V2	V3	V4	V5	V6
vLLM ⁵	P, R, G, E ³	E	P, E	P, E	E	E
fLLM(CSPNER-50) ²			R	R	R	R
fLLM(CSPNERMZC-50)		P, R, G		G		G
fLLM(MZC-50)			G		G
fLLM(CSP-50)					P	P
fLLM(CSPNER-50)
LLM Qty ⁴	1	2	3	3	4	4

¹ VX—variant number for architectural configuration of ACMG; V1 is Variant 1. ² fLLM(Dataset) represents the distinct LLM fine-tuned with our dataset. ³ P,R,G,E—Parser, Recognizer, Generator, and Evaluator, respectively. ⁴ LLM Qty—the distinct number of LLMs in that configuration variant. ⁵ Vanilla LLM.

References

Rossi, F.; van Beek, P.; Walsh, T. (Eds.) Handbook of Constraint Programming; Elsevier: Amsterdam, The Netherlands, 2006. [Google Scholar] [CrossRef]
Tsouros, D.; Verhaeghe, H.; Kadıoğlu, S.; Guns, T. Holy Grail 2.0: From natural language to constraint models. arXiv 2023, arXiv:2308.01589. [Google Scholar] [CrossRef]
Freuder, E.C.; O’Sullivan, B. Grand challenges for constraint programming. Constraints 2014, 19, 150–162. [Google Scholar] [CrossRef]
Yuksekgonul, M.; Chandrasekaran, V.; Jones, E.; Gunasekar, S.; Naik, R.; Palangi, H.; Kamar, E.; Nushi, B. Attention satisfies: A constraint-satisfaction lens on factual errors of language models. arXiv 2023, arXiv:2309.15098. [Google Scholar] [CrossRef]
Régin, F.; De Maria, E.; Bonlarron, A. Combining constraint programming reasoning with large language model predictions. arXiv 2024, arXiv:2407.13490. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Jiang, J.; Wang, F.; Shen, J.; Kim, S.; Kim, S. A survey on large language models for code generation. arXiv 2024, arXiv:2406.00515. [Google Scholar] [CrossRef]
OpenAI. ChatGPT: A Large Language Model. 2022. Available online: https://chat.openai.com (accessed on 1 April 2025).
Ridnik, T.; Kredo, D.; Friedman, I. Code generation with AlphaCodium: From prompt engineering to flow engineering. arXiv 2024, arXiv:2401.08500. [Google Scholar] [CrossRef]
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.D.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
Kahneman, D. Thinking, Fast and Slow; Farrar, Straus and Giroux: New York, NY, USA, 2011. [Google Scholar]
Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.L.; Chen, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. arXiv 2023, arXiv:2305.10601. [Google Scholar] [CrossRef]
Lee, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent abilities of large language models. arXiv 2022, arXiv:2206.07682. [Google Scholar] [CrossRef]
Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, E.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling instruction-finetuned language models. arXiv 2022, arXiv:2210.11416. [Google Scholar] [CrossRef]
Almonacid, B. Towards an automatic optimisation model generator assisted with generative pre-trained transformers. arXiv 2023, arXiv:2305.05811. [Google Scholar] [CrossRef]
Michailidis, K.; Tsouros, D.; Guns, T. Constraint modelling with LLMs using in-context learning. In Proceedings of the 30th International Conference on Principles and Practice of Constraint Programming (CP 2024), Girona, Spain, 2–6 September 2024; Shaw, P., Ed.; Schloss Dagstuhl—Leibniz-Zentrum für Informatik: Wadern, Germany, 2024; Volume 307, pp. 20:1–20:27. [Google Scholar] [CrossRef]
Dakle, P.P.; Kadıoğlu, S.; Uppuluri, K.; Politi, R.; Raghavan, P.; Rallabandi, S.; Srinivasamurthy, R.S. Ner4Opt: Named entity recognition for optimization modelling from natural language. In Machine Learning, Optimization, and Data Science; Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R., Sciacca, V., Eds.; Springer: Cham, Switzerland, 2023; pp. 299–319. [Google Scholar] [CrossRef]
Penco, R. Dataset for Fine-Tuning LLM to Generate Minizinc [Data Set]. Kaggle. 2025. Available online: https://doi.org/10.34740/KAGGLE/DSV/11207997 (accessed on 1 April 2025).
ACMG Code Repository. Available online: https://github.com/erobpen/ACMG/ (accessed on 1 April 2025).
Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large language models: A survey. arXiv 2024, arXiv:2402.06196. [Google Scholar] [CrossRef]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling language modeling with Pathways. arXiv 2022, arXiv:2204.02311. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
OpenAI. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Enis, M.; Hopkins, M. From LLM to NMT: Advancing low-resource machine translation with Claude. arXiv 2024, arXiv:2404.13813. [Google Scholar] [CrossRef]
Zhang, S.; Dong, L.; Li, X.; Zhang, S.; Sun, X.; Wang, S.; Li, J.; Hu, R.; Zhang, T.; Wu, F.; et al. Instruction tuning for large language models: A survey. arXiv 2023, arXiv:2308.10792. [Google Scholar] [CrossRef]
Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv 2024, arXiv:2402.07927. [Google Scholar] [CrossRef]
Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar] [CrossRef]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv 2021, arXiv:2107.13586. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. OpenAI. 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 1 April 2025).
Khashabi, D.; Min, S.; Khot, T.; Sabharwal, A.; Tafjord, O.; Clark, P.; Hajishirzi, H. UnifiedQA: Crossing format boundaries with a single QA system. arXiv 2020, arXiv:2005.00700. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2020, arXiv:2005.11401. [Google Scholar] [CrossRef]
Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; Callan, J.; Neubig, G. PAL: Program-aided language models. arXiv 2022, arXiv:2211.10435. [Google Scholar] [CrossRef]
Doan, X.-D. VTCC-NLP at nl4opt competition subtask 1: An ensemble pre-trained language models for named entity recognition. arXiv 2022, arXiv:2212.07219. [Google Scholar] [CrossRef]
Ning, Y.; Liu, J.; Qin, L.; Xiao, T.; Xue, S.; Huang, Z.; Liu, Q.; Chen, E.; Wu, J. A novel approach for auto-formulation of optimization problems. arXiv 2023, arXiv:2302.04643. [Google Scholar] [CrossRef]
Wang, K.; Chen, Z.; Zheng, J. Opd@nl4opt: An ensemble approach for the NER task of the optimization problem. arXiv 2023, arXiv:2301.02459. [Google Scholar] [CrossRef]
Ramamonjison, R.; Yu, T.T.; Li, R.; Li, H.; Carenini, G.; Ghaddar, B.; He, S.; Mostajabdaveh, M.; Banitalebi-Dehkordi, A.; Shi, Y.; et al. NL4Opt competition: Formulating optimization problems based on their natural language descriptions. arXiv 2023, arXiv:2303.08233. [Google Scholar] [CrossRef]
Gunasekar, S.; Zhang, Y.; Aneja, J.; Mendes, C.C.; Del Giorno, A.; Gopi, S.; Javaheripi, M.; Kauffmann, P.; De Rosa, G.; Saarikivi, O.; et al. Textbooks Are All You Need. arXiv 2023, arXiv:2306.11644. [Google Scholar] [CrossRef]
Gangwar, N.; Kani, N. Highlighting named entities in input for auto-formulation of optimization problems. In Machine Learning, Optimization, and Data Science; Nicosia, G., Pardalos, P., Umeton, R., Giuffrida, G., Sciacca, V., Eds.; Springer: Cham, Switzerland, 2022; pp. 129–143. [Google Scholar] [CrossRef]
Nethercote, N.; Stuckey, P.J.; Becket, R.; Brand, S.; Duck, G.J.; Tack, G. MiniZinc: Towards a standard CP modelling language. In Proceedings of the 13th International Conference on Principles and Practice of Constraint Programming, Providence, RI, USA, 25–29 September 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 529–543. [Google Scholar] [CrossRef]
Nightingale, P. Savile Row Manual. arXiv 2021, arXiv:2201.03472. [Google Scholar] [CrossRef]
Akgün, Ö.; Frisch, A.M.; Gent, I.P.; Jefferson, C.; Miguel, I.; Nightingale, P.; Salamon, A.Z. Towards Reformulating Essence Specifications for Robustness. arXiv 2021, arXiv:2111.00821. [Google Scholar] [CrossRef]
Guns, T. Increasing Modeling Language Convenience with a Universal N-Dimensional Array, CPpy as Python-Embedded Example. In Proceedings of the 18th Workshop on Constraint Modelling and Reformulation at CP (ModRef 2019), Stamford, CT, USA, 30 September 2019. [Google Scholar]
Stuckey, P.J.; Marriott, K.; Tack, G. MiniZinc Documentation (Version 2.8.3). MiniZinc. 2024. Available online: https://www.minizinc.org/doc-latest/en/index.html (accessed on 1 April 2025).
ACMG Control Prompts Repository. Available online: https://github.com/erobpen/ACMG/tree/main/appendices/Appendix_1 (accessed on 1 April 2025).
Schulte, C.; Stuckey, P.J. Efficient constraint propagation engines. ACM Trans. Program. Lang. Syst. 2008, 31, 2. Available online: https://www.gecode.org/papers/SchulteStuckey_TOPLAS_2008.pdf (accessed on 1 April 2025). [CrossRef]
Tack, G. Constraint Propagation: Models, Techniques, Implementation. Ph.D. Dissertation, Saarland University, Saarbrücken, Germany, 2009. Available online: https://www.gecode.org/papers/Tack_PhD_2009.pdf (accessed on 1 April 2025).
Schulte, C.; Tack, G.; Lagerkvist, M.Z. Modeling and Programming with Gecode. 2010. Available online: https://www.gecode.org/doc-latest/MPG.pdf (accessed on 1 April 2025).
Google OR-Tools. Google Optimization Tools. 2010. Available online: https://developers.google.com/optimization (accessed on 1 April 2025).
ACMG Validation Test Suite Repository. Available online: https://github.com/erobpen/ACMG/tree/main/appendices/Appendix_7 (accessed on 1 April 2025).
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar] [CrossRef]
Soudani, H.; Kanoulas, E.; Hasibi, F. Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge. arXiv 2024. [Google Scholar] [CrossRef]
Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N.A.; Khashabi, D.; Hajishirzi, H. Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv 2022, arXiv:2212.10560. [Google Scholar] [CrossRef]
Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; Casas, D.D.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training Compute-Optimal Large Language Models. arXiv 2022, arXiv:2203.15556. [Google Scholar] [CrossRef]
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar] [CrossRef]
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]

Figure 1. An example of a CSP and its solution.

Figure 2. Instantiations and a partial instantiation from Figure 1.

Figure 3. Configuration task example.

Figure 4. Generic building blocks of the ACMG process.

Figure 5. Overview of the ACMG process (a) and its data flow (b).

Figure 6. Example from Figure 3 of prompts used during the ACMG process’s Recognizer phase.

Figure 7. Example of an aggregated input prompt from the example shown in Figure 6.

Figure 8. Example from Figure 3 of the prompts used during the ACMG process’s Generator phase.

Figure 9. LLM response from Figure 7 and Figure 8.

Figure 10. Architecture overview of the Automatic Constraint Model Generator (ACMG).

Figure 11. Segmentation of the created CSPNERMZC-50 dataset into subsets.

Figure 12. The results of the four different groups of experiments: (a) VAL-12 GPT 3.5 turbo; (b) VAL-12 GPT 4o; (c) Mixed-CP GPT 3.5 turbo; and (d) Mixed-CP GPT 4o.

Table 1. Summarized differences between the ACMG engine, constraint modeling with LLMs using in-context learning [17], and an automatic optimization model generator assisted by generative pre-trained transformers [16].

	ACMG	[17]	[16]
Multi-step process	Yes	Yes	No
NER	Yes	Yes	No
RAG	No	Yes	No
Encoder LLM in NER	No	Yes	No
Decoder LLM in NER	Yes	Yes	No
Fine-tuning LLM	Yes	No	No
Multiple LLMs	Yes	No	No
Prompting technique	Zero-Shot	Few-Shot	Zero-Shot
LLM-guided process	Yes	No	Partially

Table 2. Fine-tuning hyperparameters.

Parameter	Value	Notes
Input format	jsonl	Prompt-completion pairs
Tokenization	OpenAI BPE	Subword segmentation (GPT-3.5/4)
Batch size	1	Sequential processing
Epochs	1	Default setting
Learning rate	Multiplier = 1	Base rate undisclosed by OpenAI
Training loss	0.057	Final convergence metric

Table 3. Legend description for Figure 12.

Legend	Valid	Solution
Fail	Incorrect	Incorrect
Partial success	Correct	Incorrect
Success	Correct	Correct

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Penco, R.; Pintar, D.; Vranić, M.; Šoštarić, M. Large Language Model-Driven Framework for Automated Constraint Model Generation in Configuration Problems. Appl. Sci. 2025, 15, 6518. https://doi.org/10.3390/app15126518

AMA Style

Penco R, Pintar D, Vranić M, Šoštarić M. Large Language Model-Driven Framework for Automated Constraint Model Generation in Configuration Problems. Applied Sciences. 2025; 15(12):6518. https://doi.org/10.3390/app15126518

Chicago/Turabian Style

Penco, Roberto, Damir Pintar, Mihaela Vranić, and Marko Šoštarić. 2025. "Large Language Model-Driven Framework for Automated Constraint Model Generation in Configuration Problems" Applied Sciences 15, no. 12: 6518. https://doi.org/10.3390/app15126518

APA Style

Penco, R., Pintar, D., Vranić, M., & Šoštarić, M. (2025). Large Language Model-Driven Framework for Automated Constraint Model Generation in Configuration Problems. Applied Sciences, 15(12), 6518. https://doi.org/10.3390/app15126518

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Model-Driven Framework for Automated Constraint Model Generation in Configuration Problems

Abstract

1. Introduction

2. Background

2.1. Constraint Programming

2.2. Constraint Satisfaction Problem

2.3. Configuration

2.4. Large Language Models

2.5. Fine-Tuning and Instruction Tuning

2.6. Prompt Engineering

3. Related Work

4. Automatic Constraint Model Generator (ACMG)

4.1. Main Components

4.2. ACMG Process

4.3. ACMG Architecture

5. Validation

5.1. Datasets

5.2. Experiments

5.3. Ablation Study

5.4. Results

6. Discussion

6.1. Limitations

6.2. Future Work

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

Appendix C

Appendix D

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI