1. Introduction
Large language models (LLMs), such as GPT, Claude, Gemini, Llama, DeepSeek, and Qwen, are deep neural networks trained on massive text datasets to understand, generate, and manipulate both human and programming languages [
1]. LLMs have demonstrated remarkable capabilities in many software development tasks [
2], such as code generation [
3,
4,
5,
6], code summarization [
7,
8], comment generation [
9], test case generation [
10,
11], and bug fixing [
12,
13]. While LLMs excel at solving code-related tasks, their application to higher-level software design tasks—particularly domain modeling—remains relatively underexplored in the literature. This oversight is significant given modeling’s critical role in software development and maintenance.
In object-oriented methodology, domain modeling is a process of translating a business requirement into a class diagram that reflects the core concepts and relationships existing in the business domain. The process fundamentally differs from coding—requiring conceptual abstraction, contextual comprehension, and semantic pattern identification that extends beyond syntactic manipulation. Existing research has primarily focused on guiding LLMs to construct domain models through two main approaches: (1) improved prompting techniques—such as in-context learning and chain-of-thought reasoning [
14,
15]—and (2) decomposing the modeling process into subtasks [
16,
17], such as entity identification and relationship extraction, rather than generating the entire model in a single step.
In this paper, we propose a new perspective on LLM-based domain modeling:
can we enhance the quality of generated models through an automated process of review and revision? While LLMs can now handle complex tasks via simple prompts due to their rapid advancement, they are inevitably plagued by the persistent issue of hallucination [
18]. Based on this observation, this paper proposes
MoRe, an LLM-based approach to domain model generation with self-refinement. First,
MoRe generates a draft domain model by prompting an LLM with a simple instruction and a textual domain description. This draft is then fed into a three-step self-refinement post-processing pipeline, where it undergoes sequential enhancement and correction. Specifically, the three-step procedure consists of (1) model correction, (2) model reduction, and (3) rule-based model fixing. Note that the general idea of self-refinement was proposed by Madaan et al. [
19] and has been widely adopted in many LLM-based SE tasks. This work applies this idea to domain modeling and proposes a new hybrid pipeline that combines both LLMs and rule-based checkers.
For evaluation, we first conducted an experiment using 30 domain modeling problems across four open-source LLMs: DeepSeek-V3.2, DeepSeek-R1-Distill-Llama-8B, Qwen3-Coder-480B-A35B-Instruct, and Qwen3-Coder-30B-A3B-Instruct. This experiment compared MoRe against two baseline approaches. The results demonstrate that MoRe improves the quality of generated domain models, as measured by F1 scores, over the baselines for most of the LLMs tested. Additionally, we performed an ablation study to assess the contribution of each step in the self-refinement pipeline. The findings confirm that the complete three-step procedure achieves the most consistent and balanced performance across different evaluation metrics and LLMs.
The contributions of this paper are summarized as follows:
We propose MoRe, an LLM-based approach to domain model generation with a hybrid self-refinement pipeline.
We construct a benchmark for domain model generation that contains 30 domain requirements along with reference models. Additionally, we implement a semantic model matcher to assess how close a generated model and a reference model are.
We conduct an empirical evaluation and demonstrate the effectiveness of MoRe.
The remainder of this paper is organized as follows:
Section 2 covers background and related work.
Section 3 details the
MoRe system, while
Section 4 describes its semantic matching algorithm.
Section 5 outlines the experimental design, and
Section 6 presents the results.
Section 7 addresses the research questions, and
Section 8 examines the work’s strengths and limitations. Finally, the last section concludes and suggests future work.
3. The MoRe Approach
This section proposes
MoRe, an LLM-based approach to domain model generation with hybrid self-refinement. The overall workflow of
MoRe is illustrated in
Figure 2 by example. The process of converting a textual requirement into a domain model consists of four major steps. Step ➀ is
initial generation, in which an LLM is prompted to produce an initial domain model based on the given domain requirement. Step ➁,
model correction, follows: the LLM is asked to detect specific semantic issues in the initial model and correct them. In step ➂,
model reduction, the corrected model is sent again to the LLM to identify and remove redundant fragments. Finally, step ➃,
rule-based fixing, employs a rule-based checker to scan the current model. This produces two categories of issues:
confirmed issues with known fixes and
pending issues. For the first category,
MoRe applies algorithmic fixes using predefined strategies. For the second category,
MoRe queries the LLM to decide how to resolve them. After addressing these issues,
MoRe outputs the
final domain model. The prompt templates used in
MoRe are presented in
Appendix A.
3.1. Initial Generation
The first step of MoRe is to generate a draft domain model from the given requirement using an LLM. Since MoRe focuses primarily on the refinement phase, we use only a simple prompt to guide the model generation. The prompt begins by defining the LLM’s role and task: “You are a domain modeling expert for UML class diagrams. You must carefully analyze <Description> and create a domain model that represents the domain knowledge provided.” It then provides the domain description and specifies the expected output format. To facilitate comprehension and generation by LLMs, we consistently use PlantUML as the encoding format for domain models when interacting with the LLM.
Upon receiving this prompt, the LLM returns an initial domain model draft. This draft, however, is likely to contain various semantic and syntactic flaws.
MoRe is designed to refine the model and minimize such flaws, thereby enhancing its overall quality. To understand what flaws LLMs tend to produce, we used the same simple prompt to interact with DeepSeek V3.2 and Qwen3-Coder-480B-A35B-Instruct and collected 60 samples. By analyzing these samples, we identified a set of recurring issues and quantified their frequency, which are summarized in
Table 1. The frequency was calculated per sample, counting each issue only once per sample regardless of how many times it occurred within that sample. We group these issues into three types:
semantics missing,
conceptual error, and
redundancy. Based on these common issues,
MoRe proceeds to the refinement phase.
Semantics missing means that the generated model is incomplete, resulting in the absence of critical semantic elements.
Table 1 lists two issues of semantics missing that are related to reference names, as these can be effectively addressed. While a generated model may also exhibit other issues of this type—such as missing implied classes—relying on LLMs to analyze them could introduce many false positives and thereby degrade model quality.
Conceptual error refers to cases where an LLM incorrectly applies modeling concepts while constructing a domain model.
Table 1 outlines four such issues. For instance,
misuse of class as primitive type describes the inclusion of a class that semantically corresponds to a primitive type.
Redundancy refers to situations in which the generated model contains over-designed fragments that can be removed without affecting its meaning. For example, if a class in the model has no attributes and participates in no relationships—as in the case of an unnecessary class—it can be considered redundant and may be removed.
3.2. Model Correction
The second step of MoRe is model correction, which aims to address the issues of semantics missing and conceptual error. This step is further divided into two sequential subtasks: name correction and concept correction.
MoRe first performs
name correction by instructing an LLM to refine the names of associations and compositions based on the domain requirement. The LLM is prompted to act as a senior modeler and to examine every reference in the model. It is given three guidelines: (1) each relationship must receive a meaningful, semantics-based name; (2) generic or verb-based labels (e.g.,
has,
contains,
refers_to) are prohibited; and (3) the output must adhere to a specified format that is easy for parsing. For example, as shown in
Figure 2, the relationship name
has in the initial model is changed to
departments in the corrected model.
Afterward,
MoRe performs
concept correction using the LLM. The LLM is instructed to review the domain model thoroughly and scan for any conceptual misuse.
MoRe provides three guidelines for this task: (1) a complete list of conceptual errors that the LLM must verify; (2) the instruction for adjusting any unreasonable inheritance hierarchies, as fixing conceptual errors may alter inheritance relationships; and (3) restrictions on the format of the output model for ease of parsing. For example, as shown in
Figure 2, the relationship
has in the initial model is a misuse of inheritance as references. Hence, in the corrected model,
MoRe changed it to a composition.
MoRe also directs the LLM to produce a reasoning analysis before outputting the revised model, ensuring this analysis is available for subsequent refinement. The entire conversation history from the name correction phase remains visible during concept correction, enabling the LLM to preserve naming consistency and avoid logical contradictions.
3.3. Model Reduction
The third step of MoRe, called model reduction, aims to minimize redundant fragments—such as unnecessary classes, attributes, and references—in the domain model.
MoRe begins by extracting the domain model revised in the previous step from the LLM’s latest response. It then instructs the LLM to review and simplify the model based on the domain requirements, guiding the reduction process step by step.
First, the LLM is asked to identify redundant elements. To support its judgment, MoRe provides explicit criteria:
For a class, if it holds structural features or provides behavioral functionality to the system, it is not redundant.
For an attribute, if it is mentioned in the requirements and is not derived, it is considered essential.
For a reference, if it represents a structural relationship between objects that must be preserved (i.e., one that must be stored), rather than a transient runtime invocation.
Afterward, the LLM is requested to adjust and simplify the overly-designed fragments. For example, as shown in
Figure 2, the attribute
address is removed in the simplified model because it is not implied by the requirement.
Following the reduction step, MoRe queries the LLM again to detect and fix any grammatical errors that may remain in the domain model. This additional check is necessary because the reduction process can introduce syntax issues when the LLM adjusts the model structure. To ensure the output conforms to valid syntax, the prompt includes a carefully constructed example that illustrates the supported PlantUML grammar.
Finally, MoRe will invoke a customized parser to convert the domain model in PlantUML into an Ecore model because the last step of MoRe uses a rule-based checker that only works for Ecore models.
3.4. Rule-Based Fixing
While the previous two refinement steps are LLM-driven, the last step of
MoRe is
rule-based fixing with a model checker. The checker consists of a set of algorithmically implemented scanners, each dedicated to detecting a specific kind of modeling issue within an Ecore model. It should be noted that the issues checked and fixed in this step are concrete instances of the common issues listed in
Table 1. If an issue identified by the checker is confirmed and resolved using a predefined strategy,
MoRe directly fixes it by modifying the Ecore model. Otherwise,
MoRe forwards an issue description—along with potential resolution strategies—to the LLM, requesting it to confirm the issue and select an appropriate repair strategy.
MoRe will fix (or ignore) the issue according to the feedback of the LLM.
We now detail the issues supported by our rule-based checker and their corresponding fixing strategies, in the order in which they are processed.
- 1
The issue “id attribute” arises when a class contains both an attribute such as xyz_id and a reference to another class named xyz. In such cases, the attribute is considered an unnecessary structural feature and should be removed. To identify these issues, we defined a regular expression that matches potential id attributes in each class. For each candidate, we further verify whether a corresponding reference exists that can be paired with the attribute. Once an id attribute is confirmed, it is simply removed.
- 2
The issue “is-a association” occurs when an association is given a name that implies an inheritance relationship. This step serves as a double-check for the misuse of association as inheritance. We employ a regular expression to detect any association named with terms such as is, is-a, inherits, extends, or similar variants. All is-a associations will be replaced with inheritance relationships.
- 3
The issue “empty child classes” arises when all child classes of a parent class are empty, i.e., they contain no attributes, participate in no references, and have no subclasses. This issue is a special case of unnecessary classes. To fix empty child classes, we introduce an enumeration to capture the subtype distinctions, add a type attribute to the parent class, and finally remove the now-redundant child classes.
- 4
The issue “unused enumeration” occurs when the model contains an enumeration that is never referenced. It is also a special case of an unnecessary class. For every enumeration, we traverse the model to check whether it is used as an attribute type. If not, we remove the enumeration from the model.
- 5
The issue “features to be pulled up” is detected when identical attributes appear across sibling classes, indicating they should be moved to the common superclass. Features to be pulled up are a special case of unnecessary structural features. The check involves traversing each parent’s inheritance tree, collecting child class features, and flagging any feature that occurs in multiple siblings based on name and type. The fix removes these features from all sibling subclasses and adds them once to the parent class.
- 6
The issue “empty class” arises when a class is empty (see the issue empty child classes). While this is a special instance of unnecessary classes, we cannot automatically confirm that an empty class is truly redundant, because it may be the result of parsing errors during the PlantUML-to-Ecore conversion, which could have caused its original definitions to be lost. We must consult an LLM for confirmation. If the LLM determines that the class is redundant, MoRe will remove it.
- 7
The issue “derived reference” arises when a reference can be derived from two other references. It is a special case of unnecessary structural features. For example, if a Professor teaches a Course and the Course is enrolled by a Student, the direct relationship Professor-->Student may be derivable from the chain Professor-->Course and Course-->Student and could therefore be redundant. To check for this issue, we traverse all reference triples in the model and verify whether they form a transitive triangle. For each candidate triple, the LLM is asked to determine whether one of the references is semantically derived. If the LLM confirms that a reference is redundant, it is removed.
- 8
The issue “enumeration mirroring class hierarchy” arises when an enumeration is defined such that each of its literals semantically corresponds to a distinct subclass of a parent class. Detecting this issue cannot rely on simple string matching between enumeration literals and subclass names. Therefore, we employ LLM-assisted verification: for every parent class that contains an enumeration attribute, we present both the enumeration and the corresponding inheritance hierarchy to the LLM and request a judgment on whether they encode the same classification scheme. If so, we remove the enumeration and the attribute.
- 9
The issue “bidirectional reference pair” occurs when two classes are connected by two opposite references that represent the same logical relationship, resulting in
unnecessary features. For example, in
Figure 2,
departments and
affiliatedTo between
Company and
Department can be viewed as a bidirectional reference pair. Because we cannot determine whether two opposite references denote the same relationship based solely on structure, we send all candidate pairs of opposite references to the LLM and ask it to assess whether they represent the same semantic association and, if so, indicate which one can be safely removed. Finally,
MoRe fixes this issue according to the LLM’s feedback (Since Ecore supports bidirectional references,
MoRe turns a confirmed bidirectional reference pair into a properly configured bidirectional reference).
- 10
The issue “duplicate reference pair” occurs when two classes are connected by two references that share the same direction and essentially represent the same semantic relationship, introducing unnecessary structural features. To check for this issue, we identify all reference pairs between the same two classes that have the same direction (i.e., share the same source and target), and then use the LLM to determine whether they are semantically equivalent and, if so, to indicate which reference should be removed.
4. Semantic Model Matcher
How LLM-generated models should be evaluated is an open issue. Existing research typically uses manual inspection, automatic Ecore model matchers, or LLM-based judgment. To minimize human bias and LLM instability, this paper proposes a semantic model matcher, namely, , that realizes a deterministic matching algorithm.
First, we formally define a domain model m as a tuple of six model components, where
is the set of classes; each class has a class name and a set of structural features (i.e., attributes and references);
is the set of attributes; each attribute has a name and a (primitive) type; we assume that an attribute is always defined within an owner class;
is the set of references (i.e., associations and compositions); each reference is defined by a name, a source class, and a target class; we assume that a reference is owned by its source class;
is the set of inheritance relationships; each inheritance relationship is defined by a child class and a parent class; assume that an inheritance relationship is denoted as , where is a child class and is a parent class;
is the set of enumerations; each enumeration is defined by a name and some enumeration literals;
is the set of enumeration literals; each literal is defined by a name; we assume that every literal belongs to an enumeration.
The following functions are also provided to traverse the model.
returns the structural features defined in the class c.
returns the literals defined in the enumeration e.
returns the owner class/enumeration of the structural feature/literal f. If f is a structural feature, then ; if f is an enumeration literal, then .
and return the source and the target class of a reference f. Note that .
4.1. Matching Algorithm
takes two domain models and m as input and produces a match consisting of mappings from and m. Conceptually, can be regarded as a partial injective function: for each element x in —which may be a class, attribute, reference, inheritance relationship, enumeration, or literal—the function yields either the matched element in m, or the symbol ⊥ if no corresponding element exists in m. We denote by the element in that matches a given element y in m (if such a match exists).
The overall framework of
is heavily inspired by the state-of-the-art open-source model matcher EMF Compare
https://eclipse.dev/emfcompare/ (accessed on 12 February 2026). However, EMF Compare generally relies on unique id values or edit distances to determine whether two elements are matched. Since an LLM may generate synonyms that are literally different from the expected name but semantically equivalent, EMF Compare cannot be applied straightforwardly.
Instead,
matches two models based on their semantic embedding vectors. The main matching algorithm is presented in Algorithm 1. Assume that
S is a meta-variable ranging over {CLS, ATT, REF, INH, ENM, LIT}. Given
, the key idea of this algorithm is to use a similarity function
to find the
closest match y within
—
x and
y must be matchable and
must achieve the highest similarity.
| Algorithm 1 Algorithm of semantic model matcher |
- 1:
- 2:
- 3:
- 4:
for S in do - 5:
for x in do - 6:
- 7:
if then - 8:
- 9:
end if - 10:
end for - 11:
end for - 12:
return
|
, as defined in Algorithm 2, is responsible for matching
x. First, it constructs a candidate set
candY from every unmatched y in
that is matchable to
x and has a similarity to
x exceeding
. Second, it sorts
candY in descending order of similarity, producing the list
sortedY. It then examines the first element
y in
sortedY and verifies whether
x is also the closest unmatched element for
y. If this condition holds,
y is returned as the closest match for
x. Otherwise, the algorithm proceeds to check the next
y in
sortedY.
| Algorithm 2 Algorithm of |
- 1:
- 2:
sort candY in descending order by - 3:
for do - 4:
if dbchk then - 5:
- 6:
if then - 7:
return y - 8:
end if - 9:
else - 10:
return y - 11:
end if - 12:
end for - 13:
return ⊥
|
The function
checks whether
x and
y can be matched, as shown in Algorithm 3. For classes and enumerations, the function always returns true. For attributes and literals, the function returns true only when
holds, that is to say, two attributes/literals can be matched only when their owner classes/enumerations are already matched. For two references, the function returns true only when the source and target classes of
x and
y are already matched, i.e.,
or
. For two inheritance relationships, say
and
, the function returns true if and only if
holds, that is to say, two inheritance relationships can be matched only when their child and parent classes are already matched.
| Algorithm 3 Algorithm of |
- 1:
if are classes or enumerations then - 2:
return true - 3:
else if are attributes or literals then - 4:
return - 5:
else if are references then - 6:
if then - 7:
return true - 8:
else if then - 9:
return true - 10:
else - 11:
return false - 12:
end if - 13:
else if are inheritance relationships then - 14:
- 15:
- 16:
return - 17:
else - 18:
return false - 19:
end if
|
The function
serves as the core of the matching process. Its algorithm is presented in Algorithm 4. First, it generates two strings from
x and
y using
. Second, it computes the embedding vectors from the strings using
. In implementation, we used a sentence transformer called MiniLM-L6-V2 [
36] to compute the embedding vectors. Finally, the function returns the cosine similarity between the two vectors.
| Algorithm 4 Similarity function |
- 1:
- 2:
- 3:
- 4:
- 5:
return
|
The behavior of the function varies according to the type of the input value x.
If x is a class, then returns a string concatenation of the constant string "class", the name of x, and for every feature f in .
If x is an attribute, then returns a string concatenation of the attribute type’s name and the attribute’s name.
If x is a reference, then returns a string concatenation of the source class’s name, the reference’s name, and the target class’s name.
If x is an enumeration, then returns a string concatenation of the constant string "enum", the name of x, and for every literal l in .
If x is an enumeration literal, then simply returns the literal name.
If x is an inheritance relationship, then returns a constant string "inheritance" because inheritance relationships are matched solely based on their child and parent classes, and their embedding similarity does not affect the matching process.
4.2. Selection of Hyper-Parameters
uses six hyper-parameters , , , , , and as the lower bounds of similarity for matching classes, attributes, references, inheritance relationships, enumerations, and literals, respectively. To select appropriate values for these hyper-parameters, we conducted a pilot experiment with the help of the reference models in our problem set. Note that we fix , as inheritance matching depends only on class matching, not on embedding similarity.
We illustrate the threshold determination process using as an example. For each class c in reference models, we first prompt an LLM to generate 20 semantically equivalent mutants of c by replacing the class name with synonyms and altering its structural features, then compute the similarity for each mutant . Concurrently, we compute the similarity between c and mutants derived from other classes in the same model. The threshold is then chosen to optimally separate the higher similarity scores (from equivalent mutants) from the lower ones (from non-equivalent mutants). By repeating this process for attributes, references, enumerations, and literals, we can determine their hyper-parameters.
We performed the selection procedure on DeepSeek V3.2 and GLM-5.
Figure 3 depicts how F1 scores vary across different threshold values. Based on
Figure 3, we observe two main findings: (1) The optimal threshold varies across tasks. For instance, the best thresholds for classes and enumerations lie between 0.6 and 0.7, whereas for attributes, the optimal threshold falls between 0.4 and 0.5. (2) For a given task, different LLMs exhibit similar performance trends, though their absolute results differ. For example, DeepSeek achieved the highest F1 score for classes when
is set to 0.6, while for GLM-5, the optimal
should be around 0.65.
To select the optimal threshold, we first analyzed the results from DeepSeek and GLM-5 shown in
Figure 3. For a given task, we took the x-value at which the corresponding curve reached its highest point as the optimal threshold for that LLM. Finally, the smaller of the two optimal thresholds from the two LLMs was chosen as the value for the respective hyper-parameter. Since
uses these thresholds to filter implausible matches, a lower value risks pairing semantically dissimilar elements. Generally, poorer-performing approaches benefit from lower thresholds.
Finally, we chose the following thresholds: , , , , and .
6. Results
We applied six approaches—
MoRe, Simple, Iterative,
MoRe-MC,
MoRe-MR, and
MoRe-RF—to generate domain models from the 30 domain requirements in our problem set.
MoRe, Simple, and Iterative were tested on the four selected LLMs, while
MoRe-MC,
MoRe-MR, and
MoRe-RF were tested on DeepSeek V3.2. For each generated model and its corresponding reference model, we computed a match
using
. Based on these matches, we derived an F1 score for each set of generated models. To ensure statistical reliability, the entire process was repeated five times, and the average results were calculated.
Appendix B presents two concrete examples of the model generation.
Table 4 presents the F1 scores for three model generation approaches to be compared—Simple, Iterative, and
MoRe—across six model components (CLS, ATT, REF, INH, ENM, LIT) using four different LLMs. The first column indicates the approach, while the second column specifies the LLM used for generation. The subsequent columns report the F1 scores for each model component, along with their standard deviations across 5 generations.
F1 scores across metrics range as follows: CLS –; ATT –; REF –; INH –; ENM –; LIT –. The highest scores for each metric are achieved by MoRe with specific LLMs: CLS by DS-V3.2 () and Q3C-30B (); ATT by Q3C-480B () and Q3C-30B (); REF by Q3C-30B (); INH by DS-V3.2 (); ENM by DS-V3.2 () and Q3C-30B (); LIT by DS-V3.2 () and Q3C-30B (). Standard deviations range from (ATT under Simple) to (INH under Simple for DS-R1-8B), with most within – and broader intervals primarily for INH and ENM.
Table 5 reports the two-tailed p-values obtained from two different significance testing methods (the Mann-Whitney U test and the permutation test), comparing
MoRe with baseline approaches across different evaluation metrics. The left panel shows results from the Mann-Whitney U test, while the right panel presents results from the permutation test (with 1,000,000 permutations), which provides a more robust alternative by empirically estimating the null distribution through random reassignment of group labels.
The p-values from comparisons between MoRe and the Iterative baseline are low in most cases, indicating statistical significance. Mann-Whitney U tests show for all LLMs and metrics except for DS-R1-8B on INH (), ENM (), and LIT (). Permutation tests yield even smaller p-values elsewhere and identical non-significant results for that model (INH , ENM , LIT ).
Against the Simple baseline, most metrics also show , with the Qwen3 series yielding in both tests across nearly all metrics. Non-significant results again appear for DS-R1-8B, with for CLS, ATT, REF and near-threshold values for ENM () and LIT (), with the permutation test generally producing slightly smaller p-values.
Table 6 presents the results of an ablation study conducted using DS-V3.2 and Q3C-480B. It is organized with “LLM” and “Approach” as the row variables and six evaluation metrics (CLS, ATT, REF, INH, ENM, LIT) as column variables. Four distinct approaches are compared:
MoRe, and three ablated variants denoted as
MoRe-MC,
MoRe-MR, and
MoRe-RF.
For DS-V3.2, MoRe leads on CLS (0.828), INH (0.708), ENM (0.768), and LIT (0.756). The MC variant tops ATT (0.702). The RF variant tops REF (0.554). The MR variant is the lowest, for example scoring 0.78 on CLS. For Q3C-480B, MoRe leads on CLS (0.822), ATT (0.75), and REF (0.564). The MC variant tops ENM (0.768) and LIT (0.746). The MR variant tops INH (0.71). This model shows higher ATT scores overall. Across LLMs, MoRe delivers top or near-top scores on most metrics.
We performed a significance test to compare
MoRe with its ablation variants, and the results show that, in most cases, the differences between the variants and
MoRe were not significant. Since the F1 score is the harmonic mean of precision and recall, this averaging process may have affected the significance level of the results. To analyze the impact of different variants in depth, we conducted separate significance tests for precision and recall; the specific results are shown in
Table 7. The cell’s value is the
p-value from a Mann-Whitney U test, indicating the significance of the difference between a variant and
MoRe for that category. We also fill all cells with
in different colors to indicate the direction of the effect: red cells indicate a decrease (i.e., the variant’s result is lower than
MoRe’s result), and blue cells indicate an increase.
For DS-V3.2, MoRe-MR and MoRe-RF significantly decrease precision across most categories: MoRe-MR shows in 5/6 metrics, MoRe-RF in 3/6 with borderline decreases () in the remaining three. MoRe-MC shows only INH precision decreasing significantly (). Recall is largely unaffected, with only MoRe-MR showing a borderline CLS decrease ().
For Q3C-480B, MoRe-MC yields significant precision increases for CLS () and INH (), but significant recall decreases for CLS () and REF (). MoRe-MR shows significant precision decreases for ATT, ENM, LIT () and borderline REF decrease (), with significant LIT recall increase (). MoRe-RF causes significant precision decrease only for ATT (), but significant recall decreases for CLS () and borderline ATT decrease ().
7. Answers to Research Questions
According to
Table 4,
MoRe consistently achieves the highest or near-highest F1 scores across most categories (CLS, ATT, REF, INH, ENM, LIT) for the three most capable LLMs (DS-V3.2, Q3C-480B, and Q3C-30B). The significance test results in
Table 5 also confirm that
MoRe substantially improves the quality of the generated models, particularly improving the generation of classes, attributes, and associations/compositions, compared to the baseline methods for most of the LLMs we tested.
MoRe outperforms Simple approach by introducing structured refinement steps. While Simple approach generates a model in one shot, it is prone to omissions, inconsistent relationships, and superficial understanding. MoRe’s multi-stage process—generate, critique, and refine—allows it to identify and correct these specific errors post-hoc.
The stark superiority of MoRe over Iterative baseline empirically demonstrates the vulnerability of an unstructured, naive multi-step generation process. Iterative approach, which constructs a domain model incrementally, suffers from conceptual drift, error propagation, and hallucination. In contrast, MoRe provides a controlled refinement loop with a specific, error-focused critique, preventing aimless drift and anchoring improvements to the initial valid structure.
Notably, Iterative suffers from catastrophic performance drops on DS-R1-8B and Q3C-30B. While MoRe also drops for DS-R1-8B, its performance with Q3C-30B remains comparable to larger models. This indicates that MoRe is more robust than Iterative when using mid-sized LLMs. One possible explanation is that MoRe uses clearer, shorter prompts than Iterative, making them easier for mid-sized LLMs to follow. The rule-based fixes in MoRe also reduce its reliance on the capabilities of LLMs.
While MoRe excels at fixing redundancy and inconsistencies, it may be less effective if the initial generation is fundamentally flawed or misunderstands the core domain, as refinement operates on an existing structure.
Table 4 further highlights that
MoRe’s performance varies considerably across different LLMs. The most pronounced improvements over Simple are observed with Q3C-30B, which shows substantial gains across nearly all evaluation categories. In contrast, both DS-V3.2 and Q3C-480B exhibit more moderate, yet consistent, performance improvements.
A possible explanation for these results is that Q3C-30B occupies a unique performance sweet spot: it is sufficiently large to comprehend the abstractions and instructions required for domain modeling, yet small enough to benefit disproportionately from MoRe ’s step-by-step scaffolding. While larger models (DS-V3.2 and Q3C-480B) also improve consistently under MoRe—underscoring the general utility of the self-refinement strategy—their gains are more modest. This implies that they have already internalized many reasoning patterns. Conversely, the smallest model appears overwhelmed by the decomposition, resulting in performance degradation.
The smaller DS-R1-8B model yields mixed outcomes with notable degradation on several structural components (INH, ENM, LIT). This suggests that MoRe, which involves complex reasoning and long, intricate prompts, requires a minimal level of LLM capability. Specifically, the LLM must be able to (1) effectively process and follow long, multi-constraint instructions, and (2) generate outputs with precise syntax adherence (e.g., valid PlantUML code). The execution logs reveal that DS-R1-8B frequently generated syntactically invalid PlantUML and failed to properly follow our instructions.
For AI-assisted domain modeling, selecting an LLM based solely on generic benchmark scores is inadequate. Practitioners should instead evaluate candidate LLMs in conjunction with the intended approaches (e.g., MoRe) to assess the task performance. A medium-sized LLM like Q3C-30B, when paired with MoRe, may surpass larger models used with simpler prompts, offering a favorable balance of cost and accuracy for many projects. Even when access to very large language models is limited—for instance, in environments with restricted external connectivity—high-quality domain modeling remains achievable by applying structured, decompositional prompting to capable mid-sized LLMs.
Based on
Table 6, removing any component generally degrades the overall performance of
MoRe. Their integration enables
MoRe to achieve stable, well-rounded performance across different backbone LLMs.
MoRe consistently achieves top or near-top performance across most metrics for both models, with ablation variants only outperforming on individual metrics.
However, as shown in
Table 7, each component affects performance in distinct ways and to varying degrees. Overall, the components in
MoRe contribute to a notable improvement in precision, albeit with a slight trade-off in recall. This outcome is expected, as the refinements introduced in
MoRe are primarily designed to eliminate errors, redundancies, and inconsistencies in modeling, thereby enhancing precision. Among all components,
model reduction exerts the most substantial impact on overall performance—particularly on precision—followed by
rule-based fixing. Moreover, the effects of individual components vary across different LLMs. For instance, with DS-V3.2, all components enhance precision; however, for Q3C-480B,
model correction reduces precision in CLS and INH while improving recall in CLS and REF. These variations reflect differences in architecture and capability across LLMs and further underscore that it is the integration of all components that endows
MoRe with robust performance.
These results confirm that the effectiveness of MoRe stems from the synergistic integration of complementary components. This aligns with the inherent nature of object-oriented modeling, where accurate domain modeling requires balancing consistency, conciseness, extensibility, and flexibility. Each component addresses a distinct dimension of this challenge: model reduction captures design conciseness, model correction enforces structural and hierarchical validity, and rule-based fixing refines local logical conformance. Their combination allows MoRe to manage these trade-offs dynamically instead of over-optimizing for one criterion.
The ablation patterns further reveal that MoRe supports task-aware adaptation based on modeling priorities. Variants without certain components can excel on specific metrics such as ATT, REF, ENM, LIT, or INH, depending on the backbone LLM. For instance, if a user works with DeepSeek V3.2 and wants to maximize the quality of reference generation, he/she may try MoRe-RF. This suggests that MoRe can be deployed in its full configuration for balanced, robust performance, or adjusted to emphasize particular modeling properties when domain requirements warrant.