A Hybrid Knowledge Extraction Method to Support Early Concurrent Engineering in the Aerospace Industry

Duverger, Eliott; Arista, Rebeca; Aubry, Alexis; Levrat, Eric

doi:10.3390/aerospace13040337

Open AccessArticle

A Hybrid Knowledge Extraction Method to Support Early Concurrent Engineering in the Aerospace Industry^†

¹

CRAN, CNRS, Université de Lorraine, F-54000 Nancy, France

²

Airbus SAS, F-31700 Blagnac, France

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in PLM 2025.

Aerospace 2026, 13(4), 337; https://doi.org/10.3390/aerospace13040337

Submission received: 18 February 2026 / Revised: 27 March 2026 / Accepted: 30 March 2026 / Published: 3 April 2026

(This article belongs to the Special Issue Challenges and Recent Advances in Model-Based Engineering for Aerospace)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In the early stages of concurrent engineering, the ability to assess design change impact is fundamentally limited by the availability of expert knowledge. Knowledge-Based Engineering (KBE) provides structured approaches for the capture, formalization, management, and diffusion of knowledge within complex organizations. KBE has increasingly turned toward ontology-based methodologies, leveraging their robust framework for shared conceptualization and reasoning capabilities. Integrated with Model-Based Systems Engineering (MBSE), such Ontology-Based Engineering (OBE) methodologies provide the necessary infrastructure for knowledge-driven workflows in a Digital Engineering (DE) context. Such integration is critical for complex engineering sectors such as the aerospace industry. However, the traditional knowledge acquisition process is expert-centric and, consequently, resource-intensive. The digital transformation of the industry has led to an explosion of data volumes, and raised concerns toward statistical approaches. This study implements a hybrid knowledge acquisition method within the OBE framework and MBSE environment. Specifically, this method combines human expertise and interpretable machine learning techniques to formalize knowledge models and instantiate them with concrete design rules. Applied in a real-world use-case involving workload estimation, this paper aims to enhance cross-domain collaboration during the conceptual design phase of new aircrafts.

Keywords:

concurrent engineering; knowledge-based engineering; ontology-based engineering; digital engineering; model-based systems engineering; knowledge acquisition

1. Introduction

Digital Engineering (DE) is transforming complex engineering sectors such as the aerospace industry. The advent of DE has evolved traditional Systems Engineering (SE) practices with enhanced modeling and simulation capabilities across the product lifecycle [1,2]. Such transformations provide compelling opportunities to reduce time and costs of commercial aircraft development while ensuring better quality of the delivered products. To this extent, one key endeavor for aircraft development is the consideration of all lifecycle issues during conception in order to avoid heavy and costly design changes [3].

Addressing this concern, Concurrent Engineering (CE) is defined as a “systematic approach to the integrated, concurrent design of products and their related processes, including manufacturing and support” [4]. The modeling and simulation capabilities of DE allow linking the multiple lifecycle phases through digital artifacts. This connectivity enables augmented, multidisciplinary analysis and cross-domain impact assessments, both of which are essential for CE. Thus, organizations must develop a systematic approach to integrate DE into their existing processes and systems to allow for a seamless flow of data, information, knowledge, and expertise [5].

One key aspect of DE is Model-Based Systems Engineering (MBSE), which leverages digital representations for systems requirement management, design, verification, and validation [6]. MBSE drives the transformation of engineering organizations into collaborative environments. Consequently, it serves as a cornerstone for CE, particularly during early stages such as conceptual design [7].

However, during the conceptual design phase, the effective implementation of CE remains highly dependent on domain experts and empirical knowledge to evaluate the multi-disciplinary impacts of design decisions [8]. Therefore, the ability to elicit, manage, and reuse engineering knowledge is also key to meet the challenges of an increasingly competitive industry.

Knowledge-Based Engineering (KBE) addresses the challenges associated with the capture, formalization, management, and diffusion of knowledge within complex organizations [9]. Nevertheless, problems of interoperability and adoption have hindered their development [10].

The democratization of ontologies in recent years has introduced Ontology-Based Engineering (OBE) as an increasingly important branch of KBE. OBE emerges as a solution to the conventional issues associated with expert systems, offering a shared model-based knowledge representation and reasoning capabilities [11].

The increasing adoption of digital representations to capture and transmit product engineering knowledge is also visible in other domains of engineering practice. Recent studies have explored the digital reconstruction of legacy manufacturing systems to preserve technical knowledge and support analysis in digital environments, such as the virtual modeling of traditional machining tools [12].

Nevertheless, KBE and OBE remain resource-intensive, notably with traditional expert-centric knowledge acquisition processes. The explosion of industrial data volume brought by DE raised new concerns toward more systematic knowledge capitalization strategies [10].

In the current era of Digital Engineering transformation, MBSE serves as the cornerstone of complex system design. However, in the context of early Concurrent Engineering, incorporating knowledge that supports design change impact analysis is crucial. While Ontology-Based Engineering provides compelling capabilities to integrate MBSE and KBE, the acquisition of actionable knowledge from large-scale industrial datasets remains a persistent challenge.

This paper implements a hybrid knowledge acquisition method in a real-world industrial use-case, necessitating augmented workload estimation capabilities. Framed within an OBE methodology, the proposed approach leverages ontological models formalized in a MBSE environment to guide interpretable machine learning studies. By systematically acquiring and integrating knowledge, this method intends to complement knowledge acquisition approaches to support CE processes.

The remainder of this paper is structured as follows: Section 2 provides an overview of the related work addressing Digital Engineering, Knowledge-Based Engineering, and knowledge acquisition approaches. Section 3 details the method employed to acquire knowledge supporting early Concurrent Engineering. The results of a preliminary implementation of the aforementioned method are presented in Section 4. Finally, discussions are held in Section 5, while Section 6 concludes this paper.

2. Related Work

2.1. Digital Engineering

Digital Engineering is transforming the industry with new engineering processes fostering the elaboration and reusability of digital models. In opposition to the traditional document-centric, paper-based working methods, digital models can be easily shared and edited, supporting collaborative engineering processes and enhancing information cohesiveness.

As DE is a large concept encompassing diverse fields and domains [5], it remains an ongoing transformation with significant endeavors remaining. Incremental maturity degrees for DE transformation were defined by McDermott et al. [6] as: data integration, semantic integration, augmented engineering, and finally, fully implemented DE. The authors highlight that the integration of digital models with technological innovations such as Semantic Web Technologies (SWT), ontologies, and machine learning, provides a framework for enhanced decision-making. This envisioned maturity increments follow a classical path: first integrate the involved data, then provide meaning and context, and finally, integrate high-end knowledge representation, reasoning and simulation.

Model-Based System Engineering is a basis for DE and Concurrent Engineering as it supports the modeling of requirements and system architectures, as well as automatically cascading design changes for the systematic exploration of design solutions [13]. In the aerospace industry, Lee et al. [14] highlighted the need to quickly assess and visualize the impact that changes in product and manufacturing design have on aircraft performance, production rate, backlog, and profitability.

Such assessments and impact visualizations are facilitated by the information traceability of MBSE. Nevertheless, these mechanisms rely on the integration of additional knowledge in early design stages.

2.2. Knowledge-Based Engineering

As stated in Section 1, the conceptual design phase is characterized by a high degree of abstraction with limited knowledge about the system of interest. Therefore, the capture, modeling, and reuse of knowledge that supports design decisions is key to supporting early Concurrent Engineering [8].

Knowledge-Based Engineering is defined as “the implementation of knowledge management methods and instruments, which support computational systems that make organizational knowledge the centerpiece of engineering design” [15]. KBE promotes the acquisition and reuse of knowledge facilitating cross-functional collaboration by centralizing system information and design rules.

Consequently, KBE systems support decision-making through the effective propagation of design modifications across components, systems, and domains [16]. Bruggeman et al. [8] proposed a generic model to capture and organize knowledge for automatic manufacturing consideration during the conceptual design of aircraft structures. This model has helped identify trends and ranking manufacturing concepts supporting trade-off decisions. Adopting a comparable approach, ontologies and semantic technologies were employed to identify inconsistencies between product and manufacturing system design while ensuring information interoperability [17].

Dunbar et al. [18] leverage a digital thread framework and a reasoning layer to infer knowledge based on heterogeneous data sources. Through descriptive logic and reasoners, the perspective of engineers can be automatically enriched with new relationships and insights. Nevertheless, the authors mention a high-effort cost to capture the initial knowledge and establish the cross-domain links.

Although exhibiting encouraging capabilities, these approaches remain locally specialized and frequently lack reusability and adoption. The high costs of implementation and maintenance, coupled with a lack of robust knowledge management and formalization methods, were identified as issues impeding the development of generic KBE approaches [10]. Some researchers state that interoperability problems in the industry are still unresolved because they are developed on local needs, failing to consider a wider scope of application [19]. The authors identify the need for a consistent use of ontologies for data access and reasoning across the product lifecycle. Concurring with the previous statement, Sun et al. [20] also identify ontologies as a means to acquire, manage, and transfer tacit knowledge for product design.

2.3. Ontology-Based Engineering

The increasing trend of KBE approaches employing ontologies for a unified knowledge representation has introduced Ontology-Based Engineering [11]. OBE leverages explicit specifications and structured conceptualizations for improved knowledge representation and reasoning. OBE enhances complex system design by allowing domains to be modeled via axiomatic definitions and taxonomic structures, thus embedding enriched knowledge within the engineering process. Complex system design activities are supported by OBE methodologies that use ontological models to enhance collaboration through knowledge capture, management, and sharing [21].

Through ontologies, Curran et al. [22] proposed a methodology for the early and continuous use of multidisciplinary knowledge leading to more accurate estimation of manufacturing consequences. Mas et al. [23] proposed an OBE methodology named Models for Manufacturing (MfM) that leverages an agnostic approach for knowledge capture and reuse to support complex manufacturing system design. This methodology is based on MOKA (Methodology for Knowledge-based engineering Applications) [24] and is employed at the core of MBSE environments in numerous industrial applications [25,26,27].

For example, MfM has been applied to establish a semantic-driven tradespace framework by converging ontology engineering and MBSE to optimize the digital continuity and interoperability in aircraft manufacturing system design [26]. In this context, the ontology serves as a semantic core to integrate the modeling, simulation and requirement management activities. The authors state that the model-based knowledge representation and formalization of ontologies allow capturing domain knowledge in a persisting way that is compatible with the MBSE framework and the perspective of human common sense.

In a similar context, Arista et al. [27] incorporate experts’ knowledge to support manufacturing system design via semantic integration and ontology reasoning. In this paper, the authors support manufacturing system design tradespace by employing MBSE to structure and manage ontologies. The developed application ontology enhances interoperability and information enrichment, re-using captured knowledge for complex system design.

However, Mas et al. [21] consider that OBE systems still hold some limitations, in particular regarding the use of various knowledge sources (e.g., domain experts, existing documentation, industrial databases). In this paper, the authors identify new technologies such as Big Data (BD) and Artificial Intelligence (AI) as important parts for future knowledge acquisition approaches.

2.4. Knowledge Acquisition for Product Design

Knowledge acquisition is a key enabler of Knowledge Management (KM). Acting as a primary process, it is indispensable to the further activities of formalization, structure, and redistribution [10].

The literature employs various terms to describe knowledge acquisition. A primary distinction exists between human-centric methods (such as knowledge capture, elicitation, and formalization), and data-centric methods, which focus on knowledge extraction from technical data. Knowledge acquisition serves as an umbrella term encompassing both approaches.

The traditional activity of knowledge acquisition has considered human experts as their main source for a long time. Nevertheless, the explosion of data quantity induced by the digital transformation of the industry raises more questions toward the extraction of knowledge from large databases.

Nevertheless, knowledge elicitation from domain experts can be tedious due to preparation processes, interactions, knowledge curation, and re-formalization. The industry’s digital transformation, in conjunction with advancements in AI, has provided new opportunities for considering industrial databases as a more compelling source of knowledge.

In the context of PLM and the data-driven manufacturing era, the applications of interpretable machine learning algorithms to extract valuable insights across lifecycles are increasing continuously [28,29].

Regression analysis has been used to estimate the cost of aerospace components in the early conception stage to offer better cost understanding [30]. Furthermore, Tang et al. [31] have employed surrogate models to generate aircraft performances as well as supply chain outputs. In this paper, the authors conducted a combination of Design of Experiments (DoE) and sensitivity analysis to extract key cross-domain drivers supporting decision-making in early design phases.

On a different approach, Random Forest (RF) models have been leveraged to support the design space exploration of Turbine Rear Structures [32]. A highlight of this study is the extraction of design parameters’ importance and if-then rules. Following a similar approach, Decision Tree (DT) models were employed to extract production rules [33]. In this paper, the authors identify the need for a bridge between manual knowledge curation and automatic data-driven knowledge generation. Recently, Li et al. [34] have proposed a scalable knowledge capture blending human expertise with computational methods to improve knowledge management practices across various industries.

A survey of hybrid expert systems between 1988 and 2010 has unveiled hybrid knowledge systems based on fuzzy neural networks [35]. Nevertheless, most of these hybrid experts systems are hybrid in their utilization of knowledge rather than in their method of acquisition. In practice, most hybrid frameworks maintain a rigid dichotomy, conducting expert-centric elicitation and data-driven extraction as independent, parallel processes.

The hybrid notion within the context of data-driven acquisition remains a nuanced topic. Given that all data-driven extraction studies are inherently designed and oriented by humans, the primary concern is not whether expert influence exists, but rather its degree of application.

2.5. Gap Analysis

The digital transformation of the industry has incurred significant opportunities to support the early Concurrent Engineering of complex products and their manufacturing systems. While Model-Based Systems Engineering provides a strong basis for data integration, engineers still lack perspectives toward impact assessment in the early design stage to conduct trade-off analysis.

Knowledge-Based Engineering provides structured approaches for the capture, formalization, management, and diffusion of knowledge supporting engineering activities. The limitations of KBE in terms of deployment, adoption, and interoperability are addressed by the emergence of Ontology-Based Engineering, which provides explicit specifications and structured conceptualizations for improved knowledge representation and reasoning.

OBE methodologies such as MfM can be employed at the core of the MBSE environment. This integration serves as the unification of MBSE and KBE. Nevertheless, systematically extracting knowledge from the large-scale datasets generated by Digital Engineering, and integrating these insights into OBE frameworks, remains a persistent challenge.

By leveraging the interpretability of Random Forest and Decision Tree algorithms, complex data interactions are translated into actionable rule-based logic. The semantic flexibility of MBSE and OBE frameworks then allows for the formalization and execution of these rules, providing robust decision support across the engineering lifecycle.

Therefore, there is a need to develop a structured method leveraging data-driven techniques that complements the traditional expert-centric knowledge acquisition processes of MBSE and OBE.

3. Hybrid Knowledge Acquisition Method

3.1. Method Overview

A previous contribution has leveraged MfM to support early Concurrent Engineering [36]. In this paper, several ontological models are described in a MBSE environment as an agnostic approach to manage knowledge and interfaces between tools. MfM is composed of four ontological models: Scope, Data, Semantic, and Behavior models [23]. The elaboration of ontological models necessitates a significant knowledge acquisition process. MfM’s methodology guidelines recommend modeling sessions with domain experts in order to formalize this knowledge.

MfM had numerous industrial applications employing SysML models for ontology development [21,27]. This easily-maintainable collaborative approach intends to facilitate the acquisition of complex knowledge from different perspectives. The most difficult knowledge to formalize is arguably the one described by the Behavior model. Such knowledge often remains buried in experts’ minds as a tacit know-how difficult to formalize.

This paper is an extended version of our paper published in PLM 2025 [37]. This work leverages the hybrid knowledge acquisition method illustrated in Figure 1, to support the elaboration of MfM’s Behavior model through a combination of expert-centric formalization and data-driven approaches. The objectives of this method are twofold: first, to strengthen interconnections between aircraft and manufacturing domains by formalizing cross-domain knowledge patterns, and second, to exploit these patterns as the foundation of machine learning studies.

3.2. Expert-Centric Formalization

The first part of this method consists of the formalization of cross-domain patterns identifying key concepts and attributes involved in a specific interaction between product and manufacturing engineering. Semi-structured modeling sessions are conducted with domain experts, according the MfM methodology. This formalization is carried out through a 5-step protocol: contextualization, enumeration, extension, selection, and verification [36].

The objectives of the contextualization step are to present the scope for the knowledge acquisition study to the domain experts. This helps to frame the modeling sessions with an adapted degree of granularity. The second step is the enumeration of all cross-domain knowledge patterns considered important in the defined scope. These patterns describe categories of cross-domain interactions, and therefore have a high degree of abstraction. Then, the most promising patterns are extended with concrete concepts and attributes. This detailed mapping of design impacts between product and manufacturing engineering serves as a cross-domain bridge within the MBSE environment, ensuring information traceability and highlighting specific levers for design optimization. Finally, according to the needs of the project and the availability of the information involved, a knowledge pattern is selected to drive interpretable machine learning studies.

This manual and iterative process exploits the experience and intuition of domain experts. Conducted in a MBSE environment, this knowledge formalization augments the perspectives of product and manufacturing architects and supports impact assessment analysis.

3.3. Data-Driven Extraction

The second part of the method involves the data-driven extraction of engineering rules that describe a specific cross-domain knowledge pattern. To achieve this, interpretable machine learning techniques are applied to large-scale industrial databases.

In order to achieve a proper engineering rule extraction, an iterative process of data exploration, data preparation, pre-processing, model training, model evaluation, and, finally, rules extraction activities is conducted.

The data exploration gathers all databases containing the identified attributes of the selected cross-domain knowledge pattern. From these heterogeneous sources, a data preparation activity is conducted in order to structure all the required information in a single, integrated dataset. Then, traditional pre-processing techniques are applied to clean the data and prepare it for the training process. Afterward, interpretable machine learning models are trained with the pre-processed input dataset. Interpretable models such as Decision Tree or Random Forest models are privileged for their rule extraction capability [33,38]. Subsequently the trained models are evaluated with metrics thresholds adapted to the selected knowledge.

By default, this approach exploits the Random Forest model. This choice is based on the Stable and Interpretable RUle Set (SIRUS) algorithm [38] that uses the internal structure of RF models to extract a concise set of rules imitating the model’s behavior, which can then be integrated into MBSE and ontology models.

4. Workload Estimation Knowledge Acquisition

Through an industrial collaboration with Airbus SAS, this experiment was conducted as part of an innovative project focused on developing a new hydrogen fuel tank. The Concurrent Engineering approach required short iteration loops between product and manufacturing architects. According to our industrial partners, the principal bottleneck to shorter iteration loops was the workload estimation, which is an activity relying strongly on empirical experience.

The objective of this study is to apply the hybrid knowledge acquisition method in order to elicit design rules for workload estimation. Please note that any technical data in the remainder of this section has been modified for confidentiality purposes, as it involves sensitive information from an internal project. Nevertheless, these modifications have been made with the intention of preserving coherence. The authors emphasize that the focus here is on the general approach rather than specific technical outcomes.

4.1. Expert-Centric Formalization

A series of modeling sessions was conducted with diverse domain experts, including manufacturing preparators, industrial system experts, material experts and industrial architects. This iterative expert-centric formalization yielded significant results by identifying concrete cross-domain parameters (e.g., weight, volume, material, and process category) from the product and manufacturing domains that play a significant role in workload estimation. The resulting cross-domain pattern associated with workload estimation is illustrated in Figure 2.

It is important to note that the identified parameters were not unanimously considered essential by all experts for workload estimation. For instance, the involvement of the material parameter was not unanimous among the experts. A decision was made to initially retain this parameter, with the option to remove it later if its inclusion proves to be a source of complications.

This identification process served to formalize the concepts and attributes essential to workload estimation. Consequently, this model brings value by augmenting cross-domain links between the MBSE architectures and their involved attributes. By formalizing the impact relationship between systems, the information traceability capability of MBSE environments has the potential to facilitate the identification of effective levers supporting early co-architecture activities.

Further, the second phase of the hybrid knowledge acquisition method aims to extract executable rules instantiating this pattern from historical records. During the final step of the human-centric formalization, the experts directed us toward relevant data sources. Specifically, most information about product engineering was accessible in a particular legacy database, while the manufacturing engineering information was maintained within the company’s Enterprise Resource Planning (ERP) tool.

4.2. Data-Driven Extraction

Guided by the established workload estimation pattern (Figure 2), the data-driven extraction phase was initiated. This approach intends to leverage the robustness and scalability of interpretable machine learning models applied on large scale datasets. More specifically, it aims to extract explicit rules of tacit knowledge through interpretable machine learning models such as Random Forest [38].

4.2.1. Data Exploration - E1

A thorough exploration of a substantial part of the organization’s data sources found traces of the parameters identified by the workload estimation pattern. Based on the experts’ guidance during the previous modeling sessions, we prioritized data from recent programs due to its relative recentness, which supported enhanced data reliability.

Product engineering information: Through Configuration Management (CM) databases and digitalized drawings’ meta-data, the targeted information was found at a Design Solution granularity level.
Manufacturing engineering information: The absence of an explicit workload parameter necessitated finding a variable tracking the process time. Within the ERP, concepts such as Routings and Work Orders were identified with such information.
–
A Routing represents a generic list of operations defining the overall product build process.
–
A Work Order is a specific instantiation of a Routing for an individual aircraft.

Although the Work Order database contained a significantly larger number of instances, consultation with domain experts strongly advised against using it due to unreliable processing time information. Indeed, the timestamps of Work Orders can significantly differ from reality due to the way operators register this information.

Consequently, we selected the Routing dataset, accepting the trade-off of a smaller sample size (approximately 50,000 instances) in exchange for the assurance that the time information was measured using the rigorous and reliable MTM-UAS (Methods-Time Measurement Universal Analyzing System) standard. This decision therefore prioritizes data quality and measurement rigor over considerable data volume.

4.2.2. Data Preparation - E2

The subsequent challenge was the integration of the identified sources of information. The datasets were stored in heterogeneous formats with no explicit link between each other (e.g., product engineering data from the legacy database and manufacturing engineering data from the ERP system).

To create a unified and meaningful input for the machine learning training, significant preparation activities are required:

Transformation: Converting the disparate data structures and formats into a common, standard representation (e.g., unit homogenization between kg and g, Hours and Minutes).
Aggregation: Summarizing or combining related data points to the appropriate level of detail required for workload estimation (e.g., summing the measured process time of all operations in a Routing).
Filtering: Removing irrelevant data out of the scope of the model’s training (e.g., keeping only Routings relative to the recent programs).
Joining: Establishing synthetic or derived linkages between the datasets (e.g., matching aircraft IDs in the Routing data with corresponding attributes in the aircraft drawing data).

These collective transformation, aggregation, filtering, and joining activities are essential to achieve a homogeneous and meaningful input dataset comprising all the targeted parameters for workload estimation.

4.2.3. Pre-Processing - E3, Loop 1

From an initial input dataset comprising 21,846 data points, standard pre-processing techniques are applied to reduce biases and favor the expected machine learning training [39].

Data Cleaning:

Initial checks are performed to identify and address any incorrect or missing data. This involves scrutinizing logical inconsistencies:

Missing values: A significant number of lines in the input dataset can have at least one missing value. Dealing with missing values can be achieved either by removing the whole line or by filling the missing data point with a synthetic value. This choice depends on the importance of the feature involved. Less critical features can be filled synthetically to preserve the information contained in the rest of the line while others might justify a simple line removal.
Incorrect values: Data points with logically impossible characteristics, such as negative values for parameters like volume or weight, were removed.
Outlier values: Data points with excessive workload values were flagged as outliers. According to domain experts, any workload exceeding 2500 Industrial Minutes (IMs) was considered as an outlier. While these data points may not all be fundamentally incorrect, they represent rare exceptions that could introduce significant bias and impede the model learning process.

Categorical Feature Balancing:

Then, the dataset is balanced to address the issue of categorical values occurring too infrequently. These non-representative instances can introduce noise into the training process. While a 1% heuristic is frequently applied for such thresholds, the iterative nature of our pre-processing revealed that this value resulted in an excessive number of columns during a subsequent encoding phase. Consequently, a more strict threshold of 1.5% was adopted to maintain a manageable feature space while ensuring the inclusion of statistically significant categories.

This balancing is illustrated in Figure 3, where we see the initial distribution of ATAs and SubATA values. As a high variation in such parameters would introduce noise in the dataset, the categories less represented shall be excluded. The removal of a few will thus greatly improve the stability and distribution of such features. We see here that we keep mostly ATA 92 (Electric installation), 53 (Fuselage), 57 (Wings), and 21 (Air conditioning) chapters.

Discrete Feature Segmentation:

Afterward, due to the intention to leverage the SIRUS algorithm [38], all discrete features are segmented to their closest 10th quantile. This value stabilization helps to define common thresholds in order to regroup the Random Forest rules and extract the most frequent ones.

Categorical Feature encoding:

To ensure a proper comprehension by the machine learning model, the categorical features are subsequently encoded. The two most common encoding techniques are Label Encoding and One-Hot Encoding.

Label Encoding: Assigns a numerical value (1, 2, 3, etc…) to each category. The main drawback of this technique is that it may generate an artificial hierarchy or ordinal relationship that does not exist in reality.
One-Hot Encoding (Privileged): This technique generates a new boolean column for each distinct categorical value. A 1 in a column indicates that the data point belongs to that specific category, while a 0 indicates the opposite. This method effectively represents categories without imposing any misleading ordinal relationships. However, it generates a larger feature space (more columns) and might cause complexity issues when a large number of categories are involved.

In order to avoid the unwanted ordinal relationship generated by the Label Encoding technique, and supported by the previous Categorical feature balancing step, which significantly reduced the number of distinct categories, One-Hot Encoding was selected. This resulted in a total of 231 columns, including discrete features.

Train/Test Split:

Finally, the fully pre-processed dataset was split into two subsets: a training set and a test set. This is a crucial step in machine learning methodology to ensure the model’s generalizability. By intentionally keeping the test subset separate and out of the training process, the primary objective is to avoid a phenomenon known as “overfitting”. Overfitting occurs when a model learns the training data, including its noise, instead of learning the logic behind it. While this produces excellent results on the training data, it significantly reduces the model’s ability to make generic predictions when presented with new data.

As Random Forest models are not overly susceptible to overfitting, and because our pre-processed dataset does not contain an extremely large number of lines, the target size of the test set was specified around 1000 data points. This allowed us to allocate 85% of the data for the model training while keeping 15% as a testing set (1112 data points).

The pre-processing techniques employed and their resulting number of lines in the input dataset are summarized in Table 1. One can notice that a considerable part of the initial dataset was removed during the Data cleaning step (more than 50%). This was mainly due to the weight feature, which was considered important but was often missing from the source databases.

4.2.4. Model Training - E4, Loop 1

After the pre-processing step, a Random Forest Regressor model was iteratively trained, and with the coefficient of determination (R²) and the Root Mean Square Error (RMSE) as primary performance metrics.

R² quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables. To put it more simply, it represents a measure of how well a regression model explains the variability of the data around its mean.

The formula for the coefficient of determination is described in Equation (3) as the inverse of the unexplained variation (SSres) over the total variation (SStot), thereby yielding a result between 0 and 1. The unexplained variation (SSres) is the sum of the squares of errors, described in Equation (1) where f is the prediction and y is the actual data point. The total variation (SStot) is the sum of the squares of the differences between the actual data points and the mean, as described in Equation (2).

S S_{res} = \sum_{i} {(y_{i} - f_{i})}^{2} = \sum_{i} e_{i}^{2}

(1)

S S_{tot} = \sum_{i} {(y_{i} - \bar{y})}^{2}

(2)

R^{2} = 1 - \frac{S S_{res}}{S S_{tot}}

(3)

However, the consideration of what constitutes a “good” R² is heavily dependent on the context. Some studies require values extremely close to 1, and others might find 40% as an acceptable threshold. For the specific task of early workload estimation in this project, consultation with domain experts established an R² threshold of 70% as the criterion for demonstrating a correct model fit to the data. It is important to note that the aim of this project is not to be highly precise, but to perform quick approximations that enable short iteration loops.

Nevertheless, relying only on a percentage metric like R² can remain abstract for some stakeholders. To provide a more concrete and interpretable measure of model performance, we included the RMSE as a complementary metric. RMSE, as defined in Equation (4), represents an absolute measure of the mean error. By using the same units as the predicted workload values (IM), it allows stakeholders to grasp the magnitude of the prediction errors. However, as this metric is highly sensitive to outliers, it is rarely used as the main metric of regression models.

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - f_{i})}^{2}}

(4)

By defining a dual-metric specification of R² greater than 70% and a RMSE inferior to 300 IM, we established both relative and absolute measures for determining the successful completion of the training process. This combined assessment ensures that the model not only explains a significant portion of the data variance but also maintains a practically acceptable level of estimation error.

With these performance metrics established, a Grid Search analysis is executed to systematically finetune the hyperparameters of the Random Forest Regressor model. This optimization process involves exploring key parameters such as the number of decision trees in the forest (named “estimator”) and their maximum depth.

As illustrated in Figure 4, various sets of hyperparameter values were iteratively evaluated through systematic cross-validation. This rigorous approach ensures that the selection of optimal hyperparameters is based on stable performance metrics across five data folds. Such hyperparameter values are detailed in Table 2.

4.2.5. Model Evaluation - E5, Loop 1

The testing set, which was kept out of the training phase, is then employed to demonstrate the machine learning model’s capability to generalize on new data. As illustrated in Figure 5, the measured workload values are compared against their associated predictions. The accuracy of these predictions (represented by the blue data points) is visually confirmed by their proximity to the central red line (representing perfect predictions).

Thus, the model demonstrates strong performance, achieving R² = 94% and RMSE = 165 IM. While a concrete trend can be easily noticed, the plot also indicates the presence of a few outliers with error increasing on high workload values. By meeting the predefined metric requirements (R² > 70% and RMSE < 300 IM), this evaluation confirms the successful learning process for early workload estimation.

4.2.6. Rules Extraction - E6, Loop 1

Once the successful learning process is confirmed, the stakeholders could use the model (considered as a “black box”) as an additional simulation tool. However, our approach leverages interpretable machine learning models to extract the learned knowledge in the form of if-then rules, which can then be incorporated into MBSE environments and ontology models.

The SIRUS algorithm proposes to extract a stable rule set imitating the behavior of the trained Random Forest model by deriving if-then rules from its inner structure while preserving the model’s accuracy. Attention has to be paid that these rules should not be considered individually. The complete rule set has to be used for each prediction by averaging the results of each rule.

As seen previously, SIRUS relies on the segmentation of discrete features toward their closest 10th quantiles to stabilize the splits of each tree with homogeneous values. Then, each path of each tree is collected, treated, and aggregated into the resulting rules listed in Table 3.

The coherence and industrial relevance of the extracted rules for early workload estimation were initially confirmed through a qualitative validation with industrial experts. However, as a rigorous follow-up confirmation, a quantitative validation was executed on the testing set, mirroring the evaluation step of Section 4.2.5.

The results were considerably below our expectations, with

R^{2} = 38 %

and RMSE

= 7782 IM

. While considering an error in our implementation of the SIRUS algorithm, we identified that the nature of SIRUS induces the emergence of shallow rules (mostly with a depth of one, two and rarely three degrees).

In the context of this specific workload estimation problem, the maximum tree depth appears to be a key hyperparameter with our input data. As shown in Figure 4, satisfactory results only start to be achieved with a depth of eight degrees.

Although the extracted individual rules appear logical, the complete, complex knowledge captured by the deep structure of the RF model was not fully extracted into this simplified rule set. We emphasize that this is a context-specific limitation. In scenarios where models comprising shallow trees are sufficient, the SIRUS approach would be appropriate. However, since our objective is to extract a rule set that represents a maximum of the learned knowledge, such rules were considered insufficient.

Consequently, a second iteration loop is necessary in order to extract a set of rules capable of describing workload estimation more accurately. The analysis of the hyperparameter heatmap (Figure 4) indicated that, while the maximum tree depth was an important factor, the number of trees was relatively less critical for optimal performance. Based on this evidence, we attempted a second loop of the data-driven knowledge extraction by exploiting a single Decision Tree model.

4.2.7. Pre-Processing, Model Training, and Model Evaluation - E3, E4, E5, Loop 2

E3 - Planning to exploit a single DT model, we keep the previous pre-processing steps (data cleaning, categorical feature balancing and encoding). The only difference is the removal of the discrete feature segmentation step that was previously applied to features such as weight and volume. This segmentation was a specific requirement for the SIRUS algorithm, but is not necessary when training a standard DT model. Similarly to the first iteration loop, we kept a training set separated from the testing set as a safeguard to overfitting.

E4 - Then we conduct another Grid Search analysis to finetune the hyperparameters of the Decision Tree Regressor model. The resulting hyperparameters are detailed in Table 4.

E5 - As shown in Figure 6, while the performance metrics for the single DT model are slightly lower than those achieved by the Random Forest model, they still meet the specified metric requirements. Specifically, the model yielded

R^{2} = 89 %

and a

R M S E = 184 IM

. Satisfying the minimum requirements established for the project (

R^{2} > 70 %

and

RMSE < 300 IM

), it confirms the model’s prediction capability for early workload estimation. Nevertheless, we can notice stronger errors than with the RF model.

4.2.8. Rule Extraction - E6, Loop 2

Once the successful training of the DT model for workload estimation has been confirmed, the process of extracting an explicit rule set mimicking its behavior is quite straightforward. The inner structure of a DT model is a rule set itself. Thus, we systematically collect all paths from the root node to each leaf parent. Each of these unique paths constitutes a precise if-then rule that describes the model’s predictive behavior. A subset of the extracted rules are detailed in Table 5.

The major drawback of this direct extraction method is the resulting large volume of rules compared to the SIRUS algorithm. While the magnitude of the rule set can be managed by limiting the maximum tree depth, the resultant number of rules will inevitably generate a considerable complexity.

The core motivation for this rule extraction step was primarily to demonstrate the principles of interpretable machine learning and to support the trust that can be placed in predictive models.

4.2.9. Validation (F5)

The validation step has been conducted on both extracted rule sets: the concise, yet imprecise, set obtained through the SIRUS algorithm, and the precise, yet lengthy, set constituting the Decision Tree model.

It has already been established that the SIRUS rule set did not capture all the complexity of the knowledge involved for workload estimation. However, after discussion with experts, it appears that these few rules remain coherent and could be used at a very high degree of abstraction. This qualitative validation allows architects to obtain an estimation when the complete context is still undefined. Furthermore, its concise format facilitates its integration and human interpretation.

On the other hand, the rule set constituting the DT model was quantitatively evaluated during the learning process with RMSE and R² metrics. While these rules represent a more accurate version of the captured knowledge, their large numbers make them more cumbersome to integrate and reuse in a KBE system.

4.3. Knowledge Integration and Execution

Finally, the explicit rules resulting from the hybrid knowledge acquisition method must be incorporated and executed. For this experiment, an example originating from a real case study involving the development of a new hydrogen fuel tank is used. Figure 7 shows an illustrative Computer Aided Design screenshot. However, please note that this figure has been AI-generated for the same aforementioned confidentiality purposes.

This section addresses the incorporation of the captured knowledge in an Ontology-Based Engineering system for the workload estimation of a welding assembly. Through the expressivity of MBSE and ontology modeling languages, both rule sets can be integrated within the OBE system.

As illustrated in Figure 8, the SIRUS rule set was incorporated in the workload estimation pattern. Thus, these rules are available to engineers and directly linked to the MBSE artifacts. Such knowledge, though imprecise, helps to estimate a first degree of magnitude during early co-architecture activities, thereby demonstrating the capability of MBSE to represent knowledge in the form of if-then rules.

Later in the co-architecture process, both product and manufacturing MBSE architectures were extracted into a knowledge graph for ensuring interoperability within the Digital Engineering ecosystem. This knowledge graph, supported by a Neo4J infrastructure, allows for flexible queries to answer the multi-perspectives concerns of diverse stakeholders and supports more detailed trade-offs. The lengthy rule set constituting the trained Decisions Tree model was then integrated with Semantic Web Rule Language (SWRL) as described as Kim et al. [40] with the Protégé tool. This more precise knowledge becomes accessible to all stakeholders, with ontology reasoning capabilities that enable executable cross-domain impact assessments.

In the context of this experiment, rule n°226 enables the estimation of the workload for assembling the hydrogen fuel tank parts, as depicted in Figure 9. This knowledge incorporation permits architects to estimate the workload of a whole series of operations by simply using an explicit if-then rule captured from past production records. This estimation was confirmed coherent by domain experts, although such designs have never been produced in real life yet.

Instead of asking domain experts, often different people, for each independent operation, this hybrid knowledge acquisition approach has provided architects with the means to quickly estimate the workload at a higher degree of granularity. This allows greatly shortening iteration loops without requiring resource-intensive in-depth analysis from domain experts.

5. Discussions

This method implementation opens several points of discussion. The decision to prioritize a Random Forest model over an individual Decision Tree model often depends on the requirements for rule extraction rather than raw predictive performance. In scenarios where a tree depth of three or fewer is sufficient, the SIRUS algorithm can be employed to derive a rule set that is both precise and concise. However, in studies where greater tree depth is required, the optimal choice depends strictly on the intended application of those rules. As we were in this situation, we adopted a dual approach. First, we integrated SIRUS’s imprecise rule set into the MBSE environment to facilitate preliminary estimations during early-stage system design with a short, human-readable rule set. Secondly, we incorporated the extensive rule set from the DT model into the whole knowledge graph, enabling high-fidelity impact assessments through ontology reasoning.

The authors identify no significant scalability risks associated with the selected machine learning models. By tuning hyperparameters during training, both the model performance and volume of extracted rules can be effectively controlled. However, in scenarios where the input data is insufficient to establish a successful learning process, more traditional and resource-intensive knowledge acquisition approaches should be privileged.

Nevertheless, a critical point remains in the degree of confidence engineers can have toward the extracted rules. While knowledge formalized with experts tends to naturally adapt toward expected changes in new design contexts, a question can be raised regarding knowledge extracted from databases, which can be seen as a frozen image from the past.

In order to ensure that the rules extracted with the machine learning approaches are reliable, the justification of the training metrics alone is insufficient. While these metrics do lend some credibility to the extracted rules, one might question whether they should be used when developing a new system.

The extraction and incorporation of if-then rules make such knowledge explicit and verifiable with concrete evidence. Therefore, by retrieving all historical data points that satisfy a specific rule, architects gain a quantifiable measure of scenario frequency. These frequencies allow for the assignment of statistical weights to individual rules, establishing a formal degree of reliability. Beyond simple quantification, this approach also enables a qualitative review of historical instances to assess their relevance to current projects. This dual-layered analysis provides an empirical justification for the extracted knowledge, though the necessity for manual oversight remains resource-intensive.

To systematically evaluate the knowledge reliability within a new product design context, data samples from older programs were structured using a consistent pre-processing methodology. By assessing the knowledge extracted on recent datasets to these older programs, we can simulate how effectively the rules generalize in a new program environment.

As shown in Figure 10 and Figure 11, both pre-trained RF and DT models were tested with older program data samples. Although a reduced performance was observed, both models maintained metrics exceeding the established thresholds of

R^{2} > 70 %

and

R M S E < 300 IM

, as summarized in Table 6. Consistent with the initial training phase, the RF model slightly outperformed the DT model. These results suggest that the core knowledge remains highly relevant for early-stage workload estimation in new design contexts.

However, technological breakthroughs are the most significant and obvious constraints for the reusability of captured knowledge in a new product design context. The introduction of new categorical values (such as a novel material, or process categories) that were unknown during the original training process would significantly decrease the model’s predictive performance. In such scenarios, the captured knowledge would be unable to reliably predict the outcomes for these new data points.

In the authors’ opinion, another important discussion regards the implications for knowledge capitalization within the Digital Engineering ecosystem. The data preparation and pre-processing activities conducted in this research have unveiled information storage issues.

Inconsistencies between categorical features of similar data points have highlighted a fragmented landscape and a lack of homogenization across programs and databases. Further, the prevalence of manually entered data impeded the systematic reuse of some information for knowledge capitalization. Moreover, despite the high volume of historical records, frequent data incompleteness necessitated strict pre-processing filters, which may introduce selection biases into the extracted knowledge.

Therefore, the authors emphasize that establishing standardized protocols for data governance, structure, and storage is fundamental to the successful DE transformation and knowledge capitalization process.

6. Conclusions

As part of the Digital Engineering transformation, this paper studied the application of a hybrid knowledge acquisition method that considers both human experts and industrial databases as its sources. In the context of early Concurrent Engineering, this method aims to extract design rules in order to facilitate cross-domain impact assessment, thus shortening iteration loops between product and manufacturing engineering. Expert-centric knowledge formalization and data-driven rule extraction were conducted toward the capture of explicit knowledge for workload estimation.

Expert-centric knowledge formalization was performed through several modeling sessions with key domain stakeholders in a real-world use-case study. Within the framework of the MfM methodology, and by exploiting the modeling capability of Model-Based Systems Engineering, a SysML model was developed to identify the key concepts and attributes fundamental to workload estimation. This cross-domain knowledge pattern acted as an additional bridge between product and manufacturing system architectures.

Employing this knowledge pattern as a guide, data-driven studies were then conducted to capture tacit knowledge from historical records. Through the robustness and scalability of interpretable machine learning algorithms applied on large industrial datasets, explicit engineering if-then rules were extracted.

The application of this hybrid method resulted in two rule sets with different degrees of abstraction. The first concise rule set, yet imprecise, was incorporated in the MBSE knowledge pattern. Although this approach provided architects with incomplete workload estimation knowledge, it allowed them to estimate a first degree of magnitude during early-stage MBSE activities. Thus, they could design their architectures without relying on a resource-intensive domain expert’s empirical experience.

Later during the Concurrent Engineering process, the second rule set, more precise yet lengthy, was integrated into a shared knowledge graph. This permitted any stakeholder to access detailed, explicit, and executable workload estimation knowledge with reasoning capabilities. Cross-domain impact assessments could then be carried out, significantly shortening the iteration loops between product and manufacturing system design.

The method leveraged in this paper has been accomplished within the MfM methodology and considers MBSE as a fundamental technology brick for engineering activities of complex systems. The authors emphasize that this approach is not intended to replace other knowledge acquisition approaches. Rather, it aims to complement existing methods of knowledge capitalization within Ontology-Based Engineering systems by integrating more data-driven techniques, embracing the Digital Engineering transformation.

Author Contributions

E.D.: writing—review & editing, writing—original draft, visualization, software, methodology, investigation, formal analysis, methodology, data curation, conceptualization. A.A.: writing, review & editing, validation, supervision, funding acquisition. R.A.: supervision, resources, methodology, conceptualization. E.L.: supervision, funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research was conducted under the framework of an industrial chair project with Airbus, which provided funding for the research. The funding source had no role in the study design, data collection, analysis, interpretation, or writing of the manuscript.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors employed generative AI to create the illustration in Figure 7. This approach was taken to represent the system visually while maintaining the confidentiality of proprietary data. The authors would also like to thank Airbus colleagues for their support and precious insights.

Conflicts of Interest

Author Rebeca Arista was employed by the company Airbus SAS. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
BD	Big Data
CAD	Computer-Aided Design
CE	Concurrent Engineering
CM	Configuration Management
CPKT	Cross-Project Knowledge Transfer
DE	Digital Engineering
DoE	Design of Experiment
DT	Decision Tree
ERP	Enterprise Resource Planning
IM	Industrial Minute
KBE	Knowledge-Based Engineering
KM	Knowledge Management
MBSE	Model-Based Systems Engineering
MfM	Models for Manufacturing
MOKA	Methdology for Knowledge-based engineering Applications
MTM-UAS	Methods-Time Measurement Universal Analyzing System
OBE	Obtology-Based Engineering
PLM	Product Lifecucle Management
RF	Random Forest
RMSE	Root Mean Square Error
SE	Systems Engineering
SIRUS	Stable and Interpretable RUle Set
SWRL	Semantic Web Rule Language
SWT	Semantic Web Technologies

References

INCOSE. Systems Engineering Vision 2035—Engineering Solutions for a Better World; Technical Report; INCOSE: San Diego, CA, USA, 2021. [Google Scholar]
Bajaj, M.; Friedenthal, S.; Seidewitz, E. Systems modeling language (SysML v2) support for digital engineering. Insight 2022, 25, 19–24. [Google Scholar] [CrossRef]
Tuegel, E.J.; Kobryn, P.; Zweber, J.V.; Kolonay, R.M. Digital thread and twin for systems engineering: Design to retirement. In Proceedings of the 55th AIAA Aerospace Sciences Meeting, Grapevine, TX, USA, 9–13 January 2017; p. 0876. [Google Scholar] [CrossRef]
Verhagen, W.J. Concurrent Engineering in the 21st Century: Foundations, Developments and Challenges; Springer International Publishing: Berlin/Heidelberg, Germany, 2015; Chapter 2. [Google Scholar]
Suprun, E.; Elsawah, S. Fundamentals for Digital Engineering. In The Guide to the Systems Engineering Body of Knowledge (SEBoK); The Trustees of the Stevens Institute of Technology: Hoboken, NJ, USA, 2026. [Google Scholar]
McDermott, T.; Henderson, K.; Salado, A.; Bradley, J. Digital engineering measures: Research and guidance. Insight 2022, 25, 12–18. [Google Scholar] [CrossRef]
Akundi, A.; Lopez, V. A review on application of model based systems engineering to manufacturing and production engineering systems. Procedia Comput. Sci. 2021, 185, 101–108. [Google Scholar] [CrossRef]
Bruggeman, A.; Bansal, D.; La Rocca, G.; van der Laan, T.; van den Berg, T. Model-based approach for the automatic inclusion of production considerations in the conceptual design of aircraft structures. J. Phys. Conf. Ser. 2024, 2716, 012022. [Google Scholar] [CrossRef]
La Rocca, G. Knowledge based engineering: Between AI and CAD. Review of a language based technology to support engineering design. Adv. Eng. Inform. 2012, 26, 159–179. [Google Scholar] [CrossRef]
Kuegler, P.; Dworschak, F.; Schleich, B.; Wartzack, S. The evolution of knowledge-based engineering from a design research perspective: Literature review 2012–2021. Adv. Eng. Inform. 2023, 55, 101892. [Google Scholar] [CrossRef]
Lentes, J. Ontology-Based Engineering—An Overview. In Proceedings of the Building Resilience into Production: Contemporary Challenges for the Future; Dragomir, M., Popescu, D., Huang, C.Y., Chiu, S.F., Quezada, L., Eds.; Springer: Cham, Switzerland, 2025; pp. 264–270. [Google Scholar] [CrossRef]
Jiménez-Galea, J.J.; Martín-Martín, M.Á.; Martín-Béjar, S. Digitalization of Legacy Machining Tools: A Case Study of a Manual Drill Press. Heritage 2026, 9, 73. [Google Scholar] [CrossRef]
Bruggeman, A.L.M.; La Rocca, G. From requirements to product: An mbse approach for the digitalization of the aircraft design process. In Proceedings of the INCOSE International Symposium, Honolulu, HI, USA, 15–20 July 2023; Volume 33, pp. 1688–1706. [Google Scholar] [CrossRef]
Lee, M.; Ceisel, J.; Liu, Z.; Mavris, D. A parametric, preliminary structural analysis and optimization tool with manufacturing cost considerations. In Proceedings of the 53rd AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics and Materials Conference 20th AIAA/ASME/AHS Adaptive Structures Conference 14th AIAA, Honolulu, HI, USA, 23–26 April 2012; p. 1750. [Google Scholar] [CrossRef]
Azevedo, M.; Tavares, S.; Soares, A.L. The digital twin as a knowledge-based engineering enabler for product development. In Proceedings of the Boosting Collaborative Networks 4.0: 21st IFIP WG 5.5 Working Conference on Virtual Enterprises, Valencia, Spain, 23–25 November 2020; Proceedings 21; Springer: Berlin/Heidelberg, Germany, 2020; pp. 450–459. [Google Scholar] [CrossRef]
La Rocca, G.; Van Tooren, M. Enabling distributed multi-disciplinary design of complex products: A knowledge based engineering approach. J. Des. Res. 2007, 5, 333–352. [Google Scholar] [CrossRef]
Szejka, A.L.; Canciglieri, O., Jr.; Mas, F. Knowledge-based expert system to drive an informationally interoperable manufacturing system: An experimental application in the Aerospace Industry. J. Ind. Inf. Integr. 2024, 41, 100661. [Google Scholar] [CrossRef]
Dunbar, D.; Hagedorn, T.; Blackburn, M.; Dzielski, J.; Hespelt, S.; Kruse, B.; Verma, D.; Yu, Z. Driving digital engineering integration and interoperability through semantic integration of models with ontologies. Syst. Eng. 2023, 26, 365–378. [Google Scholar] [CrossRef]
Ameri, F.; Sormaz, D.; Psarommatis, F.; Kiritsis, D. Industrial ontologies for interoperability in agile and resilient manufacturing. Int. J. Prod. Res. 2022, 60, 420–441. [Google Scholar] [CrossRef]
Sun, X.; Huang, R.; Jiang, Z.; Lu, J.; Yang, S. On tacit knowledge management in product design: Status, challenges, and trends. J. Eng. Des. 2024, 36, 1673–1710. [Google Scholar] [CrossRef]
Mas, F.; Arista, R.; Skrzek, M.; Oliva, M.; Morales-Palma, D.; Szejka, A.L. A Framework for Ontology-Based Engineering Systems: Advances and Open Questions About Knowledge Capture and Use in Aerospace Manufacturing. In Innovative Intelligent Industrial Production and Logistics; Springer: Cham, Switzerland, 2025; pp. 166–174. [Google Scholar] [CrossRef]
Curran, R.; Verhagen, W.J.; Van Tooren, M.J.; Van Der Laan, T.H. A multidisciplinary implementation methodology for knowledge based engineering: KNOMAD. Expert Syst. Appl. 2010, 37, 7336–7350. [Google Scholar] [CrossRef]
Mas, F.; Racero, J.; Oliva, M.; Morales-Palma, D. A preliminary methodological approach to Models for Manufacturing (MfM). In Proceedings of the Product Lifecycle Management to Support Industry 4.0: 15th IFIP WG 5.1 International Conference, PLM 2018, Turin, Italy, 2018; Proceedings 15; Springer: Cham, Switzerland, 2018; pp. 273–283. [Google Scholar] [CrossRef]
Stokes, M. Managing Engineering Knowledge: MOKA: Methodology for Knowledge Based Engineering Applications; Professional Engineering Publishing: London, UK, 2001; Volume 3. [Google Scholar]
Hu, X.; Arista, R.; Zheng, X.; Lentes, J.; Sorvari, J.; Lu, J.; Ubis, F.; Kiritsis, D. Ontology-based system to support industrial system design for aircraft assembly. IFAC-PapersOnLine 2022, 55, 175–180. [Google Scholar] [CrossRef]
Zheng, X.; Hu, X.; Arista, R.; Lu, J.; Sorvari, J.; Lentes, J.; Ubis, F.; Kiritsis, D. A semantic-driven tradespace framework to accelerate aircraft manufacturing system design. J. Intell. Manuf. 2022, 35, 175–198. [Google Scholar] [CrossRef]
Arista, R.; Zheng, X.; Lu, J.; Mas, F. An Ontology-based Engineering system to support aircraft manufacturing system design. J. Manuf. Syst. 2023, 68, 270–288. [Google Scholar] [CrossRef]
Tao, F.; Cheng, J.; Qi, Q.; Zhang, M.; Zhang, H.; Sui, F. Digital twin-driven product design, manufacturing and service with big data. Int. J. Adv. Manuf. Technol. 2018, 94, 3563–3576. [Google Scholar] [CrossRef]
Oksuz Gurdal, B.; Testik, O.M. A Framework for Product Life Cycle Management Based Digital Twin Implementation in the Aerospace Industry. Appl. Stoch. Model. Bus. Ind. 2025, 41, e70001. [Google Scholar] [CrossRef]
Muia, T.; Salam, A.; Bhuiyan, N. A comparative study to estimate costs at Bombardier Aerospace using regression analysis. In Proceedings of the 2009 IEEE International Conference on Industrial Engineering and Engineering Management, Hong Kong, China, 8–11 December 2009; IEEE: Piscataway, NJ, USA, 2009. [Google Scholar] [CrossRef]
Tang, Z.; Pinon-Fischer, O.J.; Mavris, D.N. Identification of key factors in integrating aircraft and the associated supply chains during early design phases. In Proceedings of the 14th AIAA Aviation Technology, Integration, and Operations Conference, Atlanta, GA, USA, 16–20 June 2014. [Google Scholar] [CrossRef]
Dasari, S.K.; Cheddad, A.; Andersson, P. Random forest surrogate models to support design space exploration in aerospace use-case. In Proceedings of the Artificial Intelligence Applications and Innovations: 15th IFIP WG 12.5 International Conference, AIAI 2019; Proceedings 15; Springer: Cham, Switzerland, 2019; pp. 532–544. [Google Scholar] [CrossRef]
Ali, M.; Ali, R.; Khan, W.A.; Han, S.C.; Bang, J.; Hur, T.; Kim, D.; Lee, S.; Kang, B.H. A data-driven knowledge acquisition system: An end-to-end knowledge engineering process for generating production rules. IEEE Access 2018, 6, 15587–15607. [Google Scholar] [CrossRef]
Li, Z. Leveraging Computational Algorithms for Effective Explicit and Tacit Knowledge Capture: A Hybrid Approach Combining Expert Interviews, Machine Learning, and Data Mining Techniques. Appl. Comput. Eng. 2024, 114, 15587–15607. [Google Scholar] [CrossRef]
Sahin, S.; Tolun, M.R.; Hassanpour, R. Hybrid expert systems: A survey of current approaches and applications. Expert Syst. Appl. 2012, 39, 4609–4617. [Google Scholar] [CrossRef]
Duverger, E.; Arista, R.; Aubry, A.; Levrat, E. An ontology-based engineering methodology to support early concurrent engineering. IFAC-PapersOnLine 2025, 59, 2838–2843. [Google Scholar] [CrossRef]
Duverger, E.; Levrat, E.; Aubry, A.; Arista, R. A Hybrid Knowledge Extraction Approach to Support Early Concurrent Engineering in the Aerospace Industry. In Product Lifecycle Management. PLM in the Age of Model-Based Engineering in Industry; Springer: Cham, Switzerland, 2025; pp. 98–108. [Google Scholar] [CrossRef]
Bénard, C.; Biau, G.; da Veiga, S.; Scornet, E. Interpretable Random Forests via Rule Extraction. Proc. Mach. Learn. Res. 2021, 130, 937–945. [Google Scholar] [CrossRef]
Zheng, A.; Casari, A. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2018. [Google Scholar]
Kim, K.Y.; Ahmed, F. Semantic weldability prediction with RSW quality dataset and knowledge construction. Adv. Eng. Inform. 2018, 38, 41–53. [Google Scholar] [CrossRef]

Figure 1. Hybrid knowledge acquisition method.

Figure 2. Workload estimation pattern as a Behavior model component.

Figure 3. Initial ATA/subATA distribution (a) and balanced ATA/subATA distribution (b).

Figure 4. Grid Search analysis heatmap.

Figure 5. Random Forest Regressor—performance evaluation.

Figure 6. Decision Tree Regressor—performance evaluation.

Figure 7. Illustrative CAD of the tank assembly (generated with Gemini 3 Pro Image).

Figure 8. Extracted rules formalized in the behavior model.

Figure 9. SWRL rule example in Protégé (a) and Rule inferred in the knowledge graph (b).

Figure 10. Random Forest performance on older program data.

Figure 11. Decision Tree performance on older program data.

Table 1. Data volume engaged in pre-processing.

Initial Dataset of 21,846 Lines
Pre-Processing Technique	Total Remaining Lines
Data cleaning	9339 lines
Categorical feature balancing	7413 lines
Discrete feature segmentation	7413 lines
Categorical features encoding	7413 lines
Train/test split (85%/15%)	6301 & 1112 lines

Table 2. Random Forest hyperparameter values.

Hyperparameter Name		Hyperparameter Value
n_estimators	=	200
max_depth	=	12
bootstrap	=	True
max_features	=	“sqrt”
criterion	=	“squared_error”

Table 3. Most occurring extracted rules.

Frequency	Rule Body	Rule Head
0.21	IF volume ≤ 1,618,204,019.5 ${mm}^{3}$	THEN workload = 789.04 IM	ELSE workload = 2204.72 IM
0.19	IF size_Z ≤ 4421.3 mm	THEN workload = 944.6 IM	ELSE workload = 4331.9 IM
0.12	IF category = “assembly”	THEN workload = 3614.4 IM	ELSE workload = 1322.7 IM
0.11	IF ata = “21"	THEN workload = 326.7 IM	ELSE workload = 1403.6 IM
0.08	IF weight ≤ 9556.4 g	THEN workload = 886.6 IM	ELSE workload = 1642.5 IM
0.07	IF size_Y > 1516.1 mm	THEN workload = 1626.6 IM	ELSE workload = 634.4 IM

Table 4. Decision Tree hyperparameter values.

Hyperparameter Name		Hyperparameter Value
max_depth	=	12
criterion	=	“squared_error”
splitter	=	“best”

Table 5. Decision Tree rules subset.

Rule Body	Rule Head	Sample Basis
IF size_z > 6501.4 mm³ AND weight > 4.3 kg
AND category ≠ “Electrical assembly”	THEN
AND material = “ALUMINIUM”	workload = 4667.7 IM	Based on 1271 samples
IF size_z $\leq 2566.8 {mm}^{3}$
AND volume ≤ 3,309,015,256.6 mm³
AND category = “Electrical assembly”
AND volume > 654,746,152.5 mm³
AND weight > 1.4 kg AND size_y ≤ 3006.1 mm³
AND (ata ≠ “57” & subata ≠ “5”)
AND (ata ≠ “92” & subata ≠ “8”)	THEN
AND (ata ≠ “54” & subata ≠ “5”)	workload = 450.4 IM	Based on 71 samples
IF size_z ≤ 10,277.0 mm³
AND volume ≤ 6,136,550,323.2 mm³
AND composition ≠ “STAINLESS_STEEL”
AND weight ≤ 10.4 kg AND weight > 0.351 kg
AND size_y ≤ 2273.6 mm³ AND size_x $\leq 5456.7 {mm}^{3}$	THEN
AND ata ≠ “38” AND (ata ≠ “53” & subata ≠ “3”)	workload = 89.6 IM	Based on 56 samples
IF size_z $\leq 8754.0 {mm}^{3}$
AND volume ≤ 6,307,480,845.7 mm³
AND composition ≠ “ALUMINIUM”
AND weight ≤ 3.3 kg AND weight > 1.1 kg
AND size_y ≤ 1936.8 mm³ AND size_x $\leq 4648.3 {mm}^{3}$
AND (ata = “54” & subata = “1”)	THEN
AND volume > 5,227,431,757.5 mm³	workload= 1871.5 IM	Based on 52 samples

Table 6. RF and DT model performance on older data.

Model	R²	RMSE
Random Forest	81%	211 IM
Decision Tree	78%	228 IM

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Duverger, E.; Arista, R.; Aubry, A.; Levrat, E. A Hybrid Knowledge Extraction Method to Support Early Concurrent Engineering in the Aerospace Industry. Aerospace 2026, 13, 337. https://doi.org/10.3390/aerospace13040337

AMA Style

Duverger E, Arista R, Aubry A, Levrat E. A Hybrid Knowledge Extraction Method to Support Early Concurrent Engineering in the Aerospace Industry. Aerospace. 2026; 13(4):337. https://doi.org/10.3390/aerospace13040337

Chicago/Turabian Style

Duverger, Eliott, Rebeca Arista, Alexis Aubry, and Eric Levrat. 2026. "A Hybrid Knowledge Extraction Method to Support Early Concurrent Engineering in the Aerospace Industry" Aerospace 13, no. 4: 337. https://doi.org/10.3390/aerospace13040337

APA Style

Duverger, E., Arista, R., Aubry, A., & Levrat, E. (2026). A Hybrid Knowledge Extraction Method to Support Early Concurrent Engineering in the Aerospace Industry. Aerospace, 13(4), 337. https://doi.org/10.3390/aerospace13040337

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Knowledge Extraction Method to Support Early Concurrent Engineering in the Aerospace Industry †