Large Language Models for Structured Information Processing in Construction and Facility Management

Buga, Kyrylo; Tesic, Ratko; Koyuncu, Elif; Hanne, Thomas

doi:10.3390/electronics14204106

Open AccessArticle

Large Language Models for Structured Information Processing in Construction and Facility Management

by

Kyrylo Buga

¹

,

Ratko Tesic

¹,

Elif Koyuncu

¹ and

Thomas Hanne

^2,*

¹

School of Business, University of Applied Science and Arts Northwestern Switzerland, 4600 Olten, Switzerland

²

Institute for Information Systems, University of Applied Science and Arts Northwestern Switzerland, 4600 Olten, Switzerland

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(20), 4106; https://doi.org/10.3390/electronics14204106

Submission received: 11 September 2025 / Revised: 8 October 2025 / Accepted: 11 October 2025 / Published: 20 October 2025

(This article belongs to the Special Issue Deep Learning Approaches for Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

This study examines how the integration of structured information affects the performance of large language models (LLMs) in the context of facility management. The aim is to determine to what extent structured data such as maintenance schedules, room information, and asset inventories can improve the accuracy, correctness, and contextual relevance of LLM-generated responses. We focused on scenarios involving function calling of a database with building information. Three use cases were developed to reflect different combinations of structured and unstructured input and output. The research follows a design science methodology and includes the implementation of a modular testing prototype, incorporating empirical experiments using various LLMs (Gemini, Llama, Qwen, and Mistral). The evaluation pipeline consists of three steps: user query translation (natural language into SQL), query execution, and final response (translating the SQL query results into natural language). The evaluation was based on defined criteria such as SQL execution validity, semantic correctness, contextual relevance, and hallucination rate. The study found that the use cases involving function calling are mostly successful. The execution validity improved up to 67% when schema information is provided.

Keywords:

large language models; LLMs; function calling; construction; facility management; building information modeling

1. Introduction

Large language models (LLMs) have evolved from generic text-completion systems into multimodal reasoning engines capable of translating user questions into executable code, structured database queries, and domain-specific recommendations. Their capabilities are particularly beneficial for construction and facility management (FM), where experts manage a wide range of data: building information models, maintenance logs, floor plans, inspection reports, and regulatory manuals, which are often stored in various formats and scattered across siloed systems. Transforming this diverse information landscape into actionable knowledge is a long-standing challenge that currently limits data-driven decision making and leads to significant operating costs.

1.1. Problem Statement

Companies in the construction and facility management sector manage large volumes of data such as operational manuals, inspection reports, maintenance logs and building plans. A major problem related to this is the handover phase when a large number of documents produced and maintained during the construction phase are provided for continuing building operations during its lifecycle [1]. Because this information is often unstructured and stored in various formats, finding relevant documents frequently requires significant domain expertise and expert input [2].

The handover process and the subsequent use of information may be supported by a more formalized construction-operations building information exchange using specific data formats such as the industry foundation class (IFC) model [3]. In the ideal case, building information modeling (BIM) may lead to a detailed representation of the building which can be used for subsequent processes such as facility management [4,5]. However, issues such as costs and data quality remain, and the practical usage of the models may be limited by the understanding of data structures in daily operations involving information retrieval [6,7]. Thus, a solution providing information access in a user-friendly way would still be desirable.

A possible solution to that which allows a user to communicate in natural language to find the required information could be based on a chatbot driven by an LLM. Recent LLM advances have significantly improved natural language understanding and generation, enabling powerful applications in information-intensive domains. One prominent technique, Retrieval-Augmented Generation (RAG), allows LLMs to combine retrieved documents with natural language prompts to generate sufficiently accurate and context-aware responses [8]. However, the effectiveness of such systems is highly dependent on multiple factors, including the structure and format of the underlying data, the retrieval strategy, and the prompting methodology. In practice, the quality and consistency of generated responses vary considerably and LLMs are prone to hallucinations, i.e., producing information not grounded in source data, particularly when lacking a domain-specific context [8].

1.2. Thesis Statement and Research Questions

Our study investigates how the integration of structured information, such as asset inventories, maintenance schedules, and environmental monitoring data, can enhance the performance of LLMs in domain-specific applications. Focusing on facility management, this study explores how structured and unstructured data can be combined to improve retrieval accuracy, reduce hallucinations, and enable reliable function calling in technical support scenarios.

Basically, three application scenarios can be considered:

Unstructured Input + Unstructured Output: A user submits a free-text question and the LLM responds with a narrative explanation based on retrieved manuals or reports. For instance, a user might ask for the types of floor covers (e.g., for planning cleaning operations), expecting a text answer providing further details.
Unstructured Input + Structured Output: In this scenario, the user submits a natural language query and the LLM responds by generating a structured output, such as an SQL query. This is a use case to support a scenario where BIM was already used involving a database with relevant information. For instance, a user might ask for all the cold storage rooms with their areas.
Structured Input + Structured Output: The user provides structured parameters and the LLM generates a corresponding structured response, such as a command, database query, or summarized data output, to support automated workflows. This is an advanced scenario with a human user familiar with structured information access, e.g., in the form of SQL queries. It could also include automated information access by an application, e.g., a software for planning operations.

In this paper, we focus on three use case variants which are based on the usage of function calling involving an SQL database and distinguishing the involved steps of the process:

Use Case 1 considers only the first step from a user query in natural language involving the LLM-based user interface to a structured output assuming to be a suitable SQL query.
Use Case 2 goes one step further and includes the evaluation of the SQL query by the related database software.
Use Case 3 considers the end-to-end process by including the evaluation and presentation of the SQL query result through the LLM.

By comparing these variants across multiple metrics (accuracy, response quality), this paper aims to identify best practices or leveraging structured data to augment LLM performance in real-world facility management environments.

Our study addresses one main research question (MRQ), which is subdivided into four subresearch questions (SRQs):

MRQ: Assuming a scenario based on using information from a relational database, how does the integration of unstructured/structured data impact the retrieval accuracy of LLM-based facility management systems?
SRQ1: What specific parameters (such as for the prompt specification) are relevant and most effective for handling and extracting insights from structured facility management data?
SRQ2: What are the advantages and limitations of different methods/techniques for providing structured or unstructured input and output to LLMs in terms of processing efficiency and response correctness?
SRQ3: How do different structured and unstructured input strategies (e.g., structured queries, database schema, guided prompts) affect the ability of LLMs to retrieve and process information in a facility management context?
SRQ4: What factors (e.g., input/output formats, LLM model choice, retrieval technique) most significantly influence the success of structured-data-enhanced LLM applications in real-world scenarios?

In particular, our study thesis investigates how integrating structured information such as structured prompts and function calling via a relational database can enhance the performance of LLMs in domain-specific applications by improving response accuracy, reducing hallucinations, and enabling practical use in professional contexts such as facility management.

1.3. Scope

Our study focuses exclusively on the evaluation of LLMs for text-to-SQL-to-text generation in a facility management context. During defining the project scope, our research deliberately excluded related capabilities such as document summarization, RAG, and function calling via external APIs. While API use may provide advantages regarding the robustness and reliability of a solution, the consideration of SQL queries as a target format for retrieving external information for the LLM allows more freedom for the experimentation and better insights into LLM capabilities (as API usage may involve particular results being calculated externally without LLM involvement). Thus, it was decided to opt for the PostgreSQL text-to-SQL-to-text solution.

Another reason for this decision was to conduct controlled experiments while avoiding technical dependency on an existing solution, including the related integration effort.

The core objective of this study is to assess whether LLMs can reliably achieve the following:

Understand domain-specific natural language queries;
Generate syntactically and semantically valid SQL statements;
Correctly reflect user intent based on the database schema;
Support complex filtering, aggregation, and conditional logic;
Communicate results back to the user in an accurate and context-aware manner.

All use cases are based on real-world facility management scenarios such as room data lookup, maintenance scheduling, asset inventory, and cost-related queries. By narrowing the scope in this way, we ensure methodological clarity, practical feasibility within time constraints, and deeper insights into the potential of LLMs in structured enterprise data environments.

For the specific experiments of our study, we had access to data from an Innosuisse Project (107.417 IP-SBM), which involved the project partners LIBAL Schweiz GmbH, FHNW (University of Applied Sciences and Arts Northwestern Switzerland), and ZHAW (Zurich University of Applied Sciences). Access to a sample database was provided by LIBAL.

With addressing the aforementioned research questions, our study contributes to a better understanding how a solution for providing access to building-related information based on function calling and LLM usage could be designed. In addition, it shows how the reliability of responses can be improved considering typical application scenarios, which are investigated and evaluated during a series of experiments.

This paper is structured as follows. A literature review is provided in the next section. Section 3 presents the research methodology of the study. The experimental design is further specified in Section 4. Section 5 discusses the implementation. The results are presented in Section 6. Section 7 provides a discussion and conclusions.

2. Literature Review

2.1. Literature Search

To identify the relevant literature for this study, a structured search strategy was applied to ensure the inclusion of high-quality, domain-specific academic sources. The objective was to gain a comprehensive understanding of current research on enhancing LLMs with structured data with a focus on facility management. Particular attention was given to techniques aimed at mitigating hallucinations and improving reliability, such as retrieval-augmented generation, structured prompting, and ontology-based knowledge integration.

Academic databases, including Google Scholar and arXiv, were consulted. The search process followed a top–down strategy, beginning with broad terms and gradually narrowing the focus. To further support the search process, AI-assisted tools such as ChatGPT (GPT-4, OpenAI) were used to explore keyword combinations, identify related concepts and map relevant research directions. This allowed for a more comprehensive and efficient literature exploration.

The search was guided by carefully selected keyword combinations, including LLMs, LLM hallucinations, retrieval-augmented generation, structured data + language models, structured prompting, function calling + NLP, ontologies + LLMs, facility management + language models, knowledge graph + LLM. Boolean operators (AND, OR) and database-specific filters such as publication year, subject relevance and citation count were applied to refine the results. Moreover, we searched for specific publications on information management in construction and facility management using keywords such as BIM or IFC. Additionally, forward and backward citation tracking was used to identify seminal works and influential publications.

Priority was given to peer-reviewed journal articles, systematic literature reviews and conference proceedings published within the last five years to ensure up-to-date coverage of technological developments. Found sources were assessed for their academic credibility, relevance to the research questions, and applicability to the facility management domain. This approach provided a robust foundation for identifying key concepts, current trends and existing research gaps. It also informed the design of our research methodology.

2.2. Information Management in Facility Management

Facility management involves a wide area of activities expected to be executed effectively and involving various stakeholders such as building owners, operators, tenants, facility managers, and professional advisors [9]. During the construction phase of a building, usually a large number of documents (e.g., operational manuals, inspection reports, maintenance logs and building plans) are generated and collected, which are handed over to the operational phase of the building lifecycle afterwards. While this information is usually updated and intensively used during the construction phase, information usage may be impeded during the subsequent building lifecycle [1]. For instance, some documents related to infrastructure or equipment may only be needed irregularly, e.g., in case of a necessary repair, which makes information retrieval time consuming and costly.

As a possible solution to that, often building information modeling (BIM) is considered as an approach aiming at detailed digital representations of the physical and functional characteristics of buildings including related assets [4,5,10,11,12]. Often, BIM is based on specifying the related information in relational databases and following standardized data formats such as the industry foundation class (IFC) model [3,13,14,15,16].

Nevertheless, the usage of BIM includes various issues such as the cost of deriving such models and the data quality [6,7,17,18]. Even if a complete and correct BIM model is available, difficulties may arise, e.g., when users are unaware of query syntax or database schemas [19,20]. This gives rise to considering more user-friendly solutions, e.g., the conversion of natural language queries into an SQL format [21]. Moreover, the usage of LLM models could be considered with providing a capable chatbot interface, on the one hand, and comprehensive information access by techniques such as retrieval-augmented generation for document access [2] or function calling for database access [22].

2.3. Enhancing LLMs with Unstructured or Structured Data

LLMs have made impressive progress in natural language understanding and generation, enabling their application across a wide range of domains. However, one of the most persistent challenges in using LLMs for domain-specific tasks, particularly in high-stakes fields such as healthcare, finance and facility management, is their tendency to generate hallucinations. These are outputs that are unfaithful to the source data [8,23]. Hallucinations may result from insufficient context, ambiguity in the input, or a lack of structured grounding. In facility management, such errors can lead to incorrect maintenance schedules or faulty asset tracking [2].

Hallucinations are typically categorized into factual and ungrounded types. Factual hallucinations refer to completely incorrect outputs, whereas ungrounded hallucinations result from insufficient contextual alignment [23]. In the domain of facility management, this can directly impact operational efficiency and safety, e.g., in case of wrong planning of maintenance and repair activities. Dettmers et al. [24] further highlighted that in particular, LLMs that have been quantized (compressed for efficiency) may suffer from hallucinations despite fine tuning for improved performance.

To reduce hallucinations and improve response quality, retrieval-augmented generation (RAG) has emerged as a key technique. It integrates external knowledge into the LLM’s generation process by retrieving relevant documents or facts from a structured source prior to generating a response [8,25]. In facility management, RAG systems have been employed to retrieve operational manuals, inspection reports, and maintenance logs, enhancing the contextuality and precision of outputs [2].

Figure 1 illustrates the RAG workflow. When a user submits a query, the system first searches a connected knowledge base to retrieve relevant documents. These are used to form the context, which is then passed to the LLM to generate a grounded, accurate response. This feedback loop ensures that the model has access to real-world, domain-specific data during the generation process.

Studies by Borgeaud et al. [26] and Liu et al. [27] show that using structured sources, such as ontologies or knowledge graphs, allows LLMs to generate domain-relevant and context-aware responses. For example, linking structured asset data with maintenance history enables the model to make accurate predictions about upcoming service needs.

In addition to retrieval, structured prompting is another effective method to improve the LLM output consistency. Structured prompts guide the model by specifying the input format, for example, by requesting the output as a table, a checklist, or a JSON object, and by clearly defining the expected output type and domain-specific constraints [28]. Irugalbandara [29] introduced meaning-typed prompting, where prompts are designed to reflect the semantic meaning of the input and the desired output. This ensures that the model accurately interprets both the structure and the intent of the information, enhancing the reliability across different tasks.

In facility management, such prompt engineering techniques can be employed to structure user queries for function calling scenarios. In general, function calling means that an LLM can reliably communicate with external tools to enable effective tool usage and interaction via external APIs. For example, LLMs can automatically generate SQL commands or structured maintenance reports, especially when combined with well-structured backend knowledge.

The specific problem of translating natural language into SQL queries by tools support has gained significant interest in recent years using the term “Text-to-SQL.” Originally, related techniques were focusing on NLP techniques with machine learning approaches (especially deep learning techniques) which require sufficiently comprehensive datasets for training [30,31]. For training and evaluation purposes, significant effort has been made to provide suitable datasets. Some of the most well-known datasets available for benchmarking purposes are Spider and WikiSQL [32,33,34]. During recent years, LLMs became an attractive approach for Text-to-SQL applications due to their increased capabilities of coping with varieties in text and their pretrained availability, which reduced the need for additional training or fine tuning [35]. Recent versions of LLMs have shown promising capabilities of using them immediately for SQL query generation (within a zero-show or few-shot setting) [36].

Ontologies provide a formal representation of knowledge by defining relationships between concepts. Liu et al. [27] showed that ontologies can be leveraged by LLMs to interpret complex domain-specific queries with higher accuracy. In facility management systems, ontologies provide the structural foundation for linking technical terms from equipment logs to preventive maintenance schedules or energy usage data. By formally defining the relationships between such domain-specific entities, ontologies enable large language models to achieve a deeper semantic understanding and thereby improve the precision and reliability of their outputs. Gao et al. [8] demonstrated that SQL outputs derived from structured inputs are more consistent and interpretable than free text answers, particularly in mission-critical environments. The use of structured outputs, such as database queries or function calls, supports the integration of LLMs into automated workflows, which may advance professional FM systems.

A major challenge in facility management is the unstructured nature of much of the data, such as PDF manuals, scanned inspection forms or multilingual documents. Techniques such as vector-based document embedding, table parsing and semantic segmentation are used to convert unstructured sources into structured formats [37,38].

BGE M3 Embedding [37] offers a multifunctional, multilingual embedding framework that enables fine-grained control over document retrieval and alignment. Its capabilities suggest potential value for handling multilingual documents in global facility management contexts, where diverse language sources and heterogeneous document formats are common [37]. Günther et al. [39] highlighted the potential of token-optimized embedding strategies, which improve retrieval efficiency by reducing token overhead and optimizing document segmentation in large-scale data environments.

Newer LLM functionalities such as asynchronous function calling [28] assume that the model can respond with executable output rather than just text. This is especially useful in scenarios where an LLM must trigger an action, such as generating a service ticket or updating a maintenance log. The ability to transition between structured and unstructured output formats is key for deploying LLMs in business-critical environments such as facility management.

2.4. Summary and Research Gap

In summary, the integration of structured data through techniques such as retrieval-augmented generation (RAG), structured prompting, ontologies, and advanced embeddings can significantly enhance the performance of large language models (LLMs) in facility management by reducing hallucinations, improving response accuracy and enabling automation.

However, despite these advancements, a comparative evaluation of these methods in heterogeneous real-world environments is still lacking. In this context, heterogeneous refers to environments characterized by diverse document formats, such as PDFs, scanned inspection reports, technical drawings and multilingual data sources, as well as varying system architectures. Most studies examine individual techniques in isolation and are often limited to general NLP benchmarks. This creates a gap in understanding which combination of input and output structuring, model configuration and data augmentation strategies yields the best performance in domain-specific, practical applications.

This study addresses this gap by systematically evaluating different structuring approaches across input and output modalities and assessing their impact on LLM performance in facility management use cases.

3. Research Methodology

This study adopts a design science research (DSR) methodology to investigate how different configurations of structured and unstructured data inputs, outputs, and retrieval strategies affect the performance of large language models (LLMs) in the domain of construction and facility management. The approach follows a structured research lifecycle, including awareness, suggestion, implementation, and evaluation phases, as proposed by Hevner et al. [40] and Peffers et al. [41]. This methodology is particularly suitable for problem solving in information systems where new artifacts (e.g., prompt strategies, data pipelines, or retrieval frameworks) are created and empirically evaluated.

3.1. Problem Awareness

The awareness phase identified the core challenges faced by facility management professionals when interacting with large volumes of heterogeneous data. These challenges include inefficient document search, inconsistent response quality, and limited integration between structured systems and unstructured sources.

To ground the study in real-world requirements, several semi-structured stakeholder interviews and Q&A workshops were conducted with technical project staff and decision-makers in a facility management environment. These sessions helped define the scope of the experimentation, prioritize tasks such as query automation or building component lookup, and determine the practical feasibility of integration using an Application Programming Interface (API). The retrieved literature pointed out the need for better user support including a user interface enabling natural language communication as provided by LLMs while still providing accurate access to information.

Based on this input, the central research focus was formulated: testing how different LLM architectures, retrieval methods, and prompt engineering strategies perform when processing structured and unstructured data. While usually LLMs are considered to work with unstructured text input and output, structured input may be, for instance, database schemas or ontology information, while structured outputs may be function calls such as database queries.

3.2. Suggested Solution

In the suggestion phase, theoretical and technological solutions were identified and mapped to the identified challenges. The main outcome of this phase was a high-level artifact design concept including a modular testing framework that allows for comparison between multiple input/output and data retrieval configurations.

The specific components of the suggested solution are outlined below:

Prompt engineering with different configurations (zero-shot, few-shot, and guided prompts);
RAG;
Function calling and structured output (e.g., SQL, JSON) generation;
Embedding-based document search, including the use of multilingual models such as BGE-M3.

Concrete experiments were designed to compare how LLMs perform across the following four dimensions:

Input Format: unstructured queries vs. structured parameter input;
Output Format: free-text explanations vs. structured SQL/JSON output;
Data Type: unstructured document corpus (e.g., reports, manuals) vs. structured relational data (PostgreSQL);
Retrieval Strategy: zero-context vs. embedding-based RAG with semantic chunking.

Each configuration was tested against realistic facility management queries derived from the stakeholder requirements.

3.3. Data Sources

The experimental dataset consists of both structured and unstructured data derived from an ongoing facility management project:

The unstructured data are outlined below:

Operation and maintenance manuals (PDF);
Building and infrastructure plans;
Environmental reports and concept papers;
Financial summaries related to the building operation.

Structured data related to a PostgreSQL database including the following information:

Room and asset metadata;
Maintenance schedules;
Inspection logs;
Equipment classifications.

The database schema and its entity-relationship structure were made available by LIBAL Schweiz GmbH. These structured data provide the foundation for the text-to-SQL generation and structured reasoning use cases.

In addition, the LIBAL API, a commercial Common Data Environment (CDE) platform, was partially integrated for selected experiments (e.g., function calling tests), although its deterministic behavior limited its applicability for LLM learning or benchmarking. Finally, we decided to use a PostgreSQL database clone directly without the API.

3.4. Experimental Evaluation

The experimental evaluation of the solution focused on qualitative and quantitative metrics commonly used in LLM benchmarking studies (see, e.g., [8,23]). Each response generated by the model was scored by manual and automatic approaches comprising the following criteria:

Correctness: factual accuracy based on comparison with database or document ground truth;
Hallucination: degree to which fabricated or irrelevant information was included;
Contextual Relevance: semantic match between the query intent and output;
Execution Validity: the syntactical and semantically correctness of the structured outputs (SQL queries).

In cases where numerical queries were issued (e.g., retrieving an area size or asset count), validation was performed using deterministic queries against the PostgreSQL backend.

Both user-generated and synthetic prompts provided by the industry partner were used to ensure wide test coverage. Some prompts (see below) included embedded metadata or structured hints to test the LLMs’ sensitivity to the input formulation.

During the experiments, various artifacts were produced and collected, such as detailed logs, performance metrics, and structured comparisons between the LLM configurations. These artifacts include the following:

A comparative results matrix for all tested use case configurations;
Annotated examples of the outputs;
Taxonomy of input/output structures and their impact on model behavior;
Practical guidelines for prompt design and structured data preparation in facility management LLM deployments.

The artifact contributes to both academic research and practical implementation by showcasing how structured data can significantly improve LLM reliability and applicability in technical real-world domains while still providing a user-friendly communication interface.

4. Experiment Design and Implementation

This section outlines the technical and methodological setup of the experiments conducted to evaluate the performance of LLMs in facility management tasks. The experiments were designed to investigate how the integration of structured and unstructured data, as well as varying prompt strategies and model configurations, impact the accuracy, consistency, and reliability of the LLM-generated outputs.

4.1. Integration with Facility Management Data

LIBAL provides a CDE platform designed specifically for the building owner. It serves as a central hub for managing and exchanging structured information throughout the entire lifecycle of a building, from design and construction to operation and maintenance. The software is tightly integrated with the BIM2FM (Building Information Modeling to Facility Management) process, helping to transform project data into a trusted, complete digital building model. This model can be continuously updated and reused for ongoing building management, renovation, or demolition [42].

The LIBAL API is the primary way to integrate the software with other systems and automate the exchange of data, which are stored in a PostgreSQL database. The API allows software applications, platforms, and services involved in a building project to connect and interact with the building data in a structured and secure way.

One of the current applications of the LIBAL API is an example given during one of the stakeholder meetings: The API is used as a function calling tool for calculating the area of a specific facility in square meters. It is essential to mention that the LLM or the agent cannot influence the result of the function calling tool, meaning all math calculations are made inside the function to provide accurate and consistent results.

However, due to the static and deterministic nature of the LIBAL API, it was decided that the experiments should mainly focus on LLM-generated SQL queries using the PostgreSQL Libal database. This approach allows for much broader experimentation and variation in prompt styles, model behavior, and data usage.

4.2. Setup and Tools

To ensure reproducibility and effective testing, a modular and flexible setup was implemented. Table 1 summarizes the primary components used. In particular, four types of recent LLMs were included to provide a variety of recent models covering on-premises and off-premises hosting scenarios for a possible practical usage. Other state-of-the-art models were not explored to keep the research design manageable or for cost reasons.

4.3. Experiment Design

Three use case variants were tested based on different combinations of the input and output formats. Each use case reflects a practical scenario in facility management and was designed to assess the model’s ability to generate accurate, structured and unstructured outputs, especially SQL queries, from various types of input.

4.3.1. Use Case 1: Unstructured Input → Structured Output

The LLM receives a natural language prompt and is instructed to generate a structured SQL query based on its understanding of the user’s intent.

Example Input: “What is the area of Room 203 in Building A?”
Expected Output: SELECT area FROM rooms WHERE room_number = ‘203’ AND building_id = ‘A’.

4.3.2. Use Case 2: Unstructured Input → Function Calling

In this scenario, the LLM receives a free-text query and decides whether to invoke a predefined external function (e.g., via the Libal API) to retrieve data such as calculated room areas. However, due to the deterministic nature of the Libal API (which does not allow the LLM to modify inputs or outputs), this use case was limited in scope and used primarily for supplementary insights.

Example Input: “Calculate the total floor area of Building A using the Libal platform.”
Example Output: CALL calculate_floor_area(‘Building A’) → returns 1254.67 m².

4.3.3. Use Case 3: Structured Input → Unstructured Output

The LLM receives a structured prompt with parameters, such as the initial user prompt, the initial SQL query, and the SQL execution result. The parameters are then used to generate an unstructured response in natural language to answer the question.

Example Input: prompt = (f’Question: On which floors are training rooms available?n’;
f’SQL Query: SELECT id, name FROM space WHERE description_txt LIKE ‘%training%’;\n’;
f’SQL Result: [(74592546, ‘00.415’), (74622500, ‘01.127’), (74632504, ‘01.535’)]’);
Example Output: The rooms with IDs 74592546, 74622500, and 74632504, named ‘00.415’, ‘01.127’, and ‘01.535’, are designated for training purposes.

4.4. Prompt Strategies

To ensure a realistic and varied evaluation, six representative prompts were selected from a set of typical facility management questions provided by the Innosuisse project team domain experts. These questions reflect actual stakeholder needs, including space management, cleaning logistics, and infrastructure maintenance. While six prompts are not enough for deeper statistical insights such as using significance testing, they provide a sufficient basis for qualitative insights in a practical application setting.

The prompts were grouped into three complexity levels based on the SQL reasoning difficulty:

Simple prompts (easy)

These prompts involve basic data retrieval, typically requiring a single-table lookup without additional filtering or logic.

“On which floors are restrooms located?”

→ Cleaning Manager—Location overview of sanitary rooms by floor.

“What types of floor coverings exist?”

→ Cleaning Manager—Overview of the floor material types in the building.

These questions were classified as low complexity because they only required selecting values from a single column.

2.: Medium-complexity prompts (medium)

These prompts require combining multiple data points, such as room identifiers and their associated attributes, potentially including SQL table joins.

“List all cooling rooms with their area sizes.”

→ Logistics Manager—Room numbers and area retrieval for cold storage zones.

“What are the area sizes of the sanitary rooms on each floor?”

→ Cleaning Manager—Per-floor breakdown of sanitary room sizes.

These prompts often required SELECT queries with multiple fields and optional grouping by room type or floor.

3.: Complex prompts (hard)

These prompts include conditional logic, comparisons, table joins, or advanced filtering—all of which increase the complexity of SQL generation.

“Which training room has the largest area?”

→ Quality Manager—Selection of the training room with the maximum area.

“Which rooms are suitable for training?”

→ Quality Manager—Filtering for rooms labeled as suitable for training.

These were considered high complexity due to the need for ranking, filtering, and condition-based logic, which challenged the model’s reasoning ability.

Each model was tested with at least two prompts per complexity level. This systematic distribution enabled a meaningful comparisons of each model’s robustness and adaptability to different query types.

This classification not only ensured coverage of real-world data needs but also allowed us to evaluate how different prompt formulations affect the accuracy and consistency of LLM-generated queries. The prompt strategies used in this study thus served as a key variable in analyzing LLM performance across use cases.

4.5. Evaluation Criteria

Each use case was evaluated using a combination of qualitative and quantitative criteria based on metrics commonly used in LLM benchmarking studies (e.g., [8,23]). These criteria aim to assess the performance and practical reliability of the model outputs in a structured facility management context:

Correctness: measures whether the generated SQL query produces the correct results based on the known ground truth from the PostgreSQL database.
Hallucination: evaluates the degree to which the model includes fabricated, irrelevant, or unsupported information in its response.
Contextual relevance: checks how well the generated output aligns with the user’s original query intent and the relevant database schema.
Execution validity: verifies whether the generated SQL queries are syntactically correct and can be executed without error on the PostgreSQL database. This criterion ensures that the model output is not only logically plausible but also technically usable in real-world systems.

All prompts were executed in a controlled environment using a consistent, standardized system prompt with a fixed temperature value (temperature = 0) on a stable PostgreSQL test database. In preliminary evaluations, we conducted experiments with temperature > 0 leading to more varied responses being more frequently incorrect. Therefore, and based on a discussion with project stakeholders, we decided to conduct only experimented with temperature = 0. Where appropriate, we used both user-generated and synthetic prompts to challenge the LLM’s reasoning abilities at different complexity levels.

5. Implementation Architecture

This section discusses the technical implementation of the artifact, outlines each sequential step of the pipeline, and discusses the specific setup constraints.

5.1. Text-to-Query-to-Text Pipeline

A modular testing pipeline was implemented to systematically evaluate the performance of different LLMs in processing facility management-related prompts. This setup uses a realistic query-handling system and consists of three distinct phases:

User query translation (write_query): The LLM receives a user prompt in either an unstructured or structured form and generates a corresponding SQL query. This step tests the model’s ability to understand the intent and develop the correct logic. This phase corresponds to the defined use case in the following section: Use Case 1: Unstructured Input → Structured Output.
Query execution (execute_query): The generated SQL query is executed against a PostgreSQL database containing facility management data. This step ensures that the SQL is valid, executable, and returns meaningful results.
Response generation (generate_answer): The result is post-processed and returned as a natural language unstructured or structured response, depending on the test configuration. This phase corresponds to the defined Use Case 3: Structured Input → Unstructured Output.

Figure 2 shows the architecture of the artifact, which illustrates the modular testing pipeline used to evaluate the LLM performance in the facility management queries. It includes three sequential steps: query generation (write_query), query execution (execute_query), and result generation (generate_answer). This process reflects a typical flow from prompt to SQL to execution to answer.

5.2. Setup for the Experiments

As the PostgreSQL database schema is large and requires approximately at least 10,000 tokens, it was decided to include only relevant tables from the schema. This setup allows us to not exceed the standard LLM context window, which often allows for 8192 tokens. The only exceptions to this are the large-context window Gemini models capable of fitting 1 million tokens. However, the provided relevant tables were consistent across all tests for comparable testing results.

The relevant tables identified for obtaining answers to most user questions included: a table “facility” providing the overall context for the building/site and referenced by most other relevant tables, “facility_attributes”, which stores additional properties for facilities, “floor”, a table containing information about building floors (e.g., name, elevation) and for linking spaces to a specific floor, “floor_attributes” for storing additional properties for floors, “space” for rooms and spaces including name, category (e.g., “training,” “restroom”), area, and link to floor, “space_attributes” for attributes relating to spaces (e.g., detailed room function or specific equipment), “attribute” for defining types of attributes that can be assigned to entities (e.g., area, floor covering, function of room), “type”, a general-purpose typing table (room types, materials, …), and “material” for details about specific materials being used (such as for floor coverings).

For our test scenario, the selection of these tables appears rather straightforward. Notably, there are various tables which appear redundant for typical use cases in facility management because BIM models include various information which are more relevant for other purposes such as in construction. We assume that a respective reduction in the model might also be useful in a realistic application scenario. However, the selection of tables may influence the results, and future research should also investigate experimental setups with a varying coverage of model tables.

As discussed above (see Section 4.4), all LLMs use a temperature parameter that is set to 0 to ensure consistency across tests and minimize randomization. Additionally, the system prompts for instructing the LLM were identical across all tests except those providing the database schema.

The following system prompt was used: “Given an input question, create a syntactically correct {dialect} query to run to help find the answer. You can order the results by a relevant column to return the most interesting examples in the database. Never query for all the columns from a specific table, only ask for the few relevant columns given the question.” In addition, we used an extended version that includes the database schema in the prompt: “Pay attention to use only the column names that you can see in the schema description. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table. Only use the following tables: {table_info}.”

5.2.1. Test Questions

All experiments were conducted using each LLM model, all of which are multilingual and capable of interpreting questions in English and German. The experiments involved six identified questions selected from a list provided by the Innosuisse project team (see Section 4.4). These questions were kept in their original German form without any modifications, as this may facilitate better integration with the database schema, which contains values in both English and German. Table 2 shows the English versions of the questions and the expected output. Additionally, all questions correspond to entries in the PostgreSQL Libal database, meaning that an LLM should be able to retrieve the relevant data if a correct SQL query is constructed. The questions are presented in the following order: two easy, two medium, and two hard complexity questions, as previously described.

5.2.2. Evaluation Setup and Rules

Table 3 showcases the defined rules used for each run evaluation. The evaluation consisted of manual and automated evaluations with the support of the LLM-as-a-judge, which is a method that uses LLMs to evaluate quality, correctness, contextual relevance, and hallucinations.

6. Results

This section presents the results of the conducted experiments to assess the performance of the LLM in a text-to-SQL-to-text scenario within a facility management context. The objective was to evaluate how well different models can generate executable SQL queries, handle varying levels of prompt complexity, and produce contextually accurate responses based on a predefined database schema.

The modular testing prototype results showcase 120 runs (=6 questions × 10 models × 2 setups) testing ten LLM models with and without the database schema.

All responses were evaluated on the basis of four key dimensions as discussed above.

Correctness (alignment with the expected answers);
Execution validity (syntax and logic of the generated SQL);
Hallucination (presence of false or fabricated content);
Contextual relevance (semantic fit to the prompt).

The detailed results in percentage are shown in Figure 3. For each evaluation criterion, these values are calculated as average values for the considered sample (60 results corresponding to 6 test questions involving 10 LLMs).

6.1. Key Findings

The experiments confirm that providing a database-schema context is a decisive factor for both the executability and the quality of the SQL query produced by LLMs. When the schema is included in the prompt, the models are far more likely to generate syntactically valid SQL, achieve higher semantic accuracy, and hallucinate less.

Key findings at a glance:

Execution validity (share of queries that run without error) rises from 0% without the schema to 35% with the schema. Without an explicit context, every generated query in our test set failed to execute.
Correctness (at least one of the expected answer values is present in the SQL or the natural-language answer) improves from 38% to 58% of all runs when the schema is supplied.
Hallucinations (answers containing no meaningful overlap with expected values) dropped from 65% without schema to 50% with schema, which is a 15-point reduction but still indicating that half of the schema-aware answers lacked grounding.

These results underscore the importance of the structured context input. Only by integrating the underlying database structure into the prompt can LLMs generate reliable, usable, and semantically accurate outputs. Figure 3 illustrates the effect of the schema context on the four central evaluation metrics in direct comparison. The hallucination rate is shown as an inverse hallucination rate so that for each criterion, a longer bar corresponds to a better result.

The heatmap (see Figure 4) summarizes the average performance scores across the core evaluation criteria—Execution Validity, Correctness, Contextual Relevance, and Hallucination—for each tested LLM. A higher score indicates better performance, except in the case of hallucination, where lower values are favorable (i.e., fewer hallucinations). The color scale emphasizes strengths in green and weaknesses in red, allowing for intuitive model benchmarking.

The analysis was based on an Excel dataset with over 120 entries, containing prompt texts, models, generated SQL queries, execution outcomes, model outputs, and reference answers. In addition to the quantitative analysis, qualitative observations regarding prompt sensitivity, hallucinations, and model-specific patterns were documented.

The primary findings show a more nuanced picture than initially assumed. Models with larger context windows and advanced prompt alignment do not automatically excel across all metrics:

Execution validity is led by Gemini 2.5 (0.33) and Qwen-3-32b (0.33), while most other systems remain below 0.25.

The models Qwen-3-32b (1.00) and LLaMA 3.3-70b (0.92) achieved the highest correctness scores in the evaluation. While LLaMA 3.3-70b shows high semantic alignment with the expected output, it is characterized by a low execution validity score (0.08). By contrast, Qwen-3-32b maintains a high level of correctness while exhibiting a moderate hallucination rate (0.42), which suggests relatively stable output behavior under these evaluation conditions.

LLaMA 4-Scout-17b-16e achieved the highest score in contextual relevance (0.75).

The lowest hallucination rates were observed for LLaMA 4-Scout (0.25) and LLaMA 3.3 (0.42), indicating comparatively stable output generation. In contrast, substantially higher hallucination scores were recorded for Mistral-Saba-24b (0.83) and the Gemini 1. x series (0.75), suggesting a greater tendency toward generating factually incorrect or unsupported content.

The heatmap reveals that the execution validity and hallucination rate are not necessarily correlated. For example, LLaMA 4-Scout-17b-16e generated no executable SQL queries (execution validity = 0) yet maintained the lowest hallucination rate (0.25), indicating accurate but non-runnable outputs. In contrast, Mistral-Saba-24b also failed to generate executable queries but exhibited the highest hallucination rate (0.83), producing both invalid and misleading results. These examples show that failure to execute does not imply a higher risk of hallucination—both aspects must be evaluated separately.

Models without reliable schema access or with limited database-reasoning pretraining often failed to generate runnable SQL even on simple prompts.

These results directly support the main research question (MRQ) by showing that both the structured input format and the precise schema alignment are critical for retrieval accuracy and consistency.

6.2. Additional Findings

After filtering our Excel dataset based on our testing criteria and removing all errors and empty rows, we obtained 22 responses. For these, the occurrence of hallucinations and correctness appeared to be independent of the level of difficulty. Figure 5 provides an overview of how often the tested models were able to generate SQL code that executed successfully. The best-performing model in this context is qwen-3-32b with four correct executions. Close behind are various versions of Google’s Gemini models (e.g., gemini-2.5-flash-preview-05-20, gemini-2.0-flash, and gemini-1.5-flash), each achieving three correct executions. Models such as llama-3.3-70b and meta-llama/llama-4-maverick achieved moderate performance with two successful runs, while others, including llama-4-scout-17b and llama-3.1-8b, managed only one correct result.

6.3. Observed Hallucinations

Despite the clearly expressed prompts, several models exhibited hallucinations:

Fictitious room labels: Some models invented room numbers or names that did not exist in the dataset.
Incorrect attribute assignments: In several responses, materials or room functions were mentioned that did not match the actual query or database content.
Unjustified generalizations: In some cases, the models generated broad statements such as “There are several rooms with parquet flooring,” even though such information was not supported by the retrieved context.

Models were particularly prone to these hallucinations when they lacked explicit access to the database schema or when the prompt structure was too vague.

6.4. Contextual Relevance

The analysis revealed notable differences among the models in their ability to semantically interpret user intent. In particular, LLaMA-4-Scout-17b-16e-instruct achieved the highest contextual relevance (0.75), followed by LLaMA-3.3-70b and meta-LLaMA-4-Maverick-17b-128e-instruct, each scoring 0.58 (see Figure 4). These models exhibited a strong alignment between natural-language queries and the semantic structure of the returned answers.

In contrast, the Gemini 2.5 flash preview models—especially versions 04-17 (0.33) and 05-20 (0.25)—performed significantly worse. This suggests that these models struggle to retain contextual relevance, particularly in the absence of explicit schema integration or domain-specific grounding.

Simple queries such as “Which floor coverings are available?” or “On which floors are sanitary rooms located?” were generally interpreted correctly by the higher-performing models. However, more complex requests such as “Which training room is the largest?” often revealed a mismatch. While the intent was correctly understood, the resulting answers were frequently inaccurate. This indicates a misalignment between the comprehension and execution logic during the response generation.

A clear pattern also emerged regarding prompt structure. Structured prompts—such as those using JSON inputs or fixed SQL output templates—more consistently resulted in contextually relevant responses compared to free-form natural language queries. This trend is further illustrated in Figure 3, which compares the model performance with and without schema access. When the schema context was provided, contextual relevance increased from approximately 35% to 50%, while the hallucination rates decreased from 65% to 50%. These results highlight the importance of structured input and prompt engineering, especially for mid- to high-complexity queries.

7. Discussion and Conclusions

Our study shows that adding structured information—like a database schema—into the prompt helps LLMs provide more accurate and reliable answers in facility management tasks. Without this structured input, most models failed to generate usable SQL queries. When the schema was included, execution validity improved from 0% without schema to 35% on average with a top value of 67% of success rate when the schema was provided.

Models such as Qwen and LLaMA 3.3 performed especially well in terms of correctness, generating accurate queries even for more complex tasks, while LLaMA 4 achieved the best results for contextual relevance and hallucination rates. Structured prompts and clear input formats were also critical in reducing hallucinations and improving contextual understanding.

These findings are important for companies that want to use LLMs in technical areas like facility management. They show that proper prompt engineering, structured inputs, and model selection are key for building reliable systems. Our results highlight the conditions under which LLMs can reliably generate structured outputs and the factors critical in minimizing hallucinations and ensuring semantic accuracy. The following practical implications for the deployment of LLMs in real-world facility management scenarios can be expressed on the basis of our results:

Importance of schema-based prompting: Providing explicit access to database schema information significantly increases execution validity and correctness. LLMs frequently generate invalid or semantically incorrect queries without a structured context. Therefore, table definitions or attribute structures should be embedded directly in prompt templates, especially for tasks involving automated data retrieval.
Structured inputs improve robustness: Structured inputs—such as JSON-based queries or parameterized prompts—substantially enhance consistency and reduce hallucinations. Input formalization should be prioritized in practice, particularly for tasks such as space allocation, asset reporting, or cleaning logistics.
Prompt engineering as a critical skill: Although not the focus of our study, the experiments showed that the design and formulation of prompts directly affect the quality of LLM outputs. Task-specific, structured prompts aligned with the database schema and user intent are crucial. Organizations should develop internal guidelines for prompt design or provide tools to support structured prompt generation.
Model selection based on contextual needs: Performance differences between the models were significant. Advanced models such as Qwen, Gemini 2.5 Flash, and LLaMA 3.3 70B demonstrated stronger performance in SQL generation and contextual understanding. Model selection should be based on empirical benchmarking that reflects specific use case requirements and available structured input.
Integration into existing systems: The evaluated text-to-SQL-to-text pipeline is suitable for integration into facility management platforms via APIs or modular agent architectures. LLM-based assistants can be embedded to translate user queries into SQL and return natural language summaries—eliminating the need for manual database interaction.
Data privacy and system architecture: Facility management data often include sensitive information. Locally hosted LLMs or secure hybrid deployments are preferred over public APIs. Measures such as data minimization, encryption, and internal compliance with privacy regulations should be standard.
Use of hybrid architectures: Combining LLMs with rule-based validation and RAG can increase reliability and safety. Schema metadata can be retrieved dynamically and added to prompts, while rule-based systems validate the generated queries before execution.

However, our study faces various limitations, which indicate avenues for future research and development:

Schema completeness: Only selected tables were included in the schema because of token limitations. In future studies, larger-context models such as Gemini 1m could be tested with the full database schema to assess performance improvements. In addition, there should be further research on systematically reducing BIM models (including database schema) to make the models more appropriate for applications in facility management.
Prompt diversity and reusability: More prompt variations could be introduced, including multilingual input and more abstract queries. This would test the model’s adaptability and robustness.
Test sample size: For future work, we suggest more comprehensive studies involving a larger set of test questions which should provide more reliability in the results and allow for significance testing.
Bias in evaluation: Some evaluation steps used automated tools or LLM-as-a-judge methods. Although helpful, they can introduce bias. Using the manual screening of our results, we did not find any related issues, which was probably because of the availability of a ground truth and the clear definition of evaluation criteria. However, a more diverse human evaluation could help ensure fairness and improve the quality of the results.
Consistency tests: multiple runs with the same prompt could be used to test the output stability (assuming a temperature parameter > 0). This would help us to better understand the model reliability in productive settings.
Expand to RAG and function calling: While RAG based on using documents was excluded in this study, combining a text-to-SQL function calling with RAG may leverage further information not included in the database but in various documents (as common in construction and facility management). This could allow for more advanced applications in these industries.
Long-term integration testing: Future research could include real-time tests in operational systems to evaluate latency, user experience, and scalability under practical conditions.

Author Contributions

Conceptualization, methodology, software, K.B., R.T. and E.K.; validation, T.H.; investigation, writing—original draft preparation, K.B., R.T. and E.K.; writing—review and editing, T.H.; visualization, K.B., R.T. and E.K.; supervision, T.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The detailed results of the study are available from the authors in the form of an Excel sheet. The evaluation prototype can be accessed at the following GitHub repository: https://github.com/kiril-buga/LLM-research, accessed on 2 July 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lindkvist, C.; Whyte, J. Challenges and opportunities involving facilities management in data handover: London 2012 case study. In Proceedings of the AEI 2013: Building Solutions for Architectural Engineering, State College, PA, USA, 3–5 April 2013; American Society of Civil Engineers: Reston, VA, USA, 2013; pp. 670–679. [Google Scholar] [CrossRef]
Krütli, D.; Hanne, T. Augmenting LLMs to Securely Retrieve Information for Construction and Facility Management. Information 2025, 16, 76. [Google Scholar] [CrossRef]
William East, E.; Nisbet, N.; Liebich, T. Facility management handover model view. J. Comput. Civ. Eng. 2013, 27, 61–67. [Google Scholar] [CrossRef]
Pinti, L.; Bonelli, S. A methodological framework to optimize data management costs and the Hand-Over phase in cultural heritage projects. Buildings 2022, 12, 1360. [Google Scholar] [CrossRef]
Hosseini, M.R.; Roelvink, R.; Papadonikolaki, E.; Edwards, D.J.; Pärn, E. Integrating BIM into facility management: Typology matrix of information handover requirements. Int. J. Build. Pathol. Adapt. 2018, 36, 2–14. [Google Scholar] [CrossRef]
Zhu, L.; Shan, M.; Xu, Z. Critical review of building handover-related research in construction and facility management journals. Eng. Constr. Archit. Manag. 2021, 28, 154–173. [Google Scholar] [CrossRef]
Abdelkarim, S.B.; Ahmad, A.M.; Naji, K. A BIM-based framework for managing handover information loss. J. Manag. Eng. 2024, 40, 04024030. [Google Scholar] [CrossRef]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2312.10997. [Google Scholar] [CrossRef]
Atkin, B.; Brooks, A. Total Facility Management, 5th ed.; John Wiley & Sons: Hoboken, NJ, USA, 2021. [Google Scholar]
Dixit, M.K.; Venkatraj, V.; Ostadalimakhmalbaf, M.; Pariafsai, F.; Lavy, S. Integration of facility management and building information modeling (BIM) A review of key issues and challenges. Facilities 2019, 37, 455–483. [Google Scholar] [CrossRef]
Matarneh, S.T.; Danso-Amoako, M.; Al-Bizri, S.; Gaterell, M.; Matarneh, R. Building information modeling for facilities management: A literature review and future research directions. J. Build. Eng. 2019, 24, 100755. [Google Scholar] [CrossRef]
Altohami, A.B.A.; Haron, N.A.; Ales Alias, A.H.; Law, T.H. Investigating approaches of integrating BIM, IoT, and facility management for renovating existing buildings: A review. Sustainability 2021, 13, 3930. [Google Scholar] [CrossRef]
Guo, H.; Zhou, Y.; Ye, X.; Luo, Z.; Xue, F. Automated mapping from an IFC data model to a relational database model. J. Tsinghua Univ. 2020, 61, 152–160. [Google Scholar]
Barzegar, M.; Rajabifard, A.; Kalantari, M.; Atazadeh, B. An IFC-based database schema for mapping BIM data into a 3D spatially enabled land administration database. Int. J. Digit. Earth 2021, 14, 736–765. [Google Scholar] [CrossRef]
Dai, C.; Cheng, K.; Liang, B.; Zhang, X.; Liu, Q.; Kuang, Z. Digital twin modeling method based on IFC standards for building construction processes. Front. Energy Res. 2024, 12, 1334192. [Google Scholar] [CrossRef]
Toldo, B.M.; Modolo, A.; Zanchetta, C.; Bock, B.S. SQL Relational Database Usage for Integration Between BIM Models and Quantity Take-Off Platforms. In Proceedings of the New Frontiers of Construction Management (CMW 2024), Ravenna, Italy, 7–8 November 2024; Construction Management Workshop. Springer Nature: Cham, Switzerland, 2024; pp. 165–176. [Google Scholar]
Pidgeon, A.; Dawood, N. BIM adoption issues in infrastructure construction projects: Analysis and solutions. J. Inf. Technol. Constr. 2021, 26, 263–285. [Google Scholar] [CrossRef]
Pinti, L.; Codinhoto, R.; Bonelli, S. A review of building information modelling (BIM) for facility management (FM): Implementation in public organisations. Appl. Sci. 2022, 12, 1540. [Google Scholar] [CrossRef]
Masataka, H.; Yutaka, W. Making software based on human-driven design case study: SQL for non-experts. In Proceedings of the 2022 IEEE 15th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip (MCSoC), Penang, Malaysia, 19–22 December 2022; IEEE: New York, NY, USA, 2022; pp. 264–270. [Google Scholar] [CrossRef]
Taipalus, T. The effects of database complexity on SQL query formulation. J. Syst. Softw. 2020, 165, 110576. [Google Scholar] [CrossRef]
Baig, M.S.; Imran, A.; Yasin, A.U.; Butt, A.H.; Khan, M.I. Natural language to SQL queries: A review. Int. J. Innov. Sci. Technol. 2022, 4, 147–162. [Google Scholar] [CrossRef]
Shorten, C.; Pierse, C.; Smith, T.B.; D’Oosterlinck, K.; Celik, T.; Cardenas, E.; Monigatti, L.; Hasan, M.S.; Schmuhl, E.; Williams, D.; et al. Querying Databases with Function Calling. arXiv 2025, arXiv:2502.00032. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv 2023, arXiv:2305.14314. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. Available online: https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf (accessed on 2 July 2025).
Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; Driessche, G.B.V.D.; Lespiau, J.-B.; Damoc, B.; Clark, A.; et al. Improving Language Models by Retrieving from Trillions of Tokens. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 June 2022; Microtome Publishing: Brookline, MA, USA, 2022; pp. 2206–2240. Available online: https://proceedings.mlr.press/v162/borgeaud22a.html (accessed on 2 July 2025).
Liu, X.; Sun, J.; Lei, A.; Zhu, J. Research and Applications of Large Language Models for Converting Unstructured Data into Structured Data. In Proceedings of the 2024 3rd International Conference on Cloud Computing, Big Data Application and Software Engineering (CBASE), Hangzhou, China, 11–13 October 2024; IEEE: New York, NY, USA, 2024; pp. 305–308. [Google Scholar] [CrossRef]
Gim, I.; Lee, S.; Zhong, L. Asynchronous LLM Function Calling. arXiv 2024, arXiv:2412.07017. [Google Scholar] [CrossRef]
Irugalbandara, C. Meaning Typed Prompting: A Technique for Efficient, Reliable Structured Output Generation. arXiv 2024, arXiv:2410.18146. [Google Scholar] [CrossRef]
Finegan-Dollak, C.; Kummerfeld, J.K.; Zhang, L.; Ramanathan, K.; Sadasivam, S.; Zhang, R.; Radev, D. Improving text-to-sql evaluation methodology. arXiv 2018, arXiv:1806.09029. [Google Scholar] [CrossRef]
Katsogiannis-Meimarakis, G.; Koutrika, G. A survey on deep learning approaches for text-to-SQL. VLDB J. 2023, 32, 905–936. [Google Scholar] [CrossRef]
Yu, T.; Zhang, R.; Yang, K.; Yasunaga, M.; Wang, D.; Li, Z.; Ma, J.; Li, I.; Yao, Q.; Roman, S.; et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. arXiv 2018, arXiv:1809.08887. [Google Scholar] [CrossRef]
Zhong, V.; Xiong, C.; Socher, R. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv 2017, arXiv:1709.00103. [Google Scholar] [CrossRef]
Mitsopoulou, A.; Koutrika, G. Analysis of text-to-SQL benchmarks: Limitations, challenges and opportunities. In Proceedings of the 28th International Conference on Extending Database Technology, EDBT, Barcelona, Spain, 25–28 March 2025; OpenProceedings.org: Konstanz, Germany, 2025; pp. 199–212. Available online: https://datagems.eu/wp-content/uploads/2025/05/paper-41.pdf (accessed on 2 July 2025).
Liu, X.; Shen, S.; Li, B.; Ma, P.; Jiang, R.; Zhang, Y.; Fan, J.; Li, G.; Tang, N.; Luo, Y. A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going? IEEE Trans. Knowl. Data Eng. 2025, 37, 5735–5754. [Google Scholar] [CrossRef]
Hong, Z.; Yuan, Z.; Zhang, Q.; Chen, H.; Dong, J.; Huang, F.; Huang, X. Next-generation database interfaces: A survey of LLM-based text-to-SQL. IEEE Trans. Knowl. Data Eng. 2025, 1–20. [Google Scholar] [CrossRef]
Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv 2024, arXiv:2402.03216. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
Günther, M.; Ong, J.; Mohr, I.; Abdessalem, A.; Abel, T.; Akram, M.K.; Guzman, S.; Mastrapas, G.; Sturua, S.; Wang, B.; et al. Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents). arXiv 2024, arXiv:2310.19923. [Google Scholar] [CrossRef]
Hevner, A.R.; March, S.T.; Park, J.; Ram, S. Design science in information systems research. MIS Q. 2004, 28, 75–105. [Google Scholar] [CrossRef]
Peffers, K.; Tuunanen, T.; Rothenberger, M.A.; Chatterjee, S. A design science research methodology for information systems research. J. Manag. Inf. Syst. 2007, 24, 45–77. [Google Scholar] [CrossRef]
CDE für Bauherren und Betreiber. LIBAL. Available online: https://www.libal-tech.de/libal-common-data-environment-cde/ (accessed on 3 June 2025).

Figure 1. RAG system [2].

Figure 2. Architecture of the artifact.

Figure 3. Impact of the schema context on model performance in terms of execution validity, correctness, contextual relevance, and hallucination (all in %) for the considered 60 samples (6 test questions × 10 models).

Figure 4. Comparative heatmap for showing model performance in terms of execution validity, correctness, contextual relevance, and hallucination (all in %) for the considered 12 samples (6 test question × 2 setups). Red corresponds to weak values, yellow to values around 0.5, and green to strong values.

Figure 5. SQL execution overview: number of correct SQL executions.

Table 1. Tool setup.

Component	Description
Evaluated LLMs	Gemini (Google): gemini-1.5-flash gemini-2.0-flash gemini-2.5-flash-preview-04-17 gemini-2.5-flash-preview-05-20 Llama (Meta): llama-3.1-8b llama-3.3-70b llama-4-scout-17b-16e-instruct llama-4-maverick-17b-128e-instruct Qwen (Alibaba Cloud): qwen-3-32b Mistral (Mistral AI): mistral-saba-24b
LLM Inference Providers	Cerebras—Llama 3.1, 3.3, Scout and Qwen models Gemini API—all Gemini models Groq—Mistral and Llama 4 Maverick models
AI Frameworks	LangChain (structured/unstructured input/output for writing the query, executing it, and generating the final answer for all listed LLMs)
Data Sources	PostgreSQL Libal database (structured data), PDFs (manuals, plans), Libal API
Development Tools	Python 3.12, Jupyter Notebooks 7.4, PyCharm 2024.3, vs. Code 1.97, GitHub

Table 2. Questions for the evaluation.

Questions	Expected Output	Intention	Expected Information Source
On which floors are restrooms available?	1F, 2F, 3F, 4F, 5F, 6F, 7F, 8F	Planning cleaning effort	LIBAL database
What types of floor covers are there?	Stone and cement floors, stone and tile floors, plastic coverings and linoleum, parquet and cork parquet, unknown, mastic asphalt and rubber floors, textile flooring	Determination of the floor covers	LIBAL database
Give me all the cold storage rooms with their areas.	Cold storage room 00.631 has a total area of 6.299 m², cold storage room 00.632 has a total area of 6.299 m², cold storage room 00.636 has a total area of 5.098 m², cold storage room 00.637 has a total area of 8.075 m², cold storage room 00.638 has a total area of 22 m², cold storage room 00.645 has a total area of 7.234 m²	Survey of the storage area	LIBAL database
What are the areas of the individual restrooms?	B1 = 0 m², 1F = 118.65 m², 2F = 39.14 m², 3F = 39.14 m², 4F = 39.14 m², 5F = 39.14 m², 6F = 39.14 m², 7F = 39.14 m², 8F = 39.14 m², 9F = 39.14 m², 10F = 0 m²	Planning cleaning effort	LIBAL database
Which training room has the largest area?	The training room 00.415 is the largest with an area of 41.473 m²	Conduction of trainings	LIBAL database
Which rooms can be used for training?	There are the training rooms 00.415, 01.127 and 01.535	Conduction of trainings	LIBAL database

Table 3. Evaluation rules for each criterion.

Criterion	Rule for the Criterion Evaluation
Execution Validity	sql_execution_result does not contain “error,” “failed,” “fehler” or “empty value”.
Correctness	At least one value mentioned from expected_answer appears in the answer OR everything in expected_answer is found in the answer/query.
Contextual Relevance	Answer shares > 1 meaningful value with expected_answer.
Hallucination	No meaningful overlap with expected_answer values

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Buga, K.; Tesic, R.; Koyuncu, E.; Hanne, T. Large Language Models for Structured Information Processing in Construction and Facility Management. Electronics 2025, 14, 4106. https://doi.org/10.3390/electronics14204106

AMA Style

Buga K, Tesic R, Koyuncu E, Hanne T. Large Language Models for Structured Information Processing in Construction and Facility Management. Electronics. 2025; 14(20):4106. https://doi.org/10.3390/electronics14204106

Chicago/Turabian Style

Buga, Kyrylo, Ratko Tesic, Elif Koyuncu, and Thomas Hanne. 2025. "Large Language Models for Structured Information Processing in Construction and Facility Management" Electronics 14, no. 20: 4106. https://doi.org/10.3390/electronics14204106

APA Style

Buga, K., Tesic, R., Koyuncu, E., & Hanne, T. (2025). Large Language Models for Structured Information Processing in Construction and Facility Management. Electronics, 14(20), 4106. https://doi.org/10.3390/electronics14204106

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Models for Structured Information Processing in Construction and Facility Management

Abstract

1. Introduction

1.1. Problem Statement

1.2. Thesis Statement and Research Questions

1.3. Scope

2. Literature Review

2.1. Literature Search

2.2. Information Management in Facility Management

2.3. Enhancing LLMs with Unstructured or Structured Data

2.4. Summary and Research Gap

3. Research Methodology

3.1. Problem Awareness

3.2. Suggested Solution

3.3. Data Sources

3.4. Experimental Evaluation

4. Experiment Design and Implementation

4.1. Integration with Facility Management Data

4.2. Setup and Tools

4.3. Experiment Design

4.3.1. Use Case 1: Unstructured Input → Structured Output

4.3.2. Use Case 2: Unstructured Input → Function Calling

4.3.3. Use Case 3: Structured Input → Unstructured Output

4.4. Prompt Strategies

4.5. Evaluation Criteria

5. Implementation Architecture

5.1. Text-to-Query-to-Text Pipeline

5.2. Setup for the Experiments

5.2.1. Test Questions

5.2.2. Evaluation Setup and Rules

6. Results

6.1. Key Findings

6.2. Additional Findings

6.3. Observed Hallucinations

6.4. Contextual Relevance

7. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI