Quantitative Evaluation and Domain Adaptation of Vision–Language Models for Mixed-Reality Interpretation of Indoor Environmental Computational Fluid Dynamics Visualizations

Futamura, Soushi; Fukuda, Tomohiro

doi:10.3390/technologies14030157

Open AccessArticle

Quantitative Evaluation and Domain Adaptation of Vision–Language Models for Mixed-Reality Interpretation of Indoor Environmental Computational Fluid Dynamics Visualizations

by

Soushi Futamura

and

Tomohiro Fukuda

^*

Divisions of Sustainable Energy and Environmental Engineering, Graduate School of Engineering, The University of Osaka, 2-1 Yamadaoka, Suita 565-0871, Osaka, Japan

^*

Author to whom correspondence should be addressed.

Technologies 2026, 14(3), 157; https://doi.org/10.3390/technologies14030157

Submission received: 21 January 2026 / Revised: 22 February 2026 / Accepted: 2 March 2026 / Published: 4 March 2026

(This article belongs to the Section Construction Technologies)

Download

Browse Figures

Versions Notes

Abstract

In built environmental design, incorporating building user participation and verifying indoor thermal performance at early design stages have become increasingly important. Although Computational Fluid Dynamics (CFD) analysis is widely used to predict indoor thermal environments, its results are difficult for non-expert stakeholders to interpret, even when visualized using Mixed Reality (MR). Interpreting CFD visualizations in MR requires quantitative reasoning that explicitly cross-references visual features with legend information, rather than relying on prior color–value associations learned from natural images. This study investigates the capability of Vision–Language Models (VLMs) to interpret MR visualizations of CFD results and respond to user queries. We focus on indoor temperature distributions and airflow velocities visualized in MR. A novel dataset was constructed, consisting of MR images with CFD results superimposed onto real indoor spaces, paired with domain-specific question–answer annotations requiring legend-based reasoning. Using this dataset, a general-purpose VLM (Qwen2.5-VL) was fine-tuned. Experimental results show that the baseline model achieved less than 30% accuracy, whereas fine-tuning improved accuracy to over 60% across all categories while largely preserving general reasoning performance. These results demonstrate that domain adaptation enables VLMs to quantitatively interpret physical information embedded in MR visualizations, supporting non-experts’ understanding of built environmental design.

Keywords:

vision–language models (VLMs); mixed reality (MR); computational fluid dynamics (CFD); domain adaptation; thermal environment

1. Introduction

In recent years, consensus-building among diverse stakeholders, including building occupants, has become increasingly important in architectural and built environmental design to enhance occupant satisfaction [1]. People spend approximately 90% of their time indoors [2], and this proportion has further increased since the COVID-19 pandemic [3]. As a result, indoor environmental conditions play a critical role in occupants’ productivity, health, and overall well-being [4]. Accordingly, involving occupants in the design process has become essential for achieving higher levels of satisfaction with indoor environments.

Despite its importance, effective consensus-building remains challenging. In this study, the term “non-experts” primarily refers to clients, building owners, and building occupants who are often the final decision-makers but lack technical training in building physics. Non-experts often struggle to comprehend information that requires specialized expertise, including the interpretation of two-dimensional (2D) drawings and technical design documents [5]. This difficulty creates a substantial gap in understanding between experts (e.g., architects and engineers) and non-experts, which frequently results in mismatched expectations and reduced post-occupancy satisfaction [6].

To mitigate this gap, various methods for visually presenting design information have been explored [7]. Among them, Mixed Reality (MR) has attracted increasing attention [8,9]. By superimposing digital design models onto real-world spaces, MR enables users to intuitively understand spatial relationships, scale, and geometry directly on site. Following the definition by Milgram and Kishino, MR refers to a continuum in which real and virtual environments are combined [10]. In this study, Augmented Reality (AR) is treated as a subset of MR, as the precise mixing ratio between real and virtual content is not explicitly specified.

However, consensus-building in environmental design requires consideration not only of visible spatial elements—such as geometry and furniture layout—but also of invisible physical factors, including temperature, airflow, and acoustics. Among these, thermal comfort is a particularly influential determinant of occupant productivity and comfort [11]. To support design evaluation, prior studies have proposed visualizing otherwise invisible physical phenomena by conducting Computational Fluid Dynamics (CFD) simulations and overlaying the results onto indoor spaces using MR, typically in the form of contour plots or vector fields [12]. Nevertheless, users are susceptible to perceptual biases when interpreting colormaps [13]. As a result, simply visualizing CFD results in this manner does not allow non-experts to intuitively understand indoor environmental performance or comfort levels. Consequently, disparities in interpretation ability between experts and non-experts persist, hindering effective information sharing and consensus-building in environmental design. Moreover, the miscommunication and translation efforts required for experts stemming from this information asymmetry drive up overall project costs [6].

One potential approach to addressing this issue is the application of Vision–Language Models (VLMs), which can jointly process visual and linguistic information. While VLMs have demonstrated strong performance in various tasks, most existing studies focus on natural images and everyday visual content [14,15]. In contrast, limited research has examined whether VLMs can interpret scientific visualization data that represent physical quantities, such as CFD simulation results commonly used in architectural and environmental design [16]. Moreover, when VLMs are applied to MR images, real-world background imagery may introduce visual noise that interferes with the recognition of virtual objects. To date, few studies have systematically evaluated whether VLMs can accurately interpret virtual information embedded within MR scenes [17].

Therefore, assessing whether VLMs can correctly recognize and reason about color-encoded physical information—such as contour plots superimposed onto real indoor environments—is essential for evaluating the feasibility of integrating MR and VLMs in future environmental design workflows.

In this study, we construct a novel dataset comprising MR images that visualize indoor CFD analysis results, specifically contour plots and isolines, along with corresponding question–answer pairs. Using this dataset, we evaluate the reasoning and interpretation capabilities of VLMs with respect to scientific visualization data and examine the effectiveness of domain adaptation through fine-tuning. By enabling VLMs to verbalize and explain visualized physical information, this research aims to support non-expert stakeholders in understanding future indoor environments, thereby facilitating smoother consensus-building and improving satisfaction with indoor environmental design.

2. Literature Review

2.1. Development and Current Status of Vision–Language Models

In recent years, Large Language Models (LLMs) have advanced rapidly in the field of artificial intelligence, with state-of-the-art models exhibiting language understanding and generation capabilities comparable to those of humans [18]. However, because LLMs process only textual information, their applicability to real-world tasks is inherently limited. Many practical scenarios rely heavily on visual information, such as spatial relationships and geometric configurations, which cannot be fully conveyed through text alone [19]. This limitation is particularly pronounced in architectural and environmental design, where complex indoor layouts and spatial arrangements are difficult to express precisely using language only, and where LLMs struggle to infer accurate spatial structures from textual descriptions [20]. Consequently, extending AI applications to environmental design requires models with advanced visual and spatial understanding capabilities.

To overcome these limitations, VLMs, which jointly process visual and linguistic information, have been actively developed [21,22]. VLMs typically integrate the language processing capabilities of LLMs with vision encoders—such as Vision Transformers [23]—enabling them to perform image-based tasks including image captioning, visual question answering, and multimodal reasoning.

As VLMs have matured, their application scope has expanded beyond natural images, such as everyday scenes and landscapes, to the interpretation of graphical and diagrammatic data, including charts, plots, and technical illustrations [24,25,26]. Effective interpretation of such graphical data requires the ability to process high-resolution images and achieve high optical character recognition (OCR) accuracy. Recent models, such as Qwen2.5-VL [27], address these requirements by dynamically splitting high-resolution input images into manageable segments, allowing them to process images at their original resolution. As a result, these models demonstrate superior OCR performance and enhanced capabilities for interpreting complex graphical information.

Nevertheless, existing benchmarks for graphical data understanding [24,25,26] are largely limited to graphs with clearly defined axes or discrete data points. As a result, the ability of VLMs to interpret physical quantities encoded through continuous color variations—such as those found in scientific visualizations—has not been systematically evaluated.

In addition, applying general-purpose VLMs to specialized technical domains remains challenging. Because these models are predominantly trained on natural images and generic diagrams, they often lack sufficient domain-specific knowledge required for accurate interpretation in specialized fields. One commonly adopted approach to addressing this limitation is fine-tuning, which enables domain adaptation by further training a pre-trained general-purpose model using domain-specific datasets [28].

2.2. Adaptation of VLMs to Visualized Physical Information and MR Image

To utilize VLMs to explain CFD analysis results, models must be capable of interpreting images that visualize inherently invisible physical quantities, such as temperature distributions and airflow velocities. One of the few studies evaluating VLMs in this context is that of Moshtaghi et al. [16], who constructed a paired dataset of RGB and thermal images to assess the ability of VLMs to interpret thermal information. Their results showed that general-purpose models exhibited insufficient performance, achieving a maximum accuracy of only 17.24%, and the evaluation focused primarily on qualitative judgments rather than quantitative interpretation.

Beyond thermal imagery, several studies have examined VLM performance on other types of physical visualizations. Examples include ClimaQA [29], which targets weather maps, and research evaluating the interpretation of stress distribution maps [30]. However, these studies similarly do not assess the quantitative interpretation capabilities of VLMs with respect to physical quantities encoded in visual form.

In the context of indoor environmental evaluation, accurate interpretation of physical quantities—rather than merely identifying qualitative trends—is essential for explaining factors such as thermal comfort. As noted above, however, systematic evaluations of the quantitative interpretation capabilities of VLMs for such physical information remain largely unexplored.

Moreover, research integrating MR and VLMs is still limited. Most existing studies [31,32] propose systems in which real-world images are provided as input to VLMs, and the inference results are overlaid onto the real-world view. Consequently, few studies have explicitly examined whether VLMs can correctly interpret MR images in which virtual objects are superimposed onto real-world scenes.

Duan et al. [17] investigated the understanding of MR scenes by VLMs and reported that VLMs are capable of recognizing objects displayed virtually via MR. However, the virtual objects considered in their study were tangible entities, such as animals and furniture. To date, no studies have evaluated VLM performance on MR images that visualize inherently invisible physical phenomena—such as airflow or heat—through virtual overlays.

2.3. Applications of AI and VLMs in Built Environmental Design

In built environmental design, conducting prior CFD analyses of indoor environments and sharing the results among stakeholders is widely recognized as essential. However, non-expert stakeholders often struggle to understand these analysis results due to their technical complexity. Existing AI-based studies related to CFD analysis [33] primarily focus on accelerating simulations or enabling real-time design iterations using techniques such as convolutional neural networks (CNNs). In contrast, relatively little attention has been paid to improving the interpretability of CFD results for non-expert users.

Zhu et al. [34] proposed a system that visualizes CFD analysis results at multiple spatial scales within an MR environment. While this approach enhances spatial understanding, simply displaying physical quantities such as temperature and airflow remains insufficient for enabling non-experts to assess indoor comfort in a concrete and intuitive manner.

Separately, several studies have explored the application of VLMs to environmental evaluation tasks. For example, Zhang et al. [35] proposed a system that evaluates thermal comfort in urban spaces based on landscape images using VLMs and reported a positive correlation between model predictions and expert assessments. Similarly, the effectiveness of VLMs for perceptual evaluation of urban safety has been demonstrated [36]. However, because these approaches rely on landscape images as input, they cannot account for environmental factors such as temperature and airflow, which are not directly observable in such imagery.

2.4. Contribution

In response to the challenges identified above, this study aims to evaluate the quantitative interpretation accuracy of general-purpose VLMs with respect to physical quantities, as well as the extent to which this accuracy can be improved through domain adaptation. The novel contributions of this study are summarized as follows:

•: We introduced a novel approach by combining VLMs with MR and CFD. While existing studies primarily focus on passively displaying visualization results to users, this integration makes it possible to support non-experts, such as building owners and occupants, in understanding the design content.
•: We constructed a unique dataset specifically designed for the quantitative interpretation of physical visualizations. Unlike conventional datasets that rely on qualitative judgments, our dataset requires the extraction of quantitative values from visualized CFD results (e.g., contour plots) superimposed onto real-world backgrounds.
•: We investigated the quantitative reasoning capabilities of VLMs regarding physical quantities. By rigorously evaluating the model’s ability to cross-reference images with legend information, we verified the feasibility and practical boundaries of applying VLMs to scientific environmental visualizations.

The outcomes of this research provide a foundational technology for supporting non-experts in understanding CFD analysis results used during the design process. By facilitating comprehension of future indoor environments, this approach aims to promote smoother consensus-building and improve satisfaction with indoor environmental design. In addition, this study extends the application scope of VLMs to the interpretation of invisible physical information, laying the groundwork for future VLM-based environmental evaluation systems.

3. Dataset Construction

3.1. Dataset Overview

In this study, to evaluate and enhance the ability of VLMs to interpret images visualizing physical quantities, we constructed a dataset consisting of MR images and associated question–answer pairs. In the MR images, indoor CFD analysis results are overlaid onto real-world indoor scenes.

Unlike existing heatmap datasets [16], which primarily target qualitative judgments such as identifying which object or region is hotter, the proposed dataset includes questions that require the extraction of quantitative values based on legend information. Using this dataset, we aim to verify whether VLMs can quantitatively infer physical quantities from visualized scientific data.

Each dataset entry comprises an MR image, legend information provided in either image or text form, a question, an answer, and an explanation intended to support the reasoning process. The MR images include contour plots representing temperature distributions and isoline maps representing airflow velocity. Legend information consists of color bar images corresponding to the temperature contours and textual descriptions defining the numerical meaning of the isoline colors. An example data sample is shown in Figure 1.

The dataset was constructed through the following steps:

Creation of three-dimensional models of the target space and execution of CFD analyses;
Generation of MR images with superimposed CFD analysis results;
Creation of question–answer pairs and explanatory texts based on the MR images.

3.2. CFD Analysis

This subsection describes the CFD analysis methodology used to generate indoor environmental data for MR visualization. First, the analysis domain was defined, and a three-dimensional model of the target indoor space was constructed. Mesh generation was then performed, followed by transient CFD simulations using an appropriate solver.

Room 410 of the M3 Building at The University of Osaka (W 6.940 × D 6.555 × H 3.200 m) was selected as the target space. A three-dimensional model of the room was created using FreeCAD. To introduce variability into the dataset, two layout configurations were prepared: an empty room without furniture (Pattern A) and a furnished room equipped with desks (Pattern B).

Meshes were generated using cfmesh, with a base cell size of 100 mm. Local mesh refinement was applied in regions of geometric or physical importance: a cell size of 50 mm around the desks due to their geometric complexity, and 25 mm around the air conditioner inlets and outlets. Figure 2 shows the interior view of Room 410 and the generated meshes. Although existing objects such as shelves appear in the interior photograph, they were excluded from the CFD model; conversely, the desks in Pattern B were introduced as virtual objects.

CFD simulations were conducted using the open-source software OpenFOAM (v2306). The buoyantPimpleFoam solver, which supports unsteady flow analysis with buoyancy effects, was employed, and the standard k–ε turbulence model was adopted. This configuration enabled the output of temperature contour plots and airflow isoline maps at arbitrary time steps.

To diversify the dataset, multiple analysis cases were created by varying the air conditioner operation mode (cooling or heating) and outlet angle (0° or 60°). The outlet angle was defined relative to the ceiling surface, with positive values indicating a downward direction. CFD simulations were conducted for all combinations of these conditions. Detailed analysis parameters are summarized in Table 1 and Table 2.

3.3. MR Image Generation Method

This subsection describes the procedure used to generate MR images for the dataset. First, CFD analysis results were visualized, after which the visualized outputs were imported into a game engine and overlaid onto real-world images to create MR scenes.

CFD results obtained as described in Section 3.2 were visualized using ParaView (v6.0.1). Data were extracted at selected time steps and included temperature contour plots and airflow velocity isolines on three vertical cross-sections—near the air conditioner (1), at the room center (2), and near a wall (3)—as well as one horizontal cross-section at a height of 1.1 m above the floor (4), corresponding to the occupied zone. The locations of these cross-sections are illustrated in Figure 3.

To increase dataset diversity, both the temperature range and the colormap type were varied across analysis cases during visualization. The visualized geometries were exported in .ply format along with the corresponding colormap images. The colormaps used were Fast (a rainbow-based scheme), Fast (Reversed) (its inverted variant), Viridis (a perceptually uniform purple–yellow scheme) and Magma (a perceptually uniform black–magenta–yellow scheme). These colormaps are shown in Figure 4.

Airflow velocity was visualized using isolines rather than vector arrows. Although vector arrows are commonly used in fluid analysis, their apparent length varies with viewpoint in MR environments, making quantitative interpretation difficult. Moreover, coloring arrows by velocity would introduce color overlap with temperature contours when both are displayed simultaneously. Therefore, isolines were employed to represent airflow velocity using colored curves, facilitating clearer identification of velocity ranges.

MR image generation was performed using the game engine Unity 2022.3.3f1. Visualization data exported from ParaView in .ply format were converted to .fbx format to enable import into Unity. The imported .fbx models were rendered using an Unlit shader in Unity. This specific material configuration prevented the virtual objects from being altered by virtual environmental lighting, thereby preserving the exact RGB values of the original colormaps. Background images, contour plots, and isoline maps were then integrated. A total of 18 photographs of Room 410, captured from various viewpoints using a smartphone, were prepared as background images (examples are shown in Figure 5). Spatial alignment was achieved using the Visual Positioning System provided by Immersal (Helsinki, Finland) [37], after which CFD results were superimposed to generate MR images.

To diversify MR presentation conditions, the opacity (alpha channel) of the superimposed analysis objects was randomly assigned a value within a range of 0.7 to 1.0. For each MR image, only a single cross-section was displayed, as simultaneous visualization of multiple sections leads to occlusion and hinders accurate color identification. Two display patterns were generated: images showing only temperature contours, and images showing both temperature contours and airflow isolines. Multiple MR images were created from a single background image by varying the superimposed cross-section and the time step of the CFD results.

Finally, the generated MR images were rendered and exported at a resolution of 640 × 480 pixels.

3.4. Creation of Question–Answer Pairs

The question–answer pairs in the dataset were categorized into the following three types:

Temperature interpretation;
Airflow interpretation;
Integrated interpretation of temperature and airflow.

For the temperature interpretation category, questions were designed to evaluate the ability to extract quantitative information from contour plots in MR images. Typical examples include questions such as “What is the maximum temperature?” Answers were obtained directly from the numerical CFD results. In addition, explanation texts included bounding box coordinates identifying the relevant image regions, and the prompts instructed the model to output both the temperature value and the corresponding bounding box.

For airflow-related questions, the objective was to assess interpretation of airflow velocity visualized using discrete isolines. Because precise velocity values cannot be determined at positions between isolines, questions were divided into two types:

•: Type 1: Questions asking for the color of isolines present in the image;
•: Type 2: Questions asking for the airflow velocity range at specified coordinates.

Type 1 questions evaluate whether a VLM can correctly recognize isoline colors in MR images with real-world background noise. Type 2 questions assess whether the model can infer velocity ranges by identifying isoline colors around the specified coordinates and comparing them with legend information. Answers for airflow-related questions were manually created.

For integrated temperature–airflow interpretation, both qualitative and quantitative questions were designed to evaluate the ability to synthesize multiple types of physical information. Qualitative questions asked about differences in temperature distributions across regions with varying airflow velocities, and answers were manually generated based on visual inspection. Quantitative questions required identifying specific temperatures within regions defined by airflow velocity ranges. Answers for these questions were generated using an automated script that extracts RGB values from target regions and maps them to temperature values using the color bar. Although this approach does not rely on raw CFD numerical outputs, it ensures consistency between the visual input presented to the VLM and the expected answers.

In addition to MR images and question–answer pairs, the dataset includes legend information and explanation texts. Legend information comprises temperature color bar images and textual descriptions explaining the numerical meaning of airflow isolines. Specifically, the legend text defines the following relationships:

•: Magenta line: 1.0 m/s, representing discharged airflow from the air conditioner;
•: White line: 0.25 m/s, representing the draft risk threshold at which airflow may cause thermal discomfort;
•: Black line: 0.1 m/s, representing the boundary of air stagnation, with enclosed regions indicating stagnant air.

Explanation texts were added to support learning of the reasoning process. Templates corresponding to each question category were defined, and explanation texts were automatically generated using a rule-based approach. Because this script-based approach dynamically populated the templates using specific data attributes (e.g., applied colormaps and answer values), the actual generation process was completed almost instantaneously. However, designing and refining these optimal templates required approximately six hours of manual effort. To prevent the VLM from relying on prior knowledge, we provided explicit prompts during both the fine-tuning and inference phases, instructing the model to cross-reference the MR images with the legend information. Specifically, these prompts explicitly directed the model to “strictly cross-reference the MR image with the legend.”

3.5. Dataset Details

The dataset constructed using the procedures described above consists of 313 MR images and 879 question–answer pairs. A detailed breakdown by category is provided in Table 3. Each data entry includes an MR image, a legend image, legend text, a question, an answer, and an explanation.

For the experiments described in subsequent sections, the primary dataset (containing Fast, Fast (Reversed), and Viridis colormaps) was divided into training, validation, and evaluation sets with an approximate ratio of 6:2:2. To prevent data leakage, data splitting was performed at the background image level rather than by individual question–answer instances. This ensures that no real-world background photograph is shared across the training, validation, and evaluation subsets. Stratified sampling was employed to ensure that the distribution of question categories in each subset closely matched the full dataset.

Furthermore, to rigorously evaluate the model’s ability to apply legend-based reasoning to unseen colormaps, an independent test dataset (containing the Magma colormap) was exclusively reserved for the evaluation phase and was not used during training or validation. To isolate the effect of the unseen colormap from potential background-induced biases, the real-world background photographs used for this dataset were strictly limited to the exact same set of images as those in the standard evaluation subset.

In the subsequent sections, the primary evaluation subset and the independent test dataset (containing the Magma colormap) are collectively referred to as the evaluation set. The full dataset is provided in the Supplementary Materials.

4. Evaluation

This section evaluates the effectiveness of the dataset constructed in Section 3 and examines the impact of domain adaptation through fine-tuning. Specifically, the performance of the VLM on the evaluation dataset is assessed both before and after fine-tuning. In addition, the effect of fine-tuning on the model’s general reasoning capabilities is investigated to identify potential trade-offs between domain-specific accuracy and generalization.

4.1. Evaluation Setup

In this study, Qwen2.5-VL-7B-Instruct [26] was employed as the baseline model. This model is known for its high recognition accuracy of fine-grained textual and geometric details in images. Because the MR images used in this study include small numerical labels within temperature color bars, high OCR accuracy is essential; therefore, this model was selected as an appropriate baseline.

Low-Rank Adaptation (LoRA) [38] was adopted for fine-tuning. LoRA enables efficient domain adaptation while significantly reducing computational cost by updating only a small subset of model parameters. The training subset of the dataset was used for model optimization, while the validation subset was used to monitor loss convergence and prevent overfitting during training. The main hyperparameter settings are summarized in Table 4, and the computational environment is detailed in Table 5.

Regarding the computational cost using the hardware environment described in Table 5, the fine-tuning process for five epochs required approximately 163 min in wall-clock time (222 min in CPU time). Furthermore, the fine-tuning process for ten epochs required approximately 324 min in wall-clock time (450 min in CPU time).

4.2. Evaluation Methodology

This subsection describes the verification experiments conducted in this study and the evaluation metrics used to assess model performance.

4.2.1. Verification Experiments

To assess the reasoning capabilities of general-purpose VLMs on the constructed dataset, and to verify both the effectiveness of the constructed dataset and the impact of domain adaptation through fine-tuning, the following models were compared:

•: Qwen2.5-VL-7B-Instruct (without fine-tuning): The baseline model;
•: Lllava-NeXT-Llama3-8B [39] (without fine-tuning);
•: MiniCPM-V-2.6-8B [40] (without fine-tuning);
•: Epoch-5 Model: Qwen2.5-VL-7B-Instruct fine-tuned on the constructed dataset for five epochs;
•: Epoch-10 Model: Qwen2.5-VL-7B-Instruct fine-tuned on the constructed dataset for ten epochs.

For each model, category-wise performance was evaluated on the evaluation set to analyze the effects of domain adaptation and performance changes across training epochs. The fine-tuned models, train scripts, and inference scripts are provided in the Supplementary Materials.

In addition to domain-specific performance, generalization capability was evaluated using MMBench [41], a widely used benchmark for assessing perception and reasoning in VLMs. By comparing MMBench scores before and after fine-tuning, the impact of domain adaptation on the model’s inherent general reasoning ability was quantitatively assessed.

4.2.2. Evaluation Metrics

To improve the reliability of model predictions, multiple inference runs were conducted for each sample in the evaluation set, and the final prediction was determined through statistical aggregation. Specifically, three inference runs were performed per question. For qualitative questions, the most frequently occurring answer (mode) was selected, while for quantitative questions, the median value was used to mitigate the influence of outliers.

Based on these aggregated predictions, evaluation metrics were applied according to the question type.

For qualitative questions, accuracy was computed by automatically checking for exact matches between predicted answers and answers. Because the baseline model frequently failed to adhere to the specified output format, its qualitative responses were manually reviewed to determine correctness.

For quantitative questions, the Mean Absolute Error (MAE) was computed using Equation (1) to measure the absolute difference between predicted and simulation reference values or visualized data values. In addition, because the range of the temperature color bar varied across samples, accuracy was also evaluated in a relative manner. Specifically, predictions were considered correct if the absolute error fell within 10% of the corresponding color bar range, as defined by Equations (2) and (3).

M A E = \frac{1}{N} \sum_{i = 1}^{N} | {\hat{x}}_{i} - x_{i} |

(1)

f (x_{i}) = {\begin{array}{l} 1, i f | {\hat{x}}_{i} - x_{i} | \leq R_{i} \times 0.10 \\ 0, o t h e r w i s e \end{array}

(2)

A c c u r a c y = \frac{1}{N} \sum_{i = 1}^{N} f (x_{i}) \times 100

(3)

Here, N denotes the total number of data samples,

{\hat{x}}_{i}

is the predicted value,

x_{i}

is the simulation reference value or visualized data value, and

R_{i}

represents the color bar range of the i-th sample.

4.3. Evaluation Results

This subsection presents the results obtained from the evaluation procedures described in Section 4.2. Table 6 shows category-wise accuracy and MAE for each general-purpose VLM (Qwen2.5-VL-7B-Instruct, Lllava-NeXT-Llama3-8B, and MiniCPM-V-2.6-8B) on the evaluation set. Table 7 and Table 8 summarize the category-wise accuracy, MAE, and time required for inference on the evaluation set, specifically comparing the performance of the baseline Qwen2.5-VL-7B-Instruct model before and after fine-tuning (Epoch-5 and Epoch-10). A comparative visualization of the accuracy is provided in Figure 6. For Category 1 (temperature interpretation), Figure 7a,b show the accuracy and MAE, respectively, across different colormap types. Figure 8 illustrates the accuracy of each model for individual question types within Category 2 (airflow interpretation).

Across all categories, the fine-tuned models consistently outperformed the baseline model, demonstrating clear accuracy improvements and confirming the effectiveness of domain adaptation.

To provide a qualitative comparison between the baseline and fine-tuned models, Figure 9 presents representative samples from each category along with the corresponding model outputs. Furthermore, the evaluation of generalization performance summarized in Table 9 shows a decrease of approximately 2.9% in the MMBench score after fine-tuning, indicating a modest trade-off between domain-specific performance gains and general reasoning capability.

5. Discussion

5.1. Effectiveness of Domain Adaptation

As shown in Table 6, all evaluated general-purpose VLMs (Qwen2.5-VL-7B-Instruct, Lllava-NeXT-Llama3-8B, and MiniCPM-V-2.6-8B) exhibited low accuracy and high MAE across all categories prior to fine-tuning. These results across different model architectures indicate that general-purpose VLMs lack the ability to interpret scientific visualizations representing physical quantities. In particular, the poor performance on quantitative questions suggests a failure to cross-reference color legends with contour plots in MR images to infer temperature values.

In contrast, as reported in Table 7, the Qwen2.5-VL model fine-tuned on the constructed dataset achieved substantially higher accuracy and lower MAE across all categories. This improvement indicates that the VLM successfully acquired domain-specific reasoning and interpretation capabilities for scientific visualization. Notably, the marked performance gain in the “Integrated Interpretation of Temperature and Airflow” category demonstrates that the model acquired the ability to simultaneously interpret multiple physical quantities in MR images and integrate them to perform complex reasoning.

5.2. Category-Wise Analysis

5.2.1. Temperature Interpretation

As reported in Table 7, the baseline model achieved an accuracy of only 12% for temperature interpretation, indicating an inability to cross-reference image features with legend information. Figure 7 and Figure 9 further show that performance was significantly degraded when the Viridis and Magma colormaps were used, compared to Fast. This is likely because rainbow-based colormaps are widely used in thermography and weather visualizations, allowing the model to exploit prior knowledge learned during pretraining. In contrast, the Viridis and Magma colormaps are less commonly encountered in general visual data, and therefore, the model was unable to rely on prior associations, resulting in poor performance.

For the fine-tuned model, however, Figure 7 shows that accuracy with Viridis exceeded that achieved with Fast and Fast (Reversed). This can be attributed to the fact that Viridis maintains a unique, monotonic correspondence between color and value, whereas Fast and Fast (Reversed) introduce ambiguity: identical colors may correspond to different numerical values depending on the legend orientation. Consequently, accurate interpretation using Fast requires rigorous cross-referencing between the MR image and the legend. The superior performance with Viridis, therefore, indicates that the fine-tuned model relied on explicit legend-based reasoning rather than prior assumptions, resulting in more robust temperature interpretation. Furthermore, to rigorously evaluate the model’s ability to apply legend-based reasoning to unseen colormaps, a test was conducted using a completely unseen Magma colormap. For this novel visual representation, the interpretation accuracy improved significantly from approximately 10% prior to fine-tuning to 50% with the fine-tuned model. Because the model was never exposed to Magma colormap during training, this substantial performance gain indicates that the model acquired a generalizable capability to cross-reference legend information with spatial visual features, regardless of the underlying color scheme.

5.2.2. Airflow Interpretation

As shown in Figure 8, the baseline model achieved an accuracy exceeding 50% for Type 1 questions, which ask about the colors present in the image. This suggests that the model retains basic perceptual abilities, such as recognizing isolines and their colors in MR visualizations. In contrast, for Type 2 questions—requiring inference of airflow velocity ranges at specific coordinates—the accuracy dropped to approximately 5%. This result indicates that the baseline model lacks the ability to interpret legend information and perform the reasoning necessary to estimate values between isolines.

The fine-tuned model (Epoch-5), by contrast, achieved accuracies of approximately 98% for Type 1 questions and 42% for Type 2 questions. These results demonstrate that domain adaptation enabled the VLM to associate recognized visual features with legend information and convert them into physical quantities. The substantial improvement for Type 2 questions further indicates that the model acquired spatial reasoning capabilities, allowing it to identify regions between isolines and perform visual interpolation based on adjacent values.

5.2.3. Integrated Interpretation of Temperature and Airflow

As shown in Table 7, the baseline model achieved an accuracy of only 17.86% in Category 3. This low performance is primarily attributable to its inability to correctly infer airflow velocity ranges, as discussed in Section 5.2.2.

In contrast, the fine-tuned model achieved an accuracy of approximately 65%. This result suggests that the training process enabled the model to sequentially perform multiple inference steps: first, deriving airflow velocity ranges from legend information, and second, inferring temperature values by cross-referencing contour plots with corresponding legends. This capability to integrate and reason over multiple physical quantities suggests that future VLM-based systems may be able to assess indoor thermal comfort using complex, multimodal information and provide accessible explanations to non-expert users.

5.3. Model Generalization Performance and Training Efficiency

While fine-tuning on domain-specific data improves task performance, it may also degrade a model’s general reasoning capabilities. To assess this effect, we evaluated the models using MMBench. As shown in Table 9, the fine-tuned model exhibited a performance decrease of approximately 2.9%, with the score dropping from 87.3% to 84.4%. Given the substantial gains in quantitative interpretation accuracy achieved through fine-tuning, this reduction is relatively modest. The results, therefore, indicate that the model retains sufficient general reasoning ability despite domain adaptation.

Training efficiency and the impact of training iterations were also examined. As shown in Table 7, the model’s accuracy converged rapidly, reaching a high level after only five epochs. However, when training was extended to ten epochs, the accuracy on the test dataset degraded. This decline indicates that excessive training led to overfitting; the model overly adapted to the specific features of the training data, thereby reducing its generalizability to unseen data. These results demonstrate that effective domain adaptation is best achieved with a small number of training iterations. This efficiency suggests that rapid fine-tuning tailored to specific indoor environments or analysis conditions will be feasible, enabling swift system adaptation in practical deployments.

5.4. Limitations and Future Work

This study has three primary limitations.

The first limitation concerns the diversity of spatial conditions. All CFD analysis results in the constructed dataset were derived from a single room geometry. Consequently, the model’s performance in spaces with different geometries or complex furniture arrangements has not been validated. Future work should therefore incorporate CFD results from a wider variety of spatial configurations to improve and evaluate the model’s generalizability. However, the fine-tuned models demonstrated a degree of spatial and visual generalizability within the same environment. Because our evaluation set included viewpoints and colormaps that were not used during fine-tuning, the observed accuracy improvements on these unseen images confirm that the model did not merely memorize specific spatial layouts or color distributions. Having acquired the ability to systematically cross-reference visual features with the color legend, the VLM is expected to reasonably predict values even for unextracted or intermediate planes.

The second limitation relates to practical applicability. The primary objective of this study was to assess whether a VLM can accurately interpret the current state of an indoor environment, such as reading temperatures from images. However, real-world design processes require more than descriptive understanding; they demand explanations of causal mechanisms and proposals for design improvements. Future work will therefore focus on constructing datasets that include explanatory texts describing cause–effect relationships and specific improvement strategies. Training on such data would enable the VLM not only to interpret visualized CFD results but also to provide causal explanations and actionable recommendations, thereby supporting non-experts throughout the design process.

The third limitation pertains to the flexibility of user queries. The current model was fine-tuned using structured question templates, and even within these trained categories, the interpretation accuracy currently remains between approximately 50% and 80%. Given this performance level, it is highly likely that the model cannot yet provide valid responses to queries that deviate significantly from the training scope—such as abstract evaluations of thermal comfort or the spatial implications of placing virtual objects. To bridge this gap, future research must focus on constructing expanded datasets that explicitly link fundamental physical quantities to human-centric indices. For instance, to train the model for queries regarding thermal comfort—an index easily understood by non-experts—it would be essential to create a new dataset that guides the VLM to first extract temperature and airflow values from the visualizations and subsequently calculate the Predicted Mean Vote (PMV) index based on those extracted physical quantities. Training on such a multi-step reasoning process would enable the VLM to derive complex, abstract comfort evaluations directly from basic physical parameters, thereby broadening its practical applicability in real-world design scenarios.

6. Conclusions

This study investigated the application of VLMs to support non-experts’ understanding of CFD analysis results, with the goal of facilitating consensus-building in architectural and environmental design. A fundamental prerequisite for this application is the ability of VLMs to accurately interpret visualizations of physical quantities.

To evaluate this capability, we constructed a dataset consisting of MR images with superimposed CFD analysis results and corresponding question–answer pairs. These images visualize temperature contour plots and airflow velocity isolines within real-world indoor spaces. Using this dataset, we fine-tuned a general-purpose VLM and evaluated its interpretation and reasoning performance.

The dataset enabled systematic evaluation of the model’s ability to cross-reference legend information with quantitative values, as well as its capacity to interpret inherently invisible physical phenomena—such as heat and airflow—when visualized using MR.

The main findings are summarized as follows:

•: The baseline model achieved accuracies below 30% across all categories, indicating that general-purpose VLMs lack sufficient quantitative reasoning ability for interpreting visualized physical quantities such as temperature and airflow.
•: Fine-tuning the model on the constructed dataset improved accuracy by over 40% across all categories, demonstrating that domain adaptation enables effective cross-referencing between image features and legend information.
•: Training convergence was achieved within five epochs, with no significant improvement observed at ten epochs, indicating that effective domain adaptation can be achieved with minimal training effort.
•: The fine-tuned model enables quantitative interpretation of CFD analysis results, allowing building users to understand future indoor environments during the design phase. This capability is expected to facilitate smoother consensus-building and improve satisfaction with indoor environmental quality.

Supplementary Materials

The following supporting information can be downloaded at https://doi.org/10.5281/zenodo.18728525. The materials are organized into the Data Folder: Dataset, Fine-Tuned Models, Inference Results and Scripts for Fine-Tuning and Inference.

Author Contributions

Conceptualization, S.F. and T.F.; methodology, S.F.; software, S.F.; validation, S.F.; formal analysis, S.F.; investigation, S.F.; resources, S.F.; data curation, S.F.; writing—original draft preparation, S.F.; writing—review and editing, S.F. and T.F.; visualization, S.F.; supervision, T.F.; project administration, S.F.; funding acquisition, T.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS KAKENHI, grant number 23K11724.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset, fine-tuned models, and inference results are presented in this study and are available in Zenodo at https://doi.org/10.5281/zenodo.18728525.

Acknowledgments

During the preparation of this study, the authors used Gemini 3.0 Pro for the purpose of improving the readability and language of the text. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Robertson, T.; Simonsen, J. Challenges and opportunities in contemporary participatory design. Des. Issues 2012, 28, 3–9. [Google Scholar] [CrossRef]
Matz, C.J.; Stieb, D.M.; Davis, K.; Egyed, M.; Rose, A.; Chou, B.; Brion, O. Effects of age, season, gender and urban-rural status on time-activity: Canadian Human Activity Pattern Survey 2 (CHAPS 2). Int. J. Environ. Res. Public Health 2014, 11, 2108–2124. [Google Scholar] [CrossRef]
Morris, E.A.; Speroni, S.; Taylor, B.D. Going nowhere faster: Did the covid-19 pandemic accelerate the trend toward staying home? J. Am. Plan. Assoc. 2025, 91, 361–379. [Google Scholar] [CrossRef]
Al Horr, Y.; Arif, M.; Kaushik, A.; Mazroei, A.; Katafygiotou, M.; Elsarrag, E. Occupant productivity and office indoor environment quality: A review of the literature. Build. Environ. 2016, 105, 369–389. [Google Scholar] [CrossRef]
Calderon-Hernandez, C.; Paes, D.; Irizarry, J.; Brioso, X. Comparing virtual reality and 2-dimensional drawings for the visualization of a construction project. In Proceedings of the ASCE International Conference on Computing in Civil Engineering 2019, Atlanta, Georgia, 17–19 June 2019; pp. 17–24. [Google Scholar] [CrossRef]
Ivić, I.; Cerić, A. Risks caused by information asymmetry in construction projects: A systematic literature review. Sustainability 2023, 15, 9979. [Google Scholar] [CrossRef]
Lange, E. Integration of computerized visual simulation and visual assessment in environmental planning. Landsc. Urban Plan. 1994, 30, 99–112. [Google Scholar] [CrossRef]
Grossman, R.L. The case for cloud computing. IT Prof. 2009, 11, 23–27. [Google Scholar] [CrossRef]
Garbett, J.; Hartley, T.; Heesom, D. A multi-user collaborative BIM-AR system to support design and construction. Autom. Constr. 2021, 122, 103487. [Google Scholar] [CrossRef]
Milgram, P.; Kishino, F. A Taxonomy of Mixed Reality Visual Displays. IEICE Trans. Inf. Syst. 1994, 77, 1321–1329. [Google Scholar]
Huizenga, C.; Abbaszadeh, S.; Zagreus, L.; Arens, E.A. Air quality and thermal comfort in office buildings: Results of a large indoor environmental quality survey. Healthy Build. 2006, III, 393–397. [Google Scholar]
Fukuda, T.; Yokoi, K.; Yabuki, N.; Motamedi, A. An indoor thermal environment design system for renovation using augmented reality. J. Comput. Des. Eng. 2019, 6, 179–188. [Google Scholar] [CrossRef]
Sibrel, S.C.; Rathore, R.; Lessard, L.; Schloss, K.B. The relation between color and spatial structure for interpreting colormap data visualizations. J. Vis. 2020, 20, 7. [Google Scholar] [CrossRef]
Zhou, X.; Liu, M.; Yurtsever, E.; Zagar, B.L.; Zimmer, W.; Cao, H.; Knoll, A.C. Vision language models in autonomous driving: A survey and outlook. IEEE Trans. Intell. Veh. 2024, 1–20. [Google Scholar] [CrossRef]
Ma, Y.; Song, Z.; Zhuang, Y.; Hao, J.; King, I. A survey on vision-language-action models for embodied ai. arXiv 2024, arXiv:2405.14093. [Google Scholar] [CrossRef]
Moshtaghi, M.; Khajavi, S.H.; Pajarinen, J. RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models. arXiv 2025, arXiv:2503.19654. [Google Scholar] [CrossRef]
Duan, L.; Xiu, Y.; Gorlatova, M. Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble. In Proceedings of the 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), Saint-Maro, France, 8–12 March 2025; pp. 156–161. [Google Scholar] [CrossRef]
OpenAI. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Li, Z.; Wu, X.; Du, H.; Liu, F.; Nghiem, H.; Shi, G. A Survey of State of the Art Large Vision Language Models: Benchmark Evaluations and Challenges. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 1587–1606. [Google Scholar] [CrossRef]
Zhong, X.; Meng, X.; Li, Y.; Fricker, P.; Liang, J.; Koh, I. An agentic vision-action framework for generative 3D architectural modeling from sketches. Int. J. Archit. Comput. 2025, 23, 679–700. [Google Scholar] [CrossRef]
Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 26296–26306. [Google Scholar] [CrossRef]
Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2025, arXiv:2304.10592. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Methani, N.; Ganguly, P.; Khapra, M.M.; Kumar, P. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1527–1536. [Google Scholar] [CrossRef]
Masry, A.; Do, X.L.; Tan, J.Q.; Joty, S.; Hoque, E. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 2263–2279. [Google Scholar] [CrossRef]
Xia, R.; Ye, H.; Yan, X.; Liu, Q.; Zhou, H.; Chen, Z.; Shi, B.; Yan, J.; Zhang, B. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning. IEEE Trans. Image Process. 2025, 34, 7436–7447. [Google Scholar] [CrossRef]
Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2. 5-vl technical report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 2019, 32, 13–23. [Google Scholar]
Manivannan, V.V.; Jafari, Y.; Eranky, S.; Ho, S.; Yu, R.; Watson-Parris, D.; Ma, Y.; Bergen, L.; Berg-Kirkpatrick, T. ClimaQA: An Automated Evaluation Framework for Climate Question Answering Models. arXiv 2024, arXiv:2410.16701. [Google Scholar] [CrossRef]
Lautenschlager, S. True colours or red herrings?: Colour maps for finite-element analysis in palaeontological studies to enhance interpretation and accessibility. R. Soc. Open Sci. 2021, 8, 211357. [Google Scholar] [CrossRef]
Xu, F.; Nguyen, T.; Du, J. Augmented reality for maintenance tasks with ChatGPT for automated text-to-action. J. Constr. Eng. Manag. 2024, 150, 04024015. [Google Scholar] [CrossRef]
Fan, H.; Zhang, H.; Ma, C.; Wu, T.; Fuh, J.Y.H.; Li, B. Enhancing metal additive manufacturing training with the advanced vision language model: A pathway to immersive augmented reality training for non-experts. J. Manuf. Syst. 2024, 75, 257–269. [Google Scholar] [CrossRef]
Calzolari, G.; Liu, W. Deep learning to replace, improve, or aid CFD analysis in built environment applications: A review. Build. Environ. 2021, 206, 108315. [Google Scholar] [CrossRef]
Zhu, Y.; Fukuda, T.; Yabuki, N. Integrating animated computational fluid dynamics into mixed reality for building-renovation design. Technologies 2019, 8, 4. [Google Scholar] [CrossRef]
Zhang, D.; Xiong, Z.; Zhu, X. Evaluation of Thermal Comfort in Urban Commercial Space with Vision–Language-Model-Based Agent Model. Land 2025, 14, 786. [Google Scholar] [CrossRef]
Zhang, J.; Li, Y.; Fukuda, T.; Wang, B. Urban safety perception assessments via integrating multimodal large language models with street view images. Cities 2025, 165, 106122. [Google Scholar] [CrossRef]
Immersal SDK. Available online: https://immersal.com/ (accessed on 16 January 2026).
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar] [CrossRef]
Cheng, D.; Huang, D.; Zhu, Z.; Zhang, X.; Zhao, X.W.; Luan, Z.; Dai, B.; Zhang, Z. On Domain-Adaptive Post-Training for Multimodal Large Language Models. arXiv 2024, arXiv:2411.19930. [Google Scholar] [CrossRef]
Yao, Y.; Yu, T.; Zhang, A.; Wang, C.; Cui, J.; Zhu, H.; Cai, T.; Li, H.; Zhao, W.; He, Z.; et al. MiniCPM-V: A GPT-4V Level MLLM on Your Phone. arXiv 2024, arXiv:2408.01800. [Google Scholar] [CrossRef]
Liu, Y.; Duan, H.; Zhang, Y.; Li, B.; Zhang, S.; Zhao, W.; Yuan, Y.; Wang, J.; He, C.; Liu, Z.; et al. Mmbench: Is your multi-modal model an all-around player? In Proceedings of the European Conference on Computer Vision, Milano, Italy, 29 September–4 October 2024; pp. 216–233. [Google Scholar] [CrossRef]

Figure 1. Example of a data sample included in the constructed dataset.

Figure 2. Interior view of Room 410 and the corresponding computational meshes for each layout pattern.

Figure 3. Locations of the extracted cross-sections used for CFD visualization.

Figure 4. Colormap types and examples of the corresponding temperature contour plots used in the dataset: (a) Fast, (b) Fast (Reversed), (c) Viridis, and (d) Magma. The temperature scale is presented in degrees Celsius (°C).

Figure 5. Examples of real-world background images used for generating MR images in the dataset.

Figure 6. Category-wise accuracy (%) for the baseline model (Qwen2.5-VL-7B-Instruct) and fine-tuned models (Epoch-5 and Epoch-10) on the evaluation set.

Figure 7. (a) Accuracy (%) and (b) Mean Absolute Error (MAE, °C) for Category 1 (Temperature interpretation) across different colormap types for the baseline model (Qwen2.5-VL-7B-Instruct) and fine-tuned models (Epoch-5 and Epoch-10).

Figure 8. Accuracy (%) of the baseline model (Qwen2.5-VL-7B-Instruct) and fine-tuned models (Epoch-5 and Epoch-10) for Category 2 (Airflow interpretation) across different question types.

Figure 9. Representative data samples from each category and the corresponding outputs of the baseline model and the Epoch-10 fine-tuned model. Quantitative outputs represent temperature values in degrees Celsius (°C) as defined by the legend. If the MR image includes airflow isolines, the associated legend text described in Section 3.4 is also provided. Red dots indicate the coordinates referenced in the questions; these dots are shown for illustration only and are not included in the MR images input to the VLM.

Table 1. Boundary conditions applied to the air conditioner in the CFD analysis. Temperatures are given in degrees Celsius (°C) and velocities in meters per second (m/s).

Item	Cooling	Heating
Initial Temperature (°C)	30	10
Outlet Temperature (°C)	22	30
Outlet Speed (m/s)	3.2	3.2
Direction (°)	0, 60	0, 60

Table 2. Boundary conditions of the construction materials used in the CFD analysis. Temperatures are given in degrees Celsius (°C) and thermal transmittance in W/(m²·K).

Item		Wall	Ceiling	Floor
Thermal Transmittance (W/(m²·K))		1.38	1.8	0
Cooling	Initial wall temperature (°C)	30	30	30
Cooling	Outdoor temperature (°C)	35	35	-
Heating	Initial wall temperature (°C)	10	10	10
Heating	Outdoor temperature (°C)	5	5	-

Table 3. Category structure of the constructed dataset.

Category	Example Question	Sample
1. Temperature Interpretation	What is the highest temperature in the image?	430
2. Airflow Interpretation	What is the airflow velocity at the specified coordinates?	307
3. Integrated Interpretation of Temperature and Airflow	What is the temperature within the specified flow velocity range?	142
All	-	879

Table 4. Primary hyperparameters used for fine-tuning.

Learning Rate	0.0002
Batch Size	1
Gradient Accumulation Steps	16
Optimizer	AdamW
LoRA Rank	16

Table 5. Computational environment used for fine-tuning and evaluation.

CPU	Intel Core i9-14900K
GPU	NVIDIA GeForce RTX 4090
RAM	64 GB
OS	Windows 11 Education 25H2

Table 6. Category-wise accuracy (%) and Mean Absolute Error (MAE, °C) for each general-purpose VLM (Qwen2.5-VL-7B-Instruct, Lllava-NeXT-Llama3-8B, and MiniCPM-V-2.6-8B) on the evaluation set. Upward arrows (↑) indicate that higher values are better, and downward arrows (↓) indicate that lower values are better.

Model	Temperature		Airflow	Integrated
Model	Accuracy [%] ↑	MAE [°C] ↓	Accuracy [%] ↑	Accuracy [%] ↑	MAE [°C] ↓
Qwen2.5-VL-7B-Instruct	12.20	3.36	38.98	17.86	3.67
Lllava-NeXT-Llama3-8B	10.97	3.91	15.25	21.43	8.18
MiniCPM-V-2.6-8B	1.50	9.64	22.95	6.9	6.52

Table 7. Category-wise accuracy (%) and Mean Absolute Error (MAE, °C) for the baseline model (Qwen2.5-VL-7B-Instruct) and the fine-tuned models (Epoch-5 and Epoch-10) on the evaluation set. Upward arrows (↑) indicate that higher values are better, and downward arrows (↓) indicate that lower values are better.

Model	Temperature		Airflow	Integrated
Model	Accuracy [%] ↑	MAE [°C] ↓	Accuracy [%] ↑	Accuracy [%] ↑	MAE [°C] ↓
Qwen2.5-VL-7B-Instruct	12.20	3.36	38.98	17.86	3.67
Epoch-5	54.27	1.26	79.66	64.29	0.96
Epoch-10	53.05	1.23	74.58	67.86	0.77

Table 8. Category-wise wall-clock time and CPU time (in seconds, s) required for inference by the baseline model (Qwen2.5-VL-7B-Instruct) and the fine-tuned models (Epoch-5 and Epoch-10) on the evaluation set. Downward arrows (↓) indicate that lower values are better.

Model	Temperature		Airflow		Integrated
Model	Wall Time [s] ↓	CPU Time [s] ↓	Wall Time [s] ↓	CPU Time [s] ↓	Wall Time [s] ↓	CPU Time [s] ↓
Baseline Model	2.31	6.62	6.66	10.05	4.13	7.89
Epoch-5	5.98	10.29	7.58	11.51	2.92	7.19
Epoch-10	8.24	11.54	7.76	12.10	2.73	7.21

Table 9. MMBench scores used to evaluate the general reasoning performance of the baseline model (Qwen2.5-VL-7B-Instruct) and the fine-tuned models (Epoch-5 and Epoch-10). Accuracy is measured in percentage (%), and upward arrows (↑) indicate that higher values are better.

	Baseline Model	Epoch-5	Epoch-10
Accuracy [%] ↑	87.3	84.4	84.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Futamura, S.; Fukuda, T. Quantitative Evaluation and Domain Adaptation of Vision–Language Models for Mixed-Reality Interpretation of Indoor Environmental Computational Fluid Dynamics Visualizations. Technologies 2026, 14, 157. https://doi.org/10.3390/technologies14030157

AMA Style

Futamura S, Fukuda T. Quantitative Evaluation and Domain Adaptation of Vision–Language Models for Mixed-Reality Interpretation of Indoor Environmental Computational Fluid Dynamics Visualizations. Technologies. 2026; 14(3):157. https://doi.org/10.3390/technologies14030157

Chicago/Turabian Style

Futamura, Soushi, and Tomohiro Fukuda. 2026. "Quantitative Evaluation and Domain Adaptation of Vision–Language Models for Mixed-Reality Interpretation of Indoor Environmental Computational Fluid Dynamics Visualizations" Technologies 14, no. 3: 157. https://doi.org/10.3390/technologies14030157

APA Style

Futamura, S., & Fukuda, T. (2026). Quantitative Evaluation and Domain Adaptation of Vision–Language Models for Mixed-Reality Interpretation of Indoor Environmental Computational Fluid Dynamics Visualizations. Technologies, 14(3), 157. https://doi.org/10.3390/technologies14030157

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Quantitative Evaluation and Domain Adaptation of Vision–Language Models for Mixed-Reality Interpretation of Indoor Environmental Computational Fluid Dynamics Visualizations

Abstract

1. Introduction

2. Literature Review

2.1. Development and Current Status of Vision–Language Models

2.2. Adaptation of VLMs to Visualized Physical Information and MR Image

2.3. Applications of AI and VLMs in Built Environmental Design

2.4. Contribution

3. Dataset Construction

3.1. Dataset Overview

3.2. CFD Analysis

3.3. MR Image Generation Method

3.4. Creation of Question–Answer Pairs

3.5. Dataset Details

4. Evaluation

4.1. Evaluation Setup

4.2. Evaluation Methodology

4.2.1. Verification Experiments

4.2.2. Evaluation Metrics

4.3. Evaluation Results

5. Discussion

5.1. Effectiveness of Domain Adaptation

5.2. Category-Wise Analysis

5.2.1. Temperature Interpretation

5.2.2. Airflow Interpretation

5.2.3. Integrated Interpretation of Temperature and Airflow

5.3. Model Generalization Performance and Training Efficiency

5.4. Limitations and Future Work

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI