Quantitative Evaluation and Domain Adaptation of Vision–Language Models for Mixed-Reality Interpretation of Indoor Environmental Computational Fluid Dynamics Visualizations
Abstract
1. Introduction
2. Literature Review
2.1. Development and Current Status of Vision–Language Models
2.2. Adaptation of VLMs to Visualized Physical Information and MR Image
2.3. Applications of AI and VLMs in Built Environmental Design
2.4. Contribution
- •
- We introduced a novel approach by combining VLMs with MR and CFD. While existing studies primarily focus on passively displaying visualization results to users, this integration makes it possible to support non-experts, such as building owners and occupants, in understanding the design content.
- •
- We constructed a unique dataset specifically designed for the quantitative interpretation of physical visualizations. Unlike conventional datasets that rely on qualitative judgments, our dataset requires the extraction of quantitative values from visualized CFD results (e.g., contour plots) superimposed onto real-world backgrounds.
- •
- We investigated the quantitative reasoning capabilities of VLMs regarding physical quantities. By rigorously evaluating the model’s ability to cross-reference images with legend information, we verified the feasibility and practical boundaries of applying VLMs to scientific environmental visualizations.
3. Dataset Construction
3.1. Dataset Overview
- Creation of three-dimensional models of the target space and execution of CFD analyses;
- Generation of MR images with superimposed CFD analysis results;
- Creation of question–answer pairs and explanatory texts based on the MR images.
3.2. CFD Analysis
3.3. MR Image Generation Method
3.4. Creation of Question–Answer Pairs
- Temperature interpretation;
- Airflow interpretation;
- Integrated interpretation of temperature and airflow.
- •
- Type 1: Questions asking for the color of isolines present in the image;
- •
- Type 2: Questions asking for the airflow velocity range at specified coordinates.
- •
- Magenta line: 1.0 m/s, representing discharged airflow from the air conditioner;
- •
- White line: 0.25 m/s, representing the draft risk threshold at which airflow may cause thermal discomfort;
- •
- Black line: 0.1 m/s, representing the boundary of air stagnation, with enclosed regions indicating stagnant air.
3.5. Dataset Details
4. Evaluation
4.1. Evaluation Setup
4.2. Evaluation Methodology
4.2.1. Verification Experiments
- •
- Qwen2.5-VL-7B-Instruct (without fine-tuning): The baseline model;
- •
- Lllava-NeXT-Llama3-8B [39] (without fine-tuning);
- •
- MiniCPM-V-2.6-8B [40] (without fine-tuning);
- •
- Epoch-5 Model: Qwen2.5-VL-7B-Instruct fine-tuned on the constructed dataset for five epochs;
- •
- Epoch-10 Model: Qwen2.5-VL-7B-Instruct fine-tuned on the constructed dataset for ten epochs.
4.2.2. Evaluation Metrics
4.3. Evaluation Results
5. Discussion
5.1. Effectiveness of Domain Adaptation
5.2. Category-Wise Analysis
5.2.1. Temperature Interpretation
5.2.2. Airflow Interpretation
5.2.3. Integrated Interpretation of Temperature and Airflow
5.3. Model Generalization Performance and Training Efficiency
5.4. Limitations and Future Work
6. Conclusions
- •
- The baseline model achieved accuracies below 30% across all categories, indicating that general-purpose VLMs lack sufficient quantitative reasoning ability for interpreting visualized physical quantities such as temperature and airflow.
- •
- Fine-tuning the model on the constructed dataset improved accuracy by over 40% across all categories, demonstrating that domain adaptation enables effective cross-referencing between image features and legend information.
- •
- Training convergence was achieved within five epochs, with no significant improvement observed at ten epochs, indicating that effective domain adaptation can be achieved with minimal training effort.
- •
- The fine-tuned model enables quantitative interpretation of CFD analysis results, allowing building users to understand future indoor environments during the design phase. This capability is expected to facilitate smoother consensus-building and improve satisfaction with indoor environmental quality.
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Robertson, T.; Simonsen, J. Challenges and opportunities in contemporary participatory design. Des. Issues 2012, 28, 3–9. [Google Scholar] [CrossRef]
- Matz, C.J.; Stieb, D.M.; Davis, K.; Egyed, M.; Rose, A.; Chou, B.; Brion, O. Effects of age, season, gender and urban-rural status on time-activity: Canadian Human Activity Pattern Survey 2 (CHAPS 2). Int. J. Environ. Res. Public Health 2014, 11, 2108–2124. [Google Scholar] [CrossRef]
- Morris, E.A.; Speroni, S.; Taylor, B.D. Going nowhere faster: Did the covid-19 pandemic accelerate the trend toward staying home? J. Am. Plan. Assoc. 2025, 91, 361–379. [Google Scholar] [CrossRef]
- Al Horr, Y.; Arif, M.; Kaushik, A.; Mazroei, A.; Katafygiotou, M.; Elsarrag, E. Occupant productivity and office indoor environment quality: A review of the literature. Build. Environ. 2016, 105, 369–389. [Google Scholar] [CrossRef]
- Calderon-Hernandez, C.; Paes, D.; Irizarry, J.; Brioso, X. Comparing virtual reality and 2-dimensional drawings for the visualization of a construction project. In Proceedings of the ASCE International Conference on Computing in Civil Engineering 2019, Atlanta, Georgia, 17–19 June 2019; pp. 17–24. [Google Scholar] [CrossRef]
- Ivić, I.; Cerić, A. Risks caused by information asymmetry in construction projects: A systematic literature review. Sustainability 2023, 15, 9979. [Google Scholar] [CrossRef]
- Lange, E. Integration of computerized visual simulation and visual assessment in environmental planning. Landsc. Urban Plan. 1994, 30, 99–112. [Google Scholar] [CrossRef]
- Grossman, R.L. The case for cloud computing. IT Prof. 2009, 11, 23–27. [Google Scholar] [CrossRef]
- Garbett, J.; Hartley, T.; Heesom, D. A multi-user collaborative BIM-AR system to support design and construction. Autom. Constr. 2021, 122, 103487. [Google Scholar] [CrossRef]
- Milgram, P.; Kishino, F. A Taxonomy of Mixed Reality Visual Displays. IEICE Trans. Inf. Syst. 1994, 77, 1321–1329. [Google Scholar]
- Huizenga, C.; Abbaszadeh, S.; Zagreus, L.; Arens, E.A. Air quality and thermal comfort in office buildings: Results of a large indoor environmental quality survey. Healthy Build. 2006, III, 393–397. [Google Scholar]
- Fukuda, T.; Yokoi, K.; Yabuki, N.; Motamedi, A. An indoor thermal environment design system for renovation using augmented reality. J. Comput. Des. Eng. 2019, 6, 179–188. [Google Scholar] [CrossRef]
- Sibrel, S.C.; Rathore, R.; Lessard, L.; Schloss, K.B. The relation between color and spatial structure for interpreting colormap data visualizations. J. Vis. 2020, 20, 7. [Google Scholar] [CrossRef]
- Zhou, X.; Liu, M.; Yurtsever, E.; Zagar, B.L.; Zimmer, W.; Cao, H.; Knoll, A.C. Vision language models in autonomous driving: A survey and outlook. IEEE Trans. Intell. Veh. 2024, 1–20. [Google Scholar] [CrossRef]
- Ma, Y.; Song, Z.; Zhuang, Y.; Hao, J.; King, I. A survey on vision-language-action models for embodied ai. arXiv 2024, arXiv:2405.14093. [Google Scholar] [CrossRef]
- Moshtaghi, M.; Khajavi, S.H.; Pajarinen, J. RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models. arXiv 2025, arXiv:2503.19654. [Google Scholar] [CrossRef]
- Duan, L.; Xiu, Y.; Gorlatova, M. Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble. In Proceedings of the 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), Saint-Maro, France, 8–12 March 2025; pp. 156–161. [Google Scholar] [CrossRef]
- OpenAI. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Li, Z.; Wu, X.; Du, H.; Liu, F.; Nghiem, H.; Shi, G. A Survey of State of the Art Large Vision Language Models: Benchmark Evaluations and Challenges. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 1587–1606. [Google Scholar] [CrossRef]
- Zhong, X.; Meng, X.; Li, Y.; Fricker, P.; Liang, J.; Koh, I. An agentic vision-action framework for generative 3D architectural modeling from sketches. Int. J. Archit. Comput. 2025, 23, 679–700. [Google Scholar] [CrossRef]
- Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 26296–26306. [Google Scholar] [CrossRef]
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2025, arXiv:2304.10592. [Google Scholar] [CrossRef]
- Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
- Methani, N.; Ganguly, P.; Khapra, M.M.; Kumar, P. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1527–1536. [Google Scholar] [CrossRef]
- Masry, A.; Do, X.L.; Tan, J.Q.; Joty, S.; Hoque, E. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 2263–2279. [Google Scholar] [CrossRef]
- Xia, R.; Ye, H.; Yan, X.; Liu, Q.; Zhou, H.; Chen, Z.; Shi, B.; Yan, J.; Zhang, B. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning. IEEE Trans. Image Process. 2025, 34, 7436–7447. [Google Scholar] [CrossRef]
- Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2. 5-vl technical report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 2019, 32, 13–23. [Google Scholar]
- Manivannan, V.V.; Jafari, Y.; Eranky, S.; Ho, S.; Yu, R.; Watson-Parris, D.; Ma, Y.; Bergen, L.; Berg-Kirkpatrick, T. ClimaQA: An Automated Evaluation Framework for Climate Question Answering Models. arXiv 2024, arXiv:2410.16701. [Google Scholar] [CrossRef]
- Lautenschlager, S. True colours or red herrings?: Colour maps for finite-element analysis in palaeontological studies to enhance interpretation and accessibility. R. Soc. Open Sci. 2021, 8, 211357. [Google Scholar] [CrossRef]
- Xu, F.; Nguyen, T.; Du, J. Augmented reality for maintenance tasks with ChatGPT for automated text-to-action. J. Constr. Eng. Manag. 2024, 150, 04024015. [Google Scholar] [CrossRef]
- Fan, H.; Zhang, H.; Ma, C.; Wu, T.; Fuh, J.Y.H.; Li, B. Enhancing metal additive manufacturing training with the advanced vision language model: A pathway to immersive augmented reality training for non-experts. J. Manuf. Syst. 2024, 75, 257–269. [Google Scholar] [CrossRef]
- Calzolari, G.; Liu, W. Deep learning to replace, improve, or aid CFD analysis in built environment applications: A review. Build. Environ. 2021, 206, 108315. [Google Scholar] [CrossRef]
- Zhu, Y.; Fukuda, T.; Yabuki, N. Integrating animated computational fluid dynamics into mixed reality for building-renovation design. Technologies 2019, 8, 4. [Google Scholar] [CrossRef]
- Zhang, D.; Xiong, Z.; Zhu, X. Evaluation of Thermal Comfort in Urban Commercial Space with Vision–Language-Model-Based Agent Model. Land 2025, 14, 786. [Google Scholar] [CrossRef]
- Zhang, J.; Li, Y.; Fukuda, T.; Wang, B. Urban safety perception assessments via integrating multimodal large language models with street view images. Cities 2025, 165, 106122. [Google Scholar] [CrossRef]
- Immersal SDK. Available online: https://immersal.com/ (accessed on 16 January 2026).
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar] [CrossRef]
- Cheng, D.; Huang, D.; Zhu, Z.; Zhang, X.; Zhao, X.W.; Luan, Z.; Dai, B.; Zhang, Z. On Domain-Adaptive Post-Training for Multimodal Large Language Models. arXiv 2024, arXiv:2411.19930. [Google Scholar] [CrossRef]
- Yao, Y.; Yu, T.; Zhang, A.; Wang, C.; Cui, J.; Zhu, H.; Cai, T.; Li, H.; Zhao, W.; He, Z.; et al. MiniCPM-V: A GPT-4V Level MLLM on Your Phone. arXiv 2024, arXiv:2408.01800. [Google Scholar] [CrossRef]
- Liu, Y.; Duan, H.; Zhang, Y.; Li, B.; Zhang, S.; Zhao, W.; Yuan, Y.; Wang, J.; He, C.; Liu, Z.; et al. Mmbench: Is your multi-modal model an all-around player? In Proceedings of the European Conference on Computer Vision, Milano, Italy, 29 September–4 October 2024; pp. 216–233. [Google Scholar] [CrossRef]









| Item | Cooling | Heating |
|---|---|---|
| Initial Temperature (°C) | 30 | 10 |
| Outlet Temperature (°C) | 22 | 30 |
| Outlet Speed (m/s) | 3.2 | 3.2 |
| Direction (°) | 0, 60 | 0, 60 |
| Item | Wall | Ceiling | Floor | |
|---|---|---|---|---|
| Thermal Transmittance (W/(m2·K)) | 1.38 | 1.8 | 0 | |
| Cooling | Initial wall temperature (°C) | 30 | 30 | 30 |
| Outdoor temperature (°C) | 35 | 35 | - | |
| Heating | Initial wall temperature (°C) | 10 | 10 | 10 |
| Outdoor temperature (°C) | 5 | 5 | - | |
| Category | Example Question | Sample |
|---|---|---|
| 1. Temperature Interpretation | What is the highest temperature in the image? | 430 |
| 2. Airflow Interpretation | What is the airflow velocity at the specified coordinates? | 307 |
| 3. Integrated Interpretation of Temperature and Airflow | What is the temperature within the specified flow velocity range? | 142 |
| All | - | 879 |
| Learning Rate | 0.0002 |
| Batch Size | 1 |
| Gradient Accumulation Steps | 16 |
| Optimizer | AdamW |
| LoRA Rank | 16 |
| CPU | Intel Core i9-14900K |
| GPU | NVIDIA GeForce RTX 4090 |
| RAM | 64 GB |
| OS | Windows 11 Education 25H2 |
| Model | Temperature | Airflow | Integrated | ||
|---|---|---|---|---|---|
| Accuracy [%] ↑ | MAE [°C] ↓ | Accuracy [%] ↑ | Accuracy [%] ↑ | MAE [°C] ↓ | |
| Qwen2.5-VL-7B-Instruct | 12.20 | 3.36 | 38.98 | 17.86 | 3.67 |
| Lllava-NeXT-Llama3-8B | 10.97 | 3.91 | 15.25 | 21.43 | 8.18 |
| MiniCPM-V-2.6-8B | 1.50 | 9.64 | 22.95 | 6.9 | 6.52 |
| Model | Temperature | Airflow | Integrated | ||
|---|---|---|---|---|---|
| Accuracy [%] ↑ | MAE [°C] ↓ | Accuracy [%] ↑ | Accuracy [%] ↑ | MAE [°C] ↓ | |
| Qwen2.5-VL-7B-Instruct | 12.20 | 3.36 | 38.98 | 17.86 | 3.67 |
| Epoch-5 | 54.27 | 1.26 | 79.66 | 64.29 | 0.96 |
| Epoch-10 | 53.05 | 1.23 | 74.58 | 67.86 | 0.77 |
| Model | Temperature | Airflow | Integrated | |||
|---|---|---|---|---|---|---|
| Wall Time [s] ↓ | CPU Time [s] ↓ | Wall Time [s] ↓ | CPU Time [s] ↓ | Wall Time [s] ↓ | CPU Time [s] ↓ | |
| Baseline Model | 2.31 | 6.62 | 6.66 | 10.05 | 4.13 | 7.89 |
| Epoch-5 | 5.98 | 10.29 | 7.58 | 11.51 | 2.92 | 7.19 |
| Epoch-10 | 8.24 | 11.54 | 7.76 | 12.10 | 2.73 | 7.21 |
| Baseline Model | Epoch-5 | Epoch-10 | |
|---|---|---|---|
| Accuracy [%] ↑ | 87.3 | 84.4 | 84.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Futamura, S.; Fukuda, T. Quantitative Evaluation and Domain Adaptation of Vision–Language Models for Mixed-Reality Interpretation of Indoor Environmental Computational Fluid Dynamics Visualizations. Technologies 2026, 14, 157. https://doi.org/10.3390/technologies14030157
Futamura S, Fukuda T. Quantitative Evaluation and Domain Adaptation of Vision–Language Models for Mixed-Reality Interpretation of Indoor Environmental Computational Fluid Dynamics Visualizations. Technologies. 2026; 14(3):157. https://doi.org/10.3390/technologies14030157
Chicago/Turabian StyleFutamura, Soushi, and Tomohiro Fukuda. 2026. "Quantitative Evaluation and Domain Adaptation of Vision–Language Models for Mixed-Reality Interpretation of Indoor Environmental Computational Fluid Dynamics Visualizations" Technologies 14, no. 3: 157. https://doi.org/10.3390/technologies14030157
APA StyleFutamura, S., & Fukuda, T. (2026). Quantitative Evaluation and Domain Adaptation of Vision–Language Models for Mixed-Reality Interpretation of Indoor Environmental Computational Fluid Dynamics Visualizations. Technologies, 14(3), 157. https://doi.org/10.3390/technologies14030157

