You are currently viewing a new version of our website. To view the old version click .
Buildings
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

14 November 2025

Reproducibility and Validation of a Computational Framework for Architectural Semantics: A Methodological Study with Japanese Architectural Concepts

and
1
Graduate School of Science and Engineering, Ritsumeikan University, Kusatsu 525-8577, Japan
2
Department of Architecture and Urban Design, College of Science and Engineering, Ritsumeikan University, Kusatsu 525-8577, Japan
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Analysis, Conservation, and Refurbishment Methods of Heritage Architecture Based on Modern Technology

Abstract

Architectural discourse is a specialised language whose key terms shift with context, which complicates empirical claims about meaning. This study addresses this problem by testing whether a rigorously audited, reproducible NLP framework can recover a core theoretical distinction in architectural language, specifically the conceptual versus physical split, using Japanese terms as a focused case. The objective is to evaluate contextual embeddings against static baselines under controlled conditions and to release an end-to-end pipeline that others can rerun exactly. We assemble a ~1.98-million-word corpus spanning architecture, history, philosophy, and theology; train Word2Vec (CBOW, Skip-gram) and a fine-tuned BERT on the same sentences; derive embeddings; and cluster terms with k-means and Agglomerative methods. Internal validity is assessed using the Adjusted Rand Index against a phenomenological gold standard split; external validity is correlated with WordSim-353; robustness is examined through a negative-control relabelling and a definitional audit comparing FULL and CLEAN corpora; seeds, versions, and artefacts are pinned for exact reruns in the archived environment; and identity across different hardware is not claimed. The study finds that BERT cleanly recovers the split with ARI 0.852 (FULL) and 0.718 (CLEAN). BERT and CBOW show no seed variation. Both Word2Vec models hover near chance, but Skip-gram shows instability across seeds. We provide a transparent, reusable methodology, with released assets, that enables falsifiable and scalable claims about architectural semantics.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.