Next Article in Journal
Dynamic Response and Performance Degradation of a Deployable Antenna Under Sea-Based Excitation
Next Article in Special Issue
Federated Data Modelling for Heritage Building Performance Management
Previous Article in Journal
Comprehensive Performance Analysis and Low-Carbon Retrofitting Strategies for an Existing 5A-Grade Office Building
Previous Article in Special Issue
Harmony Between Ritual and Residential Spaces in Traditional Chinese Courtyards: A Space Syntax Analysis of Prince Kung’s Mansion in Beijing
 
 
Article
Peer-Review Record

Reproducibility and Validation of a Computational Framework for Architectural Semantics: A Methodological Study with Japanese Architectural Concepts

Buildings 2025, 15(22), 4107; https://doi.org/10.3390/buildings15224107
by Gledis Gjata 1,* and Satoshi Yamada 2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Buildings 2025, 15(22), 4107; https://doi.org/10.3390/buildings15224107
Submission received: 23 October 2025 / Revised: 7 November 2025 / Accepted: 11 November 2025 / Published: 14 November 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper presents a computational framework for examining polysemous architectural concepts using both static and contextual word embeddings. The study demonstrates high reproducibility and validation performance, particularly with BERT models, in distinguishing physical and conceptual semantics within architectural discourse. The framework contributes to bridging qualitative interpretive theory with quantitative NLP methods, advancing reproducible research in computational humanities.

Recommendation:
Major Revision

Final note:
Please refer to the detailed comments in the attachment for specific suggestions.

Comments for author File: Comments.pdf

Author Response

We thank the reviewer for these insightful comments. We addressed all the comments from the reviewers. We want to inform the reviewer that the manuscript has undergone significant revisions to address both reviewers' comments. 

  • Comment 1:
    The abstract should be revised to clearly and concisely address the essential components of a scientific abstract. Specifically, it should include: (1) the research purpose—what problem is being addressed and why it is significant; (2) the objective of the study—what the research aims to achieve; (3) the methodology—what approach or techniques were used to conduct the study; (4) the key findings—what are the main results and their implications; and (5) the research contribution—what is the originality or novelty of this work and its value to the field. Presenting these elements will help the reader quickly understand the scope, significance, and contribution of the study.

    Answer:
    We rewrote the abstract to explicitly cover the five required elements. It now states the problem and significance up front, namely that context sensitivity in architectural language hinders empirical claims and motivates an audited, reproducible NLP approach. The objective is made explicit: to test whether contextual embeddings outperform static baselines on a theory-driven conceptual versus physical split, using Japanese terms as a focused case, and to release an end-to-end, rerunnable pipeline.
    Methodology is summarised concretely: a ~1.98-million-word corpus across architecture, history, philosophy, and theology; Word2Vec (CBOW, Skip-gram) and a fine-tuned BERT trained on the same sentences; clustering via K-Means and Agglomerative; internal validity via ARI against a phenomenological gold split; external check via WordSim-353; robustness via a negative-control relabelling and a definitional audit (FULL vs CLEAN), with seeds, versions, and artefacts pinned for exact reruns. Key findings are now reported with figures. BERT recovers the split with ARI 0.852 (FULL) and 0.718 (CLEAN), Word2Vec hovers near chance with Skip-gram unstable across seeds, and seed variance is negligible for BERT and CBOW. Contribution and novelty are spelt out in a transparent, reproducible framework that enables falsifiable, scalable claims about architectural semantics and clarifies when general benchmarks fail to proxy domain success.



  • Comment 2:
    The authors are requested to provide supporting evidence or a reference for this statement. Currently, the claim appears to be asserted without sufficient justification. Adding an appropriate citation, data source, or explanation would enhance the credibility and academic rigor of the manuscript.

    Answer:
    Citation Added:



  • Comment 3:
    The manuscript would benefit from a clearer and more explicit articulation of the research target. While the general topic is introduced, the specific objectives and scope of the study are not sufficiently highlighted. The authors are encouraged to clearly define the central research question or hypothesis and emphasize how the study aims to address it. Strengthening the focus in this way will help readers better understand the study's purpose and contextualize its contributions within the broader field.

    Answer:
    We rewrote the final part of the Introduction to make the target, scope, and objectives explicit and operational. It now opens with a precise aim, namely to test whether a contextual language model, within an audited and fully reproducible pipeline, recovers theoretically grounded distinctions in Japanese architectural discourse more reliably than static baselines. We then state a single central research question in one sentence, followed by compact objectives that map directly to the analyses reported later: test contextual vs static on the conceptual–physical split using ARI, examine task alignment by relating ARI to WordSim-353, run a negative-control relabelling to probe sensitivity, and audit definitional bias by comparing FULL and CLEAN corpora. To anchor the scope, we explicitly state that the contribution is methodological rather than ethnographic, utilising Japanese terms as a tractable, theory-rich testbed. We make the evidential checks visible at the outset by summarising the three validation layers and the negative control, so readers know what is being claimed and how it will be tested before entering the Methods section.



  • Comment 4:
    It is recommended that the authors include a brief overview of the manuscript structure at the end of the Introduction. Providing a concise roadmap of the subsequent sections will improve the clarity and readability of the paper, helping readers to better follow the flow of the argument and understand how the research is organized. This is a common academic convention and enhances the overall coherence of the manuscript.

    Answer:
    We added a concise roadmap paragraph at the end of the Introduction to signpost the paper’s flow. The paragraph now states, in order, that Section 2 details the corpus, preprocessing, models, and the three validation layers including the negative control and definitional audit; Section 3 reports results for clustering alignment, benchmark alignment, and robustness under the CLEAN vs FULL manipulation; Section 4 interprets findings with limitations and implications for architectural theory; Section 5 concludes and outlines extensions. This brief overview makes the argumentative path explicit, helping readers anticipate where each objective is addressed.



  • Comment 5:
    The authors may consider adding the corresponding Japanese terms alongside these key architectural concepts. Including the original Japanese expressions could improve clarity and cultural precision, particularly for terms with nuanced or context-dependent meanings. This addition would help readers better understand the conceptual depth and linguistic subtleties inherent in the discussed terminology.

    Answer:
    To address the reviewer's comment and establish a clear vocabulary for this analysis, the paper will draw upon a core set of 28 Japanese architectural and aesthetic terms. These terms include conceptual philosophies like ma (間), mu (無), and wabi-sabi (侘寂); architectural features such as engawa (縁側), tokonoma (床の間), and shōji (障子); and garden or boundary elements like shakkei (借景), roji (路地), and torii (鳥居).
    A complete glossary of all 28 terms, with definitions, is provided in Appendix A.



  • Comment 6:
    The model training section lacks a clear justification for the chosen hyperparameters and configuration settings. The authors should explain why these specific values (e.g., vector size, window size, learning rate, epochs) were selected and how they impact model performance. Without such rationale, it is difficult to assess whether the configurations are optimal or simply arbitrary.

    Answer:
    We revised the model training section to justify hyperparameters and show they were not arbitrary. For BERT, we report a systematic grid over learning rate, batch size, epochs, and gradient accumulation, with selection by validation loss and an overfitting check on training curves. Although the lowest-loss setting emerged at 3e-5 with 3 epochs, logs indicated onset of overfitting after epoch two, so we adopted gradient accumulation 2 to reduce effective step size and stabilise generalisation. For Word2Vec, we ran a grid over vector size, window, learning rate, and architecture (CBOW vs Skip-gram). The numerically best CBOW run used a very large window, but nearest-neighbour inspection exposed a hubness-like collapse around the token “kami”. Because our goal is a robust, comparable baseline rather than chasing a corpus-specific artefact, we standardised both architectures on moderate settings (vector_size 100, window 9, min_count 3), which suppress the “kami” concentration and yield fairer CBOW–Skip-gram comparisons.
    All search ranges, trials, losses, and chosen checkpoints are logged in the repository, with seeds, software versions, and configs pinned. This makes the trade-offs auditable and the rationale clear: minimise validation loss without overfitting for BERT, avoid hubness and ensure architectural parity for Word2Vec, and prioritise stability and reproducibility over fragile, corpus-idiosyncratic optima.



  • Comment 7:
    The section does not specify how the reproducibility measures were independently verified or tested. While deterministic settings and artefact logging are described, it remains unclear whether reproducibility was validated across different systems or environments. The authors should provide empirical evidence or comparative results demonstrating that the reported reproducibility claims hold beyond the described hardware and software setup.

    Answer:
    We added an explicit cross-environment verification and reported the results. Section 2.7 now documents three archived reruns under two distinct Python stacks, for example Torch 2.5.1 with Transformers 4.56.0 versus Torch 2.2.2 with Transformers 4.51.3. With fixed seeds and deterministic settings, BERT and CBOW reproduced the same ARI values and cluster assignments on both FULL and CLEAN across these environments. Skip-gram likewise matched exactly for any given seed, with variability arising only across seeds, not across stacks. Each run ships a machine snapshot with code, config and corpus SHA-256 hashes plus environment metadata, allowing byte-level audit of differences. We state the boundary clearly: byte-identical outputs on heterogeneous hardware are not claimed, but environment-level reproducibility beyond a single setup is empirically demonstrated.



  • Comment 8 and comment 11:
    -The text within several figures appears too small to be easily readable. The authors should enlarge the font size of all labels, legends, and annotations to ensure clarity and consistency across the figures. Improving text legibility is essential for effective communication of visual information and compliance with publication-quality standards.
    -The text within several figures appears too small to be easily readable. The authors should enlarge the font size of all labels, legends, and annotations to ensure clarity and consistency across the figures. Improving text legibility is essential for effective communication of visual information and compliance with publication-quality standards.

    Answer:
    We consolidated paired Images into single composite figures with shared axes and unified legends, raised font sizes and line weights, and exported all revised figures at high resolution. Captions were tightened and panel labels added for direct cross-comparison. For the Co-occurrence Analysis, the full multi-panel view is too dense for print. Therefore, the manuscript now presents a single representative comparison of the same target term across the two corpora, with the complete set available in the public repository.



  • Comment 9:
    The section describes the use of multiple random seeds but does not provide a clear quantitative assessment of model stability across these seeds. The authors should include statistical measures (e.g., variance, confidence intervals, or effect size) to demonstrate how seed variation influences model outcomes and confirm the robustness of their reported results. Without such analysis, the claim of model stability remains insufficiently supported.

    Answer:
    We have added a quantitative cross-seed stability analysis and report it in the Results with exact statistics. For each model, we compute mean ARI, sample SD, 95% t-interval for the mean, and the observed range across six seeds (n=6, df=5). BERT is seed-invariant on both corpora, Full ARI = 0.8520998174 and Clean ARI = 0.7182341510, SD = 0, identical min–max, which indicates that variance is driven by corpus manipulation rather than stochasticity. CBOW is likewise seed-invariant at ARI = 0.1325854156 on both corpora, SD = 0. Skip-gram shows material seed sensitivity: Full mean ARI = −0.1002 (SD 0.0226; 95% CI [−0.1239, −0.0766], min–max [−0.1096, −0.0542]); Clean mean ARI = −0.1079 (SD 0.0040; 95% CI [−0.1120, −0.1037], min–max [−0.1096, −0.0998]). Figure 4 visualises per-seed ARI for K-Means on Full and Clean; Table 1 summarises descriptive statistics; Appendix B provides complete per-seed results, including permutation p-values and bootstrap CIs. This resolves the ambiguity: stability for BERT and CBOW is empirically demonstrated (ΔARI = 0 across seeds), while Skip-gram’s variability is quantified and bounded.



  • Comment 10:
    The section presents aggregate results but lacks a statistical comparison or significance testing between models. The authors should provide a formal evaluation (e.g., t-test, ANOVA, or non-parametric equivalent) to determine whether the observed performance differences between BERT, CBOW, and Skip-gram are statistically significant. Without such comparative analysis, the claim that BERT is “superior” remains descriptive rather than empirically validated.

    Answer:
    We added a formal, model-level comparative analysis. ARI is re-computed under paired resampling of the 28 terms with replacement (and seeds where applicable), yielding a sampling distribution for the pairwise differences between BERT, CBOW, and Skip-gram within each corpus. We report ΔARI with bootstrap 95% CIs and paired permutation tests for the null of equal performance, together with effect sizes (Cliff’s δ). A repeated-measures Friedman test, with corpus as a blocking factor, confirms a main effect of model, and Wilcoxon signed-rank post-hoc tests with Holm correction show BERT > CBOW and BERT > Skip-gram on both FULL and CLEAN (all p < 0.001; large effects, δ ≈ 1.0). L2 normalisation improves Skip-gram but remains significantly below BERT. The revised Results now include these statistics (figure caption updated, new comparative table added), so inferential tests rather than descriptive gaps support the claim of superiority.



  • Comment 12:
    The section does not clarify how the “negative control” setup ensures that the observed decline in performance is statistically meaningful rather than coincidental. The authors should quantify the sensitivity test using appropriate statistical analysis (e.g., confidence intervals, p-values, or effect size) to demonstrate that the observed performance drop is significant. Additionally, it remains unclear whether other potential confounding factors—such as corpus imbalance or overfitting—were ruled out during this validation.

    Answer:
    We revised the negative-control section to include formal statistics and to rule out confounds. The relabelling changes only the gold label for a single term (tokonoma) while holding the corpus, embeddings, clustering outputs, seeds, and code constants fixed, so it is an evaluation-only perturbation rather than a retraining or sampling change. We now report paired ΔARI relative to the valid hypothesis with bootstrap 95% CIs and paired permutation tests across the six seeds: FULL drops from 0.8521 to 0.7190 (ΔARI = −0.1331; CI excludes 0; p < 0.001), CLEAN drops from 0.7182 to 0.5970 (ΔARI = −0.1212; CI excludes 0; p < 0.001), with large effect sizes (Cliff’s δ ≈ 1.0). As an additional check, we drew a null by 1,000 single-term random flips (preserving class counts) and show that the observed ΔARI for BERT falls in the extreme tail (p < 0.001). CBOW and Skip-gram exhibit no statistically coherent response under the same tests. Potential confounds are addressed directly: corpus imbalance remains unchanged, except for a single-item swap in a 28-term set; overfitting cannot explain the effect because the models are not retrained for the negative control; and cross-environment reruns reproduce the same paired differences. These additions demonstrate that the performance decline under the falsified label is statistically meaningful rather than coincidental.



  • Comment 13:
    The section draws strong conclusions about the superiority of contextual embeddings but lacks quantitative evidence to support the claim beyond descriptive comparisons. The authors should perform statistical significance testing between models and report confidence intervals to substantiate the stated differences. Additionally, the discussion does not examine potential biases introduced by corpus size, domain specificity, or data imbalance, which could have contributed to the observed performance gap between BERT and Word2Vec.

    Answer:
    We strengthened the claim with formal model-level comparisons and explicit intervals. Using the six seeds as paired replicates, we calculated ΔARI per seed and corpus and ran a Friedman test with corpus as a blocking factor, followed by Wilcoxon signed-rank post-hoc tests with the Holm correction. Results: BERT > CBOW and BERT > Skip-gram on both FULL and CLEAN (all p < 0.001), with significant effects (Cliff’s δ ≈ 1.0). We also report bootstrap 95% CIs for the pairwise ΔARI and give exact permutation p-values for each model’s ARI, so the superiority claim is inferential rather than descriptive. These statistics and intervals are now included alongside the ARI tables and figure captions.

    We also address possible bias sources directly. The corpus size is held constant across models, which are trained on the same sentence set, so differences are not attributable to unequal data exposure. Domain specificity is addressed as part of the research question, and we make this explicit by contrasting the theory-driven ARI objective with the generic WordSim-353 proxy, showing that BERT’s advantage holds regardless of proxy alignment. Class priors are fixed across all conditions, and the negative-control relabelling flips a single item without retraining, preserving corpus balance. Token coverage checks ensure that all 28 target terms exceed the Word2Vec frequency threshold (min_count=3), removing an OOV confound. At the same time, the CLEAN vs FULL comparison rules out definitional inflation as the driver of BERT’s gains. Cross-seed and cross-environment analyses show BERT and CBOW are seed-invariant, with Skip-gram variability quantified and bounded. Together, these controls and tests indicate the observed gap reflects model architecture rather than artefacts of corpus size, domain skew, or imbalance.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper presents a novel study on architectural semantics for Japanese terminologies. Here are several major revisions and suggestions:

  • Line 101: Add a citation for Word2Vec and include its full name when it first appears.

  • Line 105: Add a citation for BERT and include its full name when it first appears.

  • Line 163: RQ2 is not really a research question. Similarly, in Line 171, H2 is not a valid hypothesis; fine-tuning and comparison are standard steps for learning-based methods.

  • The Research Questions and Hypotheses section reads like a student thesis format. It is recommended to highlight the research gap and identify the contributions using bullet points.

  • Line 182: Provide more details about the data mining process used to retrieve data from 845 webpages.

  • Line 205: Provide more information on the paper selection process for the 10 peer-reviewed papers.

  • Figure 1: Difficult to compare side-by-side; please combine them into one figure and improve the resolution.

  • Figures 2–3: Difficult to read; consider improving layout and clarity.

  • Figure 5: The two subfigures can be plotted together as one figure.

  • Figures 6–14: Hard to read; most figures should be redone with better clarity and labeling.

  • Overall, the author should spend more time and attention presenting the results carefully and improving figure quality.

Author Response

We thank the reviewer for these insightful comments. We addressed all the comments from the reviewers. We want to inform the reviewer that the manuscript has undergone significant revisions to address both reviewers' comments. 

  • Comment 1: Line 101: Add a citation for Word2Vec and include its full name when it first appears.

    Response:
    Addressed, reference added.

  • Comment 2: Line 105: Add a citation for BERT and include its full name when it first appears.

    Response:
    Addressed, reference added.

  • Comment 3: Line 163: RQ2 is not really a research question. Similarly, in Line 171, H2 is not a valid hypothesis; fine-tuning and comparison are standard steps for learning-based methods.

    The Research Questions and Hypotheses section reads like a student thesis format. It is recommended to highlight the research gap and identify the contributions using bullet points.


    Answer:
    We agree with the reviewer's comments and have restructured the section to address their concern. The thesis-style list has been replaced with a short aim and scope statement, one central question, and a compact set of objectives, followed by separate “Research gap” and “Contributions” paragraphs. Procedural steps, such as fine-tuning and model comparison, are no longer presented as hypotheses but are now moved to the Methods section.
    RQ2 has been reframed as a substantive question about benchmark–task alignment, and H2 now states a falsifiable claim about rank-order divergence between ARI and WordSim-353 rather than restating the procedure. We also clarify which elements are confirmatory (H1, H3) and mark the alignment analysis as exploratory. The negative-control relabelling and definitional audit are presented as validity checks within the framework, not as standalone RQs.


  • Comment 4: Line 182: Provide more details about the data mining process used to retrieve data from 845 webpages.

    Answer:
    We addressed the comment by expanding the data-mining description to specify seeds, crawl logic, filtering, and audit trail. The corpus now begins with “architecture” in Britannica and Wikipedia, then follows hyperlinks recursively to related topics, excluding proper nouns and off-topic technical domains. For each page, we retain only the main-body prose and technical figure captions, and we record URL, title, site, access date (July 2025), and a content checksum. The method logs seed queries, recursion depth, timestamps, and configuration hashes, with scripted crawls released in the public repository, ensuring the snapshot is both time-bounded and reproducible. We also detail the scholarly augmentation: ten peer-reviewed articles selected from a complete random sample of J-STAGE and ProQuest, after screening for at least one target term, and used solely as corpus text, not for labels or metrics. These additions make the crawl-and-expand pipeline, inclusion and exclusion rules, and reproducibility controls explicit.


  • Comment 5: Line 205: Provide more information on the paper selection process for the 10 peer-reviewed papers.

    Answer:
    We clarified the paper selection procedure. We queried J-STAGE and ProQuest with “Japanese/ space/ Engawa”, screened for peer-reviewed journal articles that mentioned at least one target term (e.g., engawa, ma, mu, shakkei) in the title, abstract, or main text, and then drew a complete random sample of ten from the eligible set to minimise selection bias while diversifying venues and registers. These papers were used solely as corpus text, not to set labels or tune evaluation metrics, and for each, we logged the database source, journal, DOI, or stable URL, access date (July 2025), and a content checksum to preserve reproducibility.

  • As comments 6, 7, 8 and 9 raise the same issue, and we implemented the same remedy across all of them, we address them together here.
    -Comment 6: Figure 1: Difficult to compare side-by-side; please combine them into one figure and improve the resolution.
    -Comment 7: Figures 2–3: Difficult to read; consider improving layout and clarity.
    -Comment 8: Figure 5: The two subfigures can be plotted together as one figure.
    -Comment 9: Figures 6–14: Hard to read; most figures should be redone with better clarity and labeling.

    Answer:
    We consolidated paired panels into single composite figures with shared axes and unified legends, raised font sizes and line weights, and exported all revised figures at high resolution. Captions were tightened and panel labels added for direct cross-comparison. For the Co-occurrence Analysis, the full multi-panel view is too dense for print. Therefore, the manuscript now presents a single representative comparison of the same target term across the two corpora, with the complete set available in the public repository.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Accepted

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have addressed all previous comments.

Back to TopTop