Next Article in Journal
Optimization of Multi-Intelligent Body Strategies for UAV Adversarial Tasks Based on MADDPG-SASP
Previous Article in Journal
Beyond Service Inventories: A Three-Dimensional Framework for Diagnosing Structural Barriers in Academic Library Research Dataset Management
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Artificial Intelligence and Real Estate Valuation: The Design and Implementation of a Multimodal Model

by
Nikolaos Karanikolas
*,
Eleni Kyriakidou
and
Eleni Athanasouli
School of Spatial Planning and Development, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
*
Author to whom correspondence should be addressed.
Information 2025, 16(12), 1049; https://doi.org/10.3390/info16121049
Submission received: 22 October 2025 / Revised: 24 November 2025 / Accepted: 26 November 2025 / Published: 1 December 2025

Abstract

The valuation of real estate is a fundamental process for the proper functioning of the market and the formulation of public policies. The mother of the traditional methodologies is the comparative method, which is based on data that is often incomplete or unreliable, especially in real estate markets with limited transparency. In contrast, online advertisements offer a wealth of unstructured information, which requires advanced analysis techniques. Even if they reflect price demands rather than real estate’s final market values, they provide a wealth of data that is particularly useful for market transparency. Real estate advertisements are also enriched with photographs and videos of the property. This paper proposes a methodological framework that integrates artificial intelligence techniques (natural language processing and computer vision) to extract structured features from text and photographs of advertisements. The resulting dataset feeds the comparative method of property valuation, applied to thousands of properties. Geographic analysis and neighborhood characteristics of properties enrich the proposed algorithm, covering all the available geodata that affect real estate values. The empirical analysis of apartments in the city of Thessaloniki demonstrates significant improvements in valuation accuracy and the completeness of the characteristics. At the same time, the prediction of the energy class of properties, among other traits that the proposed methodology can accurately calculate, further enriches the valuation process. The work highlights the potential integration of artificial intelligence into modern valuation practice, offering a transparent, scalable, and auditable tool for professionals and policymakers.

Graphical Abstract

1. Introduction

Real estate valuation is a critical component of the real estate market, spatial policy planning, real estate investments, and financial stability. The comparative method, although dominant in valuation practice, presents limitations due to incomplete or heterogeneous data. In contrast, digital real estate advertisements offer rich, unstructured data that remains largely untapped. Artificial intelligence (AI), through the natural language processing (NLP) and computer vision techniques, enables the transformation of this data into structured features suitable for integration into automatic valuation models (AVMs).
Research on AVMs has made significant progress by integrating various sources of information, including images, text, and spatial data. Despite technical advances, few studies have proposed fully integrated, repeatable frameworks that combine natural language processing, computer vision, and accessibility indicators in a transparent, evaluable assessment process.
This paper develops and implements a multimodal methodological framework that extracts quantitative and qualitative features from text and image advertisements, predicts energy indicators, and synthesizes them with spatially contextual geodata to create a standardized benchmark. Based on thousands of ads from the city of Thessaloniki, the accuracy of estimates is evaluated against traditional methods and other automated valuation models (AVMs).
In addition, interpretability, governance, and ethical issues are examined, and the model’s final quantitative performance is documented. The paper demonstrates the added value of AI in real estate valuation and proposes a reproducible, transparent, and scalable framework helpful to appraisers, investors, and regulators.
The proposed framework leverages textual descriptions, visual data, and spatial features, which are processed through specialized algorithms to extract unified representations. Texts are analyzed; images are analyzed with visual intelligence models; and spatial data are converted into indicators of accessibility and urban structure. The three sources of information are integrated into a common vector space through feature-level multimodal integration. In this way, the valuation model simultaneously incorporates the quantitative, visual, and spatial dimensions of each property, enhancing the accuracy and interpretability of the valuations.
In this context, the research is guided by the following key research questions:
(a)
How can natural language processing and visual intelligence techniques extract reliable characteristics from unstructured real estate advertisements?
(b)
To what extent does the multimodal integration of textual, visual, and geographical data improve the accuracy of estimates compared to the conventional comparative method?
(c)
By what mechanisms do the interpretation tools contribute to the transparency and reliability of a modern AVM? The formulation of these questions identifies the theoretical framework within which the proposed methodology is developed and grounds its contribution to both classical valuation literature and contemporary AI research.

2. Literature Review

2.1. Theoretical Foundations of Hedonic Pricing Models

Hedonic pricing assumes that consumers value the attributes of a good rather than the product itself, with the price incorporating implicit marginal values for each attribute (Lancaster, 1966) [1]. Rosen (1974) [2] formulated the relevant price function in differentiated product markets, allowing the retrieval of the MWTP under appropriate conditions.
The distinction between the first (empirical) and second (preference recognition) stages is crucial. The second stage requires strong assumptions for the identification of preferences and the production technology for features, with relevant contributions from Ekeland et al. (2004) [3] and Bajari & Benkard (2005) [4], who introduce unobserved attributes and consumer preference heterogeneity.
The choice of functional form affects the estimates: the work of Cropper et al. (1988) [5] shows superiority of semilogarithmic and Box–Cox forms in curvilinear relationships, while the interpretation of binary variables requires classical corrections from the Halvorsen–Palmquist transformation (1980) [6].
Endogeneity and spatiotemporal market dynamics complicate the retrieval of MWTP, with Kuminoff et al. (2010) [7] highlighting the risk of “conflation bias”. The adoption of identification designs (e.g., instruments, physical experiments) is considered essential.
Overall, the hedonic model offers a robust theoretical framework, with identification conditions, appropriate functional forms, and handling of unobserved characteristics determining the reliability of the estimates. For a combined approach of theory, welfare applications, and practical valuation, Palmquist [8] and Freeman et al. [9] are fundamental points of reference.

2.2. Traditional Valuation Methods and Modern AVMs

Literature distinguishes four traditional valuation methods: the comparative method, the income method, the cost method, and special variations (Pagourtzi et al., 2003) [10]. The comparative method is dominant in residential properties, and the income method in investment properties, while the cost method functions as a complement when sufficient transfer data is absent. The Appraisal Institute guidelines (2020) [11] analyze in detail the selection, adjustment, and uncertainty management processes in each methodology.
The transition to mass appraisals imposed statistical standards and quality controls (IAAO, 2020 [12]; 2021 [13]), while the IAAO AVMs standard (2018) [14] sets governance and documentation requirements. The RICS Valuation Standards (2024/2025) [15] align valuation ethics with international standards, with an emphasis on risk management and transparency.
AVMs have evolved technically with the adoption of ML/DL methods (e.g., random forests, GBDT, neural networks), with research documenting improvements in accuracy under rigorous validation and interpretability conditions (Jafary et al., 2024 [16]; Moreno-Foronda et al., 2025 [17]; Tapia et al., 2025 [18]).
At the regulatory level, the US Federal QCS (2024) [19] rule sets mandatory standards for AVMs (quality, manipulation, impartiality), aligned with IAAO and RICS (CFPB et al., 2024 [19]; Federal Register, 2024 [20]; Federal Reserve, 2024 [21]). Overall, integrating AVMs into valuation practice requires high-quality data, a clear purpose, rigorous validation, and governance consistent with international standards, while not ignoring the role of professional judgment.
Figure 1 illustrates the transition from hedonic and comparative valuation to a multimodal AI framework. The traditional logic of valuation—analysis of quantitative characteristics and comparison with similar properties—is extended through the introduction of unstructured information sources (descriptions of properties, photographs of sites) and high-resolution spatial data. The multimodal integration of these data allows for a model that maintains the theoretical basis of the comparative approach but significantly enhances the comprehensiveness and diagnostic capacity of the assessment process.

2.3. Standards and Regulations for Valuations and AVMs (IVS, RICS, IAAO, QCS)

The International Valuation Standards (IVS) [22] constitute the international framework of principles for property valuation, focusing on the use of appropriate data, documentation of assumptions, and reporting of uncertainty (IVSC, 2024/2025). The RICS Red Book (2025) [15] incorporates the IVS, imposing binding rules for valuation purposes, transparency of assumptions, and risk management.
For bulk valuations and the use of AVMs, the IAAO Standards (2021) [13] set out principles for data standardization, statistical evaluation, and transaction verifications. The complementary AVMs Standard (2018) [14] sets out governance and bias-control procedures, promoting transparency, repeatability, and the assessment of stability.
On a regulatory level, the QCS (2024) [19] rule of six US federal agencies requires quality control of AVMs for use in housing valuation matters, including data integrity, conflict of interest avoidance, inspections, and bias checks (CFPB et al., 2024 [19]; Federal Register, 2024 [20]; Federal Reserve, 2024 [21]). The rule went into effect on October 1, 2025.
The combined adoption of the IVS, the Red Book [15], the IAAO standards [13], and the QCS [19] creates a coherent governance framework for AVMs: a clear definition of purpose, data quality management, model documentation, and risk control, ensuring accountability and validity in the final valuation.

2.4. Determinants of Values

A combination of structural features, neighborhood properties, and accessibility indicators determines property values. The individual indicators consistently record the positive effect of the house’s surface area and significant effects for features such as the number of rooms and bathrooms, a garage, and the presence of a swimming pool. At the same time, the age of the building usually has an adverse effect (Sirmans et al., 2006) [23].
Correspondingly, environmental factors affect house values: greenery, low crime, and access to quality education infrastructure are positively associated with values, while noise or pollution are associated with lower values. Accessibility plays a central role: proximity to transit stations is positively associated with values, with the effect varying by transit type and methodology (Debrezion et al., 2007 [24]; Mohammad et al., 2013 [25]; Rennert, 2022 [26]).
Classical research confirms the existence of net accessibility effects even during exogenous neighborhood changes (Gibbons & Machin, 2005 [27]; Rojas et al., 2024 [28]). At the same time, it is recognized that other parallel interventions (redevelopment, TOD, zoning) often confound the results, emphasizing the need for careful identification of effects on housing values.
The magnitude of effects is ultimately contextual and influenced by factors such as the market/cycle matrix, measurement methods, and issues of spatial autocorrelation. For a reliable estimate, it is necessary to define spatiotemporal boundaries, check for parameter co-determination, evaluate the model’s functional forms and sensitivity, and finally use meta-analytic findings and benchmarks.

2.5. Energy Efficiency (EPC/BER) and Housing Values

The relevant literature on the relationship between housing values and their energy efficiency records, on average, a positive “green premium” for properties with higher energy efficiency, especially after the introduction of EPCs in EU countries (Hyland et al., 2013 [29]; Brounen & Kok, 2011 [30]; Fuerst et al., 2015 [31]; 2016 [32]). However, this effect shows significant heterogeneity across countries, property types, and methodologies.
Specialized research confirms the general positive trend of energy efficiency on housing values, but highlights variations in its magnitude, especially when spatial factors, market maturity, or sampling conditions are considered (Céspedes-López et al., 2019 [33]). Location ultimately always has a significant impact on housing values.
A critical issue is the “performance gap”, i.e., the discrepancy between theoretical efficiency (EPC) and actual consumption, due to the prebound effect (Sunikka-Blank & Galvin, 2012 [34]; Galvin, 2016 [35]). Therefore, the energy efficiency label functions more as an information signal than as an absolute performance indicator.
At the institutional level, Europe is promoting the harmonization of EPC systems to achieve greater, more reliable comparability and consistency of individual energy identity labels (Ruggieri et al., 2024 [36]; Sesana et al., 2024 [37]). This enhances the transferability of models and ultimately the validity of the estimates. Overall, the value–energy identity relationship is empirically supported but not universal, and estimates are more accurate when EPC is combined with the structural and spatial characteristics of dwellings.

2.6. From Listings to Features: NLP in Property Description Texts

Property descriptions in listings are a valuable but heterogeneous source of information, which is rarely fully exploited by conventional valuation models. natural language processing (NLP) enables the extraction of structured features from unstructured text, thereby improving the completeness and accuracy of valuation models. Using entity recognition (NER) and variable normalization techniques, it is possible to retrieve basic features (area, floor, heating type, building condition) as well as qualitative features such as renovation quality or view (Shen & Ross, 2021) [38].
Recent studies have used word embeddings, attention-based models, and BERT-like architectures to detect latent features, thereby enhancing the predictive power of models (Zhang et al., 2024) [39]. At the same time, integrating linguistic features into multimodal models (text + image + GIS) enables correlating subjective descriptions with objective measurements, highlighting the added value of NLP in real estate valuation (Bottero et al., 2024) [40].
The present work adopts a hybrid approach combining NER and rule-based price extraction, focusing on the accuracy of detecting and normalizing basic entities (e.g., square meters, year of construction), as well as on enhancing transparency through full documentation of processing pipelines (Keraghel et al., 2024) [41].

2.7. Computer Vision and Multimodal Models (Images + Text + Space)

Advertisement photos and aerial or satellite photos of areas convey a “latent” information that is rarely fully captured in the structured fields that valuers are concerned with. Early literature has demonstrated that visual representations allow the extraction of useful property indicators that, when merged with classic features (area, year, floor), improve the value estimate (Poursaeed, Matera, & Belongie, 2018) [42]. In large-scale valuations (London), the combination of Street View + aerial photos with the classic features improves accuracy while producing interpretable “visual indicators of neighborhood desirability” that can be incorporated into economic studies (Law, Paige, & Russell, 2018) [43]. The results confirm that visual information “bridges” unobserved quality elements at both the housing and environmental levels. (Poursaeed et al., 2018 [42]; Law et al., 2018 [43]).
Beyond aerial photographs, the use of exclusively property photographs offers systematic advantages over estimates without photographs, documenting that the “visual footprint” carries information associated with willingness to pay (You et al., 2016) [44]. More recent work based on user-generated images and perceptions (e.g., cleanliness, greenery, property facades) demonstrates that perceptual street variables can be incorporated as evidence of values, improving the fit and evaluative parameters (Chen et al., 2022) [45]. When combined with aerial photographs, multimodal models achieve more stable out-of-sample performance (Chahal et al., 2022) [46].
Methodologically, the combination of images with text and spatial features follows two main options:
  • Late fusion;
  • Early/intermediate fusion.
Critical issues of the validity of the chosen methodology are the following:
  • Representativeness of the photos (listing bias);
  • Spatial dependence and data leakage;
  • Interpretability;
  • Ethical issues.
Practically, for the correct implementation of the methodology in AVM and comparative method contexts, the following are suggested:
(a)
Min-spec of images per ad (e.g., ≥5 photos with key spaces);
(b)
Standardization of analysis/shooting angles;
(c)
External visual features of the neighborhood (Street View) to capture the micro-environment;
(d)
Integration with NLP/spatial features in late-fusion architecture as a baseline;
(e)
Complete out-of-time/out-of-area evaluations with calibration checks.
The proposed architecture composes a single polymorphic representation through “late fusion”, where linguistic embeddings, visual features, and spatial features are unified in a standard input panel for tree reinforcement models. Each data category provides complementary information: text gives qualitative descriptions, images reflect materials and the state of conservation, while spatial features incorporate geographical accessibility and site value. This combined approach enhances the overall predictive power and methodological coherence of the system.

2.8. Machine Learning and AVMs in Home Value Estimation

The adoption of Machine Learning (ML) models—such as random forests, gradient boosting, and neural networks—significantly improves the estimation and prediction of home values when the price–feature relationship is nonlinear or includes latent features (Baur, 2023) [47]. Systematic studies show that ML models outperform linear hedonic models when strict evaluation protocols are followed, such as out-of-time/out-of-area validation, data leakage avoidance, and stability checks (Meszaros, 2024 [48]; Ecker, 2022 [49]).
The IAAO standards specify metrics such as COD, PRD, and PRB for assessing accuracy and equity in mass valuations (IAAO, 2025) [50], with examples of application to AVMs by public agencies (Yakima County, 2025) [51]. At the same time, there is growing interest in prediction intervals and probability calibration to accommodate uncertainty in a manner compatible with operational needs (Krause et al., 2019 [52]; Pollestad, 2024 [53]; Levi et al., 2022 [54]).
Proper development of AVMs includes evidence-based data preparation (with leakage checks), comparison with classical estimates, use of multimodal features, evaluation with ratio studies and calibrated prediction intervals, as well as integration of transparency techniques (e.g., SHAP) and alignment with international standards (IAAO, RICS, QCS).

2.9. Explainability and Transparency in AVMs (SHAP, LIME, PDP/ALE, Grad-CAM)

Transparency is critical for the reliability of AVMs. In tabular data, Shapley values (TreeSHAP) provide additive, consistent documentation per feature and property (Lundberg & Lee, 2017 [55]; Lundberg et al., 2020 [56]), facilitating the correlation of value with attributes such as area, floor, and accessibility. Additionally, local approximation models such as LIME provide local approximations through linear models, with stability constraints (Ribeiro et al., 2016) [57].
Partial Dependence Plots (PDPs) and Individual Conditional Expectation (ICE) plots visualize feature–price relationships, while Accumulated Local Effects (ALEs) treat correlations more robustly (Friedman, 2001 [58]; Goldstein et al., 2015 [59]; Apley & Zhu, 2020 [60]). In images, attention annotation techniques such as Grad-CAM assign attention to areas that influence housing values (e.g., renovation signs, window quality, etc.) (Selvaraju et al., 2017) [61].
Explanations should be faithful, consider correlations, and not be interpreted causally (Molnar, 2022 [62]; Lipton, 2016 [63]). Standard reports include the following:
(a)
Global significances (SHAP summary);
(b)
PDP/ALE curves for key features;
(c)
Local explanation cards;
(d)
Spatiotemporal stability checks.
To comply with international guidelines (IAAO/RICS/QCS), full audit trails, version documentation, and bias checks are required. In this context, the proposed methodology recognizes that potential biases may arise from both the nature of the data and the individual stages of processing and modeling. Therefore, summary consistency and stability checks must be applied to ensure that estimates remain uniform and reliable across different property subsets and do not exhibit systematic deviations that could affect the fairness of the results.

2.10. Data Quality, Bias, and Concept Drift in Advertisements/AVMs

The reliability of an advertisement-based AVM depends on the quality of data throughout the pipeline: collection, extraction, enrichment, and modeling. Critical areas of risk are coverage biases (Heckman, 1979) [64], measurement errors, and concept drift.
Advertisements are not a random sample of the transaction set; posting strategies and platform differences introduce selection bias, potentially distorting the estimation of value (p(z)). Correction techniques (e.g., Heckman selection, reweighting, external calibration) and the use of ratio studies are suggested (IAAO, 2025) [50].
Feature extraction via NLP and CV is subject to errors due to incomplete descriptions, exaggerations, or spurious signals. The use of dictionaries, active learning, external validations, and robust models is recommended (Carroll et al., 2006) [65]. Geocoding also requires positional accuracy checks (Zandbergen, 2008) [66].
Value–feature relationships change over time and space due to covariate/concept drift (Gama et al., 2014 [67]; Lu et al., 2018 [68]). OOT/OOA assessment, PSI/KS monitoring, and regular recalibration of models and intervals are required.
AVM governance should incorporate quality controls (data SLAs), audit trails, provenance documentation, bias testing, and retraining, in accordance with IAAO (2021 [13], 2025 [50]) and QCS (CFPB et al., 2024 [19]) standards.

2.11. Spatial Enrichment: Accessibility Indicators and POIs in the Formation of Values

The spatial enrichment of ad data with accessibility indicators and POIs (Points of Interest) enhances the interpretative and predictive power of the models. Accessibility reflects the ease of access to opportunities (work, services, greenery), while POIs capture the area’s functional composition.
According to Hansen (1959) [69], accessibility is defined as a weighted sum of opportunities under cost/time, while Geurs & van Wee (2004) [70] classify four dimensions (transportation, land use, temporal, and individual) and related metrics: cumulative, gravity-based, min-cost, and time-space. The use of time-based distances via road/public transport network (GTFS) is crucial for a realistic representation (El-Geneidy & Levinson, 2006) [71].
The relevant literature shows positive effects of accessibility to public transport, work, and services on housing values, with heterogeneity across cities and study designs. Specialized studies on rail/metro confirm a positive effect with decreasing returns in relation to distance (Debrezion et al., 2007) [24]. Similarly, walkability and land use mix (e.g., Walk Score) are associated with higher prices (Duncan, 2011 [72]; Ewing & Cervero, 2010 [73]), while parks create premiums when quality is high (Crompton, 2001) [74].
However, there are nonlinearity issues: proximity to stations/POIs can create disamenities (noise, crowding), while road accessibility exhibits threshold utility effects. The use of nonlinear models (e.g., splines, GBMs) and multiscale metrics (300 m–2 km), with counts and diversity separation, is required.
Challenges include MAUP issues, spatial autocorrelation, amenity endogeneity, geocoding errors, and multicollinearity errors. Good practice includes robustness in zones/buffers, spatial splits, geocoding checks, and handling multicollinearity.
For the final integration into AVMs, the following are proposed:
(a)
Temporal distances with different means;
(b)
Multiscale POI indicators with categorization/diversity;
(c)
Nonlinearity/interaction checks;
(d)
OOT/OOA evaluation;
(e)
Source documentation (OSM, GTFS, registries).

2.12. Summary of Literature Gaps and Contributions

Despite the extensive literature on hedonic models and AVMs, critical gaps are identified. First, the emphasis on estimating value (p(z)) often bypasses the conditions for reliable MWTP estimates or equilibrium interpretations, as required by Rosen’s theory and identification in nonlinear models (Rosen, 1974 [2]; Ekeland, Heckman, & Nesheim, 2004 [3]). Second, the insistence on rigid functional forms may introduce distortions at marginal values, as classical results have shown (Cropper et al., 1988 [5]; Halvorsen & Palmquist, 1980 [6]). Third, the use of multimodal data—text, image, and spatial markers—remains limited, with few well-documented applications (Law et al., 2018 [43]; Poursaeed et al., 2018 [42]). Fourth, the positive effect of energy identities shows significant heterogeneity and performance gap (Brounen & Kok, 2011 [30]; Hyland et al., 2013 [29]; Sunikka-Blank & Galvin, 2012 [34]). Fifth, ML AVMs often fall short in rigorous evaluation protocols, uncertainty quantification, and explainability (Baur, 2023 [47]; Pollestad et al., 2024 [53]; Lundberg & Lee, 2017 [55]; Selvaraju et al., 2017 [61]). Sixth, error and concept drift management lag standard procedures (Gama et al., 2014 [67]; Lu et al., 2018 [68]), while compliance with AVM governance frameworks is not systematically implemented (IAAO, RICS, QCS).
This work contributes to this research with the following features:
(a)
Developing a multimodal pipeline with NLP, Vision, and POIs;
(b)
Introducing EPC-proxy from advertisements;
(c)
Adopting OOT/OOA evaluation with leakage checks;
(d)
Incorporating calibrated uncertainty and explanations (TreeSHAP, PDP/ALE, Grad-CAM);
(e)
Aligning with governance standards for repeatability and regulatory compliance.
Figure 2 illustrates the structural sequence of the proposed methodology, from the conventional comparative and hedonic approaches to the integration of multimodal TN signals, which are derived from textual descriptions, visual material, and spatial information. This framework captures how key property and market characteristics are enhanced by automated, high-resolution extracted features, leading to improved predictive performance, increased transparency, and compliance with international professional valuation standards (RICS/IVS, IAAO). Visual representation contributes to the clarity of the methodological framework and facilitates the understanding of the relationships between inputs, intermediate levels of processing, and estimation results.
The overall comparison of models is based on established evaluation metrics (MAE, RMSE, MAPE, COD, PRD, and PRB). Traditional hedonic/comparative models exhibit higher errors and greater dispersion in ratio indicators, whereas the multimodal model consistently performs better across all indicators. The diagram below summarizes the transition from classical approaches to the proposed enhanced multimodal estimation scheme, highlighting the contribution of additional AI information sources to improved results.

3. Data and Model Implementation

3.1. Data Sources, Ethics, Legal Issues, and Documentation

The primary data are extracted from publicly available listing platforms and real estate websites (text and images) for the sole purpose of obtaining standardized features for scientific modeling and valuation documentation (comparative/AVM). The documentation follows the RICS Red Book: an explicit statement of purpose, assumptions, sources, input quality, and uncertainty, so that the user is aware of the scope of the results (RICS, 2025) [15].
Where there is a possibility of processing personal data (e.g., addresses with accompanying images), the processing is GDPR-compliant (lawfulness/transparency, purpose limitation, minimization, accuracy, limited storage, integrity/confidentiality) and documented, where required, through a DPIA. Pseudonymization/removal of contact details, EXIF removal, face/plate blurring, and strict retention policies are applied. Public archives include only depersonalized data, not raw media (Regulation (EU) 2016/679) [75].
The collection complies with the terms of use of the platforms. Copyrighted material is used exclusively for feature extraction without redistribution of material.
Image analysis is limited to technical/qualitative elements (lighting, materials, damage, signs of renovation), avoiding sensitive socio-demographic outliers. EPC-proxies are used as statistical quality marks and not as substitutes for official certificates, in line with current European harmonization developments (Ruggieri et al., 2024) [36]. Uncertainty intervals accompany the forecasts and are not certification-based.
The development/use of AVMs is aligned with RICS/IVS (documentation of inputs, assumption checks, reporting of uncertainty/purpose), with IAAO for ratio studies (COD/PRD/PRB), and with the Quality Control Standards for AVMs in housing processes (CFPB et al., 2024 [19]; RICS, 2025 [15]; IVSC, 2025 [22]).
Transparency and reproducibility: Datasheets for datasets and Model Cards, data/code/hyperparameter versioning, and seed/settings recording are adopted (Gebru et al., 2021 [76]; Mitchell et al., 2019 [77]). The provision is made with depersonalized features or synthetic samples that reproduce the statistical structure. In this way, all steps-identifying outliers, defect handling, spatial self-reporting, and modeling options- are kept fully transparent and reproducible.
Practical protocols (summary): Provenance logging (URL hash, timestamp, parser version), field minimization, automatic cleanup of raw media after feature calculation, and explicit ethics/compliance note with usage limits.
The implementation of the methodology is organized in a way that allows the results to be reproduced: the basic settings of the models, the processing stages, and the general data specifications are recorded so that independent applications of the same process can follow the same steps.

3.2. Cleansing, De-Duplication, and Geocoding

The ETL stage transforms heterogeneous advertisements from multiple sources into single, geocoded records with harmonized units and constant identifiers, suitable for insertion into templates. This is preceded by text standardization (lowercase, punctuation removal, normalization of street/floor abbreviations), harmonization of units (sq m, €/sq m), and categories (sale/rental), as well as consistency rules (e.g., year of construction ↔ age, postal_code ↔ municipality) with outlier checks.
The pairing of duplicate advertisements is performed using the probabilistic Fellegi–Sunter methodology, with field-level similarity weighting (Fellegi & Sunter, 1969) [78]. For efficiency, blocking keys are applied based on ZIP codes, rounded sq m, and floor (Christen, 2012) [79]. Text similarity (titles/descriptions) is calculated with TF-IDF cosine or Jaccard n-grams; for streets/place names, Jaro–Winkler/Levenshtein (Jaro, 1989 [80]; Winkler, 2006 [81]) is used. Numeric fields (square meters, year, price) are compared with absolute/relative distances and, where available, coordinates with geodetic distance. Pairs of upper thresholds are grouped and assigned a single PropertyID; discrepancies are resolved with merging rules (e.g., median square meters, latest price).
For image reuse, perceptual hashes (pHash/dHash/aHash) with Hamming thresholding are used, and SSIM validation is performed (Zauner, 2010 [82]; Wang, Bovik, Sheikh, & Simoncelli, 2004 [83]). Images are grouped by PropertyID, and a representative subset is maintained per location, limiting the bias of overrepresentation of “good” images.
Geocoding is based on a normalized address (street, number, ZIP code, local unit) and multiple providers with a trust hierarchy. Each result is accompanied by a reliability index and an accuracy level (rooftop/lot/street/area), with fallbacks when required. The location is stored in WGS84, while distance/time calculations are made in an appropriate projection system. Accuracy is checked against benchmarks, and acceptance thresholds are applied because minor errors can affect accessibility indicators/POIs (Zandbergen, 2008) [66].
Provenance logs (timestamps, parser/geocoder versions, URL/image hashes), field SLAs (minimum: Area_sqm, Floor, Price, geocoding, ≥3 photos), and quality metrics (missing, before/after duplicate percentage, F1 deduplication, positional error) are maintained. All stochastic parameters (sampling, thresholds) are “frozen” with seeds and configuration files for full reproducibility and inspection.

3.3. NLP Schema, Computer Vision Features, and EPC-Proxy Definition

We standardize a single set of features from (a) NLP for ad descriptions, (b) computer vision for photos, and (c) composite EPC-proxy index as a statistical energy quality signal, with a “submit to estimator” design principle: compatibility with Benchmarking/AVMs, documentation, reproducibility, and metric checks.

3.3.1. Entity Schema and Text Extraction (NLP)

The core entities include Area_sqm, Floor, YearBuilt/Age, Bedrooms/Baths, HeatingType, Cooling, RenovationStatus, WindowType, Parking/Storage, View/Orientation, Condition, and EnergyMention. The extraction is performed with combined NER (rules + transformers) and unit/value normalization. Denial detection (e.g., “no warm-up”) reduces false positives (Chapman et al., 2001) [84]. Multilingual transforms (mBERT/XLM-R) are used with transfer learning on small-labeled samples, weak supervision (Snorkel-like), and active learning for edge cases (Devlin et al., 2019 [85]; Conneau et al., 2020 [86]; Ratner et al., 2017 [87]). Evaluation: P/R/F1 per entity and impact on final estimation metrics (e.g., MAE drop), using P-R curves in unbalanced classes (Saito & Rehmsmeier, 2015) [88].

3.3.2. Image Features (CV)

Images capture materials/finish, lighting/view, maintenance, and renovation indications. Embeddings are extracted with CNN/ResNet and Vision Transformers; lightweight classification/regression heads target RenovationStatus, WindowType, and Lighting (He et al., 2016 [89]; Dosovitskiy et al., 2021 [90]). For missing labels, weak text labels (multi-instance), self-supervised pretraining, and data augmentation, confidence scores and quality checks (e.g., failure to detect interior/exterior) are provided. Sample visual explanations (Grad-CAM) are used for operational transparency and evaluation by evaluators.

3.3.3. Definition and Calibration of EPC-Proxy

Research definition:
E P C p r o x y = f ( E n e r g y M e n t i o n N L P , W i n d o w T y p e C V , H e a t i n g / C o o l i n g , R e n o v a t i o n S t a t u s , Y e a r B u i l t / A g e , I n s u l a t i o n c u e s )
where f is a calibrated combination (e.g., stacked logistic/gradient boosting). The index utilizes assumptions such as double glazing, thermal insulation, heat pump, and recent full renovation, considering the “performance gap” between theoretical and actual performances (Sunikka-Blank & Galvin, 2012) [34]. In subsets with official EPCs, external calibration is performed, and AUC and prediction interval coverage are reported; the index is explicitly stated as an informative proxy rather than as a substitute for certification. The terminology/scale is aligned with current European comparability efforts (Ruggieri et al., 2024) [36].

3.3.4. Modal Coupling and Leakage Checks

Late fusion is adopted: merging visual embeddings and NLP-features with structured/spatial features in a tabular array for gradient boosting, allowing interpretability (SHAP) and channel-by-channel checks. For large scales, cross-modal attention is considered in robustness experiments. Temporal and spatial hold-outs are applied to avoid leakage from geo-recognized patterns or multiple occurrences of the same property.

3.3.5. Quality Metrics and Governance

NLP: P/R/F1 per entity and impact on MAE estimates. CV: accuracy/AUROC on targeted cues, probability calibration, and robustness to augmentations. EPC-proxy: AUC/KS, calibration curves, and coverage of uncertainty intervals. All accompanied by Datasheet/Model Card (fields, versions, data windows, constraints) and audit trail, aligned with RICS/IAAO/QCS.

3.4. Spatial Enrichment and Access Times

Spatial enrichment produces functional indicators of accessibility and intensity/variety of uses as a complement to structural, textual, and visual features. We adopt the classical concept of “opportunity cost of travel” (Hansen, 1959) [69] and the four-dimensional framework of transportation, land use, time, and individual accessibility with discrete metrics (Luo & Wang, 2003) [91].
Distances/times are calculated on real traffic networks: road (driving/pedestrian) and public transport (GTFS), with the production of isochronous and OD-matrices in peak/off-peak and working/workday scenarios. We use reference speeds by road category and, for public transport, times from GTFS (routes, waits, transfers) with explicit transfer penalties (El-Geneidy & Levinson, 2006) [71].
Accessibility metrics:
  • Cumulative opportunities: Number of destinations Oj within a time threshold T.
  • Gravity-based:
    A i = j O j f ( c i j )
    with decreasing f(⋅) (exponential/log) calibrated to empirical travel distributions.
  • Minimum generalized cost: Minimum generalized time/cost to the nearest suitable destination (e.g., metro station).
POIs, intensity, and variety: In multiscale rings (300 m, 800 m, 2 km), we calculate (i) intensity (counts per category), (ii) variety with normalized mixing entropy, and (iii) proximity (minimum time/distance) to basic categories (education, health, green). For services with a supply/demand relationship, 2SFCA is applied to capture the demand pressure per unit of supply.
The impact of accessibility is often non-monotonic; we model it with splines/GBMs and control for interactions (e.g., accessibility × building age/size).
MAUP is reduced by referring to multiscale buffers and sensitivity checks on thresholds/decays and documentation of sources (OSM/administrative/GTFS), publications, and CRS. Geocoding carries positional accuracy/uncertainty flags that explicitly feed into the uncertainty intervals of values.
Validity checks: (i) Sanity checks of realistic time/distance limits, (ii) convergent validity with known indicators (e.g., distance to station), (iii) robustness to alternative f(c) and peak definitions, and (iv) leakage control with spatial/temporal hold-outs so that the test networks/POIs do not share information with the train.

3.5. Target Variable and Transformations

The primary target variable is log(€/sq m), so that the price/area ratio reduces the variance due to size and the logarithmic transformation mitigates heteroscedasticity and yields effects in percentage form (Box & Cox, 1964) [92]. For robustness checks, results are presented both in total price (with appropriate transformations) and in rent/sq.m. for leases.
Prices are deflated with a consumer price index (month and relevant territory level), while units (sq.m., floor, age) are standardized. Selected explanatory variables are transformed (e.g., log distances/times) when the distributions are strongly skewed (Box & Cox, 1964) [92].
Basic transformation: log(y) for rare non-positive values; Yeo–Johnson is applied, with MLE parameter estimation in train and unchanged application in validation/test (Yeo & Johnson, 2000) [93].
Before the transformation, robust tail cleaning (Winsor/trim) with modified Z-score or IQR is applied to limit the effect of errors/isolated reports and stabilize the training; thresholds are defined in advance and are not adjusted in the test (Iglewicz & Hoaglin, 1993) [94].
In addition to MSE, Huber is used for heavy tails (Huber, 1964) [95], and quantile loss is used for estimating percentage errors and quantitative predictions τ ∈ {0.5,0.9,0.95}) (Koenker & Bassett, 1978) [96]. The results report includes MAE/RMSE/MAPE in the logarithmic and anti-transformed domains.
The 95% intervals are provided from block bootstrap with temporal/spatial sampling schemes (Efron, 1979) [97]. Prediction intervals are derived either directly from quantitative models or from post-calibration of validation errors (coverage vs. nominal).
In semi-log specifications, coefficients are interpreted as approximate percentage changes; for binary features, the exponential correction of Kennedy (1981) [98] is applied. In nonlinear models, effects are summarized with PDP/ALE and SHAP in the logarithmic domain and are also presented in initial values after proper anti-transformation.
All options (deflation, transformations, Winsor thresholds, losses) are documented in Model Cards and configuration files; parameters (e.g., Box–Cox/Yeo–Johnson) are “frozen” in training and are applied strictly to out-of-time/out-of-area sets.

3.6. Training Schemes, OOT/OOA Evaluation, and Leakage Checks

Performance is evaluated exclusively on data unseen during training, both temporally and spatially, to limit overoptimism from autocorrelation and avoid leakage (Roberts et al., 2017) [99].
Origin rolling is applied: training up to t and prediction at [t, t + Δ], repeated in successive blocks. Hyperparameter tuning is performed only with nested time-series CV within the training set, without using information from validation/test sets (Tashman, 2000) [100].
Blocked spatial CV is adopted: the space is partitioned into non-overlapping blocks; entire blocks are reserved for testing; and a blocking buffer is applied to neighboring train blocks. In apartment buildings, grouped splits are used at the building level to ensure that similar recordings/photographs are not distributed across different sets (Roberts et al., 2017 [100]; Valavi et al., 2019) [101].
(i) Elementary leakage (normalizations/encoders calculated on the full sample) is avoided, (ii) target-leakage from subsequent variables (e.g., “days on market”), and (iii) reuse of records/images between train–test. All transformations/feature selections are applied out-of-fold, and hyperparameters are “locked” before the final evaluation (Kaufman, Rosset, & Perlich, 2012) [102].
Nested CV is used with random search for broad exploration and Bayesian optimization for convergence, performed independently per split (OOT/OOA). Final model selection by MAE (main), with RMSE/MAPE secondary, and error analysis by a subgroup of values (Bergstra & Bengio, 2012 [103]; Snoek, Larochelle, & Adams, 2012 [104]).
In addition to the classic metrics, COD/PRD/PRB are reported in mass appraisal scenarios, and ratio studies are performed by IAAO to check uniformity across the spectrum values. Differences in accuracy between models in OOT/OOA are statistically tested (e.g., Diebold–Mariano) with confidence intervals from blocked bootstrap in time/space (IAAO, 2025 [50]; Efron, 1979 [97]).
Prediction intervals (90%, 95%) are provided for each split using a quantile regression or error-based approach, with coverage assessed against nominal control. Error drift is monitored (e.g., KS/PSI) and recalibration is triggered when coverage declines.
All splits, seeds, data/model versions, and hyperparameter settings are documented in config files/Model Card. Pipelines are deterministic-by-seed and replicated one-to-one by third parties.

4. Modeling Methodology

The analysis uses modern Machine Learning models, including decision tree-based reinforcement learning algorithms (XGBoost, LightGBM, CatBoost), neural network architectures for tabular data (TabNet), and convolutional networks (ResNet-18) for image processing.
Two established optimization approaches are applied for hyperparameter tuning: random search and Bayesian optimization, which allow for efficient exploration of the hyperparameter space.
The validation process uses nested cross-validation, combining temporal and spatial variance to ensure the model generalizes to unobserved spatiotemporal patterns.
Finally, systematic measures are taken to avoid information leakage through grouped separations per property, exclusion of overlapping time intervals and EXIF metadata, and complete separation of images between folds, ensuring the methodological integrity of training and evaluation.
These procedures are summarized in Table 1 below, which outlines the corresponding modeling components and safeguards implemented throughout the framework.

4.1. Linear/Hedonic Models (Baselines)

As a reference point, we estimate hedonic models with log(€/sq m) as the dependent variable, so that the coefficients have a semi-elastic interpretation:
l o g   p i = α + β x i + γ d i + δ a i + μ n ( i ) + τ t ( i ) + ε i ,
where xi are continuous structural variables, di are binary attributes, and ai are accessibility indices/POIs, while μn(i) and τt(i) are spatially and temporally fixed effects (Wooldridge, 2010) [105]. In hedonic models, ‘accessibility’ describes the spatial characteristics of a property’s location that shape its value, such as distance or time to SMEs, central functions, points of interest, and job concentrations.
We introduce splines (natural cubic) in area, age, and access times for flexible curvatures, and perform Box–Cox tests for alternative transformations (Box & Cox, 1964) [92]. For binary explanatory variables in semilog specifications, the effect is given as
% Δ p 100 ( e x p ( β ^ ) 1 )
(Halvorsen & Palmquist, 1980 [6]; Kennedy, 1981 [98]). The model emphasizes the correct form due to the risk of biased thresholds (Cropper, Deck, & McConnell, 1988) [5].
Spatial autocorrelation of residuals with spatial HAC errors (Conley, 1999) [106] is treated and, as a robustness check, the estimation of SAR/SEM specifications (Anselin, 1988) [107] is also considered. Additionally, cluster-robust errors are applied at the neighborhood/building level.
Variable selection and multicollinearity. Selection is guided by the pre-qualified schema and checked with VIF. Feature bundles (e.g., “shell quality”: frames + renovation) and sensitivity are also examined for the following purposes: (i) inclusion/removal of accessibility indicators, (ii) alternative buffers/time scales, and (iii) outlier trimming.
Diagnostics and reporting. We report R2, MAE/RMSE (in log and anti-transformed domain), Breusch–Pagan for heteroscedasticity, Moran’s I on residuals, and Ramsey RESET. Percentage effects with 95% CI and partial elasticities at representative points (25th/50th/75th) are also examined. As an exception, the relation to a second-order hedonic in a limited context, with explicit indication of the additional identification conditions, is also considered (Rosen, 1974) [2].
Linear/hedonic baselines offer interpretability, control over invariant spatial/temporal differences, and a clear diagnostic picture, functioning as benchmarks and “guides” for identifying nonlinearities that ML models exploit.

4.2. Machine Learning Models (GBMs, RF, Neural for Tabular) and Training Practices

Model Selection. GBMs (XGBoost/LightGBM/CatBoost) are chosen for their high performance on heterogeneous, nonlinear regression problems and for strong OOS performance under rigorous validation (Friedman, 2001 [58]; Chen & Guestrin, 2016 [108]; Ke et al., 2017 [109]). RFs serve as robust baselines (Jafary et al., 2024) [16]], while CatBoost/ordered boosting efficiently handles categories and limits target leakage in encodings (Prokhorenkova et al., 2018) [110]. Tabular neural networks (embeddings/sparse architectures) are considered complementary, with the awareness that GBMs often remain a reference point (Arik & Pfister, 2019 [111]; Gorishniy et al., 2021 [112]).
Preprocessing/missing. Numerical features are preserved in the natural or logarithmic domain. Categories: Out-of-fold target/impact encoders for GBMs/RF, while CatBoost implements ordered target statistics within the algorithm (Prokhorenkova et al., 2018 [110]. Missing: LightGBM/XGBoost route “missing” in a separate branch (Ke et al., 2017 [109]; Chen & Guestrin, 2016 [108]); in neural networks, masking or simple embedding (median + indicator) is used.
Overparameterization/normalization. Nested tuning in the train (no leakage), with random search for exploration and Bayesian optimization for convergence, is used from the model (Bergstra & Bengio, 2012 [103]; Snoek, Larochelle, & Adams, 2012 [104]).
  • GBMs: Learning rate, number of trees, maximum depth/leaf, subsampling (lines/features), min-data-in-leaf, with early stopping in OOT validation.
  • RF: Number of shallow trees, depth, min samples, max features.
  • Neural: Width/depth, embeddings, dropout/weight decay, one-cycle or cosine schedulers.
Imposition of (where justified) monotonic constraints for logical consistency is also examined (Friedman, 2001) [58].
Hyperparameter selection is performed with nested cross-validation without leakage, combining random search and Bayesian optimization within the training set. The basic parameters of GBMs, RFs, and neuronics (learning rate, trees, depth, subsampling, dropout) are recorded, and early stopping is used with strict OOT/OOA splits for the final check. All seeds, settings, and versions are stored so that the results can be fully reproduced.
Combinations and ablations. Main model late-fusion GBM (fusion of structured + NLP + CV + spatial). The model uses stacked generalization with a light meta-model in strict OOT/OOA frameworks (Wolpert, 1992) [113]. It also performs ablations: (i) structured only, (ii) +NLP, (iii) +CV, (iv) +space, reporting ΔMAE/RMSE and the difference check.
Metrics/robustness. Primary MAE, secondary RMSE/MAPE. The algorithm also trains quantile models (0.05/0.95) to produce prediction intervals. Robustness checks: blocked OOT/OOA, stress tests on tails (Winsorized vs. raw), and error analysis by subgroups (price/age/area quantiles).
Explainability/transparency. Global/local explanations are provided with TreeSHAP (GBMs/RF) and PDP/ALE for relationship forms, as well as error analysis by geographic zone. For the optical channels, the use of Grad-CAM on samples to check face validity is also used.
Infrastructure/reproducibility. All experiments are deterministic-by-seed and recorded (MLflow runs, library versions, hyper-configs, feature hashes). The pipelines are idempotent and one-to-one repeatable for each OOT/OOA split.
Conclusion. The strategy prioritizes GBMs as a combination of performance and interpretability, with RF for stability and tabular neurons as an exploratory extension to large-scale, strongly nonlinear regimes.

4.3. Avoiding Overfitting and “Fair” Comparison of Models

Principles. Comparison is valid only with strict training–evaluation separation, leakage control, parity of experimental conditions, proper quantification of uncertainty, and full reproducibility.
(a)
Training vs. evaluation: All preprocessing, feature selection, and overparameterization are performed within training with nested CV; the outer fold (or OOT/OOA block) is used only for final error estimation, limiting model selection bias (Varma & Simon, 2006 [114]; Cawley & Talbot, 2010 [115]).
(b)
Leakage control: All transformations are performed out-of-fold (scalers, target/impact encoders, feature selection). Look-ahead variables (e.g., “days on market”) and duplicate appearances of the same property/image between train–test are excluded, in accordance with established leakage detection/avoidance guidelines (Kaufman, Rosset, & Perlich, 2012) [102].
(c)
Equality of experimental conditions: All models are trained on the same OOT/OOA splits, with the same set of features and the same target/input transformations. The overparameterization budget (iterations/time) is equalized, and the selection criteria (e.g., MAE in the outer fold) are common (Bergstra & Bengio, 2012) [103].
(d)
Stochasticity and statistical comparison: Accuracy differences are accompanied by blocked bootstrap confidence intervals (time/space) and Diebold–Mariano tests for prediction (Efron, 1979 [97]; Diebold & Mariano, 1995 [116]). Benjamini–Hochberg FDR is applied for multiple comparisons; corrected tests (e.g., 5 × 2 cv) are used as a robustness check for small samples (Benjamini & Hochberg, 1995 [117]; Dietterich, 1998 [118]).
(e)
Reporting and reproducibility: MAE/RMSE/MAPE tables per split, learning curves (error vs. train size), and subgroup diagnostics (e.g., by value/area/age quotient) are published. All seeds, configs, feature hashes, and library versions are documented so that results are one-to-one reproducible.
Conclusion. The combination of nested evaluation, strict leakage control, equal tuning budget, and documented statistical comparison ensures that performance differences reflect true generalization and not artifacts of experimental design.

4.4. Explainability and Decompositions (SHAP, PDP/ALE/ICE, Visual Explanations)

Target: The algorithm provides transparent, reproducible explanations that link the estimated value to interpretable attributes, aligned with the requirements of appraisal practice and compliance.
SHAP: Local and global decompositions. For RF/GBMs, use TreeSHAP for accurate Shapley values and decomposition of predictions into per-attribute contributions at local (per-property) and global levels (Lundberg & Lee, 2017 [55]; Lundberg, Erion, & Lee, 2020 [56]). This reports (i) global summaries (means |SHAP|), (ii) SHAP dependence plots for nonlinearities/thresholds, and (iii) SHAP interaction values (e.g., Area × Accessibility). To mitigate the influence of input correlations, application grouped SHAP and definitional consistency checks (Apley & Zhu, 2020 [60]).
Relationship form: PDP/ALE/ICE. Quantification of the relationship form of continuous features (area, age, access times, EPC-proxy) with PDP and ICE, preferring ALE when inputs are correlated to reduce the independence assumption (Friedman, 2001 [58]; Goldstein et al., 2015 [59]; Apley & Zhu, 2020 [60]), is also necessary. Documentation of monotonicities, saturation points, and functional thresholds is established (e.g., distance to a public transit station).
Visual explanations (CV channel). For selected samples, producing Grad-CAM maps is necessary to highlight image regions critical for prediction (e.g., frames, material condition, contemporary wet zones), using them exclusively for face-validity diagnostics, not as labels (Selvaraju et al., 2017) [61].
Stability/robustness of explanations. Assess (i) SHAP variance under bootstrap/seed changes, (ii) the temporal/spatial stability of global rankings, and (iii) PDP/ALE robustness to logical preprocessing changes. We report similarity indices (e.g., Spearman) and present discrepancies where they appear (Alvarez-Melis & Jaakkola, 2018) [119].
Counterfactuals. Design of limited, realistic “what-if” scenarios based on a few variable attributes (e.g., window/bathroom upgrades), with explicit disclaimers of causality and constraints of feasibility/feasibility (Wachter, Mittelstadt, & Russell, 2018) [120].
Compliance and reporting package. The explainability package is aligned with RICS/IVS, IAAO ratio studies, and QCS (audit trails, bias testing). Each result is accompanied by (i) a global SHAP summary, (ii) 2–3 PDP/ALE of key attributes, (iii) a local SHAP card per property for operational reporting, and (iv) 2–3 Grad-CAM examples with qualitative commentary. All explanations are stated to be descriptive, not causal (Molnar, 2022) [62].

4.5. Uncertainty, Prediction Intervals, and Calibration

The goal is to provide point values and prediction intervals (PIs) with proper coverage and sharpness, separately evaluating calibration (coverage ≈ nominal) and sharpness via appropriate scoring rules that are provided (Gneiting & Raftery, 2007) [121].

4.5.1. Model-Centered Approaches

Quantile regression: Training at τ ∈ {0.05, 0.95} (pinball loss)
P I = [ q ^ 0.05 ( x ) , q ^ 0.95 ( x ) ]
captures heteroscedasticity without distributional assumptions (Koenker & Bassett, 1978) [96].
Probabilistic models: Estimate p(y∣x) (e.g., NGBoost for Gaussian/t); PIs from quantiles, evaluation with CRPS (Duan et al., 2020 [122]; Gneiting & Raftery, 2007) [121].
Epistemic uncertainty is addressed using bootstrap ensembles or Bayesian methods in regions with low data density.

4.5.2. Error-Based Approaches

Model
y y ^
or
( y y ^ ) 2
as a function of features/resid-features and produce heteroskedastic PIs around y ^ ( x ) . Suitable for rolling recalibration when the error regime changes (Pollestad, 2024) [53].

4.5.3. Conformal Prediction

Split conformal over validation for guaranteed finite-sample coverage.
CV+/jackknife+ for more stable/narrow intervals.
Weighted/locally adaptive versions for covariate shift (Vovk et al., 2005 [123]; Barber et al., 2021 [124]; Romano et al., 2019 [125]).
Model-agnostic approach, ideal for OOT/OOA.

4.5.4. Calibration of Quantiles/Probabilities

Application of isotonic or spline calibration maps in validation-only and “freeze” them before the OOT/OOA test. Checks with PIT histograms (probabilistic calibration) and reliability diagrams for regression-quantiles (Kuleshov et al., 2018 [126]; Gneiting & Raftery, 2007 [121]; Levi et al., 2022 [54]).

4.5.5. Uncertainty Assessment

Report of (i) coverage vs. nominal (total/by subgroup: price, area, age, zones), (ii) mean PI width and sharpness, (iii) interval score (Winkler/Gneiting–Raftery) and CRPS when full density is present, and (iv) rolling backtests for coverage drift (Winkler, 1972 [127]; Gneiting & Raftery, 2007 [121]).

4.5.6. Tracking and Recalibration

In operation: PSI/KS for error drift, coverage tracking per month/zone, and triggered recalibration when coverage declines. In mass appraisal, we combine PI-metrics with ratio studies (COD/PRD/PRB) to achieve error uniformity (IAAO, 2025) [50].
Practical protocol. (1) Quantile GBM (0.05/0.95); (2) split conformal in OOT validation for final width adjustment; (3) isotonic calibration of quantiles; (4) 95% PI + CRPS/interval score reporting; (5) coverage dashboards by subgroup and monthly recalibration where required. The combination offers sharp intervals, coverage guarantees, and operational monitoring.

4.6. Model Governance and Compliance (RICS/IVS, IAAO, QCS)

The framework and lines of defense establish the operational reliability of the AVM, which is structured around three layers: development (ownership/technical choices), independent validation (verification/validation), and internal control–compliance (policies/monitoring), in alignment with RICS/IVS, IAAO, and QCS
(a)
Roles and life cycle. Model owner (purpose, assumptions, limits of use/risks), data steward (origin/licenses, quality, GDPR/retention), independent validator (technical verification, reproducibility, stress tests), change manager (versioning, change log, rollback). Cycle: design → development → independent validation → approval → production → monitoring/recalibration → declassification.
(b)
Documentation/transparency. Datasheets (sources, licenses/ToS, biases), Model Cards (purpose, data window, attributes, metrics, constraints), full audit trail (seeds/configs/libraries), uncertainty reports (PI coverage) and statements that the AVM does not replace inspection/certification, as required by the Red Book (RICS/IVS), are adopted.
(c)
Quality and ratio studies (IAAO). In mass appraisal, ratio studies (COD/PRD/PRB) are regularly performed by category and zone on OOT/OOA samples, with a parallel PI coverage report to ensure error uniformity (IAAO).
(d)
QCS requirements and bias testing. QCS (CFPB et al., 2024) [19] mandates documented QA/VV before and during use, prevention of conflicts of interest and systematic bias testing for AVMs. Subgroup errors (area/range/type), parity indicators are monitored, and mitigation measures are implemented (feature review, monotonic constraints, reweighting).
(e)
Leakage, drift, and change checks. Out-of-fold preprocessing, avoid look-ahead variables, group splits at building level; monthly PSI/KS on inputs/residuals, coverage tracking of PIs and triggers for recalibration/retrain; standard change control with RFCs, acceptance criteria (MAE/COD/coverage), shadow/A-B before full rollout.
(f)
Ethics and privacy. ToS/license and GDPR compliance (minification, pseudonymization, EXIF removal, face/plate blurring). Avoid extraction/inference of sensitive attributes. EPC-proxies are declared as information signals, not certifications.
(g)
Independent validation and reproducibility. Each major release undergoes cold-start verification on fresh OOT/OOA, transparency package (SHAP/PDP/ALE/indicative Grad-CAM), and backtesting. Data/code are released as versioned artifacts for one-to-one iteration.
(h)
Reporting to stakeholders. Standard dashboards with MAE/RMSE/COD/PRD/PRB, PI coverage/width, drift, bias tests, and change log. Each prediction is accompanied by a 95% PI and a local explanation card.

4.7. Implementation, MLOps, and Reproducibility (Pipeline, Registries, Monitoring)

The design principle focuses on an operational AVM that minimizes ML technical debt through a strict pipeline, full traceability, a feature store, a model registry, automated testing, and production monitoring. (Sculley et al., 2015) [128].
Architectural layers.
  • Data layer: Versioned repositories (raw → curated) with proactive shape/range/uniqueness checks before each run to avoid “silent” failures.
  • Feature store: Unified feature definitions (structured, NLP, CV, spatial), offline/online computation equivalence, and version tags per set (Baylor et al., 2017) [129].
  • Training layer: Reproducible runs with MLflow, stable seeds, and environment snapshots; accompanying Model Cards (Mitchell et al., 2019 [77]).
  • Model registry: Versions with metadata (data window, OOT/OOA metrics, PI coverage, bias tests) and approval gates.
  • Serving layer: Batch (mass evaluations) and lightweight online endpoint with low coupling from upstream sources (Paleyes, Urma, & Lawrence, 2020 [130]).
Testing and quality.
  • ML Test Score: Unit/integration tests in ETL, input distribution checks, training repeatability, explanation stability, alarms on deviations.
  • Repro CI/CD: Every change goes through CI with synthetic fixtures and small-sample re-training for early detection of regressions (Sculley et al., 2015 [128]).
  • Experimental control: Mandatory ablations and parity with baselines before promotion.
Promotion/changes. Shadow deployment → canary/blue–green with SLOs (MAE, COD, PRD/PRB, PI coverage) and automatic rollback on SLO violation or drift (Humble & Farley, 2010) [131].
Monitoring and drift.
  • Data/feature drift: PSI/KS on basic inputs and EPC-proxy.
  • Prediction and uncertainty drift: Rolling PIs coverage (90/95%), interval score/CRPS, comparison against “last stable” version.
  • Bias/stability dashboards: Errors per subgroup (price/age/area quotients, zones) and global SHAP stability over time; findings → mitigation (reweighting, monotonic constraints, retrain) (Gama et al., 2014) [67].
Security, privacy, access. Least privilege, encryption in motion/at rest, PII-reduction in logs, DLP policies. All artifacts with provenance (checksums, URL/image hashes) and retention compliant with GDPR/ToS.
Auditability and reproducibility. Each prediction carries (i) model-version and feature-set version, (ii) local explanation card (SHAP), (iii) 95% PI, and (iv) timestamps and data snapshot ID for accurate reconstruction by a third party (Baylor et al., 2017 [129]; Mitchell et al., 2019 [77]).
Summary. Aligning with MLOps practices reduces vulnerability to data/software changes, enhances trust/compliance, and guarantees scientific repeatability of results.
Additional technical details that ensure the reproducibility of the methodology, such as configuration files, feature definitions, OOT/OOA schemas, and illustrative code snippets, are summarized in Appendix A.2, on reproducibility. A complementary set of minimal, depersonalized code examples illustrating the preprocessing, splitting, training, interval estimation, and logging steps is provided in Appendix A.3.

4.8. Limitations, Threats to Validity, and Limits of Generalizability

Internal validity.
(a)
Measurement errors: Inaccuracies in critical attributes (sq m, floor, renovation status) may introduce attenuation or nonlinear biases; domain dictionaries, double coding, and robust losses mitigate these without nullifying them (Carroll et al., 2006) [65].
(b)
Information leakage: implicit look-ahead or reappearance of the same property/images between train–test; out-of-fold pipelines and grouped splits substantially reduce the risk (Kaufman, Rosset, & Perlich, 2012) [102].
(c)
Concept/covariate drift: abrupt changes in preferences/regulations can degrade performance; monitoring and (re)calibration/retraining are required (Gama et al., 2014) [67].
External validity and transferability. Transfer to a new market requires similarity of distributions and stability of mechanisms; transportability theory suggests that the same explanatory variables do not guarantee the same results when the dependencies/selection mechanisms change (Pearl & Bareinboim, 2014) [132]. Use of OOA assessment and, operationally, re-weighting/propensity for strong deviations is necessary. Non-random sampling of advertisements creates selection bias that is methodologically correctable but difficult to eliminate (Heckman, 1979) [64].
Causal interpretation. Models are predictive/estimative; high accuracy does not imply causality without appropriate design/assumptions. SHAP/PDP/ALE are descriptive; additional identification assumptions are required for marginal willingness to pay (Shadish, Cook, & Campbell, 2002 [133]; Rosen, 1974) [2].
Spatial confounding and endogeneity. Proximity to public transit/POIs is potentially endogenous (infrastructure is placed in already attractive areas). Spatial stability, SAR/SEM, and robustness help, but spurious correlations may persist without physical experiments or instrumentation (Anselin, 1988) [107].
Technical debt and reproducibility. Complex pipelines carry risks of feature drift/contract violation; even with MLOps, continuous testing, monitoring, and full documentation are required (Sculley et al., 2015) [128].
Ethics/legal. Text/image processing is GDPR/ToS compliant; no sensitive features are extracted. EPC-proxies are explicitly declared as information signals, not as formal certifications.
Practical mitigations (summary). Gold sets and double encoding for critical entities; out-of-fold preprocessing and grouped splits; OOT/OOA with blocked resampling; conformal PIs for reliable coverage; drift dashboards and scheduled retrains; clear purpose/boundary statements in reports, aligned with RICS/IVS.
In addition to issues of measurement and spatial endogeneity, a significant limitation is the sample’s representativeness, as data from online classifieds do not reflect a random or complete representation of inventory or actual transactions. This fact may create systematic distortions in subgroups of properties or regions. Furthermore, the model’s transferability to other markets or time periods is limited, as spatial structures, feature distributions, and pricing mechanisms differ. Although the OOT/OOA schemes reduce the risk of overfitting to the specific sample, careful recalibration is required in the face of market changes or changes in the property mix.

Ethical Issues, Data Privacy, and Legal Compliance

Multi-modal property valuation is based on highly sensitive data, and therefore, ethical and legal issues constitute a critical constraint requiring special and systematic consideration. The text and image data of the notices are anonymised and processed in accordance with the principles of the GDPR (minimisation, purpose limitation, legality), while removing elements that could allow the re-identification of natural persons. At the same time, risks of algorithmic bias that may reflect spatial or socio-economic disparities shall be assessed with corresponding adjustments to the modeling process. The proposed framework is accompanied by explainable AI practices to ensure that the contributing factors to the estimates remain transparent and auditable by estimators, users, and regulators.

5. Model Tests

5.1. Descriptive Statistics and Diagnostic Tests

A brief presentation of the data profile and the readiness checks before modeling is a crucial step. For continuous variables (€/sq m, area, age, access times), report the median, IQR, MAD, as well as skewness/convexity for quick estimation of curvature (Tukey, 1977) [134]. For categorical variables (floor, heating, parking, etc.), give frequencies/proportions and indicative barplots. The dependent variable is defined as logf() (“€/sq m”); checking the transformation reduces heteroskedasticity and skewness.
Outliers and cleaning. According to Iglewicz & Hoaglin (1993) [94], the detection of extreme values is performed by two robust methods: (a) calculation of the modified Z-score with respect to the mean and MAD, and (b) application of IQR rules to identify observations that exceed predetermined limits. For each detected outlier, it shall be recorded whether it is a measurement error or a rare but valid observation. The final treatment is performed via Winsorization or trimming, using thresholds defined in advance and explicitly documented in the config file, ensuring the process is fully transparent and reproducible.
Preprocessing follows clear, documented protocols: missingness detection and management (MCAR/MAR) with heatmaps and regressions; imputation options recorded in config files; and outlier detection using IQR and modified Z-score. Winsorization/trimming decisions are based on predefined rules. This ensures full transparency and reproducibility of preprocessing.
Missingness and non-response As proposed by Little & Rubin (2019) [135], the rate of deficiencies per variable is calculated, and co-occurrence patterns are examined via heatmaps. Then, it is assessed whether the deficiencies are MCAR or MAR with accounting regressions where the indication “missing” is used as a dependent variable. Depending on the results, the data are treated either by simple methods (median imputation or introduction of depreciation indicators) or by multiple completion, which is used specifically for robustness analyses. All steps, controls, and decisions are documented in the config to ensure reproducibility.
Balance between splits. For OOT/OOA sets, comparison of distributions of key features with Standardized Mean Differences (SMD), KS-tests, and indicative QQ-plots; the goal is statistical similarity between train/val/test so that the evaluation is not affected by artificial mixture change (Austin, 2009) [136].
Correlation and multicollinearity. Presentation of Spearman ρ for continuous measures and VIF at an interpretive baseline, grouping closely related features (e.g., multiscale accessibility indices) to avoid over-parameterization and estimation instability (Kutner, Nachtsheim, & Neter, 2004) [137].
Space and time. Design of heatmaps of pre-modeling values/residuals and Moran’s I in simple regressions to indicate spatial autocorrelation; the need for blocked spatial/temporal splits is confirmed, which are partially implemented (Anselin, 1988 [107]; Roberts et al., 2017 [99]). The results of Moran’s I index on the residuals of the basic specifications indicated the existence of positive spatial autocorrelation in some submarkets, confirming the need for spatial correction. For this reason, spatial HAC errors were applied, while as a robustness check, SAR/SEM models were also evaluated.
Brief readiness summary. (i) logf() (“€/sq m”) stabilizes dispersion, (ii) outliers are manageable with predefined Winsor rules, (iii) ellipses are mapped/controlled, (iv) splits are balanced (low SMD), and (v) documented spatial dependence → adopt OOT/OOA.

5.2. Baseline Performance vs. ML

Evaluation of linear/hedonic baselines vs. GBMs/RF/neurals on the same OOT/OOA splits, with MAE as the primary metric and RMSE/MAPE as secondary metrics. MLs (mainly GBMs) are expected to outperform nonlinearities/interactions (Friedman, 2001 [58]; Chen & Guestrin, 2016 [108]; Ke et al., 2017 [109]), while RFs act as a robust benchmark (Breiman, 2001) [138]. Report the ΔMAE% vs. the best linear baseline and boxplots of errors per split to show out-of-sample variation.
Significance of differences. Performance differences between models are tested (a) with paired MAE differences per split and 95% CIs from blocked bootstrap (time/space) (Efron, 1979) [97] and (b) with Diebold–Mariano on time blocks for predictive accuracy (Diebold & Mariano, 1995) [116]. Set a minimum practical benefit (e.g., ≥2–3% ΔMAE) to distinguish substantial from marginal improvements. Uniformity and functional indicators. In mass appraisal scenarios, accompany the metrics with ratio studies (COD/PRD/PRB) per category/zone, according to IAAO, to test for uniformity of errors across the range of values (IAAO, 2025) [50]. The results are presented in a single table: (i) MAE/RMSE/MAPE, (ii) COD/PRD/PRB, (iii) coverage 90/95% PIs and interval score/CRPS (Gneiting & Raftery, 2007) [121].
Calibration and uncertainty. For each model, report coverage vs. nominal (90/95%), mean PI width, and interval score/CRPS. Where necessary, apply meta-calibration (e.g., isotonic) in the quantile space, so that the probabilistic predictions are well calibrated.

5.3. Ablation Studies and Channel Contribution

Quantifying the total and marginal contribution of each feature channel is aimed at. Definition of a hierarchical protocol as follows:
A: Only structured (structural/transactional) → B: A + NLP → C: B + CV → D: C + space (accessibility/POIs). Each step is trained/evaluated on the same OOT/OOA splits (§3.6). The primary metric is MAE; secondary RMSE/MAPE and, for uncertainty, interval score/CRPS. We report ΔMAE% per transition (A→B, B→C, C→D) with 95% CI from blocked bootstrap (Efron, 1979) [97] and Diebold–Mariano test on time blocks (Diebold & Mariano, 1995) [116].
Group significances and interactions. Complementation of the abstractions with grouped significance via grouped permutation and grouped SHAP (Lundberg & Lee, 2017 [55]). SHAP interaction values capture whether gains arise from synergies (e.g., EPC-proxy × age, NLP × CV). To avoid bias from correlations among accessibility indicators, the ALE is used to shape relationships (Apley & Zhu, 2020) [60].
Cost–benefit and robustness. For each channel addition, report (i) computational/operational cost (ETL/annotation/serving time), (ii) accuracy gain (ΔMAE) and uncertainty improvement (interval score reduction), and (iii) impact on uniformity (ΔCOD/PRD/PRB). We construct Pareto diagrams (benefit vs. cost) and stress tests (Winsor rules; alternatively, buffers/decays) to check the robustness of the conclusions (Kutner, Nachtsheim, & Neter, 2004) [137].
Short reading. (a) Report where the most significant marginal gain occurs (usually A→B or B→C in advertisements); (b) documentation of whether the channel-space improves mainly error tails and PI coverage; (c) when the addition of a channel does not pass a practical threshold (e.g., 2–3% ΔMAE), documentation of non-adoption in production.

5.4. Subgroup Performance, Uniformity, and Ratio Studies

Examination of uniformity of errors per subgroup (price, area, age, property type, zones) and check if price-related deviation appears. In a mass appraisal environment, follow the IAAO ratio studies: calculation of Median/Mean Assessment Ratio, COD (dispersion), PRD (price-related differential), and PRB (price-related bias) per category/geographic unit (IAAO, 2025) [50]. The uncertainty intervals of the indicators are estimated with the blocked bootstrap (time/space).
Diagnostic bias. (i) Residual vs. value plots and local regressions (LOWESS) to detect systematic over-/underestimation along with the price. (ii) Conditional coverage of PIs (90/95%) per subgroup to reveal uncertainty asymmetries (Gneiting & Raftery, 2007) [121]. (iii) Spatial correlation of residuals (Moran’s I) and heatmaps to identify pockets of systematic error (Anselin, 1988) [107].
Mitigations and functional corrections. In addition to the procedures for detecting deviations through ratio studies, residual diagnostics, and subgroup controls, targeted bias mitigation techniques were applied. In particular, methods of error escalation, observational recalibration in cases of uneven distribution of characteristics, and functional constraints (e.g., monotony) on variables where the economic background imposes a specific form of relationship have been used. These interventions enhance uniformity and limit systematic deviation in subgroups.
When price-related bias or uneven coverage is identified:
  • Recalibration (quantile or error-based) in validation, targeted per zone/subgroup.
  • Monotonic constraints on key features (e.g., area) and grouped explanations (grouped SHAP) to identify associations that create pseudo-monotonicities.
  • Reweighting/propensity when OOT/OOA sets have a different mix.
  • Policy notes in reports: transparent reporting of ratio studies findings and implications for AVM use. In scenarios falling under QCS (USA), we explicitly document bias testing and mitigation plans (CFPB et al., 2024) [19].
Summary. Report in a condensed form: (i) COD/PRD/PRB by key category, (ii) conditional coverage PIs, (iii) spatial pockets with Moran’s I, and (iv) corrective actions that improve uniformity without appreciable loss of MAE.

5.5. Uncertainty and Calibration

Coverage and sharpness. The 95% prediction intervals (PI) from quantile GBM had an average coverage of 94.7% in OOT and 94.2% in OOA splits; after split-conformal correction, the coverage was aligned to 95.0–95.3% with a slight increase in width (+2.1% median) and an improvement in interval score (↓ 3–5%) (Vovk, Gammerman, & Shafer, 2005 [123]; Gneiting & Raftery, 2007 [121]). PIT histograms show a nearly flat distribution after isotonic quantile calibration (Kuleshov, Fenner, & Ermon, 2018) [126].
Comparison with alternatives. The probabilistic version (NGBoost) achieved similar coverage but narrower PIs at mid-range values (↓ interval score by ~2%), while at the extremes, the conformal correction of quantile-GBM remained more reliable (Gneiting & Raftery, 2007 [121]; Duan et al., 2020 [122]). The jackknife+/CV+ PIs were marginally narrower than split-conformal, with consistent coverage across time blocks (Barber, Candès, Ramdas, & Tibshirani, 2021) [124].
Subgroup evaluation. By value quantile, the pre-conformal coverage showed under-coverage at high values (−1.5 to −2.0 p.m.), which was eliminated after correction; the mean width remained proportional to the value (stabilized on a logarithmic scale). By zone, mild over-coverage was detected in areas of low data density; locally weighted conformal with similarity weights reduced the difference (Romano, Patterson, & Candès, 2019) [125].
Operational reading. Suggestion as default: (i) Quantile-GBM, (ii) split-conformal correction in OOT validation, and (iii) isotonic meta-calibration. In production, we monitor coverage vs. nominal monthly by zone and trigger recalibration when the deviation is >1 pm, keeping the sharpness/coverage trade-off within predefined limits.

5.6. Explainability (Global/Local)—Summary of Findings

Global image. TreeSHAP summaries in the best GBM show as top factors (by |SHAP|): location/neighborhood (fixed/spatial features), area, age/year (with nonlinearities), accessibility indicators/POIs, and EPC-proxy. The ranking is maintained stably in OOT and OOA splits; Spearman correlations of global rankings remain high, indicating transferability of essential features (Lundberg & Lee, 2017 [55]; Lundberg, Erion, & Lee, 2020 [56]).
Relationship shape. SHAP dependence and ALE curves show the following:
  • Area: Decreasing marginal benefit (log-relation) and soft saturation thresholds.
  • Age: U–type shape (old/historical premium, with penalty at intermediate ages) with moderation when renovation status = full.
  • Accessibility: Positive slopes up to medium values and inversion very close to busy nodes (comfort/nuisance trade-off).
  • EPC-proxy: Stable positive signal, stronger in areas with lower average energy stock.
ALEs reduce independence bias in correlated indicators (Apley & Zhu, 2020) [60].
Interactions. Significant SHAP interactions emerge: (Area × Accessibility), (Age × RenovationStatus), and (EPC-proxy × Age). In particular, the combination of age + comprehensive renovation reverses much of the “age penalty”, confirming a corrective role of quality cues (Friedman, 2001 [58].
Local explanations. The SHAP cards per property document the mix of positive/negative contributions (e.g., +area, +EPC-proxy, −age, −distance to green). In high-value underestimations, local explanations often point to non-standard combinations (e.g., large area + unique amenities), which are helpful for case review.
Stability and controls. The bootstrapped variance of global SHAP is low for the top 5–7 factors; greater instability occurs in secondary spatial/POI indicators. The inference curves (PDP/ALE) remain consistent under alternative transformations, supporting robustness of the interpretations (Goldstein, Kapelner, Bleich, & Pitkin, 2015 [59]; Apley & Zhu, 2020 [60]).

5.7. Visual Explanations, Case Studies, and Robustness

Visual explanations and cases. A selective generation of Grad-CAM heatmaps on photographs (windows, bathrooms, damage) for face-validity of visual cues, and present them in “panels” per property (Selvaraju et al., 2017) [61]. Supplementation with counterfactual “what-if” scenarios (e.g., complete renovation, window upgrade) with realistic cost/feasibility constraints and explicit note of non-causality (Wachter, Mittelstadt, & Russell, 2018 [120]; Molnar, 2022 [62]).
Robustness and sensitivity. Performance of a condensed stress test package: (i) alternative splits (time-blocked, blockCV space), (ii) different Winsor/trim rules, (iii) monotonic constraints on key features, (iv) input/transformation perturbations, and (v) seed/bootstrapped iterations for MAE and global SHAP stability (Efron, 1979 [97]; Valavi, Elith, Lahoz-Monfort, & Guillera-Arroita, 2019 [101]). Evaluation of accuracy differences with Diebold–Mariano in time blocks and report 95% CIs (Diebold & Mariano, 1995) [116]. For prediction intervals, we confirm coverage vs. nominal after conformal correction and subgroup/zone controls (Romano, Patterson, & Candès, 2019) [125]. Overall, the underlying trends remain consistent; when deviations are identified, document trade-offs (e.g., slightly wider PIs for better coverage) and implement canary/shadow rules before adoption into production.
Figure 3 presents the overall data preprocessing and documentation framework, organized into four fundamental sections. It illustrates outlier detection and correction procedures, the systematic analysis and handling of missing data, and the handling of spatial autocorrelation in accordance with the respective literature approaches. At the same time, it highlights the framework of transparency and reproducibility through the use of Datasheets, Model Cards, and versioning. The framework emphasizes the precise documentation of all preprocessing options and the need for a unified record of settings. Overall, the diagram captures a coherent and controlled pipeline that enhances the validity and reliability of the analytical results.

6. Results

6.1. Discussion of Results and Practical Implications

Summary of key findings. Multimodal models (structured + NLP + CV + spatial) consistently outperformed linear/hedonic bases in OOT/OOA evaluations with substantial MAE reduction and interval score improvement, while maintaining good coverage after conformal correction. Accessibility and usage indicators contributed in a manner consistent with the accessibility theoretical framework, while EPC-proxy provided an additive signal in markets with incomplete energy documentation, aligned with “green premium” assumptions.
Interpretation of relationships. SHAP/ALE decompositions showed (a) a logarithmic price–area relationship (diminishing marginal benefit), (b) an “age penalty” that is strongly mitigated by full renovation, (c) a positive but non-monotonic accessibility effect (benefit up to a threshold—then potential nuisance), and (d) a consistently positive EPC-proxy signal, stronger where the average energy stock is lower (Lundberg & Lee, 2017 [55]; Apley & Zhu, 2020 [60]). These patterns are consistent with hedonic theory and with empirical results for transport accessibility (Debrezion, Pels, & Rietveld, 2007) [24].
What they mean for professional valuers. For Comparative Method and AVMs in operation:
  • Incorporating advert text and photographs substantially improves accuracy, especially in subsets with missing structural information.
  • Multiscale spatial indicators (buffers, times) add value but need attention to MAUP and nonlinearities.
  • Providing 95% PI and local explanations (SHAP cards) increases usability for professional reports, in line with RICS/IVS requirements for documenting uncertainty and inputs (RICS, 2025 [15]; IVSC, 2025 [22]).
Implications for mass appraisal providers. In mass appraisal scenarios, models provide better COD/PRD/PRB and more uniform errors per zone/category, provided they are accompanied by ratio studies and monitoring of PI coverage per subgroup. IAAO alignment and QCS rules (US) indicate clear bias testing, monitoring, and recalibration plans to maintain operational reliability (IAAO, 2025 [50]; CFPB et al., 2024 [19]).
Operational adoption recommendations. (i) GBM with late-fusion as default, (ii) quantile training + split-conformal for PI, (iii) Datasheets/Model Cards and audit trail in each release, (iv) shadow/canary releases with SLOs (MAE, COD, PI-coverage), (v) drift dashboards (PSI/KS) and scheduled recalibrations. This practice maximizes reliability without sacrificing transparency/interpretability.
Interpretive caution. The findings are predictive, not causal; the explanations (SHAP/ALE) are descriptive, consistent with interpretability best practices, and must be read within the constraints of sample selection and potential endogeneity of spatial variables (Rosen, 1974 [2]; Anselin, 1988 [107]).

6.2. Limitations and External Validity

Measurement and Selection Biases. Critical attributes (e.g., sq m, RenovationStatus) may contain measurement error that dampens effects or creates curvilinear biases; dictionary/double-coding checks and robust transformations reduce but do not eliminate the risk (Carroll, Ruppert, Stefanski, & Crainiceanu, 2006) [65]. Furthermore, advertisements are not a random sample of the transaction population, so there is a risk of sample selection (Heckman, 1979) [64].
Transferability and drift. External validity depends on the stability of mechanisms; changes in preferences, energy prices, or policies can alter input–price relationships (concept/covariate drift). OOT/OOA assessments improve the picture, but transfer to a new market remains conditional; transportability checks and possible reweighting are required (Pearl & Bareinboim, 2014 [132]; Gama, Žliobaitė, Bifet, Pechenizkiy, & Bouchachia, 2014 [67]).
Spatial endogeneity and spurious correlations. Proximity to public transport/POIs can be endogenous (infrastructure already selects precise locations). Although we use spatial constants, SAR/SEM, and ratio studies, spurious correlation remains possible without physical experiments or instruments (Anselin, 1988) [107].
Non-causal interpretation. Models are predictive; SHAP/PDP/ALE explanations are descriptive and not causal. Additional assumptions/designs are required for marginal willingness-to-pay inferences (Rosen, 1974) [2].
Operational mitigations. We recommend (i) ratio studies and bias testing by zone/sub-group, (ii) coverage tracking of PIs and recalibration when coverage drops, (iii) drift dashboards with PSI/KS, (iv) re-weighting/domain adaptation in market transfer, and (v) clear statements of purpose/boundaries in reports (alignment with RICS/IVS).
The representativeness of the sample remains limited as the advertisements reflect the self-selection of owners and display practices that deviate from the actual transaction population. Despite cleaning, heterogeneity in the wording of descriptions and photographic representation introduces a residual measurement error that interacts with spatial patterns. Moreover, in addition to accessibility, local dynamics (innovations, development expectations) can enhance spatial endogenousness and pseudo-associations without the need for semi-experimental projects. Finally, the model’s transferability depends on long-term changes and institutional differences across markets, requiring stability checks and adjustments before generalization.

6.3. Model Interpretability

For greater interpretability and evaluation, indicative visualizations were added to represent the analysis’s key findings, without disclosing data for privacy reasons. Figure 4 shows the Indicative SHAP summary plot, which shows the relative influence of key characteristics on the estimation of price per square meter in a multimodal context. Figure 5 shows the indicative calibration curve, which illustrates the relationship between nominal and empirical coverage of prediction intervals, as tested in the proposed framework. These figures summarize the interpretability and uncertainty control methodology applied in the proposed model, without exposing sensitive or non-publicly available data.

6.4. Future Research

Future extensions of this work can explore causal inference methodologies, with the aim of distinguishing to a greater extent the causal effect of individual characteristics (e.g., energy efficiency, renovation interventions, neighborhood quality) from simple correlations found in the data. In addition, the use of transfer learning techniques could enable knowledge transfer from the Thessaloniki market to other cities or countries, reducing training data requirements and improving the models’ generalizability. The combined use of causal inference and transfer learning, combined with more time-rich panel data, can lead to valuation frameworks that are not only accurate but also capable of capturing the dynamic evolution of real estate markets and the effects of policy interventions or regulatory changes. In practice, the study of the above can be organized into individual thematic axes:
(a)
Causal identification and heterogeneity of effects. Beyond predictive accuracy, designs for causal estimation of “marginal willingness to pay” (e.g., for accessibility, energy upgrades) are needed. Priorities: double/debiased ML for partial effects under high dimensionality and nonlinearity, causal forests for heterogeneity, and staggered DiD for infrastructure/policy interventions. (Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, & Robins, 2018 [139]; Wager & Athey, 2018 [140]; Callaway & Sant’Anna, 2021 [141]).
(b)
Transferability to new markets. Systematic study of domain adaptation/transfer learning (from mature to sparse markets) with covariate shift corrections, representative sampling, and hierarchical/multilevel models that allow for common information but also local deviations. (Ben-David, Blitzer, Crammer, & Pereira, 2010 [142]; Pan & Yang, 2010 [143]).
(c)
Multimodal fundamental models. Investigation of self-/weakly supervised multimodal architectures (text + image + space) that learn robust embeddings from large markets and then adapt to smaller samples (few-shot). Attention to documentation and avoidance of leakage from spatial patterns.
(d)
Energy efficiency and “performance gap”. Deeper external validation of EPC-proxy against official EPCs and actual consumptions, with measurement error modeling and couplings with causal designs for renovation impacts. (Sunikka-Blank & Galvin, 2012) [34].
(e)
Adaptive uncertainty in real time. Online conformal and adaptive calibration for stable coverage in changing regimes, with local similarity weights and drift-aware interval width updating. (Angelopoulos & Bates, 2022) [144].
(f)
Fair/responsible assessment. Integrate fairness analyses (error parity, conditional coverage) and counterfactual fairness techniques in subgroups/zones; documented with Model Cards/Datasheets and open replication protocols.
(g)
Synthetic data and privacy. Securely deploy synthetic arrays/images that preserve association structure for independent validation, with similarity/privacy metrics and stress tests of influence on inferences.

7. Conclusions

The results of the study show that the integration of polymorphic attributes, combining structured message data, NLP, CV, and multi-scale spatial accessibility indicators, substantially enhances the accuracy, consistency, and transparency of the automated valuation models. The proposed polymorphic AVM consistently outperformed the linear and hedonic reference models in OOT/OOA assessments, achieving a significant reduction in MAE, higher interval scores, and well-scaled prediction intervals after conformal adjustments. The SHAP/ALE decompositions demonstrated consistent and theoretically aligned patterns of a graduated marginal area benefit, an age penalty offset through full renovation, non-monotonic access, and a fixed energy premium via the EPC-proxy, providing substantial operational transparency without causal claims.
At a professional level, the findings demonstrate that AI-enhanced models can complement traditional comparative and hedonic methods, in particular in cases of incomplete information, providing improved documentation, uncertainty assessment, and alignment with RICS/IVSs. In the context of mass valuation, the improvement in COD/PRD/PRB indicators and the ability to systematically monitor coverage and bias by geographic or socioeconomic group enhances the fairness and transparency of the processes, in accordance with the IAAO/QCS guidelines. Additionally, adopting key MLOps practices—such as ratio studies, drift monitoring, Model Cards-style documentation, shadow/canary releases, and scheduled recalibration—ensures reproducibility and secure operational implementation.
Existing constraints such as measurement errors, sample selection, potential spatial endogenousness, and time-shifting of the market are reduced but not completely eliminated, underlining the need for continuous monitoring and regular upgrading of the models. Therefore, future research should consider more rigorous causal inference designs, learning transfer techniques for interoperability between markets, and adaptive uncertainty methods. Overall, the proposed framework demonstrates that the controlled integration of AI in property valuation is not only a technological innovation but a practical tool for improving accuracy, transparency, and accountability in real-world decision-making systems.
The significant increase in accuracy through multimodal features, the functional enhancement of the comparative method, the added value of the interpretation tools, and their implications for the real estate market and professionals can be achieved with the use of modern digital technology and tools. The study also highlights the prospects for applying the framework to other property categories and the directions for further research, reinforcing the contribution of the work to the international debate on modern, transparent, and efficient valuation systems.

Author Contributions

All authors contributed equally to the conception, preparation, and writing of this article. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data presented in this study were obtained through a literature review and from openly accessible real estate listing information, including textual descriptions, images, and publicly available spatial indicators.

Acknowledgments

The authors take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
NLPNatural Language Processing
AVMAutomated Valuation Model
MWTPMarginal Willingness To Pay
IAAOInternational Association of Assessing Officers
RICSRoyal Institution of Chartered Surveyors
MLMachine Learning
DLDeep Learning
RFRandom Forest
GBDT/GBMGradient-Boosted Decision Trees/Gradient Boosting Machine
QCSQuality Control Standards
TODTransit-Oriented Development
EPC/BEREnergy Performance Certificate/Building Energy Rating
EPC-proxyProxy Energy Performance Certificate Indicator
NERNamed Entity Recognition
BERT/mBERT/XLM-RBidirectional Encoder Representations from Transformers (and multilingual variants)
GISGeographic Information Systems
CODCoefficient of Dispersion
PRDPrice-Related Differential
PRBPrice-Related Bias
SHAPSHapley Additive exPlanations
ALEAccumulated Local Effects
LIMELocal Interpretable Model-Agnostic Explanations
PDP/ALE/ICEPartial Dependence/Accumulated Local Effects/Individual Conditional Expectation
Grad-CAMGradient-Weighted Class Activation Mapping
CVComputer Vision
OOT/OOAOut-of-Time/Out-of-Attribute validation
PSI/KSPopulation Stability Index/Kolmogorov–Smirnov statistic
SLA/SLOService-Level Agreement/Service-Level Objective
POIPoint of Interest
GTFSGeneral Transit Feed Specification
MAUPModifiable Areal Unit Problem
OSMOpenStreetMap
GDPRGeneral Data Protection Regulation
DPIAData Protection Impact Assessment
EXIFExchangeable Image File Format
TF-IDFTerm Frequency–Inverse Document Frequency
SSIMStructural Similarity Index Measure
WGS84World Geodetic System 1984
P/R/F1Precision/Recall/F1-score
CNN/ResNetConvolutional Neural Network/Residual Network
AUC/AUROCArea Under ROC Curve
OD-matrixOrigin–Destination Matrix
2SFCATwo-Step Floating Catchment Area
CRSCoordinate Reference System
MLEMaximum Likelihood Estimation
IQR/MADInterquartile Range/Median Absolute Deviation
MSE/RMSE/MAE/MAPEMean Squared Error/Root Mean Squared Error/Mean Absolute Error/Mean Absolute Percentage Error (Standard regression accuracy metrics)
CI/PIConfidence Interval/Prediction Interval
FDRFalse Discovery Rate
PITProbability Integral Transform
NGBoostNatural Gradient Boosting
LOWESSLocally Weighted Scatterplot Smoothing
DiDDifference-in-Differences
QA/VVQuality Assurance/Validation & Verification
RFCsRandom Forest Classifiers (ή Requests for Comments, ανάλογα το context)
MLOps/ETL/CI/CDMachine Learning Operations/Extract–Transform–Load/Continuous Integration–Continuous Deployment
PIIPersonally Identifiable Information

Appendix A

Appendix A.1

Table A1. Detailed Model Procedures.
Table A1. Detailed Model Procedures.
StageProcedures (From Main Text)Source Sections
Data ingestion and quality control
-
Provenance logging (URL hash, timestamp, parser version)—Schema/range/uniqueness checks before each run—GDPR/ToS compliance, pseudonymization, EXIF removal
Section 3.2 and Section 4.7
Feature engineering
-
Unified definitions in feature store (structured, NLP, CV, spatial)—Offline/online feature equivalence—Version tags per feature set
Section 4.7 Feature Store
Model training
-
Nested CV with random search + Bayesian optimization—Deterministic-by-seed pipelines—MLflow environment snapshots
Section 4.1, Section 4.2, Section 4.3, Section 4.4 and Section 4.7 Training layer
Validation
-
OOT/OOA splits (temporal/spatial hold-outs)—Ratio studies COD/PRD/PRB (IAAO)—Error analysis per subgroup—Diebold–Mariano tests
Section 3.3.5 and Section 4.6 Governance, Section 4.7 Validation
Uncertainty quantification
-
Quantile regression (0.05/0.95)—Split-conformal calibration —Isotonic calibration of quantiles—PI coverage and interval scores (CRPS)
Section 3.3.5 and Section 4.6 Calibration/PI
Bias and fairness checks
-
Subgroup error parity, conditional coverage—Bias/stability dashboards—Mitigation (reweighting, monotonic constraints, retrain)
Section 4.6 and Section 4.7 Bias tests (QCS)
Explainability
-
SHAP/PDP/ALE—Grad-CAM for CV cues—Local explanation card per prediction
Section 3.3.5 and Section 4.7 Explainability
Model registry
-
Versioned models with metadata (data window, OOT/OOA metrics, PI coverage, bias tests)—Approval gates
Section 4.7 Model Registry
Deployment
-
Shadow → canary/blue–green releases—SLO thresholds (MAE, COD, PI coverage)—Automatic rollback
Section 4.7 Deployment
Monitoring
-
Drift (PSI/KS), EPC-proxy monitoring—Rolling PI coverage—Stability of global SHAP over time
Section 4.7 Monitoring/Drift
Auditability and reproducibility
-
Model-version & feature-set IDs—Timestamps and data snapshot IDs—Full audit trail (seeds, configs, libraries)
Section 4.7 Auditability

Appendix A.2

Table A2. Technical Components Supporting Full Reproducibility.
Table A2. Technical Components Supporting Full Reproducibility.
Artifact CategoryDescriptionRepresentative Code Snippet (Depersonalized)
Configuration Files Contain preprocessing rules (transformations, Winsorization thresholds), missingness policies, seed values, hyperparameters, and metadata of OOT/OOA splits. All operations are applied out-of-fold.yaml\npreprocess:\n Winsor: [0.01, 0.99]\n missing: median_plus_indicator\nsplits:\n type: oot_ooa\n time_block: quarter\nmodel:\n algo: gbm\n tuning: random + bayes\n
Feature-Set DefinitionsVersioned structured, NLP, CV, and spatial feature sets accompanied by feature hashes to ensure exact reconstruction of the feature store.yaml\nfeatures:\n structured: v5\n nlp: v4\n cv: v3\n spatial: v2\n
Splits Metadata (OOT/OOA)Documentation of temporal and spatial block splits, buffers, and grouped property/building splits, ensuring no overlap or leakage.python\nsplits = make_splits(df,\n method = \”oot_ooa\”,\n spatial = \”zone\”)\n
Hyperparameter DictionariesFinal hyperparameters per model (GBM, RF, NN) obtained through nested tuning using random search + Bayesian optimization.yaml\nhyperparams:\n learning_rate: 0.05\n max_depth: 6\n min_leaf: 20\n
Model Cards/DatasheetsProvide data window, input features, MAE/RMSE/MAPE, COD/PRD/PRB, PI coverage (90/95%), drift/bias tests, and version metadata.Document-type artifact—no code snippet required
Logged ArtifactsSeeds, library versions, feature hashes, MLflow runs, hyperparameter configs, and environment snapshots allow 1-to-1 replication.python\nmlflow.log_params(hparams)\nmlflow.log_artifacts(config_path)\n
Explainability PackagesDepersonalized SHAP/ALE/PDP summaries and Grad-CAM visual checks for optical features.python\nshap_vals = shap_calc(model, X_sample)\n
Synthetic/Depersonalized SamplesOnly statistical structures or synthetic samples preserving correlations are shared; no raw media or identifiable information.python\nsynth = generate_synthetic(df,\n keep_correlations = True)\n

Appendix A.3

Table A3. Summary Table of Reproducibility Artifacts with Code Snippets.
Table A3. Summary Table of Reproducibility Artifacts with Code Snippets.
SectionDescriptionRepresentative Code Snippet (Depersonalized)
Y.1. PreprocessingOut-of-fold preprocessing and feature transformation, applied without information leakage.python\nX = preprocess(df,\n out_of_fold = True)\n
Y.2. OOT/OOA SplitsConstruction of temporal and spatial block splits ensuring no overlap or leakage between train/validation/test sets.python\nsplits = make_splits(df,\n method = “oot_ooa”,\n spatial = “zone”)\n
Y.3. Nested TrainingModel training with nested hyperparameter tuning (random + Bayesian search) using only training folds.python\nmodel = nested_cv(X, y,\n method = “random + bayes”)\n
Y.4. Prediction IntervalsGeneration of prediction intervals using quantile-based or calibrated approaches.python\npi = predict_intervals(model,\n X_test)\n
Y.5. Logging ArtifactsLogging of model outputs, configuration files and metadata to ensure full reproducibility.python\nlog_run(model,\n config = “config_v3.yaml”)\n

References

  1. Lancaster, K. A new approach to consumer theory. Am. Econ. Rev. 1966, 56, 133–157. [Google Scholar]
  2. Rosen, S. Hedonic prices and implicit markets: Product differentiation in pure competition. J. Political Econ. 1974, 82, 34–55. [Google Scholar] [CrossRef]
  3. Ekeland, I.; Heckman, J.J.; Nesheim, L. Identification and estimation of hedonic models. J. Political Econ. 2004, 112, S60–S109. [Google Scholar] [CrossRef]
  4. Bajari, P.; Benkard, C.L. Demand estimation with heterogeneous consumers and unobserved product characteristics: A hedonic approach. J. Political Econ. 2005, 113, 1239–1276. [Google Scholar] [CrossRef]
  5. Cropper, M.L.; Deck, L.B.; McConnell, K.E. On the choice of functional form for hedonic price functions. Rev. Econ. Stat. 1988, 70, 668–675. [Google Scholar] [CrossRef]
  6. Halvorsen, R.; Palmquist, R. The interpretation of dummy variables in semilogarithmic equations. Am. Econ. Rev. 1980, 70, 474–475. [Google Scholar]
  7. Kuminoff, N.V.; Parmeter, C.F.; Pope, J.C. Which hedonic models can we trust to recover the marginal willingness to pay for environmental amenities? J. Environ. Econ. Manag. 2010, 60, 145–160. [Google Scholar] [CrossRef]
  8. Palmquist, R.B. Property value models. In Handbook of Environmental Economics; Mäler, K.-G., Vincent, J.R., Eds.; Elsevier: Amsterdam, The Netherlands, 2006; Volume 2, pp. 763–819. [Google Scholar] [CrossRef]
  9. Freeman, A.M., III; Herriges, J.A.; Kling, C.L. The Measurement of Environmental and Resource Values: Theory and Methods, 3rd ed.; RFF Press: Washington, DC, USA; Routledge: Oxfordshire, UK, 2014. [Google Scholar]
  10. Pagourtzi, E.; Assimakopoulos, V.; French, N.; Wyatt, P. Real estate appraisal: A review of valuation methods. J. Prop. Investig. Financ. 2003, 21, 383–401. [Google Scholar] [CrossRef]
  11. Appraisal Institute. The Appraisal of Real Estate, 15th ed.; Appraisal Institute: Chicago, IL, USA, 2020; Available online: https://www.appraisalinstitute.org/insights-and-resources/resources/books/the-appraisal-of-real-estate-15th-edition (accessed on 15 October 2025).
  12. International Association of Assessing Officers (IAAO). Standard on Verification and Adjustment of Sales; IAAO: Kansas City, MO, USA, 2020; Available online: https://www.iaao.org/wp-content/uploads/Standard_on_Verification_Adjustment_of_Sales.pdf (accessed on 15 October 2025).
  13. International Association of Assessing Officers (IAAO). Standard on Mass Appraisal of Real Property; IAAO: Kansas City, MO, USA, 2021; Available online: https://www.iaao.org/wp-content/uploads/StandardOnMassAppraisal.pdf (accessed on 15 October 2025).
  14. International Association of Assessing Officers (IAAO). Standard on Automated Valuation Models (AVMs); IAAO: Kansas City, MO, USA, 2018; Available online: https://www.iaao.org/wp-content/uploads/Standard_on_Automated_Valuation_Models.pdf (accessed on 15 October 2025).
  15. Royal Institution of Chartered Surveyors (RICS). RICS Valuation—Global Standards (Red, Book). Available online: https://www.rics.org/profession-standards/rics-standards-and-guidance/sector-standards/valuation-standards/red-book (accessed on 15 October 2025).
  16. Jafary, P.; Shojaei, D.; Rajabifard, A.; Ngo, T. Automated land valuation models: A comparative study of machine learning and deep learning techniques. Cities 2024, 145, 105056. [Google Scholar] [CrossRef]
  17. Moreno-Foronda, I.; Sánchez-Martínez, M.-T.; Pareja-Eastaway, M. Comparative analysis of advanced models for predicting real estate prices: A systematic review. Urban Sci. 2025, 9, 32. [Google Scholar] [CrossRef]
  18. Tapia, J.; Chavez-Garzon, N.; Pezoa, R.; Suarez-Aldunate, P.; Pilleux, M. Comparing automated valuation models for real estate assessment in the Santiago Metropolitan Region: A study on machine learning algorithms and hedonic pricing with spatial adjustments. PLoS ONE 2025, 20, e0318701. [Google Scholar] [CrossRef]
  19. Consumer Financial Protection Bureau (CFPB); Office of the Comptroller of the Currency (OCC); Board of Governors of the Federal Reserve System (FRB); Federal Deposit Insurance Corporation (FDIC); National Credit Union Administration (NCUA); Federal Housing Finance Agency (FHFA). Quality Control Standards for Automated Valuation Models (Final Rule). 2024. Available online: https://www.consumerfinance.gov/rules-policy/final-rules/quality-control-standards-for-automated-valuation-models/ (accessed on 15 October 2025).
  20. Federal Register. Quality Control Standards for Automated Valuation Models. 2024. Available online: https://www.federalregister.gov/documents/2024/08/07/2024-16197/quality-control-standards-for-automated-valuation-models (accessed on 15 October 2025).
  21. Federal Reserve. Agencies Issue Final Rule to Help Ensure Credibility and Integrity of Automated Valuation Models. 2024. Available online: https://www.federalreserve.gov/newsevents/pressreleases/bcreg20240717a.htm (accessed on 15 October 2025).
  22. International Valuation Standards Council. International Valuation Standards (IVS); IVSC: London, UK, 2025; Available online: https://ivsc.org/new-edition-of-the-international-valuation-standards-ivs-published/ (accessed on 15 October 2025).
  23. Sirmans, G.S.; Macpherson, D.A.; Zietz, E.N. The value of housing characteristics: A meta analysis. J. Real Estate Financ. Econ. 2006, 33, 215–240. [Google Scholar] [CrossRef]
  24. Debrezion, G.; Pels, E.; Rietveld, P. The impact of railway stations on residential and commercial property value: A meta-analysis. J. Real Estate Financ. Econ. 2007, 35, 161–180. [Google Scholar] [CrossRef]
  25. Mohammad, S.I.; Graham, D.J.; Melo, P.C.; Anderson, R.J. A meta-analysis of the impact of rail projects on land and property values. Transp. Res. Part A Policy Pract. 2013, 50, 158–170. [Google Scholar] [CrossRef]
  26. Rennert, L. A meta-analysis of the impact of rail stations on property values. Transp. Res. Part A Policy Pract. 2022, 161, 57–86. [Google Scholar] [CrossRef]
  27. Gibbons, S.; Machin, S. Valuing rail access using transport innovations. J. Urban Econ. 2005, 57, 148–169. [Google Scholar] [CrossRef]
  28. Rojas, A. Train stations’ impact on housing prices: Direct and indirect effects. Transp. Res. Part A Policy Pract. 2024, 183, 103709. [Google Scholar] [CrossRef]
  29. Hyland, M.; Lyons, R.C.; Lyons, S. The value of domestic building energy efficiency: Evidence from Ireland. Energy Econ. 2013, 40, 943–952. [Google Scholar] [CrossRef]
  30. Brounen, D.; Kok, N. On the economics of energy labels in the housing market. J. Environ. Econ. Manag. 2011, 62, 166–179. [Google Scholar] [CrossRef]
  31. Fuerst, F.; McAllister, P.; Nanda, A.; Wyatt, P. Does energy efficiency matter to home-buyers? An investigation of EPC ratings and transaction prices in England. Energy Econ. 2015, 48, 145–156. [Google Scholar] [CrossRef]
  32. Fuerst, F.; McAllister, P.; Nanda, A.; Wyatt, P. Energy performance ratings and house prices in Wales. Energy Policy 2016, 92, 20–33. [Google Scholar] [CrossRef]
  33. Céspedes-López, M.F.; Rubio-Bellido, C.; Muñoz-González, C.M. Meta-analysis of price premiums in housing with energy performance certificates (EPC). Sustainability 2019, 11, 6303. [Google Scholar] [CrossRef]
  34. Sunikka-Blank, M.; Galvin, R. Introducing the prebound effect: The gap between performance and actual energy consumption. Build. Res. Inf. 2012, 40, 260–273. [Google Scholar] [CrossRef]
  35. Galvin, R. Quantification of (p)rebound effects in retrofit policies: The performance gap revisited. Energy 2016, 107, 47–58. [Google Scholar] [CrossRef]
  36. Ruggieri, G.; Maduta, C.; Melica, G. Progress on the Implementation of Energy Performance Certificates (EPCs) Across the EU; Joint Research Centre, European Commission: Brussels, Belgium, 2024; Available online: https://publications.jrc.ec.europa.eu/repository/handle/JRC135473 (accessed on 15 October 2025).
  37. Sesana, M.M.; Salvalai, G.; Della Valle, N.; Melica, G.; Bertoldi, P. Towards harmonising energy performance certificate methodologies across Europe. Energy Rep. 2024, 10, 11906–11920. [Google Scholar] [CrossRef]
  38. Shen, L.; Ross, S.L. Information value of property description: A machine learning approach. J. Urban Econ. 2021, 121, 103299. [Google Scholar] [CrossRef]
  39. Zhang, H.; Campoverde, D.; Avelar, J.; Lim, K. Describe the house and I will tell you the price: House price prediction with textual description data. Nat. Lang. Eng. 2024, 30, 661–695. [Google Scholar] [CrossRef]
  40. Bottero, M.; Greco, S.; Vernero, F. Geo-NLP insights: Unveiling residential real estate drivers through text and spatial data integration. In Advances in Human Factors, Business Management and Leadership; Springer: Berlin/Heidelberg, Germany, 2024; pp. 139–149. [Google Scholar] [CrossRef]
  41. Keraghel, I.; Morbieu, S.; Nadif, M. Recent Advances in Named Entity Recognition: A Comprehensive Survey and Comparative Study. arXiv 2024, arXiv:2401.10825. [Google Scholar]
  42. Poursaeed, O.; Matera, T.; Belongie, S. Vision-based real estate price estimation. Mach. Vis. Appl. 2018, 29, 667–676. [Google Scholar] [CrossRef]
  43. Law, S.; Paige, B.; Russell, C. Take a look around: Using Street View and satellite images to estimate house prices. ACM Trans. Intell. Syst. Technol. 2019, 10, 1–19. [Google Scholar] [CrossRef]
  44. You, Q.; Pang, R.; Luo, J. Image-based appraisal of real estate properties. IEEE Trans. Multimed. 2016, 19, 2751–2759. [Google Scholar] [CrossRef]
  45. Chen, M.; Liu, Y.; Arribas-Bel, D.; Singleton, A. Assessing the value of user-generated images of urban surroundings for house price estimation. Landsc. Urban Plan. 2022, 226, 104486. [Google Scholar] [CrossRef]
  46. Chahal, B.K. Using Deep Learning to Infer House Prices from Street View, Satellite and Aerial Imagery. Doctoral Dissertation, University of Warwick, Coventry, UK, 2022. Available online: https://wrap.warwick.ac.uk/id/eprint/177026/ (accessed on 15 October 2025).
  47. Baur, K. Automated real estate valuation with machine learning models using property descriptions. Expert Syst. Appl. 2023, 213, 119147. [Google Scholar] [CrossRef]
  48. Meszaros, J. A Brief Review of House Price Forecasting Methods. Real Estate Issues (Couns. Real Estate). 2024. Available online: https://cre.org/real-estate-issues/a-brief-review-of-house-price-forecasting-methods/ (accessed on 15 October 2025).
  49. Ecker, M.D. Cross-validation techniques for resampling housing sales. J. Prop. Tax Assess. Adm. 2022, 19, 29–44. [Google Scholar] [CrossRef]
  50. International Association of Assessing Officers (IAAO). Standard on Ratio Studies (Exposure Draft); IAAO: Kansas City, MO, USA, 2025; Available online: https://www.iaao.org/wp-content/uploads/2025_Ratio_Studies_Exposure_Draft.pdf (accessed on 15 October 2025).
  51. Yakima County, W.A. 2026 Value Models (Public AVM Performance Dashboard). 2025. Available online: https://www.yakimacounty.us/3041/2026-Value-Models (accessed on 15 October 2025).
  52. Krause, A.; Martin, A.; Fix, M. Uncertainty in Automated Valuation Models (Working Paper). 2019. Available online: https://www.andykrause.com/files/krause_etal_avmunc.pdf (accessed on 15 October 2025).
  53. Pollestad, A.J. Towards a better uncertainty quantification in AVMs. J. Real Estate Financ. Econ. 2024. [Google Scholar] [CrossRef]
  54. Levi, D.; Gispan, L.; Giladi, N.; Fetaya, E. Evaluating and calibrating uncertainty prediction in regression tasks. Patterns 2022, 3, 5540. [Google Scholar] [CrossRef] [PubMed]
  55. Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
  56. Lundberg, S.M.; Erion, G.G.; Lee, S.-I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
  57. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD Conference, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar] [CrossRef]
  58. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
  59. Goldstein, A.; Kapelner, A.; Bleich, J.; Pitkin, E. Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. J. Comput. Graph. Stat. 2015, 24, 44–65. [Google Scholar] [CrossRef]
  60. Apley, D.W.; Zhu, J. Visualizing the effects of predictor variables in black box supervised learning models. J. R. Stat. Soc. Ser. B 2020, 82, 1059–1086. [Google Scholar] [CrossRef]
  61. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
  62. Molnar, C. Interpretable Machine Learning, 2nd ed.; Leanpub: Victoria, BC, Canada, 2022. [Google Scholar]
  63. Lipton, Z.C. The mythos of model interpretability: In search of the science of machine learning explanation. ACM Queue 2016, 16, 30–57. [Google Scholar]
  64. Heckman, J.J. Sample selection bias as a specification error. Econometrica 1979, 47, 153–161. [Google Scholar] [CrossRef]
  65. Carroll, R.J.; Ruppert, D.; Stefanski, L.A.; Crainiceanu, C. Measurement Error in Nonlinear Models, 2nd ed.; Chapman & Hall/CRC: Boca Raton, FL, USA, 2006. [Google Scholar]
  66. Zandbergen, P.A. A comparison of address point, parcel and street geocoding techniques. Comput. Environ. Urban Syst. 2008, 32, 214–232. [Google Scholar] [CrossRef]
  67. Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. 2014, 46, 44. [Google Scholar] [CrossRef]
  68. Lu, J.; Liu, A.; Dong, F.; Guo, Y.; Zhang, G. Learning under concept drift: A review. IEEE Trans. Knowl. Data Eng. 2018, 31, 2346–2363. [Google Scholar] [CrossRef]
  69. Hansen, W.G. How accessibility shapes land use. J. Am. Inst. Plan. 1959, 25, 73–76. [Google Scholar] [CrossRef]
  70. Geurs, K.T.; van Wee, B. Accessibility evaluation of land-use and transport strategies: Review and research directions. J. Transp. Geogr. 2004, 12, 127–140. [Google Scholar] [CrossRef]
  71. El-Geneidy, A.; Levinson, D. Access to destinations: Development of accessibility measures. In Minnesota Department of Transportation Report; University of Minnesota: Minneapolis, MN, USA, 2006. [Google Scholar]
  72. Duncan, M. The impact of transit-oriented development on housing prices in San Diego, CA. Urban Stud. 2011, 48, 101–127. [Google Scholar] [CrossRef] [PubMed]
  73. Ewing, R.; Cervero, R. Travel and the built environment: A meta-analysis. J. Am. Plan. Assoc. 2010, 76, 265–294. [Google Scholar] [CrossRef]
  74. Crompton, J.L. The impact of parks on property values: A review of the empirical evidence. J. Leis. Res. 2001, 33, 1–31. [Google Scholar] [CrossRef]
  75. European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council (General Data Protection Regulation—GDPR); European Union: Brussels, Belgium, 2016. [Google Scholar]
  76. Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J.W.; Wallach, H.; Daumé, H., III; Crawford, K. Datasheets for datasets. Commun. ACM 2021, 64, 86–92. [Google Scholar] [CrossRef]
  77. Mitchell, M.; Wu, S.; Zaldivar, A.; Barnes, P.; Vasserman, L.; Hutchinson, B.; Spitzer, E.; Raji, I.D.; Gebru, T. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, Atlanta, Georgia, 29–31 January 2019; pp. 220–229. [Google Scholar] [CrossRef]
  78. Fellegi, I.P.; Sunter, A.B. A theory for record linkage. J. Am. Stat. Assoc. 1969, 64, 1183–1210. [Google Scholar] [CrossRef]
  79. Christen, P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
  80. Jaro, M.A. Advances in record-linkage methodology as applied to the 1985 census of Tampa. J. Am. Stat. Assoc. 1989, 84, 414–420. [Google Scholar] [CrossRef]
  81. Winkler, W.E. Overview of record linkage and current research directions. In U.S. Census Bureau Research Report; U.S. Census Bureau: Suitland, MD, USA, 2006. [Google Scholar]
  82. Zauner, C. Implementation and Benchmarking of Perceptual Image Hash Functions. Master’s Thesis, University of Applied Sciences Hagenberg, Mühlkreis, Austria, 2010. [Google Scholar]
  83. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  84. Chapman, W.W.; Bridewell, W.; Hanbury, P.; Cooper, G.F.; Buchanan, B.G. A simple algorithm for identifying negated findings and diseases in discharge summaries. J. Biomed. Inform. 2001, 34, 301–310. [Google Scholar] [CrossRef]
  85. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association FOR Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  86. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 5–10 July 2020; pp. 8440–8451. [Google Scholar]
  87. Ratner, A.; Bach, S.H.; Ehrenberg, H.; Fries, J.; Wu, S.; Ré, C. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, Munich, Germany, 28 August–1 September 2017; Volume 11, pp. 269–282. [Google Scholar]
  88. Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef] [PubMed]
  89. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  90. Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the ICLR, Virtual, 3–7 May 2021. [Google Scholar]
  91. Luo, W.; Wang, F. Measures of spatial accessibility to health care in a GIS environment: Synthesis and a case study in the Chicago region. Environ. Plan. B Plan. Des. 2003, 30, 865–884. [Google Scholar] [CrossRef]
  92. Box, G.E.P.; Cox, D.R. An analysis of transformations. J. R. Stat. Soc. Ser. B 1964, 26, 211–252. [Google Scholar] [CrossRef]
  93. Yeo, I.K.; Johnson, R.A. A new family of power transformations to improve normality or symmetry. Biometrika 2000, 87, 954–959. [Google Scholar] [CrossRef]
  94. Iglewicz, B.; Hoaglin, D.C. How to Detect and Handle Outliers; ASQ Quality Press: Milwaukee, WI, USA, 1993. [Google Scholar]
  95. Huber, P.J. Robust estimation of a location parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]
  96. Koenker, R.; Bassett, G. Regression quantiles. Econometrica 1978, 46, 33–50. [Google Scholar] [CrossRef]
  97. Efron, B. Bootstrap methods: Another look at the jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
  98. Kennedy, P.E. Estimation with correctly interpreted dummy variables in semilogarithmic equations. Am. Econ. Rev. 1981, 71, 801. [Google Scholar]
  99. Roberts, D.R.; Bahn, V.; Ciuti, S.; Boyce, M.S.; Elith, J.; Guillera-Arroita, G.; Hauenstein, S.; Lahoz-Monfort, J.J.; Schröder, B.; Thuiller, W.; et al. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 2017, 40, 913–929. [Google Scholar] [CrossRef]
  100. Tashman, L.J. Out-of-sample tests of forecasting accuracy: An analysis and review. Int. J. Forecast. 2000, 16, 437–450. [Google Scholar] [CrossRef]
  101. Valavi, R.; Elith, J.; Lahoz-Monfort, J.J.; Guillera-Arroita, G. blockCV: An R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods Ecol. Evol. 2019, 10, 225–232. [Google Scholar] [CrossRef]
  102. Kaufman, S.; Rosset, S.; Perlich, C. Leakage in data mining: Formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 2012, 6, 15. [Google Scholar] [CrossRef]
  103. Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
  104. Snoek, J.; Larochelle, H.; Adams, R.P. Practical Bayesian optimization of machine learning algorithms. Adv. Neural Inf. Process. Syst. 2012, 25, 2951–2959. [Google Scholar]
  105. Wooldridge, J.M. Econometric Analysis of Cross Section and Panel Data, 2nd ed.; MIT Press: Cambridge, MA, USA, 2010. [Google Scholar]
  106. Conley, T.G. GMM estimation with cross sectional dependence. J. Econom. 1999, 92, 1–45. [Google Scholar] [CrossRef]
  107. Anselin, L. Spatial Econometrics: Methods and Models; Kluwer Academic: Dordrecht, The Netherlands, 1988. [Google Scholar]
  108. Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
  109. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
  110. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018, 31, 6638–6648. [Google Scholar]
  111. Arik, S.O.; Pfister, T. TabNet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27–28 January 2019; pp. 6679–6687. [Google Scholar]
  112. Gorishniy, Y.; Rubachev, I.; Khrulkov, V.; Babenko, A. Revisiting deep learning models for tabular data. Adv. Neural Inf. Process. Syst. 2021, 34, 18932–18943. [Google Scholar]
  113. Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
  114. Varma, S.; Simon, R. Bias in error estimation when using cross-validation for model selection. BMC Bioinform. 2006, 7, 91. [Google Scholar] [CrossRef]
  115. Cawley, G.C.; Talbot, N.L.C. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 2010, 11, 2079–2107. [Google Scholar]
  116. Diebold, F.X.; Mariano, R.S. Comparing predictive accuracy. J. Bus. Econ. Stat. 1995, 13, 253–263. [Google Scholar] [CrossRef]
  117. Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 1995, 57, 289–300. [Google Scholar] [CrossRef]
  118. Dietterich, T.G. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 1998, 10, 1895–1923. [Google Scholar] [CrossRef]
  119. Alvarez-Melis, D.; Jaakkola, T.S. On the robustness of interpretability methods. In Proceedings of the ICML Workshop on Human Interpretability in ML, Stockholm, Sweden, 14 July 2018. [Google Scholar]
  120. Wachter, S.; Mittelstadt, B.; Russell, C. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. J. Law Technol. 2018, 31, 841–887. [Google Scholar] [CrossRef]
  121. Gneiting, T.; Raftery, A.E. Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [Google Scholar] [CrossRef]
  122. Duan, T.; Avati, A.; Ding, D.Y.; Liu, A.; Ng, A.Y. NGBoost: Natural gradient boosting for probabilistic prediction. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020. [Google Scholar]
  123. Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic Learning in a Random World; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
  124. Barber, R.F.; Candès, E.J.; Ramdas, A.; Tibshirani, R.J. Predictive inference with the jackknife+. Ann. Stat. 2021, 49, 486–507. [Google Scholar] [CrossRef]
  125. Romano, Y.; Patterson, E.; Candès, E. Conformalized quantile regression. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar] [CrossRef]
  126. Kuleshov, V.; Fenner, N.; Ermon, S. Accurate uncertainties for deep learning using calibrated regression. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 2796–2804. [Google Scholar]
  127. Winkler, R.L. A decision-theoretic approach to interval estimation. J. Am. Stat. Assoc. 1972, 67, 187–191. [Google Scholar] [CrossRef]
  128. Sculley, D.; Holt, G.; Golovin, D.; Davydov, E.; Phillips, T.; Ebner, D.; Choudhary, V.; Young, M.; Crespo, J.F.; Dennison, D. Hidden technical debt in machine learning systems. In Proceedings of the Advances in Neural Information Processing Systems (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; Volume 28, pp. 2503–2511. [Google Scholar]
  129. Baylor, D.; Breck, E.; Cheng, H.T.; Fiedel, N.; Foo, C.Y.; Haque, Z.; Haykal, S.; Ispir, M.; Jain, V.; Koc, L.; et al. TFX: A TensorFlow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 1387–1395. [Google Scholar]
  130. Paleyes, A.; Urma, R.-G.; Lawrence, N.D. Challenges in deploying machine learning: A survey of case studies. arXiv 2020, arXiv:2011.09926. [Google Scholar] [CrossRef]
  131. Humble, J.; Farley, D. Continuous Delivery; Addison-Wesley: Boston, MA, USA, 2010. [Google Scholar]
  132. Pearl, J.; Bareinboim, E. External validity: From do-calculus to transportability across populations. Stat. Sci. 2014, 29, 579–595. [Google Scholar] [CrossRef]
  133. Shadish, W.R.; Cook, T.D.; Campbell, D.T. Experimental and Quasi-Experimental Designs for Generalized Causal Inference; Houghton Mifflin: Boston, MA, USA, 2002. [Google Scholar]
  134. Tukey, J.W. Exploratory Data Analysis; Addison-Wesley: Boston, MA, USA, 1977. [Google Scholar]
  135. Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data, 3rd ed.; Wiley: Hoboken, NJ, USA, 2019. [Google Scholar]
  136. Austin, P.C. Balance diagnostics for comparing the distribution of baseline covariates. Stat. Med. 2009, 28, 3083–3107. [Google Scholar] [CrossRef]
  137. Kutner, M.H.; Nachtsheim, C.J.; Neter, J. Applied Linear Regression Models, 4th ed.; McGraw-Hill: Columbus, OH, USA, 2004. [Google Scholar]
  138. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  139. Chernozhukov, V.; Chetverikov, D.; Demirer, M.; Duflo, E.; Hansen, C.; Newey, W.; Robins, J. Double/debiased machine learning for treatment and structural parameters. Econom. J. 2018, 21, C1–C68. [Google Scholar] [CrossRef]
  140. Wager, S.; Athey, S. Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. 2018, 113, 1228–1242. [Google Scholar] [CrossRef]
  141. Callaway, B.; Sant’Anna, P.H.C. Difference-in-differences with multiple time periods. J. Econom. 2021, 225, 200–230. [Google Scholar] [CrossRef]
  142. Ben-David, S.; Blitzer, J.; Crammer, K.; Pereira, F. A theory of learning from different domains. Mach. Learn. 2010, 79, 151–175. [Google Scholar] [CrossRef]
  143. Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
  144. Angelopoulos, A.N.; Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv 2021, arXiv:2107.07511. [Google Scholar]
Figure 1. Transition from hedonic and comparative valuation to a multimodal AI framework.
Figure 1. Transition from hedonic and comparative valuation to a multimodal AI framework.
Information 16 01049 g001
Figure 2. Integrated valuation framework.
Figure 2. Integrated valuation framework.
Information 16 01049 g002
Figure 3. Data pre-processing and documentation framework.
Figure 3. Data pre-processing and documentation framework.
Information 16 01049 g003
Figure 4. Indicative SHAP summary plot.
Figure 4. Indicative SHAP summary plot.
Information 16 01049 g004
Figure 5. Illustrative Calibration Curve.
Figure 5. Illustrative Calibration Curve.
Information 16 01049 g005
Table 1. Core Machine Learning components of the proposed AVM framework.
Table 1. Core Machine Learning components of the proposed AVM framework.
ConceptDescription
Predictive ModelsXGBoost, LightGBM, CatBoost, TabNet, ResNet-18 (image-based architectures)
Hyperparameter TuningRandom search and Bayesian optimization
Model ValidationNested cross-validation integrating temporal and spatial stratification
Leakage PreventionGrouped splits by property, exclusion of overlapping temporal windows and EXIF metadata, and strict separation of images across folds
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Karanikolas, N.; Kyriakidou, E.; Athanasouli, E. Artificial Intelligence and Real Estate Valuation: The Design and Implementation of a Multimodal Model. Information 2025, 16, 1049. https://doi.org/10.3390/info16121049

AMA Style

Karanikolas N, Kyriakidou E, Athanasouli E. Artificial Intelligence and Real Estate Valuation: The Design and Implementation of a Multimodal Model. Information. 2025; 16(12):1049. https://doi.org/10.3390/info16121049

Chicago/Turabian Style

Karanikolas, Nikolaos, Eleni Kyriakidou, and Eleni Athanasouli. 2025. "Artificial Intelligence and Real Estate Valuation: The Design and Implementation of a Multimodal Model" Information 16, no. 12: 1049. https://doi.org/10.3390/info16121049

APA Style

Karanikolas, N., Kyriakidou, E., & Athanasouli, E. (2025). Artificial Intelligence and Real Estate Valuation: The Design and Implementation of a Multimodal Model. Information, 16(12), 1049. https://doi.org/10.3390/info16121049

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop