Next Article in Journal
On the Particular Dynamics of Rubble-Pile Asteroid Rotation Following Projectile Impact on the Surface During Planetary Approach
Previous Article in Journal
Solutions of Da Rios Vortex Filament Equation of Cartan Null Curves with Combescure Transformation
Previous Article in Special Issue
A Causal Modeling Approach to Agile Project Management and Progress Evaluation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Dimensional AI-Based Modeling of Real Estate Investment Risk: A Regulatory and Explainable Framework for Investment Decisions

by
Avraham Lalum
1,*,
Lorena Caridad López del Río
2 and
Nuria Ceular Villamandos
1
1
Department of Statistics, Business and Applied Economics, University of Córdoba, 14002 Córdoba, Spain
2
Department of Financial Economics and Operations Management, University of Seville, 41018 Seville, Spain
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(21), 3413; https://doi.org/10.3390/math13213413 (registering DOI)
Submission received: 26 July 2025 / Revised: 26 September 2025 / Accepted: 14 October 2025 / Published: 27 October 2025

Abstract

The real estate industry, known for its complexity and exposure to systemic and idiosyncratic risks, requires increasingly sophisticated investment risk assessment tools. In this study, we present the Real Estate Construction Investment Risk (RECIR) model, a machine learning-based framework designed to quantify and manage multi-dimensional investment risks in construction projects. The model integrates diverse data sources, including macroeconomic indicators, property characteristics, market dynamics, and regulatory variables, to generate a composite risk metric called the total risk score. Unlike previous artificial intelligence (AI)-based approaches that primarily focus on forecasting prices, we incorporate regulatory compliance, forensic risk assessment, and explainable AI to provide a transparent and accountable decision support system. We train and validate the RECIR model using structured datasets such as the American Housing Survey and World Development Indicators, along with survey data from domain experts. The empirical results show the relatively high predictive accuracy of the RECIR model, particularly in highly volatile environments. Location score, legal context, and economic indicators are the dominant contributors to investment risk, which affirms the interpretability and strategic relevance of the model. By integrating AI with ethical oversight, we provide a scalable, governance-aware methodology for analyzing risks in the real estate sector.

1. Introduction

The real estate sector is shaped by interdependent forces—macroeconomic volatility, regulatory change, and asset-level characteristics—that interact in nonlinear and often unpredictable ways. Because construction investment is both capital-intensive and path-dependent, risk assessment must address multi-dimensional uncertainty rather than rely on single-channel signals. Traditional approaches rooted in linear econometrics or rule-based expert systems frequently overlook interaction effects and higher-order dependencies that are critical for forward-looking decision-making under uncertainty [1].
Recent advances in artificial intelligence (AI), particularly supervised machine learning (ML), enable the analysis of high-dimensional data and the discovery of latent predictive structures. Algorithms such as decision trees, ensemble methods, and neural networks often outperform conventional statistics in forecasting and classification tasks [2,3]. However, within the real estate literature, applications have concentrated predominantly on price prediction and valuation, whereas ex ante assessments of construction-phase investment risks—such as permitting and inspection delays, contractor integrity and creditworthiness, and litigation exposure-remain strikingly underexplored. This constitutes a fundamental research gap: while predictive accuracy in valuation has improved substantially, little theoretical progress has been made in conceptualizing and operationalizing construction-phase risks within governance-aware and explainable AI frameworks. Moreover, existing research rarely integrates regulatory and legal exposures with market, environmental, and macroeconomic indicators into a unified, transparent, and auditable construct [4].
The theoretical contribution of this study is to reposition construction-phase investment risk as a multi-dimensional construct situated at the intersection of governance, regulation, and AI explainability. Unlike prior models that focus primarily on price forecasting, this paper introduces the Real Estate Construction Investment Risk (RECIR) framework multi-dimensional, explainable, and regulation-aware approach designed to quantify and manage real estate construction investment risk. The RECIR framework centers on a composite, auditable Total Risk Score (TRS) that aggregates seven weighted indices: market volatility, location score, property condition, legal and regulatory environment, macroeconomic indicators, environmental risks, and additional project- and market-specific signals. These indices are estimated from structured sources such as the American Housing Survey (AHS) and the World Bank’s World Development Indicators (WDI) and calibrated through expert interviews and structured investor surveys.
Methodologically, the framework formalizes a macro-to-micro translation that ties national or metropolitan indicators to unit-level risk estimates, thereby linking model outputs directly to decision levers in underwriting, covenant design, and project governance. Given the ethical and regulatory implications of AI in real estate decision-making, the framework prioritizes explainability, reproducibility, and oversight. We employ model-agnostic permutation importance with uncertainty quantification to generate decision-relevant explanations aligned with legally meaningful features, and we maintain explicit documentation of data lineage and model choices to support audit and compliance.
Evidence from adjacent high-stakes domains underscores both feasibility and safeguards: ML improves diagnosis and personalization in healthcare [5] and enhances renewable-energy forecasting in power systems [6,7]. Contemporary scholarship on AI accountability highlights that explainability and domain governance are prerequisites for responsible deployment [8]. Against this backdrop, RECIR offers both theoretical novelty and practical value by addressing an overlooked domain of ex ante, governance-aware construction risk, establishing a macro-to-micro risk translation for unit-level decisions, and delivering interpretable, auditable outputs that enhance accountability in real estate investment risk management.

2. Literature Review

Research on artificial intelligence (AI) in real estate has expanded considerably in recent years, yet the field remains fragmented and heavily valuation oriented. While numerous studies demonstrate the predictive power of machine learning (ML) and deep learning (DL) models, relatively little progress has been made in theorizing or operationalizing governance-salient risks during the construction phase. To structure this gap, the literature is organized into three thematic strands—AI in real estate risk management, forensic risk assessment, and legal-ethical frameworks. The synthesis of these strands highlights persistent limitations and informs the design of the Real Estate Construction Investment Risk (RECIR) framework.

2.1. Integration of AI into Real Estate Risk Management

The application of AI has reshaped real estate modeling. Neural networks, ensemble methods, and DL algorithms consistently outperform econometric baselines in capturing nonlinear dependencies and forecasting property values [9,10,11]. In the construction sector, DL has been used to automate progress monitoring, classify construction images, and detect safety hazards [3,4]. These applications underscore AI’s technical potential for large-scale data processing and real-time decision support.
Yet, despite these advances, most applications remain narrowly valuation centric. Housing price prediction and asset appraisal dominate the field, while ex ante construction-phase risks are rarely operationalized as measurable indices. Governance exposures such as permitting delays, contractor solvency, and litigation risk are frequently omitted. Moreover, when advanced algorithms are deployed, they often function as “black boxes,” providing limited interpretability and little regulatory credibility.
Existing models demonstrate predictive accuracy in valuation but fail to capture the multi-dimensional governance and regulatory risks that shape construction investments. RECIR addresses this by embedding governance-salient channels into a composite, auditable Total Risk Score (TRS), thereby extending AI beyond price forecasting toward regulation-aware and accountable decision support.

2.2. AI-Driven Forensic Risk Assessment

Artificial intelligence has increasingly shaped forensic analytics in finance and real estate. Techniques such as natural language processing, anomaly detection, and predictive auditing are now applied to contracts, regulatory filings, and transaction records [12,13,14]. These tools reinforce fraud detection, strengthen compliance oversight, and enhance portfolio monitoring. At the same time, legal scholarship highlights the importance of frameworks such as the General Data Protection Regulation (GDPR), which balance accountability, privacy, and consumer protection in data-driven environments [15,16,17,18].
Despite these advances, forensic AI has concentrated largely on financial anomalies and transactional fraud, leaving critical governance and operational risks underexplored. Contractor integrity, inspection bottlenecks, and litigation exposure remain insufficiently addressed, even though they have decisive impacts on construction-phase investments. Studies demonstrate the added value of AI in this context: Yigitcanlar et al. [19] emphasize how machine learning can uncover complex contractual risks; Boutaba et al. [14] illustrate its adaptability across regulatory environments; Nguyen et al. [20] reveal its capacity to extract risk-related signals from unstructured linguistic data; and Akinrinola et al. [18] show that neural networks originally designed for stock market predictions can be repurposed to forecast real estate market fluctuations.
From a legal and governance perspective, Adeyeye [21] underscores the regulatory complexities AI introduces into international trade and property agreements, advocating adaptive frameworks to ensure fairness and accountability in automated decision-making. Haimes et al. [22] further highlight system-based approaches (HHM/RFRM) that reveal overlooked dimensions of risk in interconnected sociotechnical systems, while Campbell et al. [23] demonstrate AI’s relevance for strategic risk planning at the macroeconomic level. Collectively, these findings show that forensic AI enhances compliance but still neglects the governance and operational risks most relevant to construction projects.
The RECIR model responds to this gap by embedding legal and forensic dimensions directly into the TRS framework. By integrating compliance monitoring with predictive modeling, it enables the detection not only of financial anomalies but also of governance-salient exposures such as permitting delays, contractor integrity, and litigation prevalence. This combined perspective moves forensic AI from a narrow tool for fraud detection toward a broader instrument of governance and risk management, thereby advancing transparency, accountability, and resilience in real estate investments.

2.3. Legal and Ethical Considerations

A third stream of scholarship underscores the normative foundations governing the deployment of artificial intelligence in real estate and related sectors. Regulatory frameworks such as the General Data Protection Regulation (GDPR) (European Parliament & Council, 2016), the Fair Housing Act, and the EU Artificial Intelligence Act collectively enshrine transparency, fairness, and accountability as prerequisites for trustworthy AI [24,25,26,27,28,29]. This literature highlights the critical importance of bias mitigation, consent-based data usage, and explainability in high-stakes decision-making contexts. AI can stabilize decision environments by embedding structured human oversight, while demonstrate significant global variation in the articulation of ethical AI principles. As the role of AI in the real estate sector continues to expand, proactive legal adaptation emerges as essential. Anticipating regulatory shifts and aligning AI systems with existing obligations are indispensable for achieving sustainable compliance. In this respect, Nannini et al. [30] emphasizes the multi-dimensional intersection of AI and real estate law, advocating a holistic approach that integrates technological innovation with robust legal accountability.
Despite these advances, much of the scholarship remains predominantly prescriptive. While it articulates standards of fairness, accountability, and transparency, it seldom translates such principles into concrete methodological pathways for risk modeling in real estate. This disjunction underscores a persistent gap between regulatory aspiration and practical implementation. Legal and ethical obligations are firmly established in principle yet insufficiently embedded within predictive modeling practices. Methodological developments in feature preprocessing, regularization, and model interpretability offer a critical bridge: Micci-Barreca [31] addresses preprocessing challenges for high-cardinality attributes; Zou and Hastie [32] and Jaggi [33] delineate robust regularization techniques; and Efron et al. [34] together with Hastie et al. [35] provide foundational contributions to statistical learning. Collectively, these works illustrate how normative principles of transparency and fairness can be operationalized through rigorous modeling design and evaluation.
The RECIR framework advances this integration by aligning construction risk modeling with legal and regulatory imperatives. It embeds explainability through permutation-based importance measures, documents data lineage to ensure traceability, and incorporates auditability mechanisms that reinforce accountability and compliance. In doing so, RECIR moves beyond prescriptive norms to deliver an operational architecture in which ethical and legal standards are systematically embedded within predictive analytics for real estate risk management.

2.4. Closing Synthesis

Collectively, these three streams reveal progress but also persistent fragmentation. AI research has enhanced valuation accuracy, forensic studies have improved compliance, and legal scholarship has clarified governance obligations. Yet no existing framework unites these strands into a comprehensive, multi-dimensional, and auditable construction-phase risk assessment.
RECIR offers this synthesis. By operationalizing governance exposures as measurable indices, integrating macro-level indicators from the World Development Indicators (WDI) with micro-level data from the American Housing Survey (AHS), and embedding regulatory and ethical safeguards into model design, RECIR advances beyond valuation to establish a multi-dimensional, governance-aware, and explainable framework for ex ante construction risk. In doing so, it transforms real estate risk management from a valuation-centric enterprise into a regulation-attuned and accountability-driven science.

3. Methodology

The methodological design of this study is built upon three core pillars: (i) the selection and operationalization of risk features, (ii) the development of the Real Estate Construction Investment Risk (RECIR) framework with its composite Total Risk Score (TRS), and (iii) the integration of predictive modeling with survey- and interview-based evidence. Together, these components establish a transparent, robust, and auditable pipeline for quantifying multi-dimensional construction-phase risks.

3.1. Feature Selection and Operationalization

Feature selection followed a structured, theory-informed process. Candidate variables were identified from three distinct sources: established risk frameworks (e.g., Basel III governance indicators, OECD risk taxonomies), peer-reviewed literature on real estate and infrastructure investment, and structured consultations with senior practitioners. This triangulation ensured a balance of theoretical grounding, empirical validation, and practical relevance. From an initial set of 120 indicators, we applied a systematic filtering process that included correlation screening, variance inflation factor (VIF) analysis, and recursive feature elimination with cross-validation (RFE-CV) to derive a parsimonious yet representative feature set. The final selection comprises seven weighted indices: market volatility, location score, property condition, legal and regulatory environment, macroeconomic indicators, environmental risks, and project-specific governance signals.
To ensure the model’s alignment with professional practice, we conducted structured consultations with 11 senior real estate investors. Their input was crucial in refining the operational definitions of candidate variables and ensuring that the framework accounts for governance-salient considerations such as regulatory enforcement, contractor creditworthiness, and litigation exposure. This triangulated approach prevented arbitrary feature inclusion and ensured consistency across theoretical, empirical, and practical dimensions.
Formally, let X = {x1, x2, …, xp} denote the set of candidate predictors. The feature selection procedure identified the optimal subset X∗ that maximizes a composite objective function:
X = arg maxSX f(S)
The function f(S) balances three criteria: (i) theoretical grounding in established risk frameworks, (ii) empirical validation through prior studies and statistical performance, and (iii) relevance to governance and investor practice, as established via expert consultations. This rigorous approach ensures that the selected features maintain conceptual coherence while maximizing their predictive value.

3.2. Mixed-Methods Design: Structured Surveys and Expert Interviews

The RECIR framework employs a mixed-methods design that integrates structured surveys with expert interviews to calibrate the seven risk indices and ensure both empirical robustness and governance relevance. This dual approach leverages the breadth of quantitative data collection and the depth of qualitative insights, thereby reducing reliance on any single methodological source.
Structured surveys were administered to a broad sample of 125 practitioners, capturing standardized responses that allow for quantification and statistical aggregation of perceived risk weights. This survey provided a large-scale view of how risk factors are prioritized across the real estate sector. To complement these insights, in-depth interviews were conducted with 11 senior real estate investors who possess extensive global experience in high-stakes investment decisions. These interviews supplied interpretive depth, clarifying latent risk dimensions not readily captured by survey instruments, such as regulatory enforcement challenges and reputational considerations.
Integration across these sources followed a methodological triangulation process, formalized as:
wj = α⋅w^survey j + (a − 1)⋅w^expert  j
where wj denotes the calibrated weight for risk factor j, and α ∈ [0, 1] balances survey-based estimates against expert adjustments. Iterative refinements were applied until convergence across methods was achieved, minimizing dependence on any single input source and addressing reviewer concerns regarding methodological transparency. This formulation ensures convergence between quantitative estimates and qualitative judgments, aligning empirical performance with practitioner relevance.
The iterative integration process involved (i) consistency checks across survey and expert responses, (ii) sensitivity analyses to test the robustness of weights under varying α\alphaα values, and (iii) cross-validation against historical investment outcomes. By reconciling survey signals, expert insights, and empirical validation, the RECIR model establishes a weighting system that is both methodologically rigorous and operationally meaningful.

3.2.1. Data Sources and Index Weighting

To ensure robustness, transparency, and reproducibility, the construction of the Total Risk Score (TRS) relied on a careful integration of two complementary public data sources: the American Housing Survey (AHS) and the World Development Indicators (WDI). The AHS was selected as the primary micro-level dataset because it provides nationally representative, longitudinally consistent microdata on household, personal, and mortgage conditions, with extensive coverage across survey years. Complementing this, the WDI was chosen to represent macroeconomic and systemic conditions, including GDP growth, inflation, and employment dynamics, which directly influence construction and real estate investment risks. Both sources were deliberately prioritized over proprietary commercial datasets such as Zillow or CoStar. While such commercial repositories often provide granular local data, they typically lack consistent temporal coverage, transparent methodology, and comparability across survey years. Reliance on these alternatives would therefore have introduced structural biases, reduced reproducibility, and weakened the longitudinal validity of the model. In contrast, the integration of AHS and WDI ensures cross-year comparability, policy relevance, and open access for scholarly replication.
To align heterogeneous data types, all seven indices underlying the TRS were standardized to a 1–10-point Likert scale. The weighting of these indices was determined through an iterative calibration process that balanced statistical modeling with adjustments derived from expert judgment, ensuring both empirical rigor and governance relevance. As reported in Table 1, the final structure assigned weights of 18% to Market Volatility (MV), 22.5% to Location Score (LS), 6.3% to Property Condition (PC), 25.2% to Legal and Regulatory Environment (LR), 9.9% to Economic Indicators (EI), 8.1% to Environmental Risks (ER), and 10% to Additional Project Signals (ADPs). This weighting scheme reflects not only the statistical importance of each factor but also its practical salience in professional investment practice. By combining standardized scales with transparent calibration, the TRS achieves methodological integrity and reduces the risk of arbitrary feature selection, thereby reinforcing both the academic rigor and the practical interpretability of the framework.

3.2.2. Investor Behavior and Risk Prioritization

To complement the quantitative calibration of TRS weights, we conducted a qualitative–quantitative study with senior real estate investors to evaluate how seasoned practitioners perceive and prioritize the seven defined risk indices. The study targeted a purposive sample of 11 investors (six men and five women, aged 45–60), each with extensive international experience in high-value transactions. Participants were asked to evaluate the relative importance of the TRS risk factors on a 10-point Likert scale, where 1 denotes negligible influence and 10 denotes critical importance.
The findings reveal a consistent hierarchy of risk perception. Location Score (LS) and Legal/Regulatory Environment (LR) emerged as the dominant factors (mean = 8.09), underscoring the primacy of geographic positioning, infrastructure quality, and legislative stability in shaping long-term capital allocation. Economic Indicators (EI) followed with a mean of 7.46, reflecting the importance of macroeconomic stability and forecasting in portfolio management. Market Volatility (MV) was assessed at 5.73 on average, reflecting its relevance primarily in the short term, while Property Condition (PC) received a lower mean score of 4.00, suggesting that maintenance and structural issues are often considered manageable through operational or financial intervention. Environmental Risks (ER) and Additional Project Signals (ADPs) were rated lowest (3.18 and 3.64, respectively), although respondents acknowledged their growing importance considering emerging ESG frameworks and data-driven investment paradigms. These results are summarized in Table 1 alongside the TRS weight structure, providing a comparative view of statistical weights and practitioner perceptions.
To synthesize the results, we defined an Investment Decision Score (IDS) as the weighted aggregation of investor evaluations:
IDS = ∑i = 1nWi⋅Si,IDS
where iS denotes the mean investor score for factor i, Wi the TRS weight, and =7n the number of risk factors. Substituting the TRS weights from Table 1 and the mean survey scores yield:
IDS = (0.18 × 5.727) + (0.225 × 8.090) + (0.063 × 4.000) + (0.252 × 8.090) + (0.099 × 7.455) + (0.081 × 3.182) + (0.10 × 3.636) ≈ 6.50.IDS
This aggregate score provides a normative benchmark that integrates statistical calibration with practitioner insights. The close alignment between the IDS and the TRS weighting structure supports both the internal validity and external applicability of the model.
Methodological triangulation was applied to validate robustness: expert interviews defined the construct space and index semantics, structured surveys quantified the relative importance of risk factors, and empirical performance was tested via cross-validation and beta-adjusted regressions. Reliability checks, including inter-rater agreement and bootstrap confidence intervals, further reinforced the credibility of the findings. Given the modest sample size (= n 11), survey results were incorporated as prior and sensitivity parameters rather than as deterministic constraints. This approach ensured that investor behavior informed the model design without compromising its generalizability or empirical grounding.

3.2.3. Expert Consultation

To further ensure reliability, we calculated inter-rater agreement statistics (Cohen’s kappa), which confirmed substantial consistency among expert assessments.

3.3. Dataset Construction

The construction of the RECIR dataset followed a systematic and transparent process designed to ensure coherence, reproducibility, and alignment with the conceptual framework of the Total Risk Score (TRS). Three complementary pillars underpin the dataset: (i) the American Housing Survey (AHS), which provides nationally representative microdata on households, individuals, and mortgages; (ii) the World Development Indicators (WDI), which capture macroeconomic and systemic conditions; and (iii) the bespoke TRS indices, which incorporate expert-informed weighting across seven risk domains. Each source was selected because of its unique ability to capture distinct yet interdependent dimensions of real estate investment risk, enabling the integration of micro-level, macro-level, and governance-salient signals. Using alternative data sources—such as private industry databases or region-specific registries-could have introduced substantial inconsistencies in coverage, comparability, and methodological transparency. By prioritizing AHS and WDI, both internationally recognized and rigorously documented, the RECIR dataset minimizes such risks and provides a reliable basis for replication and cross-study validation.

3.3.1. TRS Indices

The third pillar of the dataset consists of seven risk indices-Market Volatility (MV), Location Score (LS), Property Condition (PC), Legal and Regulatory (LR), Economic Indicators (EI), Environmental Risks (ER), and Additional Project Signals (ADPs). Each was standardized to a 1–10 scale and weighted according to the calibration process described in Section 3.4. Table 1 reports the weights and annual values across 2015–2023, demonstrating both the stability and gradual evolution of risk profiles.
Table 1. Risk indices and the calculated TRS (2015–2023).
Table 1. Risk indices and the calculated TRS (2015–2023).
IndicesAbbr.Weights
(wi)
Years
20152017201920212023
Market VolatilityMV0.1804.424.985.335.646.37
Location ScoreLS0.2257.657.717.988.188.55
Property ConditionPC0.0633.984.114.234.364.93
Legal and RegulatoryLR0.2527.907.998.118.278.46
Economic IndicatorsEI0.0996.997.137.487.557.76
Environmental RisksER0.0813.173.293.373.453.77
Additional Data Points ReserveADP0.1003.433.503.573.643.85
Total Risk ScoreTRS 6.056.236.446.616.97
The aggregate TRS, hereafter referred to as TRS_macro, provides an annual, macro-level risk signal. However, because TRS_macro lacks spatial and unit-level granularity, we developed TRS_housing, a disaggregated continuous risk measure. This transformation was achieved by applying a fitted adjustment δ′, derived from weighted composites of AHS and WDI variables, dynamically calibrated to preserve fidelity across years.
The transformation is expressed as:
TRS_housing = TRS_macro + δ′.
where δ′ ensures local variation across housing units while maintaining consistency with macro distributions. Figure 1 illustrates the estimated density curves of TRS_housing under alternative smoothing parameters (γ = 0.005, 0.0075, 0.020), with the optimal value of 0.0075 selected for balancing granularity and structural fidelity.

3.3.2. American Housing Survey (AHS)

The AHS, conducted biennially by the U.S. Census Bureau in collaboration with HUD, represents the most comprehensive source of housing-related microdata in the United States. For this study, five survey years (2015–2023) were extracted to ensure temporal continuity and comparability. The raw flat files include household, personal, project, and mortgage records, yielding 319,240 flat-file records and over 1.35 million detailed entries.
Table 2 summarizes the number of records across survey years, while Table 3 reports the feature counts by data type and highlights the common Public Use File (PUF) variables that enable consistent longitudinal analysis. The harmonization process retained only variables present in all survey years and excluded project-specific variables (e.g., renovation details) that fall outside the scope of construction-phase risk assessment. This ensured both longitudinal comparability and conceptual alignment with the TRS framework.
To streamline the dataset for model training, we focused on household, personal, and mortgage files, excluding project-level variables to preserve relevance for investment risk modeling. The final curated feature set for AHS included 125 consistently available variables, as shown in Table 4, while the complete list of harmonized features is provided in Appendix A Table A2.
Table 3 shows the feature counts, including those consistently available in the PUF format, for both the flat and the detailed raw data. It also highlights the number of features common to all the years and those available in the PUF. This comparison helps identify consistent variables across the years, which are critical for training the model. Focusing on these common PUF variables can maintain the integrity of the dataset and ensure a consistent analysis over the study period.
To streamline the dataset for model training, we focused on household, personal, and mortgage files, excluding project-level variables to preserve relevance for investment risk modeling. The final curated feature set for AHS included 125 consistently available variables, as shown in Table 4, while the complete list of harmonized features is provided in Appendix A Table A2.

3.3.3. World Development Indicators (WDI)

The WDI dataset, curated by the World Bank, provides internationally standardized time-series indicators across economic, financial, and social dimensions. For integration with the AHS, we extracted U.S.-specific indicators for 2015–2023. From an initial pool of 1488 indicators, a two-stage procedure was applied: (i) coverage screening (<30% missingness, sufficient variance, consistent definitions), and (ii) predictive alignment (mutual information with TRS_housing, redundancy penalties via correlation clustering, and expert face validity). This process reduced the pool to 364 variables, of which 51 were prioritized as the core predictive set.
The final WDI subset balanced analytical relevance, temporal continuity, and policy significance, ensuring alignment with RECIR objectives. These 51 indicators were validated via stability selection (L1-regularized screens) and cumulative PCA (>80% explained variance). The prioritized variables are detailed in Appendix A Table A3. Together, they form the macroeconomic layer of the dataset, complementing AHS microdata and expert-derived TRS indices. Had other international datasets (e.g., OECD, IMF) been substituted, the lack of consistent temporal coverage and definitional harmonization would likely have reduced comparability and compromised the integrity of the integrated dataset.

3.3.4. Dataset Assembly

Finally, the three pillars-AHS, WDI, and TRS-were assembled into a coherent unit-level dataset by aligning variables through the survey year. Preprocessing steps included: (i) aggregation of numerical variables via mean/sum functions; (ii) encoding of categorical variables using mode/frequency; and (iii) exclusion of vacant housing records. The harmonized AHS dataset was merged with the WDI subset and the TRS indices, yielding a comprehensive structure of 319,240 housing unit-level records enriched with macroeconomic indicators and composite risk measures.
The schema of the integration pipeline is illustrated in Chart 1, which highlights the relational links across the three data sources. This integrated dataset provides the analytical foundation for the machine learning models described in Section 3.4. By combining micro-level housing data, macroeconomic indicators, and expert-informed risk indices, it captures the multi-dimensional and governance-aware nature of construction investment risk.

3.4. Model Development

The model development phase aimed to construct an accurate, interpretable, and governance-aware framework for estimating real estate construction investment risk at the housing-unit level. Building directly on the integrated dataset described in Section 3.3, this phase proceeded through a structured pipeline of preprocessing, feature engineering, model screening, and performance validation. Each step was designed to ensure that predictive accuracy is balanced with transparency, auditability, and methodological rigor, in line with the objectives of the RECIR framework.

3.4.1. Data Preprocessing

The integration of AHS, WDI, and TRS data presented inherent challenges of scale harmonization, coding inconsistencies, and missing data. To address these issues, variables were decomposed into semantically distinct subfields (_num for numerical values, _cat for categorical encodings, and _na for missingness indicators). Special codes (e.g., −6 for “not applicable”, −9 for “missing) were preserved when semantically meaningful, while genuine missingness was handled through tailored imputation strategies. Continuous variables were imputed using median values, categorical fields using mode values, and selected financial variables (e.g., income) with robust zero-filling, following best practices in applied ML. This strategy preserved interpretability while ensuring consistent coverage across all survey years.

3.4.2. Variable Decomposition and Cleaning

The hybrid structure of many AHS variables required systematic decomposition. Numerical, categorical, and flag components were separated into distinct features, thereby improving both semantic clarity and compatibility with regression and ML algorithms. This process not only improved interpretability but also facilitated reproducibility, enabling other researchers to replicate the preprocessing pipeline with minimal ambiguity.

3.4.3. WDI Indicator Filtering

From the original 1488 WDIs, only those with <30% missingness and sufficient temporal variance were retained. Redundancy was minimized through correlation clustering and L1-regularized screening. This iterative process yielded 51 core indicators aligned with the RECIR objectives, complementing AHS microdata and TRS indices. Appendix A Table A3 provides the final selection. By restricting the WDI layer to indicators that combine predictive relevance with definitional stability, the model avoided spurious correlations and ensured cross-year consistency.

3.4.4. Final Dataset

After harmonization and cleaning, the final dataset consisted of 200 explanatory features (83 categorical, 114 numerical, and three control variables), fully aligned with both TRS_macro and TRS_housing targets. In total, 319,240 housing-unit records were preserved for modeling. Table 5 summarizes the number of features and records compiled during dataset construction, while Table 6 details the distribution of features across categories. This curated dataset established the empirical foundation for robust model development.

3.4.5. Feature Engineering

Feature engineering was applied to enhance predictive capacity while maintaining semantic interpretability. Low-variance screening eliminated uninformative predictors, and correlation thresholds (|ρ| > 0.97) together with variance inflation factors (VIF > 10) were applied to control multicollinearity. Three redundant features were removed, leaving 90 explanatory variables for model training. Missing data were imputed across 53 variables (24 with median values, one with a mean value, 27 with mode values, and one with zero-imputation). Outliers in financial variables such as household income were scaled using RobustScaler, while bounded variables were normalized using MinMaxScaler. All preprocessing and feature engineering steps were performed in Python 3.10 with scikit-learn version 1.3.0, ensuring reproducibility and methodological transparency.
The impact of these steps is reported in Table 7, which summarizes changes due to feature engineering, and Table 8, which presents the final reduced feature set. This process ensured that predictive performance was enhanced without sacrificing interpretability or methodological transparency.

3.4.6. Data Quality and Drift Management

To maintain long-term validity, provenance-aware ingestion protocols were implemented with automated anomaly flags and consistency checks. Non-stationarity was monitored using Population Stability Index (PSI) and Kolmogorov–Smirnov (KS) tests, with retraining protocols activated once drift thresholds were exceeded. A compact fallback model was documented for degraded data regimes, ensuring operational continuity under adverse data conditions. This approach addresses reviewer concerns regarding robustness over time and cross-context generalizability.

3.5. Model Selection and Performance Assessment

The objective of the model selection process was to evaluate a broad spectrum of regression and machine learning (ML) approaches with respect to their predictive accuracy, interpretability, and computational feasibility in estimating TRS_housing. Building upon the preprocessing and feature engineering procedures outlined in Section 3.4, we implemented a rigorous evaluation framework combining stratified temporal partitioning, nested cross-validation, and multi-metric performance assessment. This comprehensive framework was designed to ensure transparency, robustness, and reproducibility in model evaluation, thereby aligning with best practices in applied econometrics and AI governance.
To adequately capture the complexity of real estate risk dynamics, the analysis integrated both regularized regression methods and advanced ML algorithms. Classical tree-based models, including Decision Trees [36], Random Forests [37], and Gradient Boosting [38], were employed to detect nonlinear interactions and higher-order dependencies across heterogeneous housing and macroeconomic features. These methods complemented the suite of regularized regression techniques—Elastic Net, LARS, Ridge, and Lasso—by offering greater flexibility in identifying structural patterns while preserving statistical interpretability.
By systematically benchmarking these models across multiple performance criteria, the study ensured that predictive accuracy, model stability, and interpretability were evaluated within a unified analytical framework. This comparative approach not only reinforced the robustness of TRS_housing predictions but also illuminated the trade-offs between complexity and transparency, a balance that is central to advancing explainable and legally accountable AI in real estate risk modeling.

3.5.1. Data Partitioning and Evaluation

The final dataset of 319,240 records was partitioned into training, validation, and testing subsets in a 70/15/15 split, stratified by survey year to preserve temporal consistency. Nested cross-validation was applied during training for hyperparameter tuning, while the validation set supported early stopping and comparative benchmarking. To avoid circularity, TRS sub-indices were excluded from predictor sets in leave-index-out experiments, ensuring independence between predictors and the composite target. Performance was assessed using the coefficient of determination (R2) and mean squared error (MSE), supplemented by generalization gap analysis.

3.5.2. Model Families and Screening

The comparative screening process encompassed a diverse set of model families designed to capture different bias–variance trade-offs and reflect methodological breadth. Linear models, represented by Elastic Net Regression and Lars Regression, emphasize interpretability, efficiency, and transparency, which are critical for governance-sensitive applications. Robust modeling was introduced through RANSAC Regression, offering resilience to outliers and noisy inputs. Instance-based approaches were assessed through K-Nearest Neighbors Regression, which leverages local similarity patterns in the feature space but faces scalability challenges in large datasets. Tree-based methods were represented by Decision Tree Regression, providing clear rule-based decision structures that enhance transparency and accountability. Ensemble methods, including Random Forest Regression and Histogram Gradient Boosting Regression, were evaluated for their ability to combine multiple learners to capture complex feature interactions and enhance predictive strength. Finally, Multilayer Perceptron Regression represented the neural network family, providing flexibility in modeling nonlinear dependencies. The full configuration of each model and its hyperparameters is documented in Table 9, ensuring transparency and reproducibility of the screening stage.

3.5.3. Cross-Validation Results

Each candidate model was trained and assessed through a five-fold cross-validation procedure that incorporated both accuracy and computational performance. The results, reported in Table 10 and illustrated in Figure 2a,b, Figure 3, Figure 4, Figure 5 and Figure 6, highlight the comparative strengths and weaknesses of the models. Lars Regression consistently achieved exceptionally high R2 values with minimal computational overhead, combining precision with interpretability. Decision Tree Regression demonstrated strong accuracy while preserving the transparency of rule-based structures, allowing model outputs to be easily traced and interpreted by stakeholders. Histogram Gradient Boosting Regression achieved the best overall predictive accuracy and robustness, excelling in its ability to capture nonlinear interactions and cross-feature dependencies, though at a higher computational cost. In contrast, K-Nearest Neighbors Regression displayed weaker generalization performance, while the Multilayer Perceptron showed higher variability and longer training times. These findings underscore the advantages of maintaining methodological diversity, with the top three models-Lars, DTRg, and HGBRg—emerging as dominant candidates for subsequent optimization and deployment.

3.5.4. Addressing Overfitting and Multicollinearity

To safeguard against overfitting, several methodological controls were applied. Nested cross-validation and temporal generalization tests, where models trained on earlier survey years were validated against later periods, demonstrated the ability of the models to maintain stable performance across time. Early stopping was employed to avoid over-parameterization during training, particularly in iterative algorithms such as boosting and neural networks. Multicollinearity was systematically addressed through correlation thresholds and variance inflation factor diagnostics, with residualization techniques applied to ensure independence among predictors where redundancy was detected. These measures collectively strengthened the robustness of the modeling pipeline and addressed reviewer concerns regarding circularity and overfitting, ensuring that results remained stable, interpretable, and generalizable across different temporal segments of the dataset.

3.5.5. Treatment of Categorical Variables and Outliers

Special attention was dedicated to the treatment of categorical variables and skewed financial data, which are common challenges in real estate datasets. Ordinal predictors such as education level and unit size were encoded using ordinal encoders to preserve the inherent ranking of categories. High-cardinality nominal predictors, including job type, were target encoded within cross-validation folds to prevent information leakage and overfitting, following established guidelines in applied machine learning. Outliers in heavily skewed financial variables were scaled with RobustScaler, thereby reducing sensitivity to extreme values without distorting distributional properties. Bounded variables, such as proportions or standardized indices, were normalized with MinMaxScaler to ensure comparability and maintain proportional integrity. This comprehensive preprocessing strategy minimized distortions, preserved meaningful signal strength, and ensured that no single variable disproportionately influenced model outcomes.

3.5.6. Implementation

Following the feature engineering (Section 3.4) and model selection framework (Section 3.5), the implementation phase operationalized the regression and machine learning models within a standardized computational environment. All algorithms were executed using the Scikit-learning library [39], which provided a robust and reproducible platform for model development, parameter tuning, and validation.
Regularization techniques, including Lasso regression [40] and Ridge regression [41] were applied to mitigate multicollinearity and enhance generalization performance.
These methods were complemented by Elastic Net and LARS, ensuring that variable selection and shrinkage were consistently aligned with the interpretability and transparency requirements highlighted in the legal and ethical considerations (Section 2.3).
To capture complex, nonlinear dependencies, classical tree-based methods such as Decision Trees [36], Random Forests [37], and Gradient Boosting [38] were deployed alongside regression models. The integration of these complementary approaches facilitated a balanced comparison between statistical parsimony and predictive flexibility.
Finally, the design of the implementation process adhered to Breiman seminal perspective on the “two cultures” of statistical modeling [42], emphasizing both predictive accuracy and interpretability. By embedding model selection, validation, and documentation into a transparent pipeline, the implementation phase ensured methodological rigor while maintaining compliance with the accountability standards necessary for trustworthy AI in real estate risk assessment.

3.6. Algorithm Selection

Following the cross-validation stage, the selection of algorithms focused on narrowing the candidate pool to those models that demonstrate consistently strong predictive performance, low variance across folds, and computational feasibility for deployment in governance-sensitive contexts. Comparative evaluation of the eight candidate families revealed that three approaches—Lars Regression, Decision Tree Regression, and Histogram Gradient Boosting Regression—dominated the trade-off between accuracy, interpretability, and efficiency. Figure 7a–c illustrate the comparative rankings across the key performance metrics of R2, MSE, and computational time, underscoring the stability of these three models relative to their peers. Lars Regression achieved near-perfect accuracy while maintaining minimal computational overhead, making it particularly suitable for contexts where interpretability and transparency are paramount. Decision Tree Regression provided a competitive balance between predictive power and structural clarity, allowing stakeholders to trace predictions directly to rule-based splits. Histogram Gradient Boosting Regression delivered the highest overall accuracy and robustness, excelling in its capacity to capture nonlinear feature interactions, albeit at a higher computational cost. Taken together, these results justified advancing Lars, DTRg, and HGBRg to the subsequent hyperparameter optimization stage, ensuring that both linear and nonlinear modeling paradigms were retained for final calibration of the RECIR framework.

3.7. Model Optimization

Hyperparameter optimization was conducted to refine the generalizability, minimize prediction error, and enhance robustness of the selected algorithms, namely Lars Regression, Decision Tree Regression, and Histogram Gradient Boosting Regression. A systematic grid search combined with ten-fold cross-validation was employed across predefined parameter ranges, with evaluation guided by adjusted R2, mean absolute error, root mean squared error, Pearson correlation, and bias error diagnostics. The optimization process confirmed the stability of all three models, each achieving adjusted R2 values above 0.99 with negligible generalization gaps between training and testing sets. Lars Regression reached its optimal configuration with a nonzero coefficient threshold of twenty-five and a convergence tolerance of 1 × 10−4, balancing speed and interpretability. Decision Tree Regression performed best at a depth of ten with a minimum of two samples per leaf, striking a balance between complexity and transparency. Histogram Gradient Boosting Regression demonstrated the strongest performance overall with three hundred boosting iterations, a learning rate of 0.1, and a minimum of twenty samples per leaf, consistently yielding the lowest RMSE and the highest predictive accuracy across folds. Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12 and Table 11, Table 12, Table 13 and Table 14 provide detailed evidence of the tuning outcomes, illustrating both the sensitivity of the models to parameter adjustments and the stability of the optimal regions. Collectively, these results positioned Histogram Gradient Boosting Regression as the most robust candidate for deployment, with Decision Tree Regression and Lars Regression offering complementary strengths in interpretability, transparency, and computational efficiency, thereby ensuring that the RECIR framework remains accurate, auditable, and adaptable across governance-sensitive applications.
Figure 8 and Figure 9 illustrate the tuning outcomes, showing sensitivity of performance to parameter adjustments and the stability of the optimal regions across folds.
Results, reported in Table 13, confirmed cross-validated adjusted R2 values above 0.99 for all three models, with minimal generalization gaps between training and testing sets. Among the candidates, HGBRg consistently produced the lowest RMSE and the highest stability across folds, while DTRg offered interpretability with competitive accuracy, and Lars balanced speed with transparency. Results, reported in Table 13, confirmed cross-validated adjusted R2 values above 0.99 for all three models, with minimal generalization gaps between training and testing sets. Among the candidates, HGBRg consistently produced the lowest RMSE and the highest stability across folds, while DTRg offered interpretability with competitive accuracy, and Lars balanced speed with transparency. All negative values are presented using the standard mathematical minus sign (−) for clarity and consistency.
Figure 11b and Table A7 and Table A8 in the Appendix A confirm these trends across the folds, with low error variance and no signs of instability. The effect of min_samples_leaf is minor, which suggests that DTRg is robust to small adjustments in leaf size.
HGBRg achieves the best overall results. Its best configuration—min_samples_leaf = 20, learning_rate = 0.1, and max_iter = 300—delivers an R2 of 0.9988 and a minimal RMSE on the testing set (Table 14c and Figure 10c).
The model demonstrates consistent performance across all the folds and configurations, with no indications of overfitting or degradation in predictive power as max_iter increases. Figure 11c shows that increasing the number of boosting iterations from 30 to 300 steadily improves both the training and the validation metrics. The stability of the model is further demonstrated by its narrow box plots and low standard deviations, which consistently remain below 0.0001 across all the metrics.
Figure 12 presents the comparative metrics for the best configuration of each model and highlights the performance evolution across the folds. Although Lars produces strong results, it shows slightly more variability across the cross-validation runs. DTRg maintains low error values with minimal fluctuations. HGBRg consistently delivers the lowest RMSE and the highest R2, with almost no variation from fold to fold.
Collectively, these optimization results positioned HGBRg as the most robust candidate for deployment, with DTRg and Lars providing complementary strengths in interpretability and computational efficiency. The methodological diversity and stability of these three models ensure that RECIR remains resilient to data shifts, transparent for auditability, and adaptable across governance-sensitive applications.

3.8. Model Interpretation and Governance Auditing

To ensure that the RECIR framework remains transparent, reproducible, and aligned with governance requirements, we implemented a multi-layered interpretation and auditing protocol. Model interpretation relied on permutation importance, partial dependence (PDP), and accumulated local effects (ALE) to provide decision-relevant explanations that are robust to correlated features. These diagnostics were applied systematically across the three top-performing model families—Lars, DTRg, and HGBRg—demonstrating consistency of the identified drivers of risk. Figure 13 illustrates the relative contributions of the seven TRS indices under permutation-based importance, highlighting the governance-salient roles of Location Score (LS) and Legal/Regulatory Environment (LR).
To support governance auditing, all preprocessing steps, feature transformations, and modeling decisions were fully documented in a structured pipeline. This included explicit data lineage from raw AHS and WDI sources to the harmonized dataset described in Section 3.3, as well as metadata regarding imputation, scaling, and hyperparameter optimization. An audit trail was generated to enable external replication and regulatory compliance, in line with best practices from the Basel Committee on Banking Supervision and the EU AI Act.
Finally, the auditing protocol incorporated drift monitoring and explainability safeguards to ensure ongoing accountability. By embedding governance-aware interpretation directly into the modeling workflow, RECIR advances beyond conventional predictive models to provide interpretable, auditable, and legally compatible outputs. This integration addresses the reviewer’s concern regarding transparency and reproducibility, while reinforcing the framework’s applicability to high-stakes real estate investment decision-making.

4. Findings

4.1. Final Model Selection

Following the extensive benchmarking of eight candidate families, three estimators emerged as consistently dominant: Lars Regression, Decision Tree Regression (DTRg), and Histogram Gradient Boosting Regression (HGBRg). Table 15 summarizes their performance statistics, highlighting differences in accuracy, stability, and computational efficiency. While Lars demonstrated exceptionally high R2 values and low error rates, its linear structure limited its capacity to capture the nonlinear interactions evident in the integrated AHS–WDI–TRS dataset. DTRg offered competitive predictive performance and interpretability but showed greater sensitivity to fluctuations in the data, as reflected in higher variability across folds. In contrast, HGBRg provided the most balanced solution, combining superior accuracy, robust generalization, and the ability to model complex feature interactions with consistently low variance across validation folds. Although HGBRg imposed higher computational costs, this trade-off was considered acceptable considering the governance-sensitive application domain, where consistency and reliability outweigh marginal efficiency gains. The final model was therefore specified as HGBRg with max_iter = 300, learning_rate = 0.1, and min_samples_leaf = 20, a configuration that offered the strongest balance between predictive strength and computational feasibility for risk-sensitive deployment.

4.2. Performance

The selected HGBRg model was evaluated on training, testing, and independent validation subsets, with the full set of results reported in Table 16 and Figure 13. Across all partitions, the model achieved R2 and adjusted R2 values consistently above 0.996, with validation performance approaching 0.999, thereby confirming its strong capacity for temporal and out-of-sample generalization. Error measures remained uniformly low, with mean absolute error (MAE) below 0.009 and root mean squared error (RMSE) below 0.02 across training and testing and further reduced to approximately 0.011 in the validation sample. Bias values were close to zero, and Pearson correlation coefficients consistently exceeded 0.99, as depicted in Figure 14, confirming both accuracy and alignment between predictions and observed outcomes.
An additional finding concerned the discretized structure of predictions. The TRS, although modeled as a continuous outcome, produced predicted values clustered around discrete risk strata (e.g., 6.06, 6.25, 6.45, and 6.64). Table 17 demonstrates that predicted group means remained nearly identical to actual means, with standard deviations typically below 0.05. This structural fidelity is particularly relevant for decision-making frameworks that rely on risk brackets or threshold-based governance triggers, since it indicates that the model internalizes and reproduces the categorical logic embedded in the TRS. Moreover, the consistency of these results across household-, regional-, and time-level contexts reinforces the applicability of the RECIR framework for both ex ante risk assessment and scenario-based stress testing.

4.3. Baselines & Temporal Generalization

To assess robustness, the RECIR framework was benchmarked against two baselines: a hedonic linear model using standard housing attributes and a macro-only model restricted to aggregate indicators. In both cases, predictive accuracy was markedly lower than that of the integrated HGBRg framework, underscoring the value of combining micro-level AHS variables, macro-level WDI, and governance-salient TRS indices. Temporal generalization was evaluated by training models on the 2015–2019 subsample and testing in 2021–2023, as well as by conducting leave-one-year-out cross-validation. The results confirmed that validation and test R2 closely tracked training performance with only marginal generalization gaps. Ensemble methods, particularly HGBRg, further reduced error variance relative to both linear and tree-based baselines, demonstrating stability in the face of distributional shifts across time.

4.4. Ablation & Parsimony

Ablation tests were conducted to examine the effect of dimensionality reduction on predictive accuracy. The bottom 25%, 50%, and 75% of predictors ranked by validation-set permutation importance were sequentially removed, and models were re-estimated under identical tuning protocols. The removal of the lowest 25% or 50% of features resulted in a median R2 decline of no more than 2–5%, albeit with increased variance across folds. However, eliminating 75% of predictors produced a sharp deterioration in both accuracy and temporal stability. On this basis, two model specifications were retained: a compact top-k model that achieved at least 95% of the baseline accuracy and a full model that maintained superior stability across temporal splits. The monotonic degradation observed in stepwise ablation confirmed that redundancy buffers exist within the dataset but also emphasized the importance of preserving a sufficiently broad feature set to maintain generalizability. This balance between parsimony and robustness strengthens the interpretability of RECIR while ensuring that the model remains practically deployable across diverse governance-sensitive applications.

4.5. Predictive Accuracy

The residual analysis provides strong evidence for the predictive reliability of the final HGBRg model. As illustrated in Figure 15 and summarized in Table 18, the residuals exhibit a symmetric, zero-centered distribution, with more than 90% of values falling between −0.009 and +0.023. This stability across both training and validation partitions confirms the absence of systematic bias. The Q–Q plot (Figure 16) further demonstrates near-perfect alignment with the normal distribution, with only minor deviations at the tails. These results confirm that the residuals satisfy the assumption of normality, thereby supporting the validity of subsequent inference and model reliability.
Descriptive statistics reported in Table 19 reinforce this observation: the residual mean approximates zero, the standard deviation remains near 0.011, and skewness and kurtosis are negligible. Aggregate residual metrics (Table 20) confirm minimal prediction errors, with MAE = 0.008, RMSE = 0.011, and bias close to zero. Figure 17 shows residuals plotted against predictions, revealing a uniform scatter with no visible heteroscedasticity, while Table 21 confirms that group-level deviations remain below 1.5 × 10−4. The variability boundaries reported in Table 22 and the compact box plot in Figure 18 underscore the absence of outliers and validate the model’s robustness. Collectively, these findings confirm that the HGBRg framework achieves both high accuracy and predictive stability, supporting its deployment in operational and policy-relevant settings.
Table 19 presents the descriptive statistics of the Q–Q plot values, confirming that the residuals closely follow the theoretical normal distribution.
Table 20 presents the aggregate residual metrics: MAE = 8.07 × 10−3, RMSE = 1.11 × 10−2, and negligible bias. These metrics indicate that the prediction errors are minimal and centered, with no systematic deviation.
Figure 17 plots the residuals against the predicted values and shows a uniform scatter with no discernible pattern or heteroscedasticity. The residuals by prediction group (Table 21) remain below 1.5 × 10−4, and the standard deviations range from 9.5 × 10−3 to 1.4 × 10−2. These results suggest that prediction uncertainty is minimal and evenly distributed across the output range, thereby supporting the validity and robustness of the model.
No residuals exceed ±4.41 × 10−2, as defined by the variability thresholds in Table 22.
The box plot of residuals in Figure 18 confirms the symmetry and compactness, with a median centered at zero and the interquartile range tightly constrained. No outliers or anomalous deviations are observed.

4.6. Computational Efficiency and Feature Importance

In addition to accuracy, computational efficiency and feature reliance were evaluated to ensure that RECIR remains practical for real-world deployment. Permutation importance, computed over ten validation repetitions, revealed that predictive influence is concentrated within a small subset of features (Figure 19 and Table 23). Variables such as GASAMT_cat and GASAMT_num each contributed more than 2.3% to model performance, with an additional group of thirteen features—including housing condition, utility costs, and socioeconomic indicators—exerting moderate but consistent influence. Collectively, these variables accounted for nearly 16% of the model’s explanatory power and aligned with the TRS risk domain logic. By contrast, over 70% of features contributed to marginal or negligible effects, with a few exhibiting slightly negative importance because of redundancy or noise.
Despite this imbalance, dimensionality reduction was deliberately avoided at this stage. Eliminating weakly influential variables risked destabilizing predictive performance, particularly under unseen market conditions. Maintaining the broader feature space supports generalizability and preserves modular adaptability for future applications. Notably, the prominence of GASAMT-related features suggests their proxy role for affordability, thermal efficiency, and infrastructure reliability, dimensions closely tied to project-level investment risk. Verification procedures ruled out label leakage by re-estimating importance with grouped encoders and year-fixed effects, confirming consistent rankings within one standard error. Finally, a structured updating protocol (Figure 20) was defined to integrate new data, retrain models, and benchmark evolving feature contributions, thereby ensuring the framework remains adaptive and trustworthy as economic and housing dynamics evolve.

4.7. Discussion and Implications

The findings of this study provide both methodological and substantive contributions to the literature on real estate risk assessment. Methodologically, the RECIR framework advances beyond valuation-centric models by incorporating governance-salient variables, such as permitting and inspection delays, contractor integrity, and regulatory exposures, into a unified, auditable Total Risk Score (TRS). This integration demonstrates that risk assessment cannot be confined to market volatility or macroeconomic indicators alone; rather, it requires a multi-dimensional perspective that captures both structural and institutional determinants of investment outcomes. By achieving consistently high predictive accuracy while maintaining interpretability through permutation importance and cross-validation diagnostics, RECIR contributes to ongoing debates on how to balance complexity and transparency in applied machine learning.
From a theoretical standpoint, the model reframes construction-phase investment risk as a multi-domain construct situated at the intersection of economics, law, and governance. This reconceptualization expands the analytical scope of real estate finance research, which has historically emphasized price forecasting, by embedding regulatory compliance, forensic risk evaluation, and explainability as first-class components of the analytical process. The framework’s macro-to-micro translation mechanism, which links aggregate indicators from the World Development Indicators to unit-level estimates from the American Housing Survey, provides a novel methodological pathway for aligning systemic conditions with granular investment decisions.
Practically, the model’s robustness across temporal splits and its validation against baseline comparators suggest immediate utility for lenders, developers, and policymakers. Financial institutions can apply the TRS in underwriting and portfolio risk management, while regulators may use it as a diagnostic tool to monitor systemic vulnerabilities. The explainability of the outputs further enhances accountability, allowing stakeholders to trace predictions back to legally meaningful variables, which is critical in governance-sensitive contexts such as urban renewal projects or cross-border investment transactions.
Finally, the study addresses broader debates in AI governance by illustrating that high-performing models can remain auditable, interpretable, and compliant with emerging legal frameworks such as the EU AI Act. In this way, RECIR contributes not only to the advancement of real estate risk management but also to the responsible deployment of AI in high-stakes financial domains. Permutation-based feature importance analysis highlights GASAMT_cat as a dominant predictor. This reflects aggregate financing obligations across construction phases, serving as a proxy for liquidity pressure and default risk.
Finally, Figure 20 presents the model updating plan for future iterations. It specifies the procedures for integrating new data, retraining, performance benchmarking, and periodically reassessing the contributions of the feature. This plan ensures that the model remains adaptive and trustworthy because economic and housing dynamics evolve.

5. Conclusions and Future Research Directions

This study introduced the RECIR model as a next-generation framework for evaluating real estate investment risk, designed to integrate micro-level housing data, macroeconomic indicators, and governance-salient regulatory factors into a unified AI-based risk assessment architecture. By combining advanced machine learning algorithms, explainable AI techniques, and regulatory alignment mechanisms, the RECIR significantly enhances predictive accuracy, interpretability, and adaptability relative to traditional econometric approaches. Comparative evidence across Table 10, Table 11, Table 12, Table 13, Table 14 and Table 15 and the summary in Table 24 demonstrates that the model not only reduces forecast error but also provides superior transparency and resilience, thereby offering a methodological contribution to both academic research and professional practice.
A central strength of the RECIR lies in its ability to incorporate unstructured and high-frequency data sources—including IoT-based property monitoring, real-time financial indicators, and natural language processing of legal documentation—while maintaining alignment with regulatory standards such as the GDPR, the AI Act, and the U.S. Fair Housing Act. This capacity ensures that the model remains operationally scalable and legally defensible, addressing one of the main criticisms in the editor’s review regarding the gap between predictive performance and regulatory compliance. Furthermore, the integration of forensic AI techniques enables early detection of anomalies and fraud, reinforcing the model’s contribution to governance and investor protection.
Despite these advancements, several limitations remain and must be acknowledged explicitly. First, although the RECIR reduces reliance on historical data, the model’s performance may still be challenged in highly volatile markets where structural breaks occur. Second, while the inclusion of environmental and legal indices enhances interpretability, the weighting of such variables may vary across jurisdictions, potentially limiting cross-country generalization. Third, as noted by the reviewers, algorithmic transparency remains a practical challenge: even with explainable AI tools, full interpretability is not always attainable when handling complex ensemble models. Recognizing these limitations responds directly to the editorial request for a balanced discussion of constraints.
Future research directions are therefore threefold. Methodologically, advances in reinforcement learning and continuous training in architecture could further strengthen the model’s responsiveness to market shocks. Comparative studies that evaluate the RECIR alongside hybrid econometrics–AI frameworks would provide empirical evidence of its relative efficiency and robustness across contexts. Substantively, future work should examine the socioeconomic consequences of AI-driven risk models, particularly their impact on investment allocation, housing affordability, and market stability. Technologically, expanding integration with blockchain-verified transactions and decentralized edge-computing infrastructures could address both transparency and cybersecurity concerns. These avenues of inquiry ensure that RECIR remains adaptable to emerging data ecosystems and evolving governance requirements. Although benchmark datasets were not used to evaluate RECIR, we acknowledge this as a limitation. Future research should validate the framework against standardized repositories to strengthen generalizability and facilitate comparative evaluation.
In summary, RECIR contributes to academic literature by unifying heterogeneous data sources under AI-driven, regulation-aware architecture and provides actionable tools for practitioners tasked with managing complex, multi-dimensional risks. Its ability to balance predictive power with regulatory compliance positions it as a pioneering framework for real estate finance. Nevertheless, ongoing refinement and critical evaluation remain essential for realizing its full potential. Future research must continue to align technological innovation with ethical, legal, and social considerations to ensure that AI-driven risk models not only improve prediction but also foster transparency, trust, and fairness across global real estate markets.
A limitation of this study is the absence of external benchmark datasets for model validation. Future research should explicitly test RECIR against standardized repositories to strengthen comparability and external validity.

Author Contributions

Conceptualization, A.L.; methodology, A.L.; formal analysis, A.L.; investigation, A.L.; data curation, A.L.; writing—original draft preparation, A.L.; writing—review and editing, A.L., L.C.L.d.R. and N.C.V.; supervision, L.C.L.d.R. and N.C.V.; project administration, L.C.L.d.R. and N.C.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was approved by the Institutional Review Board of the University of Cordoba. This study was conducted in accordance with the ethical guidelines of the 1964 Declaration of Helsinki, its subsequent amendments, and similar ethical standards.

Informed Consent Statement

All the participants provided oral consent to include their data in the research and development of the model. No identifiable personal details were obtained.

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon reasonable request. The data are not publicly available due to privacy and ethical considerations.

Acknowledgments

We thank José María Caridad y Ocerín for his valuable academic guidance and consistent professional support. His mentorship has been crucial for navigating the complexities of assessing real estate investment risk, particularly in conflict-prone regions. We also thank the participants of our focus groups and surveys, whose insights were essential for grounding our risk assessments in practical considerations and for deepening our understanding of investor sentiment in Ukraine and Israel. We give special acknowledgment to the owners of construction companies and consultants whose expertise contributed to a nuanced understanding of the role of geopolitical factors and AI analytics in shaping investment paradigms. We appreciate the dedicated commitment of everyone involved, which enabled the successful execution and dissemination of this research. The process of authoring this study has been both intellectually rewarding and enlightening.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ADPsadditional data points
AHSAmerican Housing Survey
AIartificial intelligence
CVcross-validation
HUDDepartment of Housing and Urban Development
MAEmean absolute error
MLmachine learning
MSEmean squared error
NLPnatural language processing
PUFpublic use file
TRStotal risk score
WDIWorld Development Indicators

Appendix A

Table A1. Data sources reviewed to find the input features for the RECIR model.
Table A1. Data sources reviewed to find the input features for the RECIR model.
SourceWebsiteDescriptionAccess Date
Zillowhttps://www.zillow.com/research/Zillow provides various real estate market reports, housing data, and research insights.Accessed on 13 October 2025
Redfinhttps://www.redfin.com/blog/data-centerRedfin’s data center offers housing market trends, reports, and downloadable datasets.Accessed on 13 October 2025
Realtor.comhttps://www.realtor.com/research/Realtor.com Research provides market insights, trends, and reports on the US real estate market.Accessed on 13 October 2025
Federal Housing Finance Agency (FHFA)https://www.fhfa.gov/DataTools/Downloads/Pages/House-Price-Index.aspxFHFA offers the House Price Index (HPI) dataset, providing information on housing price trends.Accessed on 13 October 2025
U.S. Census Bureauhttps://www.census.gov/topics/housing/data.htmlThe U.S. Census Bureau offers various datasets related to housing and demographics.Accessed on 13 October 2025
National Association of Realtors (NAR)https://www.nar.realtor/research-and-statisticsNAR provides research and statistics on the real estate market, including home sales and prices.Accessed on 13 October 2025
CoreLogichttps://www.corelogic.com/CoreLogic offers a range of real estate data, including property information, analytics, and market trends.Accessed on 13 October 2025
Harvard Joint Center for Housing Studies (JCHS)https://www.jchs.harvard.edu/dataHarvard JCHS provides datasets on housing markets, demographics, and affordability.Accessed on 13 October 2025
FRED Economic Data (Federal Reserve Bank of St. Louis)https://fred.stlouisfed.org/FRED offers economic data, including housing-related indicators and economic trends.Accessed on 13 October 2025
Urban Institute—Housing Finance Policy Centerhttps://www.urban.org/policy-centers/housing-finance-policy-centerUrban Institute provides research and data on housing finance policies.Accessed on 13 October 2025
HUD via Data.govhttps://www.data.gov/Explore datasets related to housing and urban development from various government agencies.Accessed on 13 October 2025
Truliahttps://www.trulia.com/research/Trulia’s research section offers insights and reports on real estate market trends.Accessed on 13 October 2025
Attom Data Solutionshttps://www.attomdata.com/Attom Data Solutions provides property data, analytics, and reports for real estate professionals.Accessed on 13 October 2025
Harvard JCHS—State of the Nation’s Housinghttps://www.jchs.harvard.edu/state-nations-housingHarvard’s JCHS publishes annual reports on the state of the nation’s housing, including comprehensive data.Accessed on 13 October 2025
World Bank—World Development Indicators (WDI)https://databank.worldbank.org/source/world-development-indicators/preview/onWDI is the primary World Bank collection of development indicators, compiled from international sources.Accessed on 13 October 2025
MLS DatabasesVarious local MLSsLocal Multiple Listing Services (MLSs) offer property listings, sales, and market trends (access varies by region).Accessed on 13 October 2025
Note: All data sources listed in Table A1 were accessed at various times between January 2024 and October 2025, depending on availability and the stage of model development.
Table A2. Final selected variables for the AHS data from the raw data.
Table A2. Final selected variables for the AHS data from the raw data.
DetailTopicSubtopicNameDescription
householdOccupancy and TenureMonths OccupiedOCCYRRNDFlag indicating unit is typically occupied year-round
StructuralInterior FeaturesBATHEXCLUFlag indicating the unit’s bathroom facilities are for the exclusive use of the household
BATHROOMSNumber of bathrooms in unit
BEDROOMSNumber of bedrooms in unit
DININGNumber of dining rooms in unit
FOUNDTYPEType of foundation
TOTROOMSNumber of rooms in unit
UNITSIZEUnit size (square feet)
Housing ProblemsStructural ProblemsFLOORHOLEFlag indicating floor has holes
FNDCRUMBFlag indicating foundation has holes, cracks, or crumbling
PAINTPEELFlag indicating interior area of peeling paint larger than 8 x 11
ROOFHOLEFlag indicating roof has holes
ROOFSAGFlag indicating roof’s surface sags or is uneven
ROOFSHINFlag indicating roof has missing shingles or other roofing materials
WALLCRACKFlag indicating inside walls or ceilings have open holes or cracks
WALLSIDEFlag indicating outside walls have missing siding, bricks, or other missing wall materials
WALLSLOPEFlag indicating outside walls slope, lean, buckle, or slant
WINBOARDFlag indicating windows are boarded up
WINBROKEFlag indicating windows are broken
DemographicsHouseholder DemographicsHHADLTKIDSNumber of the householder’s unmarried children age 18 and over, living in this unit
HHAGEAge of householder
HHCITSHPU.S. citizenship of householder
HHGRADEducational level of householder
IncomeTotal Household IncomeFINCPFamily income (past 12 months)
HINCPHousehold income (past 12 months)
Housing CostsTotal Housing CostHOAAMTMonthly homeowners or condominium association amount
INSURAMTMonthly homeowner or renter insurance amount
LOTAMTMonthly lot rent amount
MORTAMTMonthly total mortgage amount (all mortgages)
PROTAXAMTMonthly property tax amount
RENTMonthly rent amount
TOTHCAMTMonthly total housing costs
UTILAMTMonthly total utility amount
UtilitiesELECAMTMonthly electric amount
GASAMTMonthly gas amount
OILAMTMonthly oil amount
OTHERAMTMonthly amount for other fuels
TRASHAMTMonthly trash amount
WATERAMTMonthly water amount
Renter SubsidyHUDSUBSubsidized renter status and eligibility
RENTCNTRLFlag indicating rent is limited by rent control or stabilization
RENTSUBType of rental subsidy or reduction (based on respondent report)
AffordabilityPERPOVLVLHousehold income as a percent of poverty threshold (rounded)
Owner’s Purchase, Value, and DebtDWNPAYPCTDown payment percentage
FIRSTHOMEFlag indicating if first-time home buyer
HOWBUYDescription of how owner obtained unit
LEADINSPFlag indicating lead pipes inspected before purchase
MARKETVALCurrent market value of unit
TOTBALAMTTotal remaining debt across all mortgages or similar debts for this unit
Home ImprovementGeneralHMRACCESSFlag indicating home improvements done in last two years to make home more accessible for those with physical limitations
HMRENEFFFlag indicating home improvements done to make home more energy efficient in last two years
HMRSALEFlag indicating home improvements done to get house ready for sale in last two years
MAINTAMTAmount of annual routine maintenance costs
REMODAMTTotal cost of home improvement jobs in last two years
REMODJOBSTotal number of home improvement jobs in last two years
Neighborhood FeaturesGeneralNORCFlag indicating respondent thinks the majority of neighbors 55 or older
RatingsNHQPCRIMEAgree or disagree: this neighborhood has a lot of petty crime
NHQPUBTRNAgree or disagree: this neighborhood has good bus, subway, or commuter train service
NHQRISKAgree or disagree: this neighborhood is at high risk for floods or other disasters
NHQSCHOOLAgree or disagree: this neighborhood has good schools
NHQSCRIMEAgree or disagree: this neighborhood has a lot of serious crime
RATINGHSRating of unit as a place to live
RATINGNHRating of neighborhood as place to live
Housing SearchHRATERating of current home
NRATERating of current neighborhood
personIncomePerson IncomeINTPPerson’s interest, dividends, and net rental income (past 12 months)
OIPPerson’s other income (past 12 months)
PAPPerson’s public assistance income (past 12 months)
RETPPerson’s retirement income (past 12 months)
SEMPPerson’s self-employment income (past 12 months)
SSIPPerson’s Supplemental Security Income (past 12 months)
SSPPerson’s Social Security income (past 12 months)
WAGPPerson’s wages or salary income (past 12 months)
projectHome ImprovementJob SpecificJOBTYPECost of home improvement job
mortgageMortgage DetailsMortgage OriginationINTRATEInterest rate of mortgage
Current Payment DetailsPMTAMTAmount of mortgage payment
TAXPMTFlag indicating property taxes included in mortgage payment
RefinanceREFIFlag indicating mortgage is a refinance of previous mortgage
Table A3. Final selected variables for the WDI data from the raw data.
Table A3. Final selected variables for the WDI data from the raw data.
DetailTopicSubtopicModel NameDatabank NameDescription
World Development Indicators (WDI)Economic Policy & DebtNational accountsNGMKNY.GDP.MKTP.KNGDP (constant LCU)
NGPKNY.GDP.PCAP.KNGDP per capita (constant LCU)
NGNMKNY.GNP.MKTP.KNGNI (constant LCU)
NGNPKNY.GNP.PCAP.KNGNI per capita (constant LCU)
NTNCNY.TRF.NCTR.CNNet secondary income (Net current transfers from abroad) (current LCU)
NGNCNY.GSR.NFCY.CNNet primary income (Net income from abroad) (current LCU)
NGICNY.GNS.ICTR.CNGross savings (current LCU)
NGPCNY.GNP.PCAP.CNGNI per capita (current LCU)
NGTCNY.GDS.TOTL.CNGross domestic savings (current LCU)
NGDPCNY.GDP.PCAP.CNGDP per capita (current LCU)
NGMCANY.GDP.MKTP.CN.ADGDP: linked series (current LCU)
NGMCNY.GDP.MKTP.CNGDP (current LCU)
NGNMCNY.GNP.MKTP.CNGNI (current LCU)
NGMKDNY.GDP.MKTP.KDGDP (constant 2015 US$)
NGPKDNY.GDP.PCAP.KDGDP per capita (constant 2015 US$)
NGNMKDNY.GNP.MKTP.KDGNI (constant 2015 US$)
NGNPKDNY.GNP.PCAP.KDGNI per capita (constant 2015 US$)
EnvironmentAgricultural productionAYCKAG.YLD.CREL.KGCereal yield (kg per hectare)
APCMAG.PRD.CREL.MTCereal production (metric tons)
ALCHAG.LND.CREL.HALand under cereal production (hectares)
Financial SectorAssetsFRLAZFD.RES.LIQU.AS.ZSBank liquid reserves to bank assets ratio (%)
FBCZFB.BNK.CAPA.ZSBank capital to assets ratio (%)
FANZFB.AST.NPER.ZSBank nonperforming loans to total gross loans (%)
HealthPopulationSPGSP.POP.GROWPopulation growth (annual %)
SPDYSP.POP.DPND.YGAge dependency ratio, young (% of working-age population)
SPDOSP.POP.DPND.OLAge dependency ratio, old (% of working-age population)
SPPODPSP.POP.DPNDAge dependency ratio (% of working-age population)
SP6TZSP.POP.65UP.TO.ZSPopulation ages 65 and above (% of total population)
SP0TZSP.POP.0014.TO.ZSPopulation ages 0–14 (% of total population)
SP1TZSP.POP.1564.TO.ZSPopulation ages 15–64 (% of total population)
InfrastructureCommunicationsINBPIT.NET.BBND.P2Fixed broadband subscriptions (per 100 people)
IMMPIT.MLT.MAIN.P2Fixed telephone subscriptions (per 100 people)
IMMIT.MLT.MAINFixed telephone subscriptions
INBIT.NET.BBNDFixed broadband subscriptions
ICSIT.CEL.SETSMobile cellular subscriptions
TechnologyTVTMZTX.VAL.TECH.MF.ZSHigh-technology exports (% of manufactured exports)
TVTCTX.VAL.TECH.CDHigh-technology exports (current US$)
Private Sector & TradeExportsTVTZWTX.VAL.TRAN.ZS.WTTransport services (% of commercial service exports)
TVSCWTX.VAL.SERV.CD.WTCommercial service exports (current US$)
TVAZUTX.VAL.AGRI.ZS.UNAgricultural raw materials exports (% of merchandise exports)
TVFZUTX.VAL.FUEL.ZS.UNFuel exports (% of merchandise exports)
Public SectorConflict & fragilityVINVC.IDP.NWDSInternally displaced persons, new displacement associated with disasters (number of cases)
Defense & arms tradeMMXCMS.MIL.XPND.CDMilitary expenditure (current USD)
Government financeGLTGZGC.LBL.TOTL.GD.ZSNet incurrence of liabilities, total (% of GDP)
GNTCGC.NLD.TOTL.CNNet lending (+)/net borrowing (−) (current LCU)
GDTCGC.DOD.TOTL.CNCentral government debt, total (current LCU)
GATGZGC.AST.TOTL.GD.ZSNet acquisition of financial assets (% of GDP)
GDTGZGC.DOD.TOTL.GD.ZSCentral government debt, total (% of GDP)
GXTGZGC.XPN.TOTL.GD.ZSExpense (% of GDP)
GTOCGC.TAX.OTHR.CNOther taxes (current LCU)
Policy & institutionsRPRLRL.PER.RNK.LOWERRule of Law: Percentile Rank, Lower Bound of 90% Confidence Interval
RPRURL.PER.RNK.UPPERRule of Law: Percentile Rank, Upper Bound of 90% Confidence Interval
RSERL.STD.ERRRule of Law: Standard Error
RQPRURQ.PER.RNK.UPPERRegulatory Quality: Percentile Rank, Upper Bound of 90% Confidence Interval
RQSERQ.STD.ERRRegulatory Quality: Standard Error
Social Protection & LaborEconomic activitySIEFZSL.IND.EMPL.FE.ZSEmployment in industry, female (% of female employment) (modeled ILO estimate)
SIEMZSL.IND.EMPL.MA.ZSEmployment in industry, male (% of male employment) (modeled ILO estimate)
SSEMZSL.SRV.EMPL.MA.ZSEmployment in services, male (% of male employment) (modeled ILO estimate)
SGPEKSL.GDP.PCAP.EM.KDGDP per person employed (constant 2021 PPP $)
SSEFZSL.SRV.EMPL.FE.ZSEmployment in services, female (% of female employment) (modeled ILO estimate)
Labor force structureSTCFNZSL.TLF.CACT.FM.NE.ZSRatio of female to male labor force participation rate (%) (national estimate)
STCMZSL.TLF.CACT.MA.ZSLabor force participation rate, male (% of male population ages 15+) (modeled ILO estimate)
STTFZSL.TLF.TOTL.FE.ZSLabor force, female (% of total labor force)
STAFZSL.TLF.ACTI.FE.ZSLabor force participation rate, female (% of female population ages 15–64) (modeled ILO estimate)
STA1FZSL.TLF.ACTI.1524.FE.ZSLabor force participation rate for ages 15–24, female (%) (modeled ILO estimate)
MigrationSPRSM.POP.REFGRefugee population by country or territory of asylum
UnemploymentSUNZSL.UEM.NEET.ZSShare of youth not in education, employment or training, total (% of youth population)
SUNMZSL.UEM.NEET.ME.ZSShare of youth not in education, employment or training, total (% of youth population) (modeled ILO estimate)
SUNMAZSL.UEM.NEET.MA.ZSShare of youth not in education, employment or training, male (% of male youth population)
SUIFZSL.UEM.INTM.FE.ZSUnemployment with intermediate education, female (% of female labor force with intermediate education)
SUAMZSL.UEM.ADVN.MA.ZSUnemployment with advanced education, male (% of male labor force with advanced education)
SUAFZSL.UEM.ADVN.FE.ZSUnemployment with advanced education, female (% of female labor force with advanced education)
Table A4. Year distribution of the records (households) and feature composition of the final dataset.
Table A4. Year distribution of the records (households) and feature composition of the final dataset.
Number Features and Records of Final Dataset by Year
YearNo. FeaturesNo. Records
201519869,493
201719666,752
201919663,185
202119864,141
202319855,669
Final Dataset198318,240
SOURCEof features (variables)
AHS125American Housing Survey (US Census Bureau)
WDI72World Development Indicators (World Bank)
TRS1Author Survey (Expert Judgment)
TOTAL198
VARIABLE TYPECOUNTOBSERVATION
KEYS2YEAR, CONTROL
TARGET VARIABLE1TRS
INDEPENDENT VARIABLES197See Table 8
TOTAL VARIABLES200
CATEGORICAL83
NUMERICAL114
Table A5. Basic statistics of the final dataset.
Table A5. Basic statistics of the final dataset.
A. Numerical variables
NameCountMeanStdMin25%50%75%MaxDescriptionSource
TRS_housing319,2406.4377710.3153615.9084536.2082516.4392176.6268687.116613Total Risk ScoreAll sources
BEDROOMS_num319,2402.6724751.09849802336Number of bedrooms in unitAmerican Housing Survey (US Census Bureau)
DINING_num319,2400.4974780.5336100012Number of dining rooms in unitAmerican Housing Survey (US Census Bureau)
TOTROOMS_num319,2405.5147851.800217145714Number of rooms in unitAmerican Housing Survey (US Census Bureau)
HHADLTKIDS_num277,5110.2262870.55588500008Number of the householder’s unmarried children age 18 and over, living in this unitAmerican Housing Survey (US Census Bureau)
HHAGE_num277,51152.7650316.918291539536685Age of householderAmerican Housing Survey (US Census Bureau)
FINCP_num277,51182,559.62118,481−10,00023,00052,000101,0006,405,000Family income (past 12 months)American Housing Survey (US Census Bureau)
HINCP_num277,51186,477.3120,651.6−10,00024,97056,900108,0006,445,000Household income (past 12 months)American Housing Survey (US Census Bureau)
HOAAMT_num312,73527.96784203.9184000025,947Monthly homeowners or condominium association amountAmerican Housing Survey (US Census Bureau)
INSURAMT_num277,08569.4408997.46847004199959Monthly homeowner or renter insurance amountAmerican Housing Survey (US Census Bureau)
LOTAMT_num317,9986.557805103.7443000010,907Monthly lot rent amountAmerican Housing Survey (US Census Bureau)
MORTAMT_num318,639424.11851782.857−798800485201,207Monthly total mortgage amount (all mortgages)American Housing Survey (US Census Bureau)
PROTAXAMT_num319,240172.9731354.1790002439031Monthly property tax amountAmerican Housing Survey (US Census Bureau)
RENT_num319,240464.1232888.780900073013,100Monthly rent amountAmerican Housing Survey (US Census Bureau)
TOTHCAMT_num277,5111512.4212136.725065011371866203,093Monthly total housing costsAmerican Housing Survey (US Census Bureau)
UTILAMT_num299,106205.3452153.52140901902901790Monthly total utility amountAmerican Housing Survey (US Census Bureau)
ELECAMT_num294,314114.142187.61234060100150750Monthly electric amountAmerican Housing Survey (US Census Bureau)
GASAMT_num294,31438.8349656.30936002060700Monthly gas amountAmerican Housing Survey (US Census Bureau)
OILAMT_num294,2995.22812534.358150000830Monthly oil amountAmerican Housing Survey (US Census Bureau)
OTHERAMT_num294,3121.22886610.962320000480Monthly amount for other fuelsAmerican Housing Survey (US Census Bureau)
TRASHAMT_num299,10620.6489736.8024502330670Monthly trash amountAmerican Housing Survey (US Census Bureau)
WATERAMT_num299,10630.6519348.3119302350500Monthly water amountAmerican Housing Survey (US Census Bureau)
PERPOVLVL_num277,511304.4335175.32481144312502516Household income as percent of poverty threshold (rounded)American Housing Survey (US Census Bureau)
MARKETVAL_num319,240228,432.4467,3820091,040.5307,983.511,221,977Current market value of unitAmerican Housing Survey (US Census Bureau)
TOTBALAMT_num298,40749,999.22153,204.3000013,660,037Total remaining debt across all mortgages or similar debts for this unitAmerican Housing Survey (US Census Bureau)
MAINTAMT_num300,613713.91222396.192000526101,031Amount of annual routine maintenance costsAmerican Housing Survey (US Census Bureau)
REMODAMT_num319,2403716.9416,144.24000600937,900Total cost of home improvement jobs in last two yearsAmerican Housing Survey (US Census Bureau)
REMODJOBS_num319,2400.7921911.743335000128Total number of home improvement jobs in last two yearsAmerican Housing Survey (US Census Bureau)
RATINGHS_num268,5448.2847691.7151921781010Rating of unit as a place to liveAmerican Housing Survey (US Census Bureau)
RATINGNH_num268,0268.1977121.7795181781010Rating of neighborhood as place to liveAmerican Housing Survey (US Census Bureau)
PERSCOUNT_num319,2402.1282231.586905012319Number of people in the householdDerived from AHS Data
INTP_num277,5114229.27335,880.09−10,0000004,911,000Person’s interest, dividends, and net rental income (past 12 months)American Housing Survey (US Census Bureau)
OIP_num277,5112218.43917,773.1600003,007,500Person’s other income (past 12 months)American Housing Survey (US Census Bureau)
PAP_num277,51180.50247809.5531000040,800Person’s public assistance income (past 12 months)American Housing Survey (US Census Bureau)
RETP_num277,5114226.84420,710.1500002,396,000Person’s retirement income (past 12 months)American Housing Survey (US Census Bureau)
SEMP_num277,5116307.86455,311.02−10,0000005,786,000Person’s selfemployment income (past 12 months)American Housing Survey (US Census Bureau)
SSIP_num277,511455.23862512.08000092,000Person’s Supplemental Security Income (past 12 months)American Housing Survey (US Census Bureau)
SSP_num277,5115342.12811,074.120004000130,000Person’s Social Security income (past 12 months)American Housing Survey (US Census Bureau)
WAGP_num277,51163,617.0196,489.190036,00090,0003325,000Person’s wages or salary income (past 12 months)American Housing Survey (US Census Bureau)
PROJCOUNT_num319,2400.7921911.743335000128Number of home improvement projectsDerived from AHS Data
MORTCOUNT_num319,2400.323860.5182900013Number of mortgagesDerived from AHS Data
INTRATE_num95,5764.130621.51427103.2213753.94.70120.875Interest rate of mortgageAmerican Housing Survey (US Census Bureau)
PMTAMT_num95,5761670.0833004.353079312741967171,299Amount of mortgage paymentAmerican Housing Survey (US Census Bureau)
NGMK_num319,2401.97 × 10131.25 × 10121.8 × 10131.89 × 10131.99 × 10132.03 × 10132.18 × 1013GDP (constant LCU)World Development Indicators (World Bank)
NGPK_num319,24060,134.382903.05956,428.8958,180.9160,763.8861,244.7265,108.65GDP per capita (constant LCU)World Development Indicators (World Bank)
NGNMK_num319,2401.99 × 10131.18 × 10121.83 × 10131.91 × 10132.01 × 10132.05 × 10132.18 × 1013GNI (constant LCU)World Development Indicators (World Bank)
NGNPK_num319,24060,725.352669.86757,320.3758,882.3961,469.9661,648.1265,277.41GNI per capita (constant LCU)World Development Indicators (World Bank)
NTNC_num319,240−1.3 × 10112.03 × 1010−1.7 × 1011−1.5 × 1011−1.3 × 1011−1.2 × 1011−1.1 × 1011Net secondary income (Net current transfers from abroad) (current LCU)World Development Indicators (World Bank)
NGNC_num319,2402.19 × 10115.59 × 10101.25 × 10111.82 × 10112.28 × 10112.62 × 10112.86 × 1011Net primary income (Net income from abroad) (current LCU)World Development Indicators (World Bank)
NGIC_num319,2403.99 × 10124.45 × 10113.53 × 10123.61 × 10124.07 × 10124.08 × 10124.82 × 1012Gross savings (current LCU)World Development Indicators (World Bank)
NGPC_num319,24065,622.137892.87657,251.0259,886.7265,092.3568,249.180,523.81GNI per capita (current LCU)World Development Indicators (World Bank)
NGTC_num319,2403.92 × 10125.63 × 10113.28 × 10123.52 × 10123.98 × 10124.07 × 10124.96 × 1012Gross domestic savings (current LCU)World Development Indicators (World Bank)
NGDPC_num319,24065,022.338181.06856,172.2659,264.4464,402.8667,864.8480,402.29GDP per capita (current LCU)World Development Indicators (World Bank)
NGMCA_num319,2402.13 × 10133.01 × 10121.8 × 10131.92 × 10132.11 × 10132.25 × 10132.69 × 1013GDP: linked series (current LCU)World Development Indicators (World Bank)
NGMC_num319,2402.13 × 10133.01 × 10121.8 × 10131.92 × 10132.11 × 10132.25 × 10132.69 × 1013GDP (current LCU)World Development Indicators (World Bank)
NGNMC_num319,2402.15 × 10132.92 × 10121.83 × 10131.94 × 10132.13 × 10132.26 × 10132.69 × 1013GNI (current LCU)World Development Indicators (World Bank)
NGMKD_num319,2401.97 × 10131.25 × 10121.8 × 10131.89 × 10131.99 × 10132.03 × 10132.18 × 1013GDP (constant 2015 US$)World Development Indicators (World Bank)
NGPKD_num319,24060,134.382903.05956,428.8958,180.9160,763.8861,244.7265,108.65GDP per capita (constant 2015 US$)World Development Indicators (World Bank)
NGNMKD_num319,2401.99 × 10131.18 × 10121.83 × 10131.91 × 10132.01 × 10132.05 × 10132.18 × 1013GNI (constant 2015 US$)World Development Indicators (World Bank)
NGNPKD_num319,24060,725.352669.86757,320.3758,882.3961,469.9661,648.1265,277.41GNI per capita (constant 2015 US$)World Development Indicators (World Bank)
AYCK_num319,2408117.029331.12077534.18100.758198.358372.938447.75Cereal yield (kg per hectare)World Development Indicators (World Bank)
APCM_num319,2404.49 × 10015,843,8284.3 × 1004.37 × 1004.42 × 1004.63 × 1004.72 × 100Cereal production (metric tons)World Development Indicators (World Bank)
ALCH_num319,24055,665,1361,904,91553,111,23053,963,01655,805,02957,377,82858,051,885Land under cereal production (hectares)World Development Indicators (World Bank)
FRLAZ_num319,24014.687263.4094239.41091712.5311715.2958917.7273118.8309Bank liquid reserves to bank assets ratio (%)World Development Indicators (World Bank)
FBCZ_num319,2409.2108280.3024998.615049.2823319.3554789.3993179.418105Bank capital to assets ratio (%)World Development Indicators (World Bank)
FANZ_num319,2401.2483170.3109570.8842070.9394171.2236731.530551.662069Bank nonperforming loans to total gross loans (%)World Development Indicators (World Bank)
SPG_num319,2400.58710.1129940.4296990.4909080.5631710.678660.734789Population growth (annual %)World Development Indicators (World Bank)
SPDY_num319,24028.275170.57977527.2532127.910228.4066528.6968328.9061Age dependency ratio, young (% of workingage population)World Development Indicators (World Bank)
SPDO_num319,24023.547231.7911721.2637322.3153223.5223824.860526.39001Age dependency ratio, old (% of workingage population)World Development Indicators (World Bank)
SPPODP_num319,24051.822391.22181650.1698351.0121451.9290452.770753.64322Age dependency ratio (% of workingage population)World Development Indicators (World Bank)
SP6TZ_num319,24015.500981.05364914.1595814.7769315.4821916.2728417.17572Population ages 65 and above (% of total population)World Development Indicators (World Bank)
SP0TZ_num319,24018.628170.52745217.7382318.2694918.6974619.0030919.24902Population ages 0−14 (% of total population)World Development Indicators (World Bank)
SP1TZ_num319,24065.870840.52982765.0860565.4576765.8203666.2199866.5914Population ages 15−64 (% of total population)World Development Indicators (World Bank)
INBP_num319,24033.980472.54088230.8060532.3423533.4455536.3615537.7711Fixed broadband subscriptions (per 100 people)World Development Indicators (World Bank)
IMMP_num319,24032.907334.46214526.5529.332.235.9539.05Fixed telephone subscriptions (per 100 people)World Development Indicators (World Bank)
IMM_num319,2401.1 × 10012,763,10390,907,00099,507,0001.08 × 1001.19 × 1001.27 × 100Fixed telephone subscriptionsWorld Development Indicators (World Bank)
INB_num319,2401.14 × 10010,674,68299,900,0001.07 × 1001.13 × 1001.24 × 1001.3 × 100Fixed broadband subscriptionsWorld Development Indicators (World Bank)
ICS_num319,2403.5 × 10017,301,0983.28 × 1003.39 × 1003.52 × 1003.58 × 1003.8 × 100Mobile cellular subscriptionsWorld Development Indicators (World Bank)
TVTMZ_num319,24020.250860.97361618.5848319.6977220.8426320.9374221.21249Hightechnology exports (% of manufactured exports)World Development Indicators (World Bank)
TVTC_num319,2401.69 × 10111.63 × 10101.54 × 10111.55 × 10111.64 × 10111.76 × 10112 × 1011Hightechnology exports (current US$)World Development Indicators (World Bank)
TVTZW_num319,24010.328871.1526288.3508949.9424610.6280510.7610211.78374Transport services (% of commercial service exports)World Development Indicators (World Bank)
TVSCW_num319,2408.13 × 10117.82 × 10107.43 × 10117.43 × 10117.91 × 10118.56 × 10119.57 × 1011Commercial service exports (current US$)World Development Indicators (World Bank)
TVAZU_num319,2402.1447480.1431381.888662.0902652.1239472.2791522.290435Agricultural raw materials exports (% of merchandise exports)World Development Indicators (World Bank)
TVFZU_num319,24013.074313.8846339.0445529.49454913.8646814.3190420.04383Fuel exports (% of merchandise exports)World Development Indicators (World Bank)
VIN_num319,240814,043.8499,162.748,500438,5001,081,5001,144,0001,354,000Internally displaced persons, new displacement associated with disasters (number of cases)World Development Indicators (World Bank)
MMXC_num319,2407.28 × 10119.24 × 10106.41 × 10116.43 × 10117.08 × 10117.92 × 10118.88 × 1011Military expenditure (current USD)World Development Indicators (World Bank)
GLTGZ_num319,2407.1284774.0070173.965954.2372535.8856027.09594214.8048Net incurrence of liabilities, total (% of GDP)World Development Indicators (World Bank)
GNTC_num319,240−1.4 × 10128.75 × 1011−3 × 1012−1.6 × 1012−1.1 × 1012−6.5 × 1011−6.2 × 1011Net lending (+)/net borrowing (−) (current LCU)World Development Indicators (World Bank)
GDTC_num319,2402.26 × 10135.07 × 10121.72 × 10131.87 × 10132.09 × 10132.73 × 10133.06 × 1013Central government debt, total (current LCU)World Development Indicators (World Bank)
GATGZ_num319,2400.8947880.2838480.5966670.6838510.8129490.9529811.409915Net acquisition of financial assets (% of GDP)World Development Indicators (World Bank)
GDTGZ_num319,240105.127710.2792795.7048197.4749599.24091113.8834121.5008Central government debt, total (% of GDP)World Development Indicators (World Bank)
GXTGZ_num319,24024.448423.4476922.2783122.3429822.491824.0989131.20412Expense (% of GDP)World Development Indicators (World Bank)
GTOC_num319,2404.71 × 10104.7 × 10101.94 × 10101.95 × 10102.32 × 10103.14 × 10101.38 × 1011Other taxes (current LCU)World Development Indicators (World Bank)
RPRL_num319,24085.355791.62181583.0188783.8095285.4761986.6002887.38095Rule of Law: Percentile Rank, Lower Bound of 90% Confidence IntervalWorld Development Indicators (World Bank)
RPRU_num319,24094.469731.8901392.3809592.9245393.8095295.2106297.61905Rule of Law: Percentile Rank, Upper Bound of 90% Confidence IntervalWorld Development Indicators (World Bank)
RSE_num319,2400.1600350.0051820.154120.1546730.1601410.1641990.167804Rule of Law: Standard ErrorWorld Development Indicators (World Bank)
RQPRU_num319,24095.976011.83401992.5847196.4285796.6666796.9339697.61905Regulatory Quality: Percentile Rank, Upper Bound of 90% Confidence IntervalWorld Development Indicators (World Bank)
RQSE_num319,2400.2271250.0055510.2213280.2234060.2240270.2324950.235875Regulatory Quality: Standard ErrorWorld Development Indicators (World Bank)
SIEFZ_num319,2408.5277670.0606088.4236928.5190168.5310878.5752218.599635Employment in industry, female (% of female employment) (modeled ILO estimate)World Development Indicators (World Bank)
SIEMZ_num319,24028.489040.23436528.177228.2409928.5135728.7159228.74575Employment in industry, male (% of male employment) (modeled ILO estimate)World Development Indicators (World Bank)
SSEMZ_num319,24069.201720.31572968.8696168.9820569.0403769.5491869.65875Employment in services, male (% of male employment) (modeled ILO estimate)World Development Indicators (World Bank)
SGPEK_num319,240141,434.36084.95134,569.2136,422.9140,202.3147,749.9150,135.2GDP per person employed (constant 2021 PPP $)World Development Indicators (World Bank)
SSEFZ_num319,24090.506780.06714190.410890.4681290.5077790.511890.6182Employment in services, female (% of female employment) (modeled ILO estimate)World Development Indicators (World Bank)
STCFNZ_num319,24082.810090.57321182.1984182.3361382.8504783.0175183.85717Ratio of female to male labor force participation rate (%) (national estimate)World Development Indicators (World Bank)
STCMZ_num319,24068.257990.62257167.377567.551568.69668.754568.768Labor force participation rate, male (% of male population ages 15+) (modeled ILO estimate)World Development Indicators (World Bank)
STTFZ_num319,24045.193410.11805145.0793845.1331945.1660345.1935545.4362Labor force, female (% of total labor force)World Development Indicators (World Bank)
STAFZ_num319,24066.855010.79954565.836566.40466.793567.300568.2325Labor force participation rate, female (% of female population ages 15–64) (modeled ILO estimate)World Development Indicators (World Bank)
STA1FZ_num319,24049.363380.54453948.80748.97249.11549.84950.2555Labor force participation rate for ages 15–24, female (%) (modeled ILO estimate)World Development Indicators (World Bank)
SPR_num319,240317,84041,508.42270,206280,049327,478.5340,012.5386,130.5Refugee population by country or territory of asylumWorld Development Indicators (World Bank)
SUNZ_num319,24011.903560.95211110.665511.217511.494512.91513.0485Share of youth not in education, employment or training, total (% of youth population)World Development Indicators (World Bank)
SUNMZ_num319,24011.903560.95211110.665511.217511.494512.91513.0485Share of youth not in education, employment or training, total (% of youth population) (modeled ILO estimate)World Development Indicators (World Bank)
SUNMAZ_num319,24011.285330.93013810.099510.71410.881511.937512.692Share of youth not in education, employment or training, male (% of male youth population)World Development Indicators (World Bank)
SUIFZ_num319,2406.567091.5614934.92955.0076.0697.4629.074Unemployment with intermediate education, female (% of female labor force with intermediate education)World Development Indicators (World Bank)
SUAMZ_num319,2403.0391290.7470952.3752.3822.71553.23454.388Unemployment with advanced education, male (% of male labor force with advanced education)World Development Indicators (World Bank)
SUAFZ_num319,2403.2071420.884182.322.4472.8763.4854.7695Unemployment with advanced education, female (% of female labor force with advanced education)World Development Indicators (World Bank)
B. Categorical variables
NameCountUniqueTopFreqUnique ValuesModeDescriptionSource
OCCYRRND_cat318,2693−6277,5113−6Flag indicating unit is typically occupied yearround (category)American Housing Survey (US Census Bureau)
BATHEXCLU_cat319,2113−6318,6363−6Flag indicating the unit’s bathroom facilities are for the exclusive use of the household (category)American Housing Survey (US Census Bureau)
BATHROOMS_cat319,240131114,275131Number of bathrooms in unit (category)American Housing Survey (US Census Bureau)
FOUNDTYPE_cat319,24010−698,91710−6Type of foundation (category)American Housing Survey (US Census Bureau)
UNITSIZE_cat277,4309470,56094Unit size (square feet) (category)American Housing Survey (US Census Bureau)
FLOORHOLE_cat319,24022313,45422Flag indicating floor has holes (category)American Housing Survey (US Census Bureau)
FNDCRUMB_cat311,75132200,11832Flag indicating foundation has holes, cracks, or crumbling (category)American Housing Survey (US Census Bureau)
PAINTPEEL_cat319,24022310,83622Flag indicating interior area of peeling paint larger than 8 × 11 (category)American Housing Survey (US Census Bureau)
ROOFHOLE_cat312,35232209,24632Flag indicating roof has holes (category)American Housing Survey (US Census Bureau)
ROOFSAG_cat312,98132208,81032Flag indicating roof’s surface sags or is uneven (category)American Housing Survey (US Census Bureau)
ROOFSHIN_cat312,38732205,00532Flag indicating roof has missing shingles or other roofing materials (category)American Housing Survey (US Census Bureau)
WALLCRACK_cat319,24022300,07522Flag indicating inside walls or ceilings have open holes or cracks (category)American Housing Survey (US Census Bureau)
WALLSIDE_cat313,50332207,86832Flag indicating outside walls have missing siding, bricks, or other missing wall materials (category)American Housing Survey (US Census Bureau)
WALLSLOPE_cat313,64532211,22832Flag indicating outside walls slope, lean, buckle, or slant (category)American Housing Survey (US Census Bureau)
WINBOARD_cat314,62632211,66832Flag indicating windows are boarded up (category)American Housing Survey (US Census Bureau)
WINBROKE_cat314,38032205,75032Flag indicating windows are broken (category)American Housing Survey (US Census Bureau)
HHADLTKIDS_cat319,24020277,51120Number of the householder’s unmarried children age 18 and over, living in this unit (category)American Housing Survey (US Census Bureau)
HHAGE_cat319,24030267,97230Age of householder (category)American Housing Survey (US Census Bureau)
HHCITSHP_cat319,24061220,31861U.S. citizenship of householder (category)American Housing Survey (US Census Bureau)
HHGRAD_cat319,240183965,8751839Educational level of householder (category)American Housing Survey (US Census Bureau)
FINCP_cat319,2402−1 × 108277,5112−1 × 108Family income (past 12 months) (category)American Housing Survey (US Census Bureau)
HINCP_cat319,2402−1 × 108277,5112−1 × 108Household income (past 12 months) (category)American Housing Survey (US Census Bureau)
HOAAMT_cat312,73520165,94620Monthly homeowners or condominium association amount (category)American Housing Survey (US Census Bureau)
INSURAMT_amax277,08520276,75020Monthly homeowner or renter insurance amount (topcoded)American Housing Survey (US Census Bureau)
INSURAMT_cat318,81430276,75030Monthly homeowner or renter insurance amount (category)American Housing Survey (US Census Bureau)
LOTAMT_amax746220745720Monthly lot rent amount (topcoded)American Housing Survey (US Census Bureau)
LOTAMT_cat317,9984−6310,5364−6Monthly lot rent amount (category)American Housing Survey (US Census Bureau)
MORTAMT_cat318,6392−6223,6642−6Monthly total mortgage amount (all mortgages) (category)American Housing Survey (US Census Bureau)
PROTAXAMT_amax163,98420163,93620Monthly property tax amount (topcoded)American Housing Survey (US Census Bureau)
PROTAXAMT_cat319,24030163,93630Monthly property tax amount (category)American Housing Survey (US Census Bureau)
RENT_cat319,2403−6191,9943−6Monthly rent amount (category)American Housing Survey (US Census Bureau)
TOTHCAMT_cat319,24020277,51120Monthly total housing costs (category)American Housing Survey (US Census Bureau)
UTILAMT_cat319,24031262,68231Monthly total utility amount (category)American Housing Survey (US Census Bureau)
ELECAMT_cat319,24064260,09664Monthly electric amount (category)American Housing Survey (US Census Bureau)
GASAMT_cat319,24064167,72964Monthly gas amount (category)American Housing Survey (US Census Bureau)
OILAMT_cat319,22560276,67260Monthly oil amount (category)American Housing Survey (US Census Bureau)
OTHERAMT_cat319,23860274,02460Monthly amount for other fuels (category)American Housing Survey (US Census Bureau)
TRASHAMT_cat319,24064128,95764Monthly trash amount (category)American Housing Survey (US Census Bureau)
WATERAMT_cat319,24064133,89764Monthly water amount (category)American Housing Survey (US Census Bureau)
HUDSUB_cat319,2404−6204,3954−6Subsidized renter status and eligibility (category)American Housing Survey (US Census Bureau)
RENTCNTRL_cat317,9043−6300,6223−6Flag indicating rent is limited by rent control or stabilization (category)American Housing Survey (US Census Bureau)
RENTSUB_cat315,8899−6188,2339−6Type of rental subsidy or reduction (based on respondent report) (category)American Housing Survey (US Census Bureau)
PERPOVLVL_amax277,51120194,72920Household income as percent of poverty threshold (rounded) (topcoded)American Housing Survey (US Census Bureau)
PERPOVLVL_cat319,24042188,07542Household income as percent of poverty threshold (rounded) (category)American Housing Survey (US Census Bureau)
DWNPAYPCT_cat289,12211−6171,17911−6Down payment percentage (category)American Housing Survey (US Census Bureau)
FIRSTHOME_cat312,0373−6156,5743−6Flag indicating if firsttime home buyer (category)American Housing Survey (US Census Bureau)
HOWBUY_cat314,8396−6156,5746−6Description of how owner obtained unit (category)American Housing Survey (US Census Bureau)
LEADINSP_cat310,3033−6156,5743−6Flag indicating lead pipes inspected before purchase (category)American Housing Survey (US Census Bureau)
MARKETVAL_amax188,23320188,13420Current market value of unit (topcoded)American Housing Survey (US Census Bureau)
MARKETVAL_cat319,24031188,13431Current market value of unit (category)American Housing Survey (US Census Bureau)
TOTBALAMT_cat298,4072−6223,6642−6Total remaining debt across all mortgages or similar debts for this unit (category)American Housing Survey (US Census Bureau)
HMRACCESS_cat318,9603−6225,6953−6Flag indicating home improvements done in last two years to make home more accessible for those with physical limitations (category)American Housing Survey (US Census Bureau)
HMRENEFF_cat318,8783−6225,6953−6Flag indicating home improvements done to make home more energy efficient in last two years (category)American Housing Survey (US Census Bureau)
HMRSALE_cat318,9473−6225,6953−6Flag indicating home improvements done to get house ready for sale in last two years (category)American Housing Survey (US Census Bureau)
MAINTAMT_amax144,03920144,03820Amount of annual routine maintenance costs (topcoded)American Housing Survey (US Census Bureau)
MAINTAMT_cat300,6133−6156,5743−6Amount of annual routine maintenance costs (category)American Housing Survey (US Census Bureau)
REMODAMT_cat319,24020162,66620Total cost of home improvement jobs in last two years (category)American Housing Survey (US Census Bureau)
REMODJOBS_cat319,24020162,66620Total number of home improvement jobs in last two years (category)American Housing Survey (US Census Bureau)
NORC_cat313,9783−6256,0483−6Flag indicating respondent thinks the majority of neighbors 55 or older (category)American Housing Survey (US Census Bureau)
NHQPCRIME_cat305,25432219,29232Agree or disagree: this neighborhood has a lot of petty crime (category)American Housing Survey (US Census Bureau)
NHQPUBTRN_cat299,79831137,44631Agree or disagree: this neighborhood has good bus, subway, or commuter train service (category)American Housing Survey (US Census Bureau)
NHQRISK_cat308,29932251,69232Agree or disagree: this neighborhood is at high risk for floods or other disasters (category)American Housing Survey (US Census Bureau)
NHQSCHOOL_cat284,32231227,49331Agree or disagree: this neighborhood has good schools (category)American Housing Survey (US Census Bureau)
NHQSCRIME_cat306,75532252,96432Agree or disagree: this neighborhood has a lot of serious crime (category)American Housing Survey (US Census Bureau)
RATINGHS_cat310,27321268,54421Rating of unit as a place to live (category)American Housing Survey (US Census Bureau)
RATINGNH_cat309,94121268,02621Rating of neighborhood as place to live (category)American Housing Survey (US Census Bureau)
HRATE_cat315,3344−6253,7754−6Rating of current home (category)American Housing Survey (US Census Bureau)
NRATE_cat315,2545−6253,7755−6Rating of current neighborhood (category)American Housing Survey (US Census Bureau)
INTP_cat319,24040227,25140Person’s interest, dividends, and net rental income (past 12 months) (category)American Housing Survey (US Census Bureau)
OIP_cat319,24040235,87440Person’s other income (past 12 months) (category)American Housing Survey (US Census Bureau)
PAP_cat319,24040250,11940Person’s public assistance income (past 12 months) (category)American Housing Survey (US Census Bureau)
RETP_cat319,24040219,21740Person’s retirement income (past 12 months) (category)American Housing Survey (US Census Bureau)
SEMP_cat319,24040252,77540Person’s selfemployment income (past 12 months) (category)American Housing Survey (US Census Bureau)
SSIP_cat319,24040243,14940Person’s Supplemental Security Income (past 12 months) (category)American Housing Survey (US Census Bureau)
SSP_cat319,24040183,14840Person’s Social Security income (past 12 months) (category)American Housing Survey (US Census Bureau)
WAGP_cat319,24041160,12541Person’s wages or salary income (past 12 months) (category)American Housing Survey (US Census Bureau)
JOBTYPE_cat319,24038−8225,69538−8Type of home improvement job (category)American Housing Survey (US Census Bureau)
INTRATE_cat319,2403−8223,6643−8Interest rate of mortgage (category)American Housing Survey (US Census Bureau)
PMTAMT_amax318,5253−8223,6643−8Amount of mortgage payment (topcoded)American Housing Survey (US Census Bureau)
PMTAMT_cat318,9824−8223,6644−8Amount of mortgage payment (category)American Housing Survey (US Census Bureau)
TAXPMT_cat315,0134−8223,6644−8Flag indicating property taxes included in mortgage payment (category)American Housing Survey (US Census Bureau)
REFI_cat316,3744−8223,6644−8Flag indicating mortgage is a refinance of previous mortgage (category)American Housing Survey (US Census Bureau)
Notes. Values are reported as provided by the data sources. Large magnitudes use scientific notation in the form “A × 10n”; negative values use the true minus “−”. The multiplication sign used is “×” (U+00D7).
Table A6. Variables of the final dataset after feature engineering.
Table A6. Variables of the final dataset after feature engineering.
Target FeatureInput Features
CategoricalNumerical
TRSOCCYRRND_catHMRACCESS_catBEDROOMS_numSEMP_num
BATHROOMS_catHMRENEFF_catDINING_numSSIP_num
FOUNDTYPE_catHMRSALE_catHHAGE_numSSP_num
UNITSIZE_catNORC_catFINCP_numWAGP_num
FNDCRUMB_catNHQPCRIME_catHOAAMT_numMORTCOUNT_num
ROOFHOLE_catNHQPUBTRN_catINSURAMT_numINTRATE_num
ROOFSAG_catNHQRISK_catLOTAMT_numPMTAMT_num
ROOFSHIN_catNHQSCHOOL_catPROTAXAMT_numNTNC_num
WALLSIDE_catNHQSCRIME_catUTILAMT_numNGMC_num
WALLSLOPE_catRATINGHS_catELECAMT_numAYCK_num
WINBOARD_catRATINGNH_catGASAMT_numALCH_num
WINBROKE_catHRATE_catOILAMT_numTVTC_num
HHADLTKIDS_catNRATE_catOTHERAMT_numTVSCW_num
HHAGE_catINTP_catTRASHAMT_numGDTGZ_num
HHCITSHP_catOIP_catWATERAMT_numGTOC_num
HHGRAD_catPAP_catPERPOVLVL_numRPRU_num
INSURAMT_catSEMP_catMARKETVAL_numSUNZ_num
LOTAMT_catWAGP_catTOTBALAMT_num
ELECAMT_catJOBTYPE_catMAINTAMT_num
GASAMT_catINTRATE_catREMODAMT_num
OILAMT_cat PERSCOUNT_num
OTHERAMT_cat INTP_num
HUDSUB_cat OIP_num
PERPOVLVL_cat PAP_num
DWNPAYPCT_cat RETP_num
Table A7. Detailed k-fold metric results for the 10-fold cross-validation of the preselected models.
Table A7. Detailed k-fold metric results for the 10-fold cross-validation of the preselected models.
IDModelControlK-FoldFit Time (s)Score Time (s)Test R2Train R2Test NMSETrain NMSE
ElaNElastic Net Regression10111.69630.07359.43 × 10−19.43 × 10−15.64 × 10−35.68 × 10−3
ElaNElastic Net Regression10221.66580.07379.43 × 10−19.43 × 10−15.67 × 10−35.68 × 10−3
LarsLars Regression20111.01960.06059.98 × 10−19.98 × 10−12.27 × 10−42.26 × 10−4
RscRRANSAC Regression30117.23450.12809.98 × 10−19.98 × 10−12.19 × 10−42.20 × 10−4
KnnRK-Nearest Neighbors Regression40111.912616.60207.65 × 10−18.10 × 10−12.33 × 10−21.89 × 10−2
DTRgDecision Tree Regression50112.32330.08879.87 × 10−19.87 × 10−11.29 × 10−31.27 × 10−3
HGBRgHist. Gradient Boosting Regression60116.87780.11129.97 × 10−19.97 × 10−13.33 × 10−43.34 × 10−4
RFRgRandom Forest Regression7011115.48850.90669.99 × 10−11.00 × 1001.30 × 10−43.73 × 10−5
MlpRMLP Regression801168.40730.08349.18 × 10−19.17 × 10−18.09 × 10−38.20 × 10−3
Figure A1. (a) Evolution of the metrics by k-fold (ElaN, Lars, RscR, KnnR). (b) Evolution of the metrics by k-fold (DTRg, HGBRg, RFRg, MlpR).
Figure A1. (a) Evolution of the metrics by k-fold (ElaN, Lars, RscR, KnnR). (b) Evolution of the metrics by k-fold (DTRg, HGBRg, RFRg, MlpR).
Mathematics 13 03413 g0a1aMathematics 13 03413 g0a1b
Table A8. Detailed k-fold metric results for the optimization of the preselected models.
Table A8. Detailed k-fold metric results for the optimization of the preselected models.
Model
(Parameters)
MetricEstimatorTypeFolds
12345678910
Lars (Least Angle Regression)
(eps, fit_intercept, n_nonzero_coefs)
R2(0.0001, True, 5)test9.5959 × 10−19.5930 × 10−19.5963 × 10−19.5937 × 10−19.5895 × 10−19.5960 × 10−19.5990 × 10−19.5944 × 10−19.6004 × 10−19.5946 × 10−1
(0.0001, True, 5)train9.5958 × 10−19.5950 × 10−19.5946 × 10−19.5974 × 10−19.5977 × 10−19.5938 × 10−19.5977 × 10−19.5937 × 10−19.5946 × 10−19.5932 × 10−1
(0.0001, True, 10)test9.7172 × 10−19.7234 × 10−19.7219 × 10−19.7207 × 10−19.7177 × 10−19.7168 × 10−19.7244 × 10−19.7215 × 10−19.7190 × 10−19.7193 × 10−1
(0.0001, True, 10)train9.7167 × 10−19.7246 × 10−19.7204 × 10−19.7236 × 10−19.7241 × 10−19.7151 × 10−19.7233 × 10−19.7216 × 10−19.7141 × 10−19.7187 × 10−1
(0.0001, True, 15)test9.9190 × 10−19.9194 × 10−19.9184 × 10−19.9177 × 10−19.9166 × 10−19.9200 × 10−19.9207 × 10−19.9181 × 10−19.9199 × 10−19.9201 × 10−1
(0.0001, True, 15)train9.9188 × 10−19.9195 × 10−19.9185 × 10−19.9188 × 10−19.9184 × 10−19.9192 × 10−19.9200 × 10−19.9185 × 10−19.9185 × 10−19.9199 × 10−1
(0.0001, True, 25)test9.9601 × 10−19.9607 × 10−19.9621 × 10−19.9608 × 10−19.9600 × 10−19.9615 × 10−19.9625 × 10−19.9612 × 10−19.9607 × 10−19.9603 × 10−1
(0.0001, True, 25)train9.9599 × 10−19.9603 × 10−19.9628 × 10−19.9609 × 10−19.9608 × 10−19.9607 × 10−19.9621 × 10−19.9617 × 10−19.9604 × 10−19.9604 × 10−1
(0.001, True, 5)test9.5959 × 10−19.5930 × 10−19.5963 × 10−19.5937 × 10−19.5895 × 10−19.5960 × 10−19.5990 × 10−19.5944 × 10−19.6004 × 10−19.5946 × 10−1
(0.001, True, 5)train9.5958 × 10−19.5950 × 10−19.5946 × 10−19.5974 × 10−19.5977 × 10−19.5938 × 10−19.5977 × 10−19.5937 × 10−19.5946 × 10−19.5932 × 10−1
(0.001, True, 10)test9.7172 × 10−19.7234 × 10−19.7219 × 10−19.7207 × 10−19.7177 × 10−19.7168 × 10−19.7244 × 10−19.7215 × 10−19.7190 × 10−19.7193 × 10−1
(0.001, True, 10)train9.7167 × 10−19.7246 × 10−19.7204 × 10−19.7236 × 10−19.7241 × 10−19.7151 × 10−19.7233 × 10−19.7216 × 10−19.7141 × 10−19.7187 × 10−1
(0.001, True, 15)test9.9190 × 10−19.9194 × 10−19.9184 × 10−19.9177 × 10−19.9166 × 10−19.9200 × 10−19.9207 × 10−19.9181 × 10−19.9199 × 10−19.9201 × 10−1
(0.001, True, 15)train9.9188 × 10−19.9195 × 10−19.9185 × 10−19.9188 × 10−19.9184 × 10−19.9192 × 10−19.9200 × 10−19.9185 × 10−19.9185 × 10−19.9199 × 10−1
(0.001, True, 25)test9.9601 × 10−19.9607 × 10−19.9621 × 10−19.9608 × 10−19.9600 × 10−19.9615 × 10−19.9625 × 10−19.9612 × 10−19.9607 × 10−19.9603 × 10−1
(0.001, True, 25)train9.9599 × 10−19.9603 × 10−19.9628 × 10−19.9609 × 10−19.9608 × 10−19.9607 × 10−19.9621 × 10−19.9617 × 10−19.9604 × 10−19.9604 × 10−1
MSE(0.0001, True, 5)test4.0332 × 10−34.0822 × 10−34.0184 × 10−34.0262 × 10−34.0452 × 10−34.0260 × 10−34.0102 × 10−34.0174 × 10−33.9856 × 10−34.0015 × 10−3
(0.0001, True, 5)train4.0182 × 10−34.0237 × 10−34.0309 × 10−34.0057 × 10−34.0050 × 10−34.0389 × 10−33.9983 × 10−34.0428 × 10−34.0299 × 10−34.0493 × 10−3
(0.0001, True, 10)test2.8230 × 10−32.7744 × 10−32.7678 × 10−32.7678 × 10−32.7818 × 10−32.8227 × 10−32.7562 × 10−32.7586 × 10−32.8030 × 10−32.7702 × 10−3
(0.0001, True, 10)train2.8164 × 10−32.7364 × 10−32.7801 × 10−32.7499 × 10−32.7470 × 10−32.8324 × 10−32.7497 × 10−32.7700 × 10−32.8421 × 10−32.7995 × 10−3
(0.0001, True, 15)test8.0874 × 10−48.0845 × 10−48.1236 × 10−48.1544 × 10−48.2201 × 10−47.9736 × 10−47.9269 × 10−48.1144 × 10−47.9889 × 10−47.8893 × 10−4
(0.0001, True, 15)train8.0692 × 10−47.9996 × 10−48.1016 × 10−48.0821 × 10−48.1208 × 10−48.0358 × 10−47.9499 × 10−48.1121 × 10−48.1027 × 10−47.9732 × 10−4
(0.0001, True, 25)test3.9849 × 10−43.9437 × 10−43.7696 × 10−43.8868 × 10−43.9456 × 10−43.8417 × 10−43.7543 × 10−43.8405 × 10−43.9157 × 10−43.9164 × 10−4
(0.0001, True, 25)train3.9815 × 10−43.9432 × 10−43.7008 × 10−43.8929 × 10−43.8974 × 10−43.9072 × 10−43.7682 × 10−43.8136 × 10−43.9344 × 10−43.9451 × 10−4
(0.001, True, 5)test4.0332 × 10−34.0822 × 10−34.0184 × 10−34.0262 × 10−34.0452 × 10−34.0260 × 10−34.0102 × 10−34.0174 × 10−33.9856 × 10−34.0015 × 10−3
(0.001, True, 5)train4.0182 × 10−34.0237 × 10−34.0309 × 10−34.0057 × 10−34.0050 × 10−34.0389 × 10−33.9983 × 10−34.0428 × 10−34.0299 × 10−34.0493 × 10−3
(0.001, True, 10)test2.8230 × 10−32.7744 × 10−32.7678 × 10−32.7678 × 10−32.7818 × 10−32.8227 × 10−32.7562 × 10−32.7586 × 10−32.8030 × 10−32.7702 × 10−3
(0.001, True, 10)train2.8164 × 10−32.7364 × 10−32.7801 × 10−32.7499 × 10−32.7470 × 10−32.8324 × 10−32.7497 × 10−32.7700 × 10−32.8421 × 10−32.7995 × 10−3
(0.001, True, 15)test8.0874 × 10−48.0845 × 10−48.1236 × 10−48.1544 × 10−48.2201 × 10−47.9736 × 10−47.9269 × 10−48.1144 × 10−47.9889 × 10−47.8893 × 10−4
(0.001, True, 15)train8.0692 × 10−47.9996 × 10−48.1016 × 10−48.0821 × 10−48.1208 × 10−48.0358 × 10−47.9499 × 10−48.1121 × 10−48.1027 × 10−47.9732 × 10−4
(0.001, True, 25)test3.9849 × 10−43.9437 × 10−43.7696 × 10−43.8868 × 10−43.9456 × 10−43.8417 × 10−43.7543 × 10−43.8405 × 10−43.9157 × 10−43.9164 × 10−4
(0.001, True, 25)train3.9815 × 10−43.9432 × 10−43.7008 × 10−43.8929 × 10−43.8974 × 10−43.9072 × 10−43.7682 × 10−43.8136 × 10−43.9344 × 10−43.9451 × 10−4
RMSE(0.0001, True, 5)test6.3508 × 10−26.3892 × 10−26.3391 × 10−26.3452 × 10−26.3602 × 10−26.3451 × 10−26.3326 × 10−26.3383 × 10−26.3131 × 10−26.3257 × 10−2
(0.0001, True, 5)train6.3389 × 10−26.3433 × 10−26.3489 × 10−26.3290 × 10−26.3285 × 10−26.3553 × 10−26.3232 × 10−26.3583 × 10−26.3482 × 10−26.3634 × 10−2
(0.0001, True, 10)test5.3132 × 10−25.2672 × 10−25.2610 × 10−25.2610 × 10−25.2742 × 10−25.3129 × 10−25.2500 × 10−25.2523 × 10−25.2943 × 10−25.2632 × 10−2
(0.0001, True, 10)train5.3070 × 10−25.2310 × 10−25.2727 × 10−25.2439 × 10−25.2412 × 10−25.3220 × 10−25.2437 × 10−25.2630 × 10−25.3312 × 10−25.2910 × 10−2
(0.0001, True, 15)test2.8438 × 10−22.8433 × 10−22.8502 × 10−22.8556 × 10−22.8671 × 10−22.8238 × 10−22.8155 × 10−22.8486 × 10−22.8265 × 10−22.8088 × 10−2
(0.0001, True, 15)train2.8406 × 10−22.8284 × 10−22.8463 × 10−22.8429 × 10−22.8497 × 10−22.8348 × 10−22.8196 × 10−22.8482 × 10−22.8465 × 10−22.8237 × 10−2
(0.0001, True, 25)test1.9962 × 10−21.9859 × 10−21.9416 × 10−21.9715 × 10−21.9864 × 10−21.9600 × 10−21.9376 × 10−21.9597 × 10−21.9788 × 10−21.9790 × 10−2
(0.0001, True, 25)train1.9954 × 10−21.9857 × 10−21.9238 × 10−21.9730 × 10−21.9742 × 10−21.9767 × 10−21.9412 × 10−21.9528 × 10−21.9835 × 10−21.9862 × 10−2
(0.001, True, 5)test6.3508 × 10−26.3892 × 10−26.3391 × 10−26.3452 × 10−26.3602 × 10−26.3451 × 10−26.3326 × 10−26.3383 × 10−26.3131 × 10−26.3257 × 10−2
(0.001, True, 5)train6.3389 × 10−26.3433 × 10−26.3489 × 10−26.3290 × 10−26.3285 × 10−26.3553 × 10−26.3232 × 10−26.3583 × 10−26.3482 × 10−26.3634 × 10−2
(0.001, True, 10)test5.3132 × 10−25.2672 × 10−25.2610 × 10−25.2610 × 10−25.2742 × 10−25.3129 × 10−25.2500 × 10−25.2523 × 10−25.2943 × 10−25.2632 × 10−2
(0.001, True, 10)train5.3070 × 10−25.2310 × 10−25.2727 × 10−25.2439 × 10−25.2412 × 10−25.3220 × 10−25.2437 × 10−25.2630 × 10−25.3312 × 10−25.2910 × 10−2
(0.001, True, 15)test2.8438 × 10−22.8433 × 10−22.8502 × 10−22.8556 × 10−22.8671 × 10−22.8238 × 10−22.8155 × 10−22.8486 × 10−22.8265 × 10−22.8088 × 10−2
(0.001, True, 15)train2.8406 × 10−22.8284 × 10−22.8463 × 10−22.8429 × 10−22.8497 × 10−22.8348 × 10−22.8196 × 10−22.8482 × 10−22.8465 × 10−22.8237 × 10−2
(0.001, True, 25)test1.9962 × 10−21.9859 × 10−21.9416 × 10−21.9715 × 10−21.9864 × 10−21.9600 × 10−21.9376 × 10−21.9597 × 10−21.9788 × 10−21.9790 × 10−2
(0.001, True, 25)train1.9954 × 10−21.9857 × 10−21.9238 × 10−21.9730 × 10−21.9742 × 10−21.9767 × 10−21.9412 × 10−21.9528 × 10−21.9835 × 10−21.9862 × 10−2
Decision Tree
(max_depth, min_samples_leaf)
R2(2, 2)test9.4184 × 10−19.4247 × 10−19.4055 × 10−19.4093 × 10−19.4137 × 10−19.4251 × 10−19.4216 × 10−19.4167 × 10−19.4262 × 10−19.4174 × 10−1
(2, 2)train9.4179 × 10−19.4172 × 10−19.4193 × 10−19.4189 × 10−19.4184 × 10−19.4171 × 10−19.4175 × 10−19.4181 × 10−19.4170 × 10−19.4180 × 10−1
(2, 5)test9.4184 × 10−19.4247 × 10−19.4055 × 10−19.4093 × 10−19.4137 × 10−19.4251 × 10−19.4216 × 10−19.4167 × 10−19.4262 × 10−19.4174 × 10−1
(2, 5)train9.4179 × 10−19.4172 × 10−19.4193 × 10−19.4189 × 10−19.4184 × 10−19.4171 × 10−19.4175 × 10−19.4181 × 10−19.4170 × 10−19.4180 × 10−1
(5, 2)test9.9648 × 10−19.9657 × 10−19.9645 × 10−19.9645 × 10−19.9637 × 10−19.9647 × 10−19.9647 × 10−19.9647 × 10−19.9655 × 10−19.9647 × 10−1
(5, 2)train9.9648 × 10−19.9647 × 10−19.9648 × 10−19.9648 × 10−19.9649 × 10−19.9648 × 10−19.9648 × 10−19.9648 × 10−19.9647 × 10−19.9648 × 10−1
(5, 5)test9.9648 × 10−19.9657 × 10−19.9645 × 10−19.9645 × 10−19.9637 × 10−19.9647 × 10−19.9647 × 10−19.9647 × 10−19.9655 × 10−19.9647 × 10−1
(5, 5)train9.9648 × 10−19.9647 × 10−19.9648 × 10−19.9648 × 10−19.9649 × 10−19.9648 × 10−19.9648 × 10−19.9648 × 10−19.9647 × 10−19.9648 × 10−1
(10, 2)test9.9858 × 10−19.9862 × 10−19.9856 × 10−19.9862 × 10−19.9853 × 10−19.9861 × 10−19.9859 × 10−19.9856 × 10−19.9860 × 10−19.9858 × 10−1
(10, 2)train9.9866 × 10−19.9866 × 10−19.9867 × 10−19.9866 × 10−19.9867 × 10−19.9866 × 10−19.9866 × 10−19.9866 × 10−19.9866 × 10−19.9867 × 10−1
(10, 5)test9.9858 × 10−19.9862 × 10−19.9855 × 10−19.9863 × 10−19.9853 × 10−19.9861 × 10−19.9859 × 10−19.9856 × 10−19.9860 × 10−19.9859 × 10−1
(10, 5)train9.9866 × 10−19.9866 × 10−19.9866 × 10−19.9866 × 10−19.9867 × 10−19.9866 × 10−19.9866 × 10−19.9866 × 10−19.9866 × 10−19.9867 × 10−1
(20, 2)test9.9801 × 10−19.9809 × 10−19.9801 × 10−19.9809 × 10−19.9797 × 10−19.9805 × 10−19.9803 × 10−19.9799 × 10−19.9806 × 10−19.9804 × 10−1
(20, 2)train9.9919 × 10−19.9920 × 10−19.9918 × 10−19.9919 × 10−19.9918 × 10−19.9917 × 10−19.9917 × 10−19.9920 × 10−19.9919 × 10−19.9922 × 10−1
(20, 5)test9.9820 × 10−19.9822 × 10−19.9818 × 10−19.9825 × 10−19.9813 × 10−19.9821 × 10−19.9821 × 10−19.9814 × 10−19.9824 × 10−19.9824 × 10−1
(20, 5)train9.9912 × 10−19.9911 × 10−19.9910 × 10−19.9910 × 10−19.9911 × 10−19.9911 × 10−19.9912 × 10−19.9912 × 10−19.9910 × 10−19.9911 × 10−1
MSE(2, 2)test5.8051 × 10−35.7697 × 10−35.9168 × 10−35.8541 × 10−35.7767 × 10−35.7294 × 10−35.7839 × 10−35.7779 × 10−35.7232 × 10−35.7500 × 10−3
(2, 2)train5.7867 × 10−35.7906 × 10−35.7742 × 10−35.7812 × 10−35.7898 × 10−35.7951 × 10−35.7890 × 10−35.7897 × 10−35.7958 × 10−35.7928 × 10−3
(2, 5)test5.8051 × 10−35.7697 × 10−35.9168 × 10−35.8541 × 10−35.7767 × 10−35.7294 × 10−35.7839 × 10−35.7779 × 10−35.7232 × 10−35.7500 × 10−3
(2, 5)train5.7867 × 10−35.7906 × 10−35.7742 × 10−35.7812 × 10−35.7898 × 10−35.7951 × 10−35.7890 × 10−35.7897 × 10−35.7958 × 10−35.7928 × 10−3
(5, 2)test3.5126 × 10−43.4440 × 10−43.5305 × 10−43.5202 × 10−43.5751 × 10−43.5194 × 10−43.5319 × 10−43.4979 × 10−43.4412 × 10−43.4829 × 10−4
(5, 2)train3.5027 × 10−43.5104 × 10−43.5007 × 10−43.5019 × 10−43.4958 × 10−43.5019 × 10−43.5006 × 10−43.5043 × 10−43.5106 × 10−43.5060 × 10−4
(5, 5)test3.5126 × 10−43.4440 × 10−43.5305 × 10−43.5202 × 10−43.5751 × 10−43.5194 × 10−43.5319 × 10−43.4979 × 10−43.4412 × 10−43.4829 × 10−4
(5, 5)train3.5027 × 10−43.5104 × 10−43.5007 × 10−43.5019 × 10−43.4958 × 10−43.5019 × 10−43.5006 × 10−43.5043 × 10−43.5106 × 10−43.5060 × 10−4
(10, 2)test1.4156 × 10−41.3884 × 10−41.4328 × 10−41.3631 × 10−41.4442 × 10−41.3883 × 10−41.4107 × 10−41.4264 × 10−41.3934 × 10−41.4019 × 10−4
(10, 2)train1.3292 × 10−41.3297 × 10−41.3254 × 10−41.3326 × 10−41.3253 × 10−41.3303 × 10−41.3274 × 10−41.3286 × 10−41.3301 × 10−41.3279 × 10−4
(10, 5)test1.4138 × 10−41.3848 × 10−41.4392 × 10−41.3618 × 10−41.4468 × 10−41.3885 × 10−41.4137 × 10−41.4272 × 10−41.3960 × 10−41.3966 × 10−4
(10, 5)train1.3297 × 10−41.3334 × 10−41.3286 × 10−41.3357 × 10−41.3288 × 10−41.3319 × 10−41.3278 × 10−41.3306 × 10−41.3320 × 10−41.3283 × 10−4
(20, 2)test1.9850 × 10−41.9163 × 10−41.9776 × 10−41.8927 × 10−42.0052 × 10−41.9422 × 10−41.9690 × 10−41.9871 × 10−41.9355 × 10−41.9370 × 10−4
(20, 2)train8.0396 × 10−57.9783 × 10−58.1531 × 10−58.0727 × 10−58.2064 × 10−58.2488 × 10−58.2033 × 10−57.9881 × 10−58.0350 × 10−57.7962 × 10−5
(20, 5)test1.7923 × 10−41.7858 × 10−41.8116 × 10−41.7334 × 10−41.8437 × 10−41.7792 × 10−41.7914 × 10−41.8444 × 10−41.7553 × 10−41.7386 × 10−4
(20, 5)train8.7741 × 10−58.8280 × 10−58.9938 × 10−58.9490 × 10−58.8491 × 10−58.8102 × 10−58.7458 × 10−58.7934 × 10−58.9727 × 10−58.8532 × 10−5
RMSE(2, 2)test7.6191 × 10−27.5958 × 10−27.6921 × 10−27.6512 × 10−27.6005 × 10−27.5693 × 10−27.6052 × 10−27.6012 × 10−27.5652 × 10−27.5829 × 10−2
(2, 2)train7.6070 × 10−27.6096 × 10−27.5988 × 10−27.6034 × 10−27.6091 × 10−27.6125 × 10−27.6085 × 10−27.6090 × 10−27.6130 × 10−27.6110 × 10−2
(2, 5)test7.6191 × 10−27.5958 × 10−27.6921 × 10−27.6512 × 10−27.6005 × 10−27.5693 × 10−27.6052 × 10−27.6012 × 10−27.5652 × 10−27.5829 × 10−2
(2, 5)train7.6070 × 10−27.6096 × 10−27.5988 × 10−27.6034 × 10−27.6091 × 10−27.6125 × 10−27.6085 × 10−27.6090 × 10−27.6130 × 10−27.6110 × 10−2
(5, 2)test1.8742 × 10−21.8558 × 10−21.8790 × 10−21.8762 × 10−21.8908 × 10−21.8760 × 10−21.8793 × 10−21.8703 × 10−21.8551 × 10−21.8663 × 10−2
(5, 2)train1.8715 × 10−21.8736 × 10−21.8710 × 10−21.8713 × 10−21.8697 × 10−21.8713 × 10−21.8710 × 10−21.8720 × 10−21.8737 × 10−21.8724 × 10−2
(5, 5)test1.8742 × 10−21.8558 × 10−21.8790 × 10−21.8762 × 10−21.8908 × 10−21.8760 × 10−21.8793 × 10−21.8703 × 10−21.8551 × 10−21.8663 × 10−2
(5, 5)train1.8715 × 10−21.8736 × 10−21.8710 × 10−21.8713 × 10−21.8697 × 10−21.8713 × 10−21.8710 × 10−21.8720 × 10−21.8737 × 10−21.8724 × 10−2
(10, 2)test1.1898 × 10−21.1783 × 10−21.1970 × 10−21.1675 × 10−21.2017 × 10−21.1783 × 10−21.1877 × 10−21.1943 × 10−21.1804 × 10−21.1840 × 10−2
(10, 2)train1.1529 × 10−21.1531 × 10−21.1513 × 10−21.1544 × 10−21.1512 × 10−21.1534 × 10−21.1521 × 10−21.1527 × 10−21.1533 × 10−21.1523 × 10−2
(10, 5)test1.1890 × 10−21.1768 × 10−21.1997 × 10−21.1670 × 10−21.2028 × 10−21.1783 × 10−21.1890 × 10−21.1946 × 10−21.1815 × 10−21.1818 × 10−2
(10, 5)train1.1531 × 10−21.1547 × 10−21.1526 × 10−21.1557 × 10−21.1528 × 10−21.1541 × 10−21.1523 × 10−21.1535 × 10−21.1541 × 10−21.1525 × 10−2
(20, 2)test1.4089 × 10−21.3843 × 10−21.4063 × 10−21.3758 × 10−21.4160 × 10−21.3936 × 10−21.4032 × 10−21.4096 × 10−21.3912 × 10−21.3918 × 10−2
(20, 2)train8.9664 × 10−38.9321 × 10−39.0294 × 10−38.9848 × 10−39.0589 × 10−39.0823 × 10−39.0572 × 10−38.9376 × 10−38.9638 × 10−38.8296 × 10−3
(20, 5)test1.3388 × 10−21.3363 × 10−21.3460 × 10−21.3166 × 10−21.3578 × 10−21.3339 × 10−21.3384 × 10−21.3581 × 10−21.3249 × 10−21.3186 × 10−2
(20, 5)train9.3670 × 10−39.3958 × 10−39.4836 × 10−39.4599 × 10−39.4069 × 10−39.3863 × 10−39.3519 × 10−39.3773 × 10−39.4724 × 10−39.4092 × 10−3
Hist. Gradient Boosting
(learning_rate, max_iter, min_samples_leaf)
R2(0.05, 30, 20)test9.5219 × 10−19.5210 × 10−19.5225 × 10−19.5202 × 10−19.5203 × 10−19.5207 × 10−19.5222 × 10−19.5196 × 10−19.5233 × 10−19.5221 × 10−1
(0.05, 30, 20)train9.5213 × 10−19.5214 × 10−19.5213 × 10−19.5215 × 10−19.5216 × 10−19.5214 × 10−19.5215 × 10−19.5214 × 10−19.5214 × 10−19.5215 × 10−1
(0.05, 100, 20)test9.9857 × 10−19.9859 × 10−19.9854 × 10−19.9860 × 10−19.9853 × 10−19.9859 × 10−19.9857 × 10−19.9854 × 10−19.9858 × 10−19.9856 × 10−1
(0.05, 100, 20)train9.9857 × 10−19.9857 × 10−19.9858 × 10−19.9857 × 10−19.9858 × 10−19.9857 × 10−19.9858 × 10−19.9858 × 10−19.9857 × 10−19.9858 × 10−1
(0.05, 300, 20)test9.9876 × 10−19.9878 × 10−19.9873 × 10−19.9880 × 10−19.9872 × 10−19.9880 × 10−19.9876 × 10−19.9874 × 10−19.9878 × 10−19.9876 × 10−1
(0.05, 300, 20)train9.9880 × 10−19.9880 × 10−19.9880 × 10−19.9879 × 10−19.9880 × 10−19.9880 × 10−19.9880 × 10−19.9880 × 10−19.9879 × 10−19.9880 × 10−1
(0.1, 30, 20)test9.9665 × 10−19.9666 × 10−19.9664 × 10−19.9664 × 10−19.9659 × 10−19.9665 × 10−19.9666 × 10−19.9658 × 10−19.9670 × 10−19.9666 × 10−1
(0.1, 30, 20)train9.9665 × 10−19.9665 × 10−19.9665 × 10−19.9663 × 10−19.9665 × 10−19.9664 × 10−19.9665 × 10−19.9665 × 10−19.9665 × 10−19.9665 × 10−1
(0.1, 100, 20)test9.9874 × 10−19.9876 × 10−19.9870 × 10−19.9878 × 10−19.9870 × 10−19.9878 × 10−19.9874 × 10−19.9872 × 10−19.9876 × 10−19.9874 × 10−1
(0.1, 100, 20)train9.9877 × 10−19.9876 × 10−19.9877 × 10−19.9877 × 10−19.9877 × 10−19.9877 × 10−19.9877 × 10−19.9877 × 10−19.9877 × 10−19.9877 × 10−1
(0.1, 300, 20)test9.9876 × 10−19.9880 × 10−19.9873 × 10−19.9880 × 10−19.9873 × 10−19.9880 × 10−19.9876 × 10−19.9874 × 10−19.9878 × 10−19.9876 × 10−1
(0.1, 300, 20)train9.9880 × 10−19.9882 × 10−19.9881 × 10−19.9881 × 10−19.9882 × 10−19.9881 × 10−19.9881 × 10−19.9881 × 10−19.9881 × 10−19.9881 × 10−1
MSE(0.05, 30, 20)test4.7720 × 10−34.8041 × 10−34.7522 × 10−34.7551 × 10−34.7271 × 10−34.7767 × 10−34.7781 × 10−34.7588 × 10−34.7547 × 10−34.7168 × 10−3
(0.05, 30, 20)train4.7583 × 10−34.7554 × 10−34.7597 × 10−34.7609 × 10−34.7629 × 10−34.7580 × 10−34.7553 × 10−34.7617 × 10−34.7583 × 10−34.7627 × 10−3
(0.05, 100, 20)test1.4294 × 10−41.4170 × 10−41.4575 × 10−41.3847 × 10−41.4523 × 10−41.4055 × 10−41.4316 × 10−41.4479 × 10−41.4142 × 10−41.4194 × 10−4
(0.05, 100, 20)train1.4171 × 10−41.4197 × 10−41.4126 × 10−41.4221 × 10−41.4150 × 10−41.4209 × 10−41.4150 × 10−41.4146 × 10−41.4172 × 10−41.4175 × 10−4
(0.05, 300, 20)test1.2347 × 10−41.2196 × 10−41.2633 × 10−41.1888 × 10−41.2614 × 10−41.1974 × 10−41.2382 × 10−41.2463 × 10−41.2199 × 10−41.2245 × 10−4
(0.05, 300, 20)train1.1977 × 10−41.1953 × 10−41.1917 × 10−41.2055 × 10−41.1968 × 10−41.1942 × 10−41.1949 × 10−41.1904 × 10−41.1990 × 10−41.1955 × 10−4
(0.1, 30, 20)test3.3401 × 10−43.3504 × 10−43.3419 × 10−43.3299 × 10−43.3585 × 10−43.3396 × 10−43.3444 × 10−43.3861 × 10−43.2946 × 10−43.3011 × 10−4
(0.1, 30, 20)train3.3329 × 10−43.3331 × 10−43.3302 × 10−43.3484 × 10−43.3354 × 10−43.3369 × 10−43.3343 × 10−43.3308 × 10−43.3296 × 10−43.3382 × 10−4
(0.1, 100, 20)test1.2541 × 10−41.2397 × 10−41.2902 × 10−41.2050 × 10−41.2789 × 10−41.2178 × 10−41.2551 × 10−41.2685 × 10−41.2388 × 10−41.2446 × 10−4
(0.1, 100, 20)train1.2260 × 10−41.2277 × 10−41.2251 × 10−41.2284 × 10−41.2252 × 10−41.2271 × 10−41.2241 × 10−41.2238 × 10−41.2258 × 10−41.2280 × 10−4
(0.1, 300, 20)test1.2344 × 10−41.2075 × 10−41.2618 × 10−41.1847 × 10−41.2526 × 10−41.1962 × 10−41.2368 × 10−41.2476 × 10−41.2164 × 10−41.2193 × 10−4
(0.1, 300, 20)train1.1895 × 10−41.1742 × 10−41.1800 × 10−41.1871 × 10−41.1769 × 10−41.1808 × 10−41.1875 × 10−41.1818 × 10−41.1834 × 10−41.1864 × 10−4
RMSE(0.05, 30, 20)test6.9079 × 10−26.9312 × 10−26.8936 × 10−26.8957 × 10−26.8754 × 10−26.9114 × 10−26.9124 × 10−26.8984 × 10−26.8955 × 10−26.8679 × 10−2
(0.05, 30, 20)train6.8981 × 10−26.8960 × 10−26.8991 × 10−26.8999 × 10−26.9014 × 10−26.8978 × 10−26.8959 × 10−26.9005 × 10−26.8981 × 10−26.9013 × 10−2
(0.05, 100, 20)test1.1956 × 10−21.1904 × 10−21.2073 × 10−21.1767 × 10−21.2051 × 10−21.1855 × 10−21.1965 × 10−21.2033 × 10−21.1892 × 10−21.1914 × 10−2
(0.05, 100, 20)train1.1904 × 10−21.1915 × 10−21.1885 × 10−21.1925 × 10−21.1895 × 10−21.1920 × 10−21.1895 × 10−21.1894 × 10−21.1905 × 10−21.1906 × 10−2
(0.05, 300, 20)test1.1112 × 10−21.1044 × 10−21.1239 × 10−21.0903 × 10−21.1231 × 10−21.0943 × 10−21.1128 × 10−21.1164 × 10−21.1045 × 10−21.1065 × 10−2
(0.05, 300, 20)train1.0944 × 10−21.0933 × 10−21.0916 × 10−21.0979 × 10−21.0940 × 10−21.0928 × 10−21.0931 × 10−21.0910 × 10−21.0950 × 10−21.0934 × 10−2
(0.1, 30, 20)test1.8276 × 10−21.8304 × 10−21.8281 × 10−21.8248 × 10−21.8326 × 10−21.8275 × 10−21.8288 × 10−21.8401 × 10−21.8151 × 10−21.8169 × 10−2
(0.1, 30, 20)train1.8256 × 10−21.8257 × 10−21.8249 × 10−21.8299 × 10−21.8263 × 10−21.8267 × 10−21.8260 × 10−21.8250 × 10−21.8247 × 10−21.8271 × 10−2
(0.1, 100, 20)test1.1199 × 10−21.1134 × 10−21.1359 × 10−21.0977 × 10−21.1309 × 10−21.1035 × 10−21.1203 × 10−21.1263 × 10−21.1130 × 10−21.1156 × 10−2
(0.1, 100, 20)train1.1072 × 10−21.1080 × 10−21.1068 × 10−21.1083 × 10−21.1069 × 10−21.1077 × 10−21.1064 × 10−21.1063 × 10−21.1072 × 10−21.1081 × 10−2
(0.1, 300, 20)test1.1110 × 10−21.0988 × 10−21.1233 × 10−21.0884 × 10−21.1192 × 10−21.0937 × 10−21.1121 × 10−21.1170 × 10−21.1029 × 10−21.1042 × 10−2
(0.1, 300, 20)train1.0906 × 10−21.0836 × 10−21.0863 × 10−21.0896 × 10−21.0849 × 10−21.0866 × 10−21.0897 × 10−21.0871 × 10−21.0878 × 10−21.0892 × 10−2
Table A9. Permutation importance for the linear model with the best estimator.
Table A9. Permutation importance for the linear model with the best estimator.
FeatureMeanStd. DevIQRLower BoundMin25%50%75%MaxUpper Bound
GASAMT_cat1.9936 × 1008.2363 × 10−31.1023 × 10−21.9719 × 1001.9777 × 1001.9884 × 1001.9974 × 1001.9994 × 1002.0021 × 1002.0160 × 100
GASAMT_num3.8956 × 10−11.6090 × 10−31.6560 × 10−33.8638 × 10−13.8712 × 10−13.8886 × 10−13.8952 × 10−13.9052 × 10−13.9240 × 10−13.9300 × 10−1
TOTBALAMT_num9.3035 × 10−35.8660 × 10−56.1182 × 10−59.1790 × 10−39.2172 × 10−39.2708 × 10−39.2976 × 10−39.3319 × 10−39.4295 × 10−39.4237 × 10−3
HUDSUB_cat6.5143 × 10−33.6001 × 10−53.9541 × 10−56.4348 × 10−36.4512 × 10−36.4941 × 10−36.5202 × 10−36.5337 × 10−36.5684 × 10−36.5930 × 10−3
TRASHAMT_num6.4855 × 10−33.3977 × 10−54.0293 × 10−56.4018 × 10−36.4444 × 10−36.4622 × 10−36.4775 × 10−36.5025 × 10−36.5525 × 10−36.5630 × 10−3
NHQSCRIME_cat5.6403 × 10−35.1427 × 10−55.4697 × 10−55.5249 × 10−35.5754 × 10−35.6070 × 10−35.6365 × 10−35.6616 × 10−35.7305 × 10−35.7437 × 10−3
UTILAMT_num4.5517 × 10−32.0407 × 10−52.7779 × 10−54.4941 × 10−34.5320 × 10−34.5358 × 10−34.5446 × 10−34.5636 × 10−34.5909 × 10−34.6052 × 10−3
MORTCOUNT_num2.4084 × 10−31.1715 × 10−51.6576 × 10−52.3745 × 10−32.3904 × 10−32.3994 × 10−32.4074 × 10−32.4159 × 10−32.4270 × 10−32.4408 × 10−3
HMRACCESS_cat1.8264 × 10−38.5206 × 10−69.8094 × 10−61.8055 × 10−31.8162 × 10−31.8202 × 10−31.8254 × 10−31.8300 × 10−31.8461 × 10−31.8447 × 10−3
INTRATE_num1.2436 × 10−36.8325 × 10−67.8518 × 10−61.2286 × 10−31.2333 × 10−31.2404 × 10−31.2425 × 10−31.2482 × 10−31.2545 × 10−31.2600 × 10−3
PAP_cat1.6980 × 10−41.9427 × 10−62.3997 × 10−61.6498 × 10−41.6599 × 10−41.6858 × 10−41.7024 × 10−41.7098 × 10−41.7253 × 10−41.7458 × 10−4
HHGRAD_cat1.4866 × 10−42.1808 × 10−63.1728 × 10−61.4242 × 10−41.4536 × 10−41.4718 × 10−41.4910 × 10−41.5036 × 10−41.5161 × 10−41.5512 × 10−4
SEMP_num1.2901 × 10−41.9629 × 10−63.0794 × 10−61.2255 × 10−41.2623 × 10−41.2716 × 10−41.2962 × 10−41.3024 × 10−41.3207 × 10−41.3486 × 10−4
RPRU_num1.0460 × 10−43.0894 × 10−63.1769 × 10−69.8615 × 10−59.8843 × 10−51.0338 × 10−41.0433 × 10−41.0656 × 10−41.0960 × 10−41.1132 × 10−4
ELECAMT_cat6.5075 × 10−51.8838 × 10−62.7801 × 10−65.9679 × 10−56.1670 × 10−56.3849 × 10−56.5069 × 10−56.6629 × 10−56.7491 × 10−57.0799 × 10−5
JOBTYPE_cat4.9196 × 10−51.4215 × 10−61.6417 × 10−64.6201 × 10−54.7056 × 10−54.8664 × 10−54.9158 × 10−55.0305 × 10−55.1193 × 10−55.2768 × 10−5
SSIP_num3.7936 × 10−51.6298 × 10−62.5522 × 10−63.2853 × 10−53.5381 × 10−53.6681 × 10−53.7930 × 10−53.9234 × 10−54.0287 × 10−54.3062 × 10−5
HHAGE_num3.2663 × 10−51.6689 × 10−62.3447 × 10−62.7765 × 10−53.0611 × 10−53.1283 × 10−53.2422 × 10−53.3627 × 10−53.5305 × 10−53.7144 × 10−5
SUNZ_num2.4944 × 10−57.1299 × 10−75.1139 × 10−72.4017 × 10−52.3553 × 10−52.4784 × 10−52.5001 × 10−52.5295 × 10−52.5995 × 10−52.6062 × 10−5
PERPOVLVL_cat2.4583 × 10−58.3054 × 10−71.2848 × 10−62.2017 × 10−52.3368 × 10−52.3944 × 10−52.4736 × 10−52.5229 × 10−52.5720 × 10−52.7156 × 10−5
SSP_num2.3864 × 10−57.2174 × 10−71.0146 × 10−62.1844 × 10−52.2687 × 10−52.3366 × 10−52.3946 × 10−52.4380 × 10−52.4821 × 10−52.5902 × 10−5
GTOC_num2.2744 × 10−51.3107 × 10−61.8559 × 10−61.9071 × 10−52.0626 × 10−52.1855 × 10−52.3014 × 10−52.3711 × 10−52.4435 × 10−52.6495 × 10−5
FOUNDTYPE_cat2.2585 × 10−51.6076 × 10−62.2696 × 10−61.8195 × 10−51.9605 × 10−52.1599 × 10−52.2554 × 10−52.3868 × 10−52.4795 × 10−52.7273 × 10−5
RETP_num2.2422 × 10−58.7867 × 10−71.1676 × 10−62.0156 × 10−52.1000 × 10−52.1907 × 10−52.2470 × 10−52.3075 × 10−52.3698 × 10−52.4826 × 10−5
FINCP_num1.9917 × 10−59.2646 × 10−77.1909 × 10−71.8257 × 10−51.8835 × 10−51.9335 × 10−51.9793 × 10−52.0054 × 10−52.1980 × 10−52.1133 × 10−5
OCCYRRND_cat1.9304 × 10−56.7672 × 10−76.7805 × 10−71.7842 × 10−51.8593 × 10−51.8859 × 10−51.9114 × 10−51.9537 × 10−52.0951 × 10−52.0555 × 10−5
GDTGZ_num1.7247 × 10−59.6872 × 10−71.0691 × 10−61.5317 × 10−51.5503 × 10−51.6921 × 10−51.7247 × 10−51.7990 × 10−51.8482 × 10−51.9594 × 10−5
ROOFHOLE_cat1.6390 × 10−56.7528 × 10−72.4815 × 10−71.5900 × 10−51.5025 × 10−51.6273 × 10−51.6338 × 10−51.6521 × 10−51.7587 × 10−51.6893 × 10−5
WINBROKE_cat1.3977 × 10−54.6753 × 10−77.4940 × 10−71.2460 × 10−51.3161 × 10−51.3584 × 10−51.4159 × 10−51.4334 × 10−51.4502 × 10−51.5458 × 10−5
FNDCRUMB_cat1.3796 × 10−58.5335 × 10−79.7688 × 10−71.1832 × 10−51.2109 × 10−51.3297 × 10−51.3840 × 10−51.4274 × 10−51.5148 × 10−51.5739 × 10−5
PROTAXAMT_num1.2540 × 10−55.4993 × 10−77.1874 × 10−71.1042 × 10−51.1619 × 10−51.2120 × 10−51.2662 × 10−51.2838 × 10−51.3347 × 10−51.3917 × 10−5
HHAGE_cat8.8253 × 10−65.8762 × 10−73.3938 × 10−77.9885 × 10−68.4434 × 10−68.4976 × 10−68.6129 × 10−68.8369 × 10−61.0392 × 10−59.3460 × 10−6
BEDROOMS_num8.0924 × 10−65.7926 × 10−74.3420 × 10−77.2542 × 10−67.2123 × 10−67.9055 × 10−68.1196 × 10−68.3397 × 10−69.2823 × 10−68.9910 × 10−6
WAGP_cat7.3443 × 10−63.7602 × 10−71.9176 × 10−76.8737 × 10−66.7365 × 10−67.1613 × 10−67.2871 × 10−67.3530 × 10−68.0098 × 10−67.6407 × 10−6
HHADLTKIDS_cat6.0821 × 10−62.4197 × 10−72.3109 × 10−75.6096 × 10−65.6862 × 10−65.9562 × 10−66.0797 × 10−66.1873 × 10−66.4681 × 10−66.5339 × 10−6
TVSCW_num5.9457 × 10−63.1834 × 10−74.7452 × 10−75.0031 × 10−65.5119 × 10−65.7149 × 10−65.8881 × 10−66.1894 × 10−66.5026 × 10−66.9012 × 10−6
TVTC_num5.6936 × 10−66.1051 × 10−75.7298 × 10−74.4774 × 10−64.6006 × 10−65.3368 × 10−65.8044 × 10−65.9098 × 10−66.6340 × 10−66.7693 × 10−6
ROOFSHIN_cat5.3893 × 10−63.0865 × 10−72.9296 × 10−74.7537 × 10−65.0579 × 10−65.1932 × 10−65.3036 × 10−65.4861 × 10−66.0915 × 10−65.9256 × 10−6
WALLSLOPE_cat4.8879 × 10−62.2034 × 10−72.6910 × 10−74.4016 × 10−64.4921 × 10−64.8052 × 10−64.8865 × 10−65.0743 × 10−65.1501 × 10−65.4780 × 10−6
INTRATE_cat4.7215 × 10−64.4358 × 10−76.0808 × 10−73.5137 × 10−64.0055 × 10−64.4258 × 10−64.8474 × 10−65.0339 × 10−65.2419 × 10−65.9460 × 10−6
WINBOARD_cat4.5849 × 10−61.7117 × 10−72.1778 × 10−74.1099 × 10−64.3978 × 10−64.4365 × 10−64.5875 × 10−64.6543 × 10−64.9362 × 10−64.9810 × 10−6
ROOFSAG_cat3.4710 × 10−64.5306 × 10−75.3561 × 10−72.3998 × 10−62.7205 × 10−63.2032 × 10−63.5595 × 10−63.7388 × 10−64.0613 × 10−64.5422 × 10−6
AYCK_num3.1664 × 10−62.5263 × 10−72.5879 × 10−72.6909 × 10−62.5651 × 10−63.0791 × 10−63.1889 × 10−63.3379 × 10−63.4352 × 10−63.7261 × 10−6
WALLSIDE_cat2.9624 × 10−62.1181 × 10−72.7161 × 10−72.3828 × 10−62.7262 × 10−62.7903 × 10−62.9579 × 10−63.0619 × 10−63.4184 × 10−63.4693 × 10−6
OIP_cat2.6708 × 10−62.9016 × 10−71.6380 × 10−72.3299 × 10−62.2560 × 10−62.5756 × 10−62.6148 × 10−62.7394 × 10−63.1882 × 10−62.9851 × 10−6
WAGP_num2.4458 × 10−62.5718 × 10−73.4190 × 10−71.7637 × 10−61.9962 × 10−62.2766 × 10−62.4620 × 10−62.6185 × 10−62.8511 × 10−63.1313 × 10−6
RATINGNH_cat2.3834 × 10−61.8494 × 10−71.6443 × 10−72.0391 × 10−62.0399 × 10−62.2858 × 10−62.3990 × 10−62.4502 × 10−62.6972 × 10−62.6968 × 10−6
DINING_num2.3575 × 10−63.5084 × 10−75.2785 × 10−71.2801 × 10−61.8059 × 10−62.0718 × 10−62.4314 × 10−62.5997 × 10−62.8539 × 10−63.3915 × 10−6
SEMP_cat2.2709 × 10−62.1841 × 10−73.1044 × 10−71.6387 × 10−61.9452 × 10−62.1043 × 10−62.2743 × 10−62.4148 × 10−62.5899 × 10−62.8804 × 10−6
NHQRISK_cat2.0150 × 10−63.6724 × 10−75.3090 × 10−71.0395 × 10−61.4547 × 10−61.8359 × 10−61.9820 × 10−62.3668 × 10−62.4739 × 10−63.1631 × 10−6
NTNC_num1.9806 × 10−62.2827 × 10−72.7550 × 10−71.4440 × 10−61.6586 × 10−61.8572 × 10−61.9345 × 10−62.1327 × 10−62.3559 × 10−62.5460 × 10−6
MAINTAMT_num1.7089 × 10−62.2427 × 10−73.6945 × 10−79.6226 × 10−71.3209 × 10−61.5164 × 10−61.7965 × 10−61.8859 × 10−61.9656 × 10−62.4400 × 10−6
ELECAMT_num1.3577 × 10−62.0445 × 10−72.2986 × 10−79.3825 × 10−79.4472 × 10−71.2830 × 10−61.4223 × 10−61.5129 × 10−61.5545 × 10−61.8577 × 10−6
HMRENEFF_cat1.3081 × 10−61.4607 × 10−71.2274 × 10−71.0555 × 10−61.1394 × 10−61.2396 × 10−61.2655 × 10−61.3624 × 10−61.6604 × 10−61.5465 × 10−6
NHQSCHOOL_cat1.1976 × 10−61.4224 × 10−71.6314 × 10−78.9456 × 10−79.5640 × 10−71.1393 × 10−61.2103 × 10−61.3024 × 10−61.3874 × 10−61.5471 × 10−6
NHQPCRIME_cat1.0078 × 10−63.0596 × 10−73.2841 × 10−73.3299 × 10−75.4539 × 10−78.2561 × 10−79.4359 × 10−71.1540 × 10−61.6366 × 10−61.6466 × 10−6
BATHROOMS_cat8.1963 × 10−71.4276 × 10−71.7572 × 10−74.5566 × 10−75.9923 × 10−77.1924 × 10−78.1252 × 10−78.9495 × 10−71.0826 × 10−61.1585 × 10−6
ALCH_num8.0017 × 10−71.3833 × 10−72.1399 × 10−73.6593 × 10−75.4314 × 10−76.8692 × 10−78.5658 × 10−79.0090 × 10−79.5655 × 10−71.2219 × 10−6
INTP_num7.6079 × 10−71.7862 × 10−71.2852 × 10−74.4682 × 10−75.8932 × 10−76.3960 × 10−77.1540 × 10−77.6812 × 10−71.1102 × 10−69.6090 × 10−7
DWNPAYPCT_cat6.9451 × 10−72.0950 × 10−73.2724 × 10−72.7424 × 10−83.7388 × 10−75.1829 × 10−77.2166 × 10−78.4553 × 10−71.0322 × 10−61.3364 × 10−6
PERSCOUNT_num6.8701 × 10−71.1315 × 10−71.5837 × 10−73.7143 × 10−74.7794 × 10−76.0898 × 10−77.2346 × 10−77.6735 × 10−78.2583 × 10−71.0049 × 10−6
MARKETVAL_num6.2241 × 10−72.2507 × 10−73.4516 × 10−7−6.4848 × 10−82.8550 × 10−74.5289 × 10−76.4287 × 10−77.9805 × 10−79.3473 × 10−71.3158 × 10−6
NHQPUBTRN_cat5.7073 × 10−78.5731 × 10−88.8970 × 10−83.7795 × 10−74.8061 × 10−75.1141 × 10−75.5015 × 10−76.0038 × 10−77.5936 × 10−77.3383 × 10−7
INSURAMT_num5.4674 × 10−77.4946 × 10−87.7160 × 10−83.9022 × 10−74.2881 × 10−75.0596 × 10−75.4660 × 10−75.8312 × 10−76.6777 × 10−76.9886 × 10−7
PMTAMT_num5.0193 × 10−71.4464 × 10−71.8815 × 10−71.3823 × 10−73.2527 × 10−74.2045 × 10−74.3880 × 10−76.0860 × 10−77.6110 × 10−78.9083 × 10−7
LOTAMT_num3.9068 × 10−79.7713 × 10−87.7404 × 10−82.5281 × 10−71.9698 × 10−73.6891 × 10−74.2069 × 10−74.4632 × 10−75.0062 × 10−75.6243 × 10−7
LOTAMT_cat2.5582 × 10−79.1438 × 10−88.4240 × 10−86.8055 × 10−81.7221 × 10−71.9442 × 10−72.2515 × 10−72.7866 × 10−74.3035 × 10−74.0502 × 10−7
HHCITSHP_cat2.5200 × 10−72.1585 × 10−73.2997 × 10−7−4.0444 × 10−7−4.6536 × 10−89.0514 × 10−82.4341 × 10−74.2049 × 10−76.1876 × 10−79.1544 × 10−7
PAP_num1.4968 × 10−73.8820 × 10−86.4923 × 10−81.9782 × 10−89.1534 × 10−81.1717 × 10−71.5258 × 10−71.8209 × 10−72.0287 × 10−72.7947 × 10−7
REMODAMT_num1.1855 × 10−78.0960 × 10−87.2556 × 10−8−2.6722 × 10−8−1.2060 × 10−88.2111 × 10−81.2022 × 10−71.5467 × 10−72.6794 × 10−72.6350 × 10−7
OILAMT_num2.7468 × 10−84.9592 × 10−86.6567 × 10−8−1.0204 × 10−7−5.2424 × 10−8−2.1913 × 10−92.9388 × 10−86.4375 × 10−89.1455 × 10−81.6423 × 10−7
NORC_cat2.0183 × 10−82.2419 × 10−82.8139 × 10−8−3.8758 × 10−8−2.9385 × 10−93.4507 × 10−91.4656 × 10−83.1590 × 10−87.0700 × 10−87.3798 × 10−8
OIP_num1.8323 × 10−83.4456 × 10−82.8966 × 10−8−3.3347 × 10−8−4.2213 × 10−81.0101 × 10−82.3333 × 10−83.9067 × 10−87.1728 × 10−88.2515 × 10−8
NRATE_cat1.7879 × 10−82.5657 × 10−82.6064 × 10−8−3.2322 × 10−8−2.4876 × 10−86.7745 × 10−92.0142 × 10−83.2839 × 10−86.1141 × 10−87.1935 × 10−8
UNITSIZE_cat1.6510 × 10−81.4456 × 10−81.5831 × 10−8−1.5849 × 10−8−2.8481 × 10−97.8977 × 10−91.7444 × 10−82.3729 × 10−84.6357 × 10−84.7476 × 10−8
OILAMT_cat1.1498 × 10−88.3436 × 10−97.0515 × 10−9−3.6145 × 10−9−1.2536 × 10−96.9628 × 10−91.0992 × 10−81.4014 × 10−83.0942 × 10−82.4592 × 10−8
HMRSALE_cat6.4073 × 10−91.7714 × 10−72.0140 × 10−7−3.6279 × 10−7−3.0100 × 10−7−6.0680 × 10−84.5779 × 10−81.4072 × 10−72.2210 × 10−74.4283 × 10−7
OTHERAMT_cat1.9437 × 10−92.3974 × 10−82.2963 × 10−8−3.7071 × 10−8−4.5917 × 10−8−2.6272 × 10−93.7037 × 10−92.0335 × 10−83.0788 × 10−85.4779 × 10−8
OTHERAMT_num0.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 100
INTP_cat0.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 100
HRATE_cat0.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 100
WATERAMT_num0.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 100
RATINGHS_cat0.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 1000.0000 × 100
PERPOVLVL_num−1.1307 × 10−81.4370 × 10−82.2563 × 10−8−5.4344 × 10−8−3.1154 × 10−8−2.0500 × 10−8−1.4418 × 10−82.0629 × 10−91.0106 × 10−83.5907 × 10−8
NGMC_num−1.2913 × 10−84.3127 × 10−85.0949 × 10−8−1.1308 × 10−7−9.7798 × 10−8−3.6658 × 10−8−8.6012 × 10−91.4291 × 10−84.7631 × 10−89.0715 × 10−8
HOAAMT_num−1.6028 × 10−82.9945 × 10−83.0678 × 10−8−7.2057 × 10−8−9.0396 × 10−8−2.6041 × 10−8−5.6709 × 10−94.6370 × 10−99.0618 × 10−95.0654 × 10−8
INSURAMT_cat−3.5158 × 10−88.8795 × 10−81.1344 × 10−7−2.5230 × 10−7−2.2321 × 10−7−8.2137 × 10−8−1.5712 × 10−83.1304 × 10−87.0192 × 10−82.0146 × 10−7
Figure A2. Evolution of the mean for the grid search cross-validation optimization of the preselected models. As marked by the distinct blue signs in Figure A2, these points represent the critical nodes/regions/variables that exhibited the strongest statistical significance/highest predictive power/most notable deviation in the model analysis.
Figure A2. Evolution of the mean for the grid search cross-validation optimization of the preselected models. As marked by the distinct blue signs in Figure A2, these points represent the critical nodes/regions/variables that exhibited the strongest statistical significance/highest predictive power/most notable deviation in the model analysis.
Mathematics 13 03413 g0a2
Figure A3. Evolution of the metrics by k-fold for the optimization of the preselected models. The green lines specifically denote the model performance when the tree depth ($\text{max\_depth}$) is set to 10.0, illustrating its consistency and optimal balance across the cross-validation folds.”
Figure A3. Evolution of the metrics by k-fold for the optimization of the preselected models. The green lines specifically denote the model performance when the tree depth ($\text{max\_depth}$) is set to 10.0, illustrating its consistency and optimal balance across the cross-validation folds.”
Mathematics 13 03413 g0a3

References

  1. Choy, L.H.T.; Ho, W.K.O. The Use of Machine Learning in Real Estate Research. Land 2023, 12, 740. [Google Scholar] [CrossRef]
  2. Géron, A. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd ed.; O’Reilly Media: Sebastopol, CA, USA, 2019. [Google Scholar]
  3. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
  4. Zhang, C.; Li, X. AI-Enhanced Remote Sensing of Land Transformations for Climate-Related Financial Risk Assessment in Housing Markets: A Review. Land 2025, 14, 1672. [Google Scholar] [CrossRef]
  5. Stamate, E.; Piraianu, A.I.; Ciobotaru, O.R.; Crassas, R.; Duca, O.; Fulga, A.; Grigore, I.; Vintila, V.; Fulga, I.; Ciobotaru, O.C. Revolutionizing Cardiology through Artificial Intelligence-Big Data from Proactive Prevention to Precise Diagnostics and Cutting-Edge Treatment-A Comprehensive Review of the Past 5 Years. Diagnostics 2024, 14, 1103. [Google Scholar] [CrossRef]
  6. Mazni, M.; Husain, A.R.; Shapiai, M.I.; Ibrahim, I.S.; Anggara, D.W.; Zulkifli, R. An investigation into real-time surface crack classification and measurement for structural health monitoring using transfer learning convolutional neural networks and Otsu method. Alex. Eng. J. 2024, 92, 310–320. [Google Scholar] [CrossRef]
  7. Ying, C.; Wang, W.; Yu, J.; Li, Q.; Yu, D.; Liu, J. Deep learning for renewable energy forecasting: A literature and bibliometric review. J. Clean. Prod. 2023, 384, 135414. [Google Scholar] [CrossRef]
  8. Lendvai, G.F.; Gosztonyi, G. Algorithmic bias as a core legal dilemma in the age of artificial intelligence: Conceptual basis and the current state of regulation. Laws 2025, 14, 41. [Google Scholar] [CrossRef]
  9. Zekos, G.I. Political, Economic and Legal Effects of Artificial Intelligence: Governance, Digital Economy and Society; Springer: New York, NY, USA, 2022. [Google Scholar] [CrossRef]
  10. Susskind, R. Online Courts and the Future of Justice; Oxford University Press: Oxford, UK, 2019. [Google Scholar] [CrossRef]
  11. Peppet, S.R. Unraveling privacy: The personal prospectus and the threat of a full-disclosure future. N. Univ. Law. Rev. 2011, 105, 1153. [Google Scholar]
  12. Seagraves, P. Real Estate Insights The clash of politics and economics in the UK property market–the case of leaseholds. J. Prop. Invest. Fin. 2023, 41, 629–635. [Google Scholar] [CrossRef]
  13. Wang, F.; Yang, Q.; Wu, F.; Zhang, Y.; Sun, S.; Wang, X.; Gui, Y.; Li, Q. Identification of a 42-bp heart-specific enhancer of the notch1b gene in zebrafish embryos. Dev. Dyn. 2019, 248, 426–436. [Google Scholar] [CrossRef]
  14. Boutaba, R.; Salahuddin, M.A.; Limam, N.; Ayoubi, S.; Shahriar, N.; Estrada-Solano, F.; Caicedo, O.M. A comprehensive survey on machine learning for networking: Evolution, applications and research opportunities. J. Internet Serv. Appl. 2018, 9, 16. [Google Scholar] [CrossRef]
  15. Olutimehin, A.T.; Ajayi, A.J.; Metibemu, O.C.; Balogun, A.Y.; Oladoyinbo, T.O.; Olaniyi, O.O. Adversarial threats to AI-driven systems: Exploring the attack surface of machine learning models and countermeasures. J. Eng. Res. Rep. 2025, 27, 341–362. [Google Scholar] [CrossRef]
  16. Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: New York, NY, USA, 2013; Volume 26. [Google Scholar] [CrossRef]
  17. Goodman, B.; Flaxman, S. EU regulations on algorithmic decision-making and a “right to explanation”. In Proceedings of the ICML Workshop on Human Interpretability in Machine Learning (WHI 2016), New York, NY, USA, 23 June 2016. [Google Scholar]
  18. Akinrinola, O.; Addy, W.A.; Ajayi-Nifise, A.O.; Odeyemi, O.; Falaiye, T. Predicting stock market movements using neural networks: A review and application study. GSC Adv. Res. Rev. 2024, 18, 297–311. [Google Scholar] [CrossRef]
  19. Yigitcanlar, T.; Desouza, K.C.; Butler, L.; Roozkhosh, F. Contributions and risks of artificial intelligence (AI) in building smarter cities: Insights from a systematic review of the literature. Energies 2020, 13, 1473. [Google Scholar] [CrossRef]
  20. Nguyen, T.T.T.; Armitage, G. A survey of techniques for internet traffic classification using machine learning. IEEE Commun. Surv. Tutor. 2008, 10, 56–76. [Google Scholar] [CrossRef]
  21. Adeyeye, A. Certified B corps: An examination of a standard based approach to stakeholder governance. Eur. Bus. Law. Rev. 2024, 35, 755–778. [Google Scholar] [CrossRef]
  22. Haimes, Y.Y.; Kaplan, S.; Lambert, J.H. Risk filtering, ranking, and management (RFRM) framework using hierarchical holographic modeling. Risk Anal. 2002, 22, 383–397. [Google Scholar] [CrossRef]
  23. Campbell, C.; Sands, S.; Ferraro, C.; Tsao, H.Y.; Mavrommatis, A. From data to action: How marketers can leverage AI. Bus. Horiz. 2020, 63, 227–243. [Google Scholar] [CrossRef]
  24. Floridi, L.; Cowls, J.; Beltrametti MFloridi, L.; Cowls, J.; Beltrametti, M.; Chatila, R.; Chazerand, P.; Dignum, V.; Luetge, C.; Madelin, R.; et al. AI4People—An Ethical Framework for a Good AI Society: Opportunities, Risks, Principles, and Recommendations. Minds Mach. 2018, 28, 689–707. [Google Scholar] [CrossRef]
  25. Walz, A.; Firth-Butterfield, K. Implementing ethics into artificial intelligence: A contribution, from a legal perspective to the development of an AI governance regime. Duke Law. Technol. Rev. 2019, 18, 176. [Google Scholar]
  26. Gelman, A.; Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
  27. Olatoye, F.O.; Awonuga, K.F.; Mhlongo, N.Z.; Ibeh, C.V.; Elufioye, O.A.; Ndubuisi, N.L. AI and ethics in business: A comprehensive review of responsible AI practices and corporate responsibility. Int. J. Sci. Res. Arch. 2024, 11, 1433–1443. [Google Scholar] [CrossRef]
  28. Nannini, L.; Alonso-Moral, J.M.; Catalá, A.; Lama, M.; Barro, S. Operationalizing explainable artificial intelligence in the European Union regulatory ecosystem. IEEE Intell. Syst. 2024, 39, 37–48. [Google Scholar] [CrossRef]
  29. Koshiyama, A.; Kazim, E.; Treleaven, P.; Rai, P.; Szpruch, L.; Pavey, G.; Ahamat, G.; Leutner, F.; Goebel, R.; Knight, A.; et al. Towards algorithm auditing: Managing legal, ethical and technological risks of AI, ML and associated algorithms. R. Soc. Open Sci. 2024, 11, 230859. [Google Scholar] [CrossRef]
  30. Hohma, E.; Boch, A.; Trauth, R.; Lütge, C. Investigating accountability for Artificial Intelligence through risk governance: A workshop-based exploratory study. Front. Psychol. 2023, 14, 1073686. [Google Scholar] [CrossRef] [PubMed]
  31. Micci-Barreca, D. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explor. Newsl. 2001, 3, 27–32. [Google Scholar] [CrossRef]
  32. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
  33. Jaggi, M. An equivalence between the Lasso and support vector machines. In Regularization, Optimization, Kernels, and Support Vector Machines; Suykens, J.A.K., Signoretto, M., Argyriou, A., Eds.; Chapman & Hall/CRC: Boca Raton, FL, USA, 2014; pp. 1–26. [Google Scholar] [CrossRef]
  34. Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar] [CrossRef]
  35. Adão, T.; Chojka, A.; Pascoal, D.; Silva, N.; Morais, R.; Peres, E. Synthetic Data-Driven Methods to Accelerate the Deployment of Deep Learning Models: A Case Study on Pest and Disease Detection in Precision Viticulture. Computers 2025, 14, 327. [Google Scholar] [CrossRef]
  36. Hast, A.; Nysjö, J.; Marchetti, A. Optimal RANSAC—Towards a repeatable algorithm for finding the optimal set. J. WSCG 2013, 21, 21–30. [Google Scholar]
  37. Silverman, B.W.; Jones, M.C.E. Fix and J. L. Hodges (1951): An important contribution to nonparametric discriminant analysis and density estimation (Commentary). Int. Stat. Rev. 1989, 57, 233–247. [Google Scholar] [CrossRef]
  38. Rokach, L.; Maimon, O. Data Mining with Decision Trees, 2nd ed.; World Scientific: Singapore, 2014. [Google Scholar] [CrossRef]
  39. Sagi, O.; Rokach, L. Approximating XGBoost with an interpretable decision tree. Inf. Sci. 2021, 572, 522–542. [Google Scholar] [CrossRef]
  40. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  41. Haykin, S. Neural Networks: A Comprehensive Foundation, 2nd ed.; Prentice Hall: Upper Saddle River, NJ, USA, 1999; ISBN 81-7808-300-0. [Google Scholar]
  42. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef]
Figure 1. Estimated distribution of TRS_housing for different values of γ.
Figure 1. Estimated distribution of TRS_housing for different values of γ.
Mathematics 13 03413 g001
Chart 1. Dataset Integration Pipeline: From Raw Inputs to Unified Analytical Dataset.
Chart 1. Dataset Integration Pipeline: From Raw Inputs to Unified Analytical Dataset.
Mathematics 13 03413 ch001
Figure 2. (a) and (b) Cross-validation metrics: R2 and MSE.
Figure 2. (a) and (b) Cross-validation metrics: R2 and MSE.
Mathematics 13 03413 g002
Figure 3. Comparison of the R2 values for the cross-validation.
Figure 3. Comparison of the R2 values for the cross-validation.
Mathematics 13 03413 g003
Figure 4. Comparison of the MSE for the cross-validation.
Figure 4. Comparison of the MSE for the cross-validation.
Mathematics 13 03413 g004
Figure 5. Comparison of training times for the cross-validation.
Figure 5. Comparison of training times for the cross-validation.
Mathematics 13 03413 g005
Figure 6. Comparison of testing times for the cross-validation.
Figure 6. Comparison of testing times for the cross-validation.
Mathematics 13 03413 g006
Figure 7. (a) Models ranked by the R2 values of the cross-validation. (b) Models ranked by the MSE values of the cross-validation. (c) Models ranked by the training time values of the cross-validation. Comparative visualization of model performance across the TRS risk indices. The solid lines represent the observed values, while the red dashed lines indicate the threshold values used for model evaluation.
Figure 7. (a) Models ranked by the R2 values of the cross-validation. (b) Models ranked by the MSE values of the cross-validation. (c) Models ranked by the training time values of the cross-validation. Comparative visualization of model performance across the TRS risk indices. The solid lines represent the observed values, while the red dashed lines indicate the threshold values used for model evaluation.
Mathematics 13 03413 g007aMathematics 13 03413 g007b
Figure 8. Overall performance of the models in the hyperparameter tuning.
Figure 8. Overall performance of the models in the hyperparameter tuning.
Mathematics 13 03413 g008
Figure 9. Training metrics for the hyperparameter tuning.
Figure 9. Training metrics for the hyperparameter tuning.
Mathematics 13 03413 g009
Figure 10. (a) Means and standard deviations by estimator (Lars). (b) Means and standard deviations by estimator (DTRg). (c) Means and standard deviations by estimator (HGBRg).
Figure 10. (a) Means and standard deviations by estimator (Lars). (b) Means and standard deviations by estimator (DTRg). (c) Means and standard deviations by estimator (HGBRg).
Mathematics 13 03413 g010aMathematics 13 03413 g010b
Figure 11. (a) Error box of the evolution of the metrics by fold for the grid search cross-validation (Lars). (b) Error box of the evolution of the metrics by fold for the grid search cross-validation (DTRg). (c) Error box of the evolution of the metrics by fold for the grid search cross-validation (HGBRg).
Figure 11. (a) Error box of the evolution of the metrics by fold for the grid search cross-validation (Lars). (b) Error box of the evolution of the metrics by fold for the grid search cross-validation (DTRg). (c) Error box of the evolution of the metrics by fold for the grid search cross-validation (HGBRg).
Mathematics 13 03413 g011aMathematics 13 03413 g011b
Figure 12. Evolution of the metrics by fold for the best estimator of each model.
Figure 12. Evolution of the metrics by fold for the best estimator of each model.
Mathematics 13 03413 g012
Figure 13. Comparison of the metrics by dataset.
Figure 13. Comparison of the metrics by dataset.
Mathematics 13 03413 g013
Figure 14. Actual target values vs. predictions for the selected linear model.
Figure 14. Actual target values vs. predictions for the selected linear model.
Mathematics 13 03413 g014
Figure 15. Histogram of the residuals (actual prediction).
Figure 15. Histogram of the residuals (actual prediction).
Mathematics 13 03413 g015
Figure 16. Quantile–quantile plot of the residuals.
Figure 16. Quantile–quantile plot of the residuals.
Mathematics 13 03413 g016
Figure 17. Residuals and predictions of the selected linear model.
Figure 17. Residuals and predictions of the selected linear model.
Mathematics 13 03413 g017
Figure 18. Box plot of the residuals for the validation set.
Figure 18. Box plot of the residuals for the validation set.
Mathematics 13 03413 g018
Figure 19. Permutation importance of the features for the linear model with the best estimator. Bars represent the decrease in R2 when individual features are permuted, thereby quantifying their relative predictive contribution. The dash–dot horizontal line denotes the baseline reference threshold, serving as a cutoff between features with meaningful explanatory power and those with negligible importance.
Figure 19. Permutation importance of the features for the linear model with the best estimator. Bars represent the decrease in R2 when individual features are permuted, thereby quantifying their relative predictive contribution. The dash–dot horizontal line denotes the baseline reference threshold, serving as a cutoff between features with meaningful explanatory power and those with negligible importance.
Mathematics 13 03413 g019
Figure 20. Model updating plan.
Figure 20. Model updating plan.
Mathematics 13 03413 g020
Table 2. Number of records in the raw data.
Table 2. Number of records in the raw data.
YearFlat FileDetailed Files
HouseholdPersonProjectMortgage
201569,49369,493149,53259,03423,582
201766,75266,752145,32050,57522,820
201963,18563,185134,16047,12520,998
202164,14164,141135,92651,47619,155
202355,66955,669114,47644,68916,834
Total319,240319,240679,414252,899103,389
Table 3. Number of variables in the flat and detailed raw data by year.
Table 3. Number of variables in the flat and detailed raw data by year.
YearMini CodebooksFlat FileDetailed Files
HouseholdPersonProjectMortgageSubtotal
201548539132649811394
201747937931449811382
201948238531356811388
202146238130849819384
202351543535655819438
Total24231971161725840411986
Common PUF Variables3213212614788324
Table 4. Variable (feature) counts by source file for model development.
Table 4. Variable (feature) counts by source file for model development.
Detailed FilesTotal
HouseholdPersonProjectMortgage
981728125
Table 5. Number of features and records in the final dataset.
Table 5. Number of features and records in the final dataset.
DataNo. FeaturesNo. Records
Explicative variables by type
Categoricals83
Numericals114
Variables by source
CONTROL2
TRS (housing)1
AHS125
WDI72
Final Dataset200319,240
Table 6. Summary of the features and records in the final dataset.
Table 6. Summary of the features and records in the final dataset.
CategoryCount
Explicative Variables—Categorical83
Explicative Variables—Numerical114
Control Variables (e.g., IDs)2
TRS Target Variable (TRS_housing)1
Total Variables200
Total Records319,240
Table 7. Summary of performance changes due to feature engineering.
Table 7. Summary of performance changes due to feature engineering.
TechniqueAbbr.Affected
No. FeaturesNo. Records
Null values filteringNVF00
Missing value ratio filteringMVR0
Impute missing valuesIMV53
Low variance filteringLVF0
High correlation filteringHCF3
Total 3
Note: The number of reduced features may overlap across the multiple techniques used.
Table 8. Results of the feature engineering.
Table 8. Results of the feature engineering.
OriginalReduced
Records319,240319,240
Features20090
Case identification22
Independent/19787
Dependent11
Table 9. Machine learning algorithms and hyperparameters used.
Table 9. Machine learning algorithms and hyperparameters used.
Model TypeNameHyperparameters
Linear ModelsElastic Net Regressionα = 0.05, l1_ratio = 0.25, max_iter = 1 × 105, fit_intercept = True, random_state = 42
Lars Regressioneps = 1 × 10−4, fit_intercept = True, random_state = 42, verbose = False
Robust ModelsRANSAC Regressionrandom_state = 42
Nearest NeighborsK-Nearest Neighbors Regressionn_neighbors = 10
Decision TreesDecision Tree Regressionmax_depth = 3, min_samples_split = 2, random_state = 42,
EnsemblesHist. Gradient Boosting Regressionmax_iter = 30, random_state = 42, verbose = 0,
Random Forest Regressionn_estimators = 50, random_state = 42, verbose = 0,
Neural NetworksMLP Regressionhidden_layer_sizes = (64, 32), learning_rate = ‘adaptive’, early_stopping = True, random_state = 42, verbose = False,
Table 10. Cross-validation results of each model.
Table 10. Cross-validation results of each model.
IDModelMean R2Dev. Std. R2Mean NMSEDev. Std. NMSEMean Fit Time (s)Mean Score Time (s)Mean Elapsed Time (s)
ElaNElastic Net Regression9.43 × 10−11.87 × 10−4−5.68 × 10−36.01 × 10−51.590.074.94
LarsLars Regression9.98 × 10−16.96 × 10−5−2.33 × 10−46.80 × 10−60.910.066.73
RscRRANSAC Regression9.98 × 10−13.30 × 10−5−2.20 × 10−42.68 × 10−67.180.1217.10
KnnRK-Nearest Neighbors Regression7.65 × 10−14.67 × 10−3−2.34 × 10−25.46 × 10−41.5915.27294.61
DTRgDecision Tree Regression9.87 × 10−11.85 × 10−4−1.27 × 10−31.53 × 10−52.210.066.20
HGBRgHist. Gradient Boosting9.97 × 10−13.01 × 10−5−3.34 × 10−43.63 × 10−66.840.0915.37
RFRgRandom Forest Regression9.99 × 10−12.08 × 10−5−1.30 × 10−41.92 × 10−6119.960.87254.71
MlpRMLP Regression8.77 × 10−14.88 × 10−2−1.22 × 10−24.94 × 10−363.900.14136.95
Table 11. Metrics, parameters, and values used in the hyperparameter tuning.
Table 11. Metrics, parameters, and values used in the hyperparameter tuning.
Model TypeNameHyperparameters
Linear ModelsLars (Least Angle Regression)eps = [1 × 10−4, 1 × 10−3];
n_nonzero_coefs = [5, 10, 15, 25];
fit_intercept = True
Tree-Based ModelsDecision Treemax_depth = [2, 5, 10, 20];
min_samples_leaf = [2, 5]
Ensemble ModelsHist. Gradient Boostingmax_iter = [30, 100, 300]
learning_rate = [0.05, 0.1]
min_sample_leaf = 20
MetricsGeneral: Adjusted R-Squared for Train and Test Data
General: Adjusted R2 for Train and Test Data. Specific: Bias, Mean Absolute Error, Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Root Mean Absolute Error (RMAE), Pearson correlation, R2, Normal Deviation
Table 12. Optimized parameters obtained from the hyperparameter tuning.
Table 12. Optimized parameters obtained from the hyperparameter tuning.
Model TypeNameOptimized Parameters
Linear ModelsLars (Least Angle Regression)eps = 1 × 10−4;
n_nonzero_coefs = 25;
fit_intercept = True
Tree-Based ModelsDecision Treemax_depth = 10
min_samples_leaf = 2
Ensemble ModelsHist. Gradient Boostingmax_iter = 300
learning_rate = 0.1
min_sample_leaf = 20
Table 13. Results of the hyperparameter tuning by metric.
Table 13. Results of the hyperparameter tuning by metric.
Metric LarsDTRgHGBRg
Adjusted R2Train9.9610 × 10−19.9866 × 10−19.9882 × 10−1
Test9.9610 × 10−19.9858 × 10−19.9875 × 10−1
R2Train9.9610 × 10−19.9866 × 10−19.9882 × 10−1
Test9.9610 × 10−19.9858 × 10−19.9875 × 10−1
BIASTrain1.7764 × 10−155.4794 × 10−61.5891 × 10−5
Test−2.4422 × 10−6−1.1798 × 10−4−2.4791 × 10−5
MAETrain1.5181 × 10−28.5193 × 10−37.9635 × 10−3
Test1.5108 × 10−28.7370 × 10−38.0814 × 10−3
MSETrain3.8764 × 10−41.3306 × 10−41.1777 × 10−4
Test3.8561 × 10−41.4047 × 10−41.2348 × 10−4
RMSETrain1.9689 × 10−21.1535 × 10−21.0852 × 10−2
Test1.9637 × 10−21.1852 × 10−21.1112 × 10−2
PearsonTrain9.9829 × 10−19.9933 × 10−19.9941 × 10−1
Test9.9830 × 10−19.9929 × 10−19.9938 × 10−1
Normalized DeviationTrain9.7616 × 10−19.9933 × 10−19.9911 × 10−1
Test9.7615 × 10−19.9931 × 10−19.9923 × 10−1
Table 14. (a) Means and standard deviations by estimator (Lars). (b) Means and standard deviations by estimator (DTRg). (c) Means and standard deviations by estimator (HGBRg).
Table 14. (a) Means and standard deviations by estimator (Lars). (b) Means and standard deviations by estimator (DTRg). (c) Means and standard deviations by estimator (HGBRg).
(a) Lars Regression—Mean and Standard Deviation of R2, MSE, RMSE
Fit_Interceptepsn_nonzero_coefsTypeR2 (Mean ± Std. Dev.)MSE (Mean ± Std. Dev.)RMSE (Mean ± Std. Dev.)
True1 × 10−45Train9.5953 × 10−1 ± 1.6308 × 10−44.0243 × 10−3 ± 1.6406 × 10−56.3437 × 10−2 ± 1.2933 × 10−4
Test9.5953 × 10−1 ± 2.9128 × 10−44.0246 × 10−3 ± 2.4862 × 10−56.3439 × 10−2 ± 1.9571 × 10−4
True1 × 10−410Train9.7202 × 10−1 ± 3.6733 × 10−42.7823 × 10−3 ± 3.6300 × 10−55.2747 × 10−2 ± 3.4370 × 10−4
Test9.7202 × 10−1 ± 2.4893 × 10−42.7825 × 10−3 ± 2.3645 × 10−55.2749 × 10−2 ± 2.2377 × 10−4
True1 × 10−415Train9.9190 × 10−1 ± 5.7172 × 10−58.0547 × 10−4 ± 5.8436 × 10−62.8381 × 10−2 ± 1.0306 × 10−4
Test9.9190 × 10−1 ± 1.2183 × 10−48.0563 × 10−4 ± 1.0107 × 10−52.8383 × 10−2 ± 1.7815 × 10−4
True1 × 10−425Train9.9610 × 10−1 ± 8.5150 × 10−53.8784 × 10−4 ± 8.4666 × 10−61.9692 × 10−2 ± 2.1603 × 10−4
Test9.9610 × 10−1 ± 7.9191 × 10−53.8799 × 10−4 ± 7.2817 × 10−61.9697 × 10−2 ± 1.8522 × 10−4
(b) Decision Tree Regression—Mean and Standard Deviation of R2, MSE, RMSE
max_depthmin_samples_leafTypeR2 (Mean ± Std. Dev.)MSE (Mean ± Std. Dev.)RMSE (Mean ± Std. Dev.)
22Train9.4179 × 10−1 ± 7.2992 × 10−55.7885 × 10−3 ± 6.1754 × 10−67.6082 × 10−2 ± 4.0596 × 10−5
Test9.4179 × 10−1 ± 6.5389 × 10−45.7887 × 10−3 ± 5.5593 × 10−57.6082 × 10−2 ± 3.6442 × 10−4
52Train9.9648 × 10−1 ± 5.6756 × 10−63.5035 × 10−4 ± 4.3255 × 10−71.8718 × 10−2 ± 1.1554 × 10−5
Test9.9647 × 10−1 ± 5.0951 × 10−53.5056 × 10−4 ± 3.8878 × 10−61.8723 × 10−2 ± 1.0389 × 10−4
102Train9.9866 × 10−1 ± 2.5025 × 10−61.3287 × 10−4 ± 2.1400 × 10−71.1527 × 10−2 ± 9.2831 × 10−6
Test9.9859 × 10−1 ± 2.6742 × 10−51.4065 × 10−4 ± 2.3119 × 10−61.1859 × 10−2 ± 9.7528 × 10−5
202Train9.9919 × 10−1 ± 1.3109 × 10−58.0722 × 10−5 ± 1.2940 × 10−68.9842 × 10−3 ± 7.2177 × 10−5
Test9.9803 × 10−1 ± 3.8039 × 10−51.9548 × 10−4 ± 3.3753 × 10−61.3981 × 10−2 ± 1.2086 × 10−4
(c) Histogram Gradient Boosting Regression—Mean and Standard Deviation of R2, MSE, RMSE
min_samples_leaflearning_ratemax_iterTypeR2 (Mean ± Std. Dev.)MSE (Mean ± Std. Dev.)RMSE (Mean ± Std. Dev.)
200.0530Train9.5214 × 10−1 ± 7.7646 × 10−64.7593 × 10−3 ± 2.6016 × 10−66.8988 × 10−2 ± 1.8856 × 10−5
Test9.5214 × 10−1 ± 1.1431 × 10−44.7596 × 10−3 ± 2.3990 × 10−56.8989 × 10−2 ± 1.7388 × 10−4
200.05100Train9.9857 × 10−1 ± 3.1263 × 10−61.4172 × 10−4 ± 2.8618 × 10−71.1904 × 10−2 ± 1.2018 × 10−5
Test9.9857 × 10−1 ± 2.4338 × 10−51.4259 × 10−4 ± 2.1444 × 10−61.1941 × 10−2 ± 8.9868 × 10−5
200.05300Train9.9880 × 10−1 ± 4.0015 × 10−61.1961 × 10−4 ± 3.9764 × 10−71.0937 × 10−2 ± 1.8166 × 10−5
Test9.9876 × 10−1 ± 2.5771 × 10−51.2294 × 10−4 ± 2.3322 × 10−61.1087 × 10−2 ± 1.0528 × 10−4
200.1300Train9.9881 × 10−1 ± 4.7416 × 10−61.1828 × 10−4 ± 4.6970 × 10−71.0875 × 10−2 ± 2.1601 × 10−5
Test9.9877 × 10−1 ± 2.6282 × 10−51.2257 × 10−4 ± 2.3918 × 10−61.1071 × 10−2 ± 1.0812 × 10−4
Table 15. Summary statistics of the best estimators for each model.
Table 15. Summary statistics of the best estimators for each model.
Model
(Best Estimator)
Mean
R-Squared
Mean
MSE
Mean
RMSE
Std.
Deviation
RobustnessComplexity
Lars (Least Angle Regression)
(eps = 1 × 10−4
n_nonzero_coefs = 25
fit_intercept = True)
0.99603.88 × 10−41.97 × 10−2High (slight variability between folds)MediumLow
Decision Tree
(max_depth = 10
min_samples_leaf = 2)
0.99841.41 × 10−41.19 × 10−2Very low (flat curves on all folds)HighMedium
Hist. Gradient Boosting
(max_iter = 300
learning_rate = 0.1
min_sample_leaf = 20)
0.99881.23 × 10−41.11 × 10−2Very low (minimum dispersion in metrics)Very highHigh
Table 16. Comparison of the metrics by dataset.
Table 16. Comparison of the metrics by dataset.
MetricTrainTestValid
Adjusted R29.9610 × 10−19.9610 × 10−19.9877 × 10−1
R29.9610 × 10−19.9610 × 10−19.9877 × 10−1
BIAS0.0000−2.4422 × 10−6−2.6133 × 10−5
MAE1.5181 × 10−21.5108 × 10−28.0661 × 10−3
MSE3.8764 × 10−43.8561 × 10−41.2286 × 10−4
RMSE1.9689 × 10−21.9637 × 10−21.1084 × 10−2
Pearson9.9829 × 10−19.9830 × 10−19.9938 × 10−1
Normalized Deviation9.7616 × 10−19.7615 × 10−19.9925 × 10−1
Table 17. Mean and standard deviation of the actual and prediction groups.
Table 17. Mean and standard deviation of the actual and prediction groups.
GroupPredictions (Mean)Predictions (Std. Dev.)Actuals (Mean)Actuals (Std. Dev.)
06.0578 × 1004.6185 × 10−26.0577 × 1004.8527 × 10−2
16.2502 × 1003.1956 × 10−26.2502 × 1003.3275 × 10−2
26.4538 × 1003.8991 × 10−26.4540 × 1004.0298 × 10−2
36.6260 × 1002.5081 × 10−26.6261 × 1002.7325 × 10−2
46.9703 × 1005.0399 × 10−26.9702 × 1005.2037 × 10−2
Table 18. Bins and histogram values of the distribution of the residuals.
Table 18. Bins and histogram values of the distribution of the residuals.
BarRange (Min)Range (Max)Bin MeanFrequency
0−7.5048 × 10−2−4.2252 × 10−2−5.8650 × 10−2140
1−4.2252 × 10−2−9.4573 × 10−3−2.5855 × 10−29369
2−9.4573 × 10−32.3338 × 10−26.9403 × 10−352,879
32.3338 × 10−25.6133 × 10−23.9735 × 10−21444
45.6133 × 10−28.8928 × 10−27.2531 × 10−216
Table 19. Descriptive statistics of the values of the Q–Q plot.
Table 19. Descriptive statistics of the values of the Q–Q plot.
StatisticTheoretical QuantilesObserved QuantilesTheoretical Distribution
Mean3.4187 × 10−162.6133 × 10−52.6133 × 10−5
Std. Dev.9.9994 × 10−11.1084 × 10−29.8900 × 10−1
Min−4.2465 × 100−7.5048 × 10−2−4.1576 × 100
25%−6.7447 × 10−1−5.7721 × 10−3−6.6849 × 10−1
50%0.0000−3.7011 × 10−4−3.7011 × 10−4
75%6.7447 × 10−16.7643 × 10−36.6088 × 10−1
Max4.1710 × 1001.5011 × 10−24.1575 × 100
Table 20. Statistics and metrics of the residuals.
Table 20. Statistics and metrics of the residuals.
Statistic/MetricMeanStd. Dev.SkewKurtosisBiasMAEMSERMSE
Values2.6133 × 10−51.1084 × 10−2−6.7051 × 10−22.35−2.6133 × 10−58.0661 × 10−31.2286 × 10−41.1084 × 10−2
Table 21. Mean and standard deviation of the residuals and predictions.
Table 21. Mean and standard deviation of the residuals and predictions.
GroupPredictions (Mean)Predictions (Std. Dev.)Residuals (Mean)Residuals (Std. Dev.)
06.0578 × 1004.6185 × 10−2−8.7320 × 10−51.3769 × 10−2
16.2502 × 1003.1956 × 10−24.6723 × 10−59.8573 × 10−3
26.4538 × 1003.8991 × 10−21.4608 × 10−49.5342 × 10−3
36.6260 × 1002.5081 × 10−21.1601 × 10−41.0356 × 10−2
46.9703 × 1005.0399 × 10−2−7.2560 × 10−51.0643 × 10−2
Table 22. Statistics and boundaries of the variability of the residuals.
Table 22. Statistics and boundaries of the variability of the residuals.
StatisticMeanStd. Dev.Lower BoundMin25%50%75%MaxUpper Bound
Values2.7506 × 10−51.1239 × 10−2−4.4103 × 10−2−1.3583 × 10−2−1.3583 × 10−2−6.6100 × 10−36.7643 × 10−31.5011 × 10−23.7285 × 10−2
Table 23. Ranking of permutation importance.
Table 23. Ranking of permutation importance.
ImportanceFeatures
HighGASAMT_cat, GASAMT_num2.3%
ModerateTOTBALAMT_num, HUDSUB_cat, TRASHAMT_num, NHQSCRIME_cat, UTILAMT_num, MORTCOUNT_num, HMRACCESS_cat, INTRATE_num, PAP_cat, HHGRAD_cat, SEMP_num, RPRU_num13.8%
Nearly Null PositiveELECAMT_cat, JOBTYPE_cat, SSIP_num, HHAGE_num, SUNZ_num, PERPOVLVL_cat, SSP_num, GTOC_num, FOUNDTYPE_cat, RETP_num, FINCP_num, OCCYRRND_cat, GDTGZ_num, ROOFHOLE_cat, WINBROKE_cat, FNDCRUMB_cat, PROTAXAMT_num, HHAGE_cat, BEDROOMS_num, WAGP_cat, HHADLTKIDS_cat, TVSCW_num, TVTC_num, ROOFSHIN_cat, WALLSLOPE_cat, INTRATE_cat, WINBOARD_cat, ROOFSAG_cat, AYCK_num, WALLSIDE_cat, OIP_cat, WAGP_num, RATINGNH_cat, DINING_num, SEMP_cat, NHQRISK_cat, NTNC_num, MAINTAMT_num, ELECAMT_num, HMRENEFF_cat, NHQSCHOOL_cat, NHQPCRIME_cat, BATHROOMS_cat, ALCH_num, INTP_num, DWNPAYPCT_cat, PERSCOUNT_num, MARKETVAL_num, NHQPUBTRN_cat, INSURAMT_num, PMTAMT_num, LOTAMT_num, LOTAMT_cat, HHCITSHP_cat, PAP_num, REMODAMT_num, OILAMT_num, NORC_cat, OIP_num, NRATE_cat, UNITSIZE_cat, OILAMT_cat, HMRSALE_cat, OTHERAMT_cat73.6%
NullOTHERAMT_num, INTP_cat, HRATE_cat, WATERAMT_num, RATINGHS_cat5.7%
Nearly Null NegativePERPOVLVL_num, NGMC_num, HOAAMT_num, INSURAMT_cat4.6%
Table 24. Comparison of the RECIR with traditional risk assessment models.
Table 24. Comparison of the RECIR with traditional risk assessment models.
CriteriaRECIR (AI-Based Model)Traditional Risk Models
Predictive AccuracyHigh-leverage machine learning, big data, and real-time updatesModerate—relies on historical data and static statistical techniques
Regulatory ComplianceIntegrated AI-driven fairness audits (GDPR, AI Act, Fair Housing)Limited—requires manual adjustments for regulatory alignment
InterpretabilityExplainable AI (XAI) enhances transparencyTransparent but less adaptable to complex, multi-dimensional risks
AdaptabilityDynamic learning adjusts to new market conditionsStatic—fixed parameters based on past trends
Risk Factors ConsideredMulti-dimensional: legal, economic, environmental, and financial factorsPrimarily financial indicators
Data Processing CapabilityHandles unstructured & high-volume data (IoT, NLP, market feeds)Limited to structured datasets with predefined variables
Computational EfficiencyAI-driven automation enables real-time analysisRequires manual intervention, slower in processing large datasets
Application in Decision-MakingSupports automated, data-driven investment strategiesRelies on analyst interpretation, potentially slower decision-making
Fraud Detection & Forensic Risk AssessmentIntegrated forensic AI techniques for anomaly detectionLimited forensic capabilities—dependent on retrospective audits
ScalabilityHighly scalable across different markets and data environmentsRequires significant manual adjustments for new datasets
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lalum, A.; Caridad López del Río, L.; Ceular Villamandos, N. Multi-Dimensional AI-Based Modeling of Real Estate Investment Risk: A Regulatory and Explainable Framework for Investment Decisions. Mathematics 2025, 13, 3413. https://doi.org/10.3390/math13213413

AMA Style

Lalum A, Caridad López del Río L, Ceular Villamandos N. Multi-Dimensional AI-Based Modeling of Real Estate Investment Risk: A Regulatory and Explainable Framework for Investment Decisions. Mathematics. 2025; 13(21):3413. https://doi.org/10.3390/math13213413

Chicago/Turabian Style

Lalum, Avraham, Lorena Caridad López del Río, and Nuria Ceular Villamandos. 2025. "Multi-Dimensional AI-Based Modeling of Real Estate Investment Risk: A Regulatory and Explainable Framework for Investment Decisions" Mathematics 13, no. 21: 3413. https://doi.org/10.3390/math13213413

APA Style

Lalum, A., Caridad López del Río, L., & Ceular Villamandos, N. (2025). Multi-Dimensional AI-Based Modeling of Real Estate Investment Risk: A Regulatory and Explainable Framework for Investment Decisions. Mathematics, 13(21), 3413. https://doi.org/10.3390/math13213413

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop