Total Least Squares Estimation in Hedonic House Price Models

: In real estate valuation using the Hedonic Price Model (HPM) estimated via Ordinary Least Squares (OLS) regression, subjectivity and measurement errors in the independent variables violate the Gauss–Markov theorem assumption of a non-random coefficient matrix, leading to biased parameter estimates and incorrect precision assessments. In this contribution, the Errors-in-Variables model equipped with Total Least Squares (TLS) estimation is proposed to address these issues. It fully considers random errors in both dependent and independent variables. An iterative algorithm is provided, and posterior accuracy estimates are provided to validate its effectiveness. Monte Carlo simulations demonstrate that TLS provides more accurate solutions than OLS, significantly improving the root mean square error by over 70%. Empirical experiments on datasets from Boston and Wuhan further confirm the superior performance of TLS, which consistently yields a higher coefficient of determination and a lower posterior variance factor, which shows its more substantial explanatory power for the data. Moreover, TLS shows comparable or slightly superior performance in terms of prediction accuracy. These results make it a compelling and practical method to enhance the HPM.


Introduction
Real estate holds a pivotal position in the national economy and household investment, necessitating a thorough analysis of the influencing factors and an accurate assessment of the value of real estate.It is widely recognized that the market approach is commonly applied in real estate valuation.The Hedonic Price Model (HPM) extends from the market approach [1,2], grounded in supply-demand theory, and uses regression analysis to relate characteristics to transaction prices.Studying the HPM in depth for parameter estimation and housing price prediction is crucial in facilitating well-informed decision-making, ensuring the integrity of real estate transactions, and conducting precise tax assessments [3].
Machine learning (ML) algorithms are increasingly used to analyze the housing market [4][5][6].Much of the literature [7][8][9] recognizes the undeniable predictive power of ML algorithms; however, they also present a critical limitation due to their 'black box' nature (i.e., lack of model interpretability).It is difficult to discern the role that individual parameters play in value variation or to numerically define the causal relationships between prices and the characteristics of assets [10].However, this is not the case with the HPM, which facilitates both parameter estimation and interpretation.Pérez-Rave et al. [11] point out that understanding today's complex housing market requires a thorough analysis of the relevant variables and the estimation of their significant impacts.Therefore, our work focuses on expanding the HPM to achieve a more comprehensive and thorough analysis of housing prices.

Hedonic Price Method
Currently, the research on the HPM can be delineated into two levels: practical studies and methodological exploration.For practical research, the HPM serves three main purposes.(1) It is used to construct quality-adjusted house price indexes [12,13] to broadly track property price movements.Numerous countries and organizations have developed their own hedonic indexes, such as the hedonic house price index in the US Census Bureau and the Halifax and Nationwide indexes in the UK.(2) It is used to provide automated valuations (or general appraisals) of properties [14], which is also a critical step in property tax determination in some countries (e.g., the United States and Germany).(3) It is used to explain house price variations or determine the impact of certain characteristics on houses, revealing house price drivers and mirroring the real estate market development stages [2].Housing price drivers can be roughly divided into two categories.The first category focuses on the physical characteristics, such as intrinsic characteristics [15,16] and building properties [17].The second category examines the impact of public goods on housing prices, such as school quality [18,19], public transportation [20,21], and open spaces [22][23][24].
For methodological research, the classic HPM is estimated by Least Squares (LS) regression [25].The observed dependent variable (transaction price) is expressed as the linear combination of independent variables, i.e., the implicit prices associated with the structural, neighborhood, and location characteristics of the real estate [26].As the parameters of interest are estimated based on the formed mathematical model and a certain optimization criterion, the original method has been intensively extended in two aspects.(1) To enhance mathematical modeling, (i) the functional relation has been improved by applying the semiparametric model [27], Box-Cox model [28], and log-log model [29], which aim at improving both the goodness of fit and interpretability of the model; (ii) the stochastic model has been refined by considering the spatial effects via the Spatial Autoregressive (SAR) model [30][31][32], the Spatial Error Model (SEM) [33], and the Geographically Weighted Regression (GWR) [34][35][36].(2) For the optimization criterion, (i) regularization (or penalized) methods, such as ridge regression and the Least Absolute Shrinkage and Selection Operator (LASSO), are utilized to overcome multicollinearity, which may be caused by the dependency of the independent variables [37][38][39]; (ii) the prior information can be incorporated to consider the advice of experts via Bayesian estimation [40,41] or inequality-restricted least squares [39] by using an informative prior distribution, where the hyperparameters are set according to expert knowledge of the characteristics of the model parameter; (iii) robust methods, such as the least median of squared residuals [42] and normalized interval regression [43], have been utilized to detect outliers and enhance the reliability.

Total Least Squares Estimation
Though previous research has explored the HPM from various perspectives, an aspect remains to be discussed.Real estate transactions lack transparency due to limited access to crucial details like actual transaction prices and real estate facilities on public websites [44,45].Additionally, many property characteristics are qualitative and subjective, such as views or architectural styles [46,47], and measurement errors in recording key characteristics like the environmental quality of the house all contribute to potential "Errorsin-Variables" (EIV).However, previous studies often ignore the errors in the coefficient matrix, which can result in biased estimation and incorrect accuracy assessment.Anselin and Lozano-Gracia [48] explore EIV using instrumental variables in a two-step regression approach.Unlike this method, Total Least Squares (TLS) directly minimizes the sum of squared orthogonal distances from the data points to the model, thereby simultaneously addressing errors in both dependent and independent variables.
The terminology of TLS was first coined by Golub and van Loan [49], who also proposed the widely used algorithm based on Singular Value Decomposition (SVD).Nowadays, it finds application in the fields of signal processing [50,51], image processing [52,53], and applied geodesy [54][55][56][57][58][59], to name a few.Its aim is to minimize the sum of the squares of all random errors in the model.Though statistically appealing, it is more complicated to compute than LS due to the nonlinearity caused by the interaction of the random effects with the fixed effects of unknowns.The methods to obtain a TLS solution can be mainly summarized into two categories: the SVD-based methods and the iterative methods.Inherited from SVD, the former method is numerically effective; however, it only works with some restricted structures of the stochastic model.For more details, one can refer to [50,60].The latter regards it as a nonlinear constrained optimization problem and solves it via linearization and iteration, e.g., the Gauss-Newton method and Newton method [55, 57,61,62].In contrast to the former method, it is more general in terms of the model setup.
To give the reader an impression of the basic idea of TLS, we here consider a simple example published by Wooldridge [63] (p.153).It tries to investigate the relationship between the logarithm of the house price and the logarithm of the distance from the house to an incinerator.Therefore, the model consists of two parameters, i.e., the intercept parameter and the distance parameter.We select the first 100 sampled points and fit the data with LS and TLS, respectively.The estimated results are shown in Figure 1, from which we can see that LS only adjusts the data in the vertical direction.In contrast, the direction of the residuals produced by TLS is orthogonal to the fitted line.The essential reason is that TLS considers the random errors in the distance and adjusts such quantities in the estimation.This is why TLS is called orthogonal regression in some studies.The main objective of this contribution is first to apply TLS to hedonic price problems, taking into account errors in the dependent and independent variables, thereby enhancing the reasonable estimation of the hedonic parameters.Our approach leverages simulation experiments and two empirical datasets to comprehensively demonstrate its superiority in real estate evaluation.

Outline of the Paper
The rest of the paper is organized as follows.In Section 2, we first review LS estimation and then present TLS estimation in the HPM.In Section 3, Monte Carlo simulations are presented to show the advantages of our method.In Section 4, the Boston dataset is analyzed.In Section 5, we analyze a dataset collected in Wuhan.In Section 6, we discuss the three sets of experimental results.Finally, the conclusions are drawn in Section 7.

Method for the HPM
The HPM shows the relationship between a dependent variable (the house transaction price of the i-th sample) and independent variables (the selected m house characteristics for the i-th sample).The regression equation can be written as follows: where y i is the dependent variable (i.e., transaction price), x i,1 , • • • , x im are independent variables (i.e., house characteristics), β 0 is the unknown intercept parameter, and β 1 , • • • , β m are unknown hedonic parameters.The approximation symbol "≈" is used here since such a relationship does not exactly hold for real collected data.The choice of estimation method depends on how we specify the stochastic information of this equation.If only the random errors of the dependent variables (i.e., y i ) are considered, it becomes a linear model and LS is applied; if the random errors of the independent variables (i.e., x i,1 , • • • , x im ) are additionally considered, it becomes an EIV model and TLS is applied.Next, these two methods are introduced.

Least Squares Regression
In most cases, we only consider the errors in the dependent variables y i .Collecting the equations of all n sampled points yields the well-known linear Gauss-Markov model y = Aβ + e y e y ∼ (0, with e y1 e y2 . . . where y is the n-vector of observations; A is the design matrix of order n × (m + 1) with rank (m + 1); e y is the n-vector of random errors; β is the (m + 1)-vector of the unknown parameters to be estimated; Σ y is the symmetric positive definite covariance matrix of order n × n; Q y is the cofactor matrix; and σ 2 is the (unknown) variance factor (VF).
By minimizing e T y Q −1 y e y , the LS estimator reads The residual vector reads êy,LS = y Utilizing the error propagation law yields the cofactor matrix of the LS estimator, i.e., The (weighted) Sum of Squared Errors (SSE) reads Since the degree of freedom of the model is (n − m − 1), the a posteriori square root of VF can be evaluated as σLS = SSE LS /(n − m − 1).( 8) Combining ( 6) and ( 8), the covariance matrix of βLS reads If the cofactor matrix is chosen as Q y = I n , all formulations degrade into those for the Ordinary Least Squares (OLS) regression.However, the general covariance matrix Σ enables us to consider various model setups, such as the spatial correlations [64].

Total Least Squares Estimation
In LS estimation, the design matrix A is assumed to be non-stochastic (i.e., errorless).However, due to subjectivity (e.g., the quality of decoration and architectural style) and measurement uncertainty (e.g., the commercial service level of the house), some dependent variables cannot be exactly evaluated.The ignorance of the stochasticity of the design matrix can cause the LS regression results to be less accurate.Therefore, we address the issue of endogeneity from the specific perspective of EIV by introducing a random matrix E of order n × (m + 1) to the linear model (2), i.e., y = (A − E)β + e y , (10) with where e A = vec(E) is the (nm + n)-vector of random errors; vec(•) represents the vectorization operator that stacks the columns of the argument; Q A is the symmetric positive semi-definite matrix of order (nm + n) × (nm + n); κ 2 = σ 2 A /σ 2 is the VF ratio; and σ 2 A represents the VF of the design matrix.In practice, the cofactor matrix Q A is usually determined by considering factors such as the reliability of the data sources, the collection methods, and the nature of the variables themselves.
Since we have the relationship Eβ = (β T ⊗ I n )e A [65], the model ( 10) can be reformulated as where ⊗ represents the Kronecker operator and B = [ Based on this, the general TLS objective reads Note that if the dependent parameters are fixed (or non-stochastic), the corresponding blocks of Q are zero matrices, which leads to its singularity, i.e., rank(Q) < (n + 1)(m + 1).Therefore, strictly speaking, the regular inverse Q −1 does not exist since we at least have the intercept parameter β 0 .However, we still use Q −1 to establish the objective since the singularity caused by such a structure does not affect the final estimate.A similar treatment can be found in, e.g., [57].Unlike the traditional TLS method based on SVD, we can see that the structure of the model has been considered by forming the covariance matrix Σ, which is automatically kept in the estimation.
With the Lagrangian method, we can obtain the iterative solution forms.For simplicity, the derivations are placed in the Appendix A. The TLS estimator reads βTLS = ( where Reshaping the residual vector (15), we can obtain where vec −1 n,m+2 (•) is the inverse operator of vec(•), i.e., restructuring an n × (m + 2) matrix from an (nm + 2n)-vector.
Since the residuals are determined by the unknowns, the final estimate should be obtained by iterating these expressions.The steps of the TLS estimation can be summarized as Algorithm 1.Here, ∥ • ∥ 2 2 = (•) T (•) is the squared Euclidean norm and ε is a very small positive constant and is chosen as 10 −8 in this paper.In this paper, all computations are implemented with MATLAB.For the computational aspects of TLS, one can refer to Fang [57,62] for a Newton-type iteration.

Require: y, A and Σ
1: Obtain the initial values β 0 by the LS estimation and set i = 0 Update the coefficient matrix Update the residual matrix E 0 ← E 10: Update the parameter vector β 0 ← β 11: Update i = i + 1 12: until ∥δ∥ < ε (δ is the difference in parameter between two successive iterations) 13: Calculate σTLS and Σ TLS Next, we consider how to perform the precision assessment in the TLS estimation.According to [66], the EIV model can be formed as where y c = y − Eβ, A = A − E and e c = e y − Eβ.Therefore, we can have the element as where e ci and e yi are the i-th elements of e c and e y , respectively; and e T a i is the i-th row vector of E. From the definition, we cannot immediately judge which one of |e ci | and |e yi | is greater as the sign of the product e T a,i β cannot be determined.This is why e c is called the "total error" by [66] and has been employed for statistical inference, such as hypothesis testing.
Taking it as the model with a fixed coefficient matrix, we can develop the LS ensemble formulations for TLS, such as precision assessment and hypothesis testing.Therefore, we can have the cofactor matrix as Similarly, the SSE reads which leads to the posterior square root of the VF and the posterior covariance matrix of the parameters

Coefficient of Determination
The coefficient of determination (R 2 ) is a popular measure of the goodness of fit for linear models.To assess the explanatory power of the effects in β 1 , • • • , β m , the null model can be formed as y = 1 n β 0 + e y e y ∼ (0, by dropping all these effects in the relation ( 2) or (10).By performing LS estimation, we have SSE with the projector , which is also called the (weighted) total sum of squares.
Therefore, we can have the coefficient of determination as From the definition, R 2 ranges from 0 to 1 and indicates how much the dependent variable's variability (quantified by statistical measures like variance or the standard deviation) can be explained by the independent variables.For example, R 2 = 0.66 suggests that 66% of the variation in the dependent variable is captured or explained by the model, and the remaining 34% of the variation is caused by factors not included or inherent randomness.However, the coefficient ( 25) automatically increases if an extra independent variable is added to the model.To address such a limitation, its adjusted version is formed as by considering the degree of freedom of the model, which ensures that the inclusion of additional variables is justifiably reflected in the overall explanatory power of the model.The adjusted one R 2 adj is usually less than the original one R 2 .Further, the F-test statistic can be expressed with R 2 as which evaluates the overall significance of the independent variables within the HPM when estimating house prices.If the result is significant, it underscores the collective impact of these variables on the housing values, affirming the statistical significance of the HPM analysis; a non-significant outcome, however, suggests that the model lacks statistical efficacy.

Monte Carlo Simulations
TLS is statistically superior to LS in estimating the EIV model.To show such an improvement qualitatively in the hedonic analysis, Monte Carlo simulations are designed.In Xu et al. [67], it is shown that the bias of using LS under the EIV model depends on the parameter magnitudes and the covariance matrix of the random errors (or the noise level in the homogenous cases).Therefore, in order to investigate the superiority of TLS over LS in the hedonic analysis, we have to generate data from a real house pricing example.Then, the magnitudes of the parameters and the structure of the coefficient matrix can be close to practical situations.Based on this, we select a real example reported by Wooldridge Wooldridge [63] (p.211).To ensure the accuracy of the simulations, we employ the TLS method to correct the noisy data and then take the corrected data as the ground truth in the following experiments.
To show the superiority of TLS in hedonic house pricing analysis, we design two experiments: the first is for parameter estimation and the second is for prediction.The dataset consists of n = 88 observations and the regression equation reads log(price) = β 0 + β 1 log(lotsize) + β 2 log(sqrfit) + β 3 log(bdrms) + β 4 log(colonial), (28) where "price" is the house price in $1000; "lotsize" is the size of the lot in square feet; "sqrfit" is the size of the house in square feet; "bdrms" is the number of bedrooms; and "colonial" is an indicator variable that equals 1 if the house is of a colonial style and equals 0 otherwise.As "bdrms" and "colonial" are fixed, we assume that the cofactor matrices take the form Q = I n and where diag{•} represents the operator that constructs a diagonal matrix according to its argument.Since the Monte Carlo simulations are conducted based on the ground truths, we conduct the TLS estimation for the whole system by setting σ 2 = σ 2 A = 0.10 2 , and we regard the estimates as y and A.

Parameter Estimation
The purpose of this experiment is to verify that TLS can provide a more accurate parameter estimator than OLS.We set σ = 0.10 and varied σ A from 0.01 to 0.20 with increments of 0.01, conducting a total of 20 experiments to show such an improvement at different noise levels.In order to show the statistical performance, the experiment with a specific noise level was replicated 500 times [68].
The steps of the simulation are listed below.
Conduct 500 replicated trials and, in each trial,

3.
Compute the root mean square error (RMSE) for each parameter for these two schemes.

4.
Compute the sum of the RMSEs of five parameters for OLS and TLS, respectively.
The computed RMSEs are demonstrated in Figure 2, from which we can see the following.

1.
For β 0 , β 2 , and β 3 , the RMSEs of TLS are much smaller than those of OLS.In addition, the improvement becomes more significant as σ A increases.With σ A = 0.20, the improvement ratios (the percentage reduction in the RMSE of the TLS relative to the OLS) for β 0 , β 2 , and β 3 are 73.72%,73.39%, and 55.31%, respectively.2.
For β 1 and β 4 , the RMSEs of TLS are comparable to those of OLS.However, we can see that the magnitudes of the RMSEs of these two parameters are much smaller than those of the other three parameters (particularly β 0 , the intercept).In terms of the sum of the RMSEs, TLS significantly outperforms OLS, achieving a 71.22% improvement with σ A = 0.20.

3.
In the setting of the covariance matrix, we assume the coefficients corresponding to β 0 , β 3 , and β 4 to be errorless.However, we can see that the estimates of these parameters are still significantly influenced.Therefore, although we have set some variables to be non-stochastic, the corresponding parameters are also affected in the estimation.More specifically, in the EIV model, all parameters can be biased if LS is applied.For the analytical bias of LS (or approximately the difference between LS and TLS), one can refer to [67].

Price Prediction
This part is designed to compare the performance of OLS and TLS in terms of prediction.For such a purpose, we partition the observations (both y and A) into two parts, i.e., where y 1 and y 2 are 50-and 38-dimensional vectors; and A 1 and A 2 are of order 50 × 5 and 38 × 5, respectively.Specifically, the first 50 observations are the training set (y 1 and A 1 ), and the remaining 38 observations are the validation set (y 2 and A 2 ).In the simulations, the cofactor matrices for the training set (Σ 1 ) and the validation set (Σ 2 ) can be constructed similarly to Σ: Σ 1 is formed with Q y 1 = I 50 and With σ = 0.10, we consider two experiments by setting σ A = 0.01 and σ A = 0.05, respectively.In each experiment, the following steps are conducted 1000 times.

2.
Reconstruct e y 1 and E 1 from e 1 , and then form y 1 = y 1 + e y 1 , Perform the estimations to obtain βOLS and βTLS .4.
Repeat the predictions 500 times via the following: (a) Generate noise e 2 from the normal distribution N(0 Reconstruct e y 2 and E 2 from e 2 , and then form Compute the prediction discrepancy norms τ OLS = ∥y 2 − A 2 βOLS ∥ and Record the ratio of the number of times that TLS has a smaller norm in these 500 predictions. The results are shown in Figure 3, from which we can see the following.(1) The points (ratios) are more densely distributed between 0.5 and 1.0 (i.e., TLS outperforms OLS).More specifically, the ratios of τ TLS < τ OLS in these two cases are 75.50% and 61.90%, respectively.
(2) In the case of A 2 , with larger uncertainty (i.e., larger σ A ), the ratio is smaller.This phenomenon will be analytically discussed in Section 5.

Boston Dataset Analysis
In this section, the Boston house price dataset is analyzed, which was initially presented by [69] in their hedonic analysis of the demand for clean air.It is popular and has been used in many studies, such as robust estimation [70,71], residual normality analysis [72], and non-parametric estimation [73,74].The original data of n = 506 census tracts were published by [75] and found to contain several incorrectly coded observations.In our experiment, the corrected dataset provided by [76] is utilized.The descriptions of the dependent and m = 13 independent variables in the dataset are listed in Table 1.To implement TLS, we have to first determine the cofactor matrix Q A .The values of LSTAT and CRIM are likely to have large uncertainty as the socioeconomic indicators are often determined based on sample surveys, which are susceptible to sampling and recording biases.In contrast, the variable CHAS is deemed almost fixed, primarily because it is based on clear geographical features with minimal variability and subjectivity.The uncertainty of the remaining variables, which is caused by factors such as outdated data or limitations in measurement methods, is assumed to be the same.Given the uncertainty in the variables, we subjectively set Q y = I n and Q A = v ⊗ I n with v = diag{0, 1, 0.1, 0.1, 0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 1}.

1.
The parameter estimates differ.The norms ∥ βOLS − βTLS ∥ 2 in the three cases are 0.0151, 0.1548, and 0.5659, which show that the difference between OLS and TLS becomes significant as the noise level of the design matrix increases.

2.
The significance analysis of the parameters differs.For the first two cases, OLS and TLS identify the same significant parameters.However, for the third case, TLS regards AGE while OLS does not.

3.
TLS fits the data better than OLS.For R 2 and R 2 adj , TLS produces a higher value than OLS, indicating stronger explanatory power for the observed data; for VF, TLS produces a lower value than OLS, indicating a closer fit to the observed data; for the F-test statistic, indicative of the overall significance of the regression, TLS produces a higher value than OLS, reinforcing the evidence of a statistically sounder model.It is worth mentioning that the effects of EIV on the VF have been systematically investigated by [77].He shows that OLS always overestimates the VF, which verifies our conclusion.Next, we further analyze the results for different parameters.(i) In all three cases, CRIM, DIS, RAD, TAX, and PTRATIO are regarded as significant.It shows a stable and robust relationship between them and the other dependent variables.(ii) For NOXSQ, as we vary σ A , both the coefficient value and t-statistic dramatically change.This suggests that NOXSQ is a relatively sensitive parameter.It coincides with the analysis in previous work [69] since such a model is formed mainly to investigate the effects of the nitrogen oxide concentration.(iii) For AGE, the significance analysis results are completely different from these model assumptions.This result is consistent with the conclusion of [78], from which we can infer the following reason: older houses are often located in areas closer to urban centers, which may be associated with lower status.Therefore, AGE as a variable demonstrates complex interaction effects, and its significance is subject to change as σ A increases.TLS and LS yield different significant analysis results because TLS accounts for errors in all variables, leading to different parameter estimates and posterior variances compared to LS.This discrepancy impacts the statistical testing results, as shown in our previous study [67].
The squared residuals are depicted in Figure 4, from which we can see that (1) compared with OLS, TLS has a smaller mean value of squared residuals.Specifically, as increases, the average squared residual of TLS decreases significantly.In the case of σ A , the mean squared residual of TLS is about 41.94% lower than that of OLS.(2) The squared residuals of TLS are more concentrated.In the interval of residual squared values from 0 to 0.10, the OLS and the TLS with different σ A contain 470 (91.70%), 471 (93.08%), 475 (93.87%), and 491 (97.04%) points, respectively.(3) OLS produces some extreme residuals; in contrast, TLS, particularly with σ A = 0.06 and σ A = 0.10, appears to constrain these extreme residuals more effectively.This indicates that TLS is more appropriate to handle these extreme observed values.(4) It can be seen from the shape of the violin plot in the TLS model that as σ A increases, the kernel density estimate (KDE) shows that most of the squared residuals are more concentrated around the median, indicating less variability and more stable model performance.

Practical Tests and Analysis
In this section, "Guanshan Boulevard" of Wuhan is utilized as the study area.The data were manually collected from unrestricted public domains.In addition to parameter estimation, we also analyzed their performance in house price prediction.

Study Area and Data Source
The studied real estate market is on Guanshan Boulevard in Wuhan, China, whose area is roughly 4.11 square kilometers; see Figure 5.As the employment population grows, the real estate market experiences a boost, creating a mutually reinforcing cycle of development.The listing price of the house and some of the house characteristics are obtained from the Lianjia website (http://www.lianjia.com/,accessed on 23 November 2023) and the AMAP website (http://www.amap.com/,accessed on 23 November 2023).To avoid potential biases caused by real estate market segmentation, the selected residential samples are all commercial projects.Low-rent housing provided by the government as social welfare is not included in the study.In addition, our study only includes closed high-rise residential buildings, excluding villas and bungalows located in the study area.Records with missing data for any feature were removed from the dataset.All data points were recorded from June 2022 to November 2023.

Data Preparation
For the independent variables, we initially collected a series of variables to capture the residential characteristics, referencing mainstream variables [1,2,79,80] and adapting them to the local real estate market.

•
Structural attributes.We select management fees, the ratio of elevators to residents, the ratio of parking spaces to residents, the total number of functional rooms, the living room orientation, the building type, the housing year, the green space rate, and the building's floor area ratio.
• Neighborhood attributes.To account for the educational level, we compile diverse data points (number, distance, and quality) for kindergartens, primary schools, and middle schools.For medical services, we assess the distance to the nearest tertiary hospital.For commercial services, we evaluate the availability of nearby supermarkets, malls, and other amenities.For the level of leisure, we count the parks and attractions within a 3 kilometer radius of the residential community.• Locational attributes.We only select the logarithm of the distance (m) to the nearest metro station and bus station.This is because all house samples are within a small area, and their external location factors, such as the distance to the Wuhan Central Business District (CBD) or distance to large landscapes (East Lake, etc.), do not show significant changes.
For the dependent variables (i.e., the house price), we have the following three preprocessing steps.(1) Unit price calculation.House prices are preprocessed by calculating the unit price in yuan per square meter from the total listing price and building area.(2) Floor-level standardization.To mitigate the nonlinear effects of the floor levels, the prices are standardized across different floors using a correction coefficient, following the local guidelines specific to floor-level adjustments.(3) Transaction date adjustment.The transaction dates are adjusted using the average price change rate, converting each transaction's unit price into the valuation time point's value.
During the exploratory data analysis, we evaluate the significance of each independent variable using p-values and address potential collinearity by calculating variance inflation factors.We meticulously screen the housing characteristic variables pertinent to our study area.In the end, a refined set of m = 10 characteristic variables is selected to construct our HPM; see Table 3.

Parameter Estimation
In this part, we have a total of n = 200 house transaction sample points.We use the actual values (i.e., numerical data) for some variables, rather than converting them into categories or levels (i.e., categorical data), to avoid subjectivity when dividing levels.Despite this, the characteristic variables still exhibit varying levels of noise, which is attributed to the limitations and ambiguities in publicly available information.For instance, the "distance to the nearest hospital" may not be individually measured for each property but is instead represented by the average distance from the entire neighborhood, which fails to accurately reflect the attributes of individual properties.Based on a comprehensive analysis of the accuracy of the obtained data, we set Q A = v ⊗ I n with v = diag{0, 0.1, 0.1, 0.2, 0.3, 0.2, 0.1, 0.4, 0.4, 0.1, 0.4}.
The results for parameter estimation are shown in Table 4.At the 99% confidence level, the critical F-distribution value for the corresponding degrees of freedom is 2.41594.The F-test statistic values for the OLS and TLS methods in this HPM are 82.0777 and 87.7251, respectively, far exceeding this threshold and thereby demonstrating the effectiveness of the HPM.From the results of the remaining performance indicators, the TLS method produces a higher R 2 and R 2 adj and a lower VF than the OLS method.It indicates that TLS better fits the model, which is consistent with the conclusion drawn for the Boston dataset.For the estimates of the parameters, in this study area, (1) these two methods regard the following parameters as significant: PSCHOOL, DHOSPITAL, FEE, COMMERCIAL, RPARKING, BUILDINGTYPE, RGREENING, and NROOMS.This suggests that the importance of these factors in the real estate market is generally recognized.On the contrary, the impact of RPARKING on housing prices, which is usually considered important, is relatively small in this area.The reason may be that residents prefer to park their vehicles in areas with no parking fees, such as on the streets outside the community.This preference leads to a reduced demand for parking spaces within the community among residents, thereby diminishing the influence of onsite parking facilities on property values in this region.(2) However, MSCHOOL and DISTANCE show different results in the two models: (i) MSCHOOL is significant in TLS (−0.1053 **) but not in OLS (−0.0804).The premium on housing prices due to educational resources is primarily a result of the "Nearby Enrollment" policy (i.e., students attend schools based on their residential location).Therefore, homes in districts with high-quality education are particularly favored by parents.This is especially true for primary schools, where enrollment strictly depends on the residential address.For middle schools, while many well-known ones require entrance exams, reducing the impact of the location, the need to shorten children's commuting times and boost the chances of entering a top-tier school at the next educational level still results in a premium for housing near middle schools.Thus, both primary and middle schools (i.e., the compulsory education stage) exhibit a significant school district effect, directly contributing to the rise in housing prices.This aligns with the findings of some Chinese scholars [19,81].
(ii) For DISTANCE, TLS indicates greater significance (−0.0892 ***) than OLS (−0.0642 *).Despite our study area being relatively small and having generally good traffic, this does not diminish the significant impact of accessibility on housing prices, even in areas with well-developed transportation.
The different methods always yield a different understanding of the determinants of real estate values.In this experiment, by considering the randomness of the dependent variables, we can see that TLS gains more insights than OLS in factors like MSCHOOL and DISTANCE.

Price Prediction
Besides analyzing the influence on property prices, we further collected 90 points y p,i together with the dependent variables contained in a p,i (i = 1, • • • , 80) to test the performance in prediction.Based on the previously estimated parameters (i.e., Table 4), the predicted prices can be obtained as ỹp,i = a T p,i β.By comparing the predicted values with the observed values y p,i , we can calculate the relative error The summary statistics of the relative errors are presented in Table 5.In addition, the frequency histograms (together with the KDE) and the boxplots of the relative errors are shown in Figure 6.From these results, we can see that (1) in Table 5, most quantities corresponding to TLS are slightly smaller than those of OLS.To be more specific, for 62.50% of the sampled points, TLS provides a more accurate predicted value.(2) For OLS, the relative errors range from 0.07% to 1.94%, while those of TLS range from 0.02% to 1.83%, which shows a marginally narrower range.(3) Both methods show similar patterns in the RE distribution, primarily concentrated in the lower error ranges.However, we can also see that the frequency of TLS is smaller than that of OLS with a higher relative error (i.e., the last hist in the histogram), suggesting that TLS could offer greater robustness in certain scenarios.
Therefore, TLS slightly outperforms (or is at least comparable to) OLS in terms of prediction accuracy.Let us then attempt to analyze such a phenomenon from a theoretical perspective.We denote the estimation discrepancy as ϵ = β − β; then, we have the prediction discrepancy where e p,ai and e p,i are random errors corresponding to a p,i and y p,i , respectively, i.e., y p − e p,i = (a p,i − e p,ai ) T β.Provided that the training observations are statistically independent of the validation observations, taking the expectation of ϵ p,i yields the prediction bias E{ϵ p,i } = āT p,i E{ϵ} From such an expression, we can see that although TLS has a smaller estimation bias norm (i.e., ∥E{ϵ}∥ 2 ), it may not be guaranteed to produce a smaller prediction bias because of the combination āT p,i .In addition, in one single experiment, the prediction discrepancy additionally relies on the uncertainty of the dependent variable a p,i and the uncertainty of y p,i .Therefore, although TLS greatly outperforms OLS in parameter estimation, its advantage in prediction is marginal, even in simulations.

Discussion
In the simulations, the noise levels were systematically explored by designing experiments with a fixed σ and incrementally increasing σ A .TLS demonstrates stronger explanatory power and a closer fit to the observed data.The RMSE is significantly reduced, with improvements exceeding 70% for the sum of the RMSEs at σ A = 0.20.Notably, even the estimates of non-stochastic parameters are influenced by the randomness of other variables.A comparison of the TLS and LS results confirms the following.

•
For parameter estimation, TLS consistently achieves a higher R 2 and R 2 adj , a lower VF, and a higher F-test statistic in the analysis of both the Boston and Wuhan datasets.This performance demonstrates that TLS has stronger explanatory power and a closer fit to the observed data.Furthermore, TLS also aligns more closely with the findings from previous studies [19,69,78,81].Importantly, TLS effectively bounds extreme data points, enhancing the reliability of the estimates.Moreover, TLS highlights the importance of factors such as educational resources for middle schools and the proximity to metro stations, which OLS tends to underestimate in the Wuhan dataset.

•
For price prediction, the performance advantage of TLS over OLS diminishes with increasing uncertainty (i.e., larger σ A ) in the simulations.This performance is also evident in the Wuhan dataset, in which TLS outperforms OLS in 62.50% of the observations, and most statistics of the relative errors are slightly better.We consider that the limited advantage is believed to stem from the additional prediction discrepancies that depend on the uncertainties of the dependent and independent variables.
In the real examples, similar conclusions can be drawn.One may note the minor increase in R 2 observed when applying TLS compared to OLS, which can be attributed to three main factors.(1) The similarity of the magnitude of R 2 does not imply the similarity of the estimation results.From the definition of R 2 , we can see that it only considers the estimated parameters, while the uncertainty of the data (or the difference in the posterior precision) is completely ignored.In [67], it is shown that LS tends to be optimistic about the precision assessment.Thus, in the real example, we can see that although the improvement in R 2 seems to be marginal, the significance analysis results are very different.(2) A higher R 2 does not always indicate better performance.By assuming a higher noise level in the coefficient matrix, the improvement ratio is expected to increase.However, this might not yield meaningful results and could potentially result in overfitting.(3) The enhancements in TLS over LS are significantly influenced by the noise levels in the coefficient matrix.Despite the minimal changes in R 2 , substantial gains in other metrics, such as the RMSE and variance factor, are evident.Collectively, these points demonstrate that the advancements provided by TLS are both substantive and beneficial.

Conclusions and Outlook
This article introduces an advanced approach to analyzing housing prices using TLS estimation within the HPM.It comprehensively addresses errors in the dependent and independent variables, making it suitable for real estate data characterized by measurement inaccuracies and subjectivity.In addition, the Gauss-Newton-type iterative algorithm is derived, and the posterior precision assessment is given.
In both the simulated and real examples, the application of TLS in the HPM is shown to enhance the explanatory power and accuracy of the model.Our work enriches the HPM framework; it not only facilitates a thorough analysis of the relevant variables but more accurately assesses their significant impacts.This helps us to navigate the complexities of the real estate market and make more informed decisions.
The formulation presented in this paper has the potential to consider correlations.The consideration of spatial effects (such as spatial autocorrelation and heterogeneity) is necessary for spatial data analysis [82][83][84].In our formulation, we will next take into account some assumptions about the covariance matrix Σ to consider the spatial correlations among the sampled points and even the cross-correlations between the dependent and independent variables.This will be part of our future research, aimed to refine our method.

Figure 1 .
Figure 1.Fitting results for the house price data in [63]: (a) LS results; (b) TLS results.
the noise e from the normal distribution N(0, Σ); (b) Reconstruct e y and E from e, and then form y = y + e y and A = A + E; (c) Perform the estimations to obtain βOLS and βTLS ; (d) Record the discrepancy vectors ϵ OLS = βOLS − β and ϵ TLS = βTLS − β.

Figure 2 .
Figure 2. Results of RMSEs in the simulation of parameter estimation: (a-f) correspond to six parameters.

Figure 3 .
Figure 3. Ratios by which TLS has a smaller prediction discrepancy norm than OLS in 1000 repeated trials.

Figure 4 .
Figure 4. Distributions of squared residuals and the corresponding half violin plots for OLS and TLS in three cases.

Figure 6 .
Figure 6.The relative errors: (a) the frequency histograms (together with the KDE); (b) the boxplots.

Table 1 .
Definitions of dependent and independent variables.

Table 2 .
Estimation results produced by OLS and TLS in three cases (coefficients and test statistics marked with *** and ** are significant at the 99% and 95% levels, respectively).

Table 3 .
Definitions of variables in the example of the Wuhan dataset.

Table 4 .
Estimation results produced by OLS and TLS in the real data (coefficients and test statistics marked with ***, **, and * are statistically significant at the 99%, 95%, and 90% levels, respectively).

Table 5 .
The summary statistics of the relative errors.