Advanced Feature Engineering and Machine Learning Techniques for High Accurate Price Prediction of Heterogeneous Pre-Own Cars

Fayyaz, Imran; Ali, G. G. Md. Nawaz; Khairunnesa, Samantha S.

doi:10.3390/vehicles7030094

Open AccessArticle

Advanced Feature Engineering and Machine Learning Techniques for High Accurate Price Prediction of Heterogeneous Pre-Own Cars

by

Imran Fayyaz

,

G. G. Md. Nawaz Ali

^*

and

Samantha S. Khairunnesa

Department of Computer Science and Information Systems, Bradley University, Peoria, IL 61625, USA

^*

Author to whom correspondence should be addressed.

Vehicles 2025, 7(3), 94; https://doi.org/10.3390/vehicles7030094

Submission received: 11 July 2025 / Revised: 24 August 2025 / Accepted: 3 September 2025 / Published: 6 September 2025

Download

Browse Figures

Versions Notes

Abstract

The rapid growth of the automobile industry has intensified the demand for accurate price prediction models in the used car market. Buyers often struggle to determine fair market value due to the complexity of factors such as mileage, brand, model, transmission type, accident history, and overall condition. This study presents a comparative analysis of machine learning models for used car price prediction, with a strong emphasis on the impact of feature engineering. We begin by evaluating multiple models, including Linear Regression, Decision Trees, Random Forest, Support Vector Regression (SVR), XGBoost, Stacking Regressor, and Keras-based neural networks, on raw, unprocessed data. We then apply a comprehensive feature engineering pipeline that includes categorical encoding, outlier removal, data standardization, and extraction of hidden features (e.g., vehicle age, horsepower). The results demonstrate that advanced preprocessing significantly improves predictive performance across all models. For instance, the Stacking Regressor’s R² score increased from 0.14 to 0.8899 after feature engineering. Ensemble methods, such as CatBoost and XGBoost, also showed strong gains. This research not only benchmarks models for this task but also serves as a practical tutorial illustrating how engineered features enhance performance in structured ML pipelines for the fellow researchers. The proposed workflow offers a reproducible template for building high-accuracy pricing tools in the automotive domain, fostering transparency and informed decision making.

Keywords:

feature engineering; machine learning; regressor; price prediction; car price prediction; regression; continuous value prediction

1. Introduction

The used car market has experienced significant growth in recent years, driven by rising consumer demand for affordable transportation and the rise of online car marketplaces. Accurately predicting the price of a pre-owned vehicle is a challenging task due to the numerous factors influencing its value, including mileage, brand, model, transmission type, accident history, and overall condition. Traditional car valuation methods rely on expert assessments or basic statistical models, which often fail to capture the complex, nonlinear relationships among vehicle attributes.

With advancements in machine learning (ML), predictive models have become increasingly effective in estimating used car prices based on historical sales data. This study explores the application of various ML algorithms, including XGBRegressor [1], Stacking Regressor [2], Linear Regression [3,4], Random Forest [5], Support Vector Regression (SVR) [6], and Keras-based neural networks [7], to develop an accurate and scalable price prediction model.

A crucial component of this research is feature engineering [8], which plays a vital role in enhancing model performance. Key transformations applied in our pipeline include

Extracting horsepower and engine displacement from engine specification strings.
Converting categorical features into numerical representations using appropriate encoding techniques.
Standardizing transmission types via regular expression-based mapping.
Applying target encoding to brand data based on average vehicle price [9].
Encoding binary and nominal features using label and binary encoding.
Identifying and removing statistical outliers [10].

The dataset used in this study [11] contains 4009 entries and includes attributes such as transmission type, mileage, brand, exterior and interior color, manufacturing year, clean title status, fuel type, accident history, engine details, and selling price. By applying ML techniques to both raw and feature-engineered versions of the data, we aim to build robust models that provide reliable, data-driven price estimations.

By comparing model performance before and after feature engineering, this study offers practical insights into how structured data transformations significantly enhance predictive accuracy, while feature engineering is a well-known practice in machine learning, this study is novel in how it systematically quantifies its impact across multiple regression models, ranging from linear baselines to advanced ensembles, on real-world, heterogeneous car price data. Most existing works either apply standard encodings without comparison or skip transformation impact altogether. Our analysis demonstrates that thoughtful, domain-informed preprocessing can improve R² scores by over 10× in some cases, as seen with Linear Regression. We position this paper not as a model novelty study, but as a reproducible blueprint for performance optimization in structured regression problems, with particular focus on the often under-reported benefits of comprehensive feature engineering.

The remainder of the paper is structured as follows: Section 2 reviews related work and highlights gaps in current car price prediction studies. Section 3 describes the dataset and preprocessing pipeline. Section 4 presents the exploratory data analysis and correlation insights. Section 5 details the feature engineering steps and transformation logic. Section 6 discusses the train–test strategy. Section 7 evaluates various machine learning models with and without feature engineering. Section 8 presents a comparative analysis highlighting the performance improvements. A summary discussion and insights on the obtained results, limitations, and challenges are presented in Section 9. Finally, Section 10 concludes the paper by restating the key findings.

2. Related Work

Numerous studies have applied machine learning techniques to predict used car prices, generally emphasizing algorithm selection while often employing only basic feature processing. Venkatasubbu and Ganesh [12] offered an early example, using multiple linear models (Lasso, multiple Linear Regression, and a Regression Tree) on a small curated dataset of 804 records from Kelley Blue Book. They reported an exceptionally high fit (

R^{2} \approx 0.9927

), but the limited scope of data and absence of advanced feature engineering undermined generalizability. More recent works expanded the modeling scope but still faced constraints in data richness or preprocessing. For instance, Zhu [13] experimented with a range of algorithms (Linear Regression, Decision Trees, Random Forests, and a neural network) on U.S. used car data with simple one-hot encoding of categorical variables. The best performance came from an XGBoost model (

R^{2} \approx 0.79

), yet the study did not explore more sophisticated preprocessing or ensemble combinations. Similarly, Jiayi Lin [14] compared regressors on a CarDekho used car dataset that had undergone basic cleaning (handling missing values, outlier removal, etc.), finding that a Random Forest regressor outperformed linear models like ordinary least squares and Ridge regression. However, Lin’s results were constrained by a relatively small sample and signs of overfitting, highlighting the need for larger datasets and more diverse features or encoding techniques to boost model robustness. These studies established foundational models and confirmed that machine learning can capture core price patterns, but they largely relied on limited feature sets and did not fully leverage ensemble methods or deep feature enrichment.

Other researchers have sought to balance accuracy with simplicity or transparency by focusing on a few key predictors and straightforward models. For example, Baral and Zhou [15] restricted their model to the three most influential features (car model name, age, and mileage) and selected a subset of about 2000 records across four popular brands to train and test their models. Using this simplified scenario, they evaluated a trio of algorithms (Linear Regression, Decision Tree, and Random Forest), and found that the Random Forest achieved the highest performance with an

R^{2}

of approximately 0.856. This minimalist feature approach underscores which factors drive price the most, but by design it may overlook other important variables (e.g., condition, location, and equipment level) that broader models would consider. Gupta et al. [16] likewise implemented a basic predictive pipeline with minimal feature transformation, focusing on standardizing categorical data and then applying conventional learning algorithms. Their study prioritized interpretability and practical feasibility—effectively providing a baseline model for price prediction—but did not report the use of modern ensemble or deep learning techniques, and performance metrics were not prominently discussed. In a related vein, Bukvić et al. [17] combined simple correlation analysis with supervised learning to estimate used vehicle prices. They first selected a small set of features based on correlation strengths and then applied relatively basic models (including a Linear Regression for numeric prediction and classification trees for categorizing price ranges). This approach yielded a respectable accuracy (reported

R^{2} \approx 0.95

), yet the modeling strategy was limited in diversity. In summary, approaches that emphasize only a handful of features or straightforward pipelines can achieve reasonable results, but they tend to forgo the improvements attainable through richer feature engineering and more sophisticated model ensembles.

In contrast, several recent studies leverage advanced ensemble learning and richer feature sets to push predictive accuracy higher. Huang et al. [18] introduced a model fusion framework that combined multiple state-of-the-art learners—including gradient boosting machines (XGBoost, CatBoost, LightGBM) and an artificial neural network—into a single ensemble. This integrated model achieved an

R^{2}

of 0.9845, demonstrating the power of ensembling diverse algorithms for used car price prediction. Notably, the feature engineering in Huang et al.’s work was limited to an automated forward selection process, and the study did not examine how additional feature transformations might further enhance performance. Another comprehensive approach is offered by Isaev et al. [19], who focused on a large-scale used car dataset scraped from an online marketplace (mashina.kg) comprising about 12,679 records spanning various car attributes. They engineered a broad set of features—including textual descriptions (e.g., whether the car has air conditioning, obtained via web-scraped ad data) converted into categorical indicators—and evaluated several powerful regressors (CatBoost, Random Forest, and Support Vector Machine) to identify the best fit. Isaev et al. reported that a CatBoost ensemble delivered the highest prediction accuracy on their test set (approximately 86% accuracy, corresponding to roughly

R^{2} \approx 0.93

), outperforming the other models on that task. These results underscore that boosting algorithms and ensemble techniques can substantially improve the predictive power in this domain. Nonetheless, even these sophisticated studies tend to optimize certain aspects in isolation—for example, focusing on algorithmic prowess (ensembles) without explicitly contrasting the impact of extensive feature engineering, or vice versa. Few provide a holistic examination of how each stage of the modeling pipeline (from data preprocessing to model selection) contributes to overall performance.

In reviewing the literature, a clear gap emerges: no single prior study fully integrates extensive domain-driven feature engineering with a wide-ranging model comparison on a common ground. Most efforts either limit features and concentrate on a few algorithms, or maximize algorithmic complexity on relatively raw features. Performance benchmarking across different modeling strategies under consistent conditions is seldom reported, making it difficult to discern the specific gains from feature engineering versus those from advanced algorithms. The present research addresses this gap by unifying both dimensions—by constructing a comprehensive pipeline that incorporates aggressive feature engineering and then rigorously evaluates a broad spectrum of models—and by explicitly quantifying the contribution of engineered features to model accuracy. Building on insights from the studies above, our approach introduces domain-specific transformations (e.g., parsing textual descriptions into numerical features, encoding categorical variables in multiple ways, and normalizing mileage and age) and applies these to a publicly available used car sales dataset. We then benchmark over ten learning algorithms (including Linear Regression, Support Vector Regression, ensemble tree models like Random Forest, gradient boosters, a deep neural network, and a stacked ensemble) to compare their performance on identical data splits. This enables a direct assessment of how much each layer of improvement—from data preprocessing to choice of model—influences the predictive accuracy. Through this unified methodology, we aim to advance used car price prediction beyond the segmented efforts of prior work, offering a more robust and generalizable solution that addresses the limitations identified in earlier studies.

To further illustrate the performance improvements enabled by feature engineering, Figure 1 presents a visual comparison of

R^{2}

scores across the relevant evaluated models, both before and after applying transformation techniques. The plot clearly demonstrates that ensemble-based models, particularly Stacking Regressor, CatBoost, and XGBoost—benefited significantly from structured preprocessing, achieving the highest performance gains. Even simpler models, such as Linear Regression, showed measurable improvement following feature enhancement. The later part of the paper demonstrates how the feature engineering is performed on the messy publicly available dataset and improve the prediction performance dramatically.

3. Datasets and Methods

The pre-owned cars dataset was taken from Kaggle [11]. To obtain the accurate car price prediction, we have performed an extensive feature engineering including filtering and removing unnecessary data, converting textual data to numerical form, removing outliers and finding important features through correlations [20]. The summary of the performed work is shown in a data flow diagram in Figure 2.

4. Exploratory Analysis

4.1. Dataset

We have collected the freely available pre-own car dataset from Kaggle [11]. The dataset consists of 4009 records with 12 features, representing key attributes influencing the price of used cars. These listings originate from U.S.-based online marketplaces, reflecting market conditions and buyer preferences in the United States. Therefore, feature effects, such as the impact of mileage, should be interpreted within the context of the U.S. used car market, where high mileage is typically associated with lower resale value.

The dataset includes the following attributes:

Brand: Categorical representation of car brands.
Mileage: The total distance the car has traveled.
Fuel Type: Encoded as integers, representing fuel categories (e.g., Petrol, Diesel, other, etc.).
Transmission: Encoded as integers, indicating Manual or Automatic transmission.
Exterior Color & Interior Color: Coded numerical representations of car colors.
Accident History: Indicates whether the car has been involved in an accident (binary feature).
Clean Title: A binary feature indicating whether the car has a clean ownership record.
Price: The target variable representing the selling price of the vehicle.
Age: The number of years since the car’s manufacture.
Horsepower: The power output of the car’s engine.
Engine Displacement: The engine size, measured in liters.

4.2. Graphical Analysis

To visually explore relationships between features, we generated a few visualizations, including a pair plot (Figure 3) and a correlation heatmap.These highlight the most influential variables affecting vehicle price, such as age and mileage, which show strong negative correlations.

4.2.1. Pairplot

To gain a better understanding of the dataset, visualizations were produced for the significant features [17]. A pair plot was generated for price, age, and mileage to observe the relationships between these numerical features. The plot demonstrates expected trends, such as some cars having higher mileage and prices and vice versa.

4.2.2. Outliers Removal

To ensure the trends observed in the pairplot were not due to extreme values, outliers were later removed from all features using the IQR method, while Figure 4 displays only a subset of scaled features for visualization purposes, the full dataset underwent outlier filtering. This confirms the robustness of the cleaned dataset used in subsequent analysis.

4.2.3. Correlation Analysis: Before and After Feature Engineering

To understand the impact of feature engineering, we analyzed Pearson correlations between features and price before and after transformation.

Figure 5 shows the correlation matrix using minimally processed data. As shown, the relationships with price are weak or misleading due to unstandardized formats, missing values, and unencoded categorical variables. For example,

Mileage has a weak negative correlation with price ( $r = - 0.31$ ).
Model year shows only a weak positive correlation ( $r = 0.20$ ).
Brand, fuel type, color, and transmission exhibit near-zero correlation due to inconsistent encoding and categorical noise.
Horsepower and engine data were not yet extracted, so they do not contribute meaningfully.

Figure 6 shows the correlation matrix after full preprocessing and feature engineering. Key improvements include

Mileage now shows a strong negative correlation with price ( $r = - 0.73$ ), confirming that higher mileage reduces car value.
Age exhibits a strong negative correlation ( $r = - 0.65$ ), as expected.
Horsepower has a strong positive correlation ( $r = 0.61$ ), now accurately extracted from engine specifications.
Brand also correlates well ( $r = 0.49$ ), reflecting the effect of target encoding.
Engine displacement also correlates moderately with price ( $r = 0.25$ ).
Accident history and manual transmission now show clearer negative correlations ( $r \approx - 0.27$ to $- 0.22$ ).

These improvements validate the importance of structured feature engineering for meaningful pattern extraction in tabular ML tasks.

To generate the refined correlation matrix shown in Figure 6, we first applied a set of feature engineering techniques, including the derivation of car age from manufacturing year, normalization of mileage values, and extraction of numerical fields such as horsepower and engine displacement using regular expressions. Additionally, categorical attributes such as brand, fuel type, and transmission were label encoded to enable numerical correlation analysis. Compared to Figure 5, this updated matrix reveals stronger and more interpretable linear associations, as the transformations reduced noise and aligned feature formats. In the baseline setup, only minimal preprocessing was performed, including column dropping, basic cleanup of missing values, and standard label encoding of categorical features, without any derived attributes or domain-informed transformations. These limitations led to weaker observed correlations in the raw data.

4.2.4. Brand Variance

Figure 7 shows the distribution of car manufacturers in the dataset. The brand feature was target-encoded, meaning each brand was assigned a numeric value based on its average vehicle price. This transformation helps the model capture brand-specific pricing trends more effectively.

The most frequently represented brands are Ford, BMW, and Mercedes-Benz, indicating strong market presence and demand. Other well-represented brands include Chevrolet, Porsche, and Toyota. In contrast, premium manufacturers such as Bugatti, Rolls-Royce, and Maybach appear infrequently, likely due to their high price and limited availability.

This results in a long-tail distribution, a small number of brands dominate the dataset, while many others have minimal representation. Such imbalance can potentially affect the model’s ability to generalize pricing patterns across less common manufacturers.

Figure 8 displays the feature importance derived from correlation coefficients with the target variable, vehicle price. Positive correlations indicate features that increase with price, such as horsepower and brand, while negative correlations reflect inverse relationships. Notably, mileage and age exhibit strong negative correlations (−0.70 and −0.60, respectively), confirming that older, high-mileage vehicles tend to be priced lower. In contrast, horsepower shows a strong positive correlation, aligning with market expectations that more powerful vehicles command higher prices. Categorical features like fuel type, transmission, and interior/exterior color show relatively weaker correlations, suggesting that their influence is either more nuanced or captured through interaction with other variables. This correlation-based ranking guided our selection and transformation of features during the engineering phase.

4.3. Impact of Feature Engineering on Model Performance

To further quantify the effectiveness of feature engineering, we evaluated five regression models on both raw and fully processed datasets. Figure 9 presents a side-by-side comparison of

R^{2}

scores before and after feature transformation.

The results highlight a substantial improvement in predictive performance across all models. Linear Regression improved from an

R^{2}

of just 0.03 to 0.79, while Stacking Regressor achieved a leap from 0.15 to 0.88. Ensemble models such as CatBoost and XGBoost also showed significant gains, reinforcing the value of structured preprocessing.

5. Feature Engineering

This figure shows a sample of the original dataset, including key features such as brand, model, mileage, fuel type, engine specification, transmission, color, accident history, and price. As evident, the raw data contains inconsistent formats, categorical text, and missing values, which makes it unsuitable for direct application of machine learning algorithms.

To address this, comprehensive preprocessing and feature engineering are required. In the following steps, we detail how the dataset was transformed into a structured, model-ready format.

Price prediction for used vehicles depends on factors such as brand, mileage, transmission, and engine specifications. Transforming these into structured, machine-readable features enables more accurate modeling.

This process involves multiple steps:

Data Cleaning—Handling missing values, correcting inconsistencies, and ensuring data quality [21].
Feature Transformation—Converting categorical variables into numerical values and normalizing numerical features.
Feature Selection—Identifying the most relevant variables that contribute to price prediction.
Outlier Detection & Handling—Removing or adjusting anomalies that may skew predictions.
Feature Scaling—Standardizing or normalizing features for better model performance [22].

Feature engineering enhances a model’s ability to learn from structured data, especially in automotive pricing where inconsistencies and missing values are common. Our preprocessing pipeline ensures that domain-specific patterns, such as the impact of age, mileage, or accident history, are properly represented. As shown in Figure 10, we include a sample of the raw dataset to illustrate the underlying structure of the data used in this study.

In our car dataset features fall into the following two categories:

Numerical Features
- Mileage (higher mileage often reduces value);
- Age of the Car (older cars generally have lower prices);
- Horsepower (higher power may indicate a more expensive car);
- Engine Displacement (larger engines often correlate with higher prices).

Categorical Features
- Brand & Model (premium brands tend to hold value better);
- Transmission Type (automatic cars are often priced differently from manual ones);
- Fuel Type (diesel, petrol, electric, each has different market demand);
- Exterior & Interior Color (some colors retain value better);
- Accident History (cars with reported accidents tend to be cheaper);
- Clean Title (a clean title means no major past issues, which impacts pricing).

5.1. Handling Missing Values

Certain features contained missing values, which were handled as follows:

Fuel Type: Missing values were replaced with “not supported.” Invalid characters such as “–” were also standardized to this value.
Clean Title: If missing, values were assumed to be “no,” indicating the absence of a clean title.
Accident Reports: Missing entries were filled with “None reported,” assuming no known incidents.

5.2. Data Standardization

To facilitate analysis, numerical fields were cleaned and standardized:

Mileage: Originally stored as strings (e.g., “10,000 mi.”), these were converted to numeric values.
Price: Symbols such as “$” and commas were stripped to yield numeric values.

5.3. Calculating Vehicle Age

The vehicle’s age was computed by subtracting the model year from the current year. The original year column was then dropped.

5.4. Price Transformation

To reduce skewness, a logarithmic transformation was applied to the price field, improving model performance for algorithms that assume normally distributed input features.

5.5. Feature Engineering: Transmission

Raw transmission entries varied significantly (e.g., “6-Speed A/T”, “8-Speed PDK”, “CVT-F”). These were standardized into three primary categories:

Automatic: Includes dual-clutch, CVT, and standard automatics;
Manual: All manual types regardless of speed;
Other: Covers variable or unclear entries.

Additionally, a function was created to scan for common automatic/manual indicators, defaulting unmatched values to “Other.”

5.5.1. Numerical Encoding

Final transformation converted categories to numerical values:

{[transmission]}_{encoded} = \{\begin{matrix} 1 & if Automatic \\ 2 & if Manual \\ 3 & if Other \end{matrix}

(1)

5.5.2. Benefits

The implemented approach provided multiple advantages:

Reduced dimensionality while preserving transmission type distinctions;
Improved data quality through standardization;
Improved model interpret-ability;
Improved the correlation with price.

This approach successfully transformed complex transmission descriptions into a simplified, model-ready feature while preserving essential information for price prediction.

5.6. Feature Engineering: Engine

5.6.1. Data Extraction

Raw engine data contained multiple specifications within a single string (e.g., “2.0L 255HP”, “1.8 Liter Turbo”), requiring structured parsing to extract key numerical attributes:

Horsepower: Extracted from descriptions containing HP indicators, representing the vehicle’s power output.
Engine Displacement: Measured in liters, representing the total volume of the engine’s cylinders.

To systematically extract these values, Regular Expressions (Regex) were employed for efficient pattern extraction from unstructured text fields.

5.6.2. Regex-Based Parsing and Standardization

Regex [23] patterns were designed to extract the numerical values corresponding to horsepower and displacement:

5.6.3. Horsepower Extraction

Regex Pattern: (\d+\d+)HP|\d+\d+.
This pattern captures numeric values preceding an “HP” indicator or standalone decimal values signifying horsepower.
Example: From “2.0L 255HP”, the pattern extracts 255 as horsepower.

5.6.4. Engine Displacement Extraction

Regex Pattern: (\d+\d+L|\d+\d+ Liter).
This pattern identifies values followed by “L” or “Liter,” ensuring proper detection of engine displacement.
Example: From “1.8 Liter Turbo”, the pattern extracts 1.8 as engine displacement.

These extracted values were then subjected to data preprocessing:

Removal of unit indicators (e.g., “L”, “Liter”) for displacement.
Conversion to numerical format for both attributes.
Imputation of missing values using mean-based techniques.

5.6.5. Numerical Encoding

Final transformation ensured compatibility with numerical models:

Engine_encoded = {Horsepower, Engine Displacement}

(2)

where

Horsepower: Extracted and imputed as needed.
Engine Displacement: Converted to numerical format.

5.6.6. Benefits

The use of regex-based feature extraction provided multiple advantages:

Automated parsing of complex engine descriptions, eliminating manual intervention.
Standardization of measurement units for consistent data representation.

Raw textual engine specifications were efficiently transformed into structured numerical features, ensuring they are utilized fully in making prediction.

5.7. Feature Engineering: Interior and Exterior Color

Automotive listings often include inconsistent color names due to manufacturer-specific branding (e.g., “Midnight Black” vs. “Jet Black”). We standardized these descriptions into a fixed set of base colors. Values not matching the predefined list were assigned to an “Other” category.

This normalization step reduced feature sparsity and improved model interpretability during training.

5.8. Outlier Detection and Treatment

Outliers can distort price predictions and introduce noise into model training. We used the Interquartile Range (IQR) method to identify and treat outliers across key numerical features.

5.8.1. IQR Method

The IQR is calculated as follows:

IQR = Q 3 - Q 1

(3)

Values outside the range of

[Q 1 - 1.5 \times IQR, Q 3 + 1.5 \times IQR]

were treated as outliers and adjusted accordingly.

5.8.2. Affected Features

Outlier treatment was applied to the following:

Age—To limit unrealistically old/new vehicles;
Price—To remove extreme high/low listings;
Mileage, Horsepower, Engine Displacement—To reduce skewness and maintain reliable model inputs.

This filtering step improved robustness by reducing noise and aligning feature distributions with real-world expectations.

5.9. Categorical Data Encoding

Categorical attributes were encoded using a mix of binary, label, and target encoding, selected based on feature structure and correlation with the target variable.

Binary Encoding: Applied to boolean variables such as accident history and clean title.
–
Accident History: “At least 1 accident or damage reported” → 1, “None reported” → 0.
–
Clean Title: “Yes” → 1, “No” → 0.
Label Encoding: Used for ordinal categories like fuel type, exterior color, and interior color.
Target Encoding: Initially, one-hot encoding was tested for brand, but it showed weak correlation with price and led to high dimensionality. It was replaced with target encoding, which preserved predictive signal while reducing sparsity.

This strategy provided more compact feature representations while retaining important relationships between categorical variables and the target output.

6. Train Test Split

The train–test split is a fundamental step in machine learning that ensures unbiased model evaluation. By dividing the dataset into distinct training and testing subsets, the model’s ability to generalize to unseen data can be effectively assessed. This prevents overfitting and provides a reliable estimate of model performance [24].

Dataset Partitioning

The dataset was divided into two subsets:

Training Set: Comprising 80% of the data, used for model training and learning patterns.
Testing Set: The remaining 20%, used to evaluate model performance on unseen data.

This balanced split ensures the model learns meaningful patterns without overfitting and maintains strong performance on previously unseen data.

7. Machine Learning Algorithms

This section discusses the various machine learning algorithms utilized in the prediction model for vehicle pricing.

7.1. Linear Regression Performance Analysis

Linear regression is a fundamental algorithm used to establish a linear relationship between dependent and independent variables. It assumes a straight-line relationship and is useful for interpreting feature importance through coefficient values. It was used as a baseline predictor for vehicle price, trained on a scaled feature set and evaluated using normal regression metrics. The model achieved a Root Mean Squared Error (RMSE) of 0.1634, reflecting a very small difference between predicted and actual prices [25]. Most importantly, the R² Score [26] was 0.7983, indicating that approximately 79.8% of the variation in car prices is accounted for by the features used in the model. This suggests that Linear Regression provides a solid baseline, although ensemble models might yield better results.

Support Vector Regression (SVR) Performance Analysis

Support Vector Regression (SVR) is an extension of Support Vector Machines (SVMs) for regression tasks. It employs a kernel trick to capture nonlinearity in data, making it effective for complex relationships between vehicle attributes and price.

SVR was employed as part of the predictive modeling process for car price prediction, trained on the scaled dataset and evaluated using standard regression metrics. The Root Mean Squared Error (RMSE) was 0.1447, indicating minimal deviation between predicted and actual prices. Most importantly, the R² Score was 0.8419, meaning that approximately 84.19% of the variance in car prices is explained by the model. This highlights SVR’s effectiveness in modeling nonlinear relationships, showing a notable improvement over Linear Regression models.

7.2. Random Forest Regression and Performance Analysis

Random Forest is an ensemble learning method that constructs multiple Decision Trees and aggregates their predictions to enhance accuracy. It mitigates overfitting and effectively captures nonlinear dependencies in datasets, making it well-suited for prediction tasks.

In this study, Random Forest Regression was employed for vehicle price prediction, using 150 estimators (n_estimators) to achieve robust performance through the averaging of predictions. The model achieved a Root Mean Squared Error (RMSE) of 0.1313, indicating minimal variation between predicted and actual prices. Furthermore, the R² score of 0.8697 demonstrates that approximately 86.97% of the variance in car prices is explained by the model, underscoring the effectiveness of ensemble methods in tackling complex regression problems.

7.3. XGBoost Regression and Performance Analysis

XGBoost is an optimized version of gradient boosting, renowned for its speed, performance, and ability to handle missing values, feature interactions, and nonlinearity. These attributes make it a robust choice for price prediction tasks.

In this study, XGBoost Regression achieved a Root Mean Squared Error (RMSE) of 0.1262, reflecting minimal deviation between predicted and actual prices. Additionally, the

R^{2}

score of 0.8798 indicates that 87.98% of the variance in vehicle prices is explained by the model’s input features, highlighting its effectiveness in capturing complex patterns and delivering highly accurate predictions.

7.4. Stacking Regressor and Performance Analysis

Stacking Regressor is an ensemble learning technique that combines multiple base models to enhance predictive accuracy. It utilizes a meta-model trained on the predictions of individual estimators to leverage their combined strengths. The stacked model in this study included the following estimators: SVR (C = 10, gamma = 0.1, epsilon = 0.1), Random Forest (n_estimators = 500, max_depth = 10), and XGBoost (n_estimators = 500, learning_rate = 0.05).

The model achieved a Root Mean Squared Error (RMSE) of 0.1208, demonstrating minimal deviation between predicted and actual prices. Furthermore, the

R^{2}

score of 0.8899 indicates that 88.99% of the variance in vehicle prices is explained by the input features, showcasing the superior performance of the Stacking Regressor in capturing complex patterns and delivering accurate predictions.

7.5. XGBRegressor with Grid Search and Performance Analysis

XGBRegressor, a core implementation of XGBoost, was employed in both base form and with hyperparameter tuning via GridSearch to enhance model performance [27]. Grid Search Cross Validation was used to optimize key parameters, evaluating 729 parameter combinations across 3 cross-validation folds. The best parameters were identified as follows: colsample_bytree = 0.8, learning_rate = 0.1, max_depth = 5, min_child_weight = 1, n_estimators = 300, and subsample = 1.0.

Using these optimized parameters, the retrained model was evaluated and achieved a Root Mean Squared Error (RMSE) of 0.1213, indicating minimal deviation from actual car prices. The

R^{2}

score of 0.8888 illustrates that 88.88% of the variance in vehicle prices is explained by the model’s input features, confirming the effectiveness of the optimized XGBRegressor in capturing complex patterns and providing accurate predictions.

7.6. Keras Regression and Performance Analysis

Keras Regression leverages deep learning to capture complex patterns for price prediction. The model architecture comprised a sequential neural network with the following configuration: Input layer (12 neurons, ReLU activation), Hidden layer (8 neurons, ReLU activation), Hidden layer (4 neurons, ReLU activation), and Output layer (1 neuron for regression). The network was trained using the Adam optimizer with a loss function of Mean Squared Error (MSE), over 100 epochs with a batch size of 64, and validated on the test set.

The model achieved a Root Mean Squared Error (RMSE) of 0.1602, indicating a small deviation between predicted and actual prices. The

R^{2}

score of 0.8061 indicates that 80.61% of the variance in car prices is explained by the input features, while the results are reasonable, optimizing hyperparameters, refining the network architecture, or improving training strategies could further enhance the model’s performance.

7.6.1. Early Stopping and Dropout Techniques

Early stopping prevents overfitting by halting training once validation performance stops improving, while dropout randomly deactivates neurons during training, enhancing model generalization by reducing reliance on specific features.

Neural Network Architecture with Early Stopping and Dropout: The model followed the previous Keras architecture, but dropout layers with a rate of 30% were added after each dense layer. Early stopping was configured to monitor validation loss, stopping training after 10 epochs of no improvement. The model was compiled using the Adam optimizer with a Mean Squared Error (MSE) loss function. Training was conducted for a maximum of 100 epochs with a batch size of 64, but early stopping ensured efficiency.

The architecture was as follows: Input layer (12 neurons, ReLU activation + 30% dropout), Hidden layer (8 neurons, ReLU activation + 30% dropout), Hidden layer (4 neurons, ReLU activation + 30% dropout), and Output layer (1 neuron for regression).

Performance Metrics: The model achieved a Root Mean Squared Error (RMSE) of 0.2950, showing a significant deviation between predictions and actual values, and an

R^{2}

score of 0.3429, indicating that only 34.29% of the variance in car prices was explained by the input features. While dropout and early stopping effectively reduced overfitting, the overall predictive performance was significantly weaker compared to other approaches.

7.6.2. Analysis of Deep Learning Underperformance

Despite its success in vision and language tasks, deep learning models often underperform on small, structured tabular datasets. In our study, a Keras-based neural network was compared against tree-based regressors for used car price prediction. The deep model consistently lagged behind, and the reasons are outlined below.

Data Size and Complexity: Neural networks require large datasets to generalize effectively. With approximately 4000 samples, the dataset was insufficient for the network to learn robust patterns, whereas tree-based models such as XGBoost and Stacking Regressors handled this scale well.

Categorical Feature Handling: Although encoding techniques (e.g., target encoding) were applied, the deep model struggled to extract meaningful relationships from categorical inputs. In contrast, tree-based models natively process such features more effectively.

Overfitting: Regularization methods (dropout, early stopping) were used, yet the model showed signs of overfitting and poor generalization. Due to the data being low, the dropout was a bad implementation.

Performance Comparison: Deep learning performance was inconsistent, with R² scores ranging from 0.34 to 0.79 and RMSE up to 3.79, far behind the ensemble-based approaches.

Conclusion: Tree-based models remain better suited for small, structured datasets. Future work may investigate hybrid architectures combining deep learning for feature extraction with tree-based regressors for prediction.

7.7. LightGBM Performance Analysis

LightGBM (Light Gradient Boosting Machine) is an efficient and fast boosting algorithm designed for large datasets. It achieves high accuracy while maintaining computational efficiency [28].

Model Training and Performance Analysis: LightGBM was trained on the scaled dataset using default hyperparameters. Upon evaluation, the model demonstrated a Root Mean Squared Error (RMSE) of 0.1257, reflecting minimal deviation from true price values. The

R^{2}

Score of 0.8806 shows that 88.06% of the variance in vehicle prices was explained by the input features. These results highlight LightGBM as an efficient and competitive ensemble-based technique for regression tasks.

7.8. CatBoost

CatBoost is a gradient boosting algorithm specifically optimized for categorical data. It natively handles categorical features without extensive preprocessing, making it particularly effective for structured datasets [29].

Model Training and Performance Analysis: CatBoost Regressor was trained on the scaled dataset with default parameter settings. The model achieved a Root Mean Squared Error (RMSE) of 0.1212, indicating high prediction accuracy with minimal deviation from true values. The

R^{2}

score of 0.8890 highlights that 88.90% of the variance in car prices was explained by the input features. These results establish CatBoost as one of the most effective models in this study, outperforming traditional regression approaches and several ensemble techniques.

7.9. Comparison Table

Table 1 summarizes the final performance of all evaluated regression models after applying the full feature engineering pipeline. The results are reported using two complementary metrics: RMSE (root mean squared error) and R² score.

Among all models, the Stacking Regressor achieved the best overall performance, with an RMSE of 0.1208 and an R² score of 0.8899. Tree-based ensemble models like CatBoost and XGBoost (with or without Grid Search) followed closely, demonstrating strong predictive power with RMSE values around 0.12 and R² scores above 0.88.

LightGBM and Random Forest also performed well, showing slightly higher errors but maintaining strong R² scores above 0.86. SVR and the Keras-based neural network achieved moderate performance, with RMSE values of 0.1447 and 0.1602, respectively. Although their R² scores remain above 0.80, they fall short of the ensemble methods.

Linear Regression, while still the simplest model, improved significantly after preprocessing, achieving an R² of 0.7983. However, deep learning models incorporating dropout and early stopping showed the weakest performance, likely due to underfitting or insufficient optimization, with an RMSE of 0.2950 and an R² of just 0.3429.

These results highlight the strength of ensemble tree-based methods on structured tabular data and underscore the importance of both preprocessing and model selection in achieving high predictive accuracy.

While all regression models listed in Table 1 were trained and evaluated under the same pipeline, only a subset of five was selected for detailed before-and-after comparison in Table 2. These models, Linear Regression, Random Forest, XGBRegressor (Base), CatBoost, and the Stacking Regressor, were chosen based on three key criteria:

Architectural diversity: The selected models span a spectrum of complexity, ranging from simple linear models (Linear Regression) to ensemble tree-based methods (Random Forest, XGBoost, CatBoost) and meta-learners (Stacking Regressor). This variety allows for a more generalizable understanding of how feature engineering affects different model classes.
Interpretability and reliability: These models produced consistently interpretable results with stable performance across runs, making them ideal candidates for a controlled comparison. In contrast, models like SVR and Keras-based neural networks exhibited high sensitivity to hyperparameters, slower training, or inconsistent behavior, which would complicate a fair before-and-after analysis.
Analytical clarity: Including all models in before-and-after plots would introduce visual clutter and diminish readability without offering additional insights. The selected models sufficiently represent the performance trends observed across the broader set.

This focused comparison enables clearer visualization of the impact of feature engineering while maintaining methodological rigor.

Explainability Analysis with SHAP

To quantitatively evaluate the interpretability of our predictive models, we employed SHapley Additive exPlanations (SHAPs) to analyze feature contributions for the three top-performing models: Stacking Regressor, CatBoost, and XGBoost (Grid Search). SHAP assigns an importance value to each feature by calculating its marginal contribution to the model’s predictions, averaged across all possible feature coalitions. This approach provides a consistent, model-agnostic measure of feature influence that satisfies desirable properties such as efficiency, symmetry, and additivity.

Figure 11, Figure 12 and Figure 13 present the mean absolute SHAP values for each model, serving as global interpretability metrics. Across all three models, mileage, age, and horsepower consistently emerge as the most influential predictors, followed by brand and engine displacement. These results strongly align with automotive domain knowledge: older cars or those with higher mileage typically depreciate more, while premium brands and powerful engines retain greater value.

The consistency of SHAP feature rankings across distinct architectures underscores the robustness of our engineered features and validates the reliability of our feature importance assessments.

7.10. Model Performance Comparison

To visually compare the performance of different regression models, we plotted their R² and RMSE scores side by side. Figure 14 illustrates both how well each model explains the variance in vehicle prices (R²) and their average prediction error (RMSE).

From Figure 14, it is evident that ensemble models like Random Forest, XGBoost, and the Stacking Regressor consistently outperform simpler models such as Linear Regression and Support Vector Regression (SVR). These ensemble methods excel at capturing nonlinear feature interactions, managing multicollinearity, and handling feature noise; all of which are crucial in real-world, heterogeneous datasets like used car listings.

In contrast, simpler models such as Linear Regression lack the capacity to model complex relationships unless supported by extensive feature engineering, and SVR, while more flexible, can be sensitive to feature scaling and hyperparameter tuning. Meanwhile, deep learning models, though highly expressive, underperformed due to the relatively small dataset size and the nature of tabular, structured data, which typically favors Decision Tree-based methods over neural architectures. These results reaffirm the strength of ensemble regressors as a default choice for tabular regression tasks, especially when combined with thoughtful feature transformations.

7.11. Model Evaluation and Cross-Validation Results

To ensure a robust assessment of model performance, each regression algorithm was evaluated using both hold-out test set metrics and k-fold cross-validation. Specifically, 5-fold cross-validation was applied to the training data to estimate the model’s generalization ability and to measure stability across different data splits. For each model, we report the mean and standard deviation of the

R^{2}

score across the folds, alongside the final

R^{2}

score on the independent test set.

Table 3 summarizes the cross-validation results for all models considered in this study.

From Table 3, it can be observed that the Stacking Regressor achieved the highest test

R^{2}

(0.8899) while also maintaining a strong CV mean

R^{2}

(0.8881) with low variance, indicating stable performance. XGBoost (Tuned) and CatBoost followed closely with CV mean

R^{2}

scores above 0.89. LightGBM and standard XGBoost also demonstrated competitive performance, achieving test

R^{2}

values greater than 0.88. Random Forest and SVR provided solid accuracy with minimal variation across folds, whereas Linear Regression and Keras Regression yielded lower

R^{2}

scores, highlighting the benefits of ensemble and boosting techniques for these dataset. The consistently low CV standard deviations across all models indicate that the performance is stable and not overly sensitive to specific train–test splits.

In the following subsection, we compare these accuracy and stability results against the execution times of each model to analyze the trade-off between predictive performance and computational efficiency.

7.12. Time Comparison

Figure 15 presents the execution time for each regression model. As expected, simple models like Linear Regression and base XGBRegressor complete training in under a second. Tree-based models such as Random Forest and CatBoost remain relatively efficient, while ensemble methods like the Stacking Regressor and XGBoost with GridSearch exhibit the highest training time, over 30 s in some cases.

Deep learning models, including Keras Regression with dropout and early stopping, also show longer execution times due to iterative training and validation monitoring. Interestingly, techniques like early stopping, while reducing overfitting, introduce additional training overhead.

This analysis highlights an important trade-off: while complex models tend to offer better accuracy, they often come at a significantly higher computational cost. For deployment in real-time systems or large-scale platforms, these time differences could affect scalability. Future work may explore more efficient stacking frameworks, model pruning, or distillation to reduce inference time without compromising performance.

8. Comparative Analysis: With vs. Without Feature Engineering

Feature engineering plays a critical role in enhancing the predictive power of machine learning models, particularly when working with structured tabular datasets. To quantify its impact, we conducted a comparative study using two configurations: one using only minimal preprocessing, and the other incorporating a comprehensive feature engineering pipeline.

In the baseline setup, categorical variables were left largely untouched or processed using simple label encoding without accounting for cardinality or relationship to the target. Additionally, no derived features or transformations were introduced. As a result, most models struggled to capture underlying patterns. For instance, Random Forest and XGBoost achieved R² scores between 0.12 and 0.14, and RMSE values exceeding 133,000. Linear Regression performed worst, with an R² of just 0.03—highlighting its inability to model nonlinear relationships in raw tabular data (see Figure 9 and Figure 16).

By contrast, the final version of the dataset underwent extensive feature engineering, which proved transformative. Key techniques included

Categorical encoding: High-cardinality fields such as brand and model were encoded using target encoding and frequency-based grouping, improving alignment with the target variable.
Outlier removal: Abnormal values in price, mileage, and horsepower were filtered using interquartile range (IQR) logic to reduce noise and variance.
Normalization: Continuous features were scaled to stabilize gradient-based learning and improve model convergence.
Feature extraction: Additional features such as car age, engine displacement, and power-to-weight ratios were derived from raw attributes, enhancing feature richness.

These transformations resulted in substantial performance gains. For example, the Stacking Regressor improved from an R² of 0.1486 (minimal preprocessing baseline) to 0.8899 after applying the full feature engineering pipeline, with an RMSE of just 0.1208. The low starting point was expected, as the baseline omitted most transformations in order to highlight the effect of structured preprocessing. Thus, the improvement reflects enhanced data representation and reduced noise, rather than solely algorithmic superiority.

Linear models showed the greatest relative improvement due to their reliance on properly scaled, transformed inputs. Ensemble tree-based models, though more robust, still benefited greatly from the clearer signal provided by the engineered features.

The findings underscore that feature engineering is not merely a supporting step, but a critical component of predictive modeling in structured domains. In many real-world scenarios, enhancing data representation may yield greater improvements than algorithm selection or hyperparameter tuning. Future work should continue to prioritize domain-aware preprocessing techniques to further advance model generalizability and robustness.

To further quantify the magnitude of improvement enabled by feature engineering, we selected five representative models; Linear Regression, Random Forest, XGBoost, CatBoost, and the Stacking Regressor, for focused analysis. These models were chosen based on their performance diversity and widespread use in regression tasks. Figure 17 and Figure 18 illustrate the absolute improvements in R² and RMSE, respectively, observed after applying the full feature engineering pipeline.

To visually contrast model performance before and after feature engineering, we present a grouped comparison of R² and RMSE values across five key models. This highlights the absolute difference in predictive power and error between raw and transformed data inputs. The comparison of R² scores is shown in Figure 9, while the RMSE values before and after feature engineering are presented in Figure 16.

To complement the overall analysis above, we now provide a detailed breakdown of how each model individually responded to the feature engineering process, with reference to the results visualized in Figure 17 and Figure 18.

Interestingly, Linear Regression exhibited the largest absolute improvement in R² score, rising dramatically from just 0.03 to 0.7983. This 26-fold increase in explanatory power is expected, as linear models are highly sensitive to raw data quality, scaling, and encoding. Initially, unprocessed categorical and inconsistent inputs rendered Linear Regression ineffective. After structured preprocessing, including label and target encoding, feature scaling, and outlier removal, the model could finally uncover the underlying linear relationships between features and price.

Random Forest, a tree-based ensemble method, also saw a notable improvement in R², increasing from 0.1272 to 0.8697, with a corresponding RMSE reduction of nearly 133,566 units. As a model that naturally handles nonlinearities and feature interactions, Random Forest already performed moderately well on raw data. However, the addition of cleaned, standardized inputs, particularly horsepower, mileage, and encoded categorical variables—boosted its predictive consistency.

XGBoost, a gradient boosting algorithm, improved from 0.1245 to 0.8798 in R² score and reduced its RMSE by 133,774, confirming its strength in capturing complex relationships. The model’s built-in regularization and high sensitivity to noise make it an ideal candidate for structured but noisy datasets, and it clearly benefited from outlier filtering and transformation of engine and brand features.

CatBoost, known for handling categorical variables natively, increased from an R² of 0.1490 to 0.8890, with an RMSE reduction of 131,888. Its performance gain demonstrates that even models designed for raw categorical data benefit from additional preprocessing steps such as normalization, feature extraction, and transmission standardization. This also shows that native support does not eliminate the need for domain-informed preprocessing.

Finally, the Stacking Regressor, which combines SVR, Random Forest, and XGBoost, achieved the highest R² score after feature engineering, rising from 0.1486 to 0.8899. It also showed a substantial RMSE reduction of 131,920 units. This model leveraged the strengths of its diverse base learners, and its improvement underscores how ensemble blending can further amplify the benefits of well-engineered features.

In summary, although ensemble methods such as CatBoost, XGBoost, and Stacking Regressor already possessed strong baseline performance due to their robustness and architectural advantages, they still benefited meaningfully from structured feature engineering. The consistent pattern across all models, each showing RMSE reductions exceeding 130,000 units, highlights the profound impact of thoughtful preprocessing, including outlier removal, encoding strategies, and extraction of latent numerical features from raw strings. These results reinforce the conclusion that feature engineering is often a more effective lever for performance enhancement than model complexity alone.

9. Results and Discussion

9.1. Model Performance Analysis

This research successfully demonstrates the effectiveness of machine learning techniques in predicting used car prices with high accuracy. The comprehensive evaluation of multiple regression algorithms reveals significant performance variations and highlights the critical role of feature engineering in enhancing predictive accuracy.

The experimental results show that ensemble methods consistently outperformed individual models. The Stacking Regressor achieved the highest overall performance with an R² score of 0.8899 and RMSE of 0.1208, followed closely by CatBoost (R² = 0.8890, RMSE = 0.1216) and XGBoost with Grid Search (R² = 0.8888, RMSE = 0.1213). These results demonstrate the effectiveness of ensemble approaches in capturing complex patterns within the used car pricing domain.

Traditional machine learning algorithms showed varied performance levels. Linear Regression, while being the simplest model, achieved a respectable R² of 0.7983 after proper feature engineering, demonstrating the transformative power of preprocessing. Support Vector Regression and Random Forest achieved moderate to strong performance (R² = 0.8419 and 0.8697, respectively), confirming their suitability for structured tabular data.

Interestingly, deep learning models underperformed compared to tree-based ensemble methods. The Keras-based neural network with dropout and early stopping achieved only an R² of 0.3429, highlighting the well-known limitation of deep learning approaches on small, structured datasets. This finding reinforces the superiority of tree-based algorithms for tabular regression tasks when dealing with datasets of moderate size.

9.2. Impact of Feature Engineering

This study emphasizes the critical role of feature engineering, including data normalization, categorical encoding, and outlier removal in improving model performance. These preprocessing steps transformed raw data into more informative representations, significantly enhancing both model interpretability and predictive accuracy.

When comparing models with and without feature engineering, the impact is profound and consistent across all algorithms. Linear Regression showed the most dramatic improvement, with R² increasing from 0.0296 to 0.7983—a significant enhancement. Similarly, the Stacking Regressor improved from R² = 0.1486 to 0.8899, while ensemble methods like XGBoost and CatBoost also demonstrated substantial gains.

Key preprocessing techniques proved particularly effective:

Categorical encoding: Target encoding for brand variables and label encoding for ordinal features significantly improved correlation with the target variable.
Feature extraction: Regex-based parsing of engine specifications to extract horsepower and displacement created highly predictive numerical features.
Outlier removal: IQR-based filtering reduced noise and improved model stability.
Data standardization: Normalization of continuous features enhanced convergence for gradient-based algorithms.

The creation of derived features from existing variables such as vehicle age, power-to-weight ratios, and standardized transmission categories, contributed to stronger signal representation and better generalization capabilities.

9.3. Computational Efficiency Analysis

An important practical consideration revealed by this study is the trade-off between model complexity and computational efficiency. Complex ensemble methods such as stacking, deep learning models with early stopping, and grid-searched XGBoost incurred significantly higher training times (over 30 s in some cases). In contrast, simpler models like Linear Regression and base XGBoost trained in under a second while still achieving reasonably good performance.

This computational analysis highlights critical considerations for real-world deployment. For time-sensitive applications or resource-constrained environments, models with lower training overhead may be preferred, even at a slight cost to accuracy. The results suggest that practitioners should carefully balance accuracy requirements with computational constraints when selecting models for production systems.

9.4. Model Interpretability and Feature Importance

The SHAP analysis across the top-performing models revealed consistent patterns in feature importance. Mileage, vehicle age, and horsepower emerged as the most influential predictors across all models, followed by brand and engine displacement. This consistency across different model architectures validates both the robustness of the engineered features and their alignment with domain knowledge.

The interpretability results confirm expected relationships: vehicles with higher mileage or greater age tend to have lower resale values, while more powerful engines and premium brands command higher prices. This alignment between statistical importance and automotive domain expertise enhances confidence in the model’s reliability and practical applicability.

9.5. Comparative Analysis with Existing Literature

To contextualize our findings within the broader research landscape, Table 4 presents a comprehensive comparison of existing studies on used car price prediction, highlighting their methodologies, performance metrics, and limitations.

Our approach addresses several critical gaps identified in the existing literature, while some studies achieved high reported accuracy, they often suffered from limitations such as small dataset sizes, absence of comprehensive feature engineering, or lack of systematic comparison between raw and processed data. Table 5 summarizes the key contributions of our work in addressing these limitations.

Unlike previous studies that focused primarily on final model accuracy, our research provides a systematic evaluation of the feature engineering impact through controlled before-and-after comparisons. This methodological transparency enables better understanding of performance drivers and provides a replicable framework for similar applications.

9.6. Practical Implications and Applications

The integration of machine learning into the used car pricing domain has clear practical implications for multiple stakeholders. Automated valuation systems powered by these models can assist consumers in making informed purchasing decisions by providing data-driven price estimates that account for complex feature interactions.

For automotive dealerships, these models offer the potential to optimize pricing strategies, ensuring competitive yet profitable vehicle valuations. The high accuracy achieved (R² > 0.88 for top models) suggests these tools could significantly improve pricing consistency and market efficiency.

Insurance companies could leverage similar approaches for risk assessment and claim valuations, while financial institutions might integrate these models into loan approval and asset valuation processes. The interpretability provided by SHAP analysis enhances trust and regulatory compliance in these sensitive applications.

9.7. Limitations and Future Research Directions

Despite the promising results, several limitations warrant acknowledgment. The current dataset, while substantial with over 4000 records, represents U.S. market conditions and may not generalize to international markets with different pricing dynamics, regulations, or consumer preferences.

Future research opportunities include several promising directions. Incorporating real-time market trends and economic indicators could improve model responsiveness to market fluctuations. Expanding the dataset to capture broader geographic diversity and a wider range of vehicle conditions would enhance generalizability.

Time-series analysis could reveal seasonal pricing patterns and market trends, while integration of external factors such as fuel prices, economic conditions, and regional preferences might further improve accuracy. Additionally, exploration of more sophisticated ensemble methods or hybrid approaches combining deep learning for feature extraction with tree-based models for prediction presents an interesting research avenue.

The development of dynamic pricing models using reinforcement learning techniques could enable real-time price adjustments based on market conditions and inventory levels. Such systems could provide significant competitive advantages for online marketplaces and dealership networks.

10. Conclusions

This research demonstrates that comprehensive feature engineering combined with ensemble machine learning techniques can achieve highly accurate used car price prediction, with the best model (Stacking Regressor) achieving an R² score of 0.8899. The study’s primary contribution lies in systematically quantifying the transformative impact of structured preprocessing on model performance across diverse algorithms.

Key findings include the superiority of ensemble methods over individual models for this domain, with tree-based algorithms consistently outperforming deep learning approaches on structured tabular data. The dramatic performance improvements observed across all models with some showing great increases in R² scores, underscore that thoughtful feature engineering often provides greater benefits than algorithm selection alone.

The research addresses critical gaps in existing literature by providing transparent before-and-after comparisons, comprehensive model evaluation, and interpretability analysis through SHAP values. This methodological rigor establishes a replicable framework for similar regression problems in structured domains.

From a practical perspective, the developed models offer immediate applicability for automated valuation systems, benefiting consumers, dealerships, and financial institutions through improved pricing transparency and market efficiency. The computational efficiency analysis provides guidance for deployment decisions, highlighting important trade-offs between accuracy and processing requirements.

Looking ahead, the proposed framework provides a foundation for real-time pricing applications, insurance risk assessments, and dynamic loan evaluations. With continued refinement incorporating market trends, geographic diversity, and temporal dynamics, systems based on this research can become vital tools for digital transformation in automotive and financial industries.

The study reinforces that in the era of big data and advanced algorithms, the quality and engineering of input features remain fundamental to achieving robust, interpretable, and practically valuable machine learning solutions.

Author Contributions

Conceptualization, I.F. and G.G.M.N.A.; methodology, I.F. and G.G.M.N.A.; software, I.F., G.G.M.N.A. and S.S.K.; validation, I.F., G.G.M.N.A. and S.S.K.; formal analysis, I.F. and S.S.K.; investigation, G.G.M.N.A. and S.S.K.; resources, G.G.M.N.A. and S.S.K.; data curation, I.F.; writing—original draft preparation, I.F., G.G.M.N.A. and S.S.K.; writing—review and editing, G.G.M.N.A. and S.S.K.; visualization, I.F.; supervision, G.G.M.N.A. and S.S.K.; project administration, G.G.M.N.A. and S.S.K.; funding acquisition, G.G.M.N.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research is partly supported by the Bradley University Faculty Scholarship Award (FSA) under grant number 1326872.

Data Availability Statement

The dataset used in this study are publicly available in Kaggle [11].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Malik, S.; Harode, R.; Singh, A. XGBoost: A Deep Dive into Boosting (Introduction Documentation). Tech. Rep. 2020. [Google Scholar] [CrossRef]
Pavlyshenko, B. Using stacking approaches for machine learning models. In Proceedings of the 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), Lviv, Ukraine, 21–25 August 2018; pp. 255–258. [Google Scholar] [CrossRef]
Sykes, A.O. An Introduction to Regression Analysis. In Chicago Lectures in Law and Economics; Posner, E.A., Ed.; Foundation Press: New York, NY, USA, 2000; Available online: https://chicagounbound.uchicago.edu/law_and_economics/20 (accessed on 7 February 2025).
Shen, J. Linear regression. In Encyclopedia of Database Systems; Liu, L., Özsu, M.T., Eds.; Springer: Boston, MA, USA, 2009; p. 1622. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Zhang, F.; O’Donnell, L.J. Support vector regression. In Machine Learning; Elsevier: Amsterdam, The Netherlands, 2020; pp. 123–140. [Google Scholar] [CrossRef]
Moolayil, J. An introduction to deep learning and Keras. In Learn Keras for Deep Neural Networks: A Fast-Track Approach to Modern Deep Learning with Python; Apress: Berkeley, CA, USA, 2019; pp. 1–16. [Google Scholar] [CrossRef]
Zheng, A.; Casari, A. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists, 1st ed.; O’Reilly Media, Inc.: Santa Rosa, CA, USA, 2018; Available online: https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/ (accessed on 7 February 2025).
Kosaraju, N.; Sankepally, S.R.; Mallikharjuna Rao, K. Categorical data: Need, encoding, selection of encoding method and its emergence in machine learning models—A practical review study on heart disease prediction dataset using Pearson correlation. In Proceedings of the International Conference on Data Science and Applications (ICDSA 2023), Singapore, 23–24 June 2023; Saraswat, M., Chowdhury, C., Kumar Mandal, C., Gandomi, A.H., Eds.; Springer: Singapore, 2023; pp. 369–382. Available online: https://link.springer.com/chapter/10.1007/978-981-19-6631-6_26 (accessed on 7 February 2025).
Kwak, S.K.; Kim, J.H. Statistical data preparation: Management of missing values and outliers. Korean J. Anesthesiol. 2017, 70, 407. [Google Scholar] [CrossRef] [PubMed]
Najib, T. Used Car Price Prediction Dataset. Available online: https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset (accessed on 7 February 2025).
Venkatasubbu, P.; Ganesh, M. Used cars price prediction using supervised learning techniques. Int. J. Eng. Adv. Technol. (IJEAT) 2019, 9, 216–223. Available online: https://www.ijeat.org/wp-content/uploads/papers/v9i1s3/A10421291S319.pdf (accessed on 7 February 2025). [CrossRef]
Zhu, A. Pre-Owned Car Price Prediction Using Machine Learning Techniques. In Proceedings of the 1st International Conference on Data Analysis and Machine Learning (DAML 2023), Setúbal, Portugal, 16–18 July 2023; SciTePress (INSTICC): Setúbal, Portugal, 2023; pp. 356–360. Available online: https://www.scitepress.org/Papers/2023/128101/128101.pdf (accessed on 7 February 2025).
Lin, J. Predicting Used Car Price Based on Machine Learning. In Proceedings of the 1st International Conference on E-commerce and Artificial Intelligence (ECAI 2024), Setúbal, Portugal, 2–4 March 2024; SciTePress: Setúbal, Portugal, 2024; pp. 553–560. Available online: https://www.scitepress.org/Papers/2024/132705/132705.pdf (accessed on 7 February 2025).
Baral, G.; Zhou, J. A hybrid regression method for predicting housing prices. In Proceedings of the 15th International Conference on Applied Human Factors and Ergonomics (AHFE), Istanbul, Turkey, 20–24 July 2024; Available online: https://openaccess.cms-conferences.org/publications/book/978-1-964867-35-9/article/978-1-964867-35-9_163 (accessed on 7 February 2025).
Bharambe, P.; Bagul, B.; Dandekar, S.; Ingle, P. Used Car Price Prediction using Different Machine Learning Algorithms. Int. J. Res. Appl. Sci. Eng. Technol. (IJRASET) 2022, 10, 773–780. [Google Scholar] [CrossRef]
Bukvić, L.; Pašić-Krinjar, J.; Fratrović, T.; Abramović, B. Price prediction and classification of used vehicles using supervised machine learning. Sustainability 2022, 14, 17034. [Google Scholar] [CrossRef]
Huang, J.; Yu, Z.; Ning, Z.; Hu, D. Used car price prediction analysis based on machine learning. In Proceedings of the International Conference on Artificial Intelligence and Data Science (ICAID 2023), London, UK, 9–10 December 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 356–364. [Google Scholar] [CrossRef]
Isaev, R.; Abdykalykova, M.; Amanov, A. Predicting passenger car prices with ML models. Preprints 2024. [Google Scholar] [CrossRef]
Hall, M.A. Correlation-Based Feature Selection for Machine Learning. Ph.D. Thesis, The University of Waikato, Hamilton, New Zealand, 1999. Available online: https://ml.cms.waikato.ac.nz/publications/2000/00MH-Correlation.pdf (accessed on 7 February 2025).
Gudivada, V.; Apon, A.; Ding, J. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. Int. J. Adv. Softw. 2017, 10, 1–20. Available online: https://personales.upv.es/thinkmind/dl/journals/soft/soft_v10_n12_2017/soft_v10_n12_2017_1.pdf (accessed on 7 February 2025).
Ali, P.J.M.; Faraj, R.H.; Koya, E. Data normalization and standardization: A technical report. Mach. Learn. Tech. Rep. 2014, 1, 1–6. Available online: https://www.scribd.com/document/735613404/DataNormalizationandStandardizationATechnicalReport (accessed on 7 February 2025).
Magyar, D.; Szénási, S. Parsing via regular expressions. In Proceedings of the 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), Herl’any, Slovakia, 21–23 January 2021; pp. 235–238. [Google Scholar] [CrossRef]
Tan, J.; Yang, J.; Wu, S.; Chen, G.; Zhao, J. A critical look at the current train/test split in machine learning. arXiv 2021, arXiv:2106.04525. [Google Scholar] [CrossRef]
Kaneko, H. A new measure of regression model accuracy that considers applicability domains. Chemom. Intell. Lab. Syst. 2017, 171, 1–8. [Google Scholar] [CrossRef]
Chicco, D.; Warrens, M.J.; Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef] [PubMed]
Liashchynskyi, P.; Liashchynskyi, P. Grid search, random search, genetic algorithm: A big comparison for NAS. arXiv 2019, arXiv:1912.06059. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3147. Available online: https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html (accessed on 7 February 2025).
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018, 31, 6638–6648. Available online: https://proceedings.neurips.cc/paper/2018/hash/14491b756b3a51daac41c24863285549-Abstract.html (accessed on 7 February 2025).

Figure 1. Comparison of

R^{2}

scores before and after feature engineering across all evaluated machine learning based regressors.

Figure 1. Comparison of

R^{2}

scores before and after feature engineering across all evaluated machine learning based regressors.

Figure 2. Process flow diagram for used car price prediction pipeline.

Figure 3. Pair plot showing relationships between price, age, and mileage.

Figure 4. Boxplot of scaled key features after outlier removal. The absence of extreme values confirms that the data were properly filtered prior to model training.

Figure 5. Correlation heatmap before feature engineering.

Figure 6. Correlation heatmap after full feature engineering. Stronger, more interpretable relationships emerge between features and price.

Figure 7. Car distribution in the dataset by brand.

Figure 8. Feature importance plot showing the influence of each variable on model predictions.

Figure 9. Comparison of

R^{2}

scores before and after feature engineering across selected models.

Figure 9. Comparison of

R^{2}

scores before and after feature engineering across selected models.

Figure 10. Raw sample data from thedataset [11].

Figure 11. Mean SHAP values for Stacking Regressor.

Figure 12. Mean SHAP values for CatBoost.

Figure 13. Mean SHAP values for XGBoost (Grid Search). Mileage, age, and horsepower remain dominant features.

Figure 14. Comparison of different regression models based on

R^{2}

and RMSE scores.

Figure 14. Comparison of different regression models based on

R^{2}

and RMSE scores.

Figure 15. Execution time comparison (in seconds) for different regressors.

Figure 16. RMSE score before and after feature engineering.

Figure 17. Improvement in R² score due to feature engineering.

Figure 18. Reduction in RMSE after feature engineering.

Table 1. Comparison of R² scores and RMSE for Different Regression Models after applying proposed feature engineering pipeline.

Regressor	RMSE	R² Score
Stacking Regressor	0.1208	0.8899
CatBoost	0.1216	0.8890
XGBoost (Grid Search)	0.1213	0.8888
LightGBM	0.1225	0.8840
XGBoost Regressor	0.1262	0.8798
Random Forest	0.1313	0.8697
SVR	0.1447	0.8419
Neural Network (Keras)	0.1602	0.8061
Linear Regression	0.1634	0.7983
Neural Network (Dropout Early Stopping)	0.2950	0.3429

Table 2. Comparison of R² scores and RMSE before and after feature engineering for selected models.

Model	R² Score (Before)	R² Score (After)	RMSE (Before)	RMSE (After)
Linear Regression	0.0296	0.7983	140,838.13	0.1634
Random Forest	0.1272	0.8697	133,566.21	0.1313
XGBRegressor (Base)	0.1245	0.8798	133,774.73	0.1262
CatBoost	0.1490	0.8890	131,888.52	0.1216
Stacking Regressor	0.1486	0.8899	131,920.13	0.1208

Table 3. Five-fold cross-validation and test performance for all models.

Model	CV Mean R²	CV Std R²	Test R²
Linear Regression	0.7954	0.0061	0.7983
Random Forest	0.8632	0.0076	0.8695
SVR	0.8469	0.0056	0.8419
XGBoost	0.8809	0.0118	0.8798
Stacking Regressor	0.8881	0.0095	0.8899
Keras Regression	0.7977	0.0077	0.8045
XGBoost (Tuned)	0.8920	0.0113	0.8874
LightGBM	0.8843	0.0089	0.8806
CatBoost	0.8923	0.0095	0.8871

Table 4. Comparison of existing studies on used car price prediction.

Study	Models Used	R²/RMSE	Shortcomings
Venkatasubbu & Ganesh (2019) [12]	Lasso, Regression Tree	0.9927	Small dataset, no feature engineering
Jiayi Lin (2024) [14]	Linear, Ridge, RF	Not stated	Basic derivation only, no ensembles, limited encoding
Zhu (2023) [13]	Linear, RF, ANN, XGBoost	0.789 (XGBoost)	Only one-hot encoding, no raw vs. engineered comparison
Isaev et al. (2024) [19]	RF, SVM, CatBoost	∼86% accuracy	No baseline benchmark with raw features
Huang et al. (2023) [18]	XGBoost, CatBoost, ANN	0.9845	Only forward selection, no raw vs. engineered comparison
Gaurab Baral & Zhou (2024) [15]	Ridge + XGBoost (stacked)	RMSE 0.1126	Non-car domain; preprocessing not benchmarked
Gupta et al. (2024) [16]	Linear, Classifiers	Not stated	Basic encoding, no depth or benchmarking
Bukvić et al. (2022) [17]	Linear Regression	0.95	No advanced encoding or model diversity

Table 5. Key contributions of the proposed work.

Aspect	Our Contribution
Model Diversity	10+ models including XGBoost, CatBoost, SVR, Stacking, and Deep Learning
Preprocessing Scope	Comprehensive feature engineering: regex parsing, normalization, categorical encoding (label, binary, target)
Benchmarking Strategy	Side-by-side comparison of raw vs. engineered features
Explainability	SHAP used for interpretability across top-performing models
Robustness	5-fold cross-validation and runtime evaluation across all models
Application Scope	Tailored to U.S. used car listings with domain-specific feature treatments

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fayyaz, I.; Ali, G.G.M.N.; Khairunnesa, S.S. Advanced Feature Engineering and Machine Learning Techniques for High Accurate Price Prediction of Heterogeneous Pre-Own Cars. Vehicles 2025, 7, 94. https://doi.org/10.3390/vehicles7030094

AMA Style

Fayyaz I, Ali GGMN, Khairunnesa SS. Advanced Feature Engineering and Machine Learning Techniques for High Accurate Price Prediction of Heterogeneous Pre-Own Cars. Vehicles. 2025; 7(3):94. https://doi.org/10.3390/vehicles7030094

Chicago/Turabian Style

Fayyaz, Imran, G. G. Md. Nawaz Ali, and Samantha S. Khairunnesa. 2025. "Advanced Feature Engineering and Machine Learning Techniques for High Accurate Price Prediction of Heterogeneous Pre-Own Cars" Vehicles 7, no. 3: 94. https://doi.org/10.3390/vehicles7030094

APA Style

Fayyaz, I., Ali, G. G. M. N., & Khairunnesa, S. S. (2025). Advanced Feature Engineering and Machine Learning Techniques for High Accurate Price Prediction of Heterogeneous Pre-Own Cars. Vehicles, 7(3), 94. https://doi.org/10.3390/vehicles7030094

Article Menu

Advanced Feature Engineering and Machine Learning Techniques for High Accurate Price Prediction of Heterogeneous Pre-Own Cars

Abstract

1. Introduction

2. Related Work

3. Datasets and Methods

4. Exploratory Analysis

4.1. Dataset

4.2. Graphical Analysis

4.2.1. Pairplot

4.2.2. Outliers Removal

4.2.3. Correlation Analysis: Before and After Feature Engineering

4.2.4. Brand Variance

4.3. Impact of Feature Engineering on Model Performance

5. Feature Engineering

5.1. Handling Missing Values

5.2. Data Standardization

5.3. Calculating Vehicle Age

5.4. Price Transformation

5.5. Feature Engineering: Transmission

5.5.1. Numerical Encoding

5.5.2. Benefits

5.6. Feature Engineering: Engine

5.6.1. Data Extraction

5.6.2. Regex-Based Parsing and Standardization

5.6.3. Horsepower Extraction

5.6.4. Engine Displacement Extraction

5.6.5. Numerical Encoding

5.6.6. Benefits

5.7. Feature Engineering: Interior and Exterior Color

5.8. Outlier Detection and Treatment

5.8.1. IQR Method

5.8.2. Affected Features

5.9. Categorical Data Encoding

6. Train Test Split

Dataset Partitioning

7. Machine Learning Algorithms

7.1. Linear Regression Performance Analysis

Support Vector Regression (SVR) Performance Analysis

7.2. Random Forest Regression and Performance Analysis

7.3. XGBoost Regression and Performance Analysis

7.4. Stacking Regressor and Performance Analysis

7.5. XGBRegressor with Grid Search and Performance Analysis

7.6. Keras Regression and Performance Analysis

7.6.1. Early Stopping and Dropout Techniques

7.6.2. Analysis of Deep Learning Underperformance

7.7. LightGBM Performance Analysis

7.8. CatBoost

7.9. Comparison Table

Explainability Analysis with SHAP

7.10. Model Performance Comparison

7.11. Model Evaluation and Cross-Validation Results

7.12. Time Comparison

8. Comparative Analysis: With vs. Without Feature Engineering

9. Results and Discussion

9.1. Model Performance Analysis

9.2. Impact of Feature Engineering

9.3. Computational Efficiency Analysis

9.4. Model Interpretability and Feature Importance

9.5. Comparative Analysis with Existing Literature

9.6. Practical Implications and Applications

9.7. Limitations and Future Research Directions

10. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI