Machine Learning Modeling of Household Trip Generation by State Using NHTS Data

Naseralavi, Saber; Soltanirad, Mohammad; Ranjbar, Erfan; Lucero, Martin; Gorzin, Fateme; Hakiminejad, Yasaman; Azimi, Shiva; Baghersad, Mahdi; Mazaheri, Akram

doi:10.3390/urbansci9090353

Open AccessArticle

Machine Learning Modeling of Household Trip Generation by State Using NHTS Data

by

Saber Naseralavi

^1,*,

Mohammad Soltanirad

²

,

Erfan Ranjbar

³

,

Martin Lucero

²,

Fateme Gorzin

⁴,

Yasaman Hakiminejad

⁵,

Shiva Azimi

⁵,

Mahdi Baghersad

⁶

and

Akram Mazaheri

⁷

¹

Department of Civil Engineering, Shahid Bahonar University of Kerman, Kerman 7616914111, Iran

²

Department of Civil, Environmental and Construction Engineering, Texas Tech University, Lubbock, TX 79409, USA

³

Department of Civil, Construction, and Environmental Engineering, University of Delaware, Newark, DE 19716, USA

⁴

Department of Civil and Environmental Engineering, University of Pittsburgh, Pittsburgh, PA 15261, USA

⁵

Department of Civil and Environmental Engineering, Villanova University, Villanova, PA 19085, USA

⁶

Department of Civil, Construction, and Environmental Engineering, University of Alabama at Birmingham, Birmingham, AL 35294, USA

⁷

Department of Civil and Environmental Engineering, Tarbiat Modares University, Tehran 1411944961, Iran

^*

Author to whom correspondence should be addressed.

Urban Sci. 2025, 9(9), 353; https://doi.org/10.3390/urbansci9090353

Submission received: 4 July 2025 / Revised: 22 August 2025 / Accepted: 1 September 2025 / Published: 4 September 2025

Download

Browse Figures

Versions Notes

Abstract

This study investigates the factors that influence household trip generation across the United States using the National Household Travel Survey (NHTS) dataset. Recognizing the limits of a one-size-fits-all modeling approach, we conduct a two-stage analysis to investigate spatial heterogeneity within travel behavior. Stage one creates a benchmark analysis, comparing advanced machine learning models (CatBoost and random forest) to a traditional linear regression model. Contrary to prevailing trends in predictive modeling, the results reveal that linear regression not only delivers competitive overall performance but also emerges as the best performing model in the majority of states. Providing optimal balance between predictive accuracy and interpretability. Building on these findings, the second stage applies state specific linear models to uncover geographic differences in trip generation drivers. The findings highlight extensive spatial heterogeneity: while core demographic variables like household size and the presence of young children show consistent effects across the US, the influence of socio-economic factors such as income and vehicle ownership are highly context-dependent and spatially volatile. These findings highlight the importance of moving beyond black box modeling and instead implementing place based, context sensitive techniques in the promotion of more effective and equitable transportation plans.

Keywords:

trip generation; sustainable transportation planning; sustainable travel behaviors; machine learning modeling; NHTS data

1. Introduction

Urban and regional transportation planning relies fundamentally on understanding and accurately forecasting travel demand. Central to this process is trip generation modeling, which estimates the number of trips produced by analytical units such as households. The accuracy of these initial forecasts has cascading effects on all subsequent planning stages, including trip distribution, mode choice, traffic assignment, and the evaluation of multibillion dollar infrastructure projects [1,2].

Trip generation is particularly sensitive to variations in demographic, economic, and spatial factors. Socioeconomic characteristics (e.g., income, vehicle ownership), land use patterns, and the built environment significantly influence travel behavior [3,4]. As such, models must be both predictively accurate and interpretable, providing actionable insights for planners and policymakers seeking to address congestion and sustainability, while ensuring equitable access across transportation networks.

The use of machine learning (ML) models, particularly black box algorithms, presents a critical challenge with their lack of interpretability. While these models may offer high accuracy, their internal logic is often opaque, limiting their use within public policy contexts that require causational explanations and transparency [5]. This tradeoff between predictive accuracy and interpretability represents a major tension in applied transportation modeling and a core focus of this study.

This paper addresses two key gaps in the literature: (1) the tradeoff between model interpretability and predictive performance and (2) the assumption of spatial homogeneity in household travel behavior. Based on these considerations, this study hypothesizes that factors influencing household trip generation vary significantly across U.S. states, and no single model can adequately capture these differences. We propose a two-phase, state-level analytical framework using data from the US National Household Travel Survey (NHTS). In Phase I, we benchmark CatBoost, random forest (RF), and linear regression models across all 50 states, including the District of Washington, to assess their balance between accuracy and interpretability. In Phase II, we use the selected model (Linear Regression) to examine how variable effects differ across states, revealing geographical heterogeneity with coefficient analysis and geospatial visualization.

This work creates a novel methodological contribution while simultaneously providing policy-relevant findings. Moving beyond the one-size-fits-all modeling paradigm supports a place-based approach to sustainable and multimodal transportation planning in infrastructure investment.

The analysis of trip generation for different areas is a critical factor, given the vast ethnic and geographical differences within the United States [6]. Each state differs in population density, economic conditions, and cultural factors, which significantly influence travel behavior. A state-level analysis helps transportation planners to understand regional patterns and requirements that may be overlooked with a national-level analysis [7]. Building on this motivation, the remainder of this paper is organized as follows: Section 2 reviews related literature, Section 3 outlines the methodology, Section 4 describes the results, Section 5 presents the discussion, and Section 6 concludes the study and discusses the limitations and future work.

2. Literature Review

2.1. Conceptual Framework of Household Trip Generation

Household trip generation, which is the foundation of transportation planning, is investigated in the literature. The common framework assumes that travel behavior results from a complex interaction between individual and household characteristics, as well as the physical and social environment in which they exist [1]. This framework categorizes factors influencing trip generation into four major groups.

2.1.1. Demographic Characteristics

Demographic characteristics such as household size, age distribution, and cultural background are consistently associated with trip generation and travel behavior. Because of increased needs, household size often has a positive relationship with trip frequency when considering work, school, shopping, and recreational travel [8]. The average age of household members also influences travel patterns. On average, younger households typically make more frequent and diverse trips, while older households travel for different purposes, such as medical or leisure [9]. Additionally the proportion of young children increases childcare-related travel, creating a higher trip count to daycare or preschool centers [10]. Gender composition also influences trip generation, as male and female members have differences in employment participation and activity patterns [11]. Immigration background is also an important factor, as cultural norms and socioeconomic integration influence mobility choices [12]. Because life cycle classification and household race capture demographic diversity, and household age shapes trip purposes and frequencies, these categories enhance the explanatory power of the data [13].

2.1.2. Economic Characteristics

Economic characteristics such as household income and homeownership status are widely recognized as important determinants of trip generation. Greater household income expands financial capacity, enabling greater participation in discretionary activities such as shopping, leisure, and nonwork travel, increasing overall trip frequency [13,14]. Conversely, low-income households often exhibit constrained travel behavior due to budget limitations, frequently prioritizing essential trips [7]. Homeownership (versus renting) is associated with greater residential stability and long-term neighborhood ties, influencing travel patterns through established social networks and community services [13]. In contrast, renters, particularly those in dense urban areas, frequently rely heavily on public transit and nonmotorized modes of transportation, resulting in different trip purposes and distances [15]. Together, these economic attributes provide critical explanatory power in transportation models and trip generation.

2.1.3. Mobility and Locational Characteristics

Mobility and locational characteristics describe a household’s ability to access transportation resources as well as the spatial context in which travel occurs. The number of household vehicles available is a strong indicator of a household trip frequency and its independence in travel decisions [7,16]. Public transit use and ridesharing participation reflect openness to alternative modes and can influence modal shifts, especially in areas with strong shared mobility infrastructure [17,18]. The urban/rural designation is another critical determinant, as urban households typically have greater access to multimodal networks, shorter trip lengths, and higher accessibility compared to rural counterparts [7].

2.1.4. Education, Work Patterns, and Health in Trip Generation

Education, employment flexibility, health, and occupation all significantly influence travel behavior and mobility patterns. Higher educational attainment is often linked to professional or managerial occupations, which tend to result in longer commutes and higher travel demand, whereas lower education levels are often associated with service or manual labor jobs, limiting economic resources and creating shorter location-dependent travel [19]. Flexible work arrangements can reduce peak-hour congestion and spread travel demand throughout the day, potentially decreasing trip frequency for commuting purposes [20]. Health status is also important; poor health can impede mobility, and chronic medical disorders often demand frequent travels to healthcare facilities [21]. Knowledge-based jobs are more frequently associated with transportation and active modes, while manual labor jobs are more reliant on private vehicles [22].

These four categories are interdependent. For instance, the effect of income on trip decisions can vary considerably depending on the level of vehicle ownership and the characteristics of the surrounding urban form [1]. Given these complex relationships, it is unlikely that a single set of parameters can fully explain trip generation in all contexts, underscoring the need for geographically sensitive analysis.

2.2. Evolution of Trip Generation Models

In addition to advances in theoretical understanding, trip generation modeling techniques have also undergone significant development. For decades, linear regression and cross-classification were the dominant methods because of their simplicity, ease of implementation, and high interpretability [23]. The ability to present results through clear, meaningful coefficients made these models particularly appealing to policymakers.

The increasing availability of large, complex datasets, such as the NHTS, pushes the linear models’ capabilities in capturing nonlinear correlations and interactions between variables. This has led to the increased adoption of ML approaches such as random forest, Support Vector Machines, and gradient boosting algorithms (e.g., XGBoost, LightGBM, and CatBoost), which frequently provide greater prediction performance [24,25,26].

Among these, CatBoost is particularly well suited to survey based transportation data because of its optimized handling of categorical variables [27]. However, improvements in predictive accuracy often come at the cost of model interpretability, a persistent challenge in applied data science [28].

2.3. Addressing Spatial Heterogeneity in Trip Generation Research

Despite notable methodological advances, a significant research gap persists concerning spatial heterogeneity in trip generation modeling. Most existing studies continue to rely on national-level models, implicitly assuming that the effects of predictors remain constant across all regions. In a geographically, culturally, and economically diverse country like the United States, this assumption is rarely valid.

International comparative studies (e.g., [7]) have confirmed the existence of structural differences in travel behavior across countries. Likewise, regional or city-level research (e.g., [29,30]) has highlighted variations in travel patterns under different local contexts. However, a comprehensive, systematic, state-level comparative analysis within the U.S. that explicitly quantifies and interprets variations in trip generation factors has received far less attention.

This study fills a gap at the intersection of three key domains: trip generation factors, modeling methods, and spatial analysis. It claims that effective transportation policy necessitates more than just identifying the “best” prediction model. Instead, understanding which modeling approaches provide the best balance of accuracy and interpretability, as well as how the impact of key variables varies geographically, is critical for planning sustainable and integrated transportation systems. This approach shifts the focus from purely methodological debates to producing actionable, context-specific insights for sustainable transportation planning.

Recent literature on sustainable multimodal transportation trends emphasizes the integration of diverse transport modes to improve accessibility, environmental outcomes, and system resilience. Hsieh (2025) explores the adoption factors and urban integration of shared mobility services within sustainable transport systems, highlighting the importance of infrastructure readiness and user perceptions [31]. Saleem et al. (2024) analyze the nested ecosystem of Mobility as a Service (MaaS), emphasizing the role of cooperative and flexible services in enhancing system resilience [32]. Chen et al. (2024) evaluate sustainable multimodal networks in urban areas, identifying optimization methods that balance efficiency with environmental objectives [33]. These perspectives align with the objectives of the present study, which also seeks to provide insights that can inform policies encouraging mode shifts toward sustainable transport solutions. This approach is consistent with recent sustainable transport research that uses large scale travel survey data to examine modal integration and policy impacts.

3. Methodology

This study tests the core hypothesis that trip generation factors vary across geographic locations, reflecting spatial heterogeneity. To investigate this, we employ a two-phase, multi-scale framework. As illustrated in Figure 1, this approach moves logically from a broad comparative analysis to a focused, in-depth interpretation. The process begins with data preparation and hypothesis formulation. In Phase I, we benchmark the performance of three distinct categories of models. The key findings from this phase guide Phase II, which examines spatial heterogeneity in detail using both state-level and national models. Finally, the results are synthesized through coefficient tables and geospatial visualizations, providing the basis for the discussion and conclusions.

3.1. Data, Sample, and Variables

3.1.1. Data Source and Sample

The foundation of this study is the 2017 NHTS dataset, published by the U.S. Federal Highway Administration (FHWA) [34,35]. After a thorough data cleaning process, including the removal of incomplete records and handling of outliers, the final dataset includes 106,287 households from all 50 states and the District of Columbia. Descriptive statistics for the sample are presented in Table 1.

The NHTS is the most comprehensive national survey of travel behavior in the United States. It collects detailed information on daily trips through travel diaries, online questionnaires, and telephone interviews. The survey captures a wide range of variables, including household characteristics (e.g., family size, income, vehicle ownership), individual demographics (e.g., age, employment status, driver status), and trip details (e.g., purpose, frequency, mode choice) [34,35]. Its large sample size and national coverage make it particularly valuable for examining household trip generation patterns at the state level. For transportation planners and policymakers, NHTS data provide critical insights for infrastructure planning, public transit investment, and promoting sustainable travel modes [31,33,35].

Table 1 lists the variables used in this study along with their descriptions.

3.1.2. Dependent and Independent Variables

The dependent variable in this study is CNTTDHH, the total number of trips made by a household in a single day.

Variable Selection Framework: The variables used in this study were selected through a structured process combining domain theory, data quality considerations, and prior literature. We began with an extensive list of socio-demographic, household, and mobility-related variables commonly associated with trip generation. Variables with more than 20% missing data or minimal variation were excluded to maintain model stability. While the NHTS dataset provides a rich set of household factors, it does not include certain spatial and infrastructural attributes such as regional transport accessibility, land use mix, and proximity to employment centers. Although excluded from this study for data availability reasons, these factors are acknowledged as important contributors to travel behavior and are recommended for inclusion in future work integrating NHTS with complementary spatial datasets. The final variable set was chosen to balance predictive power and interpretability while aligning with established travel behavior theory.

The 25 independent variables, listed in Table 1, are grouped into four main categories based on established findings in the transportation literature:

Demographic characteristics

These variables capture the household’s composition and age structure. HHSIZE reflects the number of household members, influencing the need for work, school, and recreational trips. Age_mean measures the average age of members, with younger households often having more frequent daily activities, while older households may have fewer but different types of trips. YOUNGCHILD_prc indicates the proportion of children aged 0, 4, a stage that typically increases childcare-related travel. Gender_male_prc represents the proportion of male members, which may relate to different employment and activity patterns. BORNINUS_No_prc measures the share of members not born in the U.S., which can reflect differences in travel behavior due to cultural or socioeconomic factors. HH_RACE and LIF_CYC (life cycle classification) capture demographic diversity and household stage, both of which can shape trip purposes and frequencies.

2.: Economic characteristics

These variables represent household financial capacity and residential stability. HHFAMINC (household income) influences the ability to participate in a wider range of activities and make discretionary trips. HOMEOWN distinguishes between owners and renters, often linked to stability, neighborhood choice, and travel opportunities.

3.: Mobility and locational characteristics

These variables describe access to transportation resources and the spatial context of travel. HHVEHCNT (vehicle count) and DRVRCNT_prc (proportion of licensed drivers) indicate the household’s ability to use private vehicles. PTUSED and RIDESHARE capture the frequency of public transit and rideshare use, reflecting openness to alternative modes. URBRUR identifies whether the household is located in an urban or rural area, which strongly influences mode choice, trip length, and accessibility.

4.: Other social and individual characteristics

These variables capture additional social, educational, and occupational factors that can affect travel. Education levels (EDUC_graduated_prc, EDUC_some_college_prc, EDUC_bachelor_prc) may relate to employment type and travel demand. Work flexibility (FLEXTIME_Yes_prc) can affect trip timing and frequency, while holding multiple jobs (GT1JB_Yes_prc) often increases commuting requirements. HEALTH_Poor_prc and MEDCOND_Yes_prc represent health conditions that may either limit mobility or increase the need for specific trips, such as medical visits. Occupational categories (OCCAT_Clerical_administration_prc, OCCAT_Sales_service_prc, OCCAT_Manufacturing_construction_farming_prc, OCCAT_Professional_managerial_technical_prc) reflect different job types, each with distinct trip patterns.

3.1.3. Handling Categorical Variables

One of the main challenges when working with the NHTS dataset is managing categorical variables. For machine learning models (e.g., CatBoost) that can natively process categorical features, these variables were kept in their original form. However, for the linear regression model, categorical variables were transformed using one hot encoding, converting each category into a separate binary variable (for example, transforming the 11 household income levels into 11 binary variables). This approach preserves all information contained in the categories while avoiding incorrect assumptions about the order or distance between them.

In this study, the key categorical variables are HHFAMINC (household income), HH_RACE (race of household respondent), LIF_CYC (life cycle classification), URBRUR (urban or rural location), and HOMEOWN (homeownership status). The first three have multiple categories, and their coding schemes are shown in Table 2, Table 3 and Table 4. Table 2 lists the 11 household income categories, Table 3 presents the five race categories, and Table 4 details the 10 life cycle classifications. URBRUR is coded as 1 = Urban and 2 = Rural, and HOMEOWN is coded as 1 = Owned and 2 = Rented.

Table 5 presents descriptive statistics for all variables, with continuous variables reported as means, standard deviations, minimums, and maximums, and categorical variables summarized as counts and percentages.

3.2. Phase I: Comparative Benchmarking of Models

The primary objective of Phase I is to identify the modeling approach that offers the best balance between predictive accuracy and interpretability for state-level household trip generation. This study employs three distinct models: linear regression, random forest, and CatBoost. The following section provides a concise description of each model along with its key strengths and limitations.

3.2.1. Linear Regression

Linear regression is one of the most widely used statistical techniques in travel behavior modeling and trip generation analysis because of its simplicity, interpretability, and strong theoretical foundation. Linear regression estimates the relationship between a dependent variable and multiple independent variables by fitting a linear equation to observed data, expressed as

y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{k} x_{k} + ε

(1)

where

y

is the dependent variable (e.g., household trip generation),

x_{1}, x_{2}, \dots, x_{k}

are the independent variables,

β_{0}

is the intercept,

β_{1}, β_{2}, \dots, β_{k}

are the coefficients, and

ε

is the error term [36,37].

However, linear regression assumes linearity, independence of errors, and homoscedasticity, and it can struggle with multicollinearity among predictors [36]. In addition, it is less capable of capturing complex, nonlinear relationships between variables, which may limit its predictive accuracy when applied to heterogeneous travel behavior patterns [37,38].

In trip generation studies, linear regression is valued for providing easily interpretable coefficients, enabling researchers and policymakers to directly assess the influence of socioeconomic, demographic, and locational variables on travel demand [1]. Its transparency makes it suitable for policy evaluation, as stakeholders can clearly see how changes in explanatory variables may affect travel behavior.

3.2.2. Random Forest

Random forest (RF) is an ensemble learning method that builds multiple decision trees during training and outputs either the mean prediction (for regression) or the majority vote (for classification) of the individual trees [39]. The algorithm uses bootstrap aggregation (bagging) to reduce variance and improve generalization. Each tree is trained on a bootstrap sample of the original dataset, and at each split, a random subset of predictors is considered, which helps to decorrelate the trees. The prediction function for regression can be expressed as

\hat{y} = \frac{1}{B} \sum_{b = 1}^{B} T_{b} (x)

(2)

where

B

is the number of trees, and

T_{b} (x)

is the prediction of the b-th decision tree for input

x

.

Random forest has several advantages, including robustness to overfitting, the ability to model complex, nonlinear relationships, and an inherent measure of variable importance [40]. However, it can be less interpretable compared to simpler models, and performance may decline when extrapolating beyond the range of the training data [41].

3.2.3. Catboost

CatBoost is a gradient boosting algorithm developed by Prokhorenkova et al. (2018) and Dorogush et al. (2018) [27,42] that has attracted growing interest in recent years across diverse domains, including hydrology and flood prediction [24,25]. It is particularly notable for its efficient handling of categorical variables, which allows it to retain valuable information without requiring prior conversion into numerical form.

A defining feature of CatBoost is ordered boosting, an approach designed to address the “prediction shift” problem, where future information can inadvertently influence the training phase. This method improves generalization by preventing target leakage, thereby enhancing accuracy and reducing overfitting risk [42]. The algorithm uses binary decision trees as base learners, and its prediction can be expressed as

F (x_{i}) = \sum_{j = 1}^{M} β_{j} h (x; b_{j})

(3)

where

h (x; b_{j})

is the base learner of the explanatory variables

x, β_{j}

is the expansion coefficients, and

b_{j}

are the parameters of the model [43].

CatBoost performs well across different dataset sizes and has shown strong results in domains requiring accurate modeling of complex, nonlinear relationships. In transportation research, its ability to handle numerous categorical features, such as trip purposes, travel modes, and household attributes, makes it particularly well suited for datasets like the NHTS [27,28,42]. Recent studies using NHTS data found that CatBoost models outperformed other machine learning approaches in estimating variables such as traveler age and gender [44]. These strengths are largely attributed to its ordered boosting strategy, efficient handling of categorical features, and robustness to overfitting [28].

Despite these benefits, CatBoost has some limitations. Like other gradient boosting algorithms, it can be less interpretable than simpler statistical models, making it harder to extract straightforward cause, and effect relationships. Additionally, although its training speed is competitive, it can require significant computational resources for very large datasets, and its performance may depend on careful parameter tuning [28]. Nevertheless, in scenarios where predictive accuracy takes precedence over model transparency, CatBoost remains a powerful and versatile choice for both research and applied transportation modeling.

For each of the 51 geographic units (50 states plus the District of Columbia), the dataset is split into training (80%) and testing (20%) subsets using stratified sampling on the dependent variable to preserve similar distributions in both sets.

Each model is trained on the training data and evaluated on the testing data using three standard performance metrics: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Coefficient of Determination (R²)

The results from this benchmarking phase provide the basis for selecting the most suitable model for the detailed spatial analysis in Phase II.

3.3. Phase II: Spatial Heterogeneity Analysis Using State Level Models

A key, and somewhat unexpected, finding from Phase I is the strong and often superior performance of the linear regression model. This outcome provides an opportunity for interpretable, in depth analysis. Therefore, Phase II focuses exclusively on linear regression models to examine the paper’s central hypothesis regarding spatial heterogeneity. This phase involves two parallel analyses:

Nationwide Model: A single linear regression model is trained on the full dataset (all states combined). This model identifies “global drivers” of trip generation and serves as a baseline for comparison.
State-Level Models: Separate linear regression models are trained for each of the 51 geographic units, enabling the estimation of state-specific coefficients and p values for every predictor. This approach, central to the paper’s novelty, produces 51 distinct effect sizes per variable, revealing regional patterns.

3.4. Analytical Techniques and Visualization

To extract meaningful insights from the state level regression models, this study employed two complementary analytical approaches. First, a coefficient robustness analysis was conducted to evaluate the stability of each predictor’s effect across states. For every variable, the percentage of states where the coefficient was statistically significant (p < 0.05) was calculated, along with an assessment of whether the effect was consistently positive, consistently negative, or varied in direction. This approach made it possible to distinguish robust, universal predictors from those whose influence was highly dependent on local context.

Second, geospatial visualization techniques were used to illustrate spatial heterogeneity. For each key variable, a choropleth map of the United States was generated, with color gradients representing the effect size (coefficient) in each state. States where the effect was not statistically significant were marked with hatching patterns. This visual format enabled readers to grasp at a glance, not only the magnitude and direction of each effect but also its statistical reliability across the country.

4. Results

This section presents the empirical findings obtained through the two-stage methodological framework. First, the results of the comparative analysis of the models are outlined. This is followed by a detailed discussion of the outcomes from the linear regression model at both the national and state levels.

4.1. Benchmarking Results

The performance evaluation of the three models (CatBoost, random forest, and linear regression) across 51 geographic units (50 states plus the District of Columbia) yielded noteworthy results. Table 6 summarizes the overall performance metrics of these models, calculated as weighted averages based on the number of samples in each state.

As shown in Table 6, the overall performance of all three models is very close and highly competitive. The CatBoost model, with a weighted average R² of 0.323, achieved the best overall performance by a narrow margin. However, the most surprising and noteworthy finding lies in the last column of the table. Linear regression model emerged as the top performer with the highest R² in 23 states which is the largest number among all models. This figure is nearly equal to the combined wins of the other two, more complex models. This result highlights the effectiveness of the linear regression model at the state level and justifies its selection for more in-depth interpretive analysis in the second phase of the study, as it offers the best balance between predictive accuracy and structural transparency.

As further seen in Table 6, the overall performance metrics of the three models remain very close and competitive. To gain deeper insight into this similarity and to examine the variation in performance across states, Figure 2 visually depicts the full distribution of performance metrics.

These boxplots clearly illustrate that not only are the median performance values similar, but the entire distributions (including the first and third quartiles) for all three models overlap substantially. This notable overlap reinforces the findings of Table 6, indicating that, in terms of overall performance, no single model decisively outperforms the others. However, the scatter of points within each plot reveals considerable variability in model performance across different states. This close competition at the aggregate level underscores the importance of analyzing the number of state-level wins (last column of Table 6) as a key differentiating metric, providing the basis for the more in-depth interpretive analyses presented later in this paper.

To examine this close competition more precisely at the state level, Figure 3 presents a direct comparison of R² performance between the two primary models, CatBoost, representing advanced machine learning, and linear regression, representing an interpretable baseline. Each dumbbell represents a state, ordered by the average performance of the two models. The connecting line shows the performance gap between the two models for that specific state. This visualization reveals a highly competitive landscape in which neither model consistently outperforms the other across all geographies, further supporting the rationale for analyzing the number of state-level wins.

In some states, such as Kansas (KS) and New Hampshire (NH), both models achieved high R² values, indicating strong predictive capability for travel behavior in these regions. In contrast, in states like Nevada (NV) and North Dakota (ND), both models performed poorly, which may be attributable to higher noise in the data or the influence of unobserved factors.

A key takeaway from this figure is the absence of an absolute winner. Short connecting lines in many states (e.g., California (CA) and Texas (TX)) indicate nearly identical performance between the two models. In other states, CatBoost (blue point) sometimes holds a slight advantage, while in others, linear regression (orange point) performs marginally better. This variable win–loss pattern visually confirms and reinforces the main finding from Table 6, that linear regression achieved the highest number of wins across states. This tight competition further strengthens the rationale for moving beyond simply identifying the best model and instead focusing on an in-depth analysis of spatial heterogeneity using the interpretable linear regression model.

4.2. National Linear Model Results

To establish a baseline for subsequent comparisons, a single linear regression model was fitted to the entire national dataset. Figure 4 illustrates the coefficients of the top twenty statistically significant variables (p < 0.05) from this model. The chart displays standardized coefficients along with the direction of their influence on household daily trip generation. All coefficients shown are statistically significant (p < 0.05). The length of each bar represents the magnitude of the effect, while the color indicates its direction (blue: positive, orange: negative).

The results of the national model indicate that demographic variables exert the strongest influence on trip generation. Specifically, the presence of young children (YOUNGCHILD_prc), with a large negative coefficient (−13.73), is the most significant factor reducing trip frequency. In contrast, variables related to household structure and life cycle (such as various LIF_CYC categories and HHSIZE), as well as educational variables, show positive and statistically significant effects. While this model provides an overall picture of key factors at the national level, as will be shown later, this aggregate view conceals substantial regional variations.

4.3. Spatial Heterogeneity Analysis

4.3.1. Coefficient Consistency Analysis: Identifying Core and Unstable Variables

To quantify the stability of each variable’s effect across the United States, a consistency analysis was conducted on the coefficients obtained from 51 geographic units-specific regression models. Table 7 summarizes the results of this analysis, indicating the percentage of states in which each variable was statistically significant, as well as whether the direction of its effect (positive or negative) remained consistent across states.

The results of this analysis reveal a clear distinction between two categories of variables. The first category consists of fundamental variables, whose effects are stable and reliable across most geographic areas. At the top of this group is HHSIZE (household size), which shows a positive and statistically significant influence on trip generation in 86.3% of states. Following closely is YOUNGCHILD_prc (presence of young children), which is significant in 66.7% of states and consistently exhibits a strong negative effect. Variables associated with specific life cycle stages (such as LIF_CYC6 and LIF_CYC4) also demonstrate notable consistency.

The second category includes volatile or context-dependent variables, which, despite their importance in the national model, display much lower stability at the state level. Surprisingly, classic variables such as HHFAMINC (household income) and HHVEHCNT (number of vehicles) are statistically significant in only 19.6% and 9.8% of states, respectively. This finding challenges the common assumption that these are universal predictors.

More importantly, some of these unstable variables also exhibit a sign flip phenomenon. For instance, DRVRCNT_prc (percentage of drivers) has a positive and statistically significant effect in some states but a negative and statistically significant effect in others. This statistical and directional instability strongly suggests that the influence of such variables cannot be interpreted in isolation and is highly intertwined with the economic, social, and built-environment context of each state. These complex geographic patterns will be explored visually in the next section.

4.3.2. Visualization of Geographic Patterns

The core of this study’s findings lies in the analysis of results from the 51 state-specific regression models. To visually illustrate spatial heterogeneity in variable effects, choropleth maps were created for the most important and consistent predictors. Figure 5, Figure 6, Figure 7 and Figure 8 display the geographic distribution of coefficients for four key variables. In these maps, the color of each state represents the magnitude and direction of the coefficient, while a hatched pattern indicates that the coefficient in that state is not statistically significant (p ≥ 0.05).

The map in Figure 5 shows state-level differences in the effect of adding one household member to daily trip generation. A positive (blue) and statistically significant effect in most states highlights its role as a strong and fundamental predictor. Figure 6 illustrates the powerful and consistently negative (orange) influence of young children on travel. The magnitude of this travel-reducing effect reveals notable regional patterns, particularly strong in the Deep South and Utah. Figure 7 depicts the urban–rural divide. The negative coefficient, indicating fewer trips for rural households, is statistically significant in many states, especially in the Northeast, but its magnitude varies considerably. Figure 8 presents a complex and spatially unstable relationship, with the coefficient exhibiting a sign flip, significantly positive in some states (e.g., Minnesota) and significantly negative in others (e.g., Washington), suggesting a strong interaction with local urban context.

These visual results clearly support the study’s hypothesis. As shown, even for fundamental variables such as household size, the magnitude of the effect is not uniform across the country. For more complex variables, such as the percentage of drivers, both the magnitude and the direction of the effect vary geographically. A deeper interpretation of these spatial patterns and their policy implications will be discussed in the following section.

5. Discussion

The findings of this study challenge and enrich the conventional understanding of trip generation modeling in three key areas. This section interprets the observed patterns, links them to the existing literature, and draws out their theoretical implications.

5.1. The Accuracy–Interpretability Paradox: Returning to Transparency

One of the central findings of this study is the strong and unexpected performance of the linear regression model in state-level analyses. In an era where the dominant paradigm in data science emphasizes high-accuracy machine learning algorithms, our results show that in the context of transportation planning, a classic and transparent model can not only be statistically competitive but also provide a far more powerful analytical tool. This finding aligns with the growing body of literature. In fields such as urban planning, where decisions carry direct and long-term consequences for citizens’ lives, the ability to interpret and explain the why behind a prediction is as valuable as the prediction itself. This study demonstrates that model selection should not be driven solely by a bake-off of accuracy metrics, but rather should be a strategic decision guided by the ultimate goals of the research, particularly the need to generate actionable knowledge.

5.2. Dissecting Spatial Heterogeneity: A New Classification of Trip Generation Factors

The analysis of state-level models allowed us to move beyond a one-dimensional view of variable importance and develop a multilayered classification of trip generation factors. The consistency table and Figure 5, Figure 6, Figure 7 and Figure 8 quantitatively and visually confirm this classification.

The first category comprises fundamental variables. Household size (HHSIZE) and the presence of young children (YOUNGCHILD_prc) function as the core pillars of the model, regardless of geography. The high robustness of these variables, statistical significance in 86.3% and 66.7% of states, respectively, indicates that the basic demographic structure of a household is a universal driving force in trip generation, largely unaffected by regional context.

The second category comprises context-dependent variables. This group, which contains some of the study’s most intriguing findings, includes variables whose effects are deeply intertwined with local characteristics. HHFAMINC (household income) and DRVRCNT_prc (percentage of drivers) are prominent examples. The statistical instability and sign flip of these variables’ coefficients across states is not random noise but rather a meaningful signal of underlying interaction effects. For example, the negative coefficient for percentage of drivers in Washington State strongly suggests that in a dense built environment with heavy traffic, simply holding a driver’s license does not necessarily translate into higher levels of automobile trip-making. This finding confirms and extends classic theories on the importance of the built environment at a broad geographic scale, demonstrating that a variable’s predictive power cannot be assessed in isolation but must be interpreted alongside other spatial characteristics.

5.3. Theoretical Implications: Toward a Place-Based Theory of Travel Behavior

The findings of this study carry important implications for travel behavior theory. They reveal the limitations of global rationality theories, which assume that factors influence decision-makers in the same way everywhere. In contrast, our results support a place-based theory of travel behavior, in which place functions not merely as a control variable but as an active factor shaping the relationships between variables. This theoretical framework encourages planners to shift from asking which factors are important? to asking which factors matter, in which places, and under what conditions?

6. Conclusions

This study, aimed at examining spatial heterogeneity in the factors influencing household trip generation across the United States, reached clear and definitive conclusions. We demonstrated that the processes shaping travel behavior are deeply local phenomena, and that unified national models are unable to capture this rich diversity. Three key messages emerge from this research:

Prioritizing Interpretable Models: In public policy contexts, transparent models, including linear regression, allow us to understand underlying mechanisms can be more valuable than complex black box models that merely offer higher predictive accuracy.
Identifying Factor Stability: Not all trip generation factors are created equal. Policymakers should distinguish between stable, fundamental factors (such as demographic structure) and volatile, context-dependent factors (such as income).
The Need for Place-Based Planning: Transportation policies should be designed based on a precise understanding of each region’s unique characteristics. A pricing policy that proves effective in one state may be ineffective in another, and investment in public transportation can yield vastly different returns depending on the location.

Ultimately, this research calls for a more nuanced and humble approach to modeling human behavior, one that respects the complexity and diversity of the spaces and communities in which we live.

Despite its notable findings, this study has limitations that pave the way for future research. First, the cross-sectional nature of the NHTS data restricts the analysis of temporal dynamics and causal relationships. Second, our unit of analysis was the state, whereas spatial heterogeneity undoubtedly exists at finer geographic scales (such as counties or metropolitan areas) and warrants further investigation.

Accordingly, several directions for future research are proposed:

Longitudinal Analysis: Use panel data to examine how changes in policies or economic conditions over time affect trip generation coefficients across different states.
Multilevel Modeling: Apply hierarchical models that simultaneously capture variation at the household, county, and state levels to more precisely disentangle the sources of spatial heterogeneity.
Integration of Spatial Datasets: Incorporate more detailed variables related to land use, job density, public transit accessibility indices, and traffic patterns into the models to help explain a greater share of the observed variance in coefficients.
Exploring Mode Choice Heterogeneity: Investigate whether similar spatial heterogeneity exists in the factors influencing the choice of travel mode (car, public transit, walking), as this would be a logical and important next step.

These research directions can help build a more comprehensive and dynamic understanding of the complex geography of transportation in the United States.

Author Contributions

Conceptualization, S.N. and M.S.; methodology, S.N. and M.S.; validation, S.N., M.S., S.A., and M.B.; formal analysis, S.N. and M.S.; data curation, S.N., M.S., F.G., and Y.H.; writing—original draft preparation, S.N., M.S., E.R., and M.L.; writing—review and editing, S.N., M.S., E.R., M.L., F.G., Y.H., S.A., M.B., and A.M.; visualization, S.N. and M.S.; supervision, S.N. and A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the National Household Travel Survey (NHTS), with details available at https://nhts.ornl.gov (accessed on 25 August 2024). The processed data and analysis codes generated for this research are not publicly archived but are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ortúzar, J.d.D.; Willumsen, L.G. Modelling Transport, 4th ed.; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
McNally, M.G. The four-step model. In Handbook of Transport Modelling, 2nd ed.; Hensher, D.A., Button, K.J., Eds.; Pergamon: Oxford, UK, 2007. [Google Scholar]
Ewing, R.; Cervero, R. Travel and the built environment: A meta-analysis. J. Am. Plan. Assoc. 2010, 76, 265–294. [Google Scholar] [CrossRef]
Litman, T. Evaluating Transportation Land Use Impacts: Considering the Impacts, Benefits and Costs of Different Land Use Development Patterns; Victoria Transport Policy Institute: Victoria, BC, Canada, 2023. [Google Scholar]
Rehill, P.; Biddle, N. Transparency challenges in policy evaluation with causal machine learning: Improving usability and accountability. Data Policy 2024, 6, e43. [Google Scholar] [CrossRef]
Pucher, J.; Renne, J.L. Socioeconomics of urban travel: Evidence from the 2001 NHTS. Transp. Q. 2003, 57, 49–77. [Google Scholar]
Giuliano, G.; Dargay, J. Car ownership, travel and land use: A comparison of the US and Great Britain. Transp. Res. Part A Policy Pract. 2006, 40, 106–124. [Google Scholar] [CrossRef]
Mwale, M.; Luke, R.; Pisa, N. Factors that affect travel behaviour in developing cities: A methodological review. Transp. Res. Interdiscip. Perspect. 2022, 16, 100683. [Google Scholar] [CrossRef]
Qawasmeh, B.; Qawasmeh, S.; Al Tawil, A.; Qawasmeh, D. Estimation of trip-based generation models and calibration of mode choice models for the American travel behavior. Open Transp. J. 2024, 18, e26671212348473. [Google Scholar] [CrossRef]
Qawasmeh, B. Estimation of a household trip-based generation model for the state of Michigan. In Sustainable Approaches to Environmental Design, Materials Science, and Engineering Technologies; Springer: Cham, Switzerland, 2025; pp. 105–112. [Google Scholar]
Fisu, A.A.; Syabri, I.; Andani, I.G.A. How do young people move around in urban spaces?: Exploring trip patterns of generation-Z in urban areas by examining travel histories on Google Maps Timeline. Travel Behav. Soc. 2024, 34, 100686. [Google Scholar] [CrossRef]
Lee, S.; Golub, A. Difference in travel behavior between immigrants in the U.S. and U.S.-born residents: The immigrant effect for car-sharing, ride-sharing, and bike-sharing services. Transp. Res. Interdiscip. Perspect. 2021, 9, 100296. [Google Scholar] [CrossRef]
Clifton, K.J.; Larco, N.; Currans, K.M.; Wettach-Glosser, J. Improving Trip Generation Methods for Livable Communities; Transportation Research and Education Center (TREC): Portland, OR, USA, 2017. [Google Scholar]
Bhat, C.R.; Gossen, R. A mixed multinomial logit model analysis of weekend recreational episode type choice. Transp. Res. Part B Methodol. 2004, 38, 767–787. [Google Scholar] [CrossRef]
Salon, D. Neighborhoods, cars, and commuting in New York City: A discrete choice approach. Transp. Res. Part A Policy Pract. 2009, 43, 180–196. [Google Scholar] [CrossRef]
Blumenberg, E.; Pierce, G. Automobile ownership and travel by the poor: Evidence from the 2009 National Household Travel Survey. Transp. Res. Rec. 2012, 2320, 28–36. [Google Scholar] [CrossRef]
Shaheen, S.; Cohen, A.; Zohdy, I. Shared Mobility: Current Practices and Guiding Principles; U.S. Department of Transportation, Federal Highway Administration: Washington, DC, USA, 2016. [Google Scholar]
Clewlow, R.R.; Mishra, G.S. Disruptive Transportation: The Adoption, Utilization, and Impacts of Ride-Hailing in the United States; Institute of Transportation Studies, University of California: Davis, CA, USA, 2017. [Google Scholar]
van Wee, B.; Witlox, F. COVID-19 and its long-term effects on activity participation and travel behaviour: A multiperspective view. J. Transp. Geogr. 2021, 95, 103144. [Google Scholar] [CrossRef]
de Abreu e Silva, J.; Melo, P.C. Home telework, travel behavior, and land-use patterns: A path analysis of British single-worker households. J. Transp. Land Use 2018, 11, 1134. [Google Scholar] [CrossRef]
Abdul Latiff, A.R.; Mohd, S. Transport, mobility and the wellbeing of older adults: An exploration of private chauffeuring and companionship services in Malaysia. Int. J. Environ. Res. Public Health 2023, 20, 2720. [Google Scholar] [CrossRef]
Zhao, P.; Lü, B.; de Roo, G. Impact of the jobs-housing balance on urban commuting in Beijing in the transformation era. J. Transp. Geogr. 2011, 19, 59–69. [Google Scholar] [CrossRef]
Sekhar, S.V.C.; Anand, S.; Karim, M.R. Comparison of regression model and category analysis (a case study). J. East. Asia Soc. Transp. Stud. 1997, 2, 917–929. [Google Scholar]
Szczepanek, R. Daily streamflow forecasting in mountainous catchment using XGBoost, LightGBM and CatBoost. Hydrology 2022, 9, 226. [Google Scholar] [CrossRef]
Aleksandrov, N.; Ermakov, D.; Aziz, A.; Kazenkov, O. Finding the optimal machine learning model for flood prediction on the Amur River. Comput. Nanotechnol. 2022, 9, 11–20. [Google Scholar] [CrossRef]
Gao, Q.; Molloy, J.; Axhausen, K. Trip purpose imputation using GPS trajectories with machine learning. ISPRS Int. J. Geo-Inf. 2021, 10, 775. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018, 31, 9516. [Google Scholar] [CrossRef]
Hancock, J.T.; Khoshgoftaar, T.M. CatBoost for big data: An interdisciplinary review. J. Big Data 2020, 7, 94. [Google Scholar] [CrossRef]
Goel, R.; Mohan, D. Investigating the association between population density and travel patterns in Indian cities—An analysis of 2011 census data. Cities 2020, 100, 102656. [Google Scholar] [CrossRef]
Chakraborty, A.; Mishra, S. Land use and transit ridership connections: Implications for state-level planning agencies. Land Use Policy 2013, 30, 458–469. [Google Scholar] [CrossRef]
Hsieh, F.-S. Emerging research issues and directions on MaaS, sustainability and shared mobility in smart cities with multi-modal transport systems. Appl. Sci. 2025, 15, 5709. [Google Scholar] [CrossRef]
Saleem, M.A.; Yasmin, F.; Ismail, H.; Low, D.; Afzal, H. Unlocking the maze: Exploring nested ecosystem of mobility as a service through systematic literature review. J. Adv. Transp. 2024, 2024, 4166852. [Google Scholar] [CrossRef]
Chen, X.; Deng, H.; Guan, S.; Han, F.; Zhu, Z. Cooperation-oriented multi-modal shared mobility for sustainable transport: Developments and challenges. Sustainability 2024, 16, 11207. [Google Scholar] [CrossRef]
Federal Highway Administration. 2017 NHTS Data User Guide; U.S. Department of Transportation: Washington, DC, USA, 2018. [Google Scholar]
Federal Highway Administration. 2022 NextGen National Household Travel Survey Core Data; U.S. Department of Transportation: Washington, DC, USA, 2022. [Google Scholar]
Kutner, M.H.; Nachtsheim, C.J.; Neter, J.; Li, W. Applied Linear Statistical Models, 5th ed.; McGraw-Hill Irwin: New York, NY, USA, 2005. [Google Scholar]
Montgomery, D.C.; Peck, E.A.; Vining, G.G. Introduction to Linear Regression Analysis, 6th ed.; Wiley: Hoboken, NJ, USA, 2021. [Google Scholar]
Draper, N.R.; Smith, H. Applied Regression Analysis, 3rd ed.; Wiley: Hoboken, NJ, USA, 1998. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Cutler, D.; Edwards, T.; Beard, K.; Cutler, A.; Hess, K.; Gibson, J.; Lawler, J. Random forests for classification in ecology. Ecology 2007, 88, 2783–2792. [Google Scholar] [CrossRef]
Genuer, R.; Poggi, J.-M.; Tuleau-Malot, C. Variable selection using random forests. Pattern Recognit. Lett. 2010, 31, 2225–2236. [Google Scholar] [CrossRef]
Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient boosting with categorical features support. arXiv 2018, arXiv:1810.11363. [Google Scholar] [CrossRef]
Jabeur, S.; Gharib, C.; Mefteh-Wali, S.; Ben Arfi, W. CatBoost model and artificial intelligence techniques for corporate failure prediction. Technol. Forecast. Soc. Change 2021, 166, 120658. [Google Scholar] [CrossRef]
Bakhtiari, A.; Mirzahossein, H.; Kalantari, N.; Jin, X. Inferring socioeconomic characteristics from travel patterns. J. Reg. City Plan. 2023, 34, 122–136. [Google Scholar] [CrossRef]

Figure 1. The study’s methodological framework.

Figure 2. Distribution of model performance metrics across all 51 geographic units (50 states plus the District of Columbia).

Figure 3. State-by-state comparison of R-squared between the CatBoost and linear regression models.

Figure 4. Top twenty influential variables in the national linear regression model.

Figure 5. Geographic distribution of the household size (HHSIZE) coefficient.

Figure 6. Geographic distribution of the presence of young children (YOUNGCHILD_prc) coefficient.

Figure 7. Geographic distribution of the Rural Residence (URBRUR2) coefficient.

Figure 8. Geographic distribution of the Percentage of Drivers (DRVRCNT_prc) coefficient.

Table 1. Description of the Variables Employed in the Study.

Variables	Description
HHFAMINC	Household Income
HHSIZE	Count of Household Members
HH_RACE	Race of Household Respondent
HOMEOWN	Home Ownership
LIF_CYC	Life Cycle Classification for the Household, Derived by Attributes Pertaining to Age, Relationship, and Work Status
HHVEHCNT	Count of Household Vehicles
Age_mean	Average Age of Household Persons
PTUSED	Count of Public Transit Usage
RIDESHARE	Count of Rideshare App Usage
URBRUR	Household in Urban/Rural Area
DRVRCNT_prc	Percentage of Drivers in the Household
HEALTH_Poor_prc	Percentage of People’s Opinion of Poor Health in the Household
Gender_male_prc	Percentage of Male Persons in Household
BORNINUS_No_prc	Percentage of Persons Not Born in the U.S.
EDUC_graduated_prc	Percentage of People Who Graduated in the Household
EDUC_some_college_prc	Percentage of Household Members with Some College Degree
EDUC_bachelor_prc	Percentage of Household Members with a Bachelor’s Degree
FLEXTIME_Yes_prc	Percentage of People with Flex Time in Household
GT1JB_Yes_prc	Percentage of People with More than One Job in Household
MEDCOND_Yes_prc	Percentage of People with a Positive Medical Condition in the Household
YOUNGCHILD_prc	Percentage of People with an Age Between 0 and 4 in the Household
OCCAT_Clerical_administration_prc	Percentage of People with Clerical Administration Jobs in the Household
OCCAT_Sales_service_prc	Percentage of People with Sales or Service Jobs in the Household
OCCAT_Manufacturing_construction_farming_prc	Percentage of People with Manufacturing or Construction, or Farming Jobs in the Household
OCCAT_Professional_managerial_technical_prc	Percentage of People with Professional, Managerial, or Technical Jobs in the Household

Table 2. Household income categories.

Category	Household Income
1	Less than $10,000
2	$10,000 to $14,999
3	$15,000 to $24,999
4	$25,000 to $34,999
5	$35,000 to $49,999
6	$50,000 to $74,999
7	$75,000 to $99,999
8	$100,000 to $124,999
9	$125,000 to $149,999
10	$150,000 to $199,999
11	$200,000 or more

Table 3. Race categories of the household respondent.

Category	Race
1	White
2	Black or African American
3	Asian
4	American Indian or Alaska Native
5	Native Hawaiian or Other Pacific Islander

Table 4. Life cycle classification.

Category	Life Cycle Classification
1	one adult, no children
2	+2 adults, no children
3	one adult, youngest child 0 5
4	+2 adults, youngest child 0 5
5	one adult, youngest child 6 15
6	+2 adults, youngest child 6 15
7	one adult, youngest child 16 21
8	+2 adults, youngest child 16 21
9	one adult, retired, no children
10	+2 adults, retired, no children

Table 5. Summary of descriptive statistics of variables (N = 106,287).

Variable		Characteristic ¹
Daily Household Trips (CNTTDHH)		8.0 (5.6) [1.0, 95.0]
Household Size (HHSIZE)		2.2 (1.2) [1.0, 13.0]
Mean Household Age (Age_mean)		52.3 (18.3) [11.0, 92.0]
Proportion of Young Children (<5) (YOUNGCHILD_prc)		0.0 (0.1) [0.0, 0.8]
Proportion of Drivers (DRVRCNT_prc)		0.9 (0.2) [0.0, 1.0]
Proportion of Males (Gender_male_prc)		0.5 (0.3) [0.0, 1.0]
Household Income Category (HHFAMINC)		6.3 (2.5) [1.0, 11.0]
Home Ownership (HOMEOWN)	Owned	84,183 (79%)
Home Ownership (HOMEOWN)	Rented	22,104 (21%)
Number of Household Vehicles (HHVEHCNT)		2.1 (1.1) [1.0, 12.0]
Public Transit Trips (per month) (PTUSED)		1.7 (6.5) [0.0, 132.0]
Rideshare Trips (per month) (RIDESHARE)		0.5 (2.7) [0.0, 211.0]
Area Type (URBRUR)	Urban	82,065 (77%)
Area Type (URBRUR)	Rural	24,222 (23%)
Proportion with Bachelor’s Degree (EDUC_bachelor_prc)		0.2 (0.3) [0.0, 1.0]
Proportion with Poor Health (HEALTH_Poor_prc)		0.0 (0.1) [0.0, 1.0]

¹ Continuous variables are reported as Mean (SD) [Min, Max]; Categorical as n (%).

Table 6. Weighted Overall Performance Metrics of the Evaluated Models.

Model	Weighted Avg. R²	Weighted Avg. MAE	Number of States Won (by R²)
CatBoost	0.323	3.373	14
Linear Regression	0.321	3.394	23
Random Forest	0.315	3.412	14

Note: The performance metrics represent weighted averages based on the number of observations in each state. Winning cases indicate the number of states in which the model achieved the highest R² value on the test set.

Table 7. Summary of coefficient stability across all state-level models.

Variable	States Modeled	States Significant	% Significant	Positive and Sig.	Negative and Sig.	Sign Flip?
HHSIZE	51	44	86.3%	44	0	No
YOUNGCHILD_prc	51	34	66.7%	0	34	No
LIF_CYC6	51	28	54.9%	28	0	No
LIF_CYC4	51	21	41.2%	21	0	No
LIF_CYC8	51	17	33.3%	16	1	Yes
LIF_CYC10	51	16	31.4%	16	0	No
EDUC_graduated_prc	51	15	29.4%	15	0	No
FLEXTIME_Yes_prc	51	15	29.4%	15	0	No
LIF_CYC2	51	15	29.4%	15	0	No
PTUSED	51	13	25.5%	9	4	Yes
HHFAMINC	51	10	19.6%	10	0	No
MEDCOND_Yes_prc	51	10	19.6%	0	10	No
GT1JB_Yes_prc	51	9	17.6%	9	0	No
EDUC_bachelor_prc	51	8	15.7%	8	0	No
OCCAT_Professional_managerial_technical_prc	51	8	15.7%	0	8	No
BORNINUS_No_prc	51	6	11.8%	0	6	No
DRVRCNT_prc	51	6	11.8%	5	1	Yes
RIDESHARE	51	6	11.8%	4	2	Yes
Gender_male_prc	51	5	9.8%	0	5	No
HHVEHCNT	51	5	9.8%	4	1	Yes
HOMEOWN2	51	5	9.8%	4	1	Yes
Age_mean	51	4	7.8%	0	4	No
LIF_CYC9	51	4	7.8%	4	0	No
OCCAT_Manufacturing_construction_farming_prc	51	3	5.9%	1	2	Yes
EDUC_some_college_prc	51	2	3.9%	2	0	No
HH_RACE1	51	2	3.9%	0	2	No
OCCAT_Sales_service_prc	51	2	3.9%	1	1	Yes
OCCAT_Clerical_administration_prc	51	1	2.0%	0	1	No

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Naseralavi, S.; Soltanirad, M.; Ranjbar, E.; Lucero, M.; Gorzin, F.; Hakiminejad, Y.; Azimi, S.; Baghersad, M.; Mazaheri, A. Machine Learning Modeling of Household Trip Generation by State Using NHTS Data. Urban Sci. 2025, 9, 353. https://doi.org/10.3390/urbansci9090353

AMA Style

Naseralavi S, Soltanirad M, Ranjbar E, Lucero M, Gorzin F, Hakiminejad Y, Azimi S, Baghersad M, Mazaheri A. Machine Learning Modeling of Household Trip Generation by State Using NHTS Data. Urban Science. 2025; 9(9):353. https://doi.org/10.3390/urbansci9090353

Chicago/Turabian Style

Naseralavi, Saber, Mohammad Soltanirad, Erfan Ranjbar, Martin Lucero, Fateme Gorzin, Yasaman Hakiminejad, Shiva Azimi, Mahdi Baghersad, and Akram Mazaheri. 2025. "Machine Learning Modeling of Household Trip Generation by State Using NHTS Data" Urban Science 9, no. 9: 353. https://doi.org/10.3390/urbansci9090353

APA Style

Naseralavi, S., Soltanirad, M., Ranjbar, E., Lucero, M., Gorzin, F., Hakiminejad, Y., Azimi, S., Baghersad, M., & Mazaheri, A. (2025). Machine Learning Modeling of Household Trip Generation by State Using NHTS Data. Urban Science, 9(9), 353. https://doi.org/10.3390/urbansci9090353

Article Menu

Machine Learning Modeling of Household Trip Generation by State Using NHTS Data

Abstract

1. Introduction

2. Literature Review

2.1. Conceptual Framework of Household Trip Generation

2.1.1. Demographic Characteristics

2.1.2. Economic Characteristics

2.1.3. Mobility and Locational Characteristics

2.1.4. Education, Work Patterns, and Health in Trip Generation

2.2. Evolution of Trip Generation Models

2.3. Addressing Spatial Heterogeneity in Trip Generation Research

3. Methodology

3.1. Data, Sample, and Variables

3.1.1. Data Source and Sample

3.1.2. Dependent and Independent Variables

3.1.3. Handling Categorical Variables

3.2. Phase I: Comparative Benchmarking of Models

3.2.1. Linear Regression

3.2.2. Random Forest

3.2.3. Catboost

3.3. Phase II: Spatial Heterogeneity Analysis Using State Level Models

3.4. Analytical Techniques and Visualization

4. Results

4.1. Benchmarking Results

4.2. National Linear Model Results

4.3. Spatial Heterogeneity Analysis

4.3.1. Coefficient Consistency Analysis: Identifying Core and Unstable Variables

4.3.2. Visualization of Geographic Patterns

5. Discussion

5.1. The Accuracy–Interpretability Paradox: Returning to Transparency

5.2. Dissecting Spatial Heterogeneity: A New Classification of Trip Generation Factors

5.3. Theoretical Implications: Toward a Place-Based Theory of Travel Behavior

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI