Next Article in Journal
Understanding the Contribution of the Green Climate Fund (GCF) in Mangrove Forest Conservation: A Case Study on Sundarbans Mangrove Forest, Bangladesh
Previous Article in Journal
Saline Soil Management and Improvement Protection Strategies Based on Sustainable Agricultural Development Goals
 
 
Article
Peer-Review Record

Machine Learning in Mode Choice Prediction as Part of MPOs’ Regional Travel Demand Models: Is It Time for Change?

Sustainability 2025, 17(8), 3580; https://doi.org/10.3390/su17083580
by Hannaneh Abdollahzadeh Kalantari 1,*, Sadegh Sabouri 2, Simon Brewer 3, Reid Ewing 1 and Guang Tian 4
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Sustainability 2025, 17(8), 3580; https://doi.org/10.3390/su17083580
Submission received: 21 February 2025 / Revised: 29 March 2025 / Accepted: 8 April 2025 / Published: 16 April 2025
(This article belongs to the Section Sustainable Transportation)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors In this manuscript, the aim is to improve the predictive accuracy of travel demand models based on factors (trip characteristics, socioeconomic factors, built environment characteristics, and regional conditions) that influence transportation mode choices. The Random Forest model is applied to predict transportation mode choices, and it is established that an increase in travel duration and distance is associated with a higher number of trips by car, while household vehicle ownership significantly affects choices between car and public transport. I note some considerations to improve the manuscript.
  • The document details several elements that may be useful to those interested in the topic. However, the reading is not easy, so I suggest revising the document to facilitate readability.
  • Section 2.2.2 includes only one reference, along with an extensive table. It is suggested to complement this with other references, and in Table 2, it should be indicated where the information is sourced from.
  • In Section 3.3.2, it is suggested to include for each variable whether there are previous studies that considered these, or what the authors' criteria are for establishing them.
  • In Section 3, related to methodology, it is recommended to include a flowchart that allows the reader to understand, along with the explanation, the methodology employed.
  • In the discussion section, the results are presented but are not compared with the commonly used Nested Logit (NL) and Multinomial Logit (MNL) models by MPOs. I also suggest strengthening the references so that the results can be better contrasted.
    Comments on the Quality of English Language

Dear Editor,
Thank you very much for inviting me to review this manuscript. I believe it is necessary to improve the writing, as the text is very dense and makes reading difficult.

Author Response

Reviewer 1

In this manuscript, the aim is to improve the predictive accuracy of travel demand models based on factors (trip characteristics, socioeconomic factors, built environment characteristics, and regional conditions) that influence transportation mode choices. The Random Forest model is applied to predict transportation mode choices, and it is established that an increase in travel duration and distance is associated with a higher number of trips by car, while household vehicle ownership significantly affects choices between car and public transport. I note some considerations to improve the manuscript.

  1. The document details several elements that may be useful to those interested in the topic. However, the reading is not easy, so I suggest revising the document to facilitate readability.

(Thank you very much for inviting me to review this manuscript. I believe it is necessary to improve the writing, as the text is very dense and makes reading difficult.)

Thank you for your feedback. We acknowledge the readability concerns and we revised the manuscript slightly to improve clarity and flow.

 

  1. Section 2.2.2 includes only one reference, along with an extensive table. It is suggested to complement this with other references, and in Table 2, it should be indicated where the information is sourced from.

Thank you for your suggestion. We have made the following revisions to address your concerns:

  • Clarifying the source of Table 2: We have added an explicit explanation in Section 2.2.2, stating that the information presented in Table 2 is derived from our independent data collection effort (see the highlighted text in the main manuscript lines 192-196). Specifically, we conducted a survey of 25 randomly selected MPOs and obtained details on their mode choice modeling practices through direct outreach to MPO modelers. Since this dataset is based on primary data rather than previously published reports, there are no external sources to cite for Table 2.
  • Adding more references: In addition to clarifying the source of Table 2, we have incorporated additional references (highlighted in lines 190, 201, 215) to strengthen the discussion in Section 2.2.2 by reviewing more related studies on MPOs' mode choice modeling practices.

We appreciate this valuable feedback, as it helped us enhance the clarity and comprehensiveness of this section.

 

  1. In Section 3.3.2, it is suggested to include for each variable whether there are previous studies that considered these, or what the authors' criteria are for establishing them.

Thank you for your valuable feedback. In response to your suggestion, we have added an explicit explanation in Section 3.2.2 to clarify the basis for selecting our explanatory variables (highlighted in lines 331-327). Specifically, we now mention that our variable selection was informed by previous studies on mode choice modeling, as well as theoretical considerations and contextual relevance. We have also referenced the well-established 5Ds framework for built environment variables to further substantiate our selection criteria.

This revision ensures greater transparency in our methodological approach and aligns our study with existing literature. We appreciate your insightful suggestion, which has helped improve the clarity of this section.

 

  1. In Section 3, related to methodology, it is recommended to include a flowchart that allows the reader to understand, along with the explanation, the methodology employed.

Thank you for your suggestion. In response, we have added a flowchart (Figure 1, page 13) to visually summarize our methodology. This flowchart outlines our approach to mode choice modeling, the hierarchical nature of the dataset, the challenges we encountered (such as class imbalance), and the four methods we tested to address these challenges. It also highlights our final selection of the One-vs-Rest RF method, which demonstrated the best performance. We believe this addition enhances the clarity of our methodology section and improves the reader’s comprehension.

 

  1. In the discussion section, the results are presented but are not compared with the commonly used Nested Logit (NL) and Multinomial Logit (MNL) models by MPOs. I also suggest strengthening the references so that the results can be better contrasted.

We appreciate this valuable suggestion. We have revised the discussion section to explicitly compare our ML model results with those from traditional NL and MNL models (highlighted in lines 735-752). Our revised discussion highlights key similarities, such as the consistent effects of travel time and vehicle ownership, while also emphasizing the additional insights provided by ML models, such as capturing non-linear effects and complex interactions in built environment variables. Furthermore, we discuss the implications of these findings for MPO modeling practices, reinforcing how ML models can complement traditional econometric approaches.

 

Reviewer 2 Report

Comments and Suggestions for Authors
  1. It is recommended that an addition be made at the end of the introduction to briefly summarise the main contributions and expected results of the study so that the reader can better understand the significance of the study.
  2. The article does not describe how the household travel survey data from 29 United States regions were obtained and processed. It is recommended that a detailed description of the data sources be added.
  3. Table 1 in the literature review section needs to be adjusted. It is suggested that it be sorted or replaced by inserting references if it is written by year.
  4. What do the abbreviations below MPO in Table 2 mean?
  5. When dealing with missing data values, the text only mentions that systematic processing was carried out to adapt to random forest modelling, and does not describe the specific processing method; different processing methods may have an impact on the results, and the relevant details should be added.
  6. The article does not discuss the selection of the RF model over other machine learning models (e.g., support vector machines, neural networks, etc.). It is recommended to add a discussion of the model selection process, explaining why the RF model is best suited for this study.
  7. It is recommended that a description of the variable selection process be added and how the most representative explanatory variables are selected from the large number of potential variables.
  8. Please check that the percentages in lines 392 and 397 of the text are correct.
  9. Pictures need to be revised. For example, figure 2 (b) lacks a vertical coordinate note; the legend obscures the lines; the illustrations in figures 5, 6 and 7 are too brief; and the horizontal and vertical coordinate sizes of the same picture are not consistent.
  10. Please check the formatting of the tables in the text. It is recommended that the tables be standardised as three-line tables, with bold and thin fonts in table 4.
  11. The text only classifies travelling modes into four categories: walking, cycling, public transport and private cars, and does not cover emerging modes such as shared travel and micro-travel, which is somewhat out of touch with the reality and is recommended to be paid attention to in subsequent studies.
  12. Some statements in the text lack clarity and conciseness, which affects the reader's understanding. It is recommended that the conclusion and outlook contents be narrated separately. In the conclusion part, although the significance of this study for traffic and travel planning is mentioned, there is a lack of specific and feasible application guidance. The results of the study should be combined to provide more detailed recommendations for the implementation of MPO in actual transport planning and policy making.

Author Response

Reviewer 2

  1. It is recommended that an addition be made at the end of the introduction to briefly summarise the main contributions and expected results of the study so that the reader can better understand the significance of the study.

We appreciate the reviewer’s suggestion. We have revised the manuscript accordingly by adding a brief summary highlighting the key contributions and anticipated findings (highlighted in lines 103-112).

 

  1. The article does not describe how the household travel survey data from 29 United States regions were obtained and processed. It is recommended that a detailed description of the data sources be added.
  • We have added a Data Source section to explicitly clarify the origins of the household travel survey data (highlighted in lines 242-248). This section now states that the data were obtained from various Metropolitan Planning Organizations (MPOs) and regional transportation agencies across 29 U.S. regions, collected as part of regional travel demand modeling efforts.
  • The data processing for the household travel survey data was already detailed in our manuscript. It includes:
    • The total number of trips and households in the final dataset.
    • The categorization of trips into home-based work (HBW), home-based other (HBO), and non-home-based (NHB) trips.
    • The handling of missing values for Random Forest (RF) modeling.
    • The variation in non-motorized mode share across different regions.
    • Mode Classification that clarifies that mode choice was categorized into walking, bicycling, public transit, and private vehicle use.
  • The data source and processing for the built environment were already included in the manuscript. This section covers:
    • The 5D framework variables at the Traffic Analysis Zone (TAZ) level.
    • The integration of parcel-level land use data to compute land use mix.
    • The use of GIS layers to extract intersection density, 4-way intersection percentages, and transit stop densities.
    • The inclusion of population and employment data at the block or block group level.
    • The calculation of regional employment accessibility using travel time skims for both auto and transit modes.
  • We have now added a section to incorporate sources of our Regional Variables that influence mode choice like: regional population and population density, gas prices, weather conditions, including temperature and precipitation, mentioning that these factors were obtained from sources such as governmental statistical agencies, climate databases, and fuel price reports and were integrated to capture broader regional influences on travel behavior (highlighted in lines 325-329).

 

  1. Table 1 in the literature review section needs to be adjusted. It is suggested that it be sorted or replaced by inserting references if it is written by year.

Good point! We have modified Table 1 to present the studies in chronological order (sorted by year) to improve clarity and consistency. This adjustment ensures a more structured presentation of the literature and facilitates easier comparison of research trends over time.

 

  1. What do the abbreviations below MPO in Table 2 mean?

The abbreviations listed under "MPO" in Table 2 represent the official acronyms used by the respective Metropolitan Planning Organizations (MPOs). These abbreviations are not custom-defined but are the standard designations used by each MPO. To clarify this, we have added a note below Table 2 (highlighted in line 231).

 

  1. When dealing with missing data values, the text only mentions that systematic processing was carried out to adapt to random forest modelling, and does not describe the specific processing method; different processing methods may have an impact on the results, and the relevant details should be added.

In handling missing values, we employed a simple case deletion approach, where observations with missing values were removed from the dataset. Since the proportion of missing data was relatively small, this method ensured that the integrity of the dataset was maintained without significantly affecting the representativeness of the sample. Additionally, given the robustness of Random Forest (RF) in handling varying sample sizes, eliminating missing observations did not negatively impact the performance of the model.

 

  1. The article does not discuss the selection of the RF model over other machine learning models (e.g., support vector machines, neural networks, etc.). It is recommended to add a discussion of the model selection process, explaining why the RF model is best suited for this study.

We appreciate the reviewer’s suggestion. To address this, we have added a discussion on the model selection process in the manuscript (highlighted in lines 360-364).

RF was chosen for this study due to the following advantages:

  • Handling of High-Dimensional and Mixed Data – RF effectively handles datasets with both categorical and continuous variables, making it well-suited for mode choice modeling, which involves diverse input features (e.g., socioeconomic, built environment, and trip characteristics).
  • Robustness to Overfitting – Unlike complex models such as neural networks, which require extensive hyperparameter tuning to prevent overfitting, RF uses an ensemble of decision trees and bagging techniques, leading to high predictive accuracy while maintaining generalizability.
  • Interpretability – While models like Support Vector Machines (SVMs) and Neural Networks (NNs) can achieve high accuracy, they are often considered “black-box” models. RF provides feature importance rankings, which allow for a better understanding of the key factors influencing mode choice.
  • Computational Efficiency – Compared to deep learning models, RF requires significantly less computational power and training time while still achieving competitive performance, making it ideal for large-scale transportation datasets.
  • Suitability for mode choice modeling – RF has demonstrated strong predictive performance compared to other models, supporting its suitability for this research.

 

  1. It is recommended that a description of the variable selection process be added and how the most representative explanatory variables are selected from the large number of potential variables.

Thank you for your insightful suggestion. In response, we have expanded our discussion on the variable selection process to clarify how the most representative explanatory variables were chosen (highlighted in lines 331-337). Specifically, we now highlight that our selection was guided by a combination of prior empirical studies, theoretical frameworks, and contextual relevance to travel behavior research. For built environment factors, we rely on the well-established 5Ds framework, while regional and socioeconomic variables were incorporated based on their demonstrated significance in mode choice literature.

 

  1. Please check that the percentages in lines 392 and 397 of the text are correct.

We appreciate your careful review. We have rechecked the percentages in these lines and confirmed that they are correct.

 

  1. Pictures need to be revised. For example, figure 2 (b) lacks a vertical coordinate note; the legend obscures the lines; the illustrations in figures 5, 6 and 7 are too brief; and the horizontal and vertical coordinate sizes of the same picture are not consistent.

Thank you for pointing this out. We have made the following improvements to the figures:

  • Figure 2(b): Added a vertical axis label and adjusted the legend placement to avoid overlapping with the lines.
  • Figures 5, 6, and 7: Expanded the captions to provide clearer explanations.
  • All figures: Standardized the axis label sizes to ensure consistency across visuals

 

  1. Please check the formatting of the tables in the text. It is recommended that the tables be standardised as three-line tables, with bold and thin fonts in table 4.

We have revised the tables to follow a standardized three-line format. Additionally, in Table 4, we have applied bold and thin fonts appropriately for better readability and consistency.

 

  1. The text only classifies travelling modes into four categories: walking, cycling, public transport and private cars, and does not cover emerging modes such as shared travel and micro-travel, which is somewhat out of touch with the reality and is recommended to be paid attention to in subsequent studies.

We acknowledge the limitation regarding emerging transportation modes, as noted in our conclusion (lines 808-812). Our analysis is constrained by data availability, which did not include options such as micromobility, ride-hailing, and carpooling. However, we recognize the growing role of these modes and will consider them in future research as more comprehensive datasets become available.

 

  1. Some statements in the text lack clarity and conciseness, which affects the reader's understanding. It is recommended that the conclusion and outlook contents be narrated separately. In the conclusion part, although the significance of this study for traffic and travel planning is mentioned, there is a lack of specific and feasible application guidance. The results of the study should be combined to provide more detailed recommendations for the implementation of MPO in actual transport planning and policy making.

We appreciate the reviewer’s suggestion. Our conclusion already provides specific and feasible application guidance for MPOs, including the potential for adopting machine learning techniques for mode choice modeling, the need to investigate barriers to adoption, and the importance of demonstrating practical benefits. To enhance clarity, we have refined the structure of this section by the separation of the conclusion and outlook to distinguish key takeaways from future research directions more explicitly.

Reviewer 3 Report

Comments and Suggestions for Authors

The paper reads well. It is well-structured, while the sample is large and representative. The paper could be published, however, some necessary clarifications are needed. Please find my comments below.

-It is suggested that the use of K is avoided in the abstract when referring to thousands.

-Α short section about the background theory of Random Forest models is suggested to be added in the paper.

-It is not clear how the ML is applied. Did the authors divide the set into training and testing and under what ratios? Or did the authors use k-fold cross validation? This is not clear and should be explicitly explained in text.

-If the ML model has been validated on the training set itself, then the prediction performance metrics are not valid. The same applies to the goodness of fit metrics of the other applied models. In that context, I also have concerns, as the predictive performance of some models reach 99%. I suggest that the authors carefully check again and provide the necessary clarifications or otherwise re-run the analysis if needed.

Author Response

Reviewer 3

The paper reads well. It is well-structured, while the sample is large and representative. The paper could be published, however, some necessary clarifications are needed. Please find my comments below.

Thank you.

 

  1. It is suggested that the use of K is avoided in the abstract when referring to thousands.

We appreciate this suggestion and have revised the abstract accordingly. The notation ‘K’ has been replaced with ‘thousand(s)’ to ensure clarity.

 

  1. Α short section about the background theory of Random Forest models is suggested to be added in the paper.

Thank you for this recommendation. We have added a short background section on Random Forest models in the methodology section (highlighted in lines 348-358 & lines 360-364). This addition provides an overview of how Random Forest works, its advantages in handling complex datasets, and its relevance to travel mode choice modeling.

 

  1. It is not clear how the ML is applied. Did the authors divide the set into training and testing and under what ratios? Or did the authors use k-fold cross validation? This is not clear and should be explicitly explained in text.

We appreciate this comment and have clarified the data partitioning and validation approach in the methodology section (highlighted in lines 413-418). Specifically, we now state that the dataset was split into an 80-20 ratio for training and testing and that we employed 5-fold cross-validation to ensure robust model evaluation.

 

  1. If the ML model has been validated on the training set itself, then the prediction performance metrics are not valid. The same applies to the goodness of fit metrics of the other applied models. In that context, I also have concerns, as the predictive performance of some models reach 99%. I suggest that the authors carefully check again and provide the necessary clarifications or otherwise re-run the analysis if needed.

We appreciate the reviewer’s concern regarding the model validation process and the high predictive performance metrics. We confirm that all performance metrics were computed using a separate test set rather than the training data. Specifically, we employed an 80-20 train-test split and used cross-validation to further assess the generalizability of our model. These details are explicitly outlined in our Model Validation section (highlighted in lines 688-698).

Regarding the high predictive performance (e.g., 99% accuracy for HBO trips), we have carefully reviewed our results and confirm that this is primarily due to the data distribution. In particular, some trip purposes exhibit strong modal dominance, meaning that a single mode is chosen significantly more frequently than others, which can contribute to high overall accuracy. To mitigate potential overestimation, we also report balanced accuracy, which accounts for class imbalances. The balanced accuracy values (ranging from 94% to 97%) provide a more realistic measure of the model's performance across different travel modes.

To further clarify this in the manuscript, we have explicitly emphasized the importance of balanced accuracy and AUC-ROC in evaluating model performance. Additionally, we examined class-wise performance metrics (precision, recall, and F1-score) to ensure that the model’s predictive power extends beyond majority-class predictions. These details have been clarified in Section 4.1 Performance Measures and Section 5 Model Validation to ensure transparency.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

 Accept in present form

Reviewer 3 Report

Comments and Suggestions for Authors

Authors have addressed my concerns. I do not have any further comments.

Back to TopTop