Next Article in Journal
Towards Smart Aviation: Evaluating Smart Airport Development Plans Using an Integrated Spherical Fuzzy Decision-Making Approach
Previous Article in Journal
Factors Affecting and Benefits Resulting from Lean Implementation: A Case Study
Previous Article in Special Issue
Research on the Analysis of Influential Factors of Short-Period Passenger Flow of Urban Rail Transit Based on Spatio-Temporal Heterogeneity
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Are Neural Networks Better than Machine Learning? A Comparative Study for Travel Mode Predictions

1
Jiangsu Key Laboratory of Urban ITS, Southeast University of China, Nanjing 210096, China
2
Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies, Nanjing 210096, China
3
Department of Civil and Environmental Engineering, National University of Singapore, 1 Engineering Drive 2, Singapore 117576, Singapore
*
Author to whom correspondence should be addressed.
Systems 2025, 13(12), 1099; https://doi.org/10.3390/systems13121099
Submission received: 29 October 2025 / Revised: 24 November 2025 / Accepted: 2 December 2025 / Published: 4 December 2025

Abstract

Predicting how people choose their travel modes accurately is important in the transportation field. Machine learning (ML) and neural networks (NNs) have gradually become popular in recent years. However, which is better is seldom discussed in previous studies. Therefore, we collect several real-world travel datasets from different countries, and pick five typical ML models, six classic NN models, and ten new NN models for comparison. Some methods for improvement are also considered, including SMOTE, Near-Miss, and using focal loss. The results show that, when looking at the F1-score, the NN models do not perform as well as ML models. While the performances of different classic NN models are similar, making the neural network more complex does not improve the prediction results. Some new NN models can reach the level of ML models on small datasets, but they still perform poorly on large datasets. Due to such a result, we further discuss two important topics: why NN models are not as good as compared to the ones in some other fields, and why this phenomenon is not revealed in many previous papers. In summary, we think this study gives a good reference for future research on predicting travel modes and choosing the right models.

1. Introduction

With rapid urbanization and population growth, urban travel behavior is becoming more complex. Research on travel mode prediction is of great importance: it can help urban planners make better-informed decisions. By grasping travel choice trends, they can allocate resources like roads and public transport facilities more rationally, easing congestion and boosting efficiency. From an environmental perspective, accurate predictions can promote green travel, cutting the transportation sector’s carbon footprint.
From many years ago, the discrete choice model (DCM) [1] became the mainstream tool for the analysis of travel mode choices. Many logit models based on maximum likelihood estimation, e.g., multinomial logit (MNL) [1] and mixed multinomial logit (MMNL) [2] models, have been widely studied. But in recent years, with the rapid development of AI techniques, two new methods have gradually become popular in this field, including machine learning (ML) models and neural network (NN) models [3,4]. Many researchers have found that they are more powerful than DCMs for predicting traffic [5,6,7].
Since both ML models and NN models are popular, a question naturally arises: which one is better? However, this simple question is seldom discussed in the previous papers. Therefore, we want to conduct a systematic comparison between them for travel mode predictions. Firstly, we collect multiple public travel datasets from different countries, including the Netherlands, the UK, and the US. Different sample sizes are considered, ranging from as few as several thousand records to nearly one million records. Next, five typical ML models, six classical NN models, and ten new NN models are chosen for comparison. When focusing on the F1 scores, it is a little surprising that for all the datasets, the performance of NN models is worse than that of ML models. When some typical methods for improvement are used, including SMOTE, Near-Miss, and using focal loss, the situation does not change. Therefore, it is necessary for us to go back to check the details of many previous papers, and try to find the reasons and potential mechanisms. We think such a discussion could be helpful for the future study of this field, including engineering applications.
The rest of this paper is organized as follows. The literature review for this topic is given in Section 2. Section 3 details the various datasets employed in the study. Section 4 presents the main methodology, including the models used, the possible methods for improvement, and the optimizations of model parameters. Section 5 provides the prediction results of various models, and shows a systematic comparison. Section 6 discusses two important questions: Why are these NN models worse? Why is this phenomenon not revealed in many previous studies? Finally, the conclusion is given in Section 7.

2. Literature Review

In this section, we briefly recall the studies about travel mode predictions in recent years, including the approaches with ML models and NN models. Note that we do not use the concept “deep learning” or “deep neural network (DNN)” in this paper, since sometimes it is difficult to say whether a neural network is deep or not. In addition, in some contexts, neural networks are included in broad-sense machine learning. But this paper adopts a narrow definition of machine learning that excludes them.
Firstly, since ML models can automatically find the complex relationships between variables by learning from data instead of making strong assumptions in advance, they have better predictive power than DCMs. For example, Naseri et al. [3] recently emphasized that ensemble learning techniques provide higher accuracy and better interpretability compared to conventional baselines. Li and Kockelman [4] found that the best models for predicting continuous or categorical variables were different, but the average performance of ML models was always better than DCMs. Kashifi et al. [8] claimed that among five ML models, LightGBDT outperformed other models for both under- and over-sampling strategies. Abulibdeh [9] found that XGBoost outperformed other logit models in prediction accuracy, and various trip characteristics significantly influenced mode choices. In the results of Narayanan et al. [10], RF classifier had a slightly higher accuracy than logit models. Similarly, a recent study by Kalantari [11] demonstrated that random forest models consistently outperformed nested logit models across multiple U.S. regions, highlighting the superior predictive power of ensemble learning methods. Le and Teng [5] compared several logit models and two ML models, and claimed all of them can help in predicting the effects of traffic demand management schemes. In addition, by reviewing many articles, Benjdiya et al. [6] found that ML models were more suitable for disaggregate-level analyses, while DCMs were more effective in aggregate-level analyses.
Secondly, inspired by the functions of biological nervous systems, NN models perform very well in many fields, including many transportation-related tasks with complex information. Recently, significant progress has been made in its use for travel mode predictions. For example, Zhang et al. [7], Nam, and Cho [12] proposed a DNN framework for traffic mode choice, which was better than traditional DCMs and simple NN models in prediction accuracy. Wang et al. [13] found DNN outperformed various classical DCMs in both prediction and interpretation, including large sample size and high input dimension, etc. Salas et al. [14] compared five ML/NN classifiers and two DCMs (MNL and MMNL), and found the MMNL model reduced the accuracy gap with ML methods when taste heterogeneity was present. Püschel et al. [15] found that NNs outperformed simultaneous DCMs and the sequential logit model in terms of accuracy and sociodemographic consistency within mobility tool bundles, while the shallow neural network (SNN) was more robust than the deep neural network (DNN). Xia et al. [16] developed an RE-BNN framework by combining the random effect model and Bayesian neural networks. This model outperformed the plain DNN in prediction accuracy. In addition, Wen and Chen [17] proposed a CNN architecture optimized by orthogonal experimental design to predict travel mode choice, and the optimized CNN achieved a very high accuracy.
In addition, we find there are some studies considering the comparisons between various models for travel mode predictions. However, many of them focused on DCMs, and used DCMs as typical benchmarks. In several papers considering the comparison between ML models and NN models, the findings are quite different: some claim that NN models are better [14,17], some others said ML models could be better [7], while in some comparisons the metrics of two types of models are very close [18,19]. Since the focus of some papers tends to be proposing a new model or exploring phenomena in a new dataset, rather than comparing model performance, this topic is usually not explored in depth.

3. Data

In this paper, we concentrate on real datasets rather than synthetic ones. To verify the validity of NN models, we excluded datasets with small sample sizes. We identified three publicly available travel mode choice datasets: MPN from the Netherlands, LPMC from the UK, and NHTS from the US. All datasets used in this study are stated preference (SP) data, which remain a vital tool in recent influential studies for capturing travel behavior shifts [20]. We name them as D1, D2, and D3, and show the details in Table 1. Their sample sizes vary widely, ranging from several thousand to several hundred thousand. Additionally, to obtain datasets of varying sizes, we created subsets (D2A, D3A, and D3B) by selecting the initial records from the original data.
In all the datasets, there are many available variables, and the dependent variable in this study is always the travel mode. For D2/D2A, there are only four types of travel modes, while for D1 the number of types is eight. But for D3A/D3B/D3, they initially contained 24 different travel modes. After filtering out irrelevant data, the number of travel modes is reduced to 20. These 20 modes are further divided into four main categories: walking, cycling, car, and public transport, which is the same1 as that in D2/D2A. Such a classification also yields two conditions for evaluations, i.e., D3A/D3B/D3-4 (four-class classification) and D3A/D3B/D3-20 (twenty-class classification).
The statistics of the travel modes in these datasets are shown in Figure 1. We can see the following:
(1)
There are some similarities in these datasets. For instance, the car is consistently the primary choice. Furthermore, the data imbalance in all datasets increases prediction difficulty.
(2)
For all the subsets, the proportions of all the travel modes are nearly the same as the original ones. Therefore, the performance differences between the subsets primarily result from the sample size.
(3)
Some differences in proportions mainly stem from the differences in typical national conditions. For example, the proportions of cycling in D1 (the Netherlands) are higher than that of walking. However, in D2 (the UK) and D3 (the US), the situation is just the opposite. In addition, the proportions of public transport in D3 (the US) are much lower than that of D1 (the Netherlands) and D2 (UK).
(4)
The imbalance is more prominent in D3. In Figure 1d, we only show the seven travel modes with a proportion greater than 1%. In other words, for 13 types in D3A/D3B/D3-20, their proportions are close to zero, and their influence on the final metrics of predictions will be little.
(5)
Note that the dependence of travel mode choices upon various variables is a complex topic, which cannot be quickly clarified merely through descriptive statistical charts. For this job, it is better to consider some other approaches, e.g., discrete choice models, which are out of the scope of this paper.
Next, we only consider five main independent variables for each dataset, as shown in Table 2. For D1, the chosen variables are the most important ones mentioned in the brief introduction of MPN. These abbreviations are Dutch words. For both D2 and D3, the chosen variables are age, sex, travel distance, number of family vehicles, and purpose of travel, since they are commonsensically important. The full names of the variables may be not necessarily the same in D2 and D3 (e.g., DISTANCE vs. TRPMILES), but their meanings are. In addition, we can see that in Table 2, the proportional distribution of categorical variables and the statistical results of continuous variables are basically consistent between the complete dataset and the selected subset.
Note that these independent variables may not be the best predictors for D2 or D3. However, the primary goal of this study is to compare model performance. Therefore, as long as we use the same independent variables for all the models on each dataset, such a comparison could be fair. In addition, it is possible to consider more variables for D2 or D3. However, we find that after adding several variables, the metrics of all models increase simultaneously, but the final comparative conclusions (e.g., the ranking order of model performance) do not change in any way. Therefore, to ensure consistency across all datasets, we chose to use five variables for the subsequent research.

4. Methodology

4.1. Models Used

In this section, the models used for comparisons are introduced. Firstly, five typical ML models are considered, including LR (logistic regression), KNN (k-nearest neighbors), DT (decision tree), RF (random forrest), and XGB (eXtreme gradient boosting). They have been widely used in many studies introduced in Section 1.
Note that there are some other possible models, but they are not suitable for this study. For example, as discussed in some previous study about machine learning [21], NB (naive Bayes) is a simple model, and its performance is usually not good enough. SVM (support vector machine) is not recommended in this study, since it is very slow, especially when the dataset is large. It takes many hours to obtain the results when running on D3, and it cannot be accelerated by GPU. LGB (light gradient boosting machine) is also popular in recent years. However, its mechanism and performance are usually similar to that of XGB. Therefore, models like NB, SVM, and LGB are not considered in this paper.
Subsequently, we present an overview of the six classic NN models2 utilized in this study:
(1) Multi-Layer Perceptron (MLP for short): It is also known as artificial neural network (ANN), which consists of one or more layers. In this study, two MLP models are considered, namely MLP-1 with one hidden layer, and MLP-2 with two hidden layers (see Figure 2).
(2) One-dimensional Convolutional Neural Network (CNN-1D for short): It is a variant of a convolutional neural network specifically designed to process sequential data, such as time series, text, etc. While travel mode choice data is inherently tabular, its application in this context is justified by transforming the set of input features into a structured, meaningful one-dimensional sequence. The key idea is that the ordering of features is based on their intrinsic logical relationships. For example, we can group features into categories, creating a sequence that starts with personal socio-economic attributes (e.g., age, income, occupation), followed by trip-specific characteristics (e.g., distance, purpose, time of day), and finally environmental factors (e.g., weather, POI density). When a one-dimensional convolutional kernel with a size of three slides along this sequence, it can simultaneously observe three consecutive features and learn their localized interaction patterns. A kernel might, for example, become specialized in recognizing a powerful predictive combination like “high-income, weekday, and morning peak-hour”.
The model structure used in this paper is shown in Figure 3, which is inspired by some classical models, including LeNet and VGG-16. Here, the input tabular data first undergoes feature extraction through one or two one-dimensional convolutional layers, then the data is processed by a flattening layer. Next, it enters a fully connected layer for further feature transformation and processing. The dropout layer is used to prevent overfitting, and finally, the classification result is output through the softmax layer.
(3) Two-dimensional Convolutional Neural Network (CNN-2D for short): It is also a typical form of CNN, and the model structure used in this paper is shown in Figure 4. The application of CNN-2D to tabular data is a more conceptual adaptation, requiring the feature vector to be reshaped into a two-dimensional “feature image” or matrix. For instance, a vector of 16 features can be constructed into a four × four matrix. We arrange the features within this grid to position-related attributes close to each other. For example, one row could represent personal attributes, a second row could represent trip details, and a third could represent land-use characteristics. When a two × two convolutional kernel slides over this ‘feature image’, it can learn high-order patterns that span multiple feature categories simultaneously. For example, one application of the kernel might cover the features [income, car ownership] from the first row and [trip cost, trip purpose] from the second row.
Next, for further comparison, we introduce ten new NN models. All of them were recently proposed, and the authors claimed that they were designed for tabular data. They are ResNet [22], SNN [23], AutoInt [24], GrowNet [25], NODE [26], DCN-V2 [27], FT-Transformer [28], TabNet [29], DeepNet [30], and GBDT + DNN [31]. Table 3 summarizes the basic features of the ten models and their specific designs for tabular data.
Finally, note that the essence of our study is the classification problem, and the dependent variables in our study are discrete. Therefore, some other typical transport-related models (e.g., STGNNs) are not suitable for this study, since the dependent variables are continuous.

4.2. Possible Methods for Improvement

In order to improve the performance of various models, we also consider three possible methods, including SMOTE, Near-Miss, and using focal loss:
SMOTE (synthetic minority over-sampling technique) is a synthetic minority over-sampling technique [32]. The algorithm flow is as follows:
(1)
For each sample x in the minority class, the Euclidean distance is used as the standard to calculate its distance to all the minority class samples, and its k-nearest neighbor is obtained.
(2)
A sampling ratio is set according to the sample imbalance ratio to determine the sampling rate N. For each minority sample X, a number of samples are randomly selected from its k-nearest neighbors, assuming that the selected nearest neighbors are Xn.
(3)
For each Xn, a new sample is constructed with the original sample according to the following formula:
x n e w = x + r a n d 0,1 ( x ~ x )
Near-Miss is an under-sampling technique [33]. It aims to balance the class distribution by randomly eliminating most class examples. The basic intuitions are as follows:
(1)
Find the distance between all the majority class instances and minority ones. Here, most classes will be under-sampled.
(2)
The noun selects the majority class instance that has the smallest distance from the minority one.
(3)
If there are k instances of the minority class, the closest method will result in k × n instances of the majority class.
(4)
To find the n closest instances in most classes, there are several variations in the algorithm, including Version 1, Version 2, and Version 3. We selected Version 2 for its superior performance. This version identifies the minority sample that has the greatest distance to the k majority samples.
Focal loss is an innovation loss function for handling imbalanced samples, which was originally proposed by Lin et al. [34] when dealing with object detection. But it is also possible to be used for predictions of travel mode choices. The traditional cross-entropy loss function pays excessive attention to many easy negative samples, which may limit the model performance.
On the contrary, the formula for focal loss is
F L p t = α t 1 p t γ l o g ( p t )
where pt is the predicted probability of the model for class t, αt is the weight parameter for balancing positive and negative samples, and γ is the focusing parameter. When a sample is correctly classified and pt approaches one, ( 1 p t ) γ will approach zero. This reduces the contribution of the loss from easy samples. For the difficult samples when pt is small, ( 1 p t ) γ is relatively larger, and the loss function will pay more attention to these samples.

4.3. Optimization of the Parameters

In this paper, we consider four widely used indicators: Accuracy, Precision, Recall, and F1 score. The formulas are as follows:
A c c u r a c y = ( T P + T N ) / ( T P + T N + F P + F N )
P r e c i s i o n = T P / ( T P + F P )
R e c a l l = T P / ( T P + F N )
F 1 _ s c o r e = 2 ( P r e c i s i o n R e c a l l ) / ( P r e c i s i o n + R e c a l l )
where TP, FP, and FN refer to true positive, false positive, and false negative, respectively. The F1 score is a comprehensive metric that contains information about both precision and recall, which is particularly effective for evaluating performance on datasets with imbalanced classes. Thus, in this paper, we choose it as the main indicator for calibrations and comparisons.
Next, we present some examples about how we optimize the model parameters. For the five typical ML models, the typical hyper-parameters considered in this paper are listed in Table 4. Except for some variables that are widely understood in the machine learning community (e.g., n_estimators, max_depth, or learning rate), the meanings of other specific parameters are summarized as follows: For the XGB model, parameters such as gamma, reg_alpha, and reg_lambda are pivotal for regularization and controlling the minimum loss reduction, while subsample and colsample_bytree specify the sampling ratios of training instances and features to prevent overfitting. Regarding the tree-based models (DT and RF), the criterion determines the metric for measuring split quality (i.e., Gini impurity or entropy); specifically for RF, bootstrap indicates whether bootstrap samples are employed during tree construction. In the context of KNN, weights decide if the prediction utilizes uniform voting or distance-based weighting, whereas p sets the power parameter for the Minkowski metric. Finally, as for logistic regression (LR), C refers to the inverse of regularization strength, where smaller values imply stronger regularization.
Similarly, the crucial parameters for six classic NN models are shown in Table 5. We do not consider complex networks with even more layers, since we find that increasing the number of layers does not improve model performance, which will be discussed in Section 4.
The process of finding the optimal hyper-parameters was conducted through an exhaustive and methodical search. We employed a grid search strategy combined with five-fold cross-validation on the training set. For each model, a grid of possible hyper-parameter combinations was defined based on the search ranges in Table 4 and Table 5. To ensure a rigorous and fair comparison between models, we established a systematic protocol for model training, validation, and testing. For each dataset, we first partitioned the data into three independent sets: a training set (70%), a validation set (15%), and a testing set (15%). During training, we set the maximum number of epochs to 50, which is enough for the losses of all the NN models to be nearly unchanged. Additionally, we employed an early stopping mechanism to prevent overfitting. A dual-layered regularization strategy is also considered. Finally, after evaluating all combinations in the grid, we selected the set of hyper-parameters that yielded the highest average cross-validated F1 score as the optimal configuration.
Here, we show some comparisons between the default values and optimized results in Table 6. This table serves as a typical example, using the D3B-4 dataset, to illustrate the impacts of the hyper-parameter optimization protocol. Generally speaking, the originally effective models (like RF) have little improvement after parameter optimizing. On the contrary, the originally ineffective ones (like KNN or LR) gain more significant improvement, and the differences between default value and optimized value are clearly seen.
Note that there are more parameters in the ten new NN models. The adjustment for them could be very cumbersome and difficult for the users in the field of traffic engineering, which is beyond the scope of this paper. Therefore, for these models, we choose to use the default parameters defined by the authors.

5. Model Results

Firstly, the metrics of all the models are shown in Table 7, Table 8 and Table 9. We find the following:
(1)
Generally speaking, when observing the F1 scores, the averaged performance of NN models is worse than ML models, as calculated in Table 10. This situation is not as expected as before, which will be discussed later. For the largest one (D3-20), the maximum F1 score of NN models is even smaller than the minimum value of ML models: 0.147 (GrowNet) < 0.192 (LR).
(2)
Among the five ML models, the performance of RF is the best for large datasets (D3, D3A, and D3B). The main reason is that RF is based on ensemble learning and incorporates random sampling and feature selection mechanisms. This increases the model diversity and makes it more robust against overfitting. However, for smaller datasets, the best choice is not so clear. In addition, LR is always the worst one among the ML models. Such a situation is easy to understand: the linear assumption of LR may lead to limitations when dealing with non-linear data.
(3)
The performance of six classic NN models are similar. For all the datasets, the differences between CNN-1D, CNN-2D, and MLP are not significant. This may be because there are no obvious local correlation patterns among the variables involved, or these local correlations are not key factors in this task. In addition, adding layers cannot improve the results for both CNN and MLP. For some situations (e.g., CNN-2D in D3A-20, MLP in D3B-4), the model with one layer could be better than that with two layers. In other words, increasing model complexity is not helpful for such a prediction problem.
(4)
The results of the ten new NN models are also unsatisfactory. It seems that they have better performance in a smaller dataset. For example, the best one among them in D1 (ResNet) is at least better than some ML models (LR and DT). However, with the increase in sample number, the advantages of new NN models gradually diminish. As shown in Table 10, in D2A/D2, their performances are similar to classic NN models, while in D3A/D3B/D3, they become even worse.
(5)
Differences between the datasets are observed. As introduced in Table 1 and Table 2, the sample sizes of D2 and D3A-4 are nearly the same, and their dependent variables (travel modes) and the five independent variables (age, sex, travel distance, number of family vehicles, and purpose of travel) are also the same. However, the metrics of the same models are different in Table 7, Table 8 and Table 9. For example, most accuracies in D3A-4 are higher than D2, while most F1 scores in D3A-4 are lower. In other words, the internal characteristics of the datasets selected in this paper actually vary significantly. This is also beneficial for us to fully examine and compare the characteristics of various models from different perspectives.
(6)
The sizes of datasets are also important. Generally speaking, smaller datasets are easier to predict. In addition, the simple problem with four types is easier than the difficult one with twenty types for all the datasets. This may be because the class imbalance problem is exacerbated in larger datasets, especially in D3.
Note that the results shown in Table 7, Table 8 and Table 9 are the averaged values after running ten times. And we also show the standard deviations of these models in Table 11. Since a similar phenomenon can be found in other datasets, for simplicity, we only show the results of D3B as an example. Here the results of LR, KNN, and XGB are not presented, since they are always zero for this situation. We can see that the randomness of the results of all ML models is very low, since their model structures are simple: they have a limited number of parameters and a highly deterministic training process. On the contrary, the fluctuations in new NN models are more evident, due to their complexity when training. Nevertheless, even when we consider the best of NN models, they are not better than the averaged results of ML models.
If we pay attention to some other metrics rather than F1 scores, we can see something new. For example, the precisions of NN models are usually higher than that of ML models, while the recalls of NN models are lower. Therefore, we want to improve the recalls by different means.
As we discussed in Section 3, there are three possible methods, including SMOTE, Near-Miss, and using focal loss. However, they are not only applicable for NN models, but also for ML models. Therefore, we choose to further compare the results of ML models and NN models based on these methods. For simplicity, we only present the typical cases with relatively obvious improvement effects in Table 12, Table 13 and Table 14. We can see the following:
(1)
The improvement in SMOTE is significant for many models. For some models, e.g., XGB and RF, the new F1 score is even more than two times the original value. At the same time, the improvement in F1 score by Near-Miss is also not too bad. Such phenomena coincide with what we can observe in some previous studies.
(2)
From Table 12 and Table 13, we can see that all the recalls have been increased, and the values of recalls and precisions have become nearly the same for the same model. Although the mechanisms of SMOTE and Near-Miss are not the same (e.g., the differences between under-sampling and over-sampling), similar results could be obtained, since both methods change the proportional relationship between FP and FN.
(3)
On the contrary, the effect of using focal loss is not ideal in Table 14. The marginal improvement can be nearly ignored. This may be because the imbalance in the weights of positive and negative samples is more severe, making it very difficult to find appropriate parameters.
(4)
For all the models, the improvement from SMOTE, Near-Miss, or using focal loss does not change the magnitude relationship between the results of ML models and NN models. In other words, regardless of using the conventional method or these improved methods, ML models are always better than NN models, including the new ones proposed in recent years. This conclusion seems a bit counterintuitive, but it is indeed the result after being tested.
Finally, we also show the running time for each model in Table 15. For simplicity, only the results of the smallest and largest datasets are presented. Here, h/m/s means hour/minute/second. The device we used has an Intel Core i7-12700H CPU, 32 GB of memory, and an NVIDIA GeForce RTX 3060 graphics card. It is easy to understand that ML models are always faster than NN models. Nevertheless, the significant differences between their running times further emphasize the deficiencies of NN models, especially when “the ancient model” (e.g., DT) only needs 4 s and “the advanced model” (e.g., a transformer-based model) needs more than 9 h for a large dataset.

6. Discussions

Next, after the comparisons between different models based on different datasets, we find that there are two important topics to be discussed:
(1) Why are neural networks not good for travel mode predictions?
In recent years, the effects of neural networks have been proved in many fields, especially in CV, NLP, etc. However, the effects seem unsatisfactory in some other fields, e.g., the predictions for travel mode choices. It is easy to understand that the performance of neural network depends on the amount of data. However, in this paper, we find that the results of both classic and new NN models become even worse when the datasets become larger. For such an unexpected situation, we think the possible explanations are as follows:
The type of input variables. For the travel mode predictions, many input variables are discrete. However, most NN models are typically based on the assumptions of continuity, such as gradient-based optimization methods. When faced with discrete variables, these methods may struggle to find optimal solutions because the discrete nature of the data does not allow for smooth gradients [35]. This makes it difficult for the models to capture complex relationships and patterns hidden within the data [36].
The importance of feature engineering. Typical machine learning models can better adapt to large databases through some traditional feature engineering methods. However, neural networks are restricted by the representation of the original data, and require more complex feature extraction and representation learning processes [37]. Recent studies also confirmed that without massive datasets, advanced gradient boosting models often outperform deep learning architectures on tabular travel data [38]. If the input layer and preprocessing steps of the neural network are not carefully designed (e.g., only five variables are directly input), it may not be able to effectively process this data.
The influence of local features. When the amount of data becomes large, the gradient descent process of the neural network may be inevitably affected by noise or local optimal solutions. This may result in overfitting to the local features of some samples, making its performance decline. In contrast, machine learning models can make better use of the newly added data to enhance generalization ability. For example, RF can construct multiple decision trees by randomly selecting features and samples. It is more robust to noise and local features, and the risk of overfitting can be reduced.
The lack of rotation invariance. As stated in Grinsztajn et al. [36], neural networks have rotation invariance because their learning process does not rely on the direction of features. This characteristic enables neural networks to perform outstandingly when processing data such as images. Since the orientation of an image can be arbitrary, neural networks are still able to recognize the same objects. However, the structured data (like tabular data) generally does not have rotational invariance. For example, each feature in the data of travel mode choice has a fixed position and direction. If they are rotated, the performance of the model will become worse. As a result, the advantages of neural networks cannot be fully utilized when dealing with residents’ travel mode data.
The possibility of overfitting. For some neural networks, especially the new NN models with complex structures, overfitting may be inevitable. Many papers about the new NN models did not mention the total parameter numbers of these models. But as introduced by NODE [26] and TabNet [29], their values could be higher than one million. However, even in the largest dataset of this paper (D3), the sample number is lower than one million. Under such conditions, the advantages of neural networks cannot be brought into play.
Finally, we want to mention that for travel behavior predictions, we cannot completely reject the possibility that there may exist an NN model which is better than ML models. Nevertheless, we find that the existing and commonly used NN models are unable to achieve this goal. At the same time, the design of a new NN model capable of achieving this effect may pose challenges for many users. While such a model may still emerge in the future, using ML models for predictions could be a better choice at present. In addition, in this paper, we only focus on typical tabular data, while some high-dimensional data such as GPS trajectories or image-based modes are not considered. In the future, it is also necessary to study NN superiority boundaries by collecting more data with different forms.
(2) Why are deficiencies of neural networks not revealed in previous studies?
In fact, the comparisons between different models for travel mode predictions are not new. Such a topic has been discussed in some previous papers. However, none of them give the clear conclusion of “ML models are better than NN models” [18,19]. We consider this from several different perspectives:
Firstly, in studies about the ten new models, it is clear that none of them considered datasets about travel behaviors. On the contrary, the datasets used are about users’ ratings on movies [24], rank task about documents [25], simulated physics experiments [28], inverse dynamics of an anthropomorphic robot arm [29], etc. Therefore, the comparative experiments in these articles may not have much reference value for our study. Although the data of travel behaviors also belong to “tabular data”, their features could be different from the datasets in other fields. In other words, simply applying these “models designed for tabular data” may not be a good choice for predicting travel behaviors.
Next, when we focus on the studies about travel behaviors, we find some important aspects:
The object of comparisons: Many previous studies concentrate on the performance of discrete choice models (DCMs), and MNL/BL is usually considered as a typical benchmark. At the same time, some other studies considered different types of logit models. For example, Wang et al. [13], Salas et al. [14] studied the MMNL model when considering heterogeneity. Zhang et al. [7], Le and Teng [5] used the NL model, while Nam and Cho [12], Xia et al. [16], Püschel et al. [15] used both NL and CNL models. Wang et al. [19] studied an additional ten generalized linear models for a comprehensive comparison. Theoretically speaking, it is easy to understand that all of them can obtain the natural conclusion that “ML models are better than DCMs”. This is because the assumption of DCMs about utility norms is too simple and linear, and it has trouble handling categorical explanatory variables. However, this is different from the topic discussed in this paper.
The focus of studies: In many papers considering the comparison between different models, this topic is not the main concern. Instead, the main topic is about the proposal of one or more new models, e.g., DNN-A model in Nam and Cho [12], BMTM-DLP model in Lai et al. [39], RE-BNN model in Xia et al. [16], MTLDNN-M model in Bei et al. [40], “the optimized CNN model” in Wen and Chen [17], and several NN models in Kashifi et al. [8], etc. These authors tried to prove that their new models are better than the selected benchmark models (usually logit models, as mentioned in the last paragraph). In other words, the effects of many machine learning models are not fully considered.
The metrics for evaluations: In many previous studies, the metrics for evaluations differ. Some of them are related to specific problems, e.g., Nam and Cho [12], Le and Teng [5] and Xia et al. [16] chose to study the travel mode shares and their absolute values of errors when predicting. Similarly, Püschel et al. [15] considered the deviation in the predicted distributions. Some others focused on the losses or errors during the training, including Wang et al. [13] and Abulibdeh [9]. Martín–Baos et al. [18] studied some new indicators which were not widely used, e.g., the GMPCA. The only one metric considered by all the studies is accuracy. However, as shown in Table 10, for each dataset, the accuracy of the different models is always similar. If we only look at accuracy, it is too difficult to determine whether NN models are really better.
A similar situation could be observed in other papers and datasets, especially when the papers concentrated on model comparisons (Wang et al. [19]; Martín–Baos et al. [18]). For example, in Figure 2 of Wang et al. [19], the mean accuracy of DNN/RF/Boosting is 57.79/57.05/57.03, while in Table 7 of Martín–Baos et al. [18], that of DNN/RF/XGBoost is 77.23/76.86/78.83. In summary, for travel mode predictions, especially when the sample is imbalanced, we think accuracy is not good enough as a criterion for evaluations.
The size of datasets: In some papers, the metrics of NN models are significantly better than that of ML models, e.g., Wen and Chen [17]. However, since the sample number in this paper is only 1192, we think the “good performance” of NN models may be due to overfitting. Actually, similar situations could be seen in other studies considering more datasets. For example, in Table 4 of Salas et al. [14], when the sample size is 1000, the accuracy of NN model (0.840) is much better than all the ML models (max = 0.730). However, when the size increases to 3000 or 5000, that of NN model declines (0.755) and becomes very close to the other ones (max = 0.748). Therefore, to avoid overfitting, we think it is necessary to choose a dataset that is not too small.

7. Conclusions

In recent years, the application of neural networks has become very popular in many different fields. However, its effectiveness across all fields still need to be evaluated. In this paper, we concentrate on the task of travel mode predictions, and we find something different: in five datasets with different sizes and classification features, the F1 scores of six classic NN models and ten new NN models are always worse than that of the five ML models, including the best and the average results. When considering some methods of improvement, including SMOTE, Near-Miss, and using focal loss, the situation does not change. For running speeds, ML models are also much faster. Given the different classification characteristics of the tested datasets, we think the generality of such a conclusion could be guaranteed.
Next, we try to discuss two important topics: (1) why NN models are not as good as expected; (2) why they are not revealed in many previous studies. For Topic (1), it may be due to the type of input variables, the importance of feature engineering, the influence of local features, the lack of rotation invariance, and the possibility of overfitting. For Topic (2), the first reason is that the development of these new NN models do not consider the datasets of travel behavior. Also, the improvement in the datasets of other tabular data does not lead to better performance when predicting travel mode choices. Next, for the traffic studies, the reasons may include the object of comparisons, the focus of studies, the metrics for evaluations, and the size of datasets, etc. In summary, this study provides an important theoretical basis for future research and engineering applications about travel mode predictions.
Although we have obtained lots of useful findings, a lot more work is still required. Firstly, more datasets are needed. Generally speaking, it is not easy to find suitable travel mode choice datasets with enough large samples. Nevertheless, as time passes, more public datasets could emerge and become beneficial for this field. Secondly, the development of AI technologies is very fast. It is possible that some new NN models with more advanced structures can be better than the current ones, and the current conclusions made in this paper could change in the coming days. For example, better methods for feature engineering could be helpful. In addition, in order to obtain credible results, we only focus on the real datasets in this paper. Whether synthetic datasets are possible to use still needs to be checked in the future. In summary, if we can accomplish all the aforementioned points, we may be able to propose a more comprehensive decision framework for this topic. Naturally, this will undoubtedly require additional efforts in future research.

Author Contributions

Conceptualization, C.-J.J.; methodology, T.Z. and C.-J.J.; software, T.Z.; validation, T.Z. and Y.S.; formal analysis, T.Z. and C.-J.J.; data curation, T.Z.; writing—original draft preparation, T.Z. and C.-J.J.; writing—review and editing, C.-J.J. and Y.S.; visualization, T.Z.; supervision, D.L.; funding acquisition, C.-J.J. and D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 71801036 and 71971056).

Data Availability Statement

The links for the public datasets used in this paper are: D1: https://github.com/brendadenisse16/Predicting-Travel-Mode-Choices-with-Logistic-Regression-and-SVM (accessed on 23 November 2025); D2: https://www.icevirtuallibrary.com/doi/suppl/10.1680/jsmic.17.00018 (accessed on 23 November 2025); D3: https://nhts.ornl.gov/.

Conflicts of Interest

The authors declare no conflict of interest.

Notes

1
The original names of the travel modes in D2 are not exactly the same as that in D3, e.g., “public transport” vs. “public transit”. Nevertheless, we consider them as the same thing due to the similarities.
2
Since the classic neural networks used in this study are not very “deep”, we do not use the concept of “DNN” in this paper. In addition, we think the CNN and MLP models used in this paper belong to the scope of “ANN”, but we do not put emphasis on this concept.

References

  1. McFadden, D. Conditional Logit Analysis of Qualitative Choice Behavior. In Frontiers in Econometrics; Zarembka, P., Ed.; Academic Press: New York, NY, USA, 1974; pp. 105–142. [Google Scholar]
  2. McFadden, D.; Train, K.E. Mixed MNL models for discrete response. J. Appl. Econom. 2000, 15, 447–470. [Google Scholar] [CrossRef]
  3. Naseri, H.; Waygood, E.O.D.; Patterson, Z.; Alousi-Jones, M.; Wang, B. Travel mode choice prediction: Developing new techniques to prioritize variables and interpret black-box machine learning techniques. Transp. Plan. Technol. 2025, 48, 582–605. [Google Scholar] [CrossRef]
  4. Li, W.; Kockelman, K.M. How does machine learning compare to conventional econometrics for transport data sets? A test of ML versus MLE. Growth Change 2022, 53, 342–376. [Google Scholar] [CrossRef]
  5. Le, J.; Teng, J. Understanding influencing factors of travel mode choice in urban-suburban travel: A case study in Shanghai. Urban Rail Transit 2023, 9, 127–146. [Google Scholar] [CrossRef] [PubMed]
  6. Benjdiya, O.; Rouky, N.; Benmoussa, O.; Fri, M. On the use of machine learning techniques and discrete choice models in mode choice analysis. LogForum Sci. J. Logist. 2023, 19, 331–345. [Google Scholar] [CrossRef]
  7. Zhang, Z.; Ji, C.; Wang, Y.; Yang, Y. A Customized Deep Neural Network Approach to Investigate Travel Mode Choice with Interpretable Utility Information. J. Adv. Transp. 2020, 2020, 5364252. [Google Scholar] [CrossRef]
  8. Kashifi, M.T.; Al-Rassas, A.M.; Bakar, K.A.; Al-Japairai, K.A.; Jamali, S.S. Predicting the travel mode choice with interpretable machine learning techniques: A comparative study. Travel Behav. Soc. 2022, 29, 279–296. [Google Scholar] [CrossRef]
  9. Abulibdeh, A. Analysis of mode choice affects from the introduction of Doha Metro using machine learning and statistical analysis. Transp. Res. Interdiscip. Perspect. 2023, 20, 100852. [Google Scholar] [CrossRef]
  10. Narayanan, S.; Tzenos, P.; Verani, E.; Vlahogianni, E.I. Can Bike-Sharing Reduce Car Use in Alexandroupolis? An Exploration through the Comparison of Discrete Choice and Machine Learning Models. Smart Cities 2023, 6, 1239–1253. [Google Scholar] [CrossRef]
  11. Kalantari, H.A.; Sabouri, S.; Brewer, S.; Ewing, R.; Tian, G. Machine learning in mode choice prediction as part of MPOs’ regional travel demand models: Is it time for change? Sustainability 2025, 17, 3580. [Google Scholar] [CrossRef]
  12. Nam, D.; Cho, J. Deep neural network design for modeling individual-level travel mode choice behavior. Sustainability 2020, 12, 7481. [Google Scholar] [CrossRef]
  13. Wang, S.; Zhao, J.; Lee, D.H. Deep neural networks for choice analysis: A statistical learning theory perspective. Transp. Res. Part B Methodol. 2021, 148, 60–81. [Google Scholar] [CrossRef]
  14. Salas, P.; Pezoa, R.; Oliveira, L.; Henríquez, G.; Raveau, S. A systematic comparative evaluation of machine learning classifiers and discrete choice models for travel mode choice in the presence of response heterogeneity. Expert Syst. Appl. 2022, 193, 116253. [Google Scholar] [CrossRef]
  15. Püschel, J.; Regue, R.; Gerike, R.; Nagel, K. Comparison of discrete choice and machine learning models for simultaneous modeling of mobility tool ownership in agent-based travel demand models. Transp. Res. Rec. 2024, 2678, 376–390. [Google Scholar] [CrossRef]
  16. Xia, Y.; Chen, H.; Zimmermann, R. A random effect bayesian neural network (RE-BNN) for travel mode choice analysis across multiple regions. Travel Behav. Soc. 2023, 30, 118–134. [Google Scholar] [CrossRef]
  17. Wen, X.; Chen, X. A New Breakthrough in Travel Behavior Modeling Using Deep Learning: A High-Accuracy Prediction Method Based on a CNN. Sustainability 2025, 17, 738. [Google Scholar] [CrossRef]
  18. Martín-Baos, J.A.; Ros, L.G.; García-García, F.; López-Sánchez, A.D.; Soria-Olivas, E.; Pérez-Bernabeu, E. A prediction and behavioural analysis of machine learning methods for modelling travel mode choice. Transp. Res. Part C Emerg. Technol. 2023, 156, 104318. [Google Scholar] [CrossRef]
  19. Wang, S.; Kockelman, K.M.; Lemp, J.D. Comparing hundreds of machine learning and discrete choice models for travel demand modeling: An empirical benchmark. Transp. Res. Part B Methodol. 2024, 190, 103061. [Google Scholar] [CrossRef]
  20. Shahdah, U.E.; Elharoun, M.; Ali, E.K.; Elbany, M.; Elagamy, S.R. Stated preference survey for predicting eco-friendly transportation choices among Mansoura University students. Innov. Infrastruct. Solut. 2025, 10, 180. [Google Scholar] [CrossRef]
  21. Jin, C.; Luo, Y.; Wu, C.; Song, Y.; Li, D. Exploring the Pedestrian Route Choice Behaviors by Machine Learning Models. Int. J. Geo-Inf. 2024, 13, 146. [Google Scholar] [CrossRef]
  22. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  23. Klambauer, G.; Unterthiner, T.; Mayr, A.; Hochreiter, S. Self-Normalizing Neural Networks. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  24. Song, W.; Shi, C.; Xiao, Z.; Wang, Z.; Sun, L.; Rossi, P. AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM ‘19), Beijing, China, 3–7 November 2019; pp. 1161–1170. [Google Scholar]
  25. Badirli, S.; Tufan, A.; Kask, K.; Can, F.; User, H.B.; Yilmaz, A. Gradient Boosting Neural Networks: GrowNet. arXiv 2020, arXiv:2002.07971. [Google Scholar] [CrossRef]
  26. Popov, S.; Morozov, S.; Babenko, A. Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data. In Proceedings of the International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  27. Wang, R.; Fu, B.; Fu, G.; Wang, M. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 1785–1797. [Google Scholar]
  28. Gorishniy, Y.; Rubachev, I.; Gulin, A.; Babenko, A. Revisiting Deep Learning Models for Tabular Data. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Online, 6–14 December 2021. [Google Scholar]
  29. Arık, S.Ö.; Pfister, T. TabNet: Attentive Interpretable Tabular Learning. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI-21), Online, 2–9 February 2021; Volume 35, pp. 6679–6687. [Google Scholar]
  30. Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. arXiv 2017, arXiv:1703.04247. [Google Scholar]
  31. Yan, J.; Chen, J.; Wang, Q.; Chen, D.Z.; Wu, J. Team up GBDTs and DNNs: Advancing Efficient and Effective Tabular Prediction with Tree-hybrid MLPs. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ‘24), Barcelona, Spain, 25–29 August 2024. [Google Scholar]
  32. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  33. Zhang, J.; Mani, I. KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In Proceedings of the International Conference on Machine Learning (ICML), Washington, DC, USA, 21–24 August 2003. [Google Scholar]
  34. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
  35. Shwartz-Ziv, R.; Armon, A. Tabular data: Deep learning is not all you need. Inf. Fusion 2022, 81, 84–90. [Google Scholar] [CrossRef]
  36. Grinsztajn, L.; Oyallon, E.; Varoquaux, G. Why do tree-based models still outperform deep learning on tabular data? arXiv 2022, arXiv:2207.08815. [Google Scholar] [CrossRef]
  37. Borisov, V.; Leemann, T.; Seßler, K.; Haug, J.; Pawelczyk, M.; Kasneci, G. Deep Neural networks and tabular data: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 7499–7519. [Google Scholar] [CrossRef]
  38. Banyong, C.; Hantanong, N.; Nanthawong, S.; Se, C.; Wisutwattanasak, P.; Champahom, T.; Ratanavaraha, V.; Jomnonkwao, S. Machine learning-based analysis of travel mode preferences: Neural and boosting model comparison using stated preference data from Thailand’s emerging high-speed rail network. Big Data Cogn. Comput. 2025, 9, 155. [Google Scholar] [CrossRef]
  39. Lai, Z.; Chen, C.; Wang, Y.; Wang, J.; Xu, Z. Travel mode choice prediction based on personalized recommendation model. IET Intell. Transp. Syst. 2023, 17, 667–677. [Google Scholar] [CrossRef]
  40. Bei, H.; Liu, J.; Zhang, Y.; Wang, W. Joint prediction of travel mode choice and purpose from travel surveys: A multitask deep learning approach. Travel Behav. Soc. 2023, 33, 100625. [Google Scholar] [CrossRef]
Figure 1. The distribution of travel mode choices in different datasets. (a) D1, (b) D2, and D2A; (c) D3A/D3B/D3-4; (d) D3A/D3B/D3-20 (only the results greater than 1% are shown).
Figure 1. The distribution of travel mode choices in different datasets. (a) D1, (b) D2, and D2A; (c) D3A/D3B/D3-4; (d) D3A/D3B/D3-20 (only the results greater than 1% are shown).
Systems 13 01099 g001aSystems 13 01099 g001b
Figure 2. The structures of MLP models. (a) MLP-1; (b) MLP-2.
Figure 2. The structures of MLP models. (a) MLP-1; (b) MLP-2.
Systems 13 01099 g002
Figure 3. The structures of CNN-1D models. (a) CNN-1D-1; (b) CNN-1D-2.
Figure 3. The structures of CNN-1D models. (a) CNN-1D-1; (b) CNN-1D-2.
Systems 13 01099 g003aSystems 13 01099 g003b
Figure 4. The structures of CNN-2D models. (a) CNN-2D-1; (b) CNN-2D-2.
Figure 4. The structures of CNN-2D models. (a) CNN-2D-1; (b) CNN-2D-2.
Systems 13 01099 g004
Table 1. The real datasets studied in this paper.
Table 1. The real datasets studied in this paper.
DatasetSourceYearCountryNumber of
Samples
Number of Total VariablesNumber of Types of Travel Modes
D1MPN2018The Netherlands7310568
D2ALPMC2012–2015UK20,000364
D2LPMC2012–2015UK81,096364
D3ANHTS2017US79,7071154, 20
D3BNHTS2017US149,4451154, 20
D3NHTS2017US920,0411154, 20
Table 2. The main independent variables considered for different datasets and their statistics.
Table 2. The main independent variables considered for different datasets and their statistics.
DatasetVariablesExplanationsStatistics
D1ROLAUTORole in travelDrivers: 78.8%; Passengers: 21.2%
KAFSTVTravel distanceMean: 13.00; Median: 4.35; Mode: 1.75; Std: 17.83
AANTRITNumber of travels per dayMean: 1.73; Median: 1.00;
Mode: 1; Std: 1.35
VERPLTravel featuresNo new trip: 19.1%; New trip: 79.3%; Trip abroad: 0.9%; Others: 0.7%
KMOTIEFPurpose of
travel
Work: 21.2%; Business-related visit: 1.5%; Personal care: 3.0%
Shopping: 19.4%; Education/courses: 6.4%; Visitation: 8.8%
Social/recreational: 19.5%; Touring: 5.1%; Others: 15.0%
D2AAGEAgeMean: 39.60; Median: 38.00; Mode: 35; Std: 19.57
SEXSexFemale: 53.2%; Male: 46.8%
DISTANCETravel distanceMean: 4515.34; Median: 2737.50; Mode: 1286; Std: 4770.33
OWNERSHIPNumber of family vehiclesMean: 1.00; Median: 1.00;
Mode: 1; Std: 0.75
PURPOSEPurpose of
travel
Work: 15.8%; Education: 11.0%; Employers’ business: 7.0%;
Home-based other: 52.1%; Non-home-based other: 14.1%
D2AGEAgeMean: 39.46; Median: 38.00; Mode: 35; Std: 19.23
SEXSexFemale: 52.6%; Male: 47.4%
DISTANCETravel distance Mean: 4605.26; Median: 2814.00; Mode: 1309; Std: 4782.35
OWNERSHIPNumber of family vehiclesMean: 0.98; Median: 1.00;
Mode: 1; Std: 0.75
PURPOSEPurpose of
travel
Work: 16.7%; Education: 11.4%; Employers‘ business: 7.1%;
Home-based other: 51.0%; Non-home-based other: 13.8%
D3AR_AGEAgeMean: 49.09; Median: 53.00; Mode: 65; Std: 20.55
R_SEXSexFemale: 54.1%; Male: 45.9%
TRPMILESTravel distance Mean: 11.73; Median: 3.45; Mode: 1.0; Std: 74.16
HHVEHCNTNumber of family vehiclesMean: 2.24; Median: 2.00;
Mode: 2; Std: 1.20
TRIPPURPPurpose of
travel
Work: 12.9%; Shopping: 21.3%; Social/recreational: 12.2%;
Other: 20.0%; Not home-based: 33.6%
D3BR_AGEAgeMean: 49.03; Median: 53.00; Mode: 65; Std: 20.62
R_SEXSexFemale: 53.1%; Male: 46.9%
TRPMILESTravel distance Mean: 11.52; Median: 3.44; Mode: 1.0; Std: 82.56
HHVEHCNTNumber of family vehiclesMean: 2.24; Median: 2.00;
Mode: 2; Std: 1.22
TRIPPURPPurpose of
travel
Work: 12.7%; Shopping: 21.3%; Social/recreational: 12.4%;
Other: 19.8%; Not home-based: 33.8%
D3R_AGEAgeMean: 49.16; Median: 53.00; Mode: 65; Std: 20.64
R_SEXSexFemale: 53.2%; Male: 46.8%
TRPMILESTravel distanceMean: 11.36; Median: 3.44; Mode: 1.0; Std: 74.41
HHVEHCNTNumber of family vehiclesMean: 2.23; Median: 2.00;
Mode: 2; Std: 1.20
TRIPPURPPurpose of
travel
Work: 12.7%; Shopping: 20.6%; Social/recreational: 11.9%;
Other: 20.6%; Not home-based: 33.6%
Table 3. The basic features of the ten new NN models.
Table 3. The basic features of the ten new NN models.
ModelBasic FeatureDesign for Tabular Data
ResNetIt is a simple variation of the original ResNet. There is an almost clear path from the input to output.It reuses well-established DL building blocks, which is beneficial for optimization.
SNNIt does not face vanishing or exploding gradient problems, since neuron activations automatically converge towards zero mean and unit variance.It enables self-normalization by using the SELU activation function and specific weight initialization.
AutoIntIt can automatically learn high-order feature interactions, and the attention mechanism can show the correlations between different features.It is designed for recommender systems where the features are sparse and contain high-dimensional tabular data.
GrowNetIt uses shallow neural networks as weak learners, and the final output is the weighted sum of all weak learners.It is faster and easier to train, since it incorporates second-order statistics and a global corrective step for fine-tuning.
NODEIt is composed of differentiable oblivious decision trees (ODTs), which is a sequence of k NODE layers following the DenseNet model.It is fully differentiable and allows constructing multi-layer architectures for end-to-end training.
DCN-V2It consists of an embedding layer, a cross network with multiple cross layers, and a deep network, which are helpful for learning explicit and implicit features.The cross layers are designed to learn bounded-degree feature interactions, which are useful for tabular data where feature combinations are crucial.
FT—TransformerIt transforms all features (categorical and numerical) to embeddings, and applies a stack of transformer layers to these embeddings.Based on the transformer, it uses multi-head self-attention and pre-normalization for better performance of tabular data.
TabNetIt consists of multiple decision steps. It uses a learnable mask for feature selection at each step, and processes the selected features through a transformer.It employs sequential attention for instance-wise feature selection, which is beneficial for tabular data with redundant features.
DeepFMIt integrates the architectures of factorization machines (FMs) and deep neural networks (DNNs), sharing the same input to learn both low- and high-order feature interactions.It is designed to capture both low- and high-order feature interactions from raw features without any feature engineering, effectively handling sparse categorical data.
GBDT + DNNIt incorporates a GBDT-based feature gate for sample-specific feature selection, followed by simplified sparse MLP blocks for prediction.It integrates the advantages of GBDTs (efficient feature selection) and DNNs (smooth optimization) to address the model selection dilemma on tabular datasets.
Table 4. The dominant hyper-parameters of five typical ML models.
Table 4. The dominant hyper-parameters of five typical ML models.
ModelHyper-ParameterSearch Range
LRC[0.001, 100]
solver[‘liblinear’, ‘lbfgs’, ‘sag’, ‘saga’]
KNNn_neighbors[3, 29]
weights[‘uniform’, ‘distance’]
p[1, 2]
DTmax_depth[1, 11]
min_samples_split[2, 10]
min_samples_leaf[1, 4]
criterion[‘gini’, ‘entropy’]
max_features[‘sqrt’, ‘log2’]
RFn_estimators[10, 200]
max_depth[1, 15]
min_samples_split[2, 10]
min_samples_leaf[1, 4]
max_features[‘sqrt’, ‘log2’]
bootstrap[True, False]
criterion[‘gini’, ‘entropy’]
XGBn_estimators[10, 200]
max_depth[1, 9]
learning_rate[0.01, 0.2]
subsample[0.6, 1.0]
colsample_bytree[0.6, 1.0]
gamma[0, 0.2]
reg_alpha[0, 0.5]
reg_lambda[0, 0.5]
Table 5. The dominant hyper-parameters of six classic NN models.
Table 5. The dominant hyper-parameters of six classic NN models.
ModelHyper-ParameterSearch Range
MLP-1Number of neurons[10, 500]
Dropout rate[0, 0.3]
Batch size[4, 32]
MLP-2Number of neurons in Layer 1, 2[10, 500]
Dropout rate[0, 0.3]
Batch size[4, 32]
CNN-1D-1Filters[32, 512]
Kernel size[3, 5]
Number of neurons[10, 500]
Pooling size[2, 3]
Dropout rate[0, 0.3]
Batch size[4, 32]
CNN-1D-2Filters of Layer 1, 2[32, 512]
Kernel size of Layer 1, 2[3, 5]
Number of neurons of Layer 1, 2[10, 500]
Pooling size[2, 3]
Dropout rate[0, 0.3]
Batch size[4, 32]
CNN-2D-1Filters[32, 512]
Kernel size[(3,1), (5,1)]
Number of neurons[10, 500]
Pooling size[(2,1), (3,1)]
Dropout rate[0, 0.3]
Batch size[4, 32]
CNN-2D-2Filters of Layer 1, 2[32, 512]
Kernel size of Layer 1, 2[(3,1), (5,1)]
Number of neurons of Layer 1, 2[10, 500]
Pooling size[(2,1), (3,1)]
Dropout rate[0, 0.3]
Batch size[4, 32]
Table 6. Typical examples about the optimization of the model parameters.
Table 6. Typical examples about the optimization of the model parameters.
(a) The default and optimized parameters of D3B-4.
ModelHyper-parameterDefault valueOptimized value
KNNn_neighbors518
weightsuniformdistance
p21
RFn_estimators100200
max_depthNoneNone
min_samples_split22
min_samples_leaf11
max_featuresautolog2
bootstrapTRUETRUE
criterionginientropy
MLP-2dense_units1100100
dense_units2100100
dropout_rate0.20.1
batch_size3216
CNN-1D-1filters64128
kernel_size33
pool_size22
dense_units100200
dropout_rate0.20.01
batch_size3216
(b) The default and optimized results of D3B-4.
ModelKNNRFMLP-2CNN-1D-1
StatisticsDefault
result
Optimized
result
Default
result
Optimized
result
Default
result
Optimized
result
Default
result
Optimized
result
Accuracy0.8860.9110.9180.9200.8890.8890.8830.885
Precision0.5900.7390.7280.7440.6720.6770.7700.769
Recall0.4330.5740.6100.6110.4010.3990.3680.379
F1 score0.4710.6300.6550.6580.4340.4590.4020.416
Table 7. The averaged metrics of the five ML models. The best F1 scores are marked by red fonts.
Table 7. The averaged metrics of the five ML models. The best F1 scores are marked by red fonts.
D1LRKNNDTRFXGB
Accuracy0.8230.8430.8400.8560.865
Precision0.8060.7110.7300.7460.728
Recall0.6460.6820.6770.6830.691
F1 score0.6370.8080.7410.7840.818
D2ALRKNNDTRFXGB
Accuracy0.6860.7480.7760.8010.795
Precision0.7630.7710.7020.7830.793
Recall0.5050.6930.7150.7240.713
F1 score0.5070.7230.7080.7480.743
D2LRKNNDTRFXGB
Accuracy0.6510.7520.7570.7820.744
Precision0.7330.7950.6900.7470.708
Recall0.4540.6970.7020.7200.612
F1 score0.4580.7300.6960.7320.638
D3A-4LRKNNDTRFXGB
Accuracy0.8710.9060.8880.9140.910
Precision0.6050.7110.5900.7230.729
Recall0.3010.5790.6100.6050.572
F1 score0.5710.6270.6000.6500.626
D3A-20LRKNNDTRFXGB
Accuracy0.4380.4810.5290.5640.552
Precision0.7690.2510.3570.4260.520
Recall0.0760.1820.4070.4220.337
F1 score0.2050.5440.3910.4440.385
D3B-4LRKNNDTRFXGB
Accuracy0.8640.9110.8910.9200.909
Precision0.6390.7390.5840.7440.727
Recall0.2630.5740.6130.6090.535
F1 score0.5080.6300.5980.6580.593
D3B-20LRKNNDTRFXGB
Accuracy0.4280.5390.5030.5430.526
Precision0.8350.4850.3530.4360.504
Recall0.0670.3000.3780.3810.272
F1 score0.1310.3780.3640.4030.315
D3-4LRKNNDTRFXGB
Accuracy0.8800.8900.8890.9100.904
Precision0.6230.5620.5700.6720.676
Recall0.3100.4420.5940.5950.443
F1 score0.3340.4720.5810.6250.476
D3-20LRKNNDTRFXGB
Accuracy0.4510.4580.4660.4890.487
Precision0.7650.2300.2920.3320.413
Recall0.1020.1620.2990.3160.165
F1 score0.1920.2210.2950.3230.211
Table 8. The averaged metrics of six classic NN models. The best F1 scores are marked by red fonts.
Table 8. The averaged metrics of six classic NN models. The best F1 scores are marked by red fonts.
D1CNN-1D-1CNN-1D-2CNN-2D-1CNN-2D-2MLP-1MLP-2
Accuracy0.8210.8540.8220.8510.8330.852
Precision0.8070.8350.8080.8190.8180.833
Recall0.6490.6790.6510.6780.6610.676
F1 score0.6430.6790.6460.6880.6570.673
D2ACNN-1D-1CNN-1D-2CNN-2D-1CNN-2D-2MLP-1MLP-2
Accuracy0.7080.7120.7100.7110.7080.710
Precision0.7810.7680.7790.7560.7810.631
Recall0.5250.5290.5300.5300.5270.531
F1 score0.5250.5290.5280.5540.5260.579
D2CNN-1D-1CNN-1D-2CNN-2D-1CNN-2D-2MLP-1MLP-2
Accuracy0.7070.7070.7070.7080.7050.707
Precision0.7800.7800.7810.7650.7790.756
Recall0.5270.5240.5240.5250.5250.525
F1 score0.5260.5240.5240.5250.5240.550
D3A-4CNN-1D-1CNN-1D-2CNN-2D-1CNN-2D-2MLP-1MLP-2
Accuracy0.8780.8860.8780.8860.8870.886
Precision0.6850.6230.7070.6500.7180.634
Recall0.3620.4130.3640.4070.3940.400
F1 score0.4690.4420.4470.4400.4530.459
D3A-20CNN-1D-1CNN-1D-2CNN-2D-1CNN-2D-2MLP-1MLP-2
Accuracy0.4700.4770.4650.4780.4780.478
Precision0.5030.3280.4280.3740.4760.388
Recall0.1380.1900.1400.1900.1440.155
F1 score0.2940.3640.3260.2790.3480.378
D3B-4CNN-1D-1CNN-1D-2CNN-2D-1CNN-2D-2MLP-1MLP-2
Accuracy0.8850.8880.8840.8890.8880.889
Precision0.7690.6540.7040.6440.7040.677
Recall0.3790.3850.3690.4040.3830.399
F1 score0.4160.4270.4300.4400.4660.459
D3B-20CNN-1D-1CNN-1D-2CNN-2D-1CNN-2D-2MLP-1MLP-2
Accuracy0.4620.4760.4610.4690.4690.472
Precision0.5920.4820.6610.6850.4860.468
Recall0.1140.1450.1100.1170.1190.131
F1 score0.2490.3110.2350.2430.3220.338
D3-4CNN-1D-1CNN-1D-2CNN-2D-1CNN-2D-2MLP-1MLP-2
Accuracy0.8900.8950.8900.8950.8950.895
Precision0.8200.7510.8260.7050.7250.726
Recall0.3690.3950.3690.4060.4010.403
F1 score0.4060.4300.4040.4220.4330.438
D3-20CNN-1D-1CNN-1D-2CNN-2D-1CNN-2D-2MLP-1MLP-2
Accuracy0.4680.4750.4710.4740.4720.475
Precision0.7790.7660.7430.7640.7210.737
Recall0.1260.1380.1290.1330.1300.141
F1 score0.1160.1270.1190.1220.1180.129
Table 9. The averaged metrics of the ten new NN models. The best F1 scores are marked by red fonts.
Table 9. The averaged metrics of the ten new NN models. The best F1 scores are marked by red fonts.
D1ResNetSNNAutoIntGrowNetNODEDCN-V2FT-TransformerTabNetDeepFMGBDT + DNN
Accuracy0.8430.8350.8580.8380.7680.8400.8340.8410.8320.848
Precision0.7690.8040.8230.8220.7810.8190.8150.8430.8190.810
Recall0.6900.6740.7200.6770.5850.6810.6690.6710.6600.684
F1 score0.7560.6880.7320.6780.5910.6960.6930.6730.6560.683
D2AResNetSNNAutoIntGrowNetNODEDCN-V2FT-TransformerTabNetDeepFMGBDT + DNN
Accuracy0.7100.7050.7530.7050.5450.7080.7060.7080.7090.710
Precision0.6420.6570.6670.7790.6640.6530.7810.7850.7820.783
Recall0.5300.5300.6960.5280.4150.5300.5250.5230.5280.529
F1 score0.6290.5530.6780.5260.4080.5790.5240.5250.5280.529
D2ResNetSNNAutoIntGrowNetNODEDCN-V2FT-TransformerTabNetDeepFMGBDT + DNN
Accuracy0.7080.7050.7460.7070.5390.7070.7020.7030.7050.708
Precision0.7080.7570.6710.7500.6710.6850.7790.7830.7800.781
Recall0.5250.5210.6840.5260.4020.5270.5200.5170.5250.527
F1 score0.5510.5470.6760.5250.3890.5760.5200.5200.5240.527
D3A-4ResNetSNNAutoIntGrowNetNODEDCN-V2FT-TransformerTabNetDeepFMGBDT + DNN
Accuracy0.8800.8860.8810.8880.8590.8880.8790.8600.8860.888
Precision0.6490.6910.6390.7210.8940.6330.7820.7910.8090.747
Recall0.3470.3910.3960.3950.2510.4050.3790.2730.3870.413
F1 score0.3870.4280.4450.4760.2570.4620.3860.2680.4190.440
D3A-20ResNetSNNAutoIntGrowNetNODEDCN-V2FT-TransformerTabNetDeepFMGBDT + DNN
Accuracy0.4730.4750.4850.4810.4760.4810.4740.4390.4790.480
Precision0.5240.5510.5060.6200.6020.4870.7960.8700.7430.715
Recall0.1460.1430.1710.1420.1480.1570.1240.0660.1270.152
F1 score0.2930.2710.2510.2400.2240.3400.1030.0590.1150.141
D3B-4ResNetSNNAutoIntGrowNetNODEDCN-V2FT-TransformerTabNetDeepFMGBDT + DNN
Accuracy0.8840.8850.8830.8870.8600.8890.8770.8610.8880.889
Precision0.6480.6300.6580.6700.8340.6610.7580.8060.7380.690
Recall0.3740.3910.3850.3770.2620.4090.3860.2820.3800.421
F1 score0.4120.4490.4280.4150.2480.4400.3950.2780.4130.446
D3B-20ResNetSNNAutoIntGrowNetNODEDCN-V2FT-TransformerTabNetDeepFMGBDT + DNN
Accuracy0.4690.8870.4770.4770.4310.4780.4690.4390.4770.481
Precision0.6240.6370.5040.6390.8580.4330.8000.8700.6990.727
Recall0.1200.4060.1330.1370.0670.1490.1160.0660.1290.160
F1 score0.1200.4360.2580.2030.0430.3390.1000.0590.1170.146
D3-4ResNetSNNAutoIntGrowNetNODEDCN-V2FT-TransformerTabNetDeepFMGBDT + DNN
Accuracy0.8850.8920.8860.8960.8650.8960.8850.8680.8940.896
Precision0.7340.6630.7370.7180.8380.7010.7960.8020.7160.770
Recall0.3460.4040.3990.4130.2510.4060.3720.3010.3950.428
F1 score0.3830.4360.3990.4450.2330.4430.3940.2980.4320.459
D3-20ResNetSNNAutoIntGrowNetNODEDCN-V2FT-TransformerTabNetDeepFMGBDT + DNN
Accuracy0.4500.4730.4660.4750.4320.4750.4670.4330.4740.474
Precision0.7630.7160.6740.4690.9020.4090.7940.8240.7080.737
Recall0.1030.1470.1310.1570.0860.1420.1290.0800.1350.173
F1 score0.1030.1340.1250.1470.0530.1350.1140.0670.1240.146
Table 10. The averaged F1 scores for different types of models. The best F1 scores are marked by red fonts.
Table 10. The averaged F1 scores for different types of models. The best F1 scores are marked by red fonts.
D1D2AD2D3A-4D3A-20D3B-4D3B-20D3-4D3-20
ML 0.7580.6860.6510.6150.3940.5970.3180.4980.248
Classic NN0.6640.5570.5290.4520.3320.4400.2830.4220.122
New NN0.6850.5480.5360.3970.2040.3920.1820.3920.115
Table 11. The standard deviations of the metrics of some models for D3B.
Table 11. The standard deviations of the metrics of some models for D3B.
D3B-4DTRFCNN-1D-1CNN-1D-2CNN-2D-1CNN-2D-2MLP-1MLP-2ResNet
Accuracy0.0000.0000.0010.0010.0010.0010.0010.0010.006
Precision0.0010.0020.0130.0750.0250.0850.0640.0790.024
Recall0.0010.0000.0120.0130.0180.0400.0100.0140.036
F1 score0.0010.0010.0100.0080.0160.0270.0080.0120.034
D3B-4FT-TransformerSNNNODETabNetGrowNetDCN-V2AutoIntDeepFMGBDT + DNN
Accuracy0.0100.0040.0030.0020.0020.0020.0010.0010.004
Precision0.0570.0630.1150.0070.0770.0320.0110.0950.103
Recall0.0580.0260.0290.0110.0080.0200.0030.0140.020
F1 score0.0610.0730.0350.0120.0080.0150.0010.0110.009
D3B-20DTRFCNN-1D-1CNN-1D-2CNN-2D-1CNN-2D-2MLP-1MLP-2ResNet
Accuracy0.0000.0000.0030.0020.0010.0010.0010.0010.009
Precision0.0010.0010.0400.0260.0360.0210.0260.0510.028
Recall0.0010.0010.0050.0090.0040.0060.0040.0090.007
F1 score0.0010.0010.0060.0100.0050.0060.0030.0060.007
D3B-20FT-TransformerSNNNODETabNetGrowNetDCN-V2AutoIntDeepFMGBDT + DNN
Accuracy0.0070.0020.0010.0020.0010.0020.0010.0010.001
Precision0.0450.0180.0700.0480.0370.0530.0610.0390.029
Recall0.0110.0230.0140.0120.0090.0080.0040.0060.010
F1 score0.0110.0200.0090.0260.0180.0680.0560.0060.006
Table 12. The effect of SMOTE on some models for D3A-20.
Table 12. The effect of SMOTE on some models for D3A-20.
D3A-20XGBRFMLP-2CNN-2D-1DCN-V2
OriginalSMOTEOriginalSMOTEOriginalSMOTEOriginalSMOTEOriginalSMOTE
Accuracy0.5520.8270.5640.9090.4800.6760.4710.5930.4810.699
Precision0.5200.8040.4260.9070.6850.6390.4280.5590.4870.673
Recall0.3370.8270.4220.9090.1420.6750.1400.5930.1570.700
F1 score0.3850.8110.4440.9080.1950.6470.3260.5610.3400.677
Table 13. The effect of Near-Miss on some models for D2.
Table 13. The effect of Near-Miss on some models for D2.
D2LRRFMLP-1CNN-1D-1FT-Transformer
OriginalNear-MissOriginalNear-MissOriginalNear-MissOriginalNear-MissOriginalNear-Miss
Accuracy0.6510.6070.7820.8550.7050.7410.7070.7630.7020.747
Precision0.7330.5990.7470.8550.7790.7450.7800.7660.7790.767
Recall0.4540.6070.7200.8550.5250.7410.5270.7620.5200.747
F1 score0.4580.5950.7320.8550.5240.7390.5260.7610.5200.741
Table 14. The effect of using focal loss on some models for D3B-20.
Table 14. The effect of using focal loss on some models for D3B-20.
D3B-20DTRFMLP-2CNN-2D-1ResNet
OriginalFocal LossOriginalFocal LossOriginalFocal LossOriginalFocal LossOriginalFocal Loss
Accuracy0.5030.4990.5430.5450.4720.4780.4610.4680.4690.467
Precision0.3530.3320.4360.4480.4680.7170.6610.6470.6240.420
Recall0.3780.3790.3810.3770.1310.1340.1100.1320.1200.122
F1 score0.3640.3660.4030.4050.3380.3510.2350.2410.1200.226
Table 15. The running times of all the models on two datasets.
Table 15. The running times of all the models on two datasets.
DatasetLRKNNDTRFXGBMLP-1MLP-2
D1<1 s<1 s<1 s<1 s<1 s15 s18 s
D3-2030 s13 s4 s120 s30 s50 min54 min
DatasetCNN-1D-1CNN-1D-2CNN-2D-1CNN-2D-2ResNetSNNAutoInt
D123 s26 s22 s26 s1 min 20 s22 s1 min 40 s
D3-201 h 18 min1 h 22 min1 h 03 min1 h 21 min4 h 59 min1 h 34 min4 h 19 min
DatasetGrowNetNODEDCN-V2FT-TransformerTabNetDeepFMGBDT + DNN
D115 s13 s19 s2 min 30 s2 min 14 s30 s26 s
D3-202 h 03 min1 h 35 min1 h 28 min9 h 21 min6 h 01 min2 h 50 min1 h 44 min
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, T.; Jin, C.-J.; Song, Y.; Li, D. Are Neural Networks Better than Machine Learning? A Comparative Study for Travel Mode Predictions. Systems 2025, 13, 1099. https://doi.org/10.3390/systems13121099

AMA Style

Zhang T, Jin C-J, Song Y, Li D. Are Neural Networks Better than Machine Learning? A Comparative Study for Travel Mode Predictions. Systems. 2025; 13(12):1099. https://doi.org/10.3390/systems13121099

Chicago/Turabian Style

Zhang, Tongkai, Cheng-Jie Jin, Yuchen Song, and Dawei Li. 2025. "Are Neural Networks Better than Machine Learning? A Comparative Study for Travel Mode Predictions" Systems 13, no. 12: 1099. https://doi.org/10.3390/systems13121099

APA Style

Zhang, T., Jin, C.-J., Song, Y., & Li, D. (2025). Are Neural Networks Better than Machine Learning? A Comparative Study for Travel Mode Predictions. Systems, 13(12), 1099. https://doi.org/10.3390/systems13121099

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop