Classiﬁcation of Imbalanced Travel Mode Choice to Work Data Using Adjustable SVM Model

: The investigation of travel mode choice is an essential task in transport planning and policymaking for predicting travel demands. Typically, mode choice datasets are imbalanced and learning from such datasets is challenging. This study deals with imbalanced mode choice data by developing an algorithm (SVM AK ) based on a support vector machine model and the theory of adjusting kernel scaling. The kernel function’s choice was evaluated by applying the likelihood-ratio chi-square and weighting measures. The empirical assessment was performed on the 2017 National Household Travel Survey–California dataset. The performance of the SVM AK model was compared with several other models, including neural networks, XGBoost, Bayesian Network, standard support vector machine model, and some SVM-based models that were previously developed to handle the imbalanced datasets. The SVM AK model outperformed these models, and in some cases improved the accuracy of the minority class classiﬁcation. For the majority class, the accuracy improvement was substantial. This algorithm can be applied to other tasks in the transport planning domain that deal with uneven data distribution.


Introduction
A considerable amount of people's daily trips is associated with their work. Transport planners and engineers attempt to discover work-related travel behaviors and establish strategies for reducing the adverse impacts of motorized transport on traffic, health, and the environment. One of these important behaviors is work mode choice which refers to the process where an individual chooses a certain mode for his/her trip to work. According to the literature, a variety of factors influence the work mode choice. Socioeconomic factors [1][2][3], household attributes e.g., [4], trip characteristics e.g., [5], job e.g., [6][7][8][9], and built environment [2,10,11] are some of these factors [3,11].
Mode choice data include a wide range of variables and samples. Typically, these data are complex and incomplete [12]. Furthermore, since motorized transport is dominant in most parts of the world, the travel surveys yield unbalanced mode choice classes; that is, there are more people who use cars than people who use other commute modes. 2 of 14 To date, many studies investigated the choice of travel mode to work (Table 1). These methods employed both traditional statistical methods and machine learning (ML) techniques. However, the former is criticized because of its linearity assumptions concerning mode choice data [13][14][15]. Thus, the employment of ML techniques has receieved more attention recently [16][17][18][19][20][21][22][23]. The classification of the new cases established concerning the existing samples is an essential task in ML models. If at least one of the categories comprises a smaller number of samples than other categories, the classification process becomes complex [24]. The class imbalance issue is simply an uneven data distribution amongst the different categories of the target. The precision of the classification algorithms will be unreliable when they are influenced by the majority class. In this case, the new samples are distributed to the majority category since the classification model tends to predict the minority category with less accuracy, which is an undesirable consequence [25]. Support vector machine (SVM) is a renowned ML technique for classification [26]. This algorithm also was used as a base to cope with imbalanced data. Batuwita and Palade [27] developed the fuzzy SVM model and dealt with imbalanced data in the presence of noises and outliers. Wang and Japkowicz [28] suggested boosting-SVMs with asymmetric cost. Their model runs by adjusting the classifier utilizing cost assignation, though it compensates the bias presented with adjustment through utilizing a combination system that is comparable, in effect, to adjust the distribution of data. Wu and Chang [29] suggested the class-boundary alignment algorithm to augment the SVM model to deal with imbalanced data. They modified the class boundary by converting the kernel function if data is represented in a vector space. This modification can also be performed by adjusting the kernel matrix if the data do not possess a vector-space representation. To enhance forecast performance, Liu, et al. [30] suggested consolidating an integrated sampling system, which mixes both over-sampling and under-sampling, with an ensemble of SVM. These studies investigated the binary classification problem based on the SVM model; however, less examinations have been done concerning multiclass imbalanced classification based on this model.
Many studies in other domains, including medicine, economy, crash severity, and so on, tried to reduce the issues of imbalanced data, e.g., [31][32][33]. However, to the best of the authors' knowledge, a very small number of studies have investigated the issues of imbalanced travel mode choice data and proposed a solution for it [33].
So far, many scholars have provided useful strategies to manage the issue of class imbalance. These strategies have been helpful and competent in explaining the issue partially through enhancing classifiers' performance. The majority of models developed for binary category imbalance issues are improper for the multiclass imbalanced datasets like work mode choice. In addition, rare studies have provided a solution for the imbalanced mode choice datasets. The shortcomings mentioned above prompted the authors to cope with the multiclass imbalance mode choice data issue and contribute to the body of research on this topic. Thus, this study developed the adjustable kernel-based SVM classification algorithm (SVM AK ) that is suitable for handling multiclass imbalanced data. Initially, the estimated hyperplane is obtained employing the regular SVM model. Subsequently, the parameter function and the weighting factor concerning each support vector in every iteration is determined. The likelihood-ratio chi-square test is utilized to estimate these parameters. Following this, the kernel transformation or the new kernel functions are determined. The unequal boundaries of class are enlarged, and data skewness is adjusted, thanks to this function of kernel conversion. Consequently, the estimated hyperplane is remedied through the developed model, and it also solves the problem of performance degradation.
The rest of this paper is structured as follows. In Section 2, the source of data, dataset characteristics, and methodology used for improving the performance of the SVM model for classification of imbalanced data are presented, and evaluation metrics are provided. Section 3 presents the results obtained with the model as well as a series of comparisons against other ML models and SVM-based models for classifying imbalanced datasets. Section 4 describes the sensitivity analysis method and its outcomes. Finally, a conclusion of the paper is presented in Section 5.

Data
This study employed the 2017 National Household Travel Survey (NHTS)-California dataset. These data are provided by the US Federal Highway Administration and the California Department of Transportation and are freely available to all researchers and practitioners [45]. The NHTS is the definitive source on public travel behavior in the United States. It is the only national source of data that lets researchers and practitioners look at patterns in personal and household travel. This data comprises non-commercial travel information by all modes on a daily basis and the characteristics of the people who travel, their households, and their transport means. It appeared that 26,095 household samples of California were involved in this dataset. This research eliminated records that contained incomplete or inaccurate data. Additionally, the dataset included 458 variables. Thus, based on the literature, the authors selected those variables that linked to work mode choice. Finally, the dataset included 151,597 samples (based on the individuals' records), 26 inputs, and one target variable (mode choice to work). However, at the same time, it was found that the target variable included uneven distribution of classes, which is called imbalanced data. Table 2 shows the composition of the dataset used in this study. The work mode choice had nine classes, and as expected, "car" is the majority class. The imbalance ratio is large (777.5). A list of variables used in this study is provided in Table 3.

Proposed Approach
The algorithm proposed in this study aims at dealing with the imbalanced mode choice data effectively. The theory of adjusting the kernel scaling method [46] is behind the model developed in this research to manage the multi-category imbalanced data. This study combines the SVM classification algorithm with the adjusting kernel scaling technique, which is named SVM AK .

Standard Support Vector Machine Model
Support Vector Machine (SVM) is a broadly employed and praised ML technique for classifying data [26]. The principal purpose of creating this model was to draw the input data into high dimensional space with the aid of the kernel function in such a way that the categories can be linearly divisible [47][48][49]. For the binary class issue, the greatest boundary that can divide the hyperplanes is as follows: The decision function for SVM based on the optimal pair (w 0 , l 0 ) is expressed by: where, λ i stands for support vector, a i denotes data sample and i = 1, 2, . . . , K. Concerning greater dimensional feature space, the value of a.a i is substituted by the kernel function Q a i .a j , that is: From the regular SVM, the kernel function was selected for estimating the boundaries' space. In the beginning, the dataset S is divided into different samples, which are S 1 , S 2 , S 3 , . . . , S i , and subsequently, the kernel transformation function is implemented (Equation (4)).
where, h a = ∑ i∈SV λ i y i a, a i + l (where, λ i signifies support vector), S i refers to the ith sample of the training dataset, z i is calculated using likelihood-ratio chi-square, which is described in the following sections.

Likelihood-Ratio Chi-Square
Likelihood-ratio chi-square (G 2 ) is a renowned non-parametric test which assesses the target-input independence and is suitable for categorial attributes. G 2 ascertains a frequency distribution-based relationship among the categorical attribute assortments. To put it another way, it can be said that this technique should be employed to assess the association between the groups. The importance of determining the G 2 is to ascertain the connection amongst the samples of every class and parameter z i . Equation (5) presents the analytical formulation for estimating G 2 .
where, D o and D e signify observed and expected frequencies, respectively.

Calculating the Factor of Weighting
Ascertaining the factor of weighting is a challenging and vital task while handling an imbalanced category because finding a suitable weight is comparatively complicated. A practical technique to manage such issues is to assign smaller weight to the mainstream category and larger weight to the minority category through fulfilling the weight condition z i ∈ (0,1). For dealing with the multi-category imbalance issue in the SVM AK algorithm, this study employed Equation (6).
where, C and N express class and training sample sizes, respectively. n i symbolizes the size of each class when i = 1, 2, ..., K. For computing the parameter z i , let S denotes the dataset that comprises the N number of samples and K classes. The z i parameter is estimated employing Equations (2) and (3). The G 2 value in optimal distribution can be calculated as follows: Let X i = n i log n i N/K . Then, Hence, the parameter Z i can be characterized as In Equation (8), place the G 2 value where, n i is the size of the sample in the ith class and I = 1, 2, . . . , K.

Model Development Steps
In the beginning, the NHTS data was prepared and cleansed. Later, these data were used for achieving the primary partition. Then, the authors determined the weighting factor (w i ) value as well as Z i parameters for all support vectors in every iteration. The Z i value was estimated using the likelihood-ratio chi-square test. The kernel conversion function was estimated in the next step. Eventually, utilizing the newly estimated kernel matrix K mt , the model was retrained. Figure 1 indicates the flowchart of the suggested algorithm.

Evaluation Metrics
This study employed four evaluation criteria to evaluate the performance of the models developed in this study. These criteria included accuracy, precision, recall, and F1 score. The formulas for calculating these criteria are shown in Equations (11)- (14). Accuracy refers to the ratio of the precisely forecasted class across the whole experiment class. Precision indicates the proportion of the true positive class over the whole number of an actual positive and false-positive category. Recall refers to the quantity of forecasted positive categories that fall out of whole positive cases in the data. F1 score shows the equilibrium between recall and precision.

Evaluation Metrics
This study employed four evaluation criteria to evaluate the performance of the models developed in this study. These criteria included accuracy, precision, recall, and F1 score. The formulas for calculating these criteria are shown in Equations (11)- (14). Accuracy refers to the ratio of the precisely forecasted class across the whole experiment class. Precision indicates the proportion of the true positive class over the whole number of an actual positive and false-positive category. Recall refers to the quantity of forecasted positive categories that fall out of whole positive cases in the data. F1 score shows the equilibrium between recall and precision.
where, TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively.

Models' Development and Evaluation
The authors of this study present the results of the SVM AK model as well as other classification models, including the standard SVM model, BN, ANN model, and some SVM-based models in literature proposed for handling the imbalance data.
It is a challenging task to determine the most suitable classification model for handling imbalance data issues. The travel mode choice dataset was taken for the empirical investigation. Figure 2 shows the nine classes plotted on the x-axis, and the size of samples in each class plotted on the y-axis. As can be seen in this figure, it is obvious that the NHTS dataset includes uneven category distribution; practically, it is called imbalance. Hence, it grows to be more complicated to manage such a circumstance through regular classification techniques.

Models' Development and Evaluation
The authors of this study present the results of the SVMAK model as well as other classification models, including the standard SVM model, BN, ANN model, and some SVM-based models in literature proposed for handling the imbalance data.
It is a challenging task to determine the most suitable classification model for handling imbalance data issues. The travel mode choice dataset was taken for the empirical investigation. Figure 2 shows the nine classes plotted on the x-axis, and the size of samples in each class plotted on the y-axis. As can be seen in this figure, it is obvious that the NHTS dataset includes uneven category distribution; practically, it is called imbalance. Hence, it grows to be more complicated to manage such a circumstance through regular classification techniques.  The principal intention of this study is to discover the most competent classification technique that can examine the imbalanced class. Periodically, various scholars had proposed useful techniques to handle class imbalance issue. The majority of the proposed methods were for binary category imbalance issues, which were not fit for the multi-cat- The principal intention of this study is to discover the most competent classification technique that can examine the imbalanced class. Periodically, various scholars had proposed useful techniques to handle class imbalance issue. The majority of the proposed methods were for binary category imbalance issues, which were not fit for the multicategory imbalance issue. These shortcomings urged authors to adjust the algorithm which can effectively handle binary category and multi-category imbalance issues without jeopardizing the performance of algorithms. The classification mentioned above will further assist in attaining the desirable answer toward urban transport planning and prediction of travel mode choice. This study employed four renowned conventional classification techniques and SVM-based models proposed in the literature along with the SVM AK model suggested in this study for the empirical assessment. The model applied in this study was compared with other models to ascertain effectiveness, fitness, and precision. This study evaluated the performance of the models developed using six criteria. Moreover, the authors employed a 10-fold cross-validation procedure as the validation scheme.
Four criteria, including accuracy, F1 score, precision, and recall were used to evaluate the outcomes of the classification algorithms applied in this study. The authors validated the classification techniques using the accuracy of classification. As is known, the NHTS dataset includes imbalanced category distribution, which may influence the performance of classification techniques. The overall performance of the models developed in this part of the study is shown in Table 4. All models achieved an overall accuracy above 80%. However, the SVM AK outperformed other models. The worst model was BN. Regarding other evaluation criteria, SVM AK again had the best values. It is worth mentioning that the SVM AK improved the performance of the SVM S model, which shows the capability of the proposed model of this study to handle the imbalanced mode choice data and enhance the performance of the typical SVM model for dealing with such data. An evaluation of the models developed by each class also is provided in Figure 3. For the class of car, which had the largest sample size, the SVM AK improved the prediction accuracy from 82.33% (BN model) to 99.81%. Concerning the category of motorcycle/moped, which had the smallest sample size, the models developed yielded almost a similar accuracy. For other classes, the SVM AK model almost achieved better accuracy.
As previously mentioned, the performance of the SVM AK model was compared with some existing SVM models, which tried to alleviate the severe effects of using imbalanced data. In these methods the SVM model was hybridized with some techniques, including boosting [28], fuzzy [27], and class-boundary alignment [29], and ensemble [30]. The outcomes of the mentioned comparison are presented in Table 5. As can be seen, the SVM AK obtained the highest overall accuracy among all models developed. An evaluation of the models developed by each class also is provided in Figure 3. For the class of car, which had the largest sample size, the SVMAK improved the prediction accuracy from 82.33% (BN model) to 99.81%. Concerning the category of motorcycle/moped, which had the smallest sample size, the models developed yielded almost a similar accuracy. For other classes, the SVMAK model almost achieved better accuracy.   Table 5. The overall accuracy of SVM-based models and the SVM AK model.

Sensitivity Analysis
Many factors impact the travel mode choice; however, their effects are not the same. Thus, it is necessary to ascertain the magnitude of these impacts and identify the most influential factors on travel mode choice. For this purpose, the authors employed the mutual information (MI) test method [50], which computes the importance of the inputs. MI means a filtering system that captures the random association between inputs and the target. MI examines the dependence among variables and confirms the strength of the connection among them. The MI size among inputs is measured employing the information gain: where, h denotes the number of all probable values of D, C h is the set of C when D takes the value D s , and Ent(C) signifies the information entropy. The larger the value of Gain (C, D), the better the relationship between D and C. Ultimately, the importance magnitude of each attribute for predicting travel mode choice was achieved based on the scores obtained in the MI test. The outcomes of this analysis are shown in Figure 4. The most important attributes were reason for not walking (walkmore), number of drivers in household (drvcnet), and count of adult household members at least 18 years old (numadlt). On the other hands, the lowest scores belonged to flexibility of work start time (flextime), owned vehicle longer than a year (vehowned), and gender (r_sex). Reasons for not walking among respondents included unsafe street crossings, heavy traffic, and insufficient night lighting. It is clear that any improvement in these street conditions can encourage people to shift from motorized transport to walking. Thus, it makes sense that this factor is among the most influential travel mode choice factors to work [51][52][53]. The significance of the number of drivers in households can be attributed to its influence on the usage of vehicles and the generation of more trips. In practice, the likelihood of choosing active transportation options reduces as the number of drivers in a family grows [54]. As the number of adults in a family increases, the need for independent trips rises. Because of the different responsibilities that each adult in the family has, it is not easy to consolidate trips into one trip. This can be easily one of the principal sources of more trip generation and use of motorized transportation.
The flexibility of work start time was among the least important factors. This could be attributable to work culture of the respondents in the US. However, several previous studies showed that the flexibility of work start time influences the mode choice e.g., [54,55]. The possession of a vehicle for longer than a year was also an unimportant factor for predicting the choice of travel mode to work. A possible reason for this is that people usually look for flexible and convenient travel options to work. Usually, they are reluctant to replace their private cars with healthy travel modes unless they face new challenges. These challenges can be health problems, heavy traffic, so on. Thus, it is sensible that this factor does not influence the mode choice substantially.

Conclusions
In this research, the authors offered a novel method for learning from imbalanced mode choice data by the adjustable kernel based SVM classification model (SVMAK). The likelihood-ratio chi-square test and weighting measures were used in this suggested method for selecting the kernel function. The aforementioned kernel transformation function makes it possible to increase the class limits and offset the irregular class limits. The authors also performed a sensitivity analysis which showed that the reason for not walking (walkmore), the number of drivers in the household (drvcnet), and the count of adult household members no less than 18 years old (numadlt) were the most influential factors. On the other hand, the lowest scores found for flexibility of work start time (flextime), Reasons for not walking among respondents included unsafe street crossings, heavy traffic, and insufficient night lighting. It is clear that any improvement in these street conditions can encourage people to shift from motorized transport to walking. Thus, it makes sense that this factor is among the most influential travel mode choice factors to work [51][52][53]. The significance of the number of drivers in households can be attributed to its influence on the usage of vehicles and the generation of more trips. In practice, the likelihood of choosing active transportation options reduces as the number of drivers in a family grows [54]. As the number of adults in a family increases, the need for independent trips rises. Because of the different responsibilities that each adult in the family has, it is not easy to consolidate trips into one trip. This can be easily one of the principal sources of more trip generation and use of motorized transportation.
The flexibility of work start time was among the least important factors. This could be attributable to work culture of the respondents in the US. However, several previous studies showed that the flexibility of work start time influences the mode choice e.g., [54,55]. The possession of a vehicle for longer than a year was also an unimportant factor for predicting the choice of travel mode to work. A possible reason for this is that people usually look for flexible and convenient travel options to work. Usually, they are reluctant to replace their private cars with healthy travel modes unless they face new challenges. These challenges can be health problems, heavy traffic, so on. Thus, it is sensible that this factor does not influence the mode choice substantially.

Conclusions
In this research, the authors offered a novel method for learning from imbalanced mode choice data by the adjustable kernel based SVM classification model (SVM AK ). The likelihood-ratio chi-square test and weighting measures were used in this suggested method for selecting the kernel function. The aforementioned kernel transformation function makes it possible to increase the class limits and offset the irregular class limits. The authors also performed a sensitivity analysis which showed that the reason for not walking (walkmore), the number of drivers in the household (drvcnet), and the count of adult household members no less than 18 years old (numadlt) were the most influential factors. On the other hand, the lowest scores found for flexibility of work start time (flextime), owning a vehicle longer than a year (vehowned), and gender (r_sex) were the most influential factors on travel mode choice.
The outcomes of this model were compared with those of various SVM-based models and ML models. The authors employed four criteria, including accuracy, F1 score, precision, and recall to perform this comparison. The results of this study showed that the SVM AK model achieved the best results and outperformed other models. The results also showed that this model improved the classification accuracy of most categories, especially the car class that had the largest samples.
Prediction of travel mode choice is an essential component of transport planning and traffic engineering. An accurate prediction is viable only if the data are precisely classified. Therefore, precise mode choice classification utilizing the algorithm suggested in this study would be efficacious for enhancing the current transport systems and further boosting the capacities for an efficient response to the worst traffic and transport scenarios.
The classes of choice of travel mode to work in the NHTS dataset are distinct from other mode choice datasets since this dataset considered the "SUV" category as different from the "car" category. However, future studies could combine these two classes and create a new class of travel mode to work. In the US, private motorized transport is dominant. Thus, the NHTS and, in turn, the result of this study is affected by this issue. Future studies can employ the method developed in this study to predict choice of travel mode to work in different environments, such as those in which walking, cycling, and public transport are dominant.