Factors Inﬂuencing Matching of Ride-Hailing Service Using Machine Learning Method

: It is common to call a taxi by taxi-apps in Korea and it was believed that an app-taxi service would provide customers with more convenience. However, customers’ requests can often be denied, as taxi drivers can decide whether to take calls from customers or not. Therefore, studies on factors that determine whether taxi drivers refuse or accept calls from customers are needed. This study investigated why taxi drivers might refuse calls from customers and factors that inﬂuence the success of matching within the service. This study used origin-destination data in Seoul and Daejeon obtained from T-map Taxis, which was analyzed via a decision tree using machine learning. Cross-validation was also performed. Results showed that distance, socio-economic features, and land uses a ﬀ ected matching success rate. Furthermore, distance was the most important factor in both Seoul and Daejeon. The matching success rate in Seoul was lowest for trips shorter than the average at midnight. In Daejeon, the rate was lowest when the calls were made for trips either shorter or longer than the average distance. This study showed that the matching success for ride-hailing services can be di ﬀ erentiated particularly by the distance of the requested trip depending on the size of the city.


Introduction
With the rapidly growing use of smartphones, customers can make purchases exclusively using apps on a smartphone. Online-to-offline (O2O) service is a business model for making potential customers buy goods in physical stores through online channels. The expansion of O2O services has stimulated the growth of ride-hailing services in the transportation sector over the world. For example, the ride-hailing service in China expanded to various transportation forms and is getting more influential in personal mobility [1]. Ride-hailing is different from ride-sharing. Ride-hailing is the service where riders hire a personal driver who takes them to the exact destination requested. By contrast, ride-sharing is the service where a vehicle is shared with other riders who have a different destination to each other; ride-sharing is not a personal service.
In Korea, ride-hailing services and taxi-apps offered by companies such as Kakao Taxi and T-map Taxi are becoming increasingly popular. However, despite the belief that an app-taxi service would provide customers with more convenience, there have been complaints about the service offered to customers. Customers' requests can often be denied, as taxi drivers can decide whether to take calls from customers or not after checking their origin and the destination requested.
Sustainability 2019, 11, 5615 3 of 13 identify factors influencing the success rate of matching between customers and taxi drivers as well as cross-city comparison approach.
Thus, this study aims to identify factors influencing matching of ride-hailing services from taxi drivers' point of view using machine learning. A cross-city comparison is also performed to determine the influence of the sizes of cities. In addition, cross validation is carried out to minimize overfitting or selection bias based on the study in [10]. The findings of this study provide insights on how the gap between customers and taxi drivers can be bridged. Furthermore, the study also provides directions on how to improve ride-hailing services in mega cities and medium-sized cities, as well as the features of app-taxi services in such cities.

Machine Learning and Decision Tree
Machine learning is used to construct a decision tree to determine the factors influencing matching and to build models. As one of the machine learning algorithms, decision trees chart several decision-making rules and classify or predict a few small groups concerned. They have several sub-models based on the splitting criteria and stopping rules to prevent further data splitting, including chi-squared automatic interaction detection (CHAID), classification and regression trees (CART), C5.0 (successor of ID4), C4.5 (successor of ID3), and Iterative Dichotomiser 3 (ID3). The basic structure of a decision tree is shown in Figure 1 [11].
Sustainability 2019, 11, x FOR PEER REVIEW 3 of 13 Thus, this study aims to identify factors influencing matching of ride-hailing services from taxi drivers' point of view using machine learning. A cross-city comparison is also performed to determine the influence of the sizes of cities. In addition, cross validation is carried out to minimize overfitting or selection bias based on the study in [10]. The findings of this study provide insights on how the gap between customers and taxi drivers can be bridged. Furthermore, the study also provides directions on how to improve ride-hailing services in mega cities and medium-sized cities, as well as the features of app-taxi services in such cities.

Machine Learning and Decision Tree
Machine learning is used to construct a decision tree to determine the factors influencing matching and to build models. As one of the machine learning algorithms, decision trees chart several decision-making rules and classify or predict a few small groups concerned. They have several submodels based on the splitting criteria and stopping rules to prevent further data splitting, including chi-squared automatic interaction detection (CHAID), classification and regression trees (CART), C5.0 (successor of ID4), C4.5 (successor of ID3), and Iterative Dichotomiser 3 (ID3). The basic structure of a decision tree is shown in Figure 1 [11]. This model can be easily understood owing to its tree-like structure. In addition, it is a nonparametric method, which does not require the assumption of linearity, normality, or homoscedasticity. Therefore, it is not sensitive to outliers and is superior to existing statistical models in terms of prediction [12].
In this study, a CART algorithm was used to form a regression tree as the dependent variables were continuous. The CART algorithm performs binary split and has different splitting criteria. It also forms a classification tree with discrete dependent variables and a regression tree with continuous dependent variables. The classification tree uses the Gini index as a splitting criterion, as given in the following equation. The Gini index measures the impurity or diversity in each node. It determines the probability of two elements randomly extracted from the total elements that belong to different groups.
where is the Gini index, is the number of categories of the target variables, ( ) is the probability that an object in each node belongs to the jth category of the target variable, is the This model can be easily understood owing to its tree-like structure. In addition, it is a non-parametric method, which does not require the assumption of linearity, normality, or homoscedasticity. Therefore, it is not sensitive to outliers and is superior to existing statistical models in terms of prediction [12].
In this study, a CART algorithm was used to form a regression tree as the dependent variables were continuous. The CART algorithm performs binary split and has different splitting criteria. It also forms a classification tree with discrete dependent variables and a regression tree with continuous dependent variables. The classification tree uses the Gini index as a splitting criterion, as given in the following equation. The Gini index measures the impurity or diversity in each node. It determines the probability of two elements randomly extracted from the total elements that belong to different groups.
where G is the Gini index, c is the number of categories of the target variables, P( j) is the probability that an object in each node belongs to the jth category of the target variable, n is the number of observations included in parent node, and n j is the number of observations in the jth category of the target variable. A classification tree chooses the dependent variable that reduces the Gini index to the greatest degree and achieves optimal splitting of the variable as child nodes; the decrease in the Gini index is estimated, as given in the following equation. This is to form child nodes so that the impurity is at the lowest level in the case of classification into child nodes [13].
where ∆G is the decrease in the Gini index, G is the Gini index of the parent node, n is the number of observations in the parent node, n L , n R are the number of observations in the left child node and right child node, respectively, and G L , G R are the Gini indices of the left child node and the right child node, respectively. This study used scikit-learn, which is one of the libraries for the Python programming language. Scikit-learn can favorably provide a user-friendly, efficient, and productive interface when implementing an algorithm. It is available in several distribution versions of the Python language such as Anaconda, Enthought Canopy, and Python (x, y). Anaconda is particularly useful for analyzing mass data and performing predictive analysis [14,15]. Therefore, this study used Anaconda for decision tree analysis to form a regression tree.

Cross-Validation and Model Optimization
Machine learning algorithms are better at predicting explanatory power and building efficient models than conventional methods. However, machine learning exhibits the problem of overfitting. CART algorithm also has an overfitting problem as it is one of the machine learning algorithms. Therefore, a cross-validation technique was used to generalize the models without overfitting. Cross-validation is a technique that optimizes the balance between complexity and classification errors of decision trees. Although a tree grows and has more terminal nodes, the errors decrease; however, this also means that the model performs poorly with new data. Therefore, cross-validated was carried out in this study using a cost-complexity function [16,17].
where R(T) is the misclassification error of tree, T, α( T) is the complexity measure, which depends on T, T is the total sum of terminal nodes in the tree, and α is a parameter. Data for machine learning are divided into two data sets: training data and test data. Some of the data is used as the training data to form a tree. The parameter α is a regularization parameter, which appears according to the orders of inputting the observations for a training phase of a model. The remaining data becomes the test data for a test phase. These steps were randomly iterated.
Exhaustive search, also known as brute-force or generate and test, was used to select two parameters, namely the maximum depth and minimum sample size of leaf nodes for the optimized tree models [18]. The method calculates all possible number of cases in the combinations of the depth of a tree between 1 and 10 and the minimum sample size of leaf nodes, and then selects parameters from the best fit case [19]. After performing this step, the maximum tree depth and the minimum sample size of leaf nodes for Seoul were selected as 4 and 1, respectively, whereas those for Daejeon were selected as 4 and 9, respectively.

Analysis of Features of App-Taxi Matching
In this study, the daily average trip distances of cities and counties in Korea, denoted as Si and Gun, respectively, were analyzed. Seventy-two cities and counties were divided into five regions; then, the daily average trip distances were calculated for each region according to trip purposes and trip means. This approach was used to compare the effects of app-taxi matching based on the characteristics of cities, such as population and distance. Results show that the capital region, including Seoul, had the longest distances of 9.5 km and 8.8 km for trip purposes and trip means, respectively, whereas the Daejeon region had the shortest distances of 5.8 km and 5.9 km, respectively, as shown in Figure 2. The results for Seoul and Daejeon, representing the longest and shortest trip distances, respectively were compared. respectively, whereas the Daejeon region had the shortest distances of 5.8 km and 5.9 km, respectively, as shown in Figure 2. The results for Seoul and Daejeon, representing the longest and shortest trip distances, respectively were compared. Figure 2. Comparison of regional trip distance.
The population of Seoul in 2019 was estimated at 9.7 million and the city of Seoul covers a surface area of 605.2 km 2 . The population density is the city is 16,136 people/km 2 . On the other hand, Daejeon has a population of 1.5 million in 2019 and the city area is 539.4 km 2 . The population density is the city is 2785 people/km 2 . The geographical sizes of the cities are similar, but the population and population density of Seoul are over six times larger than the population and population density of Daejeon.
In this study, Origin-Destination (OD) data on successful and unsuccessful (failed) matchings decided by taxi drivers of T-map taxi service in Seoul and Daejeon Metropolitan City in April 2017 (Table 1)    The population of Seoul in 2019 was estimated at 9.7 million and the city of Seoul covers a surface area of 605.2 km 2 . The population density is the city is 16,136 people/km 2 . On the other hand, Daejeon has a population of 1.5 million in 2019 and the city area is 539.4 km 2 . The population density is the city is 2785 people/km 2 . The geographical sizes of the cities are similar, but the population and population density of Seoul are over six times larger than the population and population density of Daejeon.
In this study, Origin-Destination (OD) data on successful and unsuccessful (failed) matchings decided by taxi drivers of T-map taxi service in Seoul and Daejeon Metropolitan City in April 2017 (Table 1)  respectively, whereas the Daejeon region had the shortest distances of 5.8 km and 5.9 km, respectively, as shown in Figure 2. The results for Seoul and Daejeon, representing the longest and shortest trip distances, respectively were compared. The population of Seoul in 2019 was estimated at 9.7 million and the city of Seoul covers a surface area of 605.2 km 2 . The population density is the city is 16,136 people/km 2 . On the other hand, Daejeon has a population of 1.5 million in 2019 and the city area is 539.4 km 2 . The population density is the city is 2785 people/km 2 . The geographical sizes of the cities are similar, but the population and population density of Seoul are over six times larger than the population and population density of Daejeon.
In this study, Origin-Destination (OD) data on successful and unsuccessful (failed) matchings decided by taxi drivers of T-map taxi service in Seoul and Daejeon Metropolitan City in April 2017 (Table 1)     According to data provided by T-map, there were 30,990 cases in Seoul and 4294 cases in Daejeon. The study focuses on identifying what makes taxi-drivers refuse calls from customers. Thus, of the cases, 21,785 cases in Seoul (70.3%) and 3112 cases in Daejeon (72.5%) were used in this study to examine whether taxi drivers accept or decline a call. Matching success was set as the dependent variable (1 for success and 0 for fail) to analyze factors influencing matching success. Various features of information related to the origins and destinations recorded in the ODs were set as independent variables, namely socio-economic indicators, land uses, station influence area, time, weather, weekday or weekend, etc.
The socio-economic variables include population density, business density, and employee density of the origins and destinations. They were obtained from the Open Data portal for Seoul and National Statistics for Daejeon. The land use data of the ODs were extracted from GIS data in the National Spatial Data Infrastructure Portal. Influence of subway stations was estimated and used as an independent variable. Sta-Inf was defined as an area within a radius of 400 m around a subway station. Data of stations was obtained from the website of the Korea Transport Data Base (KTDB). Weather data was sourced from daily weather information recorded by the Korea Meteorological Administration. Details of the variables are presented in Table 2. According to data provided by T-map, there were 30,990 cases in Seoul and 4294 cases in Daejeon. The study focuses on identifying what makes taxi-drivers refuse calls from customers. Thus, of the cases, 21,785 cases in Seoul (70.3%) and 3112 cases in Daejeon (72.5%) were used in this study to examine whether taxi drivers accept or decline a call. Matching success was set as the dependent variable (1 for success and 0 for fail) to analyze factors influencing matching success. Various features of information related to the origins and destinations recorded in the ODs were set as independent variables, namely socio-economic indicators, land uses, station influence area, time, weather, weekday or weekend, etc.
The socio-economic variables include population density, business density, and employee density of the origins and destinations. They were obtained from the Open Data portal for Seoul and National Statistics for Daejeon. The land use data of the ODs were extracted from GIS data in the National Spatial Data Infrastructure Portal. Influence of subway stations was estimated and used as an independent variable. Sta-Inf was defined as an area within a radius of 400 m around a subway station. Data of stations was obtained from the website of the Korea Transport Data Base (KTDB). Weather data was sourced from daily weather information recorded by the Korea Meteorological Administration. Details of the variables are presented in Table 2.

Results of Machine Learning
This study used supervised learning, which predicts a dependent (target) variable based on input data. The dependent variable was predicted based on separated input data: training set and test set. We analyzed 75% of the data for the training set and 25% for the test set. Table 3 shows the results of empirical analysis using machine learning models. In terms of the results obtained before performing cross-validation, the explanation powers (R 2 ) of the training set and test set were respectively 0.770 and 0.723 in Seoul and 0.757 and 0.766 in Daejeon. The decision tree derived the important ranks of the dependent variables as it employs the non-parametric method. The results show that distance was the most important variable in both Seoul and Daejeon.
However, the R 2 s of the results for the training set and test set in Seoul increased to 0.780 and 0.772, respectively after performing cross-validation. There were also significant increases of 0.836 and 0.834 for the training set and test set in Daejeon, respectively. Thus, cross-validation clearly increased the explanation powers in both cities.
Cross-validation clearly distinguished the important and unimportant variables; no value at Importance for unimportant variables. Five variables were identified as important for Seoul, namely Distance X(6), Midnight X(8), Peak time X(9), D_Employee density X(5), and O_Employee density X(2) in order of importance, whereas eight variables were identified as important for Daejeon, namely Distance, Peak time, Weekend, Cloudy, O_Business density, O_Population density, D_Population density, and O_ Employee density. The results shown in Figures 5 and 6 were obtained from the decision tree. The tree consists of 18 leaf nodes in grey squares, 15 internal nodes in white squares, and 1 root node in a slightly larger white square in the middle of the left side. The root node is the start of the tree and allows us to track down the classifications of data. The equations next to the white square nodes are the conditions for splitting the samples, and X(n)s are the dependent variables in Table 3. The solid lines in the figure indicates that the condition for splitting at each variable was met, whereas the dashed lines indicate that the condition was not met [20]. For example, in Figure 5, the root node had X(6) ≤ 8.155 with a dashed line in the upper side and a solid line in the lower side. This indicates that Distance X(6) was the dependent variable and the samples were split at a distance of 8.155 km (splitting condition). The samples that were at a distance of 8.155 km or less belonged to the lower side, whereas the other samples belonged to the upper side. In terms of the values next to the leaf nodes, the first values are the counts of the failed cases of matching, whereas the second values are those of the success cases. Each leaf node also indicates the success rate of matching on the right side. As previously explained, the decision tree shows a clear structure of the model.
In the results for Seoul ( Figure 5), most of the leaf nodes with over 50% matching success rate were in the upper half of the decision tree diagram. Only 2 out of 8 leaf nodes in the upper half side had less than 50% matching success rate, whereas only 1 out of 8 leaf nodes in the lower half side had more than 50% matching success rate. This shows that X(6) Distance, which is the variable used for the first branch, is the most important factor and the matching success rates were clearly and mostly identified. When the trip distance for a certain call is less than 8.155 km, the call has a high possibility of being refused by a taxi driver. Apart from distance (X(6)), midnight (X(8)) and peak-time (X(9)) were also important factors. identified. When the trip distance for a certain call is less than 8.155 km, the call has a high possibility of being refused by a taxi driver. Apart from distance (X(6)), midnight (X(8)) and peak-time (X(9)) were also important factors. Two patterns for low-matching success rates were particularly identical in Seoul ( Figure 6). Although the trip distance for calls in Group 4 (Pattern 1) were long (over 8.155 km), the calls were made at night and the destinations had a relatively low employee density rate (X(5) ≤ 1.723). Therefore, the matching success rate was low. The calls in Groups 9 and 10 were for relatively shortdistance trips (Pattern 2); short-distance trips tended to be frequently refused. However, Pattern 2 had even lower matching success rates as the calls were made at night and the employee density rate at the origins was high. Patterns 1 and 2 were both for trips at night; however, there were a few differences. Pattern 1 was for long-distance trips with a low possibility of taking another passenger at the destinations, whereas Pattern 2 was for short-distance trips with high demands at the origins. This means that taxi drivers can easily find passengers at the origins for Pattern 2. The results in Seoul show that taxi drivers clearly made selective choices for calls based on the trip distance and demand for taxis, especially the demand at origins. Two patterns for low-matching success rates were particularly identical in Seoul ( Figure 6). Although the trip distance for calls in Group 4 (Pattern 1) were long (over 8.155 km), the calls were made at night and the destinations had a relatively low employee density rate (X(5) ≤ 1.723). Therefore, the matching success rate was low. The calls in Groups 9 and 10 were for relatively short-distance trips (Pattern 2); short-distance trips tended to be frequently refused. However, Pattern 2 had even lower matching success rates as the calls were made at night and the employee density rate at the origins was high. Patterns 1 and 2 were both for trips at night; however, there were a few differences. Pattern 1 was for long-distance trips with a low possibility of taking another passenger at the destinations, whereas Pattern 2 was for short-distance trips with high demands at the origins. This means that taxi drivers can easily find passengers at the origins for Pattern 2. The results in Seoul show that taxi drivers clearly made selective choices for calls based on the trip distance and demand for taxis, especially the demand at origins. made at night and the destinations had a relatively low employee density rate (X(5) ≤ 1.723). Therefore, the matching success rate was low. The calls in Groups 9 and 10 were for relatively shortdistance trips (Pattern 2); short-distance trips tended to be frequently refused. However, Pattern 2 had even lower matching success rates as the calls were made at night and the employee density rate at the origins was high. Patterns 1 and 2 were both for trips at night; however, there were a few differences. Pattern 1 was for long-distance trips with a low possibility of taking another passenger at the destinations, whereas Pattern 2 was for short-distance trips with high demands at the origins. This means that taxi drivers can easily find passengers at the origins for Pattern 2. The results in Seoul show that taxi drivers clearly made selective choices for calls based on the trip distance and demand for taxis, especially the demand at origins. In the results for Daejeon (Figure 7), there were only three leaf nodes with less than 50% matching success rate. This indicates that taxi drivers in Daejeon tended to refuse a call less than drivers in Seoul. Distance (X(6)) and peak-time (X(9)) were the most important factors in order. Distance was the most important factor in Daejeon, similar to the result in Seoul. However, instead of midnight (as in Seoul), peak-time was the second most influential in Daejeon. Moreover, the In the results for Daejeon (Figure 7), there were only three leaf nodes with less than 50% matching success rate. This indicates that taxi drivers in Daejeon tended to refuse a call less than drivers in Seoul. Distance (X(6)) and peak-time (X(9)) were the most important factors in order. Distance was the most important factor in Daejeon, similar to the result in Seoul. However, instead of midnight (as in Seoul), peak-time was the second most influential in Daejeon. Moreover, the matching success rates were influenced by weekend or weekdays or weather conditions, even at peak-time. matching success rates were influenced by weekend or weekdays or weather conditions, even at peak-time. The two patterns for low matching success rate were also identical in Daejeon (Figure 8). The calls in Group 2 were for relatively long-distance trips, indicating that this might be preferable for taxi drivers. However, the calls were at peak-time during cloudy and rainy weather. Therefore, the matching success rate was low, as taxi drivers had more options owing to high demands. Another important pattern is Pattern 2 for Groups 11 and 12. The pattern shows calls for short-distance trips during weekdays, which had low matching success rate, including the lowest rate of 26.7%. However, under the same condition as weekdays, the rates for weekend were high, with over 50% in Groups 9 and 10. This is because there is a difference in the demands during weekdays and weekend. Taxi drivers in Daejeon tended to refuse calls with relatively short-distance trips at peak-times on weekdays. It means that taxi drivers in Daejeon also refused calls when the demand was high (peaktimes on weekdays). However, the refusals in Daejeon occurred at different times from Seoul: midnight in Seoul, whereas peak-times on weekdays in Daejeon, and less occurred than in Seoul due to the difference in the absolute volume of taxi demand. The two patterns for low matching success rate were also identical in Daejeon (Figure 8). The calls in Group 2 were for relatively long-distance trips, indicating that this might be preferable for taxi drivers. However, the calls were at peak-time during cloudy and rainy weather. Therefore, the matching success rate was low, as taxi drivers had more options owing to high demands. Another important pattern is Pattern 2 for Groups 11 and 12. The pattern shows calls for short-distance trips during weekdays, which had low matching success rate, including the lowest rate of 26.7%. However, under the same condition as weekdays, the rates for weekend were high, with over 50% in Groups 9 and 10. This is because there is a difference in the demands during weekdays and weekend. Taxi drivers in Daejeon tended to refuse calls with relatively short-distance trips at peak-times on weekdays. It means that taxi drivers in Daejeon also refused calls when the demand was high (peak-times on weekdays). However, the refusals in Daejeon occurred at different times from Seoul: midnight in Seoul, whereas peak-times on weekdays in Daejeon, and less occurred than in Seoul due to the difference in the absolute volume of taxi demand. calls in Group 2 were for relatively long-distance trips, indicating that this might be preferable for taxi drivers. However, the calls were at peak-time during cloudy and rainy weather. Therefore, the matching success rate was low, as taxi drivers had more options owing to high demands. Another important pattern is Pattern 2 for Groups 11 and 12. The pattern shows calls for short-distance trips during weekdays, which had low matching success rate, including the lowest rate of 26.7%. However, under the same condition as weekdays, the rates for weekend were high, with over 50% in Groups 9 and 10. This is because there is a difference in the demands during weekdays and weekend. Taxi drivers in Daejeon tended to refuse calls with relatively short-distance trips at peak-times on weekdays. It means that taxi drivers in Daejeon also refused calls when the demand was high (peaktimes on weekdays). However, the refusals in Daejeon occurred at different times from Seoul: midnight in Seoul, whereas peak-times on weekdays in Daejeon, and less occurred than in Seoul due to the difference in the absolute volume of taxi demand. A comparison of the results of the two cities shows that taxi drivers in Daejeon tended to refuse calls less than those in Seoul. This is possibly due to higher demand for taxis in Seoul than in Daejeon; therefore, taxi drivers in Seoul tend to be selective in choosing a passenger. However, taxis in Daejeon may be more cost competitive for passengers due to the relatively lower demand. Refusals usually A comparison of the results of the two cities shows that taxi drivers in Daejeon tended to refuse calls less than those in Seoul. This is possibly due to higher demand for taxis in Seoul than in Daejeon; therefore, taxi drivers in Seoul tend to be selective in choosing a passenger. However, taxis in Daejeon may be more cost competitive for passengers due to the relatively lower demand. Refusals usually occur at peak-time or at night. It is difficult to find clear evidence for this trend owing to lack of research regarding taxis in Korea; however, it can be somewhat verified using the actual occupancy rate (AOR) and operating rate (OPR) of taxis in 2014 (Table 4). The AOR indicates the rate of running miles or times of taxis with passengers, whereas the OPR indicates the rate of the number of actual running taxis over the number of taxis permitted (licensed). On average, the OPR of Daejeon was over 10% point greater than that in Seoul. On the other hand, the AOR by distance in Seoul was over 10% points greater than that in Daejeon. This indicates that there are fewer taxis running in Seoul than the permitted number of taxis; therefore, the taxis in Seoul can have higher probability of running with passengers. In Daejeon, most of the permitted taxis are in operation and have a low probability of running with passengers. Therefore, in terms of comparison of the AORs by distance between the two cities, the taxis in Seoul have a higher likelihood of taking passengers as there is higher demand, whereas the taxis in Daejeon have a lower likelihood of taking passengers owing to low demand and high competition. Although the AORs by time for all taxis and corporate taxis in Seoul were approximately 2% points smaller than those in Daejeon, there is no significant difference between the AORs by time in Seoul and Daejeon considering that the proportion of private taxis was more than 60%.
As shown in Figure 3, although the sizes of the two cities were similar, the geographical boundaries of taxi operations in the cities were completely different. The taxis in Daejeon run only within the central area for short distances. In the decision tree models, the branching condition for Distance (X(6)) of Seoul at the root node was approximately 8.1 km, whereas that of Daejeon was 5.4 km. This clearly shows that the running distance of taxis in Daejeon was shorter than that in Seoul, which perfectly matches the OD distributions shown in Figure 3. If we assume that taxis continuously take passengers for 1 h and each trip distance was 8.1 km in Seoul and 5.4 km in Daejeon, the revenue of taxi drivers in Daejeon will be higher. Therefore, from the Daejeon taxi drivers' point of view, taking passengers will be beneficial regardless of distance and they can easily return to the central area for more passengers.
The other variables of the decision tree models also provide reasonable explanations for the present situation of refusals by taxi drivers. The variable of midnight was important in Seoul. It is difficult to take a taxi at night in this city, especially after 11 pm, as demands for taxis sharply increase between 11 pm and 1 am the next day. However, taxis in Seoul run for relatively long distances; thus, it is difficult to take multiple passengers during the limited time, and drivers have to refuse certain calls to maximize their own revenue. Peak-time and weekend were important variables in Daejeon, indicating that the demand for taxis is high during morning and evening commuting times on weekdays. Moreover, unlike in Seoul, the variable of midnight was not important because of the relatively short trip distances between residential areas and business or commercial areas.

Conclusions
In this study, the factors influencing the success of matching for app-taxis in Seoul and Daejeon were identified. Socio-economic variables, land uses, and distance, as well as weather conditions and times of the day were used as factors that influence the success of matching. The key approach of this study was the use of a machine learning method to perform a decision tree for analysis. In addition, the results obtained before and after performing cross-validation were compared.
Overall, the results showed that using machine learning yielded good performance as the explanation powers for the decision tree were over 0.7 for the training set and test set. Cross-validation had relatively small effect on the results, except for the decision tree of Daejeon in which the explanation power significantly increased.
The results also showed that distance was the most important factor in both cities. In Seoul, taxi drivers tended to prefer calls of a long-distance trip during off-peak in the day, whereas taxi drivers' preferences in Daejeon seemed more complicated than those in Seoul. However, it can be deduced that taxi drivers in Daejeon were more influenced by the prevailing characteristics of the origins, such as population density or business density, than drivers in Seoul.
Consequently, using machine learning yields good performance for decision trees, and the study found that distance was the most important factor to ensure matching between taxi drivers and customers for app-taxis. However, the study also has a few limitations. Data used in this study covered only a period of one month, and the cases were limited to Seoul and Daejeon. In addition, the usage of T-map taxi service is low; therefore, it is rather difficult to generalize the results of this study at this point. The study should be extended using additional data.