Public Bike Trip Purpose Inference Using Point-of-Interest Data

Public bike-sharing is eco-friendly, connects excellently with other transportation modes, and provides a means of mobility that is highly suitable in the current era of climate change. This study proposes a methodology for inferring the bike trip purpose based on bike-share and point-ofinterest (POI) data. Because the purpose of a trip involves decision-making, its inference necessitates an understanding of the spatiotemporal complexity of human activities. Thus, the spatiotemporal features affecting bike trips were selected from the bike-share data, and the land uses at the origin and destination of the trips were extracted from the POI data. During POI type embedding, the data were augmented considering the geographical distance between the POIs and the number of bike rentals at each bike station. We further developed a ground truth data construction method that uses temporal mobile and POI data. The inference model was built using machine learning and applied to experiments involving bike stations in Seocho-gu, Seoul, Korea. The experimental results revealed that optimal performance was achieved with the use of decision tree algorithms, as demonstrated by a 78.95% overall accuracy and 66.43% F1-score. The proposed method contributes to a better understanding of the causes of movement within cities.


Introduction
Although the sharing economy is accelerating worldwide, there is a limit to the expansion of vehicle sharing owing to transportation and environmental issues in crowded cities. Accordingly, bike-sharing, which reduces vehicle emissions and improves urban mobility, is an attractive alternative [1]. Currently, more than 2000 cities operate bike-sharing systems, which provide users with a flexible tool for making short-distance trips and interchanging between different modes of transport [2]. Bike-sharing is a significant transportation mode because it is often located at the start and end stages of trip chains [3][4][5]. Free-floating car-sharing, which has experienced remarkable growth in European and American urban markets, is a system that allows users to freely rent shared vehicles through smartphones at participating public parking lots [6,7]. In this context, bike-sharing enables users to connect between their departure points and the beginning parking lots and between the end parking lots and their destinations. This expands the service of vehicle sharing, with an increase in bike-sharing leading to an increase in the use of shared vehicles, as well as other public transportation modes. The scenario offers an effective countermeasure against urban traffic congestion. Non-motorized trips (walking and cycling) account for the finer-scale "capillary" flow in cities, revealing the full nature of urban transportation flows [3,8]. Many studies have analyzed bike-share movement patterns to better understand the urban dynamics [1,3,9], while others have focused on examining the variability of land use [10,11]. Zhao et al. [10] used bike-share data as the main basis for identifying land use characteristics. The present study focused on the "why" of traffic, rather than simply analyzing bikeshare movement patterns, which is more relevant to the "where" of traffic. The latter only addresses urban mobility, whereas the former enables inferences on the purposes of the trips and determination of the causes of urban movement [12]. Nevertheless, the reason for a phenomenon and its result is more meaningful to decision-making, rather than a mere observation of the phenomenon. For example, rather than simply observing many cars in a specific area after 6 p.m. (movement pattern analysis), it is more significant to understand the cause of the observation, namely, the rush-hour movement of people from their work to dwelling places (trip purpose inference). This enables drivers to make the decision to avoid roads leading to residential areas at that time.
Traffic purpose inference is essential for understanding traffic behavior for traffic planning, investment decision-making, and estimating traffic demand [13]. However, among studies related to traffic characteristics, trip purpose prediction has received considerably less attention [14][15][16]. Traffic history data are required for trip purpose prediction; however, there are only a few accurate data containing information on human decision-making. Thus, traditional trip purpose inference analyses rely on direct user surveys [1]. User-direct surveys for traffic purposes contain accurate ground truth data and are, therefore, suitable datasets for applying inference algorithms. However, there are limited targets and cost-related limitations.
Nguyen et al. [17] reviewed 25 studies in terms of trip purpose imputation, and classified them into two broad fields, namely, transportation science (TS) and human geography (HG). They indicated that researchers will always find themselves in these two situations in the real world. In other words, researchers would analyze a post-collected dataset or design a survey to create an enhanced inference model. Therefore, it is emphasized that accurate targeting of research domains is crucial. TS-related studies focus on the methods of deriving objects from GPS data, whereas HG-related studies focus on semantic enrichment for GPS trajectories. The latter case focuses on obtaining general knowledge on the mobility and whereabouts of activities, rather than accurate methodologies [17].
For GPS information collected via onboard devices that do not contain ground truth data, various studies have utilized a fusion of survey data (such as household travel survey data), point-of-interest (POI) data, and social media data for trip purpose inference [13,14,18]. However, these make for only a small percentage of related works, and research on the prediction of the purpose of bike-related travel remains insufficient [2]. Bao et al. [1] and Xing et al. [2] inferred bike trip purpose using POI data, considering that trip purpose is correlated with land use at the destination [18][19][20]. However, these studies did not consider the effect of mixed land use in urban areas. Bao et al. [1] classified the trip purposes by clustering based on only a simple frequency count of the POI type near the bike stations. Xing et al. [2] augmented the context of the origin and destination of bike use through POI data but did not consider the specific trip purpose for an integrated location such as a shopping mall. However, the increasing occurrence of mixed land use in urban areas and integrated POIs such as buildings needs to be considered [21,22]. Therefore, the present study aims to solve the problem of mixed land use in the context of origin-destination, which is key to trip purpose inference, using POI embedding technology. POI is a useful tool for defining the meaning of a place. However, owing to the hierarchy of POI types, many factors are overlooked, resulting in information loss [23]. For example, apartments and parks are perceived as completely unrelated POIs because of their different categories, but residential areas and urban parks are generally located very close to each other. Recently, research has been conducted on the meaning of a POI considering its spatial correlations [23][24][25]. This may involve the application of wordembedding technology to the natural language processing domain [26], as has been used to classify urban functional areas [25,27]. As urban functional areas are strongly correlated with the internal socioeconomic activities in spaces, they are not easy to identify from pure remote-sensing images [21]. Therefore, POI data are used as a complementary measure, but the limitation of the hierarchy of POI types requires the use of embedding technology to define the POI type. Consequently, in the present study, POI types customized for bike trips were used, with POI type embedding conducted based on the bike trip purpose. This provides a solution to the urban mixed land use problem.
This work differs from the TS-related existing trip purpose inference studies focusing on the accuracy of the methodology. It is different in that it focuses on obtaining general knowledge on bike mobility and the whereabouts of activities by applying meaningenhancing processes reflecting the real world. To evaluate the proposed method, we also create ground truth data using mobile data and POI data. The contributions of this work to the field and the society are as follows. First, this work proposes a methodology to apply POI embedding techniques to bike trip purpose inference. POI embedding technology is an application of technology in the field of natural language processing (NLP); in the field of spatial analysis, it has only been used for urban functional area classification [25,27] and POI recommendation [26]. To the best of our knowledge, this study is the first to adopt POI embedding techniques for trip purpose inference. Second, personal mobility, as well as existing transportation modes such as buses and taxis, has recently been rapidly increasing worldwide. Given this scenario, the proposed methodology can be customized according to the corresponding mode when inferring the trip purpose for personal mobility. Third, the travel data of personal mobility, including those of shared bicycles, allow us to know the fine flow of personal movement in cities that cannot be observed from the movement patterns of buses or subways. Thus, applying the proposed inference methodology to real-world data for analyzing the trip purpose would lead to a better understanding of the causes of personal movement within cities.
The remainder of this paper is organized as follows. Section 2 reviews previous studies related to trip purpose inference and POI embedding. Section 3 presents the proposed methodology for inferring the bike trip purpose. In Section 4, we describe a bike trip purpose inference experiment conducted using 87 bike stations in Seocho-gu, Seoul. Finally, Section 5 presents the conclusions and limitations of the study and future research directions.

Studies on Trip Purpose Inference
Among traffic-related studies, trip purpose has received relatively less attention [14][15][16]. This is because of the spatial-temporal complexity of human activities required for predicting trip purpose [13,28]. Studies on the inference of trip purpose can be classified according to the main modes of travel. The most studied aspect focuses on the trips of human beings without specifying the transportation mode. In this case, the trip purpose is mostly predicted based on surveys, with the development of related technologies enabling extensive research, including the addition of GPS data that provide the travel routes [15,29]. In addition, the analysis of complex space and time data has become more sophisticated through the fusion of heterogeneous data such as social media data (e.g., tweets) and POI data (e.g., Google API) [14,29,30]. However, predicting the trip purpose only through direct traveler surveys has severely limited the existing research.
Studies have also been conducted on trip purpose inference with respect to the transportation mode. The most studied aspect in this case is the public transportation mode, such as subways, with diversification achieved with the development of smart cards. However, only a few of these studies focused on predicting the trip purpose because a smart card does not include "travel purpose," which requires decision-making by the traveler. Alsger et al. [18] addressed this issue by inferring the trip purpose through the collection of household travel survey (HTS) data. Rule-based modeling was applied to the HTS data, which were obtained through traveler surveys, and the trip purpose was inferred based on the spatial, temporal, and frequency attributes.
In bike-related studies, the analysis of the traditional sharing patterns or travel purposes is often based on traveler satisfaction surveys [13,14,29,[31][32][33]. More recently, the increasing abundance of bike-share data has been used to infer the trip purpose. In particu-lar, trip purpose is highly related to land use at the destination. Thus, with the increasing sophistication of POI data, many researchers have attempted to carry out a spatiotemporal analysis of the fusion of bike-share data and POI data. Bao et al. [1] analyzed bike-share travel patterns and trip purpose using Citi Bike data in New York, USA. Through k-means clustering, they classified the bike stations based on the surrounding POI data and developed a bike travel pattern model using the latent Dirichlet allocation (LDA) method. However, the study had some limitations, such as the clustering of the bike trip purposes using only the simple frequencies of the POI types around the destination. Xing et al. [2] used Mobike data and dockless bike-share data for Shanghai, China, and similar to Bao et al. [1], they combined bike-share data and POI data to analyze the user travel patterns and purposes. In particular, they enriched the trip origin and destination contexts by extracting information on the configurations of nearby POIs. This is different from the approach of Bao et al. [1], who considered only the frequencies of the POI types. However, only the POI type and bike point were used to construct the context, and data that changed over time were excluded. Additionally, the method is limited by its inability to consider integrated POI such as shopping malls.
Chen et al. [34] predicted the trip purpose for taxi users. The study was significant because a large amount of trip character data was integrated into one vector value, and embedding was used to infer the trip purpose. The utilized taxi trajectory data and POI data were generated in New York City, USA. To elaborate on the contexts of the taxi origin and destination, popularity, uniqueness, and distance to the POI were considered through time information and Foursquare check-in data. Three contexts (time, origin, and destination) of the taxi trip were constructed and deep-embedded through an autoencoder. Finally, the taxi trip purpose was extracted by clustering the embedded values.
The foregoing review reveals that traditional studies related to trip purpose inference have mainly employed GPS-based surveys of human trips. However, some recent trip purpose inference works have utilized big data acquired by smart cards or public bikeshare services. Nevertheless, it remains difficult to measure the inference model accuracy if the dataset is not based on a traveler survey. Most of these previous studies attempted to infer the trip purpose from trip history data and POI data, but it was not easy to generate ground truth data. Therefore, when big data are used, they are combined with survey data, or the results obtained by the inference models are clustered and labeled for appropriate traffic purposes. Table 1 compares different studies related to trip purpose inference conducted over the last three years, from which the recent trends can be observed. Unlike the existing studies, this study proposes a methodology to infer bike trip purposes, with the aim of acquiring general knowledge on bicycle mobility and activity sites. Further, this study evaluates the proposed method by generating ground truth data using mobile data and POI data.

Studies Related to POI Embedding
With the development of technologies such as Word2vec for vectorizing information that is difficult to quantify in natural languages, studies are being actively conducted to quantify information in various fields [26,[35][36][37]. Yan et al. [24] developed the Place2vec model, which is used for POI type embedding and takes into account the geographical distance and popularity of the POI, for quantifying POI information distributed over a geospatial space. Place2vec differs from previous methods, which do not consider internal spatial correlations and only use the POI frequency to determine the functional type of the region of interest [25].
Yao et al. [27] and Zhai et al. [25] proceeded further in measuring the similarity and relevance of place types. These studies attempted to extract land use from the results of POI embedding. Zhai et al. [25] simplified the results obtained by Yan et al. [24] and used them for data augmentation based on only the distance between POIs. The functional type of the urban area was extracted from the embedding results. In addition, for the accurate identification of the extraction results, the annotation of each region was supported by mobile phone data and truck origin-destination (OD) data.
Liu et al. [38] and Liu et al. [23] subdivided POI types to avoid loss of information due to hierarchy. For example, Liu et al. [23] specified 488 POI types and observed a correlation between pharmacies, convenience stores, and barbershops. These POI types are usually located close to each other, but the hierarchy of these POI types may obscure their correlation. In addition to subdividing the POI type, different correlations may also be paired when making tuples. For example, Jin et al. [39] embedded the store types instead of POI types and augmented the data by constructing pairs of the type of store and items sold by the store. Table 2 compares different studies on POI embedding. In most studies, the type of POI was defined through data augmentation based on the distance of the POI in the geospatial space. The k-nearest or a buffer was used to generate tuples between the POIs, and Word2vec skip-gram was mainly used for embedding. The present study followed the flow of these previous studies with respect to tuple composition and the use of Word2vec, but a novel POI embedding method was used to extract the land use specific to a bike trip. The POI types were classified according to the bike trip purpose with reference to related studies. In particular, the POI types related to bike leisure were separately classified to avoid information loss due to the hierarchy of POI types. We believe this methodology is beneficial because POI embedding can be customized to suit the mode of transportation when inferring personal mobility trip purposes in the future.

Methods
We developed a method for inferring the purpose of public bike trips based on bikeshare data and POI data. The method involves extracting the features that affect the trip purpose from the data and using them for machine learning to infer the final trip purpose. When selecting the utilized features, the land use information at the bike trip starting and end points were determined by embedding and clustering the POI data for the area around the bike station. In addition, to establish the ground truth data for determining the trip purpose, we standardized the mobility data for each time period in a mesh form and extracted it together with the POI data. The overall method is illustrated in Figure 1.

Feature Extraction for Inference of Trip Purpose
Unlike travel demand forecasting, which is influenced by various factors such as weather and the floating population, the trip purpose can be broadly divided into time and space information. The time of a bike ride and the land uses at the departure point and the destination significantly reflect the trip purpose. Chen et al. [34] created three contexts for time, origin, and destination and used embedding to infer the purpose of a taxi trip. Alsger et al. [18] inferred trip purpose with the aid of a smart card, using the destination land use, start time, and activity duration as the important attributes.
In the present study, the features that influence public bike trip purposes are the riding day, departure time, departure point land use, arrival time, destination land use, trip time, trip distance, and distance between bike stations. The riding day was considered to be the rental date given the improbability of the use of a rented public bike for more than 24 h. The riding day was denoted by 1 for a workday and 0 for a non-workday. The departure and arrival times ranged between 0 and 23, with only time zone information used and minute and second details ignored. To consider the time and spatial factors of the entire trip, the trip time and trip distance were separately inputted. The distance between the departure and arrival stations was considered an important feature owing to the characteristics of bikes used for leisure purposes such as park riding. For example, in the case of a leisure trip, even if the total trip time or distance was long, the departure and destination stations may be the same.
When the above-mentioned features are constructed using bike-share data, POI data for the areas around the stations should be used for land use extraction at the departure point and destination. Considering the distance between public bike stations and the population density of Seoul City, the search radius for each station was set to 250 m, as in previous studies [1,40]. For accurate inference, the POIs were classified into customized types by categorizing based on the purposes of the bike trips. The classification of the purpose of the trips used in previous studies is presented in Table 3.

Feature Extraction for Inference of Trip Purpose
Unlike travel demand forecasting, which is influenced by various factors such as weather and the floating population, the trip purpose can be broadly divided into time and space information. The time of a bike ride and the land uses at the departure point and the destination significantly reflect the trip purpose. Chen et al. [34] created three  In this study, eight bike trip purpose categories were considered, namely, home, work, education, transit, dining, shopping and service, leisure, and bike leisure. The classification was performed by referring to Bao et al. [1], who used bikes as the main object, with "work" added. In addition, POIs in parks and amusement parks specific to bike traffic were identified as "bike leisure." Further, POIs related to "bike leisure" within the search radius were compared with everything within 1000 m of the bike station instead of the standard 250 m. The identification of the land use using POI data for areas around the bike stations is discussed in the next subsection.

POI Type Embedding Related to Bike Trip Purpose
Studies on the embedding of POI types have been conducted in various fields over the last three years, beginning with the work of Yan et al. [23][24][25]38]. In the present study, we developed a POI embedding method specifically for bike trips. The POI type was classified by considering the purpose of the bike trip, while the geographical distance between the POIs and the number of bike rentals per station were used to augment the POI tuples for embedding.
The embedding of the POI type to infer the bike trip purpose was essentially word embedding, which is used to give meaning to a word by considering the location of the central word and the context words. This method applies the concept of word embedding to a POI or place. Figure 2 compares the window size-based Word2vec and the concept of POI type embedding. Because word embedding (skip-gram) is based on window size, if the window size is 1, as shown in Figure 2, training is performed by selecting "quick" and "for" as the context words of the center word "brown." This method is used to predict context words based on a center word. Conversely, POI type embedding utilizes the location of the POI. If the POI is located within 100 m of the central POI, "apartment," it is recognized as the context POI (e.g., supermarket, pharmacy, or park), and a tuple is produced.
concept of POI type embedding. Because word embedding (skip-gram) is based on window size, if the window size is 1, as shown in Figure 2, training is performed by selecting "quick" and "for" as the context words of the center word "brown." This method is used to predict context words based on a center word. Conversely, POI type embedding utilizes the location of the POI. If the POI is located within 100 m of the central POI, "apartment," it is recognized as the context POI (e.g., supermarket, pharmacy, or park), and a tuple is produced. We developed a POI type embedding method specifically for bike trips based on Yan et al. [24]. A skip-gram model for predicting context words from the center word obtained by Word2vec models was utilized, and the difference between the trained probability and the actual probability was measured by cross-entropy. The model is expressed as follows.
where is the learned probability distribution, y is the actual probability distribution, c is the type index, and is the probability of occurrence of m POIs ( , , , … . ), given by Equation (2). We developed a POI type embedding method specifically for bike trips based on Yan et al. [24]. A skip-gram model for predicting context words from the center word obtained by Word2vec models was utilized, and the difference between the trained probability and the actual probability was measured by cross-entropy. The model is expressed as follows.
whereŷ is the learned probability distribution, y is the actual probability distribution, c is the type index, andŷ c is the probability of occurrence of m POIs (t 1 , t 2 , t 3 , . . . t m ), given by Equation (2).ŷ where t c is the center POI. When the naïve Bayes assumption is applied in calculating the probability, y c is always 1. Finally, by converting the score into probability using the softmax function, the POI type can be expressed as a vector. The objective function is given by, where u t denotes the context place-type vectors, v c denotes the center-place-type vectors, and |T| is the cardinality of the POI type.
To give the correct meaning to a POI type located in a geospatial space, Yan et al. [24] proposed augmentation of the number of appearances of the training tuple (t center , t context ) by a factor β. They presented three concepts: (1) naïve spatial context, (2) simple augmented spatial context incorporating the idea of the popularity and distance to a place, and (3) spatial context augmented by information-theoretic, distance lagged (ITDL).
The simple augmented spatial context methodology was modified to suit the research purpose. It was specifically used to augment the popularity of the individual POIs by Yelp data check-in. This was because the number of check-ins represents the relative popularity or dominance of the POI. Thus, factor β can be defined as in Equation (4) below, where P lj is the total number of check-ins to POI l j . In this study, instead of the Yelp check-in data used by Yan et al. [24], we augmented POI tuples by allocating the number of rentals at each bike station among the POIs within the set radius of the station according to their respective distances. For example, if there were 100 bike check-in data at station A, they would be evenly distributed among all nearby POIs; however, their quotas would be proportional to their respective distances from the station.
This was followed by augmentation based on the distance of the POIs. The basic principle of POI type embedding was applied, i.e., the shorter the distance, the deeper the relationship between the POIs. The distance-related factor β is defined as follows.
where |L| is the total number of POIs, d l i , l j is the distance between POI l i and POI l j , and α is the inverse distance factor, which was set to 1 in this study. We applied the distance augmenting factor used by Yan et al. [24]. Factor β, which comprehensively represents the popularity of a POI based on the concept of distance, was defined as follows.
Based on the method suggested by Yan et al. [24], the proposed POI type embedding method for bike trip purpose inference is as follows. First, a tuple is formed for POIs located within a radius of 250 m of the bike station (1000 m in the case of the POIs related to bike leisure). The created tuple is augmented using an augmenting factor β that considers the popularity and distance between the POIs. Figure 3 shows the configuration of all the tuples around the bike stations and their augmentation. As indicated in the table on the right of Figure 3, if k is 3 during the configuration of the k-nearest function, the tuples would be formed by grouping the set of three POIs closest to each POI, and all POIs in the bike station would be included. In this case, because the distance between tuples is shorter and there is an increase in the number of rentals, the augmentation can be performed using factor β, as suggested by Yan et al. [24]. The numbers in parentheses below the POIs in Figure 3 are the values obtained by distributing the total number of bike rentals (assumed to be 1000 times here) among the POIs proportional to their respective distances from the station. This indicates that the POIs closer to the bike stations are more popular. In addition, because the number of rentals varies between bike stations (for example, the number of bicycle rentals at stop A may be 1000, whereas that at stop B may be 50), this method can be applied differently among the stations. The tuples created in this way were trained using TensorFlow, with the number of embedding dimensions set to 70 and applied as in previous studies.

Extracting the Land Use at Each Station
The method for extracting the land use at each bike station based on the value obtained by embedding the POI type is based on the methods presented by Zhai et al. [25] and Yao et al. [27]. First, the POI type embedding values of each station are combined into one using Equation (7), where POItype(v i, k ) is the vector value of the kth POI type at the bike station, and N is the total number of POIs at all the bike stations.
Based on the extracted values, the clustering is performed for the bike stations, with the k-means algorithm used to extract the land use at the stations. The embedding values of the individual bike stations consisting of high-dimensional vectors are grouped for similar land use types by the clustering analysis. The k value, which is the number of clusters in the k-means algorithm, must be determined in advance as an elbow or silhouette value. The process is expressed by Equation (8). In this study, because the final output is not an accurate land use value, the land use value was not specified using additional indicators as in previous studies. Instead, the clustering results were applied as variables of the land use features at the origin and destination to infer the bike trip purpose. tation can be performed using factor β, as suggested by Yan et al. [24]. The numbers in parentheses below the POIs in Figure 3 are the values obtained by distributing the total number of bike rentals (assumed to be 1000 times here) among the POIs proportional to their respective distances from the station. This indicates that the POIs closer to the bike stations are more popular. In addition, because the number of rentals varies between bike stations (for example, the number of bicycle rentals at stop A may be 1000, whereas that at stop B may be 50), this method can be applied differently among the stations. The tuples created in this way were trained using TensorFlow, with the number of embedding dimensions set to 70 and applied as in previous studies.

Extracting the Land Use at Each Station
The method for extracting the land use at each bike station based on the value obtained by embedding the POI type is based on the methods presented by Zhai et al. [25] and Yao et al. [27]. First, the POI type embedding values of each station are combined into one using Equation (7), where , is the vector value of the kth POI type at the bike station, and N is the total number of POIs at all the bike stations.
Based on the extracted values, the clustering is performed for the bike stations, with the k-means algorithm used to extract the land use at the stations. The embedding values of the individual bike stations consisting of high-dimensional vectors are grouped for similar land use types by the clustering analysis. The k value, which is the number of clusters in the k-means algorithm, must be determined in advance as an elbow or silhouette value. The process is expressed by Equation (8). In this study, because the final output is not an accurate land use value, the land use value was not specified using additional indicators as in previous studies. Instead, the clustering results were applied as variables of the land use features at the origin and destination to infer the bike trip purpose. Because the ground truth data for the trip purpose reflect the decision-making of the travelers, they are not easy to obtain. In some previous studies, the data were collected by a traveler survey [13,14,29]. Nevertheless, with the advent of various types of trafficrelated data based on GPS, big data that do not reflect ground truth data have emerged. Nguyen et al. [17] classified studies that infer trip purpose using these types of data, such as HG-related studies. In HG-related traffic-purpose inference studies, ground truth data are optional because common sense or travel patterns from previous surveys are sufficient for validation [17]. Furthermore, rather than focusing on precision at individual levels, they aim to semantically enrich the trajectory and discover general patterns of activity. Because annotation activities with approximations are allowed, most HG-related studies infer trip purpose results with unsupervised learning, such as clustering results without accurate ground truth data [1,2,34].
However, this study proposes a technique to generate ground truth data to evaluate the accuracy of our methodology in a realistic manner-this technique involves quantifying the bike trip purpose at a common-sense-level when a bicycle user gets off at a specific time and place. It is meaningful because HG-related research aims to obtain general knowledge on mobility and the whereabouts of activities. South Korea has a 95% smartphone ownership rate [41]; thus, we utilize mobile aggregated data from the perspective of service population or living population, which differs from the resident population. Thus, ground truth data of trip purpose for anonymous bike-sharing is created based on the assumption that most people would have made specific choices when they were at a particular place and time. Therefore, it reflects the real world, though not the complete ground truth data.
From this perspective, this study utilizes mobile data and POI data to generate ground truth data for bike trip purposes. This required standardization of the available spatial information. The utilized real-time mobile data consisted of the population data provided by Seoul City. The unit of the spatial information was the output area, which is the minimum statistical counting area built to consider the population size (optimum of 500 people), socioeconomic homogeneity (housing type, land price), and land shape based on a basic statistical zone [42]. Accordingly, the National Statistical Office in Korea aggregates and serves census data in small area units (output area) that are smaller than the administrative units. As shown in Figure 4, the standardized unit was unified as mesh data (50 × 50 m) to match the spatial unit of the mobile data that can be used to determine the floating population with respect to time. The data were applied in a buffer zone that included POIs within 250 m of each bike station.  To standardize the output area unit into mesh data, the mesh weights were determined using the building floor area data. These data were used to consider the area where the actual population lives in the geospatial space. In Figure 5, the mesh data (standardized unit), building floor area data, and output area data are presented in the same space. First, the points were arranged at intervals of 5 m in the building polygon, and the building floor area was then equally divided at each point to obtain their attribute values. For example, building A, which is represented in Figure 5, has a building floor area of 1000; if the number of points is 100, each point would weigh 10. If this value is distributed in a mesh, each grid surrounding building A (marked by red dashed lines) would have a different weight. The next step is the determination of the output areas to which the numbered mesh points belong. This step is used to match a mesh area and an output area based on points. To standardize the output area unit into mesh data, the mesh weights were determined using the building floor area data. These data were used to consider the area where the actual population lives in the geospatial space. In Figure 5, the mesh data (standardized unit), building floor area data, and output area data are presented in the same space. First, the points were arranged at intervals of 5 m in the building polygon, and the building floor area was then equally divided at each point to obtain their attribute values. For example, building A, which is represented in Figure 5, has a building floor area of 1000; if the number of points is 100, each point would weigh 10. If this value is distributed in a mesh, each grid surrounding building A (marked by red dashed lines) would have a different weight. The next step is the determination of the output areas to which the numbered mesh points belong. This step is used to match a mesh area and an output area based on points.
In most cases, the range of the output area is wider than the mesh; thus, the former can be calculated by dividing the output areas into the mesh. However, similar to those in the center of Figure 5 (marked in yellow), a mesh may contain multiple output areas, and the unit must, therefore, be standardized based on the value of the point. In this way, the mobile data for each time period constructed in the output area units can be standardized in the mesh units. and the building floor area was then equally divided at each point to obtain their attribute values. For example, building A, which is represented in Figure 5, has a building floor area of 1000; if the number of points is 100, each point would weigh 10. If this value is distributed in a mesh, each grid surrounding building A (marked by red dashed lines) would have a different weight. The next step is the determination of the output areas to which the numbered mesh points belong. This step is used to match a mesh area and an output area based on points. In most cases, the range of the output area is wider than the mesh; thus, the former can be calculated by dividing the output areas into the mesh. However, similar to those in the center of Figure 5 (marked in yellow), a mesh may contain multiple output areas, and the unit must, therefore, be standardized based on the value of the point. In this way, the mobile data for each time period constructed in the output area units can be standardized in the mesh units.

Extraction of Bike Trip Purpose
The method for inferring the bike trip purpose using mobile data standardized in mesh units and POI data around the bike stations is as follows. First, the 250 m-radius buffer zone of the bike station is standardized in mesh units. As shown in Figure 6, the floating population inferred from mobile data is assigned to each mesh; the darker the mesh, the higher the assigned floating population. The calculation is performed by dividing the floating population among the POI data of each mesh. For example, the mesh area in Figure 6 marked by the red dashed lines has two POIs related to shopping and service and transit, and the floating population within this grid is assigned to each of these POI types. In other words, the POIs have values allocated from the floating population, and when these are combined for each bike station, the type with the largest value can be inferred as the bike trip purpose for that station. Accordingly, it can be inferred that the bike trip purpose for the station in Figure 6 is shopping and service. As the mobile data were built in a time zone, this method enables the construction of a large amount of ground truth data.

Extraction of Bike Trip Purpose
The method for inferring the bike trip purpose using mobile data standardized in mesh units and POI data around the bike stations is as follows. First, the 250 m-radius buffer zone of the bike station is standardized in mesh units. As shown in Figure 6, the floating population inferred from mobile data is assigned to each mesh; the darker the mesh, the higher the assigned floating population. The calculation is performed by dividing the floating population among the POI data of each mesh. For example, the mesh area in Figure 6 marked by the red dashed lines has two POIs related to shopping and service and transit, and the floating population within this grid is assigned to each of these POI types. In other words, the POIs have values allocated from the floating population, and when these are combined for each bike station, the type with the largest value can be inferred as the bike trip purpose for that station. Accordingly, it can be inferred that the bike trip purpose for the station in Figure 6 is shopping and service. As the mobile data were built in a time zone, this method enables the construction of a large amount of ground truth data.

Training for Machine Learning
When traffic data are established, there are three types of methods for using the data for trip purpose inference, namely, rule-based, statistical, and machine learning and neural network methods [14,15]. Rule-based methods are the most traditionally employed, with the representative utilized rules being heuristic, land-use-and-purpose matching tables, and closest-POI matching. Statistical methods are used to calculate the probability of each trip purpose with integration of the respondent's personal information or location information using a multinomial logit model or distance-based prob- Figure 6. Method for extracting the bike trip purpose using mobile data and POI data.

Training for Machine Learning
When traffic data are established, there are three types of methods for using the data for trip purpose inference, namely, rule-based, statistical, and machine learning and neural network methods [14,15]. Rule-based methods are the most traditionally employed, with the representative utilized rules being heuristic, land-use-and-purpose matching tables, and closest-POI matching. Statistical methods are used to calculate the probability of each trip purpose with integration of the respondent's personal information or location information using a multinomial logit model or distance-based probability calculation. Machine learning and neural network methods are more suitable for larger amounts of data and are used to build sophisticated and computationally intensive classification and pattern recognition models. Recent studies tend to use statistical or machine learning (and neural network) methods [14].
The trip purpose inference in this study utilized a machine learning model, which enabled the building of a more sophisticated model compared to the use of the rule-based or statistical methods. This is because the possibility of supervised learning through the construction of ground truth data uses more abundant data than that obtained by travel surveys. Owing to the need for a classification model for the multiple classes of the purpose of public bike trips, we implemented the learning using decision tree and random forest models, which have been shown to perform relatively well [14,16,20,[43][44][45][46].
Decision tree algorithms are useful for supervised learning as analytical tools for classifying data into different groups and making predictions through decision rules represented in a tree structure. The key to training decision trees is learning in a way that minimizes classification impurity, which is measured using indexes such as the Gini index and entropy. The advantage of decision tree algorithms is their ability to interpret their results. The generated decision tree segmentation criteria indicate which attributes are used as classification criteria and what the criteria values are.
Random forest is an ensemble method that aggregates predictions using multiple decision tree models. Multiple training data are generated to build the individual decision trees for each dataset and to randomly select variables for constructing a decision tree model. Random forest has the advantage of enabling the rapid building of a model, and even when the data size of the base model is large, a vast decision tree model can be built without the need for data distribution. Accordingly, random forest does not provide information about the significance of individual variables, as in the case of linear or logistic regression models but indirectly determines the importance of the variables, for example, through out-of-bag estimations [47].
The results of training by machine learning are evaluated using the confusion matrix in Table 4. Two evaluation metrics were used to assess the performance of the model, namely, accuracy and F 1 -score, based on Equations (9) and (10), respectively. The accuracy can most intuitively represent the model's performance, while the F 1 -score is a useful performance metric when the classification of the data is unbalanced.

Data
The experimental area was in Seocho-gu, located on the south side of the Han River that runs through the center of Seoul, Republic of Korea. Seocho-gu has a convenient trans-portation system and is a representative business district in Seoul, along with Gangnam-gu. The specific experimental area is a well-developed residential area adjacent to natural environments, such as the Han River and the Yangjae Citizen's Forest, resulting in a high use rate of public bikes compared with other local government areas in Seoul. The experiment period was 1-30 June 2019. The total number of bike stations in Seocho-gu was 87.
The utilized bike-share data consisted of public data available from the Seoul Open Data Plaza (http://data.seoul.go.kr, accessed on 4 March 2021). The data included the number of bikes, rental date and time, rental location number, rental location name, number of rental docks, return date and time, return location number, return location name, number of return docks, usage time, and distance traveled. Overall, there were 132,788 datum units.
The utilized POI data were obtained from commercial maps that are currently used on the Korean market for the CNS. The data were used to check the land use around the bike stations. In total, 19,358 POI datum units for Seocho-gu were used. The original POI types in the data were modified to fit the purpose of this study, i.e., inference of bike trip purpose. The modified POI types are presented in Table 5. POIs within 250 m of a bike station were selected for the determination of land use, with the exception of the cases of bike leisure, for which a radius of 1000 m was used. The mobile data used to produce the ground truth data were also public data obtained from the Seoul Open Data Plaza (http://data.seoul.go.kr, accessed on 4 March 2021). For this purpose, the number of people within a specific space unit at a specific time on a specific day was estimated based on KT mobile phone signals captured from 6000 base stations across Seoul. This method is useful for identifying the floating population, not the residential population, because the population of the output area, which is the smallest statistical unit, corresponds to the time zone.

Extracted Land Use at Bike Stations by POI Type Embedding
The POI types associated with eight bike trip purposes were embedded based on the work of Yan et al. [24] through the application of the popularity and distance concepts. In the configuration of the POI tuples, the k value of the k-nearest function was set to 10, resulting in groups of the 10 closest POIs per POI. The tuples were augmented using the parameter β lj combined presented by Yan et al. [24], and the popularity of a POI was calculated based on the number of bike rentals at the station to which the POI belongs. The final number of POI tuples obtained by the data augmentation was 323,184. The POI type embedding experiments were conducted in Python 3.7.7 and TensorFlow 2.0.0. The number of embedding dimensions was set to 70 and the number of iterations to 20,000. The embedding values according to POI type calculated through training are presented in Table 6. The calculated POI type embedding values were aggregated for each bike station, and k-means clustering analysis was then performed. Because k-means clustering is used to determine the center of the cluster to minimize the sum of squares error (SSE), the optimal value of Cluster k is determined by the elbow technique by checking the SSE. In the present experiments, the SSE was at the minimum value for k = 4, as shown in Figure 7.  The calculated POI type embedding values were aggregated for each bike station, and k-means clustering analysis was then performed. Because k-means clustering is used to determine the center of the cluster to minimize the sum of squares error (SSE), the optimal value of Cluster k is determined by the elbow technique by checking the SSE. In the present experiments, the SSE was at the minimum value for k = 4, as shown in Figure 7. The final clustering results are shown in Figure 8. To visualize the geospatial elements of the experimental area, a layer provided by OpenStreetMap (http://www.openstreetmap.org, accessed on 4 March 2021) was used as a background map. A simple POI type analysis of the clustered results was conducted. Cluster 1 was found to be the center of work, transportation, and shopping, while Cluster 2 was an area with a relatively high amount of bike-related leisure facilities such as nearby forests, playgrounds, and parks. Cluster 3 was observed to be a residential area where schools and academies were dominant, and Cluster 4 an area with few and relatively insignificant POIs. The final clustering results are shown in Figure 8. To visualize the geospatial elements of the experimental area, a layer provided by OpenStreetMap (http://www.openstreetmap. org, accessed on 4 March 2021) was used as a background map. A simple POI type analysis of the clustered results was conducted. Cluster 1 was found to be the center of work, transportation, and shopping, while Cluster 2 was an area with a relatively high amount of bike-related leisure facilities such as nearby forests, playgrounds, and parks. Cluster 3 was observed to be a residential area where schools and academies were dominant, and Cluster 4 an area with few and relatively insignificant POIs. To verify the distribution of POIs in the significant clusters, the regions A and B in Figure 8 are magnified and shown in Figure 9. Because region A is dense with clusters 1, 2, and 3, features can be compared for each cluster. Cluster 1 (marked in red) is the center of work, transportation, and shopping, with various POIs concentrated along the road. Cluster 2 (marked in orange) contains relatively fewer POIs but is highly influenced by POIs related to bike leisure, such as parks and playgrounds. Bike leisure POIs are designed to have an influence range of 1000 meters from the center of a bike station; thus, the POIs outside the buffer range are also in the ambit of this cluster. Consequently, not only the Banpo sports complex in Cluster 2 but also Puleun park and Seorae children's park are in the ambit of Cluster 2. Cluster 3 (marked in yellow) is a residential area where residential POIs are identified at the bottom of the figure. Region B is located in the foothills of a mountain and thus contains relatively fewer POIs. Cluster 4 (marked in green) reflects greenfield characteristics, and Cluster 2 is influenced by the Wissaem children's park outside the buffer range, similar to region A. This confirms that our POI-type embedding design is well customized for bike trips when configuring clusters. To verify the distribution of POIs in the significant clusters, the regions A and B in Figure 8 are magnified and shown in Figure 9. Because region A is dense with clusters 1, 2, and 3, features can be compared for each cluster. Cluster 1 (marked in red) is the center of work, transportation, and shopping, with various POIs concentrated along the road. Cluster 2 (marked in orange) contains relatively fewer POIs but is highly influenced by POIs related to bike leisure, such as parks and playgrounds. Bike leisure POIs are designed to have an influence range of 1000 m from the center of a bike station; thus, the POIs outside the buffer range are also in the ambit of this cluster. Consequently, not only the Banpo sports complex in Cluster 2 but also Puleun park and Seorae children's park are in the ambit of Cluster 2. Cluster 3 (marked in yellow) is a residential area where residential POIs are identified at the bottom of the figure. Region B is located in the foothills of a mountain and thus contains relatively fewer POIs. Cluster 4 (marked in green) reflects greenfield characteristics, and Cluster 2 is influenced by the Wissaem children's park outside the buffer range, similar to region A. This confirms that our POI-type embedding design is well customized for bike trips when configuring clusters.
This methodology would improve the study results if applied to the results of existing works. Only type information is used in POI embedding in this study; however, information about the size or open time of POI can also be used [48]. The influence of a bike leisure POI, such as a park that affects bicycle traffic, may vary depending on its size. We used a buffer of 250 m to design a range that currently affects bike stations-this would present more realistic embedding results if a network distance is used or a service area is applied [24]. The results could be more easily interpreted if additional data are utilized to validate the current clustering results. Zhai et al. [25] extracted urban functional regions using POI embedding and verified the extraction results using truck OD data to interpret the clustered results. This methodology would improve the study results if applied to the results of existing works. Only type information is used in POI embedding in this study; however, information about the size or open time of POI can also be used [48]. The influence of a bike leisure POI, such as a park that affects bicycle traffic, may vary depending on its size. We used a buffer of 250 m to design a range that currently affects bike stations-this would present more realistic embedding results if a network distance is used or a service area is applied [24]. The results could be more easily interpreted if additional data are utilized to validate the current clustering results. Zhai et al. [25] extracted urban functional regions using POI embedding and verified the extraction results using truck OD data to interpret the clustered results.

Results of Public Bike Trip Purpose Inference
The final dataset consisted of features related to bike trip purpose obtained from bike-share data and land use features calculated by POI type embedding. There was a total of eight features, as shown in Table 7. The ground truth data were obtained using the trip purpose values calculated from mobile data and POI data. Among the features,

Results of Public Bike Trip Purpose Inference
The final dataset consisted of features related to bike trip purpose obtained from bike-share data and land use features calculated by POI type embedding. There was a total of eight features, as shown in Table 7. The ground truth data were obtained using the trip purpose values calculated from mobile data and POI data. Among the features, the land use values of the trip origin and destination calculated by POI type embedding were applied to the clustering values derived, as discussed in the previous section. Thereafter, one-hot encoding was performed because of the categorical nature of the data. Of the utilized 132,788 bike-share datum units for Seocho-gu, Seoul, for the month of June 2019, those for 1-20 June were classified as training data, and those for 21-30 June as test data for the experiment. The decision tree and random forest algorithms were employed for bike trip inference. We performed k-fold cross-validation, in which the data were repeatedly divided to train multiple models and measure the generalization performance. Decision tree and random forest require hyperparameter tuning for critical variables and were used to select the optimal hyperparameters with a high predictive performance using Python's GridSearchCV module of Scikit-learn.
The experimental results showed that the optimal performance (accuracy criterion) measured by machine learning was 78.95% for decision tree and 74.08% for random forest. The decision tree model also exhibited better performance in terms of the F1-score, i.e., 66.43%, as determined by the confusion matrix. The experimental results for different bike trips are presented in Table 8, from which an accurate prediction can be observed, especially for shopping and service, work, and transit. An education purpose for a bike trip is relatively less predictable, attributable to the relatively large number of shopping and service POIs in Seocho-gu. In addition, the POI characteristics for shopping and service are combined with those for various other POI types, as the shopping and service POIs are not located alone, similarly impacting other trip purposes. This issue can be resolved by disaggregating the classification of the POI types.  Home  3016  286  14  100  722  54  214  42  4448  Work  240  7008  36  224  968  232  8  20  8736  Education  24  36  284  14  276  30  16  -680  Transit  82  276  14  1686  248  20  2  26  2354  Shopping and service  634  1004  238  178  21,182  892  342  86  24,556  Dining  70  282  54  24  1042  2250  20  -3742  Leisure  154  6  2  -272  32  794  -1260  Bike Leisure  68  22  -12  64  --240  406  Total  4288  8920  642  2238  24,774  3510  1396  414  46,182 NB: The notations in the column headings correspond to the bike trip purposes in each column, in the same order.
Furthermore, to improve the overall accuracy, applying deep learning algorithms may be an alternative, wherein bike-user characteristics (e.g., gender, age) can be added as features. Among the studies related to trip purpose inference, studies using deep learning have exhibited higher accuracy [13,29]. Because deep learning works well for big data, it is required to build data by increasing the spatial and temporal scope of the study, as well as features that affect the trip purpose. Furthermore, entire datasets must be built by combining various user-generated data, such as review data built by real bike users.

Discussion and Conclusions
This paper proposed a methodology for inferring the purpose of bike trips from historical data on shared bikes. Investigations of bike-sharing facilitate a more detailed look at urban mobility, especially the first and last miles of public transportation. Decision-making on various urban issues can, therefore, be supported by inferring the purpose of bike trips. An essential factor in trip purpose inference is the land use of destinations [18][19][20]. Based on this fact, this study resolved the urban mixed land-use problem by applying the POI embedding technology. The bike trip date, departure time, arrival time, total time, total trip distance, and distance between stations were adopted as relevant features from bike-share data. POI data were used to extract the land uses at the origin and destination, which significantly reflect the trip purpose. POIs within a radius of 250 m from the bike stations (1000 m for bike leisure-related POIs) were used to extract the land use through POI type embedding, a methodology developed by Yan et al. [24], with the POI types redefined for the present interest in bike trips. The geographical distance of the POIs and the number of bike rentals at each station were considered as tuple augmentation factors for the POI type embedding. However, if there were hundreds of POIs for each bike station, the POI-type information alone could not capture the biker preferences for visiting the area. Therefore, the number of bike rentals at each station was distributed to obtain popularity information for each POI. We also developed a method for utilizing the temporal mobile data and POI data for the extraction of ground truth data for bike trip purposes. The mobile data for each time zone obtained in the output area units and the POI data obtained in the buffer units around the bike stations were standardized as mesh data for the generation of the ground truth data.
The experiments of this study considered 87 bike stations in Seocho-gu, Seoul. The study period was 1-31 June 2019, and in total, 132,788 public bike-share datum units were utilized. Decision tree and random forest were used for machine learning for bike trip purpose inference, with the hyperparameters adjusted for optimal performance. The land uses at the bike stations deduced by POI type embedding were divided into four clusters, with decision tree exhibiting better performance, having an accuracy of 78.95% and an F1-score of 66.43%.
Results from POI type embedding revealed relatively well the origin and destination land use for bike trips proposed in this study. A single bike station contains 70-100 POIs on average in a buffer zone of 250 m; however, clusters about these mixed land uses were classified well in this study. In particular, as for bike leisure POI, which affects bike trips, the influence range from the center of a bicycle station was set to 1000 m, and the cluster results ( Figure 9) showed that the details are well-reflected. Furthermore, the trip purpose inference results obtained through machine learning indicated that relatively accurate predictions are made for all eight traffic purposes (Table 8). In previous studies related to trip purpose inference, accuracies of 60-90% were only achieved through the use of survey data. This study is of significance as the first step toward using machine learning on spatial data-not survey data-for trip purpose inference.
The results of this study can be helpful not only for bicycles but also for the recent increase in personal mobility trip purpose inference. The bike-specific POI embedding technique proposed in this study can be customized according to the corresponding mobility and used for its trip purpose inference. This study offers a good resource for policymakers for decision-making on urban issues, affording a means of making bike trip purpose inferences using only publicly available bike-share data and POI data. Inferring trip purposes using the methodology proposed in this study and analyzing its patterns will lead to a better understanding of the causes of personal movement within cities.
This study differs from the related works on trip purpose inference using survey data in that it focuses on obtaining general knowledge on bike mobility and the whereabouts of activities by applying meaning-enhancing processes reflecting the real world rather than improving the accuracy of inference from a traffic-engineering perspective. For semantic enhancement of bike-sharing data, land use at origin and destination was extracted using the POI embedding technology. Further, this study evaluated the proposed method by generating ground truth data using mobile data and POI data to infer bike trip purpose at a common-sense level.
However, this study has a limitation that the proposed method for ground truth data generation utilizes aggregated data and not the data at the disaggregated level. Nevertheless, this reflects the difficulty of the travel survey of individuals from a realistic perspective. We believe that this study's results are meaningful in terms of human geography, which aims to semantically enrich and discover general patterns of mobility rather than focusing on precision at individual levels. Additionally, a problem exists that individual characteristics of bicycle users could not be considered when selecting features for trip purpose inference. Because the level of data provided only presents information about the origin, and not the entire OD data, to protect users' privacy.
From the methodological perspective, we limited the POI type to eight during POI embedding and used only type information for data augmentation. This limitation can be addressed by segmenting the POI type and redesigning the algorithm to allow for data augmentation using various information such as the size or open time of POI during POI embedding. In addition, the results can be verified by converging various data such as social media and mobile data to strengthen the meaning of the extracted cluster. Deep learning algorithms that have been proven to deliver relatively good inference results can be utilized to improve the accuracy of trip purpose inference. Because deep learning yields good results only for large datasets, the spatial and temporal range of the experimental sites must be improved. Alternatively, methods to increase the number of features that affect the trip purpose can be considered, including users' personal information combined with various user-generated data, such as the review data built by actual bicycle users.