Trip Extraction of Shared Electric Bikes Based on Multi-Rule-Constrained Homomorphic Linear Clustering Algorithm

: Trajectory data include rich interactive information of humans. The correct identiﬁcation of trips is the key to trajectory data mining and its application. A new method, multi-rule-constrained homomorphic linear clustering (MCHLC), is proposed to extract trips from raw trajectory data. From the perspective of the workﬂow, the MCHLC algorithm consists of three parts. The ﬁrst part is to form the original sub-trajectory moving / stopping clusters, which are obtained by sequentially clustering trajectory elements of the same motion status. The second part is to determine and revise the motion status of the original sub-trajectory clusters by the speed, time duration, directional constraint, and contextual constraint to construct the stop / move model. The third part is to extract users’ trips by ﬁltering the stop / move model using the following rules: distance rule, average speed rule, shortest path rule, and completeness rule, which are related to daily riding experiences. Veriﬁcation of the new method is carried out with the shared electric bike trajectory data of one week in Tengzhou city, evaluated by three indexes (precision, recall, and F1-score). The experiment shows that the index values of the new algorithm are higher (above 93%) than those of the baseline methods, indicating that the new algorithm is better. Compared to the baseline velocity sequence linear clustering (VSLC) algorithm, the performance of the new algorithm is improved by approximately 10%, mainly owing to two factors, directional constraint and contextual constraint. The better experimental results indicate that the new algorithm is suitable to extract trips from the sparse trajectories of shared e-bikes and other transportation forms, which can provide technical support for urban hotspot detection and hot route identiﬁcation.


Introduction
Various forms of trajectory data are collected owing to the popularity of GPS devices and positioning technology. Trajectory data not only include spatio-temporal information of moving objects but also include abundant interactive human-region, human-society, and even person-person information attributes, which reflect the behavior characteristics of individuals or groups [1,2]. It is helpful to understand and optimize urban decisions by mining the potential information of trajectory data. Trajectory cleaning, aimed at identifying or extracting trips from unordered GPS points, is the key to mining the potential information of trajectory data. The correct identification of trips is helpful for understanding and optimizing urban construction, such as bike lane planning [3], energy conservation Trips can also be identified by trajectory segmentation, i.e., splitting trajectories into homogeneous segments based on some criteria. Trajectory segmentation is an important task in trajectory data processing, as its correctness will largely affect such subsequent analyses as O/D matrix construction, trip purpose identification, and travel mode detection. The methods of trajectory segmentation can be classified into "supervised" and "unsupervised". The criteria used by "supervised" methods for segmenting trajectories are predetermined. The aforementioned algorithms of O/D identification are "supervised", whether the criteria are monotone or non-monotone. The supervised segmentation methods are user-driven, relying on user-defined rules or thresholds. When the segmentation criteria are unknown, the methods are called "unsupervised". Unsupervised algorithms derive the homogeneity of segments based on some cost function, which can avoid any control from the user, but may lack semantics (W-K means) [33] or are time-consuming (GRASP-UTS) [34]. To combine the benefits of both supervised and unsupervised strategies, Junior [35] first proposed a semi-supervised approach RGRASP-SemTS (Reactive Greedy Randomized Adaptive Search Procedure for Semantic semi-supervised Trajectory Segmentation), which exploited labeled and unlabeled data to achieve homogeneity of segments using a cost function based on the minimum description length (MDL) principle. RGRASP-SemTS can obtain more than 50 features to enrich trajectory data, which is useful for semantic trajectory segmentation. To verify algorithm performance, RGRASP-SemTS was performed with the Atlantic hurricane trajectory dataset and grey seals trajectory dataset.
In summary, few works in the literature of trip identification are related to the sparse trajectory data of vehicles. Therefore, it is worth studying and examining how to correctly extract trips from sparse trajectory data. Shared electric bikes (e-bikes) are selected as an example, as e-bikes can easily travel along wide roads and narrow alleys alike because of their small size, which makes e-bikes a solution for short-to medium-distance trips. However, due to the limitation of the cycling environment and battery life, the trajectory points of shared e-bikes tend to exhibit discontinuities and non-uniformities, and it is difficult to obtain riding status information. A new method, multi-rule-constrained homomorphic linear clustering (MCHLC), is proposed to identify trips from the trajectories of shared e-bikes. From the perspective of the semantic trajectory, the new method constructs the stop/move model based on various criteria, such as speed, time duration, directional constraint, and contextual constraint, and identifies users' trips based on riding experience rules.
The main contributions of this paper are as follows: (1) A new method is proposed to address sparse trajectory data. Shared e-bike trajectories are taken as an example to verify the performance of the method, which also enriches the research related to the shared e-bikes.
(2) The new method can effectively detect temporary stops resulting from specific purposes or behaviors (such as transporting children to school on the way to work), thereby revealing the details of residents' travel behaviors and providing scientific data for urban hotspot detection, hot route analysis, and urban functional zone sectorization.
The remainder of the paper is organized as follows. Section 2 describes the workflow of the new method, which is tested with the experimental data provided in Section 3. Section 4 discusses the role of the two factors (directional change constraint and contextual constraint) of the method and compares the new method with baseline algorithms. We draw conclusions and outline future work in Section 5.

Methodology
Certain related terms are first introduced to explain the new method.

Definition 1 (trajectory).
A trajectory T is a series of discrete spatio-temporal points, each of which is a triple containing geographic coordinates and timestamps. A spatio-temporal point is expressed asP i = x i , y i , t i , i = 0, 1, . . . , N, and ∀0 ≤ i < j ≤ N, t i < t j .
As described in Definition 1, a trajectory essentially is a polyline consisting of a series of spatial points that are successive in terms of the timestamp. For example, in Figure 1, polyline T is a trajectory and is subject to the spatio-temporal point set {a, b, c, d, e, f}. , , e x y t ( ) 6 6 6 , , f x y t . Figure 1. Illustration of a spatio-temporal trajectory [12].
Mao [36] summarized previous research on the mobility behavior of urban residents and defined a trip as follows. A trip is the movement of a resident from A to B using one or more transportation modes due to certain purposes. A reasonable trip is that the object moves at least a minimum distance in space and lasts for a minimum duration in time. Therefore, a trip corresponds to the movement along a trajectory, represented by a set of sequence points with a higher velocity.
A complete trip connects two consecutive stops: the origin and the destination. A stop along the trajectory indicates that a certain activity has been carried out at a certain location for a period of time. To better understand the behavior of residents, a temporary stay caused by a specific behavior or purpose is also considered a stop. For example, the temporary stay when dropping children off at school on the way to work is considered a stop, while the temporary stay caused by waiting at a traffic light is not a stop. From the perspective of data, a stop corresponds to a sequence of points with a velocity of zero or a very low velocity (such as below four kilometers per hour).
Definition2 (trajectory element): A trajectory element E, the basic linear unit of a trajectory, is the line connecting two adjacent points of the trajectory, as shown in Figure 1.
In Figure 1, trajectory T contains five trajectory elements E. The attribute information of a trajectory element, such as the distance, average speed, and motion status, can be calculated by the endpoints. The length of a trajectory element is calculated by Equation (1), based on which the shortest distance between two points over the earth's surface can be calculated [37,38]. P and P are endpoints, and R is the average radius of the reference ellipsoid (WGS84), equal to 6371 km. dis(P P ) = R * cos (sin P . lat * π 180° sin P . lat * π 180° + cos (P . lat * π 180°) cos P . lat * π 180° cos ( (P . lon − P . lon) * π 180°) )) (1) The length of the trajectory element divided by the time interval is the average speed. The Mao [36] summarized previous research on the mobility behavior of urban residents and defined a trip as follows. A trip is the movement of a resident from A to B using one or more transportation modes due to certain purposes. A reasonable trip is that the object moves at least a minimum distance in space and lasts for a minimum duration in time. Therefore, a trip corresponds to the movement along a trajectory, represented by a set of sequence points with a higher velocity.
A complete trip connects two consecutive stops: the origin and the destination. A stop along the trajectory indicates that a certain activity has been carried out at a certain location for a period of time. To better understand the behavior of residents, a temporary stay caused by a specific behavior or purpose is also considered a stop. For example, the temporary stay when dropping children off at school on the way to work is considered a stop, while the temporary stay caused by waiting at a traffic light is not a stop. From the perspective of data, a stop corresponds to a sequence of points with a velocity of zero or a very low velocity (such as below four kilometers per hour).
Definition 2 (trajectory element). A trajectory element E, the basic linear unit of a trajectory, is the line connecting two adjacent points of the trajectory, as shown in Figure 1. In Figure 1, trajectory T contains five trajectory elements E. The attribute information of a trajectory element, such as the distance, average speed, and motion status, can be calculated by the endpoints. The length of a trajectory element is calculated by Equation (1), based on which the shortest distance between two points over the earth's surface can be calculated [37,38]. P i and P i+1 are endpoints, and R is the average radius of the reference ellipsoid (WGS84), equal to 6371 km.
The length of the trajectory element divided by the time interval is the average speed. The mathematical expression is shown in Equation (2).
The motion status of a trajectory element can be obtained by comparing the average speed E i . = ϑ to a given threshold ϑ thresh , that is, if E i . = ϑ is not greater than ϑ thresh , the motion status is stopped, marked as "s", otherwise, the motion status is moving, marked as "m". The mathematical expression is as follows: The new method is inspired by the velocity sequence linear clustering (VSLC) algorithm [39], which utilizes the semantic information of a trajectory to detect the stops due to the refueling behavior of taxies. Considering the trajectory element as the basic unit, the VSLC algorithm constructs a sequence of elements that have motion status according to Equation (3) and performs sequence merging of those elements with the same motion status to obtain sub-trajectory clusters of moving or stopping. The motion status of the sub-trajectory cluster is decided by the duration criterion. When the duration of the sub-trajectory cluster is shorter than the corresponding minimum duration threshold, the original motion status is pseudo and subject to revision. To build the stop/move model, homomorphic linear clustering is again performed after revising the motion status. Finally, the stops due to fueling are identified by analyzing the stop/move model with semantic information. The key to the VSLC algorithm is the correct identification of the motion status. The motion status of a temporary stay may be misidentified using only the duration criterion. Moreover, occasional behavior may divide a trip into multiple fragments, resulting in misidentification of the motion status by the duration criterion.
To solve such issues, two factors, directional change constraint and contextual semantic constraint, are introduced to determine the motion status of the sub-trajectory clusters. Trips are extracted by analyzing the stop/move model with the help of daily riding experiences. The new algorithm adopts the idea of clustering to extract trips from single trajectories, of which the methodological steps are shown in Figure 2. To better understand the principle of the MCHLC algorithm, all methodological steps are divided into three parts, which are implemented step by step in Figure 3. The first part is the formation of the original sub-trajectory clusters by sequentially clustering the trajectory elements with the same motion status. The motion status of a trajectory element is determined by comparing the speed and the given threshold. As shown in Figure 3a, one trajectory is composed of multiple trajectory elements marked "s" or "m", which expresses the motion status and is calculated by Equation (3). The sub-trajectory cluster marked "s" or "m" is formed by performing homomorphic linear clustering, as shown in Figure 3b. To better understand the principle of the MCHLC algorithm, all methodological steps are divided into three parts, which are implemented step by step in Figure 3. The first part is the formation of the original sub-trajectory clusters by sequentially clustering the trajectory elements with the same motion status. The motion status of a trajectory element is determined by comparing the speed and the given threshold. As shown in Figure 3a, one trajectory is composed of multiple trajectory elements marked "s" or "m", which expresses the motion status and is calculated by Equation (3). The sub-trajectory cluster marked "s" or "m" is formed by performing homomorphic linear clustering, as shown in Figure 3b.  The status of the cluster in the red circle is misidentified using the duration criterion, which is correctly revised using the directional constraint, as shown in (d). (e) The trajectory is segmented into many segments, each of which is composed of clusters with the motion status in the form of "m 1 − · · · − s n ". (f) The pseudo status is revised using the contextual constraint. (g) New clusters are obtained by performing homomorphic linear clustering again after status revision. (h) A trip is extracted using the rules based on daily riding experiences. The second part is the construction of the stop/move model, which is the core of the MCHLC algorithm. In this part, the identification and revision of the motion status of the sub-trajectory clusters is the key to construct the stop/move model. According to the definition of a stop or a move, each motion status should last for a minimum duration over time. Thus, the duration criterion is first used to determine whether the identification of the sub-trajectory cluster is correct. That is, when the duration of a sub-trajectory cluster is not greater than the corresponding threshold, the current status of the sub-trajectory cluster is pseudo. The pseudo status indicates that the motion status may be misidentified, labeled as "F". Then, the directional constraint criterion is used to evaluate the sub-trajectory cluster with the pseudo motion status. For example, the pseudo status of the sub-trajectory cluster marked "s" is a true stop, whose direction change angle is near 180 degrees, which implies that the trips alternate. The direction change angle of the cluster is the difference between the heading angles of adjacent trajectory elements, calculated by Equation (4). The heading angle of a trajectory element is calculated by the geographic coordinates of the endpoints, following Equations (5) and (6). As Figure 3d shows, the misidentified status in the red circle of Figure 3c can be revised using the directional constraint. The latter shows that the directional constraint is useful to correctly identify the status of the temporary stay and can reveal detailed information of the trip.
A whole trip may be divided into fragments by occasional behavior, such as waiting at a traffic light or avoiding pedestrians. Among the fragments, some motion status may be pseudo due to the short duration, resulting in misidentification of the trip. To address this issue, a contextual semantic constraint is introduced to determine and revise the motion status. Generally, a trip ends at a stop where the trips alternate. Therefore, a single trajectory is first segmented into multiple segments. Each segment is composed of multiple sub-trajectory clusters with a motion status in the form of "m 1 − · · · − s n ", where the last sub-trajectory cluster marked "s n " is a true stop. Certain sub-trajectory clusters in each segment may have the pseudo status, except for the status of the last sub-trajectory cluster. To eliminate misidentifications caused by the pseudo status, the contextual relationships of the clusters in a segment are analyzed according to actual riding experiences, which are summarized in Figure 4. When there is only one pseudo status in the segment, the pseudo status is regarded as noise caused by the statuses of adjacent segments, which should be revised. In Figure 4a, the pseudo status of the moving cluster may be caused by GPS signal drift, so the motion status of the cluster is revised from "m" to "s", denoting a stop. In Figure 4b, the pseudo status of the temporary stay caused by occasional behavior (such as waiting at a traffic light) is in the middle of the trip, thereby splitting the trip into fragments. Therefore, the status of the temporary stay should be revised to "m", denoting a move. If the occasional behavior occurs at the endpoint of a trip, there will be multiple pseudo statuses among the segments, as shown in Figure 4c,d. Then, the pseudo status of each cluster should be revised. Based on the contextual semantic constraint, the status of each cluster is re-identified and revised, and then homomorphic linear clustering is performed again to build the stop/move model. be revised. Based on the contextual semantic constraint, the status of each cluster is re-identified and revised, and then homomorphic linear clustering is performed again to build the stop/move model. The third part is trip extraction from the stop/move model using the daily riding rules. According to daily riding experiences, a trip should satisfy the following rules: 1) Distance rule: According to the definition, a trip should be at least a minimum distance in space. Here, the minimum distance is set as 500 m. The trip length is calculated from the distance between the points instead of the Euclidean distance between the endpoints of the trip.
2) Average speed rule: Compared to shared bikes, the speed of shared e-bikes is higher. The report on sharing bicycles and urban development in 2017 stated that the average velocity of the fastest shared bicycles in cities is approximately 9.7 km/h [40], and here, we consider that the average velocity of a shared e-bike during a trip should be higher than 10 km/h.
3) Shortest path rule: Generally, residents select the shortest travel path. If the distance between the endpoints of the trip is not greater than half the trip length, the trip should be filtered out.

4) Completeness:
A trip is usually connected by two consecutive stops. However, when a trip is at the endpoint of a trajectory, the trip may not be adjacent to the stop. In this situation, the endpoint of the trip can be decided by evaluating the speed. Considering that the speed gradually changes near the start or end point of the trip, when the speed at a point is not higher than 1.5 times the average speed of the trip, the point is considered an endpoint of the trip.

Experimental Data
The experimental data are the trajectory data of the shared e-bikes in Tengzhou city. As shown in Figure 5, the data cover the whole city center. Shared e-bikes are an emerging green travel mode, which is a solution to short-to medium-distance trips, especially in second-and third-tier cities. Compared to shared bikes, shared e-bikes with electric assistance have a distinct superiority in solving cycling barriers, such as longer trips and overcoming a challenging topography (hilly or dispersed cities) [41]. Moreover, shared e-bikes attract additional user groups who carry loads when traveling [42] or suffer from physical defects, which do not allow bicycle pedaling [43]. Fewer existing studies are related to shared e-bikes, most of which were based on survey data to study the performance of e-bikes [44], users' mobility behavior [45][46][47], and travel mode influencing factors [48]. Li and Dai [49] completed data cleaning of shared e-bike trajectories based on the speed and time interval rules for the first time, without considering the trajectory semantic information. To utilize the trajectory semantic information, a new method, MCHLC algorithm, is proposed to extract trips from the shared e-bike trajectories.
The shared e-bikes in Tengzhou city came from the BeeFly company and are named Mebikes, due to their bee-like appearance. Similar to dockless shared bikes, users can pay using a smart phone The third part is trip extraction from the stop/move model using the daily riding rules. According to daily riding experiences, a trip should satisfy the following rules: (1) Distance rule: According to the definition, a trip should be at least a minimum distance in space. Here, the minimum distance is set as 500 m. The trip length is calculated from the distance between the points instead of the Euclidean distance between the endpoints of the trip.
(2) Average speed rule: Compared to shared bikes, the speed of shared e-bikes is higher. The report on sharing bicycles and urban development in 2017 stated that the average velocity of the fastest shared bicycles in cities is approximately 9.7 km/h [40], and here, we consider that the average velocity of a shared e-bike during a trip should be higher than 10 km/h.
(3) Shortest path rule: Generally, residents select the shortest travel path. If the distance between the endpoints of the trip is not greater than half the trip length, the trip should be filtered out.
(4) Completeness: A trip is usually connected by two consecutive stops. However, when a trip is at the endpoint of a trajectory, the trip may not be adjacent to the stop. In this situation, the endpoint of the trip can be decided by evaluating the speed. Considering that the speed gradually changes near the start or end point of the trip, when the speed at a point is not higher than 1.5 times the average speed of the trip, the point is considered an endpoint of the trip.

Experimental Data
The experimental data are the trajectory data of the shared e-bikes in Tengzhou city. As shown in Figure 5, the data cover the whole city center. Shared e-bikes are an emerging green travel mode, which is a solution to short-to medium-distance trips, especially in second-and third-tier cities. Compared to shared bikes, shared e-bikes with electric assistance have a distinct superiority in solving cycling barriers, such as longer trips and overcoming a challenging topography (hilly or dispersed cities) [41]. Moreover, shared e-bikes attract additional user groups who carry loads when traveling [42] or suffer from physical defects, which do not allow bicycle pedaling [43]. Fewer existing studies are related to shared e-bikes, most of which were based on survey data to study the performance of e-bikes [44], users' mobility behavior [45][46][47], and travel mode influencing factors [48]. Li and Dai [49] completed data cleaning of shared e-bike trajectories based on the speed and time interval rules for the first time, without considering the trajectory semantic information. To utilize the trajectory semantic information, a new method, MCHLC algorithm, is proposed to extract trips from the shared e-bike trajectories.
The shared e-bikes in Tengzhou city came from the BeeFly company and are named Mebikes, due to their bee-like appearance. Similar to dockless shared bikes, users can pay using a smart phone to pick up or return Mebikes freely, owing to the development of electronic fence technology. Users can be granted discounts on riding fees when e-bikes are returned to the electronic virtual stations set up using the electronic fence technology, which is an excellent solution to the issue of random parking.
The experimental data were acquired between May 19 and May 26, 2018, involving 516 Mebikes and 98,795 GPS trajectory points. The geographical coordinates are 117 • 07´14"-117 • 12´36"N and 35 • 02´08"-35 • 07 21"E. Each shared e-bike was equipped with an integrated GPS and communication module. The data were acquired from a specified internet address, where the GPS information is sent to every minute. As Figure 6 shows, the GPS records were stored as individual files by the key value of the vehicle ID. All the work was conducted using the Java programming language. to pick up or return Mebikes freely, owing to the development of electronic fence technology. Users can be granted discounts on riding fees when e-bikes are returned to the electronic virtual stations set up using the electronic fence technology, which is an excellent solution to the issue of random parking. The experimental data were acquired between May 19 and May 26, 2018, involving 516 Mebikes and 98,795 GPS trajectory points. The geographical coordinates are 117°07´14"-117°12´36"N and 35°02´08"-35°07′21"E. Each shared e-bike was equipped with an integrated GPS and communication module. The data were acquired from a specified internet address, where the GPS information is sent to every minute. As Figure 6 shows, the GPS records were stored as individual files by the key value of the vehicle ID. All the work was conducted using the Java programming language. to pick up or return Mebikes freely, owing to the development of electronic fence technology. Users can be granted discounts on riding fees when e-bikes are returned to the electronic virtual stations set up using the electronic fence technology, which is an excellent solution to the issue of random parking. The experimental data were acquired between May 19 and May 26, 2018, involving 516 Mebikes and 98,795 GPS trajectory points. The geographical coordinates are 117°07´14"-117°12´36"N and 35°02´08"-35°07′21"E. Each shared e-bike was equipped with an integrated GPS and communication module. The data were acquired from a specified internet address, where the GPS information is sent to every minute. As Figure 6 shows, the GPS records were stored as individual files by the key value of the vehicle ID. All the work was conducted using the Java programming language. As shown in Figure 6, each of the original GPS points includes the vehicle ID (the StationID), data acquisition time (timestamp), geo-location (latitude and longitude coordinates), and predicted mileage (anticipated mileage). It is not an easy task to extract trips from the raw data without any riding status-related attribute information. The GPS devices are set to collect data once a minute, but only 51.7% of the original data are recorded with a sampling interval of one minute. As shown in Figure 7, the original data have different sampling intervals. Among the data, 82.1% of the sampling intervals of the original data are shorter than 2 minutes, and 10.8% of the data are low-density sampling data, in which the sampling interval is longer than 2 minutes [50]. These statistics show that most of the data are uniform, but the data still have certain sparsity. As shown in Figure 6, each of the original GPS points includes the vehicle ID (the StationID), data acquisition time (timestamp), geo-location (latitude and longitude coordinates), and predicted mileage (anticipated mileage). It is not an easy task to extract trips from the raw data without any riding status-related attribute information. The GPS devices are set to collect data once a minute, but only 51.7% of the original data are recorded with a sampling interval of one minute. As shown in Figure 7, the original data have different sampling intervals. Among the data, 82.1% of the sampling intervals of the original data are shorter than 2 minutes, and 10.8% of the data are low-density sampling data, in which the sampling interval is longer than 2 minutes [50]. These statistics show that most of the data are uniform, but the data still have certain sparsity.   Data sparsity is caused by many factors, one of which is the difference in the travel time span of shared e-bikes. Figure 8 shows that 88.68% of the shared e-bikes are used on three to four days, and only 0.58% of the shared e-bikes are used on one day. The difference in time span causes data sparsity to a certain extent. Moreover, data sparsity is also related to the riding environment, and the battery can also result in discontinuous data. When riding between buildings or trees or along narrow alleys or when the battery runs out, the GPS signal will be weak or may be lost, resulting in discontinuous data or missing data. Such characteristics of shared e-bike trajectories indicate that a new method is needed to extract trips. Notably, the high utilization rate of shared e-bikes and the relatively uniform sampling rate both indicate that the data are usable and analyzable, although the data used in the study are sparse.  Data sparsity is caused by many factors, one of which is the difference in the travel time span of shared e-bikes. Figure 8 shows that 88.68% of the shared e-bikes are used on three to four days, and only 0.58% of the shared e-bikes are used on one day. The difference in time span causes data sparsity to a certain extent. Moreover, data sparsity is also related to the riding environment, and the battery can also result in discontinuous data. When riding between buildings or trees or along narrow alleys or when the battery runs out, the GPS signal will be weak or may be lost, resulting in discontinuous data or missing data. Such characteristics of shared e-bike trajectories indicate that a new method is needed to extract trips. Notably, the high utilization rate of shared e-bikes and the relatively uniform sampling rate both indicate that the data are usable and analyzable, although the data used in the study are sparse. As shown in Figure 6, each of the original GPS points includes the vehicle ID (the StationID), data acquisition time (timestamp), geo-location (latitude and longitude coordinates), and predicted mileage (anticipated mileage). It is not an easy task to extract trips from the raw data without any riding status-related attribute information. The GPS devices are set to collect data once a minute, but only 51.7% of the original data are recorded with a sampling interval of one minute. As shown in Figure 7, the original data have different sampling intervals. Among the data, 82.1% of the sampling intervals of the original data are shorter than 2 minutes, and 10.8% of the data are low-density sampling data, in which the sampling interval is longer than 2 minutes [50]. These statistics show that most of the data are uniform, but the data still have certain sparsity.   Data sparsity is caused by many factors, one of which is the difference in the travel time span of shared e-bikes. Figure 8 shows that 88.68% of the shared e-bikes are used on three to four days, and only 0.58% of the shared e-bikes are used on one day. The difference in time span causes data sparsity to a certain extent. Moreover, data sparsity is also related to the riding environment, and the battery can also result in discontinuous data. When riding between buildings or trees or along narrow alleys or when the battery runs out, the GPS signal will be weak or may be lost, resulting in discontinuous data or missing data. Such characteristics of shared e-bike trajectories indicate that a new method is needed to extract trips. Notably, the high utilization rate of shared e-bikes and the relatively uniform sampling rate both indicate that the data are usable and analyzable, although the data used in the study are sparse.

Experimental Results
As described in Section 2, the MCHLC algorithm depends on three key parameters, which are summarized in Table 1. Among these parameters, Min_Move and Min_Stop are the minimum duration time of a move and a stop, respectively, which are set to determine the corresponding motion status of a sequence. To reduce the misidentification of the stops caused by waiting for the traffic lights, Min_Stop is set as 3 minutes to detect the true stops. According to the report on sharing bicycles and urban development in 2017, it takes approximately 3 minutes to complete a minimum trip (500 m). Compared to shared bicycles, residents usually choose shared e-bikes for a longer trip, so the parameter Min_Move is set as 4 minutes. Notably, a speed threshold is used to determine the original status of the trajectory element. Here, the speed threshold is set as the walking speed of 4 km/h. Direction_Diff is used to distinguish between a true stop and a temporary stay caused by special purposes, such as dropping children off at school on the way to work.

Min_Move
The minimum duration for a normal move 4 min Min_Stop The minimum duration for a normal stop 3 min Direction_Diff The angle of the direction change between adjacent points 180 • Based on the default settings of the parameters listed in Table 1, the trajectories of shared e-bikes in Tengzhou were processed with the MCHLC algorithm. The invalid points beyond the study area were filtered out before utilizing the MCHLC algorithm. A total of 5962 trips were identified from the original data. The statistical results in Table 2 show that the shortest trip is approximately 833 m, while the longest trip length reaches up to 12.5 km. The average trip length is approximately 2.5 km, and the average riding duration is 10 minutes. To further examine the users' riding behaviors of shared e-bikes, the distribution of the identified trips is analyzed in the form of a bar chart (as shown in Figure 9). The majority of the trip lengths are within 1-4 km (accounting for more than 85% of all trips), of which 1-2 km accounts for the largest proportion (approximately 41.4%), followed by 2-3 km (nearly 30%), while 3-4 km accounts for the smallest proportion (only 14.8%). The bar chart shows that the trip length is mostly within 2-5 km when the residents of Tengzhou city choose shared e-bikes to travel. This result confirms that shared e-bikes are an excellent solution to short-to medium-distance travel in Tengzhou city. As described in Section 2, the MCHLC algorithm depends on three key parameters, which are summarized in Table 1. Among these parameters, Min_Move and Min_Stop are the minimum duration time of a move and a stop, respectively, which are set to determine the corresponding motion status of a sequence. To reduce the misidentification of the stops caused by waiting for the traffic lights, Min_Stop is set as 3 minutes to detect the true stops. According to the report on sharing bicycles and urban development in 2017, it takes approximately 3 minutes to complete a minimum trip (500 m). Compared to shared bicycles, residents usually choose shared e-bikes for a longer trip, so the parameter Min_Move is set as 4 minutes. Notably, a speed threshold is used to determine the original status of the trajectory element. Here, the speed threshold is set as the walking speed of 4 km/h. Direction_Diff is used to distinguish between a true stop and a temporary stay caused by special purposes, such as dropping children off at school on the way to work.

Parameter
Description The minimum duration for a normal move 4 min Min_Stop The minimum duration for a normal stop 3 min

Direction_Diff
The angle of the direction change between adjacent points 180° Based on the default settings of the parameters listed in Table 1, the trajectories of shared e-bikes in Tengzhou were processed with the MCHLC algorithm. The invalid points beyond the study area were filtered out before utilizing the MCHLC algorithm. A total of 5962 trips were identified from the original data. The statistical results in Table 2 show that the shortest trip is approximately 833 m, while the longest trip length reaches up to 12.5 km. The average trip length is approximately 2.5 km, and the average riding duration is 10 minutes. To further examine the users' riding behaviors of shared e-bikes, the distribution of the identified trips is analyzed in the form of a bar chart (as shown in Figure 9). The majority of the trip lengths are within 1-4 km (accounting for more than 85% of all trips), of which 1-2 km accounts for the largest proportion (approximately 41.4%), followed by 2-3 km (nearly 30%), while 3-4 km accounts for the smallest proportion (only 14.8%). The bar chart shows that the trip length is mostly within 2-5 km when the residents of Tengzhou city choose shared ebikes to travel. This result confirms that shared e-bikes are an excellent solution to short-to mediumdistance travel in Tengzhou city.  To verify the performance of the MCHLC algorithm, 50 shared e-bikes were sampled. A total of 486 trips were selected as reference data from the samples, which were obtained by manual interpretation in the ArcGIS Engine against the background of Amap and the Open Street Map (OSM) road network. The MCHLC algorithm was evaluated by three data mining indexes (precision, recall, and F1-score), whose mathematical expressions are shown in Equations (7)-(9). Among these indexes, TP is the number of trips correctly detected, while FP is the number of trips incorrectly identified by the algorithm, and FN is the number of trips that the algorithm failed to identify. Table 3 indicates that all three indexes have a high value, higher than 92%. The results show that the MCHLC algorithm is reliable and suitable to identify trips from sparse trajectory data.

The Roles of the Directional Change Constraint and Contextual Constraint
Compared to the baseline VSLC algorithm, the two factors, directional constraint and contextual constraint, are mainly introduced in the new method to improve the performance. The directional constraint can effectively identify temporary stays, which may be caused by special travel purposes, even with a short duration. As shown in Figure 10, without the directional constraint, the GPS trajectory points in part A were misidentified as belonging to one trip, while the points were correctly identified as four trips considering the directional constraint, as shown in part B.
In Figure 10b, Trip 1_1 shows that the user departed from the New Star International Cinemas and arrived at the People's Harmony Bay Community (a residential area). The shared e-bike departed from the residential area 5 minutes later in the opposite direction, which indicated that a new trip occurred. Due to the missing data caused by the GPS signal being obscured by tall buildings, the stop cannot be identified using the speed criterion. However, considering the directional constraint, the stop can be identified, and the GPS points were identified as belonging to two trips, Trip 1_1 and Trip 1_2.
The shared e-bike departed from the People's Harmony Bay Community (a residential area) at 14:25 and passed by Beiwen Primary School at 14:34 before arriving at the True Love Shopping Mall at 14:47. Trip 1_2 shows that the shared e-bike departed from the People's Harmony Bay Community (a residential area) at 14:25 and arrived at Beiwen Primary School at 14:34, while Trip 1_3 shows that the shared e-bike departed from Beiwen Primary School at 14:36 in the opposite direction and arrived at the True Love Shopping Mall at 14:47. We conjectured that a resident had dropped off their child at school on the way to the shopping mall, resulting in a two-minute stop at the school. Considering that this two-minute stay was caused by the resident's travel purpose, the stay was considered a stop as identified by the directional constraint, and two trips were distinguished.
A new trip, Trip 1_4, occurred between the True Love Shopping Mall and the Seven Degrees Network Club (a leisure area). The duration of the stop was only two minutes at the True Love Shopping Mall, where many people arrive and leave. We conjectured that someone had picked up the shared e-bike soon after Trip 1_3 had ended and left the mall in the opposite direction. The short stop, misidentified by the speed and duration criteria, was correctly identified by the directional constraint. In Figure 10b, Trip 1_1 shows that the user departed from the New Star International Cinemas and arrived at the People's Harmony Bay Community (a residential area). The shared e-bike departed from the residential area 5 minutes later in the opposite direction, which indicated that a new trip occurred. Due to the missing data caused by the GPS signal being obscured by tall buildings, the stop cannot be identified using the speed criterion. However, considering the directional constraint, the stop can be identified, and the GPS points were identified as belonging to two trips, Trip 1_1 and Trip 1_2.
The shared e-bike departed from the People's Harmony Bay Community (a residential area) at 14:25 and passed by Beiwen Primary School at 14:34 before arriving at the True Love Shopping Mall at 14:47. Trip 1_2 shows that the shared e-bike departed from the People's Harmony Bay Community (a residential area) at 14:25 and arrived at Beiwen Primary School at 14:34, while Trip 1_3 shows that the shared e-bike departed from Beiwen Primary School at 14:36 in the opposite direction and arrived at the True Love Shopping Mall at 14:47. We conjectured that a resident had dropped off their child at school on the way to the shopping mall, resulting in a two-minute stop at the school. Considering that this two-minute stay was caused by the resident's travel purpose, the stay was considered a stop as identified by the directional constraint, and two trips were distinguished.
A new trip, Trip 1_4, occurred between the True Love Shopping Mall and the Seven Degrees Network Club (a leisure area). The duration of the stop was only two minutes at the True Love Shopping Mall, where many people arrive and leave. We conjectured that someone had picked up the shared e-bike soon after Trip 1_3 had ended and left the mall in the opposite direction. The short It is noted that the directional factor used to identify the trip is based on the hypothesis that residents rarely turn 180 degrees during the trip without special reasons or purposes. If a person turns 180 • in the middle of a journey, it means that some activity may have occurred. Even if the staying time is very short, it also can be identified based on the directional change factor, detecting the special purpose during a journey and revealing the detailed information of residents' travel behaviors. It is the advantage of our method over other methods. However, the directional change factor does not work if some activity with a short time has occurred without direction change. For example, in Figure 10, if the Primary School were at 90 degrees of the residential area, the trip 1_2 could not be identified due to the short stay and lack of direction change.
The pseudo stop caused by waiting at a traffic light can be revised by the contextual constraint. In Figure 11A, the trip extracted by the MCHLC algorithm is theoretically compatible with the shared e-bike departing from the Tengdu Community (a residential area) and arriving at a fitness club (a leisure area), in which the temporary stay (the enlarged figure in Figure 11B) shown with the red rectangle is caused by a traffic light. The temporary stay is misidentified as a stop by the method of Li and Dai, resulting in the whole trip being divided into two different trips (as shown in Figure 11C), which can be revised by the contextual constraint.
The pseudo stop caused by waiting at a traffic light can be revised by the contextual constraint. In Figure 11A, the trip extracted by the MCHLC algorithm is theoretically compatible with the shared e-bike departing from the Tengdu Community (a residential area) and arriving at a fitness club (a leisure area), in which the temporary stay (the enlarged figure in Figure 11B) shown with the red rectangle is caused by a traffic light. The temporary stay is misidentified as a stop by the method of Li and Dai, resulting in the whole trip being divided into two different trips (as shown in Figure 11C), which can be revised by the contextual constraint. A temporary stay, such as waiting at a traffic light or avoiding pedestrians, may divide a trip into multiple fragments with different motion statuses. As shown in Figure 12A, the trip extracted by the MCHLC algorithm is theoretically compatible with the shared e-bike departing from the Sea Moon Community (a residential area) and arriving at the Garden Lane Community (another residential area), in which the pseudo stop caused by a traffic light occurs at the beginning of the trip (shown in the red rectangle), resulting in trip fragmentation. Thus, the beginning point of the trip is misidentified by the duration criterion of the VSLC algorithm, as shown in Figure 12B. A temporary stay, such as waiting at a traffic light or avoiding pedestrians, may divide a trip into multiple fragments with different motion statuses. As shown in Figure 12A, the trip extracted by the MCHLC algorithm is theoretically compatible with the shared e-bike departing from the Sea Moon Community (a residential area) and arriving at the Garden Lane Community (another residential area), in which the pseudo stop caused by a traffic light occurs at the beginning of the trip (shown in the red rectangle), resulting in trip fragmentation. Thus, the beginning point of the trip is misidentified by the duration criterion of the VSLC algorithm, as shown in Figure 12B.

Comparison of the Different Methods
Two factors, directional constraint and contextual constraint, are introduced into the MCHLC algorithm, which is inspired by the VSLC algorithm. In addition, speed is one of the criteria used to extract trips in the MCHLC algorithm. Thus, to test the performance of the MCHLC algorithm, two baseline methods are used here: CB-SMoT and VSLC. To understand the role of the directional constraint and contextual constraint better, the experimental result of the MCHLC algorithm is also compared with the results of the direction VSLC and semantic VSLC algorithms. Notably, the direction VSLC algorithm introduces the directional constraint into the VSLC algorithm, and the

Comparison of the Different Methods
Two factors, directional constraint and contextual constraint, are introduced into the MCHLC algorithm, which is inspired by the VSLC algorithm. In addition, speed is one of the criteria used to extract trips in the MCHLC algorithm. Thus, to test the performance of the MCHLC algorithm, two baseline methods are used here: CB-SMoT and VSLC. To understand the role of the directional constraint and contextual constraint better, the experimental result of the MCHLC algorithm is also compared with the results of the direction VSLC and semantic VSLC algorithms. Notably, the direction VSLC algorithm introduces the directional constraint into the VSLC algorithm, and the semantic VSLC algorithm introduces the contextual constraint into the VSLC algorithm.
The experimental results of the five algorithms were evaluated using three indexes (precision, recall, and F1-score), which are compared in Figure 13. The comparison shows that the new algorithm is better than the others, whose evaluation indexes are all above 93%.

Comparison of the Different Methods
Two factors, directional constraint and contextual constraint, are introduced into the MCHLC algorithm, which is inspired by the VSLC algorithm. In addition, speed is one of the criteria used to extract trips in the MCHLC algorithm. Thus, to test the performance of the MCHLC algorithm, two baseline methods are used here: CB-SMoT and VSLC. To understand the role of the directional constraint and contextual constraint better, the experimental result of the MCHLC algorithm is also compared with the results of the direction VSLC and semantic VSLC algorithms. Notably, the direction VSLC algorithm introduces the directional constraint into the VSLC algorithm, and the semantic VSLC algorithm introduces the contextual constraint into the VSLC algorithm.
The experimental results of the five algorithms were evaluated using three indexes (precision, recall, and F1-score), which are compared in Figure 13. The comparison shows that the new algorithm is better than the others, whose evaluation indexes are all above 93%. Compared to the baseline CB-SMoT algorithm, our method (MCHLC) has an obvious advantage for sparse trajectory data. As Figure 13 shows, the evaluation indexes of CB-SMoT are much lower than the indexes of other methods, especially the recall of CB-SMoT, which is just 55.14%. The results show that CB-SMoT is not suitable for trip extraction of sparse trajectory. CB-SMoT is an extension of DBSCAN, in which a core point is calculated by testing the average speed of adjacent points through two parameters Eps and Mintime. However, when only a few points meet the speed limit, it Compared to the baseline CB-SMoT algorithm, our method (MCHLC) has an obvious advantage for sparse trajectory data. As Figure 13 shows, the evaluation indexes of CB-SMoT are much lower than the indexes of other methods, especially the recall of CB-SMoT, which is just 55.14%. The results show that CB-SMoT is not suitable for trip extraction of sparse trajectory. CB-SMoT is an extension of DBSCAN, in which a core point is calculated by testing the average speed of adjacent points through two parameters Eps and Mintime. However, when only a few points meet the speed limit, it is difficult for CB-SMoT to discover the stops. For example, in Figure 14, there are two different stops. One is a stop with many points in a place, shown in Figure 14B, which can be discovered easily by CB-SMoT. The other is only one single point with a large time interval, shown in Figure 14C, which cannot be discovered by CB-SMoT. To address this issue, duration time is used to identify stops in our method. For the stop in Figure 14C, the motion between the two points is a stop, and the duration time is greater than the minimum time of a stop, so it can be discovered by VSLC and our method. is difficult for CB-SMoT to discover the stops. For example, in Figure 14, there are two different stops.
One is a stop with many points in a place, shown in Figure 14B, which can be discovered easily by CB-SMoT. The other is only one single point with a large time interval, shown in Figure 14C, which cannot be discovered by CB-SMoT. To address this issue, duration time is used to identify stops in our method. For the stop in Figure 14C, the motion between the two points is a stop, and the duration time is greater than the minimum time of a stop, so it can be discovered by VSLC and our method. Compared to the baseline VSLC algorithm, the performance of the MCHLC algorithm has been improved by approximately 10%, owing to the two additional factors, directional constraint, and contextual constraint. Although the directional and contextual constraints can individually improve the accuracy of the algorithm, the influence of the two additional factors is different. The results indicate that the contextual constraint improves the accuracy rate of the extracted trips by the algorithm, while the directional constraint results in an increasing number of trips being extracted successfully from the referenced data. The temporary stays caused by special purposes are mainly Compared to the baseline VSLC algorithm, the performance of the MCHLC algorithm has been improved by approximately 10%, owing to the two additional factors, directional constraint, and contextual constraint. Although the directional and contextual constraints can individually improve the accuracy of the algorithm, the influence of the two additional factors is different. The results indicate that the contextual constraint improves the accuracy rate of the extracted trips by the algorithm, while the directional constraint results in an increasing number of trips being extracted successfully from the referenced data. The temporary stays caused by special purposes are mainly detected by the directional constraint, while the contextual constraint reduces the misidentification of the temporary stays, inevitably caused by the occasional behavior of residents that may result in trip fragmentation.

Conclusions
Shared e-bikes, an emerging transport mode, are favored by an increasing number of people. It is significant to study mobility behavior based on shared e-bike trajectories, as it can provide reasonable decisions for urban development. Moreover, it is worth studying the trip identification method, which is the basis of the trajectory data mining and trajectory analysis. In this paper, a new method of trip identification is proposed, which is tested by the trajectories of shared e-bikes in Tengzhou city. The conclusions drawn from the experimental results are as follows: (1) The new algorithm, named the MCHLC algorithm, is reliable and suitable to identify trips from sparse trajectory data of shared e-bikes. Compared to the baseline VSLC method, the three evaluation indexes (precision, recall, and F1-score) have been increased by approximately 10% with the MCHLC algorithm, indicating that the new algorithm is better for shared e-bike trajectory data. Compared to the baseline CB-SMoT, the MCHLC algorithm has an obvious advantage of enabling sparse trajectory to identify stops. Due to data missing, a stop may be a single point. In this case, the stop can be identified by MCHLC with the duration time parameter, whereas it cannot be discovered by the CB-SMoT.
(2) The performance of the MCHLC algorithm is controlled by the directional and contextual constraints. The former can effectively identify the short stops caused for special purposes, which enriches detailed behavior identification. The latter utilizes semantic relationships to reduce misidentifications, especially the fragmentation caused by occasional behaviors (waiting at traffic lights or avoiding pedestrians).
Our new algorithm can also be applied to other forms of sparse trajectory data. Residents' activities at different scales can be identified by changing the parameter settings. In ongoing work, we will identify urban hotspots and hot travel routes on the basis of shared e-bikes trips, thereby providing a scientific decision-making basis for urban planning.