1. Introduction
A well-functioning freight system is essential for efficient supply chain operation and timely delivery of goods. The organizational layouts of freight locations—serving manufacturing, distribution, and transportation—form the backbone of urban freight networks. Large trucks facilitate critical interactions between metropolitan regions and businesses, connecting markets, logistics hubs, and industrial centers. To align with sustainability goals, freight networks must evolve by incorporating low-carbon technologies, such as electric trucks and alternative fuels, reducing emissions, and enhancing energy efficiency. As cities grow, optimizing these networks for sustainability becomes essential, helping to reduce carbon emissions, improve air quality, and alleviate traffic congestion, all of which contribute to urban well-being. However, the data supporting the evaluation of the dynamics and interaction patterns of different freight locations, as well as a comprehensive analysis of the Heavy Duty Vehicle (HDV) behavior, are still largely unexplored. Addressing this data gap is critically important beyond methodological advancement. The freight transportation sector, and heavy-duty trucking in particular, is a major and growing contributor to global greenhouse gas emissions and urban air pollutions. Current operational patterns and emissions trajectories of HDV fleets are significantly out of sync with international net-zero climate targets, creating an urgent need for data-driven insights to inform effective policy and planning [
1]. A precise understanding of freight vehicle behavior—including trip chains, dwell times, and facility interactions—is fundamental to designing targeted interventions such as zero-emission zones, logistics consolidation schemes, and infrastructure for alternative fuels. Therefore, this study aims to contribute to this urgent research priority by developing and validating a framework to reconstruct and analyze HDV activity patterns from GPS data, thereby providing a scalable tool to support evidence-based urban freight management and sustainability policy.
A truck trip chain or tour is a sequence of trips that includes the origins, destinations, and intermediate stops a vehicle makes during a journey. For commercial vehicles, these trip chains often consist of diverse purposes, such as goods pick-ups, deliveries, and non-freight-related activities like rest or refueling. By connecting various business establishments, truck trip chains create spatial interactions between various facility locations, linking different parts of one or more supply chains. Thus, analyzing the trip chain patterns of commercial vehicles can yield valuable insights into supply chain agents’ decision-making and behavior. In addition, trip chain data are also a fundamental component of freight modeling [
2,
3]. Trip chains have a behavioral foundation; the interconnected trips within a tour are considered collectively to reflect their logistical interactions [
4]. Freight tours, driven by economic decisions aimed at minimizing logistics costs, are particularly suited as the analytical unit for freight movement studies. This makes trip chain analysis essential for understanding the activity patterns, logistical decisions, and travel behaviors of commercial vehicles.
Therefore, high-quality, well-structured trip chain data are needed for freight analysis and modeling. Traditionally, trip chain data has been collected through driver surveys [
5,
6,
7], where truck drivers document trip details, such as stops and purposes. While this method provides rich and accurate data, it is labor-intensive, costly, and often lacks comprehensive and timely coverage, limiting its utility for freight modeling. The advent of GPS technology has revolutionized trip chain data collection. Although GPS data does not directly provide behavioral information, it provides detailed vehicle trajectory information, enabling the reconstruction of trips and trip chains from these trajectories and the inference and extract behavioral knowledge using a variety of techniques [
8,
9,
10,
11,
12]. Extensive studies have utilized GPS data for various research purposes, including trip end identification [
13,
14], trip purpose inferring [
9,
15,
16,
17,
18,
19], commercial vehicle travel pattern analysis [
9,
20], and freight modeling [
4,
21,
22], and, of course, truck trip chain mining.
Compared to passenger trip chains, which typically follow simpler home-based or work-based patterns, commercial vehicle trip chains are generally more complex. Their identification requires determining the start and end of a tour, for which no unified definition exists. The literature offers various perspectives on freight tour definitions. Some studies define the end of a tour as a return to a base location [
22,
23,
24], while others consider a new tour to begin after a delivery stop is followed by a pick-up stop [
25,
26]. One study explores various definitions of freight vehicle trip chain/tour chains [
23]. The study evaluated three commonly used definitions, base-based trip chains, trip purpose-based trip chains and capacity-based trip chains. It compares the extracted tour chains according to different definitions and demonstrates that tour chain types are highly dependent on the tour chain definition. These differing definitions reflect varied analytical priorities: base-driven methods emphasize activity regularity, while others focus on trip purposes or vehicle capacity usage. For freight modeling and prediction purposes, base-driven methods are widely favored due to their ability to capture the regularity of truck activities, particularly since GPS data typically lacks trip purpose and vehicle capacity information.
In general, the reconstruction of trip chains typically involves three steps: (1) identifying truck stops from raw GPS data; (2) determining trip ends; and (3) identifying distinct stop locations. Generally, speed threshold method is used to identify truck stops [
19,
27]. A truck stop is indicated when the vehicle speed drops below a predefined threshold (e.g., 5 km/h). Once stops are identified, they are classified based on their purpose. Numerous studies have focused on inferring trip purposes from GPS data. Most studies categorize freight-related purposes into two groups: (1) freight-related, such as delivery, unloading, and loading, and (2) non-freight-related activities, such as rest, refueling, and dining. For truck behavior analysis, non-freight-related stops are typically excluded, as freight-related activities are more relevant for understanding truck travel patterns and the relationships between different trip ends. Once trip ends are identified, truck trips and trip chains can be reconstructed based on the behavioral information from these stops.
The third step, identifying distinct truck stop locations, has received relatively little attention. While truck stops refer to any stopping event by a truck, trip stop locations specifically refer to sites such as business establishments, company-owned parking lots, logistics depots, and freight hubs where trucks stop for similar purposes. Accurately identifying these stop locations is crucial for reconstructing trip chains for both individual trucks and groups of trucks. Correct identification ensures that stops at the same location are grouped together rather than treated as distinct stops, improving the accuracy of trip chain reconstruction. For grouped trucks, identifying shared stop locations also facilitates the evaluation of collective behaviors, such as trip chain similarities, and provides insights into stop location characteristics, such as base classifications. This is particularly important as freight transport patterns vary across different facility types and locations [
28].
Despite its importance, relatively few studies have focused on identifying stop locations. Two common methods are used in the literature to determine when vehicles stop near each other. One approach employs spatial constrained methods, such as grid-based methods or the Voronoi method, to identify and characterize truck stops within a region [
10,
29]. Spatial constrained methods divide a space into different regions, which can result in an establishment being split across multiple regions. This method tends to favor identifying collective behaviors rather than distinguishing individual locations. Another widely used approach involves clustering techniques [
24,
30,
31,
32,
33], such as the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) method, which is valued for its ability to detect clusters of varying shapes and for its robustness to noise [
34]. We tested the DBSCAN algorithm using practical data and found that it cannot effectively differentiate distinct stop locations in areas with high truck stop densities as the spatial constrained is not taken into consideration.
This study therefore aims to uncover trip chains from the GPS data, revealing the patterns of activity that groups of trucks display, and offering a strong data base for improving freight planning tactics. For this purpose, this study introduces an integrated method that takes the advantage of both the spatial-constrained method and the DBSCAN method, namely, a roadway-constrained method to identify trip ends and stop locations from GPS data, reconstructing trip chain information for individual vehicles, and classifying trucks based on travel behavior. This study contributes to the literature in three key areas:
Development of the Roadway-Constrained Method: A novel, road-constrained method is proposed to accurately identify truck stop locations, addressing the limitations of previous studies that relied on traditional clustering algorithms, which struggled to effectively capture distinct truck stop locations.
Introduction of an Iterative Procedure for Trip Chain Identification: An iterative procedure is introduced to determine truck trip chains, offering empirical evidence that supports the planning of sustainable freight systems.
Identification and Classification of Consistent Truck Travel Behaviors: This study identifies consistent truck travel behaviors and classifies trucks into distinct groups based on trip chain properties, providing deeper insights into truck behavior. By classifying trucks according to these properties, this study enhances the understanding of freight dynamics and offers valuable implications for sustainable freight management.
The proposed framework can accurately extract trip chain data from GPS trajectories and profile trucks into distinct behavioral groups for analysis. By addressing the complexities of trip chain identification and leveraging GPS data, this study contributes to the advancement of freight modeling and the understanding of commercial vehicle travel behavior.
This paper is organized as follows:
Section 2 describes the dataset used for this study and defines truck trip chain.
Section 3 introduces the trip chain mining methodology and truck classification methods.
Section 4 summarizes the results of this study. Finally,
Section 5 concludes this study.
3. Materials & Methodology
The proposed approach consists of the following four major steps. (1) Truck stops identification and clustering. (2) Trip chain extraction. (3) Truck trip chain feature selection. (4) Truck profiling and classification.
Figure 2 outlines the overall framework, and the detailed procedure is described in the following sections.
3.1. Truck Stop Identification
A heuristic-based approach was developed to identify truck stop activities using the available GPS data, which included each truck’s spot speed. Following methods established in prior study [
27], a speed threshold of 5 km/h was applied to determine truck stop behavior. Additionally, consecutive stops that were spatially and temporally close were merged to account for stop-and-go patterns, such as those occurring at parking facilities. The merging rules were defined based on empirical data:
If the distance between two consecutive stops was smaller than a predefined distance threshold and the temporal difference was less than a specified temporal threshold, the two stops were considered to be the same stop.
If the arrival time between two consecutive stops was shorter than a defined time difference threshold and the average speed between the two stops was below a specified speed threshold, the two stops were also regarded as the same stop.
The first rule identifies truck stops that are both spatially and temporally close, suggesting that they should be merged together. The second rule, which is applied after the first, further refines the process by identifying temporally connected stops with very low-speed movement between stops. This could indicate trucks waiting in line but not fully stationary or exhibiting stop-and-go behavior. A combination of these thresholds was tested (
Section 4.1) to select the best set of parameters for accurate stop identification.
3.2. Truck Stop Clustering
Then, we aimed to identify the stop locations. Stop locations refer to places such as business establishments, company-owned truck parking lots, logistics depots, freight hubs, and ports where trucks stop for similar trip purposes. Accurate identification of these locations is essential for constructing vehicle trip chains. Identifying stop location requires grouping vehicle stops in a meaningful way. Previous studies generally use two different approaches: Spatial clustering techniques, such as DBSCAN [
34], or the Spatial Constrained Approach [
10,
29]. Upon examining freight-related AOIs, truck stops, and the roadway network, it was found that the majority of business establishments in China are located within regions delineated by higher levels of roadways. It is uncommon for high-level roadways to traverse through business establishments (detailed explanation in
Section 3.2.1). Based on this observation, we extended the spatial constrained method and developed a roadway-constrained spatial clustering method, described in the following sections.
The roadway-constrained clustering algorithm consists of three main steps: (1) defining roadway-constrained zones by using OSM road network data to create zones based on major roads and assigning truck stops to these zones, (2) associating related zones by grouping adjacent roadC zones likely belonging to the same business cluster, and (3) identifying stop locations by applying the DBSCAN clustering algorithm within each roadC group to accurately refine truck stop locations. Detailed procedures for each step are provided in the following sections.
3.2.1. Roadway Constrained Area
In this study, we introduced roadway-constrained zones (referred to as roadC zones) to capture the spatial distribution of truck stops within the road network. These zones were generated by dividing the study area based on the road network data. Specifically, the roadC zones are polygonal regions that are delineated by high-level major roads, which serve as the boundaries of each zone. Each truck stop was assigned to a corresponding roadC zone based on its proximity to these road network boundaries. This approach ensured that the spatial units of analysis aligned with the structure of the urban road network and the distribution of truck stops within these high-traffic areas. The road network data was obtained from OpenStreetMap, which provides a classification system for differentiating roads based on their function and importance. These classifications are typically organized into nine categories, including motorways, primary roads, and secondary roads, among others. Using these classifications, higher-level roadways were selected to define the boundaries of roadway-constrained zones.
A detailed examination of roadway classifications and business establishment boundaries in Chongqing was conducted. A sample of 7402 freight-related AOIs, which included manufacturing factories/business establishments in Chongqing, was evaluated, and their boundaries were compared against various levels of roadways. The analysis revealed that six high-level roadway classes—motorway, trunk, primary, secondary, tertiary, and residential—rarely traverse within the boundaries of establishments (see
Table 1 for classification details).
Figure 3a provides a comparison between the boundaries of the AOIs and the OSM roadway network. The AOIs were consistently confined within the boundaries defined by the selected roadway network level, demonstrating alignment in the spatial context.
Figure 3b,c show a comparison.
Figure 3b illustrates six of the 7402 AOIs (depicted with green line patterns and labeled ① to ⑥) alongside the OSM roadway network. The gray lines represent roadways filtered to the six selected classes, while the red dashed lines indicate the remaining roadway classes. The figure demonstrates that the selected roadway classes effectively enclose establishments within distinct roadway-defined zones, while roads within the establishments are generally lower-level roadways (depicted in red dashed lines).
In contrast,
Figure 3b showcases a spatial division using the Voronoi method used to define stop locations in other studies [
29]. The Voronoi method is a computational geometry technique that partitions a plane into regions based on proximity to POIs. The Voronoi method, apparently, performs poorly in detecting the boundaries of POIs. The city of Chongqing was therefore divided into roadway-constrained zones (roadC zones) using the filtered road network. A unique ID r (r∈R) was assigned to each zone.
3.2.2. RoadC Zone Clusters
While relatively uncommon, certain freight hubs, ports, and industrial parks extend across multiple roadC zones. We compared the AOIs with RoadC zones and found that out of 7402 observed AOIs, 440 spanned more than one RoadC zone. Additionally, heavy trucks may park or rest in nearby areas outside company boundaries due to spatial constraints or road conditions. To resolve this issue, a road zone clustering algorithm was developed based on the functional connections between roadC zones. The algorithm uses the DBSCAN (a density-based clustering algorithm) clustering method to group truck stops based on their spatial density and functional connectivity. Two criteria guided the grouping of roadC zones. For a truck , the cluster of its th stop was .
Criteria 1 Functional Connectivity: for all truck stops in the same cluster k , the corresponding roadC zones that each truck stop located in are functional connected (simplified as connected).
Criteria 2 Transitive Connectivity: If roadC zone is connected with roadC zone , and is connected to roadC zone , then , , and are grouped as one cluster.
Following the two criteria, the connected roadC zones can be identified and grouped together to form a roadC zone cluster.
Figure 4 provides an example of connected roadC zones using the trajectory of a truck with plate number ‘A6****’ from 1 September 2023 to 7 September 2023. Different trajectory colors represent different dates, and dotted points within the dashed circle indicate the truck’s stop points, identified as belonging to the same cluster. These stops span two separate roadC zones: 3107 and 3114. Following the criteria, these zones were determined to be connected and grouped into one roadC cluster.
The detailed workflow is as below:
Step 1 Initialization: Each roadC region is initially treated as a separate cluster.
Step 2 Cluster Truck Stops: Then for each truck , the DBSCAN algorithm is applied to its stops to get the cluster for each stop.
Step 3 Group RoadC Zones: For all stops of truck within the same cluster c (), the corresponding roadC zones that these stops locate on are grouped together.
Step 4 Iterate Across Trucks: Steps 1–3 are repeated until all trucks are processed.
3.2.3. Stops Clustering and Stop Location Identification
After defining roadway clusters, truck stops within each roadC cluster are grouped using the DBSCAN algorithm, which requires selecting two parameters: minPoints and eps (distance threshold). Various parameter combinations were tested, and after comparing clustering results against the 7402 AOIs (see
Section 4.1 for more details), the optimal parameters were determined to be minPoints = 3 and eps = 90 m. This combination resulted in an acceptable False Alarm Rate (FAR), effectively balancing clustering accuracy and false alarms.
3.3. Trip Chain Identification
According to the trip chain definition in
Section 2, the base location refers to a place where trucks return after completing a series of delivery or pickup activities. Two criteria were employed to define the base:
Most Frequently Visited Location: Trucks often visit multiple stops during different trip chains but consistently return to a specific location, designated as the base.
Long Stop Durations Location: A stop with a dwell time exceeding a certain threshold is considered the start of a new trip chain and thus defined as the base. Previous studies [
23,
35] suggested a threshold of 240 min, which demarcates operational stops (e.g., rest, loading, unloading) from non-operational stops (e.g., overnight stays).
We analyzed the empirical data in Chongqing and adopted the broken power law method to determine an appropriate dwell time threshold. This mathematical model identifies a “break point” where a power law relationship changes its behavior; it was previous adopted in a study to identify the time threshold for differing temporary stops with freight-related stops [
31]. By fitting truck dwell times to a broken power law, two thresholds were identified (
Figure 5): 300 min, indicating stops related to rest or non-operational activities, marking the base or start of a new trip chain, and 1000 min, representing long-term stays where trucks are parked for extended periods without daily use.
With the above definition, the trip chain for each truck was identified through the following steps:
Identify Base Location: For each truck t, identify the most frequently visited location L) and set it as the base .
Sort Stops: sort all stops of truck t in chromonic order and iterate over these stops.
Define Trip Chains: Assume the existing trip chain is . For a truck stop , if , or the stop dwell time at is longer than 300 min, start a new trip chain . Otherwise, append the stop to the existing trip chain = (,).
Iterate: repeat step 2 to step 4 until all stops are processed.
Using these steps, the trip chains for each truck are identified, enabling a detailed analysis of truck travel behavior.
3.4. Truck Profiling
We aimed to profile trucks by analyzing trip chains. We incorporated trip chain characteristics into consideration and classified the trucks based on their traveling patterns. To identify typical truck travel patterns, clustering analysis was conducted using selected features derived from trip chain characteristics, including the following:
- ○
Temporal Variables: Average dwell time, dwell time variation.
- ○
Spatial Variables: Average stop distances, trip chain radius.
- ○
Travel Attributes: Number of intermediate stops, trip chain frequency.
Note that most of the travel attributes were defined based on the trip chains of each truck because a trip chain could contain multiple sub-trips and reflect the hidden travel structure.
Table 1 list these attributes by their categories.
We tested different clustering algorithms including K-Means [
36], Agglomerative Clustering [
37], Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [
34], Gaussian Mixture Models (GMMs) [
38], Mean Shift [
39], and Spectral Clustering [
40] to provide a comprehensive comparison of clustering performance. Further details of this comparison and evaluation strategy are provided in
Section 4.4.1. We finally adopted the K-Means clustering algorithm to uncover distinct truck travel patterns and provide a foundation for improving freight logistics.
4. Results & Analysis
In this section, we apply the proposed method to the heavy-duty vehicle GPS data described in
Section 2.1.
4.1. Parameter Selection
Merging nearby stops required a set of parameters to determine when two stops should be considered as one. Consecutive stops that were spatially and temporally close were merged to account for stop-and-go patterns, such as those occurring at parking facilities. Four parameters needed to be tested separately: the distance threshold, temporal threshold, time difference threshold, and speed threshold. The distance threshold was tested within a range of 100 m to 1300 m, with increments of 200 m. The temporal threshold ranged from 0.5 min to 5.5 min, the time difference threshold from 5 min to 35 min, and the speed threshold from 3 km/h to 10 km/h.
Figure 6 illustrates the sensitivity analysis results, where the y-axis represents the share of truck stops after merging.
Figure 6a shows that the distance threshold followed an elbow pattern. When the distance was between 100 m and 500 m, the share of stops decreased sharply with an increase in the distance threshold, but this reduction slowed after the threshold exceeded 500 m.
Figure 6b shows the sensitivity to the time threshold. The elbow range was observed between 1.5 and 3.5 min, indicating that selecting a time threshold within this range was acceptable. The speed and time difference thresholds were tested after stops were initially merged based on the first rule. The proportion of stops further merged was then calculated using these parameters. As shown in
Figure 6c, the share of stops decreased clearly within the speed range of 2 km/h to 6 km/h, suggesting that 6 km/h may be the optimal speed threshold. Finally,
Figure 6d also shows an elbow value of 15 m to be the optimal time difference threshold. From the figure and based on the changing point pattern, the best parameters were set to 500 m for distance threshold, 2.5 min for temporal threshold, 6 km/h for speed threshold and 15 min for time difference threshold.
To assess the robustness of these parameters, we conducted a sensitivity analysis on key parameter thresholds (speed, time, and distance) using the elbow method for threshold selection. The speed and time thresholds had minimal impact on stop merging, with only 3–4% of stops affected, indicating stable clustering results. However, varying the distance and temporal thresholds (ranging from 300 to 500 m and 1.5 to 2.5 min) caused slight variations in merged stops. Therefore, we tested different combinations of these thresholds and found that they had minimal impact on the final results. Further inspection revealed that even if some stops were not merged initially, the majority of them were grouped into the same cluster in subsequent clustering stages. Overall, these findings demonstrate that the framework remains stable and reliable across different threshold settings.
4.2. Clustering Results and Validation
As noted in
Section 2.1, the raw dataset contained missing records. Trucks with missing data exceeding one hour were excluded from the analysis. After data cleaning and preprocessing, 94,345 unique trucks remained in the dataset. Chongqing City was divided into 3748 roadC regions, which were then processed using the roadC clustering algorithm. The final result included 3271 roadC clusters. Then, the stops of each truck were processed with the proposed road constrained clustering algorithm.
Previous studies often relied on land use data and satellite imagery for method validation, manually comparing clustering results with satellite images. However, land use data provides only general geographic information for polygonal areas with similar land use pattern and cannot differentiate between individual establishments or factories, limiting its application for detailed validation. Similarly, satellite imagery lacks precise boundaries for distinguishing between different factories or establishments. In this study, we address these limitations by using both AOI (Area of Interest) data and satellite images to validate the accuracy of the proposed road-constrained clustering methods. Specifically, we utilized 7402 freight-related AOIs to assess the accuracy of our identification results and employed satellite imagery to visually verify.
Figure 7 illustrates an example of the clustering results along with the AOIs. In this figure, each colored dot signifies a distinct cluster. The areas enclosed by bold white boxes denote individual AOIs.
We compared the clustering results with the AOI boundaries and introduced two metrics to evaluate the results:
where n_c denotes the number of AOI grouping events where different AOIs are grouped into one cluster. n_c indicates the total number of clusters. n_a indicates the number of AOI split events where AOIs are split into more than one cluster. NA represents the total number of AOIs. FAR_C and FAR_A represent misclassification rates from two different perspectives. FAR_C captures the rate at which distinct AOIs are incorrectly grouped into a single cluster, while FAR_A measures the rate at which AOIs are erroneously split into multiple parts. These two types of misclassifications affect downstream trip chain analysis in different ways. When FAR_C occurs, vehicle stops at physically and functionally distinct locations are incorrectly merged into the same location, leading to the loss of important distinctions between activity types (e.g., a warehouse vs. a fuel stop). This misclassification distorts the topological structure of the trip chain, affecting key elements such as the number of visited locations, inter-location transitions, and inferred relationships between activities. As a result, FAR_C can significantly alter the overall flow and meaning of trip chains, making it more disruptive to downstream analysis. In contrast, FAR_A causes fragmentation of stops that actually belong to the same location, creating multiple pseudo-locations. This misclassification increases the apparent number of nodes in the trip chain and may lead to inflated trip chain complexity. While the effects of FAR_A are more localized and primarily influence node-level statistics (such as visit frequency and dwell time), they do not fundamentally alter the overall sequence of activities. Consequently, FAR_C typically has a more significant impact on the semantic interpretation and connectivity of trip chains, whereas FAR_A mainly results in overestimated complexity without distorting the broader structure.
To evaluate the performance of the proposed method, we compared it with the benchmark DBSCAN method. The comparison results are shown in
Figure 8. From the figure, we observe that as the distance threshold increased, the FAR_A value (dotted line) decreased, while the FAR_C value (solid line) increased. Our goal was to achieve low values for both FAR_A and FAR_C, ideally balancing both to minimize errors in classification.
When comparing the proposed method with DBSCAN, we see that the proposed method was particularly effective in reducing the FAR_C value. Specifically, the solid purple line (representing DBSCAN) increases more sharply as the distance threshold grows, while the solid green line (representing the proposed method) shows a more gradual increase. This suggests that the proposed method is more robust at preventing different AOIs (such as establishments and hubs) from being mistakenly grouped into the same cluster. In contrast, the FAR_A values between the two methods did not show significant differences.
AOI data were used in this study as a reference for validating the clustering results of truck stop locations. This choice offers an interpretable and operationally relevant benchmark; however, several limitations should be noted. First, AOIs are typically defined based on land-use or administrative criteria, which may not align with the actual operational boundaries of freight activities. Vehicles operating within the AOI may be performing different activities, but these activities will be grouped together as serving the same stop, leading to potential misclassification. Another issue is the incomplete nature of AOI data, particularly for smaller or non-central facilities that are not represented as discrete AOIs. The unclear boundaries of these smaller facilities can complicate the validation of stops and their associated activity patterns. This ambiguity could lead to misclassifications, where distinct stops are merged or different stops are grouped together. In contrast, larger facilities with clearer boundaries are more likely to be validated accurately, which could result in a somewhat overly optimistic view of the trip chain accuracy.
4.3. Trip Chain Identification and Truck Profile
Using the proposed methods, a comprehensive list of trip chains was identified.
Figure 9 summarizes the distribution of intermediate stops within these trip chains and their corresponding shares. Notably, over 60% of trip chains include only one intermediate stop, representing direct trips where trucks move directly from origin to destination without additional stops. This observation aligns with findings from previous studies [
31].
The clustering results obtained using the k-means algorithm demonstrated better differentiation among vehicle classes. Based on these results, vehicles were categorized into six distinct classes, as outlined in
Table 2. Each column details the statistics for each of the six categories. This analysis highlights variations in truck travel patterns.
These vehicles operate over short travel distances with the highest trip chain similarity, indicating that these vehicles typically serve fixed destinations. They have the highest daily travel frequency, averaging over 7 trips per day. These trucks have minimal intermediate stops, averaging only one per trip, indicating involvement in full truckload transportation serving a single freight stop. These trucks primarily serve fixed destinations, with trips mapped to concentrated areas around vehicle manufacturing hubs. Analysis of origins and destinations relative to nearby POIs reveals that these trucks predominantly serve the vehicle manufacturing industry, a key sector in Chongqing City. Chongqing’s automobile manufacturing industry demonstrates industrial concentration, with clusters of auto parts factories and logistics centers around vehicle manufacturing bases. These short-distance trucks frequently perform multiple round trips between these bases, supporting the region’s industrial ecosystem.
Chongqing is a municipality with over 20 sub-cities/regions within its jurisdiction. These trucks operate primarily within each city/region in Chongqing, with an average trip chain radius of about 33 km. The similarity of the trip chains is about 0.4, indicating both variability and similarity in destinations. Trips typically involve 1 to 2 intermediate stops. These trucks rarely perform long distance trips; they mainly serve the target industry and surrounding areas. The key difference between Category 2 & 3 is that Category 2 trucks have a higher trip frequency, with trucks active throughout the week, while Category 3 trucks have lower utilization rates, with trucks operating on specific days and remaining at their base for the rest of the week, indicating underutilization.
These trucks mainly connect different regions/cities in Chongqing, covering a broader service radius of around 90 km per trip chain. These trucks have smaller trip chain similarity compared to shorter-distance categories, reflecting greater variability in served destinations.
The difference between C4 and C5 is that trucks in C4 have a longer average stop time at its intermediate stops, averaging 2 h, the highest among all groups. These trucks also typically involve 2 stops per trip chain, balancing between frequent stops and longer durations. Truck in C5, on the other hand, has an average stop time of 57 min. They also have relatively longer trip chains, with over 5 intermediate stops of each trip chain. Digging into the data reveals that these trucks typically have multiple bases, frequently visiting a secondary base in addition to their primary base.
Category 6 is named long-distance trucks because it is characterized by the longest travel distance among all the six categories. These trucks primarily serve intercity and cross-regional routes, starting from Chongqing and reaching major destinations such as Chengdu, Beijing-Tianjin, Shanghai, and Guangzhou. On average, these trucks have 3 intermediate stops per trip. These trucks have low trip chain similarity, reflecting diverse destinations and operational flexibility. These trucks usually originate from logistics hubs or transshipment centers.
Figure 10 presents density plots for several selected features of trucks, offering further insights into truck behavior patterns.
Figure 10a illustrates the arrival times (hour of the day) at intermediate stops. The data reveals that truck operations are minimal during nighttime hours (12:00 a.m. to 5:00 a.m.). Arrival times exhibit two peaks: one around 10:00 a.m.and another around 3:00 p.m., indicating distinct activity periods.
Figure 10b depicts the average number of daily trips made by each group of trucks, where a trip is defined as travel between two consecutive stops. Short-distance trucks (C1), medium-short distance trucks (C2), and medium-long distance trucks (C5) exhibit a higher number of daily trips compared to other groups. Notably, trucks in group C5 also demonstrate the highest average number of intermediate stops per trip chain. This suggests that these trucks may be engaged in more complex operations, possibly connecting multiple locations or handling intricate logistical tasks within a single trip chain.
Figure 10c highlights the average dwell time at intermediate stops, where the C4 truck group exhibits a unique pattern. For this group, dwell times have two peaks: one around 15 min and another at approximately 100 min. This distribution differs significantly from other groups, indicating distinct operational behaviors.
Figure 10d displays the average dwell time at base locations, showing two peaks across all truck groups. The first peak occurs around 50 min, likely reflecting typical loading or unloading activities at the base location during daytime. The second, significantly longer peak suggests activities such as overnight parking or extended rest periods.
4.4. Robustness Analysis
We conducted a robustness analysis to evaluate the performance of different clustering approaches for identifying truck groups based on travel attributes, using widely recognized indicators. We also acknowledge the potential impact of data loss on the clustering results. Sensitivity analyses were performed to assess how varying levels of missing data affect the results.
4.4.1. Clustering Approach Comparison
The performance of various clustering algorithms was assessed using three widely recognized evaluation metrics: Silhouette Score [
41], Davies-Bouldin Index [
42], and Calinski-Harabasz Index [
43], as summarized in
Table 3. Silhouette Score measures how similar each point is to its own cluster compared to other clusters, with values ranging from −1 to 1. Higher values indicate better clustering, with values close to 1 showing well-separated clusters, while values near 0 or negative suggest poor clustering. Davies-Bouldin Index assesses the average similarity between clusters, with lower values indicating better separation. Calinski-Harabasz Index calculates the ratio of between-cluster dispersion to within-cluster dispersion, where higher values reflect better-defined and more separated clusters. The algorithms compared include K-Means, Agglomerative Clustering, DBSCAN, Gaussian Mixture Model (GMM), Mean Shift, and Spectral Clustering.
The results show that K-Means, Agglomerative Clustering, GMM, and Spectral Clustering all performed equally well, each achieving a Silhouette Score of 0.79, indicating strong clustering with well-separated and compact clusters. They also achieved a Davies-Bouldin Index of 0.29, the lowest among all methods, signifying good separation, and a Calinski-Harabasz Index of 5742.04, the highest, indicating very good separation and compactness. The next best method was Mean Shift, which showed a slightly lower Silhouette Score of 0.60, suggesting less distinct clusters but still adequate separation. However, Mean Shift had a higher Davies-Bouldin Index and Calinski-Harabasz Index, indicating more overlap between clusters and weaker clustering performance. DBSCAN performed the worst in all three metrics.
Given these results, any of K-Means, Agglomerative Clustering, GMM, or Spectral Clustering could be used. However, K-Means was ultimately selected for its consistent performance, efficiency, and scalability. Compared to Agglomerative Clustering, GMM, and Spectral Clustering, K-Means is more computationally efficient, scales better with larger datasets, and is easier to implement and interpret.
4.4.2. Analysis of Clustering Stability Under Missing Data
To evaluate the impact of potential data missingness and ensure the robustness of our clustering results, we conducted a systematic stability analysis. This procedure assessed whether the identified vehicle behavior patterns remained consistent when only a subset of the data was available. We employed a sub-sampling approach where the original dataset was randomly partitioned into subsets ranging from 10% to 90% of the total population. For each sub-sample, the K-means clustering algorithm was re-applied independently. To quantify the stability, we compared the cluster assignments of the sub-sampled data against the baseline assignments derived from the full dataset.
A critical challenge in clustering stability is the label switching problem, where the same behavioral group may be assigned different cluster IDs across different runs. To overcome this, we utilized the Adjusted Rand Index (ARI) as the primary evaluation metric [
44]. The ARI is a statistical metric used to measure the similarity between two cluster groupings of the same data [
44]. It is widely used in machine learning for stability analysis because it focuses on the clustering structure, not on arbitrary cluster labels. ARI compares every pair of data points, checking whether they are in the same cluster in both groupings. A match occurs when points are either in the same or different clusters in both groupings. The adjusted part accounts for chance agreements, ensuring that the score reflects meaningful similarity. ARI ranges from −1 (completely different) to 1 (perfect match), with 0 indicating random clustering.
Figure 11 summarizes the ARI scores across different levels of missing data. As shown, the ARI scores remained consistently high across all sampling levels, demonstrating the robustness of the clustering results. Even when only 10% of the data was available, the ARI score was still above 0.86, and for samples exceeding 50%, the scores consistently exceeded 0.95. These results suggest that the clustering method was resilient to the absence of data and that the truck behavior clusters were not overly sensitive to the specific data points included in the analysis. The stability of the ARI scores, even with progressively larger amounts of missing data, indicates that the variables selected for the clustering process captured strong, underlying patterns in truck travel behavior. This highlights the robustness of the identified truck categories, ensuring that they reflected generalizable, empirical groupings of truck travel behavior rather than being contingent on specific data points.
4.5. Trucks and Industries
In this section, we analyze the potential industries served by each group of trucks. Each truck was assigned an industry category according to data provider, covering a total of 19 distinct industry classifications.
Figure 12 illustrates the proportion of each industry served by trucks across different classes. Each cell in the figure indicates the share of trucks within a specific truck group that served the corresponding industry. This section presents a descriptive analysis of the industries served by different truck categories based on the proportions observed in the data, without making direct claims about their operational significance.
Short-Distance Trucks: A significant majority of short-distance trucks served the automobile manufacturing industry, a key sector in Chongqing characterized by a high level of industrial complementarity. This reflects the localized nature of automobile production and the proximity of related supply chain operations.
Medium-to-Short-Distance Trucks: A high proportion of trucks in this category transport building materials like cement and concrete, supporting construction projects. Other industries, such as agriculture, minerals, and machinery, also make up significant shares.
Medium-to-Long-Distance Trucks and Long-Distance Trucks: A significantly high proportion of trucks in these categories serve logistics companies, handling goods that require transportation over longer and even intercity distances. For example, over a quarter of intercity trucks are dedicated to logistics companies, underscoring the importance of freight and logistics hubs in facilitating intercity transportation. Conversely, very few long-distance trucks are utilized for building materials, such as cement or concrete, as these are typically transported shorter distances due to cost and practicality constraints.
Several observations were summarized from industry-specific insights:
Logistics Sector: Trucks serving logistics companies constitute the largest proportion across all categories, highlighting the central role of logistics in truck utilization.
Commerce, Automobile Manufacturing, and Food Industries: The share of intercity trucks serving these industries is relatively higher compared to other categories, reflecting their demand for longer-range transportation networks.
Other Industries: Certain industries, like automobile manufacturing, require both short-distance and long-distance truck services, indicating a complex supply chain with localized production and intercity distribution components.
4.6. Transferability Discussion
This section discusses the transferability of the proposed framework across diverse geographic and operational contexts. The methodology comprises three critical components, each designed for generalization to other cities, regions, or freight systems. While the overarching structural framework is broadly adaptable, certain parameters are intended to be calibrated to account for local geographic contingencies and specific industrial layouts. These three pillars are summarized as follows:
Stop Location Identification Component: The methodology used for stop identification, based on GPS data, is broadly applicable and can be generalized to other urban areas with similar data sources. While the Road-Constrained clustering technique is established for regions with established road hierarchies, its direct implementation may face challenges in areas with inconsistent network structures. In such contexts, the framework maintains its robustness by cross-referencing Area of Interest data with OpenStreetMap attributes. This comparative approach facilitates the assessment of classification suitability and ensures that the most representative road hierarchy is selected for the local environment.
Trip Chain Identification Component: The methodologies for base location detection and power law-based dwell thresholding are designed for deployment with empirical data across diverse geographic contexts, ensuring the framework’s broad adaptability. Specifically, the trip chain identification logic offers high flexibility, allowing for seamless customization to align with the unique operational characteristics of different urban freight systems.
Vehicle Behavior Analysis Component: While the empirical findings reflect the local context of Chongqing, the underlying analytical framework—specifically the feature engineering and clustering techniques—is highly transferable. Sensitivity tests demonstrate that the method remains reliable despite data loss, ensuring its efficacy in capturing truck travel behaviors across different data qualities.
In summary, while the framework’s performance may fluctuate across different regions due to variations in road network structures and data quality, its core methodologies remain robust. Although specific numerical thresholds derived in this study may require localized calibration, the underlying threshold selection protocols are universally applicable. Successful implementation in alternative urban contexts necessitates high-fidelity GPS trajectories and accurate road network data, such as those from OpenStreetMap. Furthermore, integrating AOI data is essential for cross-regional validation. By accounting for these local contingencies and ensuring data integrity, this framework provides a scalable and adaptable solution for urban freight management.
5. Discussion and Conclusions
In recent years, the need for sustainable urban freight systems has become increasingly urgent as cities grapple with growing congestion, environmental challenges, and the need for more efficient logistics. However, data to support these goals is often lacking. To address these challenges, this study proposes a method for mining the trip chain patterns of heavy-duty vehicles (HDVs) using GPS data. This approach offers significant potential to enhance sustainable urban freight management by providing a clearer understanding of truck movement patterns.
In this study, we proposed a method to mine the trip chain patterns of heavy-duty vehicles using GPS data. A road-constrained clustering approach was developed to identify truck stop locations, addressing the limitations of traditional clustering methods. This technique ensures more accurate differentiation of stop locations, thereby enhancing the reconstruction of truck trip chains. Results from a comparison of over 7000 AOIs show that the proposed method is more effective at preventing different AOIs (such as establishments and hubs) from being mistakenly classified as the same stop location. Based on these identified stop locations, a procedure was designed to identify base locations for each truck and to extract trip chains, creating a robust database to support freight planning and modeling.
Key trip chain characteristics of HDVs were extracted and analyzed using GPS data from Chongqing. The results indicate that, on average, heavy trucks in Chongqing spend 63 min at intermediate stops, with an average service radius of approximately 76 km. Based on trip chain characteristics, HDVs were classified into four main categories: short-distance, medium-short distance, medium-long distance, and long-distance trucks. The analysis revealed that the service range plays a significant role in classification, with trucks of similar service ranges exhibiting comparable trip chain patterns. Long-distance trucks primarily serve intercity and cross-regional routes. Further analysis of industry data showed that over one-quarter of long-distance trucks are engaged in logistics services. Medium-short and medium-long distance trucks are primarily involved in intra-city movements. Medium-short distance trucks typically serve nearby destinations, with industries such as construction and building materials being two of the main users. In contrast, medium-long distance trucks connect regions or cities within Chongqing, with a broader service radius of approximately 90 km per trip chain.
Understanding these patterns reveals important opportunities to optimize urban freight systems by improving the efficiency of truck movements, reducing unnecessary stops, and streamlining routes. For example, identifying long-distance and medium-short distance truck behaviors helps pinpoint the most congested areas and timing for freight transport. These insights can be leveraged to reduce traffic congestion and promote the use of more environmentally friendly routes, leading to a decrease in fuel consumption and emissions. By making urban freight operations more efficient, this study contributes to the development of sustainable transport systems that minimize environmental impact while maintaining the flow of goods essential to the economy.
However, this study has several limitations. First, the analysis was based on short-term data collected over just seven days, which does not capture seasonal variations or long-term evolutionary patterns in freight behavior. Additionally, the empirical study was conducted in Chongqing; comparing data from multiple cities could offer a more comprehensive understanding of freight behavior, highlighting both common trends and regional differences that may arise due to varying geographic, economic, or infrastructural factors.
Several future research directions can be explored. First, studies could expand on the trip chain data from this study by incorporating additional factors that influence the selection of trip chain patterns. Investigating the drivers behind these patterns (e.g., operational constraints and geographic or economic factors) could provide deeper insights into truck route planning. Identifying these underlying influences could improve the understanding of how trucks optimize their trips within complex logistics networks. Additionally, future research could explore how trip chain destinations are selected. Analyzing the impact of industries, supply chains, and logistics hubs on truck destinations would enhance understanding of freight flow dynamics, informing more accurate predictive models and improving sustainable urban freight management. Finally, while this study excludes non-freight-related stops, these stops can significantly influence long-distance freight operations by affecting travel schedules and delivery performance. Future research could integrate non-freight stops into trip chain analysis, exploring their interaction with freight activities. Classifying these stops may reveal their impact on operational efficiency and help refine models for predicting truck movement, ultimately optimizing logistics networks and improving freight flow management.