1. Introduction
The increasing need for efficiently moving large quantities of goods and people over substantial distances within reasonable timeframes has highlighted the importance of reliable traffic and transportation planning. As a result, there has been a significant focus on planning and implementing transportation initiatives in recent history. Today, transportation investment and effective planning have become an integral part of economic and development strategies. These plans often include developing transit systems, pedestrian and cyclist facilities, and measures to manage transportation demand.
Transportation planning, as the initial step in the implementation of transportation systems, constitutes a significant technical process that heavily relies on computer models and advanced tools to simulate the intricate interactions inherent in transportation system performance. Effective transport planning is vital for enhancing the livability and efficiency of cities, thereby ensuring their preparedness for the challenges of future eras. This intricate transportation planning process often necessitates engagement with a broad spectrum of stakeholders, primarily focusing on infrastructure users. The fundamental action in planning transport initiatives is the measurement and analysis of people’s movements. Comprehensively scrutinizing these movements is critical to developing transportation infrastructure and services that operate efficiently while minimizing the burden on end-users.
Typically, information corresponding to human travel activity or movements may be acquired using different approaches, such as survey-based, passive, activity-based, and device-based. Survey-based approaches follow a method of gathering data from people engaged in a survey. Survey methodology targets instruments or procedures that ask one or more questions that may or may not be answered. Researchers conduct statistical surveys to make inferences about the population being studied. Ordinary travel surveys assemble human activity data, preferences, and behavioral data through questionnaires. The veracity of survey-based approaches is widely influenced by the memory of each individual participating in the survey. Additionally, the cost associated with conducting a survey is expensive and laborious. Due to this, the frequency of surveys is low, often leading to small sample sizes and an increased risk of sampling bias. A lower response rate due to response burden is another problem encountered, but certain complex data types, like preferences or opinions, can be gathered easily. Moreover, the existing studies on CDR generate only the screen line flows without the OD (origin–destination) details.
As discussed above, the characteristics of each approach, including data availability, accuracy, scalability, cost, privacy concerns, and information content, vary drastically. Therefore, careful consideration should be given when selecting the data type and acquisition method methodology. In recent decades, researchers have increasingly turned to using time-stamped location data derived from mobile network providers, combining aspects of both passive and active approaches in modeling human behavior. This data type is commonly referred to as Mobile Network Big Data. Technological advancements, including improved data retrieval, processing, and storage facilities, have enhanced the usability of such big data, motivating researchers to incorporate these datasets into their studies.
This study presents a novel approach to integrating CDR data into the four-step travel demand modeling framework (
Figure 1). Modeling travel demand typically begins with household surveys to collect transportation data, such as trip frequency, origins, destinations, and times. The initial phase of the traditional transportation model estimates the total number of trips originating from and destined for each zone within a region, classifying them based on their purpose—such as Home-Based Work, Home-Based Other, or Non-Home-Based trips. In the second stage, these trips are processed into an origin–destination (OD) trip matrix, disaggregated by travel modes, such as private vehicles, public transport, walking, and cycling. The final step is route assignment, where trips between each OD pair by mode are estimated and loaded into the transport network to determine the total number of trips for each route [
1,
2,
3].
Within this study, we apply a stay-based methodology to extract significant user locations (e.g., Home, Work, and other) and define trips based on regularity and frequency of cell tower appearances. These trip patterns are then aggregated into origin–destination matrices and validated against a comprehensive household travel survey conducted in Sri Lanka. The final stage involves assigning inferred trips to a road network using a route choice model, which utilizes the sequence of cell tower connections made during commuting—referred to as en-route cell data—to approximate users’ likely travel paths.
By combining mobile network data with traditional modeling principles, this research demonstrates that CDR-based methods can yield comparable accuracy to survey-based models while offering significant advantages in scalability, cost, and temporal coverage.
2. Background and Related Work
Call Detail Records (CDRs) have been extensively used in mobility research to understand individual movement patterns and broader human mobility trends [
4,
5,
6,
7,
8,
9,
10,
11,
12]. These records—collected passively from mobile phone activity—enable researchers to reconstruct individual trajectories across space and time. Such trajectories, shaped by daily routines and activity participation, define a person’s activity space [
7,
13]. This concept has been central to many mobility studies, where CDR data is used to infer trips, identify frequently visited locations, and analyze travel behaviors. The literature using CDR data to model human mobility broadly falls into three categories: OD matrix estimation, significant location identification, and route assignment.
2.1. Origin–Destination Matrices
A large body of research has explored how to derive OD matrices from raw CDR traces. Fekih et al. [
14] proposed identifying stationary periods in user traces to determine activity zones, forming the basis for trip detection. Similarly, Mamei et al. [
15] inferred OD pairs from sequences of CDR appearances, relying on temporal and spatial transitions. These approaches generally use a trip-based model—segmenting travel behavior based on location transitions within specific time windows.
Other studies, such as Bwambale et al. [
16], introduced latent demographic modeling, weighting trip generation by inferred socio-demographic attributes derived from mobile phone usage characteristics. While such methods enhance behavioral realism, they often lack direct validation. Meanwhile, studies in Boston, San Francisco, and Dhaka [
5,
17] constructed tower-level transient OD matrices, converting them to node-to-node networks for routing, yet often without addressing accuracy at the individual level.
2.2. Mobility Pattern Recognition and Significant Location Identification
This category of research focuses on identifying anchor locations (e.g., Home, Work) around which users organize their travel. These locations are typically inferred by analyzing appearance regularity at specific cell towers. Luo et al. [
18], for example, combined CDR and road network data to reduce tower-level noise and extract Home/Work locations using trajectory regularity. Other studies used clustering algorithms to group spatially adjacent towers [
19] or time-weighted frequency scores to detect commonly visited places [
20,
21].
Leng et al. [
22] proposed a geo-temporal matrix approach, using eigen decomposition to identify recurring spatiotemporal patterns. These methods often assume that trip behavior is governed by routines, but they may apply uniform thresholds across all users, potentially missing individual variation in location behavior.
2.3. Route Estimation from CDR Data
CDR-based route estimation involves mapping inferred OD pairs onto a transport network. The first step typically involves generating a set of feasible routes, often using OpenStreetMap data [
23,
24]. Then, a flow distribution algorithm assigns trips to these routes. One common method is the use of defined cell paths, where sequences of cell tower connections are used to approximate the chosen path [
25].
2.4. Research Gap and Objectives
Despite the breadth of CDR-based mobility studies, key limitations persist:
OD estimation and route assignment methods often ignore individual-level behavioral variations, relying instead on aggregated transitions.
Route choice is typically modeled based on tower frequency or flow density, without considering parameters such as appearance regularity, call frequency, or user-specific trip patterns.
Few studies validate CDR-derived mobility outputs against traditional travel survey data at a granular (zonal or individual) level.
This study addresses these gaps by introducing three key innovations:
A user-specific regularity model that detects significant locations (e.g., Home, Work, and other) using temporal and frequency-based indicators.
An individual caller-based OD estimation approach, which constructs personalized travel demand profiles rather than relying on aggregated cell transitions.
A route assignment methodology using en-route cell sequences, which calculates route alignment probabilities for each user based on observed call paths—thereby modeling route choice with individual-level granularity.
Through this framework, we demonstrate that privacy-preserving CDR data can be repurposed as a scalable and behaviorally nuanced alternative to traditional survey-based methods for travel demand estimation.
3. Study Area and Data
This study uses Call Detail Records (CDRs) from a mobile operator in Sri Lanka, provided by LIRNEasia, a regional ICT policy think-tank. The data is pseudonymized to ensure user privacy. Each record includes a user ID, call time, cell tower connection, and call duration, enabling geographic location association. The dataset covers 400,000 callers over the month of June 2013, totaling 89 million call records, focusing on the Western Province of Sri Lanka.
The primary data source for validation is transportation-related data collected through the Household Visit Survey (HVS) conducted in 2013 as part of the CoMTrans study in the Western Province of Sri Lanka [
26]. The Household Visit Survey (HVS) data includes transportation information, such as socio-demographic records, travel times, trip purposes, and travel modes. This data is utilized to validate the results from the CDR trip analysis. The survey used a face-to-face interview method and covered 44,000 households, with a response rate of 4%. The HVS includes detailed socio-demographic data, travel times, trip purposes, and travel modes.
To ensure comparability between the CDR-based outputs and the HVS data, CDR tower locations were mapped to corresponding administrative zones (districts and Divisional Secretariate Divisions: DSDs). Aggregation was performed to align the granularity of CDR data with the travel zones used in the household survey. Differences in temporal coverage and behavioral representation were addressed by validating trip distributions and route assignments at both the macro (district) and micro (DSD) levels. Statistical methods, including correlation analysis and outlier tests, were applied to evaluate the consistency between the two datasets.
4. Methodology
Figure 2 outlines the methodology, starting with a 400,000-sample CDR dataset collected over a month. Geospatial data on cell tower locations and administrative boundaries are foundational inputs. The Household Visit Survey (HVS) dataset is integrated for validation. Initially, the user coordinates with timestamps to reveal digital movements. Records are categorized into Home/Work and other points based on location patterns. Trips and stays are delineated from the dataset. Results are validated against traditional data at administrative zonal levels, showcasing the refined CDR-based Travel Demand Estimation Model in four stages.
Before demand modeling, filtering noise from tower-to-tower call balancing is crucial. This balancing creates false movements, known as the load-sharing effect. To correct for the load-sharing effect—a phenomenon where users appear to move due to operator-based call balancing rather than actual physical movement—this study applies a speed-based filtering technique. The method assumes that speeds exceeding 40 km/h between two consecutive cell tower connections within a short time interval are implausible under typical urban conditions and are likely artifacts of load balancing. By calculating speeds based on tower centroid coordinates and elapsed time, the approach identifies and removes these anomalies. A range of speed thresholds (10–90 km/h) was evaluated using Divisional Secretariate Divisions (DSDs) and Kullback–Leibler divergence to compare CDR-inferred locations with household travel survey data. The optimal threshold of 40 km/h minimized distributional error, supporting its use as a cutoff for filtering out spurious records. This ensures that only legitimate movement patterns are retained for downstream mobility analysis and model estimation [
27,
28].
4.1. Stage 1: Trip Attraction and Generation
With CDR data, routes are identified by pinpointing anchor locations first. Origins and destinations are determined based on consistent appearances at these points. Frequent locations, like Home and Work, alongside other significant spots, serve as trip generation and attraction points. Movement between these points signifies a trip. Identifying these locations is crucial as they generate more calls. Individuals’ behavior divides cell towers into regular and irregular ones, with regularity varying individually. Once identified, regularly visited cells, such as Home or Non-Home, are labeled as significant locations.
Identifying Home and Work Locations
The location most visited in the morning is referred to as “Work”, while the most frequent nighttime location is identified as “Home” [
5,
6]. To more accurately determine the Home location, we compare the top nighttime locations during weekdays and weekends. If these locations matched closely, they are reliably labeled as Home. Specifically, the cell tower most often connected to between 8:00 PM and 4:00 AM is designated as the Home location. In contrast, Work locations are defined as places users regularly visit during weekday daytime hours but rarely during weekends. These locations are assumed to represent workplaces where users spent long periods. Therefore, the Work location is determined by identifying the cell tower most frequently accessed during typical office hours (10:00 AM–4:00 PM) on weekdays.
It is important to note that the “Work” classification includes both workplaces and educational institutions, such as schools and universities. Due to the anonymized nature of CDR data, it is not possible to distinguish students from working individuals. As a result, frequent daytime locations during office hours (10:00 am–4:00 pm) are collectively treated as Work/Study locations.
Once the Home and Work locations are identified, it is critical to specify significant locations that are neither Home nor Work. This study uses the appearance regularity of the cell locations to explore the above.
Identifying Significant Other Locations: Home–Work
This study assesses how consistently users visit each cell tower location using three factors: (a) consistency based on the time of day, (b) consistency based on the day of the week, and (c) consistency based on the day of the month. A day is divided into 10 time intervals (T1 to T10), such as T1 = 6–8 AM, T2 = 8–10 AM, and so on up to T10 = 4–6 AM. The regularity of a user (n) being present at a specific cell (x) during time slot Ti is proportional to the following:
Total number of days user n visited cell x during Ti.
Number of days per week the user visited cell x during Ti.
Number of days per month the user visited cell x during Ti.
Hence, the user’s regularity at cell x during Ti is proportional to a × b × c.
After calculating regularity for all cells and time intervals, the cell with the highest regularity is identified as a significant location. This process is personalized for each user. To accomplish this, this study uses Mathematical Linkage Analysis (MLA), a graph theory-based technique that evaluates how closely a user’s movement pattern aligns with an ideal regular pattern across cells [
29]. It calculates the R-squared value between actual and ideal patterns for different cell groupings. The configuration with the highest R-squared identifies the most significant locations—excluding Home and Work. These significant places form the first stage of a four-part model, where Home, Work, and other major locations act as points of origin and destination for user movement, defining their travel patterns.
4.2. Stage 2: OD Matrix Estimation
The goal of the second stage is to match the trip origins and destinations identified in the previous step to form origin–destination (OD) pairs. This results in a trip matrix (Tij) categorized by trip purpose. The classification of trip purpose depends on the nature of the origin and destination locations. If a trip starts at Home and ends at Work (or vice versa), it is labeled a Home-Based Work trip. If the trip starts or ends at Home but the other location is a significant place other than Work, it is classified as a Home-Based Other trip. Trips where neither end is Home are categorized as Non-Home-Based trips.
Once the trip purposes are defined, they need to be aggregated into Traffic Analysis Zones, which comprise the study area. Traffic Analysis Zones can be selected to encompass the expected resolution level. Considering the coverage area and complexity of overlaying, this study matches the cell towers to the district and DSD levels, such that the OD matrices are developed at the intra-district and inter-DSD levels.
4.3. Stage 4: Network Assignment
This section analyzes the route choice behavior of commuter trips, particularly Home-Based Work trips, inferred from the stage 2 OD flows. Trips are further analyzed using defined statistics to generate route choice probabilities. “En-route cells” and “cell paths” are introduced to assess the nature of route choice. En-route cells represent cell connections made during commuting, extracted from the last call at Work and first call from Home, and vice versa. A distance filter is applied to qualify cells as en-route cells. The network is interpreted using cell paths overlaid with Voronoi shape files to demonstrate sequences of en-route cells as routes.
Table 1 presents the definitions of trip statistics and the method used to derive route choice based on those statistics.
Table 1.
Defining trip statistics.
Table 1.
Defining trip statistics.
Disaggregate Measures—Per Trip | Aggregate Measure—Per Individual |
---|
(a) Alignment of a trip to the defined route—(r) | (b) Certainty in route alignment prediction of a trip | (c) Contribution of a user to route choice for the route (r) |
The number of en-route cells that fell along the defined cell path (A) was divided by the total number of cells along the cell path (B). This generated the probability of alignment to the defined route for each user’s trip (M). M = P(A)/(B) | It was measured using the no. of connections made to the cells along the route in each trip. The no. of connections was used as a weighting factor such that higher weightage is assigned when the no. of connections is high (WC). | Two measures were used to quantify the contribution made by each user to the route inference.
The output probability of measure (a) and (b), which is the probability of a user selecting a defined route for commuting. Weightage was derived by dividing the trips with evidence (T) from the total trips (Y).
|
The probability of user X selecting the defined route for commuting, n = no. of trips | Contribution of ser X for route (r) |
The final route choice model was derived by multiplying the probability along a specific route by the actual number of days the caller has traveled to Work. The exact process was carried out for each user; then, the proportion via the considered route was calculated as in the equation below. It was assumed that the same route selection probability exists for trips in which the en-route cells are not available. Figure 3 presents the graphical representation of the route choice. (m) = No. of users among a particular DSD pair. (r) = No. of routes among a particular DSD pair. P(X Final _R(r)) = Probability of user X selecting route (r) for commuting. (Ti) = No. of days the user had traveled to Work. The final route selection probability can be shown in the equation below, Percentage of Trips made via route |
Figure 3.
Route choice representation.
Figure 3.
Route choice representation.
The main output of this procedure is the percentage of commuting trips assigned to the road segments within the study area. The commuting trips thus generated are then expanded to actual values and assigned using STRADA software to the road network by following the standard process in the Comtrans network assignment. The User-Equilibrium method is employed in assigning private vehicle trips, and the transit assignment method is used for public transit trips. It is crucial to note that the CDR trip assignment assumes that the mode composition observed in the model outputs prevails.
Route assignment with CDR initially uses trips with en-route cells and expands to calculate the total work trips, assuming that trips without en-route also behave like trips with en-route.
Accordingly, for all DSD pairs, O
iD
j, where i ≠ j,
Rn_CDR/Rn_STRADA—No. of trips via route n with CDR/Modeled.
m—No. of routes identified for the considered DSD pair (n = 1, 2, 3, …, m).
i—Origin DSD (1, 2, 3, …, 47).
j—Origin DSD (1, 2, 3, …, 47).
4.4. Summary of Methodology
Figure 4 below summarizes the overall methodology, distinguishing between the traditional and CDR methods.
CDR data offers significant insights into mobile phone users’ spatial and temporal patterns, aiding the estimation of trip origins and destinations.
Table 2 and
Figure 4 highlight the complementary strengths of each approach and motivate the integration of CDRs into planning frameworks where traditional surveys may be logistically or financially constrained. These records, capturing call and text message details, facilitate the analysis of communication patterns, such as call/text frequency, duration, and timestamps. Such analysis contributes to inferring trip purposes and rates, enhancing the accuracy of trip generation models by estimating trips from various zones, and identifying travel behavior variations by time or day.
In the trip distribution phase, CDR data elucidates mobile phone users’ destinations, enriching the understanding of travel patterns and modeling trip flows between zones. This data aids in assessing the probability of trips between different origins and destinations, leveraging the spatial distribution of communication events to pinpoint prevalent travel patterns and flow dynamics. This approach helps estimate trip attractions and generations for specific areas, highlighting temporal variations like peak periods and congestion zones.
5. Experimental Details and Results
This section outlines the outputs generated by our CDR-based Travel Demand Estimation Model. It begins with the development of trip tables and a comparison of origin–destination (OD) matrices with those derived from conventional travel surveys. We then present road network assignment results and validate them against traditional traffic assignment approaches.
5.1. Trip Tables and Survey Comparison
To generate OD matrices, a processed sample of Call Detail Record (CDR) data was analyzed. The analysis included identifying user locations categorized as Home, Work, and other significant places. Using CDRs, OD pairs were determined between cell towers; however, to align with traditional datasets, these tower-level locations were mapped to administrative units. In the study area, 763 mobile towers were mapped to 48 Divisional Secretariat Divisions (DSDs) and 3 districts, which were defined as OD-generation nodes.
Trips were identified by changes in consecutive cell tower locations and assigned to OD matrices at both the district and DSD levels. When a cell tower was located near administrative boundaries, trip generation and attraction counts were proportionally distributed based on the overlapping area between zones.
The graphs (
Figure 5 and
Figure 6) compare CDR-derived trip types—Home-Based Work (HBW), Home-Based Other (HBO), and Non-Home-Based (NHB)—with traditional Household Visitor Survey (HVS) data. All trip types show a strong correlation with the HVS results at broader geographic levels, though accuracy declines slightly at more granular (DSD) levels. HBW trips, due to their more routine nature, demonstrated the highest agreement.
At the district level, we observed the following:
At the DSD level, we observed the following:
These decreases in accuracy at more disaggregated levels are mainly due to the difficulty of accurately aligning cell tower coverage areas with smaller administrative travel zones.
An outlier analysis using R identified Colombo and Thimbirigasyaya as anomalies in the HBO and NHB categories, likely due to proximity-related boundary errors. When these two DSDs were combined into a single unit, the correlation improved notably.
High-density urban zones like Colombo tend to have overlapping tower signals and frequent handoffs, contributing to errors in trip allocation. While these effects are diluted at the district level, they introduce significant variance at the DSD level. Adjusting for such anomalies is key to improving model precision in future applications.
To supplement
Figure 5 and
Figure 6,
Table 3 presents the actual and expanded HBW trip counts at the district level. The CDR data was scaled to reflect the actual population, as per the model’s methodology. Despite differences in absolute values, the CDR-based OD patterns align well with the HVS data, particularly for high-volume OD pairs like Colombo–Colombo and Gampaha–Gampaha.
5.2. Road Network Analysis
Figure 7 illustrates the validation of route choices produced by the CDR-based model against traditional assignment methods.
Figure 8 and
Figure 9 display line charts highlighting movement trends between the DSDs.
The correlation between the STRADA and CDR was 0.83, which was statistically acceptable. However, some outliers could be detected by visual inspection. When the data was tested using an outlier test, a few DSD pairs, namely, Bandaragama–Kelaniya, Dodangoda–Kaduwela, Gampaha–Dehiwala, Homagama–Panadura, Horana–Kelaniya, Kolonnawa–Dompe, Mirigama–Dehiwala, Negombo–Ratmalana, Padukka–Attanagalla, and Panadura–Kolonnawa, were detected at the initial iteration. These DSD pairs had a significantly lower number of Work trips compared to the other pairs.
A hypothesis test was carried out to determine the significance of the correlation and, based on that, whether the null hypothesis could be rejected in favor of the alternative. The p-value at 5% significance was 0.00001. Since the p-value is smaller than the significance level (α = 0.05), we can reject the null hypothesis in favor of the alternative and conclude that the correlation is statistically significant or that there is a linear relationship between the two variables in the population at the α level.
6. Conclusions
This study presents a comprehensive travel demand modeling framework that leverages mobile phone Call Detail Record (CDR) data to enhance and potentially replace traditional survey-based approaches. Compared to conventional household travel surveys, CDR data provides broader temporal coverage—capturing multiple days of user movement, including weekends—and offers scalable, cost-effective, and frequently updated mobility insights.
A novel methodology is introduced to construct origin–destination (OD) matrices by identifying significant user locations using an individual-based regularity metric. Additionally, this study proposes an innovative route assignment approach that integrates CDR-derived trip patterns with the STRADA network model through a user-equilibrium traffic assignment process.
Applied to data from Sri Lanka’s Western Province and validated against a large-scale Household Visit Survey (HVS), the model demonstrates strong alignment with traditional travel demand estimates. The results affirm the potential of repurposed mobile phone data—originally collected for billing purposes—as a reliable input for transport planning, matching or even surpassing traditional four-step modeling in terms of behavioral representation and scalability.
However, several limitations remain. The current methodology may misclassify users whose work and home locations fall within the same tower area because of spatial resolution constraints. Similarly, intra-cell movements are not captured, leading to potential underestimation of short trips. Moreover, the model assumes modal split proportions based on external data, as CDRs alone lack mode-specific indicators. Future research should explore the integration of high-frequency or multimodal datasets to enhance mode inference and capture within-cell trip variability, particularly for improving the modeling of public transport usage.
Nonetheless, it is important to note that the accuracy and generalizability of CDR-based models depend heavily on the penetration rate of mobile phone users and the density of tower infrastructure. In regions with limited mobile coverage or lower adoption rates—often rural or economically disadvantaged areas—the representativeness of the CDR sample may be compromised. This can lead to spatial data sparsity and underrepresentation of specific user groups, potentially limiting the scalability of this method in such contexts. Future extensions could incorporate multiple data sources or synthetic population weighting to mitigate these limitations.
This work represents a significant step toward the mainstream use of big mobile data in transport demand forecasting, offering a viable, scalable, and timely alternative to traditional methods. Additionally, while the CDR data used in this study is pseudonymized, we acknowledge the importance of ethical considerations in handling such data. Even anonymized records can carry re-identification risks if improperly processed. To address this, strict data access protocols were followed. Ethical data stewardship remains central to deploying big data solutions in transport modeling.