Rapidex: A Novel Tool to Estimate Origin–Destination Trips Using Pervasive Trafﬁc Data

: A trafﬁc assignment model is a critical tool for developing future transport systems, road policies, and evaluating future network upgrades. However, the development of the network and demand data is often highly intensive, which limits the number of cases where some form of the models are available on a global basis. These problems include licensing restrictions, bureaucracy, privacy, data availability, data quality, costs, transparency, and transferability. This paper introduces Rapidex, a novel origin–destination (OD) demand estimation and visualisation tool. Firstly, Rapidex enables the user to download and visualise road networks for any city using a capacity-based modiﬁcation of OpenStreetMap. Secondly, the tool creates trafﬁc analysis zones and centroids, as per the user-speciﬁed inputs. Next, it enables the fetching of travel time data from pervasive trafﬁc data providers, such as TomTom and Google. With Rapidex, we tailor the genetic-algorithm (GA)-based metaheuristic approach to derive the OD demand pattern. The tool produces critical outputs such as link volumes, link travel times, OD travel times, average trip length and duration, and congestion level, which can also be used for validation. Finally, Rapidex enables the user to perform scenario evaluation, where changes to the network and/or demand data can be made and the subsequent impacts on performance metrics can be identiﬁed. In this article, we demonstrate the applicability of Rapidex on the network of Sydney, which has 15,646 directional links, 8708 nodes, and 178 zones. Further, the model was validated using the Household Travel Survey data of Sydney using the aggregated metrics and a novel project selection method. We observed that 88% of the time, the “estimated” and “observed” OD matrices identiﬁed the same project (i.e., the rapid process estimated the more intensive traditional approach in 88% of cases). This tool would help practitioners in rapid decision making for strategic long-term planning. Further, the tool would provide an opportunity for developing countries to better manage trafﬁc congestion, as cities in these countries are prone to severe congestion and rapid urbanisation while often lacking the traditional models entirely.


Introduction
Globalisation, population rise, and the exponential growth in communication and vehicle technology have led to the increased desire and ability to travel longer distances in shorter periods of time. Consequently, transport systems worldwide suffer from congestion, leading to a lack of travel time reliability, reducing safety, intensifying environmental degradation, and ultimately creating economic inefficiency. Traffic congestion is estimated to cost around $88 billion per year in the US, £6.9 billion in the UK, and €2.8 billion in Germany [1]. Transport practitioners have two broad approaches to alleviate these issues: travel demand management and supply management. Demand management strategies probe vehicle method lacks transferability. Furthermore, relying on survey data is time consuming and expensive and limits its scalability for rapid deployment. Therefore, travel demand data for multiple cities are challenging to obtain as there is no consistent and reliable data source or tool. Most of the studies in the OD estimation space either deal with test networks or small real-world networks with limited network detail, i.e., fewer number of zones and links. Another key limitation of the existing methods is the reliance on the limited number of observations of traffic counts [23]. The number of unknowns (number of OD pairs, typically thousands or hundreds of thousands for large networks) significantly outnumber the known observations (a few hundred thousand link counts), causing under-determinacy.
The current study tries to overcome the problem by utilising easily accessible pervasive traffic data for travel demand estimation. In recent years, pervasive (some studies refer to it as crowdsourced) traffic data sources have become potential options for traffic data collection because of the widespread usage of smartphones [24,25]. People have quick access to features such as Bluetooth, WiFi, and GPS, and are better connected through social media (SM) applications, through which user location data, travel history, travel times, activity behaviour, incidents, and traffic speeds are collected. Commercial traffic and navigation data providers (Google, TomTom, HERE, etc.), social media platforms (Facebook, Twitter, Instagram, etc.), mobility-as-a-service aggregators (Uber, Didi Chuxing, Ola, etc.), and fitness service providers (Strava, Google Fit, etc.) make use of this data. Pervasive traffic data have a higher sample size that reflects the traffic conditions better than the probe vehicles [25]. Furthermore, they have comprehensive spatial coverage and fine temporal resolution and are cost-effective compared to traditional data sources [24][25][26]. Such data have been utilised in the past for incident duration prediction [27], designing adaptive traffic signals [26], and congestion estimation [24].
This paper demonstrates how we utilised pervasive traffic data and developed Rapidex, a novel OD demand estimation and visualisation tool. Firstly, the tool enables users to download and visualise road network data (roads, road length, number of lanes, speed limits, type of road, etc.) for any city using OpenStreetMap and an external tool called OSMnx [16]. Secondly, the tool creates traffic analysis zones and centroids, where the user can specify the zone size and the maximum/minimum number of centroids per zone. Thirdly, the tool enables the fetching of travel time data from the TomTom and/or Google Maps application programming interfaces (APIs). Finally, the tool predicts the OD demand patterns across the network using a customised genetic algorithm (GA) approach, which minimises the gap between the observed and estimated travel times within a bilevel optimisation framework. After this step, the tool produces critical outputs such as link volumes, link travel times, OD travel times, average trip length and duration, V/C ratio, congestion level, etc. These output values can be used to validate the demand matrix in the case such data for the real world can be accessed from other sources. Finally, the tool enables the user to perform scenario evaluation, where changes to the network and/or demand data can be made, and the subsequent impacts on performance metrics can be identified. Therefore, the main objective of Rapidex is to help practitioners and authorities in quick decision making for strategic long-term planning by utilising pervasive traffic data.
The rest of the article is organised as follows. First, a brief review of different data sources and methods for demand estimation is provided. It is followed by the description of the methodology and the experiment setup. The case study of Sydney is then presented, followed by concluding remarks.

Literature Review
The four-step travel model is a framework used to forecast how the transportation network behaves so that planners can better understand what is required to support current and future scenarios. The first step is to assign demand to the trips between all the OD pairs within the network, leading to creating an OD matrix. The OD matrix needs to serve as a solid foundation for the rest of the model to build upon. Traditionally, OD matrices were populated through large roadside or household travel surveys; however, these surveys are costly, unreliable due to reporting errors, and outdated only a couple of years after collection. Thus, researchers naturally turned to create models and frameworks for estimating OD matrices using collected real-world data. In this paper, we present a brief review of a few widely used methods and datasets for OD demand estimation.

Review of Methods for OD Demand Estimation
Origin-destination estimation models have existed for decades and have used various data sets to fuel their predictions. The earlier models employed link traffic counts and were based on static formulations. In recent years, there has been a growing interest in developing dynamic [28] and quasi-dynamic formulations [29], which can capture the within-day demand variations. The researchers have proposed various mathematical formulations, but most of them treat the OD estimation as a bi-level optimisation problem, which minimises the gap between the observed and modelled link counts. The upper level is concerned with taking the target OD matrix and adjusting it to reproduce the observed traffic counts. The estimated traffic volume that the upper level uses is produced by the lower level, which applies the OD matrix onto the network and solves for user equilibrium. Furthermore, most of the studies tend to "update" the historical OD matrix based on the additional (or latest) traffic data. Due to this, the solution space for the new matrix could be biased or limited. A review of different solution algorithms has been covered in many past studies, most notably by [30,31].
Essentially, there are four broad approaches for OD matrix estimation, from which multiple generalisations have been proposed [23]. These main approaches are minimum information/maximum entropy, Bayesian inference, generalised least squares (GLS), and maximum likelihood. The generalisations/advancements include methodologies such as neural networks [32,33], Kalman filtering [34], Markov chain theory [35], genetic algorithm [36], random forest [37], and data-driven methods [38,39]. A brief overview of the four main approaches is presented below.
Given the most uncertainty, minimum information/maximum entropy can determine the most probable OD matrix and produce this estimate without any bias. It is computationally advantageous as it is faster and convenient. However, it is unable to consider uncertainties in traffic counts and any prior matrices, which can often lead to errors as these uncertainties significantly impact the output. Further, the model requires removing inconsistencies from the data to produce a feasible solution [40]. Bayesian inference provides a framework that introduces extra information based on accumulated statistics and research through the specification of the prior. In this method, providing variability information and probability intervals of traffic flow estimation provides an essential advantage over non-statistical methods. However, uncertainty can arise when estimating the probability choice distribution. Complicated likelihood distribution and non-negligible measurement errors can occur in vehicle counts. Further, computational costs and the extent of applicability to varying traffic conditions is a limitation [41,42].
The generalised least squares method allows the combination of traffic survey and observed traffic counts. It also accounts for the relative accuracies of both data sources. Nevertheless, GLS estimators can be biased when the starting estimators are biased and/or the assignment model is misspecified [43,44]. The maximum likelihood estimation with expectation maximisation can be used to estimate an OD matrix when the proportions or percentages of the traffic demand on the links are known, but the traffic volumes are unknown [45]. The model still works even if the matrix has zero elements in the cells, whereas it would not be feasible in other models such as the maximum entropy model. Even with a small amount of data available, highly accurate results can still be achieved to estimate the OD matrix. However, it can be computationally challenging for large matrices when using the expectation maximisation function.

Loop Detector Data
Estimating OD matrices using traffic count data obtained through loop detectors has been going on for several decades [40,46]. However, the quality of OD matrices obtained using count data are questionable [47]. Loop detectors are electrically conducting loops that are installed under the pavement. Once a vehicle moves over it, the metal of the vehicle causes a reduction in the inductance of the loop, which is recognised by the unit. They are better than HTS in terms of labour and time savings. However, they depend on metering infrastructure that is expensive to install and maintain [48]. They are prone to provide erroneous counts-for example, one study found that 31% of the loop detectors had biased counts [49]. Furthermore, not all roads in a network are installed with loop detectors and so the OD estimation model may sometimes be working with only partial information.

Mobile Phone Data/Call Data Records (CDRs)
The surge of information and communications technology innovation has blossomed the reliance on mobile phones as a necessary means for communication and as a life management tool. This has opened up the opportunity for transport planners to exploit the data already collected from cellular providers to deliver newfound insights into travel behaviours. CDRs are collected and stored by cellular providers for billing purposes and contain important spatial data generated by mobile phone users. CDRs are useful when the mobile phone user interacts with their phone across different locations and can thereby monitor their physical displacements over time. Trips can then be tracked within a specific timeframe and generate tower-to-tower matrices [50,51]. A primary advantage of CDR is the passive nature of its collection process. CDR can be particularly useful in developing countries, where reliable HTS data may be limited. The continual inflow of mass data makes it convenient for periodic updates.
Despite their seeming advantages, CDRs suffer from various limitations. They are inhibited by the fact that the users must have their phones on hand and must be in operation. This means that for cases where phones are switched off, lost, or faulty, the CDR data being recorded may come up as missing or discontinuous. Incomplete data may misinterpret trip behaviours of individuals due to the absences of true origins or destinations. Additionally, CDRs are subjected to false displacements induced by the unrelated operations of cellular providers. Cell phone operators control call traffic by redistributing calls to different towers based on the call activity level. This results in incorrect displacements to be recorded for redirected calls. Finally, the mobile operators may not always share the CDR data because of privacy concerns.

Bluetooth Data
Bluetooth technology has been used in a few studies to estimate OD demand patterns [34,52,53]. Sensors installed on the roadside collect the unique Bluetooth ID of vehicles. When several sensors are installed in the network, the time-stamped vehicle locations, i.e., trajectories, can be collected. The costs for hardware, software, and installation are low compared to loop detectors. However, these data have some limitations. Some Bluetooth IDs are found to be cloned, i.e., the same ID is shared among multiple vehicles, which may lead to errors [53]. In addition, the sensors located close to each other can have overlapping detection zones. Furthermore, the penetration is only 1% to 5% because the Bluetooth functionality can be turned off in many mobile devices to conserve the battery [54]. In addition, most of the studies relied on Bluetooth data from only a handful of sensors. Therefore, demand patterns have been estimated in smaller or highly aggregated (less number of links, nodes, and zones) networks.

Social Media (SM) Data
Social media is known to have a generally easy access point via online platforms giving rise to extensive user coverage. The emergence of location tagging and regular posts by users increasingly generates public-registered information that is often linked to the user's real-time position. Several studies utilised geotagged data from Twitter and Foursquare to estimate OD matrices [48,[55][56][57]. The main advantage of SM as a data source is its universal accessibility and broad coverage across users. In addition, the availability of geo-coded capabilities on platforms such as Twitter and Foursquare allow for differentiating location-based activity patterns that are both easier to obtain and more detailed in comparison with HTS results.
SM data are not that helpful in estimating commuting travel demand due to the low reliability of identifying home and workplace [57]. A significant limitation of such data is that the geotagged tweets account only for a small proportion (1-2%) of all tweets [58]. Another drawback is the influence of the type of users correlating with the locations being checked into. For example, locations with a major landmark or historical value were biased towards tourist users [55]. Due to the innate purpose of SM to share unique personal experiences, it is not rare to observe a high frequency of activity geotagged to significant locations. However, as a tourist, their trips recorded from that location and surrounding areas are now inhibited by very different travel behaviours in comparison to the local community captured by HTS. This inhibits the consistency of the activity patterns being recorded, as tourists often visit seasonally, thereby affecting the accuracy of developed ODMs for frequently visited tourist locations. In addition, the datasets can cover many users but take significant time to accumulate enough samples for each individual.

Other Sources
Researchers have also used other sources, such as floating car techniques in the form of taxi trajectories [59,60] and license plate recognition through traffic monitoring cameras [61][62][63]. Floating cars may not represent the real traffic conditions due to limited penetration. In addition, the operating conditions of taxis are quite diverse and different from that of the typical commuter. On the other hand, traffic monitoring cameras are mainly intended for surveillance and violations, and their spatial coverage is limited because of the high installation and maintenance costs. However, one can identify vehicle classification, lane-wise traffic, and turning proportions with this method, which can be helpful in OD estimation.

Pervasive Data
In a way, data from pervasive navigation platforms is akin to floating cars but with much higher penetration rates. The main advantages of such data are the comprehensive spatial coverage and fine temporal resolution. The data are updated in real-time and account for fluctuations in traffic activity. Speed/travel time information can be obtained for every road link in the network, which is not feasible with other modes of data collection. However, traffic "count" data cannot be routinely obtained from these platforms owing to privacy issues. Nevertheless, speed patterns are observed to coevolve with traffic volume patterns [64,65], and the data has been used in several studies for traffic analysis and developing network fundamental/flow diagrams. However, to the best of the authors' knowledge, such data have not been used for OD demand estimation, despite being easily accessible.

Methodology
The Rapidex tool has five core modules, for which the detailed methodology is discussed in this section. The key nomenclature is presented in Table 1.

Road Network Extraction and Zoning
First, the coordinates of a bounding box encompassing the city are input into Rapidex. By default, we select motorway, trunk, primary, and secondary edges. The tertiary and residential type edges are not considered for modelling. Then we select the maximum and minimum grid sizes (5 km and 1.25 km, respectively, in this case study). First, Rapidex downloads the network from OSM using a Python package called OSMnx [16] and divides the entire network into square grids of the specified maximum size. Then the grids with higher node density are further disaggregated into smaller zones (again, squared shape). Next, the tool identifies a specified number of nodes within each zone to act as centroids for that zone, which permit the entering and exiting of vehicles from the network. By default, a maximum of 4 and a minimum of 3 centroids are selected per zone. The capacity values for different road types are then set. All the parameters mentioned here are only the default settings in the tool, and the user will have the flexibility to change these parameters. Furthermore, the user will have the option to import a pre-defined zoning structure (e.g., census levels or statistical areas) in place of the default zoning structure available in Rapidex. At the end of this step, the users can visualise the attributes of every link (speed limit, length, lanes, etc.) and zone (number of nodes, length of roads, etc., within the zone) in the network.

Travel Time Extraction
Pervasive data aggregators such as Google and TomTom provide both real-time and typical travel times through their APIs. The real-time travel times are prone to fluctuations and may not represent the everyday traffic conditions. On the other hand, the "typical" travel times are calculated as average travel times across multiple days (on the same day of the week), but at the exact departure time. Rapidex allows the user to collect either of these travel times. The default approach in Rapidex is to extract the typical link travel times and "calculate" the OD travel times using the shortest path algorithm. These OD travel times are then used in the error function calculation. One can directly obtain the OD travel times from the pervasive data platforms; however, a change in any of the zoning parameters may result in the previously collected data being futile. In addition, for cities with many zones, the number of queries would increase exponentially and may prove costly to fetch data.

Bilevel Optimisation Approach
The bi-level programming approach is typically used for estimating the OD matrix for networks that are congested. It consists of upper and lower-level problems that are solved through iterations until both levels converge simultaneously. Usually, the upper-level problem involves estimating the OD matrix using information about the observed link counts and aims to reduce the errors between the observed and predicted values. The lower level problem is the user-equilibrium assignment at the network level, which describes travellers' interaction with different traffic situations. At each iteration, the lower level returns the traffic properties, including flow and travel time, to the upper level, while the upper level provides the estimated OD demand as the input of the lower-level program. Demand and assignment are mutually dependent on each other in a bi-level approach, and therefore, they are better able to replicate congested traffic conditions [52,66]. Most of the previous studies have the upper-level problem specified as follows: where E(f) = error function; f est ij = estimated vehicle count on link ij; and f obs ij = observed vehicle count on link ij.
Varying from the existing literature, Rapidex offers multiple developed methods for the error function E, from which the user can choose one or a combination of the error functions. More details about the error function are discussed in the sub-section "Error Function". With the defined error function E, we propose the bi-level modelling framework for the OD estimation process as follows.
Upper level: min E where E is the error function (refer to the sub-section "Error Function"). Lower-level: subject to the flow conservation and non-negativity constraints as outlined below: ∑ k∈π rs f rs k = d rs ∀r, s ∈ N f rs k ≥ 0 ∀r, s ∈ N, ∀k ∈ π rs where t(x) is the function for flow-dependent link travel time.
Currently, Rapidex has two types of solution heuristic, i.e., method of successive averages and Frank-Wolfe to solve user equilibrium traffic assignment.
Link Performance Function Link performance functions measure the level of service (LOS) associated with the links representing an urban network [4]. These can include travel time, safety, cost of travel, stability flows, and others. However, travel time is typically used as the sole measure of LOS as other measurements are highly correlated with travel time. The Bureau of Public Roads (BPR) function is widely used in practice and is displayed in the equation below. Rapidex uses the BPR function described above to calculate travel times based on the estimated volume.

Solution Approach: The Genetic-Algorithm-Based Metaheuristics
For real-world and large-scale networks, solving for a global optimum using a bi-level approach is difficult because it is usually non-convex [67]. Researchers have used different heuristic approaches to address the problem, but they rely on the target OD matrix and some may still result in a local optimum for the upper level [68]. Some researchers have used the genetic algorithm (GA) approach to overcome this limitation [36,69] as it can handle any kind of objective function and constraints. GA starts with a set of feasible solutions and produces an optimal solution by iterating sequential steps [36]. The typical GA process follows the procedure as shown in Figure 1. However, in Rapidex, the GA approach has been tailored significantly to estimate OD demand, and the process is outlined in the subsequent sections.

Initial Solutions
OD demand estimation is a highly underdetermined problem, particularly in large networks. Multiple solutions for the OD matrix can exist for the same set of travel times. Therefore, providing accurate initial solutions (matrices) is critical in achieving a reliable final solution. The better the initial solution in terms of closeness to the observable solution, the better the GA estimation will be and the faster it will converge. A list of different initial solutions currently offered by Rapidex is presented in Table 2. It can be noted that all these solutions take the form of a gravity model. All the initial solution methods listed in Table 2 are subject to: where D is the total demand of the network and d rs is the demand between a particular OD pair r and s. The CGM is the most flexible of the initial solutions as the user can define the values of proportions G and A for each zone. For example, the user may decide that they want to use the proportion of the population (if known) or the proportion of historical trip productions (from a priori OD matrix) as a proxy for trip productions and proportion of historical trip attractions (again, from a priori matrix) as an indicator for the attractiveness of a zone. If such data is not available, Rapidex provides proxies, such as the proportion of residential roads and the proportion of nodes, which can be used in the custom solutions.
The user can decide to use one or a combination of these initial solutions for the first generation. Each initial solution requires a total demand. The user has two options for generating the total demand of a solution. Firstly, the user can decide to have the initial solution's total demand as a random value from the demand range they specified for the network. Alternatively, the demand range can be equally divided for every type of initial solution method being used. These division points are the total demands that the initial solutions are assigned. For example, say that we have 10 solutions, and we elect to use 50% TFM and 50% TDM. For the TFM solutions, the demand range is divided into 5 different demand values, and each of these values is assigned to one of the TFM solutions. Similarly, the TDM solutions will be calculated in the same way.

Error Function
Once each of the OD matrices within the generation has been solved to convergence, i.e., a predefined relative gap, each solution needs to be evaluated by an error function. The error function is the most crucial aspect of the GA as it ranks different solutions and defines their viability. Rapidex offers several error functions (see Table 3), from which the user has the flexibility to choose one or a combination of error functions. For example, if the link flow data at a few locations and the OD travel times in the network are known, then one can use the combined error sums of RMSE-ODTT and RMSE-LF. Weighting can also be incorporated if the user has a preference (or more confidence) for one dataset over another.

Selection
To proceed to the subsequent iteration (generation) of GA, new OD matrices must be created from the previous generation. First, solutions from the previous generation are selected and then combined. These selected solutions are referred to as parents. Rapidex offers different methods for parent selection.
The simplest method is randomly selecting two different solutions from the previous generation. The second method, i.e., the tournament selection, is an improvement over the random selection while still maintaining simplicity. In tournament selection, solutions are randomly selected to the desired number (two in our case) and the solution with the best fitness (or least error) is picked [70]. This is repeated once more, resulting in two parents who are guaranteed not to be the worst of the previous generation. To further ensure that the subsequent generation does not degrade, the elitist selection is employed whereby the next generation will be given the best performers of the previous generation. The number of elites that continue onto the next generation is user defined.

Crossover
Once the parent solutions have been decided, they are then combined using a crossover method to produce a new child chromosome. Again, Rapidex provides multiple techniques for "creating" the offspring, i.e., a new solution from the parents.
In the uniform crossover method, either parent's OD demand for any pair is chosen. In the single-point crossover method, the OD demand values for pairs from one parent are copied until a particular OD pair, and then the rest, are copied from the other parent. In the arithmetic method, for each OD pair the demand is the weighted arithmetic mean of the two parent demands [71].
While these general processes will yield results, a tailored crossover approach will provide better performance as characteristics of your problem may be leveraged to crossover parents more efficiently. Thus, a new crossover method was created specifically for our problem. The process is shown in Figure 2 and is outlined below:

1.
Calculate the absolute difference between the first parent's estimated travel time and the observed travel time for that OD pair.

2.
Repeat step 1 with the second parent.

3.
Assign the OD demand of the parent with the lower absolute difference to the child.

4.
Repeat for all OD pairs. Crossover method execution is subject to a user-defined crossover probability. In the instances where the crossover method is skipped, one of the parents is chosen to become the child.

Mutation
The final step of chromosome creation is a mutation, where the OD pair demand can be changed. The mutation is crucial to ensuring the GA does not concentrate only in a local search space but rather can explore the entire search space. However, too much mutation will cause the GA to be unstable and will stop it from converging. Thus, as with the crossover, there is a user-defined mutation probability that will limit mutations at the generation level. Along with this, there is also user-defined mutation variability which dictates how many of the traits of a child will be mutated. Once again, Rapidex provides multiple mutation methods from which the user has the flexibility to choose.
The swap method randomly chooses two OD pairs and swaps their demands. The bit method randomly chooses an OD pair and flips a bit of the binary number that represents the demand value. Finally, the random method selects an OD pair at random and assigns a value between a specified range.
After mutation, the chromosome is ready to be solved to convergence and have its error calculated. This iterative process continues until the genetic algorithm has achieved enough iterations. The process can be prematurely stopped if the error reaches an adequate value.

Termination of GA
The OD estimation module terminates after attaining the pre-defined error value or the maximum number of generations, whichever it reaches first. At the end of this module, Rapidex provides all the major outputs, namely link volume, link TT, and estimated OD-TT, in addition to the estimated OD matrix in the form of spreadsheets. Rapidex also allows the spatial visualisation of all these metrics on a background map. Additionally, Rapidex produces networkwide average metrics such as trip length, travel time, congestion levels (ratio of travel time to free-flow travel time), vehicle kilometres travelled, etc., which can be used for quick validation.
Due to the potential for having so many chromosomes and very large networks, completing the GA in a reasonable time required parallel computing. Solving user equilibrium is the most time intensive aspect due to the requirement of finding the shortest path tens, if not hundreds or thousands of times (depending on the relative gap) per chromosome. Thus, the Dijkstra algorithm has been implemented to leverage parallel computing to significantly speed up the process.

Scenario Testing
Here, the user can modify the demand or network input parameters and evaluate the impacts relative to the base case. For example, the user can select a few links (either in isolation or as a corridor) in the network and change their attributes such as speed limit, number of lanes, and capacities. The user can also add a new set of links and nodes to the base network.

Results: Sydney Case Study
Now, we demonstrate the application of Rapidex on the network of Sydney, Australia. The bounding box coordinates encompassing the city of Sydney are inputted into Rapidex. Then we select the maximum and minimum grid sizes as 5 km and 1.25 km, respectively, in this case study. Overall, the network has 15,646 directional links, 8708 nodes, and 178 zones. Figure 3 shows the road network output. For this case study, we collected the link travel times (from Google Maps) by setting the departure time as 8 a.m. on Wednesday, 31 March 2021. At the time of data collection, there were no travel restrictions within Sydney due to COVID-19. For the demand estimation, we defined a large input total demand range (100,000 to 2,000,000) so that the total demand estimated by Rapidex is always within these bounds. Then, we set 40 chromosomes (solutions) per generation, and the GA was set to run until a MAPE-ODTT value of 10% was reached. Therefore, the objective was to minimse the gap between the observed and estimated OD travel times. We used a server with 40 cores, 3.1 GHz processing speed and a memory of 512 GB.
While initialising, we allocated the same percentage of solutions to each initial solution method discussed in Table 2. In the custom gravity initial solution, we considered the proportion of nodes present within a zone as both G and A, respectively. In this case study, we considered a relative gap value of 0.01. The Frank-Wolfe method was used to solve user equilibrium.
The networkwide average trip length, travel time, congestion level, and vehicle kilometres travelled were estimated as 11.9 km, 20.73 min, 1.66, and 7,875,392 veh-km, respectively. The total demand estimated was 657,000 vehicles during the morning peak hour. Figure 4 shows the scatterplot between the observed and estimated travel times for 31,506 OD pairs. Figure 5 shows the convergence of the GA algorithm, which took 26 generations to reach the pre-defined MAPE-ODTT error of 10%. This means a total of 1040 solutions (26 generations × 40 chromosomes per generation) were evaluated. The total processing time was approximately 4.5 h which included 0.5 h for network extraction from OSM and zoning and 4 h for demand estimation. It signifies the main objective of Rapidex, i.e., a tool for quick decision making by authorities. Nevertheless, the processing time could be further reduced with additional computing abilities and better initial solutions. For example, if a census-based zoning structure is used instead of the default grid-based system, the demographic data could be used as initial solutions, which could further speed up the process.   Figures 6-9 show the zonal structure and the sample output produced by Rapidex. It can be noted that the areas closer to the central business district (CBD) and the ones with higher node density are automatically divided into smaller zones.     Figure 6 shows the percentage of demand attracted to each zone in the network. The darker the green, the higher the trip attraction of the zone. Figure 7 indicates the percentage of the total demand generated by each zone. In this figure, the darker red signifies that the zone generates a relatively higher amount of trips than the other zones. Figure 8 depicts the net contribution, i.e., the ratio of trip generations and attractions of the different zones. The dark red indicates that the zone produces more trips than it attracts during the morning peak hour. On the other hand, the dark green indicates more trips are ending in these zones than originating. It is evident from this figure that the zones far away from the CBD generate more trips than they attract during the morning peak hour, which is intuitive.
Finally, Figure 9 shows the average level of congestion (the ratio of travel time to free-flow travel time) to reach each zone in the network from every other zone during the morning peak hour. The dark red colour indicates that it takes a significantly longer time to reach such zones than others. It can be seen here that the zones closer to the CBD are a darker red than the ones that are further out. In the morning peak hour, one can expect more trips to be going towards the CBD and hence more congestion.

Validation
The output validation was performed using three approaches, first using the GEH (named for Geoffrey E. Havers) statistic, then using the aggregated output metrics and finally using a novel project selection method. The details are outlined below.
The Household Travel Survey (HTS) data, obtained at Statistical Area Level 2 (SA2) for Sydney for the year 2016, has been considered for validation. Therefore, the SA2 zoning system was used for validation purposes to align with the HTS data instead of the zoning structure of Rapidex. Figure 10 shows the SA2 zoning structure used for validation purposes. First, we solved the user equilibrium using the HTS data and recorded the OD travel times, which are treated as the "observed values". Using these values as input, we "estimated" the OD matrix using Rapidex. However, several default properties of Rapidex are altered in the "observed" case to avoid any bias and to mimic the real-world conditions as much as possible. For example, in the "observed" case, the BPR parameters are set as random values (α in the range between 0.10 and 0.20 and β in the range between 3.5 and 4.5), whereas for the "observed" case, the default parameters were used. Furthermore, the location and the number of centroids for each SA2 are altered. Finally, the capacity of each link in the "observed" case is multiplied by a random number between 0.75 and 1.25 to the default capacities. These changes are made because, in reality, one may not know the "true" link performance function, link capacities, and the centroid locations.

Validation Using the GEH Statistic
The GEH statistic, outlined in the equation below, is typically used to to evaluate the performance of traffic models. Generally, a GEH value of less than 10 is considered acceptable in large-scale models [5,72].
In the current study, M and C can be treated as the modelled/estimated and observed traffic demand between an OD pair.
We calculated the GEH statistic for every OD pair, i.e., 32,200 observations. Figure 11 shows the frequency distribution of the GEH statistic. It can be seen that approximately 80% of the OD pairs have a GEH value less than 5, and over 93% have a GEH value less than 10. A further granular analysis of the GEH statistic in terms of zonal trip productions and attractions was conducted. All the 180 zones have their trip productions and attractions estimated within a GEH value of 10. Only two zones have the trip production GEH value above 5, and 16 (i.e., 9%) zones have the trip attraction GEH above 5.

Validation Using Aggregated Metrics
Aggregated metrics, namely trip length (and its distribution), congestion levels, and travel times, are used to evaluate the performance of the Rapidex approach. These ag-gregated metrics, particularly the trip length distribution, are essential in calibrating and validating large-scale networks [73,74]. The "observed" values for total system travel time (TSTT), average trip length, average trip time, and congestion were 174,792 h, 11.6 km, 20 min, and 1.64, respectively. The corresponding values for the "estimated" case are 171,899 h, 11.4 km, 22 min, and 1.64, respectively. Furthermore, Figure 12a-d show the comparison of trip length and travel time distributions of the observed and estimated cases. It can be seen that the "estimated" trip length distribution matches perfectly with that of the "observed" case. Even the "estimated" travel time distribution aligns well with that of the "observed" case until the 70th percentile, after which Rapidex slightly underestimates the travel time. It could be due to the stopping criteria of the error function, i.e., MAPE-ODTT, which was set as 10%.

Validation Using the Project Selection Method
Major transportation projects are costly investments, which are aimed at improving the existing conditions. When multiple projects are in consideration, one must carefully prioritise the appropriate project, keeping in mind the budget constraints and the long-term project impacts [75]. The ultimate purpose of estimating OD matrices (and transportation planning models in general) for any network is to evaluate different transportation scenarios or/and infrastructure design projects [3,76].
In this second validation approach, we want to see how accurately the "estimated" matrix can perform in terms of project selection when compared with the "observed" OD matrix. We consider two projects, A and B, which are outlined below.
• Project A-Increasing capacity: Select a random corridor i (out of 69 major corridors, including key motorways and arterials in Sydney) and increase the capacity of all links along this corridor by a random number (set in the range of 100-1500 veh/h). • Project A-Cost: = C YA * L Y , • where C YA is the capacity increment of route Y for Project A and L Y is the length of route Y.
• Project B-Capacity increment: Select a different random corridor X. The capacity increment for Project B is calculated such that both the projects cost an equal amount as C XB = C YA * L Y L X , • where C XB is the capacity increment of route X for Project B and L X is the length of route X.

•
Here we assume that the project's cost is purely in terms of the capacity increment.
The process to select a better project between the two is to solve user equilibrium and evaluate the benefits in terms of networkwide metrics, e.g., TSTT. If the "estimated" and the "observed" OD matrices identify the same project as the better one, then we can say that the "estimated" OD matrix is useful for practical purposes. In this paper, we repeat the project selection 1000 times, i.e., each time we solve UE once for Project A and once for Project B, for both the observed and estimated cases. Thus, we run 4000 UE solutions. We observed that 88% of the time, the "estimated" and "observed" OD matrices identified the same project as the better one.
Further analysis (see Figure 13a) revealed that the projects were only ranked incorrectly when the difference between the benefits (in terms of TSTT) offered by Project A and Project B is insignificant. For example, when the percentage difference in benefits offered by the projects was greater than 0.1%, the projects were ranked correctly in 96% of the cases. Likewise, when the percentage difference increased to 0.25%, all the projects were ranked correctly. Furthermore, Figure 13b shows that the observed and estimated percentage difference of benefits between Project A and Project B align quite well.  All the three validation approaches show that the proposed Rapidex approach is appropriately validated.

Conclusions and Future Directions
This paper presented Rapidex, a new tool to estimate the trip table for any given city in the world. With the tool, users can download the road network, create zones, extract travel time data from pervasive data aggregators, estimate the OD matrix, produce critical outputs, and visualise them spatially on a map. Furthermore, the tool allows changes to be made to demand and network data to evaluate different scenarios. In the case study of Sydney, we saw some of the output produced by Rapidex, such as the maps of the road network, trip productions, attractions, and congestion levels. Furthermore, the model was validated using the HTS data of Sydney using the aggregated metrics and a project selection method.
Rapidex enables traffic authorities and practitioners to quickly make decisions regarding strategic long-term planning. It can help compare, contrast, and benchmark various policies by evaluating the trip tables across various networks. The tool can help in understanding day-to-day and within-day demand changes in a city, analysing the impacts of network and demand changes on critical performance metrics. Importantly, the tool can be beneficial for developing countries where urbanisation and population growth are rapid and yet lack adequate resources in capturing demand data.
A key limitation of most traditional existing demand estimation methods is the reliance on the limited number of observations of traffic counts, which are significantly lower than the unknown values (demand), causing the problem of under-determinacy. However, in Rapidex, the default objective function considers the travel times between every OD pair. There is also a provision to test the typically-used objective functions in the literature.
Significant future research is identified. For instance, the Rapidex tool is being updated and refined continuously. Currently, the project team is working towards including public transport and evaluating the impacts of the changes in demand and road capacity values on sustainability indicators, such as congestion, vehicle kilometres travelled, mode share, equity, etc., for multiple cities around the world. In addition, the sensitivity of different parameters such as centroid selection, zoning, GA parameters, etc., is being tested. Furthermore, different user equilibrium traffic assignment heuristics are also being tested to further speed up the demand estimation process. Critically, it is noted that the current purpose of the tool is for quick decision making in long-term planning projects. Future research also involves developing a more dynamic version of the software that takes into account the fluctuations in within-day travel times so that it can also be used for operational purposes.
As with traditional aggregate traffic assignment approaches, a limitation of the overall demand value estimated by Rapidex is the reliance on link capacities (through the link performance function). The demand may be over or under estimated if the link capacities are set too high or low. Better solutions could be obtained if either the total demand or the link capacities are known. A limitation of Rapidex lies in the link performance function. Finally, it is important to note that the level of detail of OSM data and its data quality could differ across different countries and continents. Although Rapidex has a provision to import a pre-existing network file (in the form of a shapefile or spreadsheet), enhanced OSM data, particularly in developing countries, would make it even more accurate.