Combining Telecom Data with Heterogeneous Data Sources for Trafﬁc and Emission Assessments—An Agent-Based Approach

: To create quality decision-making tools that would contribute to transport sustainability, we need to build models relying on accurate, timely, and sufﬁciently disaggregated data. In spite of today’s ubiquity of big data, practical applications are still limited and have not reached technology readiness. Among them, passively generated telecom data are promising for studying travel-pattern generation. The objective of this study is twofold. First, to demonstrate how telecom data can be fused with other data sources and used to feed up a trafﬁc model. Second, to simulate trafﬁc using an agent-based approach and assess the emission produced by the model’s scenario. Taking Novi Sad as a case study, we simulated the trafﬁc composition at 1-s resolution using the GAMA platform and calculated its emission at 1-h resolution. We used telecom data together with population and GIS data to calculate spatial-temporal movement and imported it to the ABM. Trafﬁc ﬂow was calibrated and validated with data from automatic vehicle counters, while air quality data was used to validate emissions. The results demonstrate the value of using diverse data sets for the creation of decision-making tools. We believe that this study is a positive endeavor toward combining big data and ABM in urban studies.


Introduction
Traffic congestion is one of the major problems in most cities worldwide, especially in developing regions producing increased fuel wastage, time, and monetary losses [1]. According to the World Health Organization, air pollution is responsible for approximately 4.2 million premature deaths every year [2]. Augmented duration and severity of traffic congestion has a more negative impact on pollution than free traffic flow. A higher number of speedups, slowdowns, stops, and starts increases the emission, as well as lower vehicles speeds [3]. To face the challenges, many diverse applications are developed and policies tested, such as the effects of electric cars, telecommuting [4], and car-pooling [5] on CO 2 emission, the outcome of banning old diesel cars on NO x emission [6], changing the speed limit on several emissions types [7], congestion pricing [8], etc. However, all the studies mentioned rely on synthetic or static data that are usually biased, expensive, and time-

Methodology (Model Architecture)
In this section, we will first provide an overview of the complete methodology used, and then we will describe each part in detail.
The scheme of the proposed methodology is shown in Figure 1. It is based on three steps. The first step implies spatial-temporal travel pattern generation, which further serves as a basis for traffic and emission estimation. The second and third steps include the traffic ABM simulation, and its emission calculation at a street level. The traffic ABM model, built in the GAMA platform [52], is based on the car-following model. Every agent represents a vehicle that has an origin and destination, together with its scheduled time for commuting. It realistically simulates traffic conditions, by including crossings, lane changes, other vehicles, drivers' behavior in the time step of 1-s. The output of the model is the assessed number of vehicles and their speed at every traffic link at 1-h resolution, which is further used to assess their emission with coefficients from the HBEFA handbook [53]. We calculated the CO, NO x , and PM pollutants. To feed up, calibrate, and validate the model, we used diverse data sets with heterogeneous spatial and temporal resolutions listed in Table 1. Using a telecom data set, we calculated the probabilities of spatial-temporal movement in the case study area. The probabilities of the spatial movement were calculated among local communities, while the temporal movement was computed at 1-h resolution and presented an agent's probability of starting commuting at a certain hour. Furthermore, we used that output together with official population data and GIS data at the local community level as an input to our traffic ABM model. We calibrated the traffic model by finding a combination of model parameters that best corresponded to a day selected from the automatic vehicle counters data set. We validated our model by comparing the model's output with selected parameters to a different day chosen from the automatic vehicle counters data set. Calibration and validation were accomplished at temporal resolution of 1-h. After that, the model output was used together with estimated coefficients from the HBEFA handbook to calculate the CO, NO x , and PM pollutants produced by traffic. The output is again compared with ground truth data, or to be more precise, with emission data from two available stations in the case study area. With that, we confirmed the proposed methodology. The remainder of the section is organized as follows. First, we present the mobile phone data processing and generation of the origin-destination probability matrix, together with the probabilities of temporal movement. After that, the ABM model design is presented, together with its emission calculation.

Mobile Phone Data Processing
Mobile phone data are rich in user behavior data and are passively collected by telecom providers for billing purposes. Whenever a user uses a service from a provider, one Call Detail Record (CDR) is created and stored in a telecom database. Apart from their primary function, telecom data are extremely valuable, as they enable near-real-time monitoring of the spatial-temporal dynamics of a large number of people. Due to privacy issues, telecom operators perform rigorous procedures of data anonymization before they give data to third parties.
In this research, we utilized CDR to assess the probabilities of spatial-temporal movement in the case study area. The data set consisted of a set of telecommunication records (SMS In/Out, Call In/Out, and Internet activity) performed by randomly selected anonymized users. Besides the type of telecommunication records, the data set included the approximate coordinates of a Radio Base Station (RBS) that registered traffic, time, and duration of an activity, as well as the country code and number of digits in a telecommunication number, which were used for record creation. From those records, we reconstructed the user mobility paths.
To generate the origin-destination (OD) probability matrix, which further served us to generate the traffic, we relied on several methodologies presented in the literature [24,28,35,43] with appropriate modifications adjusted to our case. The steps are presented here in three sub sections.

Data Preprocessing
The first step implied data cleaning: • We eliminated RBSs that officially did not belong to the municipality region. • We excluded landline numbers as they are not representative of user mobility. • Records made by numbers with four digits or less were eliminated, as they probably belong to public services (e.g., parking services). • Foreign numbers were excluded since they were probably from tourists who attended a festival held during a few days of the time period for which we had the data. We treated those records as anomalies, as these tourists did not contribute to the city's everyday traffic and emission. • Finally, we selected users that had records during the day and night period, since we wanted to estimate users' regular trips that occurred during a whole day. Therefore, we wiped out users with only a few records.

Stay Extraction & Activity Inference
The second step was to extract the users' trips from the CDRs by selecting each user individually and ordering their records over time. To estimate the time a user spent on each location, we calculated the time difference between two successively visited antennas, which we then split and assigned each half to an RBS [24]. According to [35], we classified locations as "stay" and "pass-by" and finally kept only the stay locations since we were only interested in the origin and destination user point. Locations were classified as stay only if a user spent more than 10 min there. To estimate origin (home) and destination locations, we split the data set into two parts: • A data set for estimating origin locations-it contained records obtained on weekdays between 7 p.m. and 8 a.m., and weekends. • A data set for estimating destination locations-it contained records made during the weekdays between 8 a.m. and 7 p.m.
Unlike the authors in [35], we summed up the duration each user spent on each antenna in both data sets and utilized it to estimate the origins and destinations. As a result, the origin location was identified as the location where a user spent the majority of their time in the first data set, while the destination location was identified as the location with the highest d1*d2, where d1 is the distance from the estimated origin location and d2 is the duration a user spent in the range of an antenna. This assumption is adopted from the paper [35] and is based on the testimony from the literature that, in accordance with time spent at a location, destination locations, such as work, are more likely to be further away from an origin (home) location than closer locations [54,55].

Rescaling
As a final step, we rescaled the data to the local community level. The reasons for that were twofold. Previous studies showed a higher correlation between trips extended from telecom data and traffic surveys when aggregating trip origins and destinations to areas larger than one square mile [35,56]. Moreover, we also had the population distribution at the local community level. For the abovementioned reasons, we further grouped the RBS points and assessed movement among communities. To achieve that, we first approximated the domains of RBSs with Voronoi polygons [57], as it was performed more often in previous research [18,21,24,28,36,39,41,43]. We assumed that a user has a uniform probability of being located at any point of a Voronoi polygon. Therefore, each user for their origin and destination locations got a point assigned inside corresponding polygons according to a uniform distribution. Next, we overlapped the areas of Voronoi and local community polygons, and in accordance with the users' points locations and intersections between the polygons, each user was assigned a local community for their origin and destination locations [28]. We further transformed the extracted number of trips among local communities into probabilities of movement among them. There are two reasons. The telecom provider that gave us the data set did not have a full market share. Therefore, we assumed users of the telecom provider were uniformly allocated in the case study area. Since we did not distinguish the users' transport modes (e.g., passengers, cyclists, vehicles, etc.), we made an assumption that the probabilities calculated depict the likelihood of vehicle movements within the municipality. The probability of telecom activity is also inferred from telecom data by calculating the proportion of active users per hour during working days.

Agent-Based Traffic Model
Urban systems in general are complex systems and to achieve higher sustainability, we need to understand their complexity [49]. The systems are strongly defined by decentralized, local interactions among sets of independent entities. They exhibit collective intelligence without the existence of a central authority, i.e., they tend to co-organize and adapt to changes in the environment, thus optimizing their behavior over time. In order to comprehend these systems, we need to study them together at two levels of abstraction, the local component level and the system level. ABM is a modeling paradigm that is suitable for modeling complex systems using a bottom-up approach by programming entities, their behavior rules, and the environment [58]. As the benefits of a bottom-up approach have been recognized, a lot of ABM frameworks for transportation modelling have been developed [59]. Among them, we chose the GAMA platform [52] to model the traffic as it has the ability to model large spatial complex systems and built-in actions for traffic modeling.

Model implementation
The model implementation is based on a built-in plugin for traffic modeling [60]. The plugin consists of three built-in action components for three different types of agents that declare them. Those plugins provide agents with a set of defined attributes and actions. The types of agents defined by the plug-in are: Road agents-Each road is a polyline composed of a set of road sections (segments). Road agents have a target and a source node and have information on all the input and output roads. They are directed. Therefore, if a road segment is bidirectional, two roads will be created for each direction. Road agents could have several lanes, which would allow vehicles to change them at any time. They take the road skill action that provides them with a set of variables, such as attributes that define the maximum allowed speed on the road, linked roads, and connected nodes. Road's node agents-They define the beginning and/or end of traffic links. They adopt the road node skill that supplies the agents with attributes related to the linked road agents as well as attributes that define crossings, such as stop signs or a list of driver agents that block a node. Driver agents-Each driver agent has a planned trajectory that consists of a succession of road links. A driver picks a lane according to the traffic density, favoring the rightmost lane. Driver agents use the advanced driving skill that gives them many attributes that characterize drivers' commuting (e.g., target, vehicle length, maximum acceleration of a vehicle, the distance they keep from another driver, etc.) and drivers' personal characteristics (probabilities of respecting stop signs, traffic rules, changing lanes, etc.). It also provides them with actions that define vehicle commuting. They are based on the car-following model. In the paper, we used all three agents with their skills and adjusted them for a specific case. The specifications of the proposed model will be described below. However, for further implementation details on the plug-in, please refer to [60].

Model rules
The model follows simple rules. Each agent represents a vehicle and each has an origin and destination point and a time of departure that defines the agents' commuting. The model setup completely relies on data. The simulation starts at 00:00, and lasts until the next day with a simulation step of 1-s. Even though the output is needed at 1-h resolution, we chose to use a built-in traffic plugin that works at 1-s resolution, as it has the ability to realistically and dynamically simulate vehicle speeds, which are needed for precise emission assessment. Before the simulation starts running, each agent gets an origin and destination point, together with a departure time. When the simulation starts, agents are at their origin or destination locations, which are located somewhere in the case study area, depending on the time window an agent spends at its destination (e.g., working hours) allocated to each agent during the initialization. When the agent's departure time comes, it drives on the traffic network toward its destination. An agent enters the traffic network at a point that represents the crossing that is closest to its origin/destination location, and in the same manner, it exits the network. If an agent is on a destination point, its stay is defined by a model input parameter-time duration. After the duration of time passes, it returns to its origin point. Agents' commuting is completely defined by the built-in plugin for traffic modeling in the GAMA platform, which is based on the car-following model [60]. For commuting, they use the shortest path, which is calculated by the Dijkstra algorithm. While commuting, they respect the maximum allowed speed on the road and adjust their speed according to other agents around them, their attribute values, and the road network. Agents have certain attributes which define the probabilities of changing a lane, accelerating, decelerating, respecting the traffic rules (priorities, stop signs, maximum allowed speed), keeping a higher or lower distance from other cars, stopping for no reason. Furthermore, when an agent slows down due to a car in front of it, the likelihood of changing lanes or using a linked road increases. Traffic lights are not included in the model, but nevertheless, agents are programmed to stop for a second before they enter the intersection and to respect the traffic rules on intersections (with a certain probability). As we need the output to be at 1-h resolution to match the output to the observed pattern in the automatic vehicle counters data set, the output of the ABM model is an estimated number of cars and average speed at 1-h resolution per road link. However, since the model works at 1-s resolution, it can be easily exported and used for studying traffic composition at a more precise level.
Model setup For the model setup, we used diverse data sets, such as the traffic network, population data, and GIS data of local communities, as well as probabilities of movement among local communities and over time, assessed from telecom data. These data defined our model assumptions and had a significant impact on model dynamics. Model input parameters were time duration that agents spend at their destination locations, and percent of simulated population. These were tuned to the real traffic scenario during the calibration processes. Traffic network The traffic network was downloaded from the Open-Street Maps (OSM). Aside from the spatial representation of roads, the data from the OSM contains other relevant information, such as the number of lanes, maximum allowed speed, a road's width, whether it is a one-way or two-way street, and other traffic network features. We created a buffer of 1km around the city in the municipality case study area and downloaded drivable road types (e.g., we excluded pedestrian roads). For the surrounding villages that also belong to the municipality case study area, we downloaded only the main roads. This was undertaken in this way because we wanted to model traffic pressure in the city through traffic inflow and outflow, and commuters from the surrounding villages significantly contribute to it, making the roads at the city-entrance suffer the most pressure. We preprocessed the data and adjusted it to our traffic model, creating a road link for each direction and lane, as well as road's nodes for each intersection on the network. Moreover, the missing road features were fulfilled (e.g., maximum allowed speed). Population distribution and trip generation Since we got the population number per local community level, the probabilities of movement between communities (including a self-loop) were assessed from telecom data. In the model, the population of agents was generated according to the population distribution per local community and the percent of simulated population input parameter. During the model setup, the number of agents was created for each community by multiplying the community population with the percent of simulated population and each agent got a randomly assigned point inside a community polygon with the condition that it was located between 100 and 500 m from a traffic link. The assigned point was marked as an agent's origin location. According to the extracted movement probabilities from each local community, each agent was assigned a second local community and a random point inside it with the mentioned condition that represented its destination location. Exported probabilities of temporal telecom activity served to allot the hour of departure, while the exact minute and second were uniformly chosen for each agent. Drivers' behavior The variables listed in the first column of Table 2 were used to characterize the personal characteristics of the drivers. We characterized them with probability distributions extracted from [61] and listed them in the second column of Table 2. Every driver has a personal probability of changing a lane, respecting priorities and stopping signs, blocking a crossing node for no reason. With the security distance coefficient, the minimal distance a driver keeps from another driver is determined. The speed coefficient represents the speed they opt to reach according to the maximum allowed speed on the road. When a driver's speed falls below 25 km/h, its probability of using a linked road increases with every simulation step. When a driver reaches a speed of more than 25 km/h, the probability is set to be zero. Since we did not have information on the number of active drivers or the duration of time that agents should spend at their destination locations, we assessed those parameters by comparing the model output with data from automatic vehicle counters, which included the number of vehicles passed and their average speed at 1-h resolution, together with coordinates of the counters' location. For every different combination of the model parameter values (every scenario), we produced a model output that was further compared with the real situation captured by the data from automatic vehicle counters in the case study area. Calibration After we preprocessed the automatic vehicle counters data set, we selected one working day and compared it with every scenario produced by the model. For every scenario, the number of vehicles and their speed were assessed at 1-h resolution and compared with values in the automatic vehicle counters data set in a corresponding hour. Moreover, we calculated correlation measures between observed traffic circumstances and every scenario produced by the model. We used the Pearson and Spearman correlation coefficients. The model parameters from the scenario that were the best fit for the day chosen from the automatic vehicle counters data set were selected. Through the calibration processes, we assessed the global values that define the fleet composition during a day (the number of vehicles on the road network across one day). Validation To assess the reliability of our model, we compared the traffic volume of the selected model scenario in the calibration step with the other selected day from the automatic vehicle data. Validation is performed in the same way as calibration, that is, by calculating correlation measures between the observed and predicted number of vehicles and their speed.

Emission Evaluation
To estimate vehicle emissions produced by traffic, we used publicly-available coefficients from the HBEFA handbook [53]. The handbook contains diverse emission factors calculated for different types of vehicles (such as passenger cars, heavy-duty vehicles, light-duty vehicles, motorcycles, coaches, and urban buses) with different types of engines (such as diesel, petrol, electricity, and CNG) in Switzerland, Austria, Germany, Norway, Sweden, and France. The database contains factors calculated for the following pollutants: CO, HC, NO x , PM , several components of HC (CH 4 , NMHC, benzene, toluene, xylene), fuel consumption (gasoline, diesel), CO 2 , NH 3 , N 2 O, PN, and PM in g/km. The pollutants are calculated for a wide range of traffic situations, such as cold start and warm emission events. Besides that, it includes the aggregated values of pollutants per type of vehicle and country.
We simplified the emission assessment since we did not have precise data on vehicle types or engines or factors assessed for Serbia. We chose to use factors estimated for Austria, as it is geographically the closest country. We used aggregated values provided for passenger cars and calculated emissions of pollutants CO, NO x , and PM at 1-h resolution. We applied Formula (1) to calculate the emissions per traffic link. As reported in the literature mentioning that coefficients are approximately twice as large when the traffic is in a stop&go regime [62], we added the k coefficient to illustrate that. For every road link on which the assessed speed was <1/5 * max_allowed_speed, k value was set to be 2, otherwise 1. l h,p = n h,l * d l * agg p * k h,l where: h hour l traffic link p pollutant, can be CO, NOx, and PM l h,p calculated emission for pollutant p on traffic link l in hour h n h,l number of vehicles on traffic link l in hour h d l length of traffic link l in km agg p aggregated coefficient from the HBEFA handbook for pollutant p in unit g/Vehkm k h,l congestion coefficient for traffic link l in hour h. It takes value 2 for congested links, and 1 otherwise

Emission Validation
To assess the reliability of the estimated emissions, we compared the model output with the available air quality data measured by stations located in the case study area. We calculated the correlation measures between the assessed and obtained values of PM, CO, NO x from air quality stations on the closest road link to the corresponding station. Again, we selected a day that is the same as the day selected for the traffic model calibration, as we believe that is the most realistic simulated day.

Case Study
Novi Sad is the second biggest city in Serbia with a positive growth rate ( Figure 2) and more than 300,000 inhabitants [63]. The city is located on the border of the Backa and Srem geographical regions, which is defined by the Danube River. Furthermore, it faces the northern slope of Fruska Gora Mountain. Due to its geographical position, its road network, and its relatively small available area, Novi Sad has become very crowded with considerable daily traffic congestion, which has led to increasing pollution. Congestion is mainly present at the extension of bridges. It is estimated that the city needs two more car bridges and one more pedestrian bridge. Unfortunately, due to financial reasons, it will get only one car bridge by the end of the year 2030 [64], which is quite a long period for a city that is constantly growing. For the abovementioned reasons, policy-makers need to find another way to optimize traffic to minimize congestion. With the proposed model, policy-makers have the possibility to explore the effects of various traffic regulatives to find the most optimal solution in the whole case study area.

Telecom Data
The data set was provided by the operator Telecom Serbia for the time period 3-11 July 2017. It contained approximately a million records per day generated by 197,950 individual users. Moreover, spatial resolution was defined by 80 RBS. After data preprocessing, we finally got 80,542 individual users, 77 antennas, and less than a million records per day, which were further used for the estimation of the OD probability matrix and probability of telecom activity per hour. The number of estimated origin (home) locations together with official population data is depicted in Figure 3a, while the calculated probability of temporal telecom activity is depicted in Figure 3b. The origin-destination probability matrix is shown in Figure A1.

Automatic Vehicle Counters Data Set
The data set from automatic vehicle counters was available for the time periods 1-17 November 2019 and 3-9 December 2019. It contained the number of cars at 1-h resolution from 26 counters, and the average speed from 16 counters located at main crossings in the city (Figure 4). However, as implementing those was a pilot project in the case study, some interruptions in data collection were present, and thus we first cleaned the data. Moreover, interruptions and inconsistent data were more present when measuring speed, compared to counting the number of vehicles.
We eliminated counters with invalid records. We set a condition that a counter must have at least four records when counting the cars or measuring the speed in one day (4 h) to be kept for validation or calibration. The number four was chosen for the possibility of calculating the correlations. If during a day, most of the counters had an interruption in the data collection, we wiped it out. In addition, we eliminated days that we identified as extreme outliers, as we assumed that there was an error in data collection. Finally, we ended up with 7-17 counters per day (depending on the day and measure (number of cars or average speed); measures from different counters were available) and 16 usable days. We calculated the mean number of cars and their average speed on a daily basis for the usable days and available stations and showed their variability in Figure 5.
Nonetheless, as previous literature reported that there are four different types of days in urban areas: working days (Monday-Thursday), Fridays, weekends, and holidays [25], and we wanted to simulate a regular working day, we only considered those that fall into the mentioned range. For the model calibration, we selected Wednesday-6 November 2019, since it had the largest number of counters available. The model output was then validated by comparing it with Thursday-3 December 2019, the next day with the largest number of available counters.

Traffic Simulation
The traffic ABM model is based on data and the car-following model. The traffic network was downloaded from the OSM and imported into the model. For the city and its 1 km buffer, all drivable roads were imported, while for the peripheral parts of the city that belong to the municipality, only the main roads were included. In this way, the model is simplified, and the traffic entering the city is included. Agents are generated according to population distribution per local community. Their destination location and movements were inferred according to the probabilities calculated from telecom data. For further implementation details, please refer to Section 2.2.
The values of two input parameters that define the number of drivers and traffic dynamics were estimated during the calibration processes. Those parameters were time duration, which defines an agent's stay at a destination location, and percent of simulated population, which determines the number of agents in the model. In the calibration process we ran a manifold of experiments with the following values for parameters time duration = [4,5,6,7,8,9] and percent of simulated population = [0.1, 0.2], chosen in accordance with expert opinion in the traffic domain. Since the stochasticity in the model is not considerable at the 1-h output, we ran the model five times for every combination of parameter settings and averaged the results. Then, as a part of the calibration process, we compared the outputs generated for each combination of parameters values to the chosen day in the vehicle automatic counters data set. We obtained the highest values of correlation between model output and data from automatic vehicle counters with parameter values [9, 0.1]. In the Figures 6 and A2, the comparison between the observed and predicted output could be observed for the mentioned parameter values. In Figure 7, the congestion maps are shown at 6 a.m., 10 a.m., 12 p.m., and 3 p.m. respectively. Congestion is calculated as a difference between the maximum allowed speed and the speed predicted by the model at a given hour for every traffic link. To validate the proposed model with the selected model parameters, we compared the results with the second day selected from automatic vehicle counters. Validation results are enclosed in the Appendix A and shown in Figures A3 and A4.

Emission Assessment
Model output is further used to calculate the traffic emission by using the HBEFA aggregated coefficients and Formula (1). Emission is calculated for every traffic link and every hour. For the validation of calculated emission, we used measurements from two available Serbian Environmental Protection Agency (SEPA) [65] stations that measure different air quality parameters from traffic at 1-h resolution: station Novi Sad Rumenacka (45.262626, 19.819016)-CO and station Novi Sad SPENS (45.24506, 19.84119)-CO, NO x , PM. The validation of emission output is shown in Figure 8, while the heat emission maps for NO x , CO, and PM are depicted in Figures 9, A5 and A6, respectively.

Discussion
The model setup and achieved accuracy were dictated by the available data. We achieved good correlation measures for most of the crossings, meaning that we managed to assess traffic composition and its emission. However, for some locations, the number of cars or emission was not predicted in the same range as observed by automatic vehicle counters and emission stations. Moreover, while the Pearson correlation for most of the plots was around 0.5-0.6, the Spearman caught higher variability, and its values went from 0.1 to 0.9. The Pearson correlation uses actual values, compared to the Spearman correlation, which calculates the correlation based on ranks. In terms of the predicted emission, we obtained good correlation measures for both coefficients, but we only had one station to compare NO x and PM pollutants, and two stations for CO pollutants. Although the number of stations was not big enough to present the overall pollution in the city, we did reproduce it to some extent and showed the possible joint use of different data sets, ways of data processing, and overcoming their versatile data forms for the purpose of creating a decision-making tool, which was the main goal all along.
There are several possible reasons for not being able to predict absolute numbers at some crossing points. First, we had the data set used for the OD probability matrix generation from July 2017, while the rest of the data were available in November and December 2019. In addition, the chosen days affected the model validation. It is reported in the literature that traffic patterns are distinct across days and cannot be generalized [66]. Therefore, to get a reliable model, a policy-maker should use data sets for assumptions, calibration, and validation from the same time period, preferably the same day. Moreover, the reported accuracy depended on the nature of data used in calibration and validation processes, and its spatial and temporal density. Contrary to the papers that compared the overall pattern produced by a model with the congestion or emission maps or a global number [4,6,42,67], we calculated the measures on a more precise spatial-temporal level and compared the real numbers, and not only the visual patterns. Even though this is a more precise way to go, there was a possibility that available data did not capture correctly the complete interactions of individuals. Presumably, the best option would be to combine both ways. With respect to the emission calculation, we chose to use a simplified aggregated approach due to the lack of the traffic fleet data and measured emission coefficients in Serbia. However, there is an open question to what extent the aggregated coefficients for Austria from the HBEFA book agree with the traffic conditions in Serbia. As we only had two stations for result verification, this question remains open for further research. From the perspective of modeling, the best way would be to dynamically simulate the vehicle emissions for every simulation step in the model, simulating the cold start and warm emission activities for every type of vehicle and its engine. Nonetheless, this increases the model's complexity and lengthens the simulation time, which potentially leads to other issues. Finally, we simulated traffic and assessed its emission at the municipality level, and generally, it is more difficult to tune the model to the real-world scenarios when is larger and more complex, compared to the models that simulated the traffic on a smaller area as it is one district.
The reported model accuracy, in addition to data, depends on the way it is validated. The literature contains numerous examples of various traffic modeling practices [68] and their validation methods differ. There are methods in which validation is only based on a visual comparison of a traffic heat map with the traffic map from services, such as Google or Baidu maps [4,6,42,67]. Moreover, there are papers that use only one global number to validate the overall pattern produced by a model, such as a comparison of the overall CO or NO x emission produced by a model with the overall emission reported in cities [4,67]. The question is how accurate the models are and to what extent the reported accuracy is influenced by the data and validation methodology used. On the other hand, some papers do validate the produced pattern with actual numbers on a higher spatialtemporal scale, but their approaches also differ. Some examples are comparisons of hourly traffic to annual average daily traffic [32], total number of vehicles with one number in several locations [43], traffic volumes with 600 sensors [44], a model output with an output from another model [69], etc. All these models are distinguished by the level of detail, assumptions, and research questions they include. From this point of view, comparing the accuracy of different models is difficult to undertake.
Despite not being able to predict the absolute numbers and scarce data sets used for validation, we believe the presented methodology could serve for the creation of a decision-making tool and be used for gaining insights when comparing different policies in order to find the optimal outcome. Relying on an evidence-based conclusion promises smoother and more intelligent decision-making. However, the inclusion of more data also contributes to the complexity of the model, leading to other problems such as long execution time and increased complexity. Therefore, it is necessary to find a compromise between the level of reality in the model and its goal. Although there are no perfect models, they could support to some extent decision making processes and provide insights into phenomena of interest. Policy-makers just need to be aware of the limitations of the model. In addition, computer resources and capacities together with algorithms are advancing, so we are moving toward more realistic models.
Additionally, this study gives a retrospective of the challenges of combining disparate data sources with different spatial-temporal resolutions. Using heterogenious data sources brings many open questions and challenges [9,10,70]. On the one hand, we can argue that passively collected data are mostly free of bias and do not suffer from the selection of sampling methods. As they cover a larger sample size, it is usually possible to obtain a broad picture of behavioral patterns. Nevertheless, using data not primarily made to support some subject, like we did with telecom data in traffic studies, introduces gaps and needs to be fused with other data sources in order to unlock their value. Combining different data sets brings a lot of uncertainties and challenges caused by mismatches in data resolution and the multimodal and dynamic nature of data and introduces an additional effort to overcome them [10]. To get from telecom data to traffic emission, we went through various steps of combining the data sources. To match the spatial resolution of telecom data with official population data, we upscaled the domain at the local community level. We populated the ABM model at the local community level and connected their trips to the traffic network. Furthermore, to compare the traffic or its emissions with the stations and counters in the case study area, we sum up traffic or its emissions on the closest traffic link and calculate the correlations. In addition, different data sources have different error distributions, and modelers should keep this in mind when using them [9]. In order to overcome these issues, modelers should support their solutions with as much data as possible and expert opinions. In terms of telecom data, we recognized that the main limitation was data availability. As a consequence, in order to replicate the methodology, other case studies would require similar data sources. Telecom operators are usually reluctant to give data to third parties, as they need to protect user privacy. If they decide to share, they need to make an extra effort to anonymize and aggregate the data. Another approach was suggested by the Open Algorithms (OPAL) initiative, which advocates moving the algorithms to data [71]. Therefore, raw data is never revealed to outside parties, only vetted algorithms run on telecom companies' servers. Nevertheless, with this study, we believe that we promote the value of using telecom data, in addition to its main use. Besides this, telecom data introduce other gaps and issues when it comes to mobility studies. They are defined by the resolution of RBSs domains, which are not homogeneous across the case study. Regarding the trip generation patterns, spatial resolution is not big enough to capture spatially detailed movements. Moreover, it is difficult to distinguish drivers from other traffic participants. It is likely that using some advanced techniques that include analysis of the speed of changing locations could distinguish pedestrians from others and overcome the problem. However, there is an issue in urban zones, where traffic intensity is high and speed is very low. It is also a challenge to differentiate between passengers and drivers. To circumvent the issues mentioned above, we did not extract precise spatial-temporal patterns but probabilities of movement for each local community. With the given results, we argue that this is sufficient when it comes to the spatially larger case study area, such as the municipality level we chose to model. In the context of transport studies, the results represent the estimated traffic flow and its emissions during different time periods of the day. Traffic flow assessment is very useful in traffic planning as a measure of exposure. In many traffic safety studies, exposure measures are the most important factor for modeling traffic accidents and determining the influencing factors that contribute to traffic accidents [72,73]. In addition, exposure measures play an important role in traffic planning and traffic infrastructure development, where they help transport professionals to better understand the mechanisms and factors for smooth traffic flow and alleviation of its emission. To create quality decisions with respect to traffic regulation and its emission, data are needed at the precise spatial-temporal level in the near real-time, which is not always available. This study emphasizes the value of data and makes a step forward in using big data in transport studies.

Conclusions
In the study, an agent-based methodology for traffic simulation and emission estimation was presented. The methodology relies on the use of heterogeneous data sources. Telecom data was used for temporal and spatial movement probability assessments between different local communities in the case study area. Using the probabilities together with an official census, GIS data, and data about road networks from the OSM, we built the Traffic Agent-Based Model in the GAMA platform. Furthermore, we used the model output and estimated traffic emissions. Using data from automatic vehicle counters, air quality data measured from several stations in the case study area, we proved that the model results are consistent with actual traffic and emission conditions. However, the achieved accuracy was dictated by the available data. Nevertheless, this study represents a positive step towards using and combining heterogenous big data in urban studies.
The proposed methodology has several constraints that need to be addressed. Travel demands (such as pedestrians, trucks, etc.,) from telecom data were not extracted. Instead, we calculated the movement probabilities and incorporated them into the model. In order to simulate the traffic, we use Advanced Traffic Plugin, offered by the GAMA framework, which uses precise traffic conditions by including lanes, directionality of streets, drivers' behavior, etc. The model needs to be run at a 1-s time resolution and without the possibility to parallelize the processes within the model. If we add a larger number of agents and traffic networks that should be simulated on that, we could end up with a really complex and time-consuming model. For the proposed model with 10% of the simulated population, simulation lasted up to 4 days, which is a relatively long period. Within the model, different types of vehicles cannot be distinguished. Traffic lights were excluded. To simplify the model, we only included the main roads in the peripheral parts of the municipality. We did not incorporate emission calculus within the model, but instead, we calculated it as a posterior, which is inconsistent with the resolution of the built traffic model. We did this because of the lack of data and to avoid additional model complexity. We did not include any external factors, such as weather conditions (wind, temperature, etc.,) and chemical reactions.
In future work, we intend to incorporate more data into the model and thus enrich it. In addition, testing different traffic policies and observing their impact on traffic flow and emissions during one day is planned.

Acknowledgments:
The authors acknowledge all companies that participated in providing the data, especially Telecom Serbia, for sharing anonymized telecom data.

Conflicts of Interest:
The authors declare no conflict of interest. Figure A1. Origin-Destination Probability matrix between local communities in the case study area; red ticks indicate local communities that are in the city area, while the blue ticks represent local communities located outside the city. Figure A2. Calibration-Pearson (p) and Spearman (s) correlation coefficients between predicted and observed speed in automatic vehicle counters data set. The selected day from automatic vehicle counters data set was 6 November 2019. Please note: the axes are not in the same scale to emphasize the captured trend. Figure A3. Validation-Pearson's (p) and Spearman's (s) correlation coefficients between predicted and observed number of cars in automatic vehicle counters data set. The selected day from automatic vehicle counters data set was 3 December 2019. Please note: the axes are not in the same scale to emphasize the captured trend. Figure A4. Validation-Pearson's (p) and Spearman's (s) correlation coefficients between predicted and observed speed in automatic vehicle counters data set. The selected day from automatic vehicle counters data set was 3 December 2019. Please note: the axes are not in the same scale to emphasize the captured trend.