Review of Transit Data Sources: Potentials, Challenges and Complementarity

: Public transport has become one of the major transport options, especially when it comes to reducing motorized individual transport and achieving sustainability while reducing emissions, noise and so on. The use of public transport data has evolved and rapidly improved over the past decades. Indeed, the availability of data from different sources, coupled with advances in analytical and predictive approaches, has contributed to increased attention being paid to the exploitation of available data to improve public transport service. In this paper, we review the current state of the art of public transport data sources. More precisely, we summarize and analyze the potential and challenges of the main data sources. In addition, we show the complementary aspects of these data sources and how to merge them to broaden their contributions and face their challenges. This is complemented by an information management framework to enhance the use of data sources. Speciﬁcally, we seek to bridge the gap between traditional data sources and recent ones, present a uniﬁed overview of them and show how they can all leverage recent advances in data-driven methods and how they can help achieve a balance between transit service and passenger behavior.


Introduction
Public transport provides an essential service whose relevance is increasingly recognized. Indeed, it helps to reduce road congestion, air pollution and energy as well as oil consumption. However, managing the public daily commute network is a difficult task, particularly today with rapid urbanization and the associated population increases, especially in developing countries. Therefore, public transportation systems need to adopt appropriate tools and take advantage of the available data to address these challenges. Indeed, it is shown in several studies (e.g., [1][2][3][4]) that the acquisition of reliable information and data is crucial for the proper functioning of public transport information systems.
We note, first, that, in general, a public transport (or transit) system is comprised of supply (presented through transit service) and demand (reflecting passenger behavior). The ultimate goal is then to achieve a balance in which transit services meet the needs of passengers, using the available information.
On the one hand, classic data sources on transit service, which are generally provided by agencies, are based on transit schedules, stations and route information. However, these static data are not informative in terms of disruptions (e.g., delays, interruptions) as they are based entirely on schedules, which are expectations rather than observations of services. Moreover, we note that the actual observation of schedules which may or may not be published reveals that a schedule does not specify certain details, as is the case, e.g., when it is said that a bus or train runs every ten minutes, etc. Rather than using schedules, public transit services may also run on demand and other options may apply as well. On the other hand, traditional manual approaches to collect information on passenger demand to our study the focus is on managing data from emerging transportation technologies to support decision making. Recent examples include [14,15].
We then notice that these reviews rather aim to show different applications of big data in public transport than to focus on the data sources themselves. For those interested in data sources, the various data challenges are not extensive. Papers with interest on reviewing data sources are tailored to a specific type of data and are referenced in the corresponding sections. Regarding applications, due to the variety of big data applications in public transport, the reviews described are also not exhaustive in this regard. For example, a number of the applications highlighted in [16] are not mentioned in these reviews. A possible framework focusing on a three-layer information management framework could focus on the information, related information systems, as well as necessary infrastructure; see Section 5 using earlier ideas from [17].
In this paper, we instead focus on the different opportunities and challenges in each type of data, as well as their integration. More in practice, we aim to summarize and provide insights on the main applications and opportunities of different data, to highlight their challenges and how to fusion them. To the best of our knowledge, this is the first paper to review public transport data sources in this way and to include recent related papers and reviews.
More precisely, we are interested in the main data sources which are automated vehicle location (AVL), automated fare collection (AFC) and automated passenger counting (APC) systems. These data are endogenous as they can be accessed directly by the agencies, if the needed technologies are available. In addition, we are interested in exogenous sources, such as weather, traffic, social media, smartphone and surveys which also contain valuable information. Based on a comprehensive discussion of the various types of data sources and related literature, we suggest an information management framework, as it can be used to streamline future work in this area.
The rest of the paper is organized as follows: In the next section, we explore endogenous data sources. Section 3 is dedicated to exogenous data sources. In Section 4, we attempt to put the different sources into perspective with a specific focus on data-driven implications. This is followed by the framework presentation. Section 6 is devoted to a conclusion and summarizing discussion of the current state of the art.
To ease the reading, the Table at the end of the paper provides abbreviations as they are used in this manuscript.

Endogenous Data Sources
In this section, we provide an overview on possibly relevant internal data for various stakeholders, as well as information systems in public transport. A crucial condition for general usage is that these data sources are openly accessible, correct, consistent and updated on a regular basis in order to ensure the functionality and reliable use and related output of respective information systems. The purpose of these data may be interlinked with supporting operations, knowledge about the public transport system, as well as legal issues, among others. Data may come as static as well as dynamic data, where the focus in the sequel will be on the latter. Note that the term dynamic data mostly refers to data changing dynamically, i.e., over time.
Our primary interest refers to AVL, AFC and APC as represented in the next few subsections. Beyond that, additional endogenous data may be utilized, as e.g., indicated in [18], who investigate the use of data on link flow, destination count and/or average travel distance. A case study is provided for the London Piccadilly underground line (United Kingdom).

Automatic Vehicle Location
AVL systems are computerized vehicle tracking systems that work by measuring the position of each vehicle in real time and relaying that information back to a central location. AVL systems collect the location of vehicles generally by broadcasting the values of the sensors at a very short periodicity (most often between 10 and 30 s depending on radio capability). For doing so, many technologies could be used for AVL (sometimes also called automatic vehicle monitoring-AVM). Examples of such technologies are GPS, signpost and odometer interpolation, ground-based radio and dead reckoning. For more information about the technological aspects, which led to the introduction of AVL, the interested reader is referred to [19]. In fact, their detailed description is beyond the scope of this paper. We should mention that some technologies, which have been important for a while, have a diminishing importance today. We note here that GPS currently seems the most used technology in the western hemisphere ( [20]). The GPS system works through a network of orbiting satellites that transmit signals to the ground. Special receivers on each vehicle read the available signals to determine their position [21]. Then, the geographic location (often measured by latitude and longitude), along with the date, time and other operational data, are distributed to various stakeholders, e.g., transport companies or even a transit agency. Note that in most papers, nowadays, as compared to the 1990s, we do no longer find a distinction between differential GPS (DGPS) and GPS (as a United States-based system). Formally, DGPS was introduced to achieve an improved location accuracy by not only using satellites but also ground-based reference stations. More recent developments include further satellite-based systems, notably including Glonass (a Russian system), Galileo (a European global navigation satellite system (GNSS)) and BeiDou (a Chinese system). A few references focusing on various transportation-related issues regarding these systems include [22][23][24][25][26].
The main purpose of adopting AVL systems is to allow agencies to remotely track the location of their vehicle fleet, e.g., using the internet. In fact, these data are a potentially rich source of information on actual fleet operations and are commonly used, particularly for the evaluation of transit services. In the following, we summarize available applications.
First, we note that the importance of introducing real-time AVL has long been recognized. For instance, about a quarter-century ago, Ref. [27] illustrate how AVL could be integrated with static data and discuss fundamental aspects of passenger information systems integrating it. Furthermore, Ref. [28] illustrate the importance of tracking bus locations to enable real-time control of the timed transfer. Additionally, AVL is a necessary feature to maintain reliability. Ref. [29] presents a methodology to measure reliability via AVL. Ref. [30] emphasize that AVL, along with other data sources, could overcome on-board surveys that were the traditional way of assessing transit services. Over the last decade, more and more research in this area has been appearing. In [31], an approach is proposed that leverages data from an AVL system to improve transit on-time performance. Ref. [32] presents a methodology for identifying bus stops that do not meet performance standards for on-time performance and factors that cause under-performance. A more advanced approach is proposed in [33] for the same purpose, which aims to characterize bus stops for routes in which reliability is insufficient along with their causes to provide preventive strategies. This work has been extended in both [34] and [35], in which the concept of punctuality is introduced instead of reliability. The main difference, according to the authors, is that the former additionally takes into account the arrival time of passengers. In particular, in [35], the authors propose a web platform to support transit managers evaluating their service. In [36], the authors adopt another approach for delay analysis which aims to detect the stops most vulnerable to delays in order to propose delay reduction interventions.
Ref. [37] propose an approach that leverages the bus vehicle location to improve the reliability of bus services by prioritizing their signals. Their approach consists of the use of connected vehicle technologies and the implementation and an adaptive optimization model of signal synchronization. The authors of [38] are also interested in the optimization of transit based on AVL data.
Regarding the validation of the AVL data, this issue could be addressed by improving and investing in the underlying technology (e.g., GPS) and extracting the most accurate information from it. Ref. [39] consider the problem of reconstructing vehicle trajectories from sparse sequences of GPS points. For more information on validating AVL from a technological perspective, the interested reader is referred to [40].
Other AVL validation solutions could also be effective. For example, in [41], Google researchers aim to match observations of the trajectories of transit vehicles with the routes they serve, in order to detect travel changes using a scoring method. The issue of AVL data validation could be studied more generically by detecting anomalies for the sensor data (e.g., [42]).
In Table 1, we summarize the current work on the AVL. More precisely, we highlight for the main papers above the aspect of application, the adopted methodology as well as their potential which reflects the main contributions and findings. In Table 1, the papers are sorted in chronological order to give an insight into the evolution or work for the data source. Moreover, some aspects not highlighted above (e.g., the methodology adopted in each paper) are given.
At the end of this section, we notice that the AVL is a widely adopted data source that allows to track and improve the transit service. However, in general, it is not primarily aimed at analyzing passenger behavior. That is, even if AVL data could have an impact on passengers (e.g., they could change their travel decisions if they are informed of delays), the ability of AVL systems to directly and individually extract passenger behavior is limited. Additionally, AVL data could be improved by adopting other data sources. Furthermore, in some cases vehicle positions are not available and other types of data could be adopted to replace them (beyond a weak radio network connection in rural areas, this could even happen if tunnels are equipped with WLAN technology).

Automatic Fare Collection
First of all, we note that the smart card technology is the core technology for AFC implementation. It has been adopted in significantly high numbers for transit systems since 1990 ( [43]). The referred paper highlights some advantages of using this technology over traditional payment options. In particular, one of its main benefits is that it presents a rich source of data (beyond benefits one may also consider behavioral issues arising, e.g., if cash is diminishing or discarded). Indeed, when a passenger taps on the card at a station, their information is recorded in the AFC system. In other words, AFC provides information on passengers paying with a smart card or other forms of electronic tickets. We should mention that a more recent review on this is available [44], though without attempting to put themselves into perspective regarding [43].
On the one hand, with regard to the technology associated with these data, we note that the last years have seen considerable progress in the design and capabilities of fare payment, media and equipment and that this progress is continuing rapidly. In [45], the authors review and assess emerging trends and developments related to fare payment and collection technology (magnetic stripe and smart card technologies). A more recent overview of current technologies for AFC with their comparison can be found in [46] and a recent review of the fare evasion literature is provided in [47]. The claim that surveys or questionnaires may be overcome by related data issues is supported by [48] regarding fare evasion estimation for a case of Lyon, France, using fare collection data, fare inspection data and counting data.
On the other hand, when it comes to data, which is our main focus here, we note that acquiring travel information from smart card data is a growing trend. It becomes clear that a large part of the work on public transport data sources includes this type of data. Indeed, many researchers aim to obtain information at a very low cost and smart card data fulfills this need as it provides valuable information for the analysis of both the demand and the transit service. In fact, by analyzing the literature, we notice that AFC is mainly tailored to study the behavior of passengers and the characteristics of public transport demand. However, it could also be useful for other purposes, including the analysis and assessment of transit services, as emphasized in [43]. The referenced paper summarizes the different applications of AFC data on the strategic, tactical and operational level. As a matter of fact, AFC data could provide information on passenger demand and a behavioral passenger analysis at the strategic level, while it could provide information on the evaluation of transit service at the tactical and operational level. One of the most important applications relates to spatio-temporal data and implied mobility patterns of passengers. Given data security issues, it may be difficult to measure both the long-term mobility and stability of transit riders' travel patterns. To accomplish this, e.g., Ref. [49] investigate a metric for measuring the similarity of smart card data over time (providing evidence for distinguishing regularity and infrequency over a time period of five years for smart card data of Beijing). Furthermore, parts of the recent survey of [50] largely focus on smart card data use.
First, with respect to passenger behavior and demand for public transport, we note that they have traditionally been estimated by surveys. However, surveys could be unreliable and people-biased. Further, it is more difficult to combine them with other exogenous data sources (e.g., weather, traffic; at least most references with questionnaires in public transport do not provide information on weather data or make that attempt to provide related information about the time when the survey was conducted leaving the lessons learned somewhat restricted -some exceptions are mentioned below), unless the appropriate processing tools are adopted (Section 3.5). Indeed, it has been pointed out in many papers that smart card data goes beyond traditional survey approaches by providing more complete and comprehensive information for public transport (e.g., [51]). Despite the existence of survey data that attempt to provide detailed and complete information (e.g., Hamburger Verkehrsverbund (HVV)), they have to be updated frequently and they are resource-consuming. Smart card, as an alternative, makes it possible to deduce information on the various passengers (most often via their card numbers). Appending this by means of the mystery shopping concept is mentioned in [52]. Mystery shopping is a marketing method intended to measure the quality of the service and to gather other related information. Note that various transport companies provide an annual customer satisfaction report; see, e.g., Ref. [53] for the HVV in the city of Hamburg, Germany or [54,55] for the city of Qingdao, China. This may even be available without the existence of smart card data or such data being used.
Regarding the literature on this topic, Ref. [56] use smart card data to measure the extent to which public transport users change their behavior over time. We note that different methods have been adopted to process AFC for the aforementioned reason. For instance, Ref. [57] propose a stochastic methodology that analyzes travel behaviors using real-time smart card data from an AFC system that reflects the characteristics of transit users. Another approach for grouping passengers according to their temporal habits is presented in [58]. In [59], a methodology is developed to relate disaggregated AFC trip data to published timetables for the purpose of studying passenger incidence behavior. In [60], an index is set to quantify the range of preferences of users who always choose to take the same route.
In general, the analysis could be carried out for different modes of transport (both rail and bus). However, some of the work could be tailored to a specific mode. In [61], the authors analyze transit demand in order to propose customized bus services. Ref. [62] focuses on the detection of home location and travel purpose for cardholder subway passengers. We should note in passing that other modes of transportation allow even more comprehensive analyses (especially in the sharing economy; see, e.g., Ref. [63] for bike sharing).
In particular, many papers use AFC data to derive the origin-destination (O-D) matrix. Such a matrix aims to build models of travel demand and to quantify transport demand between geographic regions of a city, which are also used in the analysis of travel behavior. Here again, AFC data represents a useful alternative to household and on-board passenger surveys as illustrated, for instance, in [64]. In fact, the authors argue that household surveys can significantly underestimate demand and that AFC data needs to be complemented. In addition, Ref. [65] carry out a comparison with a large O-D survey in the city of Santiago, Chile. The authors aim to validate the results of the survey and they identify certain errors by combining AFC and AVL data. Refs. [66,67] present two different methodologies for estimating the destination of passenger journeys from AFC data.
In the event that the boarding location is available (e.g., in combination with AVL data), this information can be distributed spatially over a network. We note that this is among the advantages of AFC beyond surveys as illustrated, for example, in [51], which aims to understand the spatio-temporal dynamics of passenger travel behavior in the context of a public transport network. Ref. [68] propose a trip-chaining method which uses AFC and AVL data (provided in the General Transit Feed Specification (GTFS); see [69]) to infer the most likely trajectory of individual passengers in transit. The same problem is also addressed in [70]. In [71], a method is proposed to model mini-activities within the framework of the generated trips, by mixing the trip history and the recommendations of the trip planner.
Second, with respect to transit service, smart card systems can be used to calculate specific performance indicators on a transit network, like schedule adherence (it can be estimated by comparing the boarding times given by the AFC at given stops with the route schedule). An example of a paper using these data to assess service reliability can be found in [72].
More generally, we note that there is a strong connection between the different applications described above. In fact, the O-D matrix relies on the passenger behavior, which depends on the transit service. Hence, some studies are interested in investigating these different issues simultaneously. For example, Ref. [73], using AFC data, identify and process observations of travelers' route choices between the same O-Ds under different travel environment conditions. Moreover, in [74], smart card transactions are used to gain insight into the trade-offs between travel time, transfers, waiting time and congestion in the choice of public transport routes, based on revealed preference data. In general, we can state that AFC data can be used to infer trip purposes and to reveal travel patterns in an urban area. As an example among many others, in [75], a case study demonstrates the process of trip purpose inference based on smart card data for a case study in the United States. Case studies for Nanjing (China) metro allow us to recognize congestion areas, as well as commuting characteristics of residents [76]. In a separate analysis, this also leads to a characterization of the jobs-housing ratio for various districts of the city [77]. Similarly, one may also detect areas of specific home and work places. A case study for London (United Kingdom) is given in [78]. Beyond classical methodology, current developments in data science and data mining allow more in-depth investigation and analysis. For instance, dynamic time warping can be used for comparing different classes of time series data. A case considering an application for Gatineau (Canada) is provided in [79]. The most recent and also most interesting study relates to a case for Taipei (Taiwan), provided in [80]. The authors utilize data from the local AFC smart card system EasyCard for obtaining spatio-temporal station-to-station metro trip patterns. Study results for a time period in 2019 are compared to the modified magnitudes of passenger travel within the same time period of the very early stage of the recent coronavirus pandemic, indicating implied spatial and temporal heterogeneity. A major benefit of this study is that all data used are freely available as open data.
In particular, one of those issues depending on both passenger behavior and transit service is the waiting time. AFC could also be used to estimate passenger waiting times as in [81]. Moreover, Ref. [82] record the trips using AFC data and then aim to identify factors affecting passenger waiting times. Another work which aims at analyzing both travel demand and public transport service is [83]. The authors propose to cluster smart card data from passenger-oriented (travel demand) and station-oriented (transit service) perspectives.
Another important issue in public transport is how to deal with disturbances and disruptions [84,85]. AFC is also used in studying this problem and, in particular, the impact of the pandemic. In fact, smart card data could help understanding the impact of COVID-19 and mitigate its impact, as illustrated, for instance, in [80,86,87].
We note that AFC could also be adopted to estimate AVL data. The aim of [88] is to develop an approach that uses AFC data to estimate passenger boarding information and, hence, vehicle location. On the other hand, there are persistent issues regarding the use of smart cards for public transport.
First, the problem with AFC systems is that there is a lack of standard. In fact, the TRB is calling for the establishment of a standard format for card data so that each agency stores data in the same way ( [89]). The problem of lack of standardization is also emphasized in the World Bank reports (e.g., [90] for the case of Poland). Therefore, a number of authors are interested in this issue. For instance, Ref. [91] propose a common conceptual framework based on currently available technical standards and implementation procedures, which can be generally applied to share smart card data between different transit agencies. Second, data privacy and security is an issue that needs to be addressed. In Section 4.3.3, we delve into this issue. Third, the validation of AFC data is still an issue that could be further explored. In fact, although it is more trustworthy than traditional surveys ( [92]), the data are not free from errors. In fact, as pointed out in [93], it might frequently underestimate the number of passengers due to potential scammers not purchasing tickets or having invalid ones. In addition, smart card data are criticized for not providing accurate knowledge of passenger volumes and not being able to track both the origin and destination of passengers (for fare systems in certain regions and countries). Thereby some works have been proposed in this regard. For instance, Ref. [94] focus on AFC cleaning and claim to be able to identify abnormal passenger behavior by tracking and comparing the frequency of occurrence of passenger cards, which reflects an AFC system error. However, to do this, the authors also underline the need to enrich these data with other sources in order to have a better analysis. Therefore, APC is considered to be a more representative data source, as highlighted, for example in [2] and is the subject of the next section. In Table 2 we summarize, as in Section 2.1, the work on the AFC data source. In Table 2, we additionally have put the column "Data" which highlights the previously mentioned data source associated with each paper (if no other data source is involved in the study, the sign "-" is displayed in the column).

Automatic Passenger Counting
APC works with a device installed on transit vehicles that counts the number of passengers, most often boarding and alighting at each stop. These data, along with location and time information, can provide useful information. APC data have an advantage over AFC data as they are ticket-independent, though they are also automatic. In fact, the importance of this data source is recognized in a number of studies. One of the earliest documents that highlighted APC's opportunities and challenges to 2008 is [96]. Ref. [97] also finds that by using vehicles equipped with APC, agencies could improve their performance. In most cases, these data are combined with AVL to improve transit service. [98] extend their previous work in [28] and find that control strategies could be improved by combining passenger tracking and counting technologies (i.e., APC and AVL). Below, we underline influential work published after [96].
First, Ref. [99] propose a method combining AVL and APC to estimate the mean and variance of transit vehicle delays caused by signalized intersections. Second, we note that the use of these data is more challenging than AFC and AVL. That is, unlike both of them, which often consist of structured data, APC is noisy in many cases and hence extracting useful information is not straightforward. Thus, a number of papers have been proposed for this purpose. Indeed, it is crucial to ensure an accurate passenger count as inaccurate data in this regard can obstruct other information.
Before addressing the challenges, we note that there are a number of counting techniques that aim to calculate, for example, the number of passengers boarding and alighting at each station or the number of passengers in a vehicle at a specific time. With regard to the first case, it can sometimes be difficult to differentiate between passengers alighting from the inbound trip and passengers boarding to the outbound trip. Such data are important for determining the number of passengers at each station. Another problem with respect to the O-D estimation problem is to distinguish between inherited (from a previous trip) and left behind (to a new trip) passengers. In general, these issues depend on each case study. For instance, the method proposed in [100], which exploits weighing systems, is not practical in that case, but it is useful to control braking in rail systems, which is the purpose of their paper.
Another issue is to match APC data with AVL, especially for buses that do not have a specific stop location (one bus may stop at a slightly different area of the station if its position is occupied by another vehicle). Ref. [101] aim to provide a framework for solving the problem of APC data correspondence with the bus stop. The authors are also interested in data validation and anomaly resolution. They divide APC data anomalies into operation in service and technical problems. The former concerns service problems (e.g., unforeseen breakdowns, interrupted journeys) while the latter could correspond, for example, to nonlogical values (e.g., imbalances between boarding and alighting on an entire journey), which means that a technical problem has been encountered.
As the proposed methods need to be validated, benchmark data are very useful. Ref. [102] claims to present the first large-scale benchmark public data set for video-based approaches to passenger counting. This data set contains recorded depth videos acquired with a specific camera containing the red, green and blue (RGB) color combination and depth sensors. (The data set is available in https://github.com/shijieS/people-countingdataset ; (accessed on 13 June 2021)). Additionally, the paper presents a method for realtime counting people in crowded scenes and evaluates the performance on the proposed data set. Other papers focus on improving people counting in a generic way and on proposing technologies for this purpose (e.g., [103]).
The methods highlighted above aim to exactly compute the passenger volumes. Other works intend to estimate them using machine learning (ML) and statistical methods. Regarding the former, Ref. [104] present a passenger counting system that combines a conventional neural network detection model and a spatio-temporal context model to address the counting problem in low-resolution scenes and with a varying illumination. Regarding the latter, in [105], the average passenger boarding and alighting time at stops (in addition to the bus dwell time) are explained using descriptive statistics. Other estimation and modeling approaches could also be used in this regard. For example, Ref. [106] use a mesoscopic assignment model for short-term predictions of transit on-board loads by considering predictive information about vehicle crowding.
Regarding statistical approaches, we note that statistical tests are also suitable for validating APC data. Ref. [107] adopt a revised and extended t-test to validate APC systems. We also note that the validation of the APC data could be performed simultaneously with the AVL data. Ref. [108] adopts a performance assurance methodology to identify unreliable archived AVL-APC data.
In Table 3, as in Section 2.2, we summarize the work on the APC. At the end of this section, we note that APC systems, despite their great advantages, do not enable obtaining information about the different passengers (e.g., senior vs. student) as in AFC. On the other hand, in addition to these internal data (AVL, AFC, APC), other external sources could provide a valuable source of data. In Section 3, we are interested in this type of data.

Exogenous Data Sources
In this section, we are interested in exogenous data sources, which are: weather, traffic, social media, smartphone and survey data.
While academic research often utilizes restricted field study data, more or less official data are available if one searches for them. Beyond statistics accessible through GTFS, also other sources are available on various levels. While many of them are commercially available (see, e.g., https://de.statista.com/statistik/kategorien/kategorie/16/themen/ 2368/branche/oeffentlicher-personennahverkehr/ as an example for Germany; accessed on 13 June 2021), others are freely obtainable (see, e.g., https://www.destatis.de/EN/ Themes/Economic-Sectors-Enterprises/Transport/Passenger-Transport/_node.html as an example for Germany; accessed on 13 June 2021), also public transport service providers themselves provide these data, often for free or a nominal fee.

Weather
First, weather is another source of data that, while important, has not received much or the appropriate interest in recent reviews. In fact, it is widely recognized that adverse weather conditions have an impact on both the demand and the service of transit. As regards the former, passenger behavior could be affected by adverse weather conditions. For the latter, they can change the frequency of the service, or in the case of extreme events they can cause its cancellation. (Agencies could have a specific plan as, e.g., in New York: http://www.mta.info/press-release/mta-headquarters/mta-issues-updatepreparations-tropical-storm-isaias ; (accessed on 13 June 2021)).
From a transportation agency's perspective, weather conditions are seen as exogenous factors that must be monitored to react when needed. For this reason, it is of utmost importance for agencies to take advantage of the available technology, which has been developed in recent years, to extract weather conditions in real time. Indeed, real-time weather information is valuable in many fields and is beneficial to both individuals and businesses. Thereby, many application program interfaces (APIs) have been proposed. An API is a set of instructions that allows software programs to interact with each other ( [109]) and they are used in this context to extract meteorological information in real time. Moreover, many technology companies are investing in developing technologies or algorithms to detect even the smallest climate change. In addition, a number of web scrappers have been developed in many languages. An instance of such languages is Python. A practical example can be found in https://medium.com/@dd93/collecting-weather-data-to-boostdata-science-models-with-selenium-390d9db88210 (access on 13 June 2021). Weather APIs could provide detailed information on different weather elements including temperature, precipitation, humidity, wind, to name a few. For a list of some of the most popular APIs, we refer to: https://medium.com/rakuten-rapidapi/if-youre-looking-to-build-anapplication-using-weather-data-then-you-ve-come-to-the-right-place-ae2115f2c61f and https://openweathermap.org/ (accessed on 13 June 2021).
From an academic perspective, there are a number of papers which have adopted weather data for public transport purposes. More precisely, in many papers the aim is to analyze the impact of weather on public transport and then come up with ideas to avoid or mitigate its impact. For example, Ref. [110] study the impact of weather conditions on transit ridership using time-based and station-based models. The authors claim that their proposed models could help reducing the impact of adverse weather conditions. It is noticeable that, although affirmed in some papers (e.g., [110,111]) that weather disturbances have a negative impact on transit ridership, such an impact varies for the different modes and the initiatives of the transport associations and agencies towards their use under adverse weather conditions. For instance, Refs. [112,113] both suggest that the subway is less vulnerable to inclement weather and can replace other travel modes in this case. However, prevention measures are needed so that the subway system can cope with the threat of heavy rains. Regarding buses, Ref. [111] also claim that the installation of bus shelters can reduce this impact. For a survey on disturbances in public transport including weather-based ones see [84].
We can conclude that the impact of weather depends on each zone, each city and its public transit service. Moreover, it depends on the meteorological nature of the region and the weather disturbances. Indeed, it is shown, for example in [114], that precipitationrelated events contribute more to fluctuations in ridership than temperature-related events. In [115], regression models are developed that aim to understand the impact of weather factors on daily bus transit ridership and find that some of these factors are significant while others are not. Weather impact also depends on the nature of the population and economic activities, such as work and schooling. According to [116], weather conditions have a significant impact on students' commute mode choices. The difference also depends on the purpose of the trip (e.g., leisure, shopping and personal business) as stated, for instance, in [117].
More generally, we note that these data could be analyzed within the entire urban transport system. One of the main breakdowns of the impact of weather conditions on transport concerns the choice of travel mode. This problem is mainly related to public transport as it affects the demand for public transport and the behavior of passengers. In other words, weather conditions lead passengers to switch from private modes to public transport and vice versa. Examples of such work could be found in [118,119]. Additionally, Ref. [120] are interested in analyzing passenger behavior from an emotional perspective.
Other data sources could also be adopted and coupled with weather data to estimate their impact. The traditional way of doing so is through surveys. For instance, Ref. [119] analyze the impacts of weather and seasonality on commute mode choice using a survey in which the passengers indicate their preferences. A survey is also carried out in [112] to achieve the results described above. Ref. [121] reports on the collection of smart card data for public transit and weather records from Shenzhen, China. The data make it possible to establish an association between the use of public transport and weather conditions on an hourly basis and for each metro station, with certain limits. The integration of smart card and weather data has also been proposed in [122]. Regarding AVL, Ref. [123] use it to study the effect of weather conditions on the reliability of the travel time of road and rail transport, using a case study of the Melbourne tram network.
In Table 4, we summarize the work on the weather data source as in the previous sections. From all these sources, as indicated above, it becomes clear that weather conditions can be differentiated into various dimensions without being able to draw a unique conclusion. Especially the transport mode, different framework conditions due to the infrastructure (e.g., bus stations with or without shelter, public transport vehicles with or without air conditioning), customer type (e.g., blue-collar versus white-collar workers, students, elderly, handicapped, tourists) and behavior, the type of weather-based events (e.g., temperature, snow, rain and extreme events, such as storms), location and time can lead to very different outcomes. A clear-cut mode choice by customers under certain circumstances cannot be deduced in general but relates to the mix of these dimensions. For instance, the implied change of mode from passengers under certain circumstances is not always beneficial to the different stakeholders.

Traffic
Traffic is another exogenous data source impacting public transport. First, we note that the inclusion of traffic data in the design of public transit services started a while ago (e.g., [124]). However, in that paper, the objective was to model the traffic of a transit fleet (for the metro line). Such an application is no longer of interest with current technological advances regarding endogenous data sources. Today, interest has shifted to modeling and estimating external traffic information.
From a technological side, traffic data in general exploits data collected from fixed sensors placed in the road, such as circuit cameras, video recognition cameras, infrared sensors and radio frequency identification (RFID) sensors, but also floating car data (see below), etc. For a better understanding, we should note that RFID tags can be classified into two categories as being passive and active. This distinction depends on whether an internal power source is used to power the devices and to perform the broadcasting or data exchange. Similarly, as done for the weather data, we note that a number of APIs have been developed to extract traffic data. Examples of such APIs could be found in https://towardsdatascience.com/visualizing-real-time-traffic-patterns-usinghere-traffic-api-5f61528d563 and https://towardsdatascience.com/scraping-live-trafficdata-in-3-lines-of-code-step-by-step-9b2cc7ddf31f; accessed on 13 June 2021.
From an academic perspective, many initiatives have been proposed in recent years to advance research in this field. The main application of these data in public transport is the estimation of arrival times of buses. Indeed, it is well known that traffic is one of the main causes of bus delays and some papers aim to improve the estimate of arrival times based on traffic. For instance, Ref. [125] study the problem of estimating travel times on public transport buses with real-time traffic information. Google researchers are also interested in this topic, as their recent work ( [126]) shows.
However, regarding these data, in many cases it is not straightforward to extract the necessary information. For instance, a challenging issue in traffic data is to distinguish an incident from a traffic congestion situation. Therefore, many approaches have been proposed to estimate traffic data in different case studies and a number of papers have been proposed in recent years to achieve this goal. By analyzing them, we note that today ML is one of the most-adopted approaches to estimate traffic, as there is a growing interest in leveraging big data sources and technologies to improve traffic estimation. For example, Ref. [127] propose a deep learning-based approach for traffic flow prediction. Other approaches, rather than ML, could be used. In [128], a spatio-temporal approach is proposed to detect traffic jams and incidents in real time by analysing GPS tracks belonging to moving vehicles. This is closely related to the idea of using timestamped geo-localization and speed data directly collected from moving vehicles, as it is known from floating car data (FCD), i.e., this aims at the provision of real-time traffic information services (see, e.g., [129]). Additionally, Ref. [130] propose a hybridization of deep learning and a spatio-temporal approach which, according to the authors, provide a more accurate prediction.
One of the problems for traffic data is privacy. Ref. [131] focuses on the development of privacy mechanisms that would satisfy both privacy protection and data needs for urban traffic modeling applications using mobile sensors.
Conversely, the use of buses has an impact on traffic and it is important that its services help reduce traffic (by replacing taxis and cars). In [132], the aim is to assess the impact of bus operations on traffic congestion in Melbourne. The results indicate that Melbourne's bus network is helping to reduce the number of heavily congested road links.  Table 5, we summarize the work on the traffic data source as in the previous sections. At the end, we note that other available data, mainly social media, could also be used to estimate traffic. This could be found in [127,133], which use deep learning with a different design. In the next section, we give some insights into the use of social media in public transport.

Social Media
Social media data (e.g., Twitter and Facebook) consist of a collection of social interactions of a huge number of people. It is a valuable resource for public transport analysis. Recently, several attempts have been made to implement social media analysis in the field of public transport. In recent years, social media has shown promise in providing useful information about public transport. Ref. [134] shows some potentials of using social media data to model traveler behavior and examine many opportunities and challenges in this regard; some of the issues included in the paper are mobility, trip planning, location prediction and privacy. In the following, we explore additional issues and other papers not included in this review.
First, social media can be used in public transport by deploying sentiment analysis to reveal public opinions regarding transit agencies. Ref. [135] propose a framework to evaluate the opinion of transit users on the quality of transit service using Twitter data. Another issue is to extract specific information, which captures public attention and has an impact on transit, such as accidents (which are traffic-related). As an example, Ref. [136] adopt social media data to detect traffic accidents using a deep learning approach. Ref. [137] evaluate how a social media platform is used in a case study to provide and share transportation information and respond to inquiries. In [138], the authors analyze travel behavior by modeling the relationship between characteristics of business clusters and check-in activities. In addition, social media could also be useful for exploiting other data besides traffic (which is highlighted above). For example concerning APC, Ref. [139] illustrate the existence of a moderate positive correlation between the flow of passengers and the rate of publications on social networks. In other words, social media could be adopted to exploit passenger count data.
However, social media data are very difficult to process compared to other data sources. Thereby, there are still several major challenges in handling social media data, which are unstructured, noisy, gigantic and contain a variety of information. We note that the authors, who are interested in using social media data, adopt advanced approaches, such as ML or hybridization of different methods, as can be seen in Table 6, which summarizes the work on this data source.
At the end of this section, we note that the GTFS real-time specification [140], which is managed by agencies, could include alerts in the 'service alerts'-type of information. However, to the best of our knowledge, such an initiative has not yet been explored.

Smartphone
The smartphone is a technology that offers a recent way to track the individuals' travel data, which could help in particular to track the mobility and passenger behaviors in transit systems.
From a technological perspective, as underlined in [7], we emphasize that GPS, Wi-Fi, accelerometers and Bluetooth are among the key ingredients of this data source. In [141], a random forest is adopted, which is a ML model, to leverage Wi-Fi and Bluetooth data in order to predict transport mode choices.
Regarding the applications, some are highlighted in [7]. There are also other applications available. For instance, Ref. [142] design a mobile crowd-sourcing approach to collect shared bus data, in order to optimize their routes. Another related approach, namely crowd-sensing, is used in [143]. The proposed approach consists of using the different mobile data to recommend the best cellular operator for each user.
In general, smartphone data could be an alternative to endogenous data sources (e.g., AVL and AFC) as it could enable the option of 'tracking and tracing' passengers ( [144]), especially in cases where there are no GPS data available (which is the case in many developing countries, e.g., because of missing infrastructure). Moreover, as for AFC, it could enable estimating the waiting times ( [145]). However, one of the issues, which appears with such an emerging data usage as in social media, is privacy ( [146]). This is an issue that could be studied more broadly in the internet of things (IoT) domain as in [147]. One approach to preserve privacy is by encrypting the data. Such an approach is included in [148], which is tailored to recommendation services.
We note that the study of the use of smartphones could be carried out in the field of transport in general, as these are common opportunities and challenges among different transport modes. For example, Ref. [149] insists on the natural promise of the use of smartphones in a travel behavior study (in our case, the behavior of passengers in transit). In addition, they could be used to inform passengers of relevant changes in transit service in the best way as discussed in [150].
These data could also be integrated with other data. For example, Ref. [151] aim to aggregate human activities deduced from mobile phone positioning and social media data, in order to analyze their impact on urban functions using a hidden Markov modelbased approach. They can also be used to validate other data sources. For example, concerning AVL data, in [152], a real-time positioning method, which employs crowdsourced positioning data obtained from smartphone GPS, is developed with the aim of improving vehicle-positioning accuracy. The aim to integrate AVL and smartphone data to estimate the O-D matrix can also be found in [153]. In all these cases, a disclaimer regarding data security seems necessary; see also Section 4.3.3.
The studies considered on this data source are summarized in Table 7. In addition, we note that [155] develop a framework for automated downloading and storage of GTFS data. They publish a curated collection of 25 cities' public transport networks. The proposed framework contains some interesting features (e.g., spatial and temporary filtering and technical validation). Other examples of extensively using GTFS data include [156,157]. Ref. [158] propose ideas for data analysis regarding the issue of eliminating bus stops and generating a revised bus network under some assumptions while maintaining certain levels of service and consistency, respectively.

Survey
Survey data are collected from a sample of a targeted audience that took a survey. As pointed out earlier (mainly in Section 2.2), surveys are the traditional approach to obtain information on the demand for public transport. This becomes visible, in addition to the papers referred to above, in the outcomes of conferences and workshops devoted to the use of surveys in transport, such as [5,159], which are also interested in big data sources highlighted above. We should note that our criticism on surveys does not hold for practical settings but for the way many of them are conducted and reported in academia. Moreover, in many cases survey data are also openly available from public transport service providers; see, e.g., Section 5.2.
However, the survey adoption is still in use and could be combined with other data sources. Indeed, due to the variety of factors that influence passenger behavior and the fact that some of these factors (e.g., socio-demographic information) could not be fully represented mathematically and automatically, surveys are still of interest today. Transit accessibility is an example of an application, as some social and behavioral aspects still require further analysis. A survey to deal with this issue is adopted in [160]. Moreover, Ref. [161] like many others, using a survey explores the potential of shifting from cars to public transport.
In particular, we note that surveys are main ingredients of the census data which are widely used in different studies including public transport. Census data are useful open data which help to include the socio-demographic factors in the analysis. In [162], the equity of transit accessibility of different cohorts is studied. The authors highlight the inequities using census data. In [86], census data, acquired through a survey, are adopted along with other data (e.g., GTFS) to examine the impact of COVID-19 on ridership based on socio-economic disparities. To do this, the authors examine the relationships between the impact of ridership and the explanatory socio-economic factors. The combination of census and GTFS data is also adopted in [156] to try to measure the gap between supply and demand (which is highlighted in the introduction).
Another reason for the persistence of surveys is that many agencies do not want to change their habits. However, to cope with the actual challenges, surveys must evolve and take advantage of the existence of new data sources and current developments in technology. As an example, Ref. [52] propose to adopt survey data for a primary study of mystery shopping in public transport and the authors stress the importance of incorporating other data and approaches to enhance the study. In fact, in the previous sections, we separately outlined several problems within these data, which correspond to the issues of reliability (i.e., surveys are people-biased), incorporation and consistency (except a few cases). Additionally, they are resource consuming and need to be regularly updated. As previously stated, these problems could be solved primarily by using other data or supplementing surveys with them. In particular, we observe that the issue of reliability is well studied and often the solutions proposed involve the incorporation of other data to validate survey results. Ref. [163] use smart card data to validate and correct a survey based on a computer-assisted telephone interview. Also [164] combine related data, with the purpose of "understanding" urban mobility. Regarding APC, Ref. [165] present a methodology that can combine APC data with on-board O-D survey data to mutually validate their accuracy. The concept of GPS-surveys, which consist of supporting survey practitioners and researchers with GPS data ( [166]), is gaining more attention today, for example in determining the purpose of the trip ( [167]). Such an approach is a form of merging between surveys and AVL data.
Other connected issues for survey data are the response rate and the sampling approach ( [5]). Indeed, in general, surveys should have a high response rate and a representative sample of the population concerned. However, response rates using traditional tools (i.e., postal, face-to-face and telephone media) are declining. Therefore, the idea of investing enormous efforts in obtaining random samples has been questioned. As a result, new survey technologies are emerging. Indeed, web, GPS devices and smartphones are also used as they are generally less costly. In addition, there is a growing interest in mixed-mode surveys these days. Moreover, gamification is another concept raising in popularity for increasing response rates. In [168], the potential of gamification is explored to potentially make surveys more attractive and engaging. A teaser beyond already mentioned works can be found in [169]. The transport association in Hamburg, Germany, used a modified version of the famous game Scotland Yard' called Fang den Fox' to let people learn about the public transport system. The aforementioned works on the survey data are summarized in Table 8.
On the other hand, big data sources could also benefit from surveys. In [8], it is stressed that they need to be supplemented or validated using conventional travel surveys and [170] focus in particular on travel behavior and provide insights that combine both household travel surveys (named small data) and big data.

Data-Driven Implications
In the previous sections, we separately highlighted several data issues that are crucial for data-driven decision making. The aim of this section is to further explore these data issues and provide a data-oriented unified view of the data sources, in combination with possible options for their data-driven implications. In other words, we highlight the main data issues that need to be addressed for the effective and efficient use of data sources while highlighting their potentials, challenges and merging issues with respect to each type of data. More specifically, we underline several issues with regard to the acquisition of data sources, their integration, their processing and their exploitation.
From a more technical perspective, we might need to specify some detailed issues, such as data formats. Examples include the following, without going too much into detail: Text files which use commas for delimiting are called comma-separated values (CSV) file. Each line of a CSV file with one or more consecutive fields is a data record. As an example, the attempts of many companies to visualize their efforts in being in time are made public; see, e.g., the Zurich (Switzerland) data available as CSV files under various webpages including https://data.stadt-zuerich.ch/dataset/vbz_fahrzeiten_ogd_2019 (accessed on 13 June 2021). Extensible markup language (XML) is a markup language defining a set of rules for encoding documents in a format being both human-and machine-readable. JavaScript object notation (JSON) is an open standard file and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute-value pairs and arrays. Common data environment (CDE) is an agreed source of information for collecting, managing and disseminating information containers through a managed process especially in context of digital twins.

Acquisition
The first data issue is their acquisition. In this section, we look at the main data requirements and issues that need to be considered in acquiring, maintaining and updating relevant information from data sources.

Infrastructure
Regarding data acquisition, we note that some of the data sources, such as weather and traffic, could be openly available and exploitable using the available APIs. As previously stated, while these data can help to partially replace internal data sources, increasing data acquisition capacity will improve the data handling process and addressing various data challenges. In particular, the information extraction regarding the location of vehicles and the number of passengers (AVL and APC) requires appropriate infrastructure (e.g., GPS). For other data sources, interesting options regarding data acquisition relates to available infrastructure allowing the use of smart cards, etc. Mobile phones of customers may be seen as an important infrastructure, as pointed out above. They may be applied, for instance, to produce heat maps. A heat map is a data visualization technique, showing the magnitude of a phenomenon like passenger density in public transport means; see, e.g., [171]. Technically, this may be achieved in different ways including the GPS-based components or, as mentioned, e.g., in [172], by using the related MAC addresses of the mobile phones. Still, data security needs to be kept in mind observing legal constraints. As indicated above, Ref. [102] utilize a large-scale data set for video-based approaches to passenger counting. Therefore, we can conclude from this part that investment in infrastructure is necessary for data-driven decision making. However, the capacity of the infrastructure varies depending on the budget and the expertise of the agencies. Agencies must then set their priorities to secure the data necessary for a satisfactory transit service.

Storage
Another issue with respect to the budget is data storage. In fact, a challenging issue for agencies is to find a way to efficiently and cost-effectively store the data. Such an issue is rather complicated. As a matter of fact, for real-time traffic data (e.g., AVL) from large developing cities, growth in data storage capacity is behind data growth, as transportation systems need to produce a huge amount of real-time data from the various sensors.
One option that could be exploited is cloud computing. Cloud computing technologies help agencies manage large amounts of storage. In the literature, work on this topic can be considered more broadly in the urban transport system as studied, for example, in [173]. We refer to [174] for a review of several developments in cloud computing. Another option is to use the open data tools available, such as GTFS, which could be a very useful way to manage the storage budget. Although agency involvement is required to provide these data and invest in the underlying technology, research continues to facilitate the use of these tools (e.g., [126]).

Digital Twin
At the end of this part, we note that to improve data acquisition and collection, it is crucial to take advantage of recent developments in data acquisition technologies. The digital twin is an example of an emerging concept that could provide a comprehensive view of the different kinds of necessary information. A digital twin consists of a digital representation of a physical process, person, place, system, or device. It is one of the most promising enabling technologies for digital transformation and it also allows for the merging of different data sources. In particular, the concept can be associated with smart cities, as illustrated in [175], due to the importance of having an increasingly large and accurate building information model to maintain their sustainability. In particular, Ref. [176] are interested in proposing a digital model transformation of rail station buildings. However, despite the fact that practitioners stress the importance of incorporating this concept to improve urban mobility (an example can be found in this post: https://blog.ptvgroup.com/en/city-and-mobility/digital-twins-urban-mobility/; accessed on 13 June 2021), no research has focused on leveraging data sources to our knowledge. Examples of information categories that can be exploited concern vehicles (e.g., AVL), passengers (AFC or smartphones), their integration (APC) and traffic.

Integration
The aim of this part is to underline the main integration aspects involved when merging different data sources. Indeed, it is not enough to obtain and acquire separate data, which could be heterogeneous and inconsistent, from different sources. This is why, today, data integration is one of the main challenging issues when merging data from different sources. In fact, the data are in different forms and in order to be able to combine them, it is necessary to adopt the needed pre-processing tools. In this part, we focus on three issues, namely standardization, validation and matching.

Standardization
First, to avoid an additional computational task for data pre-processing, standardization is a functional requirement that should attract more attention in order to get integrated data. Note that the problem of standardization can arise when it comes to a specific data source as its presentation differs according to agencies (e.g., AFC). Some ideas for dealing with this problem are outlined above (e.g., in Section 2.2). For standardizing different data, a currently growing trend is to exploit and extend the available data formats, such as GTFS and NeTEx (see below). The adoption of these tools is an emerging trend that is evolving quickly and the concept of open data is increasingly embraced by agencies. Nevertheless, currently, these common formats lack simple methods to incorporate other non-standard data to improve analysis. Another issue in this regard is to standardize the various data that are frequently provided from statistical services or public transport agencies, which must take this issue into account when publishing their data. Extending the earlier elaboration, examples can be found in https://www.vdv.de/vdv-statistik-2019 .pdfx and https://ec.europa.eu/transport/facts-fundings/statistics/pocketbook-2020_en; accessed on 13 June 2021. We refer to [177] as an example of work in this regard which could be enhanced further.
To go into detail, GTFS can be seen as a de facto standard regarding the definition of a common format for public transportation schedules and associated geographic information ( [69]). One may distinguish between static and dynamic data and, regarding the first, a feed is composed of a zipped series of text files. Specific entities of a public transport system make up separate text files, including trips, routes, stops and schedule data, among others. Ever since 2006, this has been developed and is still continuously improved and extended. Google Maps [178] provides a route planner supporting its users to find possible connections based on specified O-D pairs. Public transport is among the usable modes. The system uses data provided by a wealth of public transport providers who make their data available through GTFS.
NeTEx is a technical standard of the European Committee for Standardization (CEN) for exchanging public transport schedules and related data [179]. It is divided into several parts, describing specific functional subsets allowing proper passenger information. Starting with the public transport network topology, we find scheduled timetables, fare information, as well as more general passenger information. Work in progress includes a part on technical specifications. The standard is intended to be a general purpose XML format allowing an efficient exchange of transport data among distributed systems. In that respect it bears quite a few of the German VDV core application data and concepts; see, e.g., [180,181]. A core interest is the interoperability of passenger information systems and related data with the aim to obtain seamless passenger information bridging different modes, regions etc. Various systems and projects have been developed over time including, e.g., the German/European DELFI system ("Durchgängige elektronische Fahrgastinfor-mation", seamless electronic passenger information; https://www.delfi.de/; accessed on 13 June 2021) and many others (see, e.g., [27,177,182]).

Validation
Data validation has an attractive potential regarding the integration of different data sources. Indeed, by integrating data sources that provide similar information (e.g., AFC, smartphones and surveys), their mutual information could be validated. In the previous sections, several examples, which provide effective results, are presented. For example, in [65], the authors aim to validate the results of a survey and they identify certain errors by combining AFC and AVL data. [66] present a methodology for estimating the destination of passenger journeys from AFC data. Concerning AVL data, we have shown in Section 2.1 how to improve vehicles positioning accuracy using smartphone GPS ( [152]). Ref. [183] gives insights on the validation of both AVL and AFC data.
The need for data validation is due to the fact that the massive increase in data availability poses many growing challenges with transit data, including their validation, in order to make the most of them. Indeed, different data could be adopted to extract the same information and then used depending on the capacities of the transit agencies, or combined to obtain more reliable information. As a specific example for future research we point towards ML-based (compare, e.g., Section 4.3.2) detection of infrastructure failure. For instance, an erroneous APC system on a bus may be encountered using smart card data and an AFC system.

Matching
In this part, we look at another problem that arises when integrating different data sources on linked information, namely data matching. This could happen, for example, when combining APC bus data with static stop data, as the buses could stop in an area slightly different from their estimated or intended stop. A working example attempting to resolve this problem can be found in [101]. Another issue is to match APC and AVL data as the former could be noisy and unstructured. In Section 2.3, some options on how to deal with this issue are shown.

Processing
After attaining the data and integrating them, the next step is to process them. In this part, we are interested in the processing of data through their analysis in addition to the issue of privacy which is related to both storing and processing data.

Data Analytics
Over the past decades, approaches to data analysis (or data analytics) have considerably evolved. Thus, research on the analysis of public transport data sources should continually benefit from the rapid development of data analysis approaches, especially big data techniques. Indeed, as the amount of public transport data continues to grow, the research and appropriate use of big data is imperative for researchers to make the most of these newly available information and techniques. In the previous sections, we have highlighted in the summary tables several data analytics approaches that could be used for data-driven analysis. Nevertheless, the used methods should be continuously updated and leverage the advances, especially in the fields of optimization and ML.

Machine Learning
In particular, as noted earlier, ML is gaining a lot of attention today for dealing with big data, as traditional statistical and analytical methods often fail to process large-scale, unstructured and noisy data. While ML approaches are applicable to different kinds of data, they are particularly suited for social media, traffic and APC data which are mostly unstructured and noisy data. Over the past decades, countless ML techniques have been proposed to deal with different types and structures of data. However, an important issue in this regard is the choice of a suitable approach for each problem and case study. For example, Ref. [184] compared a number of ML techniques and found that random forest is the best for their case study. In particular, there are approaches which can be suitable for pattern recognition problems (e.g., [185]) and that are applicable, for instance, to APC data. Another issue is feature engineering, which consists of the extraction of the most relevant feature issues and in which neural networks are the most adopted ones today (e.g., [186]). For feature engineering, it is also useful to include best practices, such as feature selection and model selection [187]. Moreover, it is important to integrate the different data sources, which enable us to extract the different relevant features, as shown on several occasions in the previous sections. In this context, it is necessary to differentiate between the problems which tolerate static predictions and those which require real-time information, such as [188].
More practically, to use these advances, several frameworks are available such as Hadoop MapReduce or Spark. Some papers focus on using these frameworks to improve transportation management and operations. We refer to [7] for insights on this topic. A more recent paper that exploits TensorFlow (a recent and well-adopted deep learning framework) to process large-scale traffic data can be found in [189].
A major issue, once appropriate data availability is ensured, relates to prediction and forecasting. This can be demand-oriented, load-oriented, travel time-oriented, delayoriented, etc. Examples of machine learning approaches including neural networks, etc., include [190] for O-D matrix estimation, Refs. [191,192] for the prediction of bus travel times and speeds. APC data together with an appropriate mobile application can be used to crowd-source seat availability on buses; see, e.g., [193].
Delays of transit services are a major concern for the agencies due to their impact on passengers, who could be sensitive to their unexpected waiting time during their trips [84]. Therefore, several studies are proposed for an ameliorated analysis and prediction of delays with the aim to avoid their cause and to provide a better on-time performance of the service. The advantage of developing accurate delay prediction systems is twofold. First, in the short term, it enables riders to be informed in real time about delays and then update their plans. Second, in the long term, an accurate prediction could enhance the reliability and accessibility of public transit by determining the main factors that cause delays and then updating the schedule based on that information. Ref. [194] integrates the weather variability when predicting bus arrival times using APC data. The problem with AVL data is that it is usually not yet openly available for the majority of transit data. In [126], the authors integrate (predicted) traffic data as a replacement of GTFS where they are unavailable. In [195], the authors investigate the effects of vehicle delays on passenger waiting time together with the effects of transfer status, boarding location, time of day and rider travel frequency. Used data includes AFC and AVL data while a trip-chaining algorithm is used to infer the trajectories for all passengers; a case study in the United States is reported. We can conclude that the integration or merging of data sources is crucial for an enhanced evaluation of delay reasons.

Privacy and Security
Data privacy and security is an essential issue that has to be taken into consideration when storing and processing data sources, especially AFC and smartphones. For AFC, data privacy and security is an issue that needs to be addressed. Although the belief that privacy should not be a major concern with such data, as it often does not include personal information, it is shown, for example in [196], that users can be frequently identified. The authors propose a privacy allocation mechanism to better address the data sanitation issue (which aims to make data unrecoverable). We can see that the issues of privacy and security are interconnected and there are a number of applications that aim to address both (e.g., [197]). For smartphone data, one idea to preserve privacy is by encrypting the data. Such an approach is included in [148], which is tailored to recommendation services. The advent of blockchain may be among solutions once properly defined as functionality requirements; see, e.g., [198] for a related discussion regarding identity management in public transport.

Exploitation
After processing the data and getting the needed information, it is crucial to transform it in a manner that is beneficial to the transit service and to the passengers. This also relates to data and business understanding. In general, many of the cited papers above outline the practical contribution of their approaches. We also note that passenger information can be judged from different perspectives including visualization and service optimization; see, e.g., [2,27,199]. Below, we highlight how the information can be exploited through visualization or by optimizing the service.

Visualization
The visualization of data is already considered, e.g., as part of the previous sections; see, e.g., Table 2. Often it is also a matter of comprehension ( [17]). Ref. [157] highlights some of the potentials and challenges in processing data for individual visualization methods. In fact, the visualization of AFC data can help identifying passenger flow characteristics and evaluating their travel time reliability as investigated, for example, in [72] for the case of the Shanghai Metro. Moreover, the chances for a proper visualization can be further illustrated by merging different types of data. For example, Ref. [51] combine AFC data with AVL data to reconstruct travel trajectories of bus passengers at the bus stop level. To do so, AVL data often has to be published in the GTFS format, which is now the most common format to standardize these data. In that paper, the authors' ultimate goal is in particular to visually unveil the spatio-temporal travel behavior dynamics of the passengers. Ref. [200] develop a tool named PubtraVis making use of the GTFS data that carries schedule information to measure and display the public transit system operation in different perspectives through six visualization modules: mobility, speed, flow, density, headway and analysis. The user can observe the information on vehicles (e.g., speed) statistically, temporarily and geographically. Moreover, Ref. [138] adopt a visualization approach to leverage social media (Twitter) data in order to identify business clusters. To our knowledge, there is no extensive work that is interested in the visualization of the other sources. Nevertheless, traffic data are studied in a broader manner and several visualization approaches are proposed. For example, Ref. [201] propose an approach that aims to visualize the evolution of traffic congestion in large-size cities. Thereby, a prospective project worth to be studied is to integrate external traffic information along with other data (e.g., weather) into the visualization mechanism to have a complete and user-friendly platform containing all the information needed and available. The abovementioned concepts of digital twin and heat maps (see, e.g., Section 4.1.3) can equally well be incorporated here.

Service Optimization
In the previous sections, especially in Sections 2.1 and 2.2, we have shown several examples on how these data can be used to measure the reliability and punctuality of the service, even taking into consideration passenger behavior. This information could often be exploited by sharing it in real time. In other words, if the passengers are informed about delays within a reasonable time, they could update their schedule and opt for alternatives. A selective literature review of the passenger benefits of real-time transit information can be found in [202]. Moreover, data sources could be used to dynamically optimize and adapt the transit service. As presented before, examples of work that leverage these data for this purpose could be found in [37,57]. Moreover, Ref. [203] are interested in both the analysis and optimization of transport line services. Additionally, ML has become an emerging trend in optimization problems. An idea to improve the service in this regard could be found in [204].

An Information Management Framework
Information management (IM) is the purpose-oriented provision, processing and distribution of the resource information for decision support, as well as the provision of respective infrastructure [17]. (The adoption of this definition in public transport is already exemplified, e.g., in [27]).
IM is understood, among others, to be an instrument for making information distribution operable. In that respect, it becomes an enabler for efficient innovation management including digital transformation and digital innovation. However, with recent advances in information and communication technologies (IT) and big data, we observe a lack of putting data into perspective in the sense of this definition. Therefore, we focus on the different opportunities and challenges in the wealth of available data. Above, we have summarized and provided insights into the main focus and opportunities of different data to highlight their challenges and how to fusion them. In this section, we propose a unified framework for possible data usage in this area. From a methodological standpoint, our proceeding which leads into the framework may be characterized as being a narrative argument balance.
Next, we describe the framework. After that some examples for applications and use cases are provided.

Three-Layer Model
A possible foundation for developing the intended framework may be found in a basic three-layer model from IM; see, e.g., [17,205]. The framework is depicted in Figure 1. The basic, but often neglected issue is that not only the available data are explored, but that the definition of appropriate functionality requirements is privileged. These requirements set the pace for the needed data and information (information deployment, respectively). To gain access to these data, the requirements may be propagated towards other levels of the framework envisaging information systems and infrastructure. Based on those, services are provided to support fulfilling the requirements. To exemplify the initial steps of our framework development, we emphasize the distinction of static and dynamic data in an IM specification differentiating internal and external IM, depending on who is to be addressed by means of the different types of data and information; see Figure 2. We also distinguish static and dynamic data where the latter refers to continuous changes in a dynamic way up to real-time data (see also the above distinction between static GTFS Data and GTFS Realtime, i.e., the feed specification allowing public transportation companies to provide real-time updates about their fleets, schedules, etc.  Beyond the above classification criteria, we characterize different parties (stakeholders) which are involved in public transport, namely transit operators (including transport associations asking for data regarding the share of revenues and subsidies, if at all), policy makers and passengers. Differentiation in a different dimension refers to individual versus collective information. Again, in Figure 2 one may think of functionality requirements defined upfront before propagating these requirements through the different layers to obtain appropriate support. An overview of the above categories of available data is provided in Figure 3, emphasizing the specific sections where they can be found in this paper (with most important connections given by arrows).

Use Cases
In this section, we sketch a small fraction of available use cases. The reader may apply the options available through [69,140] by him or herself.
A use case for measuring daily walking to public transport based on data from Montreal, Canada, can be found in [206]. An earlier reference utilizing the potential of GTFS data is [207]. The paper analyzes networks and connectivity indicators for Auckland (New Zealand), Vancouver (Canada) and Portland (Oregon/USA).
Use cases may incorporate the appropriate use of available exogenous data, e.g., regarding sports events (e.g., [208]) or cruise ship arrivals (e.g., [209]). For the latter, very detailed questionnaire data from public transport companies are available (e.g., satisfaction surveys made by local public transit authorities in Hamburg (Germany) and Qingdao (China); [53][54][55]). Furthermore, the data-driven prediction of delays, occupancy rates of public transport means, or usage rates of public transport for special user groups (like students) may be formulated as functionality requirements to allow for appropriate support.
Note that AFC data represent a useful alternative to household and on-board passenger surveys as illustrated, e.g., in [64]. In particular, the authors show that household surveys may significantly underestimate the demand. In addition, Ref. [65] carry out a comparison with a large O-D survey in the city of Santiago (Chile). The authors aim to validate the results of the survey and they identify certain errors by combining AFC and AVL data. Nevertheless, household surveys can still help to change the infrastructure, e.g., in building new public transport lines. In [210], we find details regarding the building of a new subway line in Sidney (Australia). Ref. [66] present a methodology for estimating the destination of passenger journeys from AFC data.

Conclusions and Perspectives
Interest in public transport data sources is growing rapidly these days as more agencies and researchers see the potential for new insights. We have shown that data are readily available at our fingertips. We conducted an unprecedented review of data sources, which includes the most frequently used sources of these days. After dividing them into various types of data sources, we summarized the main chances, challenges and associated datadriven methods. In terms of challenges, transit data most often needs to be processed to derive meaningful information. When it comes to potentials, each data source has specific applications and could provide information not captured from other sources. Indeed, each could provide unique information concerning or influencing either passenger behavior or the transit service or both. For the methods, the research looked at different approaches, most of which adopted conventional or advanced data analysis methods. Moreover, we underlined the complementary nature of these data sources, either they are endogenous or exogenous, advanced or conventional. Indeed, by fusing different data sources, the information on one data source can be validated by another and new knowledge can be mutually derived or even speculated upon. Additionally, we presented a unified view of the data sources in which we show how to acquire, integrate, process and exploit the data sources.
To better position our paper with respect to recent reviews, we first note that, as indicated in the introduction, most of them are interested in big data sources. Our paper incorporates also other approaches (e.g., surveys) and shows how they can support big data sources. Indeed, it is suggested in this paper that analyses derived from emerging big data approaches could still be complemented or validated using conventional approaches. Second, we note that data sources can be categorized in different manners. That is, we can see that the proposed division in [7], into traditional data collecting technologies and advanced data collecting technologies, could be indirectly incorporated into our paper. For instance, APC could incorporate the technological advances in bio-metric face recognition (again with the utmost important hint regarding data security issues). Moreover, real-time GTFS data mainly adopt AVL data. Recent advances in GTFS, which aim to define a common format for transit data, attempt to incorporate data from social media and smartphone data. Some ideas are discussed, for example, in the 2020 MobilityData (https://mobilitydata.org/; accessed on 13 June 2021) European Public Transit Training.
We can see that there is a significant overlap between these data sources. Indeed, different data can be adopted to extract the same information and then used depending on the capacities of the transit agencies, or combined to obtain more reliable information.
In addition, some of them are more associated, in terms of application, with others. For example, smartphone data can often provide similar passenger information produced by smart cards. We can observe from this review that the smart card is today the most adopted data source. However, smartphone data usage is growing rapidly and could present an alternative, if the corresponding challenges (e.g., privacy) are resolved.
At the end, we highlight some other data issues worth to be studied further. First, although data are crucial ingredients of information systems, it is necessary to define the appropriate functionality requirements in order to take advantage of them. These requirements pave the way for information deployment by agencies and could be designed according to their specifications. Very often support and studies are based on new technology and infrastructure being available. However, we claim that the functionality requirements should come first (see [3,17]), an issue that may be seen as a most important cue or outcome of this paper while it still needs further elaboration. Second, for an in-depth analysis of public transport data, it is important to include the notion of multi-modality in public transport ( [211]), as passenger decision making could be influenced by the availability of several modes of transport, including cars and bicycles. In fact, transit data sources only record a part of the urban mobility system and it is important to consider the impact of other transport modes, including emerging ones (e.g., bike sharing; see, e.g., [63]). As shown in Section 3, exogenous sources could be analyzed in a similar manner but the issue needs further study. This will pave the way to another division of data sources, from the perspective of urban transport authorities, in which endogenous data sources relate to data sources concerning transit and exogenous relate to other modes of transport. The ultimate goal is to design a unified framework that integrates the different data sources, both inside and outside the realm of public transport. Once the different data sources are merged into a unified information system, it is possible to obtain a broader view of passengers and services, which makes it possible to achieve the balance between supply and demand outlined in the introduction. In summary, this is the first paper, to the best of our knowledge, to review public transport data sources in this way leading towards a framework like the one presented. (Although many papers provide conceptual ideas in this respect, the knowledge about this seems to be limited).
Further elaboration of the framework is part of future research. A final issue worth further research, given resolved issues, e.g., of privacy, may classify and utilize collective and individual information in a more comprehensive way.  Malek Sarhani is supported by the Alexander von Humboldt Foundation.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: