A Systematic Review of Artificial Intelligence Public Datasets for Railway Applications

: The aim of this paper is to review existing publicly available and open artiﬁcial intelligence (AI) oriented datasets in different domains and subdomains of the railway sector. The contribution of this paper is an overview of AI-oriented railway data published under Creative Commons (CC) or any other copyright type that entails public availability and freedom of use. These data are of great value for open research and publications related to the application of AI in the railway sector. This paper includes insights on the public railway data: we distinguish different subdomains, including maintenance and inspection, trafﬁc planning and management, safety and security and type of data including numerical, string, image and other. The datasets reviewed cover the last three decades, from January 1990 to January 2021. The study revealed that the number of open datasets is very small in comparison with the available literature related to AI applications in the railway industry. Another shortcoming is the lack of documentation and metadata on public datasets, including information related to missing data, collection schemes and other limitations. This study also presents quantitative data, such as the number of available open datasets divided by railway application, type of data and year of publication. This review also reveals that there are openly available APIs—maintained by government organizations and train operating companies (TOCs)—that can be of great use for data harvesting and can facilitate the creation of large public datasets. These data are usually well-curated real-time data that can greatly contribute to the accuracy of AI models. Furthermore, we conclude that the extension of AI applications in the railway sector merits a centralized hub for publicly available datasets and open APIs. response, situational awareness, risk management, system monitoring and abnormality detection. Passenger Mobility: This subdomain is dedicated to datasets that implement AI solutions from the perspective of the mobility of train commuters. Passenger mobility is the ability to move passengers safely and affordably between where they live, work, and spend their leisure time. Keywords include passenger ﬂow, mobility, commuters, and passenger trends. Autonomous Train Driving and Train Control: This subdomain includes datasets that implement AI methodologies to transfer operational responsibilities from manual operators to the train control system and automatic operators. Keywords include automatic train control (ATC), automatic train regulation (ATR), automatic train operation (ATO), automatic train protection (ATP), advanced driver assistance systems (ADAS) and obstacle detection. Revenue Management: This subdomain covers datasets intended to in applications of disciplined analytics to predict consumer behavior at the different mar-ket levels, optimize product availability and design prices to maximize revenue growth. Keywords include equity.


Introduction
The automation of traditional manufacturing and industrial practices, by using modern technology in conjunction with massive recollection of data and powerful algorithms, has initiated the course of the fourth industrial revolution.Experts agree that AI is becoming the most central player in Industry 4.0, and the railway industry is not exempt from this.There are many applications within the railway sector in which AI can create a big impact [1].The increasing number of IoT devices, data available and amount of computer power (along with the decreasing manufacturing costs for technology) create the conditions for the application of modern AI techniques in the railway sector [2].
Applications for AI models are diverse and can be implemented directly in vehicles, infrastructure and services related to transportation [3].The recollection of data and use of AI algorithms can aid the analysis of data related to travel routes, the behavior of pedestrians and commuters, the mitigation of energy use, pollution, traffic congestion as well as the improvement of the overall security and safety of passengers.The European Union has assigned over €2.3 billion in funding for the development of smart, green and integrated transport for the period 2014-2020 within the Horizon 2020 Initiative [4].This initiative comprises different research projects related to the application of AI in transport systems, including the Shift2Rails program, for the development and validation of sustainable, cost-efficient, high-performing, time-driven, digital and competitive train operation standards through railway research and innovation [5].
We recognize several literature reviews on specific AI applications in different railway subdomains.For instance, authors assessed literature on applying Machine Learning to track maintenance [6] and wheel defects [7], and [8] investigated image processing approaches for track inspection.Also, [9] addressed urban flow prediction using machine learning.For traffic management, [10] reviewed data-driven approaches.Similarly, [11] and [12] delved into big data for intelligent transport systems and railway systems, respectively, and [13] looked into potentials of swarm optimization for railways.Differently, [14] reviewed current AI developments across the railway sector holistically, covering all the above mentioned topics as well as safety and security, mobility, and autonomous train driving.They identified that majority of research belongs to maintenance and inspection (57%) and traffic planning and management (25%).The existing literature reviews typically covered a limited scope either regarding specific railway subdomains or some certain aspects of AI, with the exception of [14].Also, all review papers addressed dominantly AI applications, with little/no focus on the used data.Here, we want to stress that the existence of available data is one of the critical aspects for AI applications.However, authors of AI-focused railway applications rarely publish the related data.In addition, railway companies still tend to be rather conservative and not open to sharing publicly their data to third parties, these being research and academic institutions.Therefore, relevant data are still rather scarce and often one of the largest challenges to address at the beginning of any research effort; according to the findings of the recent survey [15] the lack of suitable datasets for training the ML models has been indicated by the railway stakeholders among the top three obstacles to be faced for the adoption of AI in the rail sector.This situation altogether may lead to the delayed development of AI applications or even prevent researchers from starting to investigate specific topics.
The aim of this paper is to review existing publicly available and open datasets across the whole domain of railway sector for diverse AI applications.We cover subdomains such as maintenance and inspection, traffic planning and management, automated train driving, safety and security, passenger mobility and transport policy.Also, datasets are classified based on its type to numerical, image, label and other.The review covers the period until January 2021.We also present a short overview of supporting public datasets and APIs (Application programming interface (API) is a set of definitions and protocols for building and integrating application software which allows a product or service to communicate with other products and services).All these AI-oriented datasets are of great value for open research and publications related to the application of AI in the railway sector.With this study, we hope to benefit researchers in the fields of computer science and the transport industry by providing an insight into these valuable data and valuable information on how they can be accessed.In addition, companies would hopefully recognize the added benefit and be encouraged to share and publish their data more willingly.
The remainder of the paper is organized as follows.Section 2 describes the methodology utilized for the dataset review.Section 3 presents an overview of selected datasets.Section 4 contains the detailed review of the selected datasets.Section 5 includes a short overview of publicly available supporting datasets and APIs.Section 6 provides the discussion.Section 7 presents challenges and opportunities.Finally, Section 8 presents our conclusions and future research directions.An appendix contains the complete list of reviewed datasets.

Methodology 2.1. Search Criteria and Dataset Selection
For the dataset review, a wide variety of publicly available datasets related to railways were searched.Since no railway-specific databases exist, we used general available and well-known databases, including sources such as: Kaggle, Google Dataset Search, European Data Portal, Data World, Data.gov,Data.gov.uk,Humanitarian Data Exchange (HDX), IEEE DataPort, Zenodo, ScienceDirect's Data in Brief journal and FigShare.These databases are widely used for academic research purposes, some of them implementing quality verifications of the data presented including peer reviews and accompanying publications (e.g., IEEE DataPort, ScienceDirect's Data in Brief).These sources are briefly described in Table 1.Numerical data, including track geometry data, train speed data, dynamic data measured from sensors and any other numerical data.
Image data, including photos of railway assets, video footage, drone footage and more.
Label data, including string data used for classification and cluster-based models.
Other data, any other data that falls outside the aforementioned categories, such as simulation instances, 3D scanner data and more.

Overview of Selected Datasets
For a clear overview, the datasets are divided according to data type, as shown in Table 2. Notice that some datasets might present multiple types of data, dataset (30), for instance, presents both numerical and label data.Table 3 presents the datasets classified according to both railway subdomain and data type.Once again, some datasets might correspond to more than one railway subdomain: dataset (34), for example, falls into two classifications, Traffic Planning and Management and Maintenance and Inspection.This shows that Traffic Planning and Management and Maintenance and Inspection are the most prevalent railway domains in AI-oriented studies.Also, numerical data prevails as the most available data type.Finally, there is a gradual growth of openly available datasets from 2016 onwards, with 2020 being the year with the most AI-oriented railway datasets published.

Review of Selected Datasets
The following sections present reviews of the selected datasets according to the defined railway subdomains.

Traffic Planning and Management
Within traffic planning and management, we group datasets related to passenger transport, rolling stock and freight transport, and passenger experience.

Passenger transport
A number of datasets consulted for the review correspond to official government data and datasets that have been facilitated by railway companies.Dataset [16] provides on-time performance (OTP) data on regional trains from the Southeastern Pennsylvania Transportation Authority (SEPTA) of the United States.The data include train ID number, trip direction, origin, date, timestamp and more.GPS data is also provided.Dataset [17] contains data on train delays for the Italian railway line connecting the cities of Bologna and Milan.These data can be used for performance review, timetable rescheduling and the prediction of train delays.Dataset [18] contains timetables of Transport express régional (TER), rail services run by the regional councils of France, with stops and timetables from French railways.This dataset comes from a certified public service and complies with the General Transit Feed Specification (GTFS) format and is created using data collected by the Department of Regional Trains and Intermodality (Direction des Trains Régionaux et de l'Intermodalité).This dataset can be used for timetable scheduling purposes.Dataset [19] contains real data for train skip-stopping pattern optimization.These data were collected from the Batong line in Beijing's railway network in China.Some of the

Review of Selected Datasets
The following sections present reviews of the selected datasets according to the defined railway subdomains.

Traffic Planning and Management
Within traffic planning and management, we group datasets related to passenger transport, rolling stock and freight transport, and passenger experience.

Passenger Transport
A number of datasets consulted for the review correspond to official government data and datasets that have been facilitated by railway companies.Dataset [16] provides on-time performance (OTP) data on regional trains from the Southeastern Pennsylvania Transportation Authority (SEPTA) of the United States.The data include train ID number, trip direction, origin, date, timestamp and more.GPS data is also provided.Dataset [17] contains data on train delays for the Italian railway line connecting the cities of Bologna and Milan.These data can be used for performance review, timetable rescheduling and the prediction of train delays.Dataset [18] contains timetables of Transport express régional (TER), rail services run by the regional councils of France, with stops and timetables from French railways.This dataset comes from a certified public service and complies with the General Transit Feed Specification (GTFS) format and is created using data collected by the Department of Regional Trains and Intermodality (Direction des Trains Régionaux et de l'Intermodalité).This dataset can be used for timetable scheduling purposes.Dataset [19] contains real data for train skip-stopping pattern optimization.These data were collected from the Batong line in Beijing's railway network in China.Some of the data contemplated include number of stations, number of trains, capacity of each train, number of passengers, minimal/maximum headway allowed, headway in original train timetable and travel time between adjacent stations.This dataset can be used for timetable scheduling purposes particularly focusing on stop skipping.
Alternatively, some of the datasets have been generated by third parties, utilizing official APIs that are available to the public.Dataset [20] contains granular trip-level performance data on train trips among the NJ Transit and Amtrak railway networks in the Northeastern United States.This dataset includes stop-level, highly detailed data on more than 287.000 train trips, including those extending from the states of New Jersey and New York and covering monthly updates from March 2018 to April 2019.Missing or invalid data from trips are properly reported.Data were obtained using web scraping techniques on available sources from NJ Transit and Amtrak.These data can be implemented for the prediction of train delays and cancellations, analysis of passenger flow and more.Dataset [21] contains data from the commuter train service in the city of Stockholm, Sweden during 2012.This dataset contains timetables and passenger flow data, and it was elicited from the Samtrafiken-Trafiklab API.This dataset can be used for timetable scheduling purposes and passenger flow analysis.Dataset [22] contains trip information for the analysis and visualization of Indian Railways.The data include source and destination stations, departure and arrival times, train names and station codes among other information.Similarly to [22], dataset [23] contains timetable data from Indian Railways-the maintainers state that the data is collected from the official Indian government website and dataset [24] contains timetable data obtained from the Indian Railway Catering and Tourism Corporation (IRCTC).Other data include details of trains, and the names of departure and destination stations.These datasets can be used for timetable scheduling purposes.
Other datasets have been created differently, using data collected from computer simulations or hypothetical cases.Dataset [25] contains data on train deviations from planned schedules for resolving the re-scheduling problem, and dataset [26] contains operational data for resolving the schedule optimization and maintenance task scheduling problem.Both datasets were made public by the Institute for Operations Research and the Management Sciences (INFORMS) Railway Application Section (RAS) as part of annual competitions for the application of AI solutions in the railway sector in 2018 and 2016, respectively.These datasets can be used for timetable scheduling and maintenance scheduling purposes.Dataset [27] contains data on four small case scenarios for the timetable scheduling fairness problem.Dataset [28] contains simulation data for the evaluation of performance and delays of a suburban railway line and was published in delay propagation study [29].The OpenTrack model simulates train departures according to a timetable and calculates speed and positioning of trains while taking into consideration safety rules, signaling and railway line constraints.The characteristics of this simulation data are comparable to other railway networks.This dataset can be utilized for evaluating timetable robustness and performance.

Rolling Stock and Freight Transport
Some of the datasets correspond to operational data and observations from railway traffic.Dataset [30] contains operational data on the daily trips of Australian freight trains to different locations, loading and unloading vehicles in sidings and delivery to final destinations.This dataset can be implemented for timetable scheduling solutions and route optimization models and also for rescheduling and the prediction of train delays.Dataset [31] contains the weekly average number of freight trains held short per day from the Surface Transportation Board's (STB) rail service metrics.The data take into account train type (intermodal, grain unit, coal unit, automotive unit, etc.) and cause (crew, locomotive, power or other).Data are collected from railroads' daily snapshots and computed in weekly averages.This dataset is maintained by the United States Department of Agriculture (USDA) and can be implemented for freight traffic planning and fleet and crew management.
A number of datasets reviewed were generated using data collected from computer simulations or hypothetical cases.Some of these hypothetical datasets were made public by INFORMS RAS as part of annual competitions for the application of AI solutions in the railway sector.Dataset [32] contains operational data for resolving the train blocking and shipment path (TBSP) optimization problem; similarly, dataset [33] contains operational data for resolving the hump yard classification problem; dataset [34] contains supporting files for modeling railroad yard capacity; dataset [35] contains supporting files for resolving the multi-track territories dispatching problem; dataset [36] contains supporting files for resolving the block-to-train assignment (BTA) problem; and dataset [37] contains operational data for resolving the locomotive refueling problem.These datasets correspond to the annual competitions held on 2019, 2014, 2013, 2012, 2011 and 2010, respectively.These hypothetical data can be used for fleet management and yard management, train forming and scheduling/dispatching.

Passenger Experience
Several datasets are related to passenger experience and ride comfort.Datasets related to measurements of different byproducts of railways such as sound and vibrations can be used to assess passenger experience.Dataset [38] illustrates the effects of train speed and track geometry on passengers' ride comfort.Train operational speeds modify the vibration of the train wagons and reduce ride comfort for railway passengers.This dataset reflects on the combined effect of speed and track geometry on vibration discomfort in high-speed trains, in support of [39].Dataset [40] contains data measured for the assessment of the influence of vibrations on the noise levels of railway vehicles.Measurements include interior noise in Hz frequency and dB values of two different fasteners at different train speeds over the same non-ballasted track section.Data on exterior noise and vibration spectra for axle box, train floor, vertical track and rail are also provided.This dataset supports the publication on the influence of fastener stiffness on the train's interior noise [41].A similar dataset [42] contains noise and vibroacoustic measurements for the prediction of rail and bridge noise levels on concrete viaducts.These data are part of the study on rail and bridge noise prediction using multi-layer fastener models among other methods [43].These datasets could be implemented for improving passenger experience and comfort as well as fault detection and maintenance purposes.

Maintenance and Inspection
This section includes datasets on maintenance and inspection, and it clusters datasets related to rolling stock, railway track, ballast, catenary and electrical equipment, communication systems and construction works.

Rolling Stock
Using image data can greatly benefit asset identification for the automatization of maintenance tasks.Dataset [44] contains induced failure test data on rolling elements of a spherical roller bearing.The data include vibration records of a bearing, in normal conditions and under rolling-element-induced defects from a test bench experiment.The data are collected within the Railway Technology Research Group (CITEF) of the Polytechnic University of Madrid in Spain.These data can be used for predictive maintenance and related fault detection models.Dataset [45] contains images of pantograph slide plates from various rolling stock on the Swiss Federal Railways' fleet.These images were taken from a rooftop using two high-resolution cameras, with the special aim of capturing the condition of the catenary system.

Railway Tracks
A good number of datasets are related to the maintenance of railway tracks.Image data can be utilized for predicting faults in railway tracks and the identification of parts.Dataset [46] contains images of fasteners from the Dutch railway network.The data is collected by camera-equipped rolling stock that captured the entirety of the Dutch railway network.The dataset was created by the Dutch infrastructure manager ProRail.
The data from the analysis of sleepers and physical phenomena of rolling stock are also relevant.Dataset [47] contains data for the calculation of remaining fatigue life of concrete sleepers on railway systems based on field conditions.A paper published by the authors further explains the methodology used to elicit the data for the prediction of the fatigue life of the asset [48].This dataset can be implemented for predictive maintenance models and system monitoring.Dataset [49], also created within ProRail and collected with the same methodology as [46], contains images of insulation joints from the railway network.ProRail has also made available a third dataset [50], containing images of insulation joints along with color masks with the aim of detecting spark erosions in the railway tracks.These datasets can be used to train AI models-such as convolutional neural networks (CNNs)-for the identification and localization of railway assets and the prediction of possible failures.
Dataset [51] contains railway track deflection signals obtained from velocity and acceleration measurements.This dataset contains both modeled and measured data for train passages to classify the range of total and downward deflection from train pass-by records.It includes data for a voided sleeper and a good sleeper, and a model for a good sleeper.The records were obtained using inertial sensors; modeled data was obtained using an equation.This dataset was published in a study on automated processing of track deflection signals using velocity and acceleration measurements [52].This dataset can be implemented for maintenance purposes, including predictive maintenance, fault detection and system monitoring.Dataset [53] contains data elicited from results of different sensitive analyses performed with 3D models.The scope of the analysis includes vehicle speed, vehicle load, number of auxiliary rails and rail pad stiffness.These data aim to identify asymmetric deformations and damaged track components.This dataset supports a study on the use of 3D models for evaluating the use of auxiliary rails in railway transition zones [54].These data can be implemented for maintenance purposes, including predictive maintenance and fault detection.Dataset [55] contains data for a study published in the Canadian Geotechnical Journal on the evaluation of track stiffness without wheel load data [56].The data include deflection data, trackside measurements, geometrical data and speed data.These data can be used for system monitoring and for predictive maintenance purposes to quantify and detect changes in the track support stiffness and to determine track deflections.Nonetheless, the dataset maintainers clarify that assumptions need to be made concerning train loading and track behavior due to the unmeasurable variation in train loads.Dataset [57] is part of the study published on the analysis of railway track behavior [58] and contains data on distributed acoustic sensing (DAS) strain and conventional lineside monitoring from strain gauges and digital image correlation (DIC) deflection, as well as other statistical data from the study.This dataset can be implemented for system monitoring and predictive maintenance purposes.Dataset [59] contains historical detection readings for three types of track defects: surface, cross level and dip, and it was made public by INFORMS RAS as part of a collection of open datasets for the application of AI solutions in the railway sector.This dataset can be used for the analysis of track geometry defects for maintenance purposes, including predictive maintenance, fault detection and system monitoring.
The data collected from railway operations can be utilized for maintenance purposes.Dataset [60] contains dynamic responses, GPS positions and environmental conditions of two light rail vehicles in the city of Pittsburgh in the United States, and it is further analyzed in the publication [61].The data also include acceleration measurements and track maintenance schedules.This dataset can be utilized for system monitoring, location of assets, predictive maintenance, and scheduling purposes.

Railway Ballast
Data related to the railway ballast can be useful for the maintenance of railway networks.Dataset [62] contains 3D scanner data of two types of railway ballast, calcite and Kieselkalk, including shape analysis information.More information on this dataset is available in its two related publications, a study on shape analysis and ballast stone [63], and a study in discrete element method (DEM) simulations in railway ballast [64].Another dataset [65] contains measurement data from uniaxial compression tests and direct shear tests conducted on the same two types of railway ballast.For this experiment, direct shear tests were conducted at three different levels of normal load: 10 kN, 20 kN and 30 kN, with three repetition tests for each ballast type and load level.Similarly, dataset [66] is based on cyclic friction tests of ballast stones interfaces with different vertical loads.The data are the result from a study on cyclic ballast-ballast friction tests [67] under varying loads expressed in coefficient of friction (CoF); and is presented as 3D scanner data of angulartip stones (before and after the tests were conducted).These datasets can be utilized to assist computer-based simulations and AI models intended to maintain railway ballast, for predictive maintenance purposes and to ensure the safety of railway operations.

Catenary System and Electrical Equipment
Dataset [68] contains data published in a Chinese study on wind-induced responses of the catenary for a high-speed railway [69].This dataset illustrates the dynamic behavior of the catenary system in different flow conditions (turbulent and uniform) in four tension combinations by measuring the displacement (D2×) and acceleration (A1×, y) at the hanging point of the steady arm.The data are collected with the implementation of a microacceleration sensor with high sensitivity and a laser displacement meter.These datasets can greatly optimize different maintenance tasks including predictive maintenance, fault detection and system monitoring.
Data collected from electrical measurements can be utilized for the maintenance of railway electrical equipment.The dataset of measured and commented pantograph electric arcs in DC (direct current) voltage railways [70] presents digitized sampled data using a data acquisition system located onboard and connected to voltage and current sensors.The data were collected from Trenitalia (Italy) and Metro de Madrid (Spain) in late 2019 for the international research project MyRailS [71].The data include recordings of pantograph electrical quantities which are difficult to obtain from real-world operations.A similar dataset [72] contains data of measured pantograph voltages and currents of European AC (alternating current) railways.The set contains digitized sampled data using various data recorders located onboard and connected to voltage and current sensors (voltage dividers and Rogowski coils).The data were collected from four major AC European railways including the Zurich-Brig railway line (Switzerland), the Rome-Naples railway line (Italy), the Hamburg-Dortmund-Frankfurt railway line (Germany) and the Paris-Lyon railway line (France).Both these datasets can be used for different maintenance purposes, including power quality studies, interaction of rolling stock with the overhead contact line, analysis of electrical phenomena of the systems, optimization of power traction supply in DC/AC rolling stock as well as the assessment of other energy-related areas (such as signaling).The 2 × 25 kV Railway Feeding System Simulation Database [73] set contains electricity data measurements from computer simulations.The simulation is based on a simplified model of a 2 × 25 kV bi-level traction power system for feeding high-speed trains; this model is defined by a study that presents traction system models and solvers for extensive network simulations [74].This dataset can be utilized for the assessment of electrical equipment in high-speed railways.

Communication System
Dataset [75] contains performance data of transmission control protocol (TCP) congestion control algorithms in high-speed railway scenarios and in mobile and static scenarios from computer simulations.The algorithms measured are Hd-TCP and TCP NewReno.These data can be utilized for the assessment, monitoring and fault prevention of long-term evolution (LTE) communication networks in railway systems.

Construction Works
Data collected during railway restoration and construction work can be utilized for maintenance purposes.Dataset [76] includes monitoring data for a railway bridge before, during and after a retrofitting process.The bridge is located in KW51 in Leuven, Belgium.These data include different measurements of the structure during the process, such as acceleration on the bridge deck and the arches, strain on the bridge deck and the diagonals connecting the bridge deck with the arches, strain on the rails, displacement at the bearings and also includes measurements on the temperature and relative humidity of the area.Similarly, dataset [77] contains data collected during the rehabilitation of the Buna railway bridge in Croatia.The dataset contains acceleration data corresponding to a roving test, with specified positions and orientations from an experiment conducted during work on the bridge to implement ultra-high-performance fiber-reinforced concrete (UHPFRC).This dataset supports the study published on the rehabilitation of the Buna bridge [78].

Safety and Security
This section presents datasets related to railway safety and security and it combines datasets into situational awareness, surveillance, accident prevention and risk assessment.

Situational Awareness
Image data can be utilized to better understand the railway environment and further promote safety and security.The Cityscapes dataset [79] contains a set of diverse stereo video sequences recorded in public street scenes from 50 different cities in different times and under different weather conditions.This dataset includes high-quality pixel-level annotations of 5.000 frames and a complementary 20.000 weakly annotated frames.Other features include polygonal annotations with dense semantic segmentation for vehicles and people.The level of complexity of this dataset is high, comprising 30 different classes clustered in eight different groups, including a class for railway vehicles.The labeling policy and other rich metadata are available to the public.More information on this dataset is available in the dataset publication paper [80].The RailSem19 dataset [81] contains a set of 8.500 images taken from a rail vehicle perspective for semantic rail scene understanding.The dataset includes extensive semantic annotations, based on geometric (polygons and polylines) and dense label maps.A good number of image frames show intersection zones of road and rail vehicles as well as other complex railway scenarios.Difficult weather and lighting conditions are also taken into consideration.Some of the labels presented in RailSem19 are compatible with the Cityscapes dataset [79].More information on this dataset is published in the supporting paper for RailSem19 [82].Possible applications of these datasets in the railway sector include safety and security, surveillance, situational awareness and any computer-vision-based models that interact with railway surroundings and city environments.

Surveillance
Some of the datasets were constructed with surveillance as their main purpose.The PETS 2017 dataset [83] contains data from on-board surveillance systems intended to protect critical assets.PETS stands for performance evaluation of tracking and surveillance, and its application is intended to evaluate the performance and detection of various surveillance events.This dataset contains two different sets of data: the ARENA dataset and IPATCH dataset.ARENA includes 22 scenarios captured on multicamera RGB video recordings for the detection and understanding of human behavior around a static vehicle.The focus of these acted scenarios is the classification of normal, abnormal/rare and threatening behavior.The IPATCH dataset presents piracy-inspired scenarios and implements not only image data but data from various sensors, but it is aimed at waterborne vessels.Further details on the dataset can be found in its publication [84].PETS 2017 can be implemented for surveillance purposes in railway stations and platforms.Dataset [85] contains a 3D point cloud of a railway trench in Lavancia-Épercy in France.The 3D point cloud is composed of 110.356.682 million points containing X, Y, Z and RGB information.This dataset was conceived within the multi-scale observation and monitoring of railway infrastructure threats (MOMIT) initiative backed by the European Union [86].The dataset is intended to observe and monitor railway infrastructure threats and can be used for both security and monitoring purposes.

Accident Prevention and Risk Assessment
The observation of physical and geological phenomena surrounding railway environments is crucial for maintaining operation safety.Dataset [87] is designed as a monitoring and early warning method for a rockfall along railways, based on the characteristics of vibration signals.The dataset was generated with the results of a rockfall test in which the vibration signals of rocks falling over a flexible safety protection net and different areas of the rails were obtained [88].This dataset can be implemented for the development of an early warning system for the safety and security of railway network operations.Similarly, dataset [89] contains data from experiments on granular flow behavior and deposit characteristics.The data contained include relationships between mean grain size and global shear as well as other numerical data from physical and geological phenomena.These data can be used to infer implications for rock avalanche kinematics, which is relevant to the railway sector.Another example is dataset [90], which includes geomagnetic and geoelectric field values generated by analytic calculations.In this dataset, two sets of data are provided: the first set contains data sampled once a second and containing seven frequency components, and the second set contains data sampled once a minute and containing six frequency components.Geoelectric fields generated from the variation in geomagnetic files can affect the operation of railway circuits.These datasets can be implemented for risk assessment, safety purposes, situational intelligence and, predictive maintenance.
Historical data and social engineering data can be used for the prediction and early detection of accidents.Dataset [91] contains data on traffic accidents that occurred in France from 2005 to 2016, including railway accidents.This dataset contains data on type of collision, atmospheric conditions, surface condition, people involved, vehicle information and many other factors.Dataset [92] contains data on over 3500 animal-train collisions and over 10.000 locations provided by Polish State Railways (Polskie Koleje Pa ństwowe, PKP).The data ranges from 2012 to 2015.Some of the traffic characteristics and factors considered in the dataset include traffic intensity, speed, rail curvature, land use and animal habitat characteristics.These data support a study on the relationship between animal habitat composition and population and ungulate-train collisions across the country [93].Dataset [94] contains data on the work schedules and sleep patterns of railroad employees from a study sponsored by the Federal Railroad Administration (FRA) in the United States.The aim of this study is the analysis of work-schedule-related fatigue in railway employees via the documented work/rest schedules and sleep patterns of a test group.This dataset includes work schedule and sleep pattern data of signalmen, maintenance of way (MOW) workers, dispatchers, and train/engine service workers in both freight and passenger trains.These datasets can be implemented for safety and risk assessment purposes.

Passenger Mobility
This section contains datasets related to passenger mobility, mostly on passenger flow estimations.

Passenger Flow Estimation and Trends
Datasets that reflect on passenger trends and occupancy levels can be implemented for estimating passenger mobility in order to optimize train operations.Some of the datasets consulted contain data on passenger trends from rail operating companies (ROC).For instance, dataset [95] contains ridership data on the Bay Area Rapid Transit (BART) network in the San Francisco Bay Area railway network in northern California (United States).This dataset includes hourly ridership divided by year, starting from 2011, and was generated using ridership reports provided by the government-owned company.These data can be implemented for the prediction of train occupancy, passenger flow and infrastructure capacity.Dataset [96] contains passenger frequency data from Swiss Federal Railways (SBB-CFF-FFS).The data were collected during operations in 2014.These data can be used for passenger flow estimation.Dataset [97] contains data captured from Deutsche Bahn (DB) trains and travels in different stations in Germany.These data were compiled from online information available from the DB service, and can be used for timetable scheduling, passenger flow analysis and traffic planning purposes.Dataset [98] contains data from metro lines in India for the prediction of traffic and passenger flow.The dataset contains a training and test sets for the purpose of predicting traffic volume according to weather conditions.Some of the features included in this dataset are the date, pollution index, humidity, wind speed and direction, visibility, snow, clouds, weather description and corresponding traffic volume in the metro.These data can be implemented for the prediction of train occupancy, passenger flow and infrastructure capacity.
Other datasets contain data that are not connected to a particular railway company.Dataset [99] contains monthly data on train occupancy from 1999 to 2011.These data can be implemented for the prediction of train occupancy, passenger flow and transport capacity.Dataset [100] present records of crowd density on several trains over a span of months.Some of the values in this dataset have been altered or obscured for security reasons.This dataset can be implemented for the analysis and understanding of patterns in passenger flow.

Supporting Datasets and Public APIs
A number of general datasets and APIs-that are publicly available-may not be directly applicable for intelligent railway models as standalone data.Nonetheless, they can be very useful for various reasons and can be used as complementary data.This section presents an overview of some supporting datasets and public APIs that were found.Table 4 presents the list of publicly available supporting datasets and public APIs.The number of datasets portrayed were limited to sixteen for illustrative purposes.A number of government websites provided information concerning railway assets and configurations.In North America, for example, the Rail Network dataset contains data on Canadian railways, including a linear network that represents railway tracks and other rail data such as geometry, operator's name, owner's name, track condition, subdivision name.These data can be implemented for traffic management, maintenance and other purposes.The National Railway Network (NRWN) GeoBase Series dataset contains geometric descriptions and basic attributes of Canadian railway systems.The data include tracks, junction, crossings, marker posts, stations and structures that are associated with descriptive attributes (e.g., track classification, operator, gauge and others).This dataset can be implemented in different areas including maintenance, security, system monitoring and more.Moreover, the Grade Crossings Inventory dataset contains an inventory of the locations and characteristics of railway crossings in Canada.Some of the data provided include location (latitude and longitude), responsible road authority, previous number of accidents, fatalities, injuries, number of daily trains and vehicles, and other data such as train max speed (mph), road speed (km/h) and more.This dataset is maintained by the Government of Canada, and it can be implemented for risk assessment and safety purposes.The datasets on Railroad Crossings and Railroad Bridges available on the Homeland Infrastructure Foundation-Level Data (HIFLD) website contain the location of and other detailed data on train assets around the United States.The first dataset contains data on more than 86.000 railroad bridges, including a large number of attributes depicting location, classification, geometrical data, description and more.The second dataset contains detailed information on more than 245.000 railway crossings including location, railway line, city and security reports among other data.The US Department of Transportation created the Freight Analysis Framework (FAF) dataset.This dataset represents flows of goods among regions in the United States for all modes of transportation, including railways.This dataset intends to map the flow of freight transportation across the country by combining data from a variety of sources including the 2012 Commodity Flow Survey (CFS), international trade data from the Census Bureau and data from other industry sectors including agriculture, resource extraction, utility, construction, services and more.This sort of dataset can be implemented for the recognition of transport trends and patterns.
European governments also provided extensive supportive datasets.The HARmonized grids of Critical Infrastructures in EUrope (HARCI-EU) dataset contains data on harmonized grids of critical infrastructures (CIs) within the European Union [101].CIs are defined as infrastructures essential to the safety and well-being of people.This dataset contains geospatial data of CIs, including railway transport infrastructure.Specifically, the dataset consists of 22 grids in GeoTIFF format with a resolution of 1 km 2 .HARCI-EU uses the ETRS89 coordinate system and a map that implements a Lambert azimuthal equal-area projection scheme.The spatial distribution of railway infrastructures and their economic value can be implemented for risk assessment, safety, and security purposes.The European Railway Accident Information Links (ERAIL) database presents updated information on railway accidents and incident reports across the member states.The data can be filtered by country and includes information such as date, location, number of fatalities and injuries and investigation details of more than 3.000 events.In an effort to map the extension of European railways, the dataset contains the names, coordinates and basic properties of more than 36.000train stations located in or adjacent to European territory.This dataset includes location data (latitude-longitude), country and city names, UIC code and other data.
On a worldwide scale, the United Nations World Food Programme (WFP) compiled a dataset depicting all railway systems worldwide.This dataset contains more than 110.000 entries and contains location coordinates, geometrical data and characteristics of railways worldwide.The Citylines dataset contains a complete description of railway networks from more than 350 cities from around the world, including the historical data of railway line development.The data presented include geospatial data, systems, lines, sections, stations and more.
In addition, there is a large number of public APIs that provide data on railway operations.These APIs are often maintained by government organizations, transport agencies and railway operators, and can be utilized as complementary data for intelligent models.The British National Rail Enquires (NRE) API provides an open data feed on TOCs across England, Scotland and Wales.It utilizes three engines: Darwin, Knowledgebase (KB) and Online Journey Planner (OJP).Darwin provides timetable information, including departure predictions, timetable rescheduling, service cancellations, predictions of delays and historical data for the previous twelve months.KB provides data on the UK railway network, including static data, such as information on station facilities, and real-time information, such as service disruptions and engineering work.Lastly, the OJP engine provides data on routes, fares and availability for planning purposes.The Sydney Trains Service Interruptions RSS Feed contains a real-time machine-readable feed of train information concerning service interruptions in the trains in Sydney (Australia).The alerts function on the stop, trip or service line level and are provided in a GTFS format.The Dutch TOC Netherlands Railways (Nederlandse Spoorwegen, NS), provides REST API to handle a large amount of data on timetables, prices, live departure times, service disruption and engineering work, and it also includes geodata on all stations in the Netherlands.

Other available APIs have been constructed by combining different publicly available
APIs and RSS feeds that are related to public transportation.Traffiklab comprises a collection of APIs for public transport services in Sweden, including regional real-time GTFS data, vehicle positioning of transport vehicles, disturbances and interruption information, traffic information and data on stops and assets from Swedish railway companies.In general, General Transit Feed Specification (GTFS) (General Transit Feed Specification (GTFS), https://gtfs.org/(accessed on 15 August 2021)), known also as Google's Transit API, provides tools for transit companies to share static and real-time data, including routes, stops, trips, service alerts and schedules.It utilizes two API extensions, GTFS Static and GTFS Realtime, for static and real-time data, respectively.

Discussion
As we have previously established, the possible applications of AI in the railway sector are vast, and there are many railway challenges and tasks that can greatly benefit from AI-based models.The heart of most of AI models is the data that is used to feed its logic.Looking at publicly available datasets we recognized some promising sources that can be used for research without any constraints regarding the publication of results.
Most of the publicly available datasets were related to Traffic Planning and Management.The most common types of data found in publicly available datasets are numerical and string data.For planning and rescheduling, additionally, some datasets have been generated by third parties using official APIs as the main source of data.Other datasets have been generated from computer simulations and hypothetical cases.These datasets contain timetable information, passenger flow data, trends and occupancy levels, information on delays and timetable performances.There are also other less common data such as operational measurements on speed, vibration and sound that can be used for the evaluation of passenger comfort onboard the trains.
When it comes to Maintenance and Inspection, there were more publicly available datasets than originally thought.There are open data on different railway maintenance applications.Some of the data were collected on train operations using IoT devices and smart sensors.The data reviewed can be clustered according to the train and railway's physical characteristics, these aspects including railway tracks, sleepers and ballast, electrical and communication components.Other data were collected from the vibration generated during railway restoration and construction work.
There are few openly available datasets related to the area of Safety and Security.Some public image datasets on traffic can be adapted for railway applications related to situational awareness.These datasets are complete and well documented.The same is true in the case of historical accident data.Nonetheless, there were no public datasets specific to railway security that were openly available under CC copyright or similar.Even though they were not specifically designed for railways, some datasets containing measurements on geological observations are important for both the safety and maintenance of railway networks.These data are of great importance for risk assessment and predictive maintenance purposes.
Comparing the existing AI applications reviewed in [14] and the public datasets, on one hand, even though that majority of applications is focused on maintenance and inspection of 57%, this is not reflected in the corresponding datasets-only 25%.On the other hand, regarding traffic planning and management, significantly higher proportion of datasets is available, there we see 25% of applications vs 50% of datasets.Also, [14] determined 8% of papers related to Autonomous train driving, but no public datasets were found; and similar holds for Passenger Mobility.Finally, following the results obtained in [14], no public datasets were found related to Revenue Management and Transport Policy.
Supporting railway datasets and APIs that are openly available to the public are vast and easily found.For the purpose of this review, we have only mentioned some illustrative examples of these data.Nonetheless, these supporting sources could be used to aid AI-based models by reading real-time data related to many railway application sectors.These data can be implemented in AI models for complementary purposes.Furthermore, real-time data can be harvested from these publicly available sources in order to create new datasets that can later be used to feed AI models.Such complimentary datasets would provide more system-specific characteristics and lead to more accurate AI-based solutions.

Challenges and Opportunities
This section discusses the shortcomings and the possibilities of the reviewed datasets, along with some observations from the authors' perspective.
First, the main challenge for this review was the low number of publicly available datasets and its uneven distribution over subdomains.Considering the amount of literature regarding the application of AI in the railway industry as reviewed in e.g., [6,10,12,14], the number of publicly available datasets is very narrow in comparison.Additionally, it was not possible to obtain any dataset implemented in the previous literature reviews [14].Unfortunately, most datasets are still private or not copyrighted as CC or open data, which render their utilization in public research limited or even impossible.Second, the lack of suitable image datasets is evident.Out of 62 datasets reviewed, only six image datasets could be found that were related to the railway industry and openly available to the public.This is a big gap considering the amount of research done involving computer vison models in the railway sector.As seen in the previous literature reviews, the collection of these data takes time and effort, which could explain why these datasets are not made publicly available by researchers.Data privacy issues could possibly be related to image datasets.Third, another big challenge encountered was the lack of proper documentation and metadata information on publicly available datasets.Only a very small number of the datasets contained detailed documentation, and only a small fraction had published papers that described them thoroughly (all of which are referenced in this report).For the most part, however, the datasets presented little to no information on data collection schemes.There is also a lack of information on missing data, errors and other limitations.This lack of proper documentation can become a problem for the assessment of data transparency and reliability.
On the positive side, we believe that valuable data have been found while conducting this review.This raises opportunities for new research in the area of AI applied to the railway sector.First, when implementing an open CC dataset, the data can be published together with the research, enhancing its impact and credibility.Data could also be complemented or enhanced using different techniques and made available to the public with the corresponding documentation.At the same time, it would allow to freely compare different models on the same datasets.Second, we observe there is the large amount of government data that is available from most Western countries.Even though most government data are mere statistical information, it can be collected over time to create large datasets using data fusion algorithms.These could be complemented with statistical data and reports made available by governments and rail operating companies.

Conclusions
Based on previous literature reviews, we have analyzed publicly available datasets across seven railway subdomains for AI applications.We have used multiple portals to collect the public datasets including general databases like European Data Portal and Data.gov,AI-focused databases like Zenodo and FigShare and recent data-focused journals like Data in Brief.The data types have been classified as numeric, image, label and other.
We believe that with the public data available today, some railway problems could already be approached with AI-oriented solutions.For instance, the domain of Traffic Planning and Management counts with a good number of public datasets, in total 28 dataset, and vast availability of data harvesting sources, such as APIs, GTFS, RSS Feeds, for data collection.This data can be used for developing traffic predictions, and also timetabling and real-time rescheduling models and approaches.There are also rather complete and readily available datasets that can be used to research problems in the domain of Maintenance and Inspection, in total 28 datasets, which are mainly related to maintenance and inspection of railway tracks.Maintenance data for rolling stock, railway ballast and catenary system and electrical equipment are also present but to a lesser extent.This would be used for better health monitoring and predicting failures of infrastructure and rolling stock to minimize the disruption impacts on railway traffic.Other railway domains do not present as many openly available datasets: Safety and Security includes 10 public datasets and Passenger Mobility-8 public datasets.Finally, no datasets could be found for Autonomous Train Driving, Transport Policy or Revenue Management.Thus, generating the first related public datasets could contribute greatly to boosting research in these domains.In all railway domains analyzed in this paper, the most common type of data is numerical data.We believe this is the easiest type of data to obtain, we have observed different data collection methods including mostly onboard sensors, and other IoT devices such as wireless sensor networks.However, only six image datasets could be found.This is a counter-proportional considering the amount of research done involving computer vison models in the railway sector.Also, regarding the datasets quality, it often tend to be rather unknown; and limited or no proper documentation is sometimes available.On the positive side, the study revealed that there are publicly available data maintained by government organizations and TOCs that can be of great use to support AI-based models such as infrastructure network and statistical data, and certified APIs.In particular, these data coming from official public services and are usually well-curated real-time data that can greatly contribute to the accuracy of AI models.Moreover, the increasing number and sophistication of IoT devices-along with the decreasing manufacturing costs-present a promising outlook for the collection of raw data in the railway sector.There are different AI techniques related to Natural Language Processing (NLP) that could be utilized to process any unstructured data collected.
We recognize several promising research directions.First, the formation of a unified database of AI-oriented railway data would be beneficial, we believe that a centralized database of railway-specific datasets would greatly contribute to the conducting research in this area.Second, the further investigation of the quality of the existing datasets would be required to understand its size, quality, and applicability in greater details.Third, the collection of new high-quality data that can be made available for public use and research by active researchers and data and problem owners.Lastly, to guarantee the quality of new datasets, publications in peer-reviewed journals like Data in Brief and IEEE DataPort, and also online databases like Zenodo and FigShare shall be encouraged.We believe that these new developments would lead towards faster uptake and more diverse developments of AI applications in railway systems.

Figures 1 -
Figures 1-3 present the classification of the reviewed datasets per subdomain, data type and year, respectively.We can observe that the railway applications with the most available datasets are Traffic Planning and Management and Maintenance and Inspection, 28 each; 10 are related to Safety and Security, 8 to Passenger Mobility and none to Autonomous Train Driving and Train Control, Transport Policy and Revenue Management.This shows that Traffic Planning and Management and Maintenance and Inspection are the most prevalent railway domains in AI-oriented studies.Also, numerical data prevails as the most available data type.Finally, there is a gradual growth of openly available datasets from 2016 onwards, with 2020 being the year with the most AI-oriented railway datasets published.

Table 1 .
A short description of all search databases.The databases were accessed on a regular basis between 15 February 2020 and 15 January 2021.

Table 3 .
Cont.Figures 1-3 present the classification of the reviewed datasets per subdomain, data type and year, respectively.We can observe that the railway applications with the most available datasets are Traffic Planning and Management and Maintenance and Inspection, 28 each; 10 are related to Safety and Security, 8 to Passenger Mobility and none to Autonomous Train Driving and Train Control, Transport Policy and Revenue Management.

Table 4 .
Selected supporting datasets and public APIs.The links were accessed between 15 September 2020 and 15 January 2021.

Table A1 .
List of all 62 datasets consulted for the review.