An Automated Method for Extracting and Analyzing Railway Infrastructure Cost Data

Dopazo, Daniel Adanza; Mahdjoubi, Lamine; Gething, Bill

doi:10.3390/buildings13102405

Open AccessArticle

An Automated Method for Extracting and Analyzing Railway Infrastructure Cost Data

by

Daniel Adanza Dopazo

^*

,

Lamine Mahdjoubi

and

Bill Gething

Coldharbour Ln, Bristol BS16 1QY, UK

^*

Author to whom correspondence should be addressed.

Buildings 2023, 13(10), 2405; https://doi.org/10.3390/buildings13102405

Submission received: 16 August 2023 / Revised: 6 September 2023 / Accepted: 20 September 2023 / Published: 22 September 2023

(This article belongs to the Special Issue Data Analytics Applications for Architecture and Construction)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The capability of extracting information and analyzing it so that it is in a common format is essential for performing predictions, comparing projects through cost benchmarking, and having a deeper understanding of the project costs. However, the lack of standardization and the manual inclusion of data make this process very time-consuming, unreliable, and inefficient. To tackle this problem, a novel approach with a big impact is presented combining the benefits of data mining, statistics, and machine learning to extract and analyze the information related to railway infrastructure cost data. To validate the suggested approach, data from 23 real historical projects from the client network rail were extracted, allowing their costs to be comparable. Finally, some machine learning and data analytics methods were implemented to identify the most relevant factors allowing cost benchmarking to be performed. The presented method proves the benefits of data extraction for gathering, analyzing, and benchmarking each project in an efficient manner, and to develop a deeper understanding of the relationships and the relevant factors that matter in infrastructure costs.

Keywords:

data extraction; data mining; railway infrastructure costs; infrastructure cost data analysis; cost analysis

1. Introduction

The lack of standardization in railway infrastructure cost projects leads to data being stored in different formats and structures. This is greatly amplified when comparing data from different organizations [1] This issue makes the decision-making process more difficult and unreliable since it does not leave room for comparison and analysis [2].

Data extraction and data mining are technologies that offer great potential in this field since they can provide many benefits. They allow a better comparison and analysis, they are an efficient and cost-effective solution, and they help companies to gather reliable information.

However, real projects demand that large amounts of information are extracted systematically. When the process of data gathering happens manually, many limitations appear such as time inefficiencies and subjectivity as the classifications are based on human judgment, and the process becomes error prone [3].

The capability of the automatization of data mining for offers great potential for overcoming all of these pitfalls, allowing a robust, unbiased, and efficient system to be produced where the more data that are handled, the more benefits will be obtained in terms of efficiency and data analysis [4].

Different approaches have been raised in this field. Unfortunately, none of them have targeted railway costs infrastructure data, which highlights the novelty of the presented method. The approach of [1] proves the lack of standardization by analyzing a wide variety of data structures researching infrastructure documents and performing deep text analysis. As constructive criticism, it could be said that the scope of the method becomes too wide without allowing researchers to delve very deeply into each aspect. Alternatively, ref. [5] present a framework combining the capabilities of data mining and machine learning to predict different parameters for non-residential buildings. Alternatively, other approaches such as that of [6] demonstrate that data analysis techniques and machine learning algorithms can be helpful when implemented in other scenarios such as in the transport network.

To contribute to the field, a novel method is presented mainly composed of three sequential processes: Data extraction, where the documentation is analyzed and the existing information is parsed to fit a common standard; data merging, where the different types of information are combined; and data analytics, where the main factors that matter in infrastructure costs are identified.

Thus, a novel method is presented based on the application of data mining and machine learning to extract and analyze the existent information from 23 different CAF (Cost Analysis Forms) input files, which contain a wide variety of information and structures. The suggested approach was able to increase the efficiency in the processes of data gathering and analysis and demonstrated the benefits of automatic data extraction and analysis with practical implementation.

2. Related Work

Data mining and machine learning offer great benefits and strong capabilities to cope with the lack of standardization and the manual inclusion of data in railway infrastructure costs projects. It is important to point out that to our knowledge, there are no other methods for performing automatic data extraction on railway infrastructure costs, implying the high novelty of this paper. However, some other related papers have been found that it is possible to learn from.

The related literature is classified here into three categories: Firstly, the most relevant studies that implement data science and machine learning in infrastructure costs are mentioned. Secondly, the closest related studies on railway infrastructure are assessed. Finally, strongly related studies that implemented data mining in similar scenarios are commented on.

2.1. Data Science and Machine Learning on Infrastructure Costs

To summarize the most relevant literature in this category, Table 1 is presented showing the reference, the main aim, and the way of approaching it for all relevant studies, demonstrating whether they implemented data science or machine learning on infrastructure costs with similar approaches.

Firstly, the paper by [7] consists of a method for improving data classification inside construction projects with the usage of the variable’s correlation and machine learning. For its validation, the study presents a practical scenario with a clear scope. As a main limitation, it could be mentioned that the study of variable correlation has some limitations when implementing the same solution on a different dataset.

Alternatively, the approach presented by [8] encompasses a method for optimizing risks and evaluating the model through rough set theory. Finally, a decision tree is implemented for optimization purposes. The approach is easily reproducible with strong results that demonstrate the efficacy of the approach. The main strength that can be highlighted in the study is the main focus of the efficiency and optimization with a great impact on its performance.

Additionally, other studies focused on infrastructure costs data analysis, such as that of [1], where a wide range of data structures were assessed, which is the main benefit, and the method avoids common mistakes through the usage of data mining and construction knowledge.

Finally, ref. [9] presented a method for automating the cost analysis, benchmarking, and prediction with the usage of machine learning and the surveyor’s knowledge. Their robust results explained in detail prove the efficacy of the suggested method. However, it is important to take into account the limitations of the different criteria relying on the surveyor’s knowledge.

2.2. Railway Infrastructure Studies

The studies that are most related to the field of railway infrastructure are gathered in Table 2 where it is possible to see the reference, the main goal, and the way of approaching it for each of them.

Firstly, in ref. [13], a new method is suggested for a better assessment of costs in railway infrastructure considering the type of train by promoting efficiency and creating a framework for allocating costs in an automatized and systematic manner. The study reports robust results when implemented in that specific scenario. However, it is also important to highlight that they focus only on a very specific case, which is a specific type of train.

Secondly, ref. [11] built a simulation model with a previously given set of objectives and restrictions to support the execution of thousands of scenarios in a scalable, efficient, and fault-tolerance approach that is deployed in a cloud computing environment for time efficiency purposes. Although the results are difficult to validate since they come from performing simulations, the model shows potential for saving 88.20% of all the costs presented in the simulation.

Thirdly, ref. [12] built a proof of concept for maintenance scheduling merging data from the railway’s condition, planning, and costs with optimized intervention plans that provide an added value and have a great impact on costs. As a constructive criticism, it could be mentioned that the validation is a bit weak since it relies on the subjective consideration of twenty-five individuals.

Fourthly, the paper by [13] presents an econometric analysis of costs, traffic, and infrastructure for the Swedish railway system between the years 1999 and 2002. The missing recovery data techniques seem to have a great impact on the understanding of traffic data, which is, however, their main weakness due to the unreliability of the generated information.

Finally, ref. [14] suggest a framework to perform a whole system lifecycle cost analysis for asset management, which is based on railway network performance and cost analysis. The framework seems to be able to predict not only an individual asset but also the whole infrastructure, allowing for a better evaluation of the railway system.

2.3. Strongly Related Studies

To conclude, studies that are most related to the presented approach are commented upon and assessed, highlighting those studies that implemented data mining techniques in a similar manner. A summary list of those studies has been gathered on Table 3.

Firstly, in ref. [15], an information retrieval algorithm is presented that consists of an enhancement of the previously established ‘a priori’ algorithm. Their approach works as a search engine specifically implemented to make recommendations to their customers borrowed from information retrieval. Their approach has been well tested, assessing not only the results but also its efficiency with synthetic and real data. As a constructive criticism, it could be said that the previously established ‘a priori’ algorithm was already suitable for some data categories.

Secondly, ref. [16] suggested an event-driven data service method that is demonstrated with a prototype. Their approach first selects a subset of the observed properties using event-filtering technologies, and finally push the data that meet the subscription requirements on time. The results show that the proposed method can achieve active pushing of the desired data to subscribers in the shortest possible time.

Thirdly, ref. [17] presented a review of different content-based image retrieval approaches. The results show that the main challenges that remain are the image segmentation and finding semantic meanings of an image.

Finally, ref. [5] presented a two-step framework that is able to identify the statistics and the inner pattern of the analyzed data harnessing machine learning capabilities. Their main aim was to reduce the expert’s intervention when utilizing measured raw data to infer different types of information regarding non-residential buildings such as performance class, operational behavior, or building use type. Strong results validate the method, especially in the case of building operations, which are 63.6% more accurate compared to the baselines.

2.4. The Contribution of the Presented Method

The main novelty of the presented method relies in its approach to addressing the problem of standardization and manual data inclusion in the analysis of railway infrastructure costs. The key contributions and novel aspects of the paper are summarized in the following bullet points:

Integration of Data Mining, Statistics, and Machine Learning: The paper introduces an innovative approach that combines innovative techniques from data mining, statistics, and machine learning implemented in a very specific scenario, which is railway infrastructure costs.
Automation and Efficiency: By leveraging data mining and machine learning, the paper aims to automate the process of extracting and analyzing data. This automation reduces the manual effort required for data inclusion and analysis, making the process more efficient.
Cost Benchmarking: The paper focuses on the important task of cost benchmarking in the context of railway infrastructure projects. It employs machine learning and data analytics methods to identify the most relevant factors affecting project costs. This is valuable for comparing projects and understanding the drivers of cost differences.
Real Historical Data: The paper validates its approach using real historical data from 23 projects within the client network rail. The usage of real data enhances the credibility and applicability of the proposed method.
Deep Understanding of Cost Relationships: The approach presented in this paper not only enables cost benchmarking but also facilitates a deeper understanding of the relationships and relevant factors that impact infrastructure costs. This insight can be crucial for making informed decisions and optimizing future projects.

In summary, the main novelty of this paper is the introduction of an integrated approach that harnesses the power of data mining, statistics, and machine learning to automate and streamline the extraction, analysis, and benchmarking of railway infrastructure cost data. This approach not only saves time but also enhances the reliability and efficiency of the analysis process, ultimately leading to a better understanding of project costs and their underlying factors.

3. Materials and Methods

The suggested method takes as input the disseminated information about railway infrastructure costs from 299 historical projects. The different steps of the presented approach are summarized in Figure 1. First, the system performs data extraction from different input files, mostly focusing on four data categories: project details, cost details, stage details, and possession strategy. As a second step, the method performs tasks about reclassification and data merging. Finally, the resulting information is used to perform some data analysis and to make some useful inferences based on the given data.

3.1. Materials

For development of the suggested method, the following technologies were used:

Anaconda navigation version 2.2.
The IDE (integrated development environment) Jupyter notebook version 6.4.5.
Python language.
Different open-source libraries were used, from which we highlighted: ‘xlrd’ for reading excel files or ‘os’ for accessing some operative system capabilities.

3.2. The Scenario

The input data are composed of 23 different CAF coming from real projects from the client network rail. Each input file is developed according to the Rail Method of Measurement (RMM). They are used for breaking down the expenses of each project into a large number of assets classified among different cost categories.

The information regarding each asset is structured differently, and the attributes are classified in a distinct manner depending on the version of each file. Five different versions can be distinguished: 1.5, 1.7, 2.0, 2.1, and 2.2. It is important to remark that there is a big gap between versions 1.7 and 2.0, whereas the rest of them only include small modifications.

3.2.1. Versions 1.5 and 1.7

The assets located on the CAF files coming from these versions are described with the following attributes:

Tier 1: Describes costs at project or subproject level for either ‘buildings’ or ‘Civil engineering’. The following categories can be found for this attribute: Buildings and property, civil engineering, electrical power plant, operational telecommunications, permanent way, railway control systems, and train power systems.
Tier 2: Describes broad ‘cost categories’ such as Acquisition Costs; Construction Costs; Renewal Costs; Operation Costs; Maintenance Costs; End of Life Costs; and Lifecycle Cost. It takes a wider range of values: AC (OLE), AC Traction Power System, Buildings, Businesses, Canopies, Car parks and roads, DC, DC Traction Power System, Depot plant, Drainage, Earthworks, Electrical, Fencing, Level crossing, Lifts and escalators, Mechanical, Network, Operational telecoms, Plain line, Platforms, Signaling, Signaling power supplies, Station Information and Security Systems, Structures, Switches and crossings, Train sheds.
Tier 3: Describes ‘cost groups’ covering the subdivision of cost category totals into a more detailed breakdown in each case. For instance, in the construction costs category, this includes key elements such as Substructure, Structure, Preliminaries, Services, and Equipment, and it can take the following values: Approaches, Auto (MSL), Auto (RTL), Auxiliary Transformer, Ballast, Business Voice, Cables, Cabling and Containments, Clocks, Closed Circuit, Television, Coastal and Estuarine Defenses, Concentrator, Conductor Rail System, Control, Control System Only, Controls and Interlocking, Culverts, Customer Information Systems, Disconnectors, DNO Supply, Driver Only Operation System Components, Embarkments, Footbridges, FSP Auto Reconfigurable, FSP Manual Reconfigurable, FSP Radial Feed, Generator, GSM-R, HV Cables, HV Switchgear, HV Transformers, Interlocking Only, Level Crossing Refurbishment Treatments, Lineside Telephone, LV dc Cables, LV Switchgear, Negative Short Circuit Device, Neutral Section, OLE system, Operational Voice, Over bridges, Phones only, Power, Principal Supply Point, Protection Relays, Protection System Upgrade, Public Address, Public Address/Voice Alarm, Public Emergency Telephone System, Radio, Rail, Rail Ballast, Rail Sleepers, Rail sleepers ballast, Retaining Walls, Rock Cuttings, RTU (SCADA), Signaling System, Sleepers, Soil cuttings, Station Help Points, Structures, TNO/DNO HV Supply, Trackside Equipment Only, Transformers/Rectifiers, Transmission FTN, Transmission IP, Transmission Legacy, Tunnels, Under bridges, Uninterruptable Power Supply, User Operated, Voice Recorders, Wire Run.
Work Type: A label that describes the work that has been carried out such as refurbishment, replace full, or replace partial.
Work Type code: The unique identifier code linking the work type that has been carried out.

3.2.2. Versions 2.0, 2.1, and 2.2

Alternatively, the assets located on the CAF files coming from these versions are described with the following attributes:

Primary reference: A group of eight numbers and letters uniquely identifying each asset of each project.
Asset: Is a generic classification attribute that replaces the older attribute Tier 1. It also describes the costs at a subproject level distinguishing between ‘buildings’ and ‘civil engineering’. However, this attribute is a bit more specific, indicating different subcategories inside civils such as earthworks, different types of drainage, and assets regarding the building of structures. The range of values that this attribute can take are the following: Buildings and property, civils (drainage—resilience), civils (drainage—earthworks), civils (drainage—track), civils (earthworks), civils (structures), electric power and plant, permanent way, railway control systems, telecommunications, train power systems.
Structures: A more specific classification attribute slightly similar to the previous Tier 3 categories where a wider range of attributes can be distinguished: AC HV Cables, AC HV switchgear, AC HV transformer, AC overhead line equipment (OLE), AC protection Relay, AC remote terminal unit, AC transmission or distribution network operator HV supply, auxiliary transformer, bespoke color light signaling, buildings, canopies, car parks and roads, chamber, channel, coastal defenses, conductor rail heating, control system, controls and interlocking, culvert, DC conductor rail system, DC disconnectors, DC HV cables, DC HV switchgear, DC HV transformer, DC LV cables, DC LV switchgear, DC negative short circuit device, DC protection relay, DC remote terminal unit, depot plant, distribution network operator (DNO), electrical wiring and lighting system, embarkment, European train control system (ETCS), fencing, footbridges, FSP auto reconfiguration, FSO manual reconfiguration, FSP radial feed, generator, gravel drain, hot axle box detector (HABD), interlocking, level crossing, lifts and escalators, lighting, mechanical heating, mineworking’s–deep, mineworking’s–shallow, mineworking’s–surface, moving bridges, network, operational communications, over bridge, pantograph measuring system (PMS), pipe, plain line, platforms, points heating, principal supply point (PSP), pumps, ramp, remote condition monitoring (RCM), retaining wall, rock cutting, signaling cables, simple modular color light signaling, soil cutting, station information and surveillance system, switch and crossings, trackside equipment, train sheds, tunnel, under bridge, uninterruptible power supply, water tanking, wheel force measuring system.
Work type: A label that uniquely identifies the work that has been carried out such as refurbishment or new building.
Work solution: An attribute that briefly describes the work that has been carried out to accomplish the task.

3.3. The Output Structure

After the suggested method is implemented. The input information is gathered and processed into big chunks of information. There are four different types of information within the output data structure for each project: project details, cost details, GRIP (Governance for Railway Investment Project) stage details, and possession strategy, the data structures of which are described as follows.

3.3.1. Project Details

The first chunk of data describes different attributes of each project including information such as the geographical region, the topography, or the project strategy for designing it and for managing it. For better clarification, Figure 2 is included showing sample values for the first registered items in the dataset.

3.3.2. Cost Details

Secondly, the algorithm extracts data describing costs divided into different categories such as management, design, and other costs. For clarification, Figure 3 is included showing the values of cost details for some of the registered items as examples.

3.3.3. GRIP Stage Details

GRIP stage details contain different information regarding the project at the beginning and at the end of each GRIP stage that the different projects have gone through. Due to confidentiality purposes, some sample data about GRIP stage details are not provided. However, a list of the attributes composing its data structure are shown: ‘CAF Title’, ‘1-Output Definition-Start’, ‘1-Output Definition-Finish’, ‘2-Pre-Feasibility-Start’, ‘2-Pre-Feasibility-Finish’, ‘3-Optioneering-Start’, ‘3-Optioneering-Finish’, ’4-Single Option Development-Start’, ‘4-Single Option Development-Finish’, ‘5-Detailed Design-Start’, ‘5-Detailed Design-Finish’, ‘6-Construct, Test & Commission-Start’, ‘6-Construct, Test & Commission-Finish’, ‘7-Scheme Handover/Handback-Start’, ‘7-Scheme Handover/Handback-Finish’, ‘8-Project Close Out-Start’, ‘8-Project Close Out-Finish’.

3.3.4. Possession Strategy

Finally, some possession strategy data are gathered out of the CAF files, indicating a summary of the number of works that have been carried out in which they are classified by the number of hours that were necessary for the work to be carried out. For better clarification, Figure 4 is providing showing some sample data for the first projects registered in the dataset.

3.4. The Method Step by Step

As stated before, the suggested method can be divided into three main steps that happen on a sequential basis: Firstly, all the different CAF files are analyzed and a data extraction process is carried out, secondly, some merging processes and data parsing are executed, and finally, an analysis is provided and some tests performed to prove the benefits of data mining in the current scenario.

3.4.1. Step 1: Data Extraction

Description: During this step, the suggested method iteratively loads each of the CAF files taken as inputs to extract from them four different types of information: project details, cost details, stage GRIP details, and possession strategy.
To perform this step, two data mining techniques are implemented. First, an association rule of mining is designed to support an automatic categorization of the raw data, which is based on the inner patterns of the input data and without the usage of machine learning algorithms. Additionally, an anomaly detection system is implemented that is also based on the manual analysis of the input data and without the usage of machine learning algorithms.
Input: The input of this step consists of the information distributed into 23 CAF files coming from real historical projects with different structures depending on their version, which ranges from 1.5 to 2.3.
Output: As a main result for this step, four different folders are created: one for storing the project information and the second for project details, whereas the last two are for the GRIP stage details and possession strategy, respectively. Each folder contains 299 different excel files with information extracted from the initial CAF files.

3.4.2. Step 2: Data Merging

Description: During the process of data merging, the data generated in the previous step are gathered and combined, considering not only the fact that there are four types of information that will be merged into one file but also that that different versions of CAF files contain different attributes.
Input: The input for this step is the same as the output for the previous step consisting of four different folders, each of them with 299 different files with their information extracted from each CAF.
Output: There are two main outputs that can be distinguished for this step: On the one hand, a new folder is generated with 299 different files combining the four types of information. Alternatively, five breakdown documents are created summarizing all projects depending on the existing CAF version (1.5, 1.7, 2.0, 2.1, and 2.2).

3.4.3. Step 3: Data Analysis

Description: As a final step, some analysis techniques are implemented to demonstrate that converting data to a common format allows the whole picture to be seen and for the relationships between the different attributes to be found. Additionally, three different machine learning algorithms are implemented to predict future project costs: linear regression, lasso regression, and random forest. To perform an unbiased validation of the algorithms, two iterations are implemented, taking as the training set a random sample consisting of 85% of the total dataset and leaving as the test set the remaining 15%.
Regarding the hyperparameter tuning, it is worth commenting that the random forest is configured using 800 trees in the forest and with a maximum depth of 6. Alternatively, the lasso regression algorithm is configured with an alpha value of 0.5, and finally, the linear regression is left with the default configuration.
Input: The main input for this step is all the attributes extracted in the previous step coming from 23 CAF files that are combined for analysis and comparison.
Output: As the main result, some inferences are made, and some knowledge of the current data is extracted to validate the suggested method.

4. Results

To validate the suggested method, the data coming from 23 different CAF files from real historical projects are gathered and parsed into a common data format. The results are formed into a set of 88 different attributes coming from four different categories for each project. This allows the client to perform data analysis, to have a deeper knowledge of their projects, and to implement machine learning to predict the future project costs.

For better clarification of the dataset being handled, the average project costs classified by different categories are shown. Firstly, Figure 5 shows the average project costs depending on the region, where it is possible to appreciate that those projects carried out in the midlands and in the south were meaningfully cheaper than those in the other three categories.

Secondly, Figure 6 shows a summary of the average costs classified by the route, where the most expensive with a great difference is Anglia followed by Western and Wales, respectively.

Finally, Figure 7 shows the average project costs classified by their main work type, where it is possible to find a big gap between the three most expensive categories. replace-full, refurbishment, and new build, and the other work types.

The process of data gathering and reclassification into a common data format is useful not only to provide a better understanding of the data but also for estimating future project costs based on the already registered ones. As a proof of this, three different machine learning algorithms were implemented: linear regression, lasso regression, and random forest.

To assess the accuracy of each algorithm, two folds were randomly generated using 15% of the available data as the test set and the remaining 85% as the training set. The R square results of each algorithm are gathered in Table 4 where it is possible to see that linear regression and lasso regression were only able to obtain a score of 0.83, whereas the random forest seemed to be the most accurate obtaining an average of 0.934 in both folds.

5. Conclusions

The presented paper describes and proves a method for extracting and parsing railway infrastructure cost data. The suggested method takes as inputs 23 CAF files from historically registered projects. Within the main challenges of the paper, it is possible to highlight the high volume of data found in each CAF and the large variety of structures found on the input data that were formed into five different versions of the input files.

The results shown in the last step of data analysis demonstrate the benefits of data mining, allowing the ability to compare projects, perform cost predictions, and establish cost benchmarking. These capabilities can potentially be used not only for the current dataset but also for future projects.

It is also worth mentioning the potential for increasing the efficiency; since the presented approach is completely automatized, it can potentially replace the manual inclusion of data, bringing an enormous benefit in terms of saving costs and time.

For comparison with other studies, it is worth highlighting that to our knowledge, there is no other automatic data mining system applied to railway cost infrastructure data. There have been, however, other approaches that used data mining such us that of Kouris et al. (2005) [15], where a two-step approach was presented with the main difference being that, in this case, they sought the relationship between the assets instead of gathering the information and analyzing it.

Alternatively, Miller and Meggers (2017) [5] implemented data retrieval techniques and machine learning for infrastructure costs. However, their approach focused on predicting the future costs of the infrastructure based on the prices of the already registered data. Additionally, their approach was not based on railway infrastructure but on building data.

The main inference that we can extract from this paper is that data mining, machine learning, and data science are very powerful tools that, when implemented into railway infrastructure cost data, can overcome the issues provoked by the manual inclusion of data and the lack of standardization, allowing some room for comparison and cost benchmarking.

Author Contributions

Validation, D.A.D.; Formal analysis, D.A.D.; Investigation, D.A.D.; Writing—original draft, D.A.D.; Writing—review & editing, D.A.D., L.M. and B.G.; Supervision, L.M. and B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the UK Department for Transport and received funding under the Innovate UK project ‘Transport Infrastructure Efficiency Strategy Living Lab (45382)’.

Data Availability Statement

Data is unavailable for confidentiality reasons. Access is restricted to protect sensitive information and the needs of the company that owns the restricted information. Inquiries for collaboration can be made to the authors for potential data access considerations.

Conflicts of Interest

The authors declare no conflict of interest.

References

Soibelman, L.; Wu, J.; Caldas, C.; Brilakis, I.; Lin, K.Y. Management and analysis of unstructured construction data types. Adv. Eng. Inform. 2008, 22, 15–27. [Google Scholar] [CrossRef]
Fereshtehnejad, E.; Shafieezadeh, A. A multi-type multi-occurrence hazard lifecycle cost analysis framework for infrastructure management decision making. Eng. Struct. 2018, 167, 504–517. [Google Scholar] [CrossRef]
Schonlau, M.; Gweon, H.; Wenemark, M. Automatic Classification of Open-Ended Questions: Check-All-That-Apply Questions. Soc. Sci. Comput. Rev. 2019, 39, 562–572. [Google Scholar] [CrossRef]
Wang, Y.; Kung, L.A.; Byrd, T.A. Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations. Technol. Forecast. Soc. Chang. 2018, 126, 3–13. [Google Scholar] [CrossRef]
Miller, C.; Meggers, F. Mining electrical meter data to predict principal building use, performance class, and operations strategy for hundreds of non-residential buildings. Energy Build. 2017, 156, 360–373. [Google Scholar] [CrossRef]
Cerquitelli, T.; Meo, M.; Curado, M.; Skorin-Kapov, L.; Tsiropoulou, E. Machine learning empowered computer networks. Comput. Netw. 2023, 230, 109807. [Google Scholar] [CrossRef]
Desai, V.S. Improved Decision Tree Methodology for the Attributes of Unknown or Uncertain Characteristics-Construction Project Prospective. Int. J. Appl. Manag. Technol. 2008, 6, 201. [Google Scholar]
Zhong, Y. Research on Construction Engineering Project Management Optimization Based on C4.5 Improved Algorithm. IOP Conf. Ser. Mater. Sci. Eng. 2019, 688, 055036. [Google Scholar] [CrossRef]
Chen, D.; Hajderanj, L.; Fiske, J. Towards automated cost analysis, benchmarking and estimating in construction: A machine learning approach. In Proceedings of the Multi Conference on Computer Science and Information Systems, MCCSIS 2019, Porto, Portugal, 16–18 July 2019; pp. 85–91. [Google Scholar] [CrossRef]
Ji, C.; Xu, C. New method for allocating high-speed railway infrastructure costs among train types. E3S Web Conf. 2021, 233, 01137. [Google Scholar] [CrossRef]
Caíno-Lores, S.; García, A.; García-Carballeira, F.; Carretero, J. Efficient design assessment in the railway electric infrastructure domain using cloud computing. Integr. Comput.-Aided Eng. 2017, 24, 57–72. [Google Scholar] [CrossRef]
Durazo-Cardenas, I.; Starr, A.; Turner, C.J.; Tiwari, A.; Kirkwood, L.; Bevilacqua, M.; Tsourdos, A.; Shehab, E.; Baguley, P.; Xu, Y.; et al. An autonomous system for maintenance scheduling data-rich complex infrastructure: Fusing the railways’ condition, planning and cost. Transp. Res. Part C Emerg. Technol. 2018, 89, 234–253. [Google Scholar] [CrossRef]
Andersson, M. Swedish Data for Railway Infrastructure Maintenance and Renewal Cost Modelling, 9th ed.; Allan, J.J., Brebbia, C.A., Hill, R.J., Sciutto, G., Sone, S., Eds.; WIT Transactions on The Built Environment; WIT Press: Billerica, MA, USA, 2004; Volume 74, p. 1015. [Google Scholar]
Rama, D.; Andrews, J.D. Railway infrastructure asset management: The whole-system life cost analysis. IET Intell. Transp. Syst. 2016, 10, 58–64. [Google Scholar] [CrossRef]
Kouris, I.N.; Makris, C.H.; Tsakalidis, A.K. Using Information Retrieval techniques for supporting data mining. Data Knowl. Eng. 2005, 52, 353–383. [Google Scholar] [CrossRef]
Fan, M.; Fan, H.; Chen, N.; Chen, Z.; Du, W. Active on-demand service method based on event-driven architecture for geospatial data retrieval. Comput. Geosci. 2013, 56, 1–11. [Google Scholar] [CrossRef]
Deb, S.; Zhang, Y. An overview of content-based image retrieval techniques. In Proceedings of the International Conference on Advanced Information Networking and Application (AINA), Fukuoka, Japan, 29–31 March 2004; Volume 1, pp. 59–64. [Google Scholar] [CrossRef]

Figure 1. Sketch of the different iterative processes that composed the suggested method.

Figure 2. Values showing the first rows containing project details data.

Figure 3. Values showing the first rows containing cost details data.

Figure 4. Values showing the first rows containing possession strategy data.

Figure 5. Summary of the average project cost classified by region.

Figure 6. Summary of the average cost classified by route.

Figure 7. Summary of the average cost classified by their primary work type.

Table 1. Summary of the aims and approaches for the related studies using machine learning and data science on infrastructure costs.

Reference	Main Aim	Approach
[7]	To enhance the data classification in construction projects	The creation of a method implementing machine learning and the knowledge for variable correlation
[8]	To optimize the management in construction engineering projects	The creation of a method that performs a risk assessment, an evaluation using rough set theory and the implementation of machine learning for optimization
[1]	To identify and analyze a large variety of data structures in construction projects	A study that encompasses the search and extraction of different data structures used in a wide-ranging project.
[9]	To analyze and estimate costs in construction projects	The development of a method that combines surveyors’ knowledge with machine learning to effectively assess and predict costs

Table 2. Summary of the aims and approaches of related studies from railway infrastructure projects.

Reference	Main Aim	Approach
[10]	To perform a deeper analysis of high-speed railway infrastructure costs	To develop a framework considering the type of train to perform a better cost estimation
[11]	To perform a massive number of simulations to produce an efficient design in railway electric infrastructures	A simulation model to perform a massive number of simulations efficiently in a cloud environment
[12]	Automatic and efficient job scheduling for maintenance on railway infrastructure	The fusion of technical and business drivers scheduling and optimizing the intervention plans that impact on costs.
[13]	To perform an analysis of infrastructure, costs, and traffic on Swedish railway infrastructure	The study incorporates data gathering and data recovering techniques to conclude with some data analysis
[13]	Railway infrastructure asset management	A proposed framework to assess the lifecycle cost analysis

Table 3. Summary of the aims and approaches of the strongly related studies.

Reference	Main Aim	Approach
[15]	The usage of information retrieval techniques to support data mining	To develop a two-step algorithm acting as a search engine for making recommendations to customers using data mining.
[16]	A service for geospatial data retrieval on-demand	The development of a prototype based on sensor web technologies
[17]	To review the extraction of information using content-based image retrieval techniques	A systematic review analyzing a group of selected papers with content-based image retrieval systems.
[5]	To predict the building use, performance, and operations strategies of non-residential buildings	To use data mining and machine learning for analyzing and predicting data.

Table 4. R square results of the three main algorithms.

	First Fold	Second Fold	Average
Linear regression	0.845	0.832	0.839
Lasso regression	0.844	0.833	0.838
Random forest	0.939	0.928	0.934

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dopazo, D.A.; Mahdjoubi, L.; Gething, B. An Automated Method for Extracting and Analyzing Railway Infrastructure Cost Data. Buildings 2023, 13, 2405. https://doi.org/10.3390/buildings13102405

AMA Style

Dopazo DA, Mahdjoubi L, Gething B. An Automated Method for Extracting and Analyzing Railway Infrastructure Cost Data. Buildings. 2023; 13(10):2405. https://doi.org/10.3390/buildings13102405

Chicago/Turabian Style

Dopazo, Daniel Adanza, Lamine Mahdjoubi, and Bill Gething. 2023. "An Automated Method for Extracting and Analyzing Railway Infrastructure Cost Data" Buildings 13, no. 10: 2405. https://doi.org/10.3390/buildings13102405

APA Style

Dopazo, D. A., Mahdjoubi, L., & Gething, B. (2023). An Automated Method for Extracting and Analyzing Railway Infrastructure Cost Data. Buildings, 13(10), 2405. https://doi.org/10.3390/buildings13102405

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Automated Method for Extracting and Analyzing Railway Infrastructure Cost Data

Abstract

1. Introduction

2. Related Work

2.1. Data Science and Machine Learning on Infrastructure Costs

2.2. Railway Infrastructure Studies

2.3. Strongly Related Studies

2.4. The Contribution of the Presented Method

3. Materials and Methods

3.1. Materials

3.2. The Scenario

3.2.1. Versions 1.5 and 1.7

3.2.2. Versions 2.0, 2.1, and 2.2

3.3. The Output Structure

3.3.1. Project Details

3.3.2. Cost Details

3.3.3. GRIP Stage Details

3.3.4. Possession Strategy

3.4. The Method Step by Step

3.4.1. Step 1: Data Extraction

3.4.2. Step 2: Data Merging

3.4.3. Step 3: Data Analysis

4. Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI