A Century of French Railways: The Value of Remote Sensing and VGI in the Fusion of Historical Data

: Providing long-term data about the evolution of railway networks in Europe may help us understand how European Union (EU) member states behave in the long-term, and how they can comply with present EU recommendations. This paper proposes a methodology for collecting data about railway stations, at the maximal extent of the French railway network, a century ago.The expected outcome is a geocoded dataset of French railway stations ( gares ), which: ( a ) links gares to each other, ( b ) links gares with French communes , the basic administrative level for statistical information. Present stations are well documented in public data, but thousands of past stations are sparsely recorded, not geocoded, and often ignored, except in volunteer geographic information (VGI), either collaboratively through Wikipedia or individually. VGI is very valuable in keeping track of that heritage, and remote sensing, including aerial photography is often the last chance to obtain precise locations. The approach is a series of steps: (1) meta-analysis of the public datasets, (2) three-steps fusion : measure-decision-combination, between public datasets, (3) computer-assisted geocoding for ‘ gares ’ where fusion fails, (4) integration of additional gares gathered from VGI, (5) automated quality control , indicating where quality is questionable. These five families of methods, form a comprehensive computer-assisted reconstruction process (CARP), which constitutes the core of this paper. The outcome is a reliable dataset—in geojson format under open li-cense—encompassing (by January 2021) more than 10,700 items linked to about 7500 of the 35,500 communes of France: that is 60% more than recorded before. This work demonstrates: ( a ) it is possible to reconstruct transport data from the past, at a national scale; ( b ) the value of remote sensing and of VGI is considerable in completing public sources from an historical perspective; ( c ) data quality can be monitored all along the process and ( d ) the geocoded outcome is ready for a large variety of further studies with statistical data (demography, density, space coverage, CO 2 simula-tion, environmental policies, etc.).


Introduction
Transport infrastructure for people and goods is as old as urban civilizations, e.g., Via Appia, Aemilia, Aurelia, etc. have helped in structuring the European landscape for centuries. From 1920 to 2020, the evolution of road and rail networks is a well known element of their competition [1][2][3]. Today climate challenges reignite that century-old debate:for instance the European Union (EU) transportation white paper [4] notes that in 2010 rail yielded11 million tons of CO2, versus road: 191 m tons, but regrets "… that only limited data are available …" for measuring the impact of measures by member states. Linking railway data with socio-eco-demographic data in historical geographic information systems (GIS), would contribute to understanding long-term trends. This approach has been developed by Siebert [5], and, in Europe by Gregory et al. [6], and Morillas-Torné [7] with a very sim-ilar goal, and uses several European public datasets, which record main rail lines (no secondary rail lines). This paper describes a method for building a digital representation of all the 'stations' (we use the French term 'gare') that have existed, between the maximum extent of the French railway network (1920s) and now, a century later. The goal is to deliver a dataset to answer questions such as: Does a commune possesses a gare? How far from the closest gare? Also, the aim is to give a comparison at different dates. For that we need: (a) to link gares to each other along their 'rail line' (we use French term 'ligne'), (b) to link every gare to a 'city or town': we use the French term 'commune', the smallest French administrative level at which most basic statistical data are collected.
There is no digital dataset providing all French gares at the maximum extent of the network, and even their number is unknown. There are two kinds of source: public records (open data from French public institutions: SNCF, Insee, IGN over 2014-2020), and a lot of volunteered geographic information (VGI), either collective through Wikipedia, or individual, providing sparse data about old gares (even simple stops) and lignes.
Sections 2 and 3 present materials and methods, respectively for public and VGI sources, with the goal to "geocode" all the gares, in a reliable, controllable way; as "points-of-interest" multi-source matching is presented in Li et al. [8].
Multi-source datasets are expected to show discrepancies between names (toponym changes), as investigated by Tian-Lan and Longley [9], and locations (existence and geocoding) of each individual gare. In order to resolve discrepancies, target constraints are formulated, and a schema is proposed. Each source is scrutinized by a "meta-analysis" [10,11], in order to identify what may conflict the constraints. Then public datasets are inputted in a fusion process, following Bloch [12], which more precisely is an asymmetric revision, as in Benferhat et al. [13], in a background similar to Reichgelt [14].
Issues raised by VGI have been addressed, e.g., by Johnson and Iizuka [15] successfully combining OpenStreetMap features and Landsat image classification; by Younghoon et al. [16] integrating several crowdsourcing sets in a single graph,or possibly developing collaborative platform such as FeatureHub [17]; or multiple platforms as in Juhász et al. [18]. The specific goal assigned to VGI in this paper is to gather data for revising some already geocoded gares (whose quality is questioned), and mostly for integrating more gares, not recorded in public datasets. Chen et al. [19] have developed CrowdFusion, a framework aiming at data refinement (a hard non-polynomial task), while Gouvëa et al. [20], and Hastings [21], have focused on toponym conflation and gazetteer-based geo-referencing. Section 3 describes routines querying Wikipedia for coordinates of gares, and mostly to parsing railway-oriented "Infobox" [22]. There are some inspiringpapers about how to populatethem [23], or to extract information based on common-sense rules [24]. An Infobox is a very valuable means to obtain at once the information about a ligne, the other lignes it connects with, and the gares it encompasses. The parsing of the information that complies with target constraints is one important contribution of the present work, because, beyond French peculiarities, railway semantics is shared by most countries. Next,Section 3 describes several ways in querying various VGI sources, mostly about gares on secondary lignes (e.g., the popular rural tramways in France, between WWI and WWII). The last opportunity to resolve failed geocoding issues, is to exploit remote-sensing images, old aerial photographs, and contemporary or 20th-century maps.
Issues of enacting data quality within a data collection process have been addressed by Vasseur et al. [25], who evaluates the quality in the context of a target use, and [26,27], two reviews about quality control in crowdsourcing, by Daniel et al. and Senaratne et al. The mix of manual and software steps presented above, combines into what we name the computer-assisted reconstruction procedure (CARP). Finally, the resulting CARP dataset is cleaned up by an assisted data quality control.

Target Goal, Constraints and Initial Schema
The target should allow us to answer questions such as: - For each of the French communes, how far from its center is the closest gare? -Between closest garesof communes A and B, what is the closest path (kilometers)? -Comparing today and a century ago.
For that purpose, we need the minimal information to linking gares with communes, and to linking gares with lignes. That minimal information can be aligned with some components of the standards for transportation networks developed by the European infrastructure INSPIRE [28] or the Open geographic consortium OGC [29]. Four classes seem the minimum required: gare, commune, ligne, and a node class that allows one gare to be attached to several lignes (a "junction" gare).
The UML-class diagram ( Figure 1) illustrates the relationships and their cardinalities, between these classes, and some additional constraints. The "belongs_to" relationships of Figure 1, are very important:


The gare-commune relationship is straightforward, once all gares are geocoded;  The gare-ligne relationship is the challenge: how to find old lignes and all their gares.
Additional spatial constraint (contains) controls that a gare 'belongs_to' a commune, and (inherits_one) states that at least one node gives its coordinates to the gare (what about the other ones, if any? This is anissue studied in the sequel).
Some target concepts matching some INSPIRE ones (Figure 2), e.g., Additional attributes (useful for display purposes, quality control): label = name given to a gare (should be in bijection with uic) or a node; nom = toponym for a commune (when used for a gare, may differ from the label); capa = use status, e.g., 'cargo only', 'border point', etc. used for display purposes. Summing up the constraints that must be satisfied: (1). Every gare belongs to one existing ligne: num is mandatory (see Section 3); (2). A couple (num, km) must be unique; (3). A couple (num, label), must be unique (see Section 2.3.1 and Appendix B); (4). The subset of all nodes of a same ligne (= num) must be strictly ordered by km; (5). Every gare belongs to one commune, whose polygon contains the garepoint. Sometimes, purposely, a gare has been built at the border of two communes (only one is recorded); (6). Every ligne has two ends: an origin node, and a terminus node. One ligne either is "active", or has a "validTo" (alias "end") date; (7). Every gare has one or more node(s), but a node may have no gare (e.g., a fork); Constraints (1-4) are checked a priori (go/no-go), also a posteriori for quality control. Constraint (5) can be checked a posteriori, setting, or controlling the insee value. Constraint (6): a failure must trigger a search for additional information (Section 3). Constraint (7): several nodes attached to the same gare, may differ in geometry, what is a major cause of indecision in the fusion process.

Target Notation and Generic Step-by-Step Approach for Collecting Values
The overall objective is to gather as many gares as possible that have ever existed. Therefore, the process starts with an initial dataset, then grow it, in a logical consistent way, by adding complementary or supplementary information from more available data.
To add a gare, means to create a node and to fill in attribute data step by step (Table 1), starting with the knowledge that a gare named label belongs to the ligne identified by num. In general, information about a ligne contains a list of the gares in between origin and terminus, which allows to set km values in correct order, although approximate in kilometers. Step-by-step generic approach for adding a gare into the reconstructed network.
(a). LABEL: the gare with that name belongs to the ligne identified by num, which has at least an originand a terminus; If working from a given list of gares: to start at origin. (b). KM: if unknown, can be interpolated between adjacent gares, or between origin, terminus, using the length of the ligne.
The hard assumption is about the strict order of the list; (c). Point[LONGITUDE,LATITUDE]: coordinates to be obtained by software or visual inspection. The geocoding, and its quality, is extensively addressed in the sequel. (d). If working from a list: to iterate with next gare in list, until reaching terminus.
Attribute uic, is either given and the node is linked to this gare, or derived from (num, km).
Attribute insee, is either given or derived from geometry of gare and,commune. Attributes capa, info are merely informative (voidable). The baseline goal is to build a dataset, compliant with the above target constraints. Constraints are flexible enough for semi-structured VGI (Section 3), while allowing consistency checking at every stage, for every source, and for the result as well.
Connectivity is not explicit in that model, but can be retrieved from the existence of several couples (num, uic) for a same uic, which denotes a "junction-gare", or by the explicit identification of a fork between 2 lignes (see Discussion and Appendix B).

Public Open Data Sources, and the Information Revision Problem
In a project it is customary to define a baseline, which encompasses: goal, constraints (listed above), resources (listed below), estimated cost (time to target goal).
Baseline cost is the cost per gare times the number of expected gares. The cost for one gare, depends on the steps outlined in Table 1. The number of expected gares, has not been found in the literature: for instance Auphan [31] estimates the maximum length of the network at70,000 km in 1920, but does notmention a number of gares or communes impacted at that time. An estimation can use length and density of the network, e.g., 24,000 km today, serving 3029 "gares du réseau français" [32], gives a gare every 7.92 km. That inter-gare distance was lower in 1920, with stops in every crossed commune: a sampling on already reconstructed data gives about 6 km. Therefore, the number can be up to 12,000 gares.

Description of the Available Public and Administrative Sources
The public datasets that provide gare and node information (no explicit difference), are listed in Table 2, with data schema, version, and availability. The public datasets about communes and lignes, listed in Table 3, are used for joining information (insee, num) and for display purposes.  communes, and must be retrieved for a gare, by identifying which commune verifies Point_in_Polygon(geometry(gare),geometry(commune));  Target.nom: matches SNCF2:commune, would rather be retrieved via Target.insee;

Meta-Analysis of Public Sources, Building a Similarity Measure, and Corrections
The various Sncf versions provide scarce metadata [28], only about their lineage:how to check if a dataset complies with the target constraints? the only way is to perform a "meta-analysis", a technique using "individual participant data" (IPD-MA), mainly developed in medicine [9,10].
The math behind meta-analysis uses variance-covariance matrices. In this application, "individual participants" are nodes of gares, and data are qualitative: hence, only equality can be checked, leading to true/false values. Therefore mathematics boils down to simple counts of items sharing equal attribute values. The meta-analysis is applied directly on the individual items of each dataset, and proves useful in the subsequent fusion.
Counting single attribute occurrences performs anunivariate meta-analysis, counting joint occurrences by a couple of attributes performs a multivariate meta-analysis.
This subsection is devoted to detailing what issues are revealed by the meta-analysis, helping to design the steps of a fusion approach. For aquick read, skip directly to Section 2.4 to discover the fusion, and return later to understanding the reasons why.

Meta-Analysis of Each Dataset, and Comparison
This is performed by the metaAnalysis routine (Appendix A).


Step 1: univariate meta-analysis with num and label values. Result: each node has a num and a label value, in both 2014 and 2017, what fulfills the target first requirement. All 6442 nodes have a geometry in 2014, but only 6812 nodes have a geometry in 2017.
There are 6813 different label, but 6817 different (uic-label) values, what means that 4 couples (=label, ≠uic) are denoting a toponym ambiguity, e.g.: "Cernay". Ambiguity is inherent to toponymy (in most countries): this is a well known issue [20,21]. Several mismatching examples and an explanation are proposed in Appendix B for what is named the "toponym-pattern".


Step 4: inspect the geometry of Step3 couples: if a same label or uic occurs in several nodes, it denotes a junction-gare (= uic, ≠ num). Associated geometryare expected to be equal: the computed deviation is broken down into five distance intervals (Table 4). In 2014, among 642 nodes, 93% are "close enough", and 1% "rather far". For instance, above 2 km, the "Bauvin-Provin" case, is named a "fork-pattern", for the closure of a ligne has "moved" the node to a remote location at the last fork on that ligne. An analysis of the "fork-pattern" is proposed in Appendix B.
In 2017, among 1360 nodes, 53% are "close enough", but 635 are not (47%). The cause of so many discrepancies is that more "fork-patterns" are taken into account in 2017.
It's impossible to choose the right geometry without extra information, moreover, the wrong location is not an error, but a meaningful node: the solution is to mark it (Section 2.4).


Step 5: inspect the cardinal of Step3 couples, for every num value: thisgives the number of nodes per ligne (Table 5). A ligne with one gare denotes an "isolated" gare, which is a conflicting constraint. Most of these errors correspond to lignes now closed to traffic (fork-pattern again), which requires extra information, not available at that stage.

Conclusion of the Meta-Analysis
The meta-analysis has been applied to the two main public datasets providing gare information: the result is a better understanding of what data must be handled, which helps designing a fusion approach to merging data from 2014 and 2017. In particular, it helps in understanding that the "toponymy-patter" and the "fork-pattern" constitute hurdles in the fusion process, and will require additional information from external VGI sources.
The meta-analysis can be used as a quality control tool at any time at any stage of the reconstruction process, and will be used to control the result (Section 4).
Concerning Sncf2014 and Sncf2017, some metadata can be summarized (Table 6). Table 6. Initial metadata after meta-analysis of the individual datasets.
The outcome of the meta-analysis of the initial datasets, is that 890 nodes are not geocoded, and 635 otherhave a poor geometry (Table 4), and would rather be avoided in the fusion process.

Combining the Available Sources: The Fusion/Revision Problem
Versions Sncf2014 and Sncf2014 probably have a same origin, then evolved separately. Sncf2017 has more attributes, notably the km, and uic, and a few more geocoded features (6812 versus 6442) although we do notknow which version would provide the best quality.
The expectation is that Sncf2014 and Sncf2017, either confirm each other on some nodes, or add up unique nodes ( Figure 3).
The relational approach, although natural with tabular-like datasets, has been refuted by the meta-analysis, which revealed that 1525nodes have poor or no geometry, and that label is not a reliable attribute. Let's consider instead the fusion approach, following Bloch [12]:

DEFINITION. Fusion of information consists of combining information originating from several sources in order to improve decision making.
More precisely, let usconsider the context of Revision, or reverse-Updating [13] for we aim at reconstructing the past. Revising Sncf2017 by Sncf2014, means to compute ( , ) in order to undertake "pairing" (2014) with a (2017), then merging attributes with those of , or "impairing" , if no fits, then adding into Sncf2017, after some conversion. All "impaired" of Sncf2017 remain unchanged.
Following Bloch, three steps are required: measure, decision, combination.

Knowledge Representation and Measure of Similarity
Consider a declarative approach based on the notion of neighborhood V(x) of a node x, and a measure of membership to that neighborhood:yV(x), degree of similarity of another node y to x.
The meta-analysis tells us that, once duplicates and ambiguities are solved, a dataset is complying with the target constraints, and has an attribute space into which to define "neighborhoods", and to build a "similarity" measure (Table 7), which is a function : ( , ) → [0,1], where 0 can be interpreted as false and 1 as true.
The initial step is to seek, for each 2017 nodey, the closest 2014 nodex: to form a pair of "closest nodes", using the GreatCircle algorithm, whose precision is sufficient in this application, in the range [0, 100 km]. This routine is part of Annex A. Table 8 gives the distance breakdown for "direct pairs" = pairs of closest nodes with the same label, and "reverse pairs", with the closest 2017 counterpart of each 2014 node. The histogram of the distances grouped by 50 m intervals (Figure 4), provides a clue about choosing relevant thresholds: Therefore the choice for D1-D2 is 150-300 m, and 1-2 km or D3-D4.

Decision to Include a Node in the Similarity Neighborhood
To include x in class Ci according to the absolute majority (or threshold rule): ∈ The inferential aspect of the declarative approach [14] ensures that: (a) rules are independent of the actual knowledge in the knowledge base; (b) rules are not restricted to the conceptual schema representation.
The choice to "pair" x and y is made by comparing yV(x) to some threshold, causing to keep x and y whose distance is % between D1 and D2: x and yare "positive" candidates to fusion. The uncertainty is that some of them can be "false positives". Similar nodes, whose dist D4, are tagged "supposed negative" (SN). The choice to "impair" an x, is to consider, using D3 and D4, that xcannotbe compared to any y, and is candidate to be "integrated" to the final result. However, we may get "false negatives". Dissimilar x whose dist ≤ D1, are tagged "supposed positive" (SP).
About "undecided" (und) nodes, the choice is between: (1) to reduce the D2-D3 gap, even to D3=D2, whichreduces the number of und, but increases the number of both false negatives and false positives; or (2) to obtain only a few FNs and FPs, and many undecided.
The overall tagging decision is sketched in Figure 5, with different possible tags.
Corresponding results for direct and reverse revision, are provided in Table 9 (a,b), where signs + and ¬ denote respective use of similarity or dissimilarity measure. Undecided cases (und) are grey cells. Counts for the direct revision do not include the nodes with no geometry.  The role of the fusionTags routine (Appendix A), is to set the tag value (AP, AN, PP, PN, SP, SN or und), of every node and to store it together with the closest counterpart node and distance of that pair, resulting in: 2017-node: f + {"tag":tag(f), "dist":dist(f,g′), "closest":link_to-2014(g′)} 2014-node: g + {"tag":tag(g), "dist":dist(g,f′), "closest":link_to-2017(f′)} dist(a,b) measures the distance between node from A and its closest B counterpart.
Then a fusion sort is made into three classes, fusionable, integrable or undecided:  fusionable: means that the node and its counterpart (closest is ok) are supposed to inform the same real gare and we can proceed with the fusion of their data;  integrable: means that the node has no counterpart (closest is irrelevant) and can be kept as is (if in the right set), or adapted for integration (if from the other set).
Using f for 2017and g for 2014, the decision follows the rules: reverse revision (for integration of "forgotten nodes"): Among the tags, onlyAP, PP, AN, and PN are reliable enough to be used in the decision. Indeed, we must question the SP-pairs (dist< D1, ≠label), for most uncertainty is due to the toponym-pattern (Appendix B.1). Also we must question SN-pairs (dist> D4, =label): see fork-pattern (Appendix B.2).
Theses tags and distances are helpful when confronting extra open data, for instance with a specific lexicon, e.g.: Insee-new list of toponym changes, SNCF list of gares at different dates (see Section 3). Also helpful for further processing is the quality of the fusion decision which can be a fuzzy measure between [D1, D2] or [D3, D4] respectively:

Materials and Methods Part 2: Voluntary Geographic Information (VGI) Data Integration and Control
So far, the revision process has yield an important and consistent dataset, complying with target constraints (unique code, revised label, correct km, …), plus some quality information. However, the target horizon, amounts to twice as manynodes. Doubling up the total, starting by investigating the 1532 nodesleft behind, then discovering thousands more "forgotten" gares, and further improving the overall quality are the challenges that we ask the VGI to help us overcome.
This section investigates Wikipedia, focusing on its semantic aspect, which makes profit of the UIC ontology (Section 3.1), then investigates various VGI sources, developing some helpers to integratesparse data from less structured and less semantic web data, and it takes advantage of the aerial pictures archives made publicly available by the French IGN.

Gathering Information from Wikipedia Railway Specific Pages
There are many pages devoted to railways. Luckily, standardization efforts of the UIC (UIC: Union Internationale des Chemins de fer, the Worldwide Railway Organization, since 1922) propose an international ontology (thousands terms in the UIC railway Dictionary), allowing the design of semantic Wikipedia "templates" for gares, lignes and communes.

Extracting Coordinates from Wikipedia
Let usstart with a simple use: the extraction of the coordinates of a gare named label. The Wikipedia API (application program interface) gives access to the coordinates, at the highest level. Querying it with "Gare de label", returns the coordinates (if any), or "missing".
The wikiCoordinates routine (Annex) is an asynchronous code that must be used in a "Promise" (technology to awaiting answer, provided by most computing languages).
In case of a "missing"result, we can query the API with "label" as commune toponym, for quite often a gare is labeled by its commune name. However the gare can be somewhat far from the center. For instance, with Martigues ( Figure 6), the distance is 5.3 km (great-circle distance). Therefore, coordinates obtained by that means provides only a rough positioning, which may help when using remote-sensing imagery. The wikiCoordinates routine is used for improving the "revised" dataset, when several location compete for one label, for instance, in the case of Saintes (Section 2.3), the distance between the duplicates was 348 m, and the answer of wikiCoordinates is: Query:"Gare de Saintes"  geometry:[-0.617596,45.748672], Quality:"good" That result is removing the uncertainty.

Extracting Semantic Information Using Wikipedia's Infobox
The Wikipedia API-v2 provides much richer information from an "ontology-driven" Infobox [22].
An Infobox is a panel, usually next to the lead section, which uses a specific ontology:   The routine wikiInfobox (Appendix A) can parse both types of Infobox.

Extracting Data for all the Garesof a Same Ligne: The Routemap Semantics
Given a ligne name, e.g., Ligne_d'Audun-le-Tiche_à_Hussigny-Godbrange, we can query the Wikipedia API, and get an {{Infobox Ligne ferroviaire …}} description.
That Infobox (French template) contains simple items, such as length (longueur), gauge (écartement), the official num (numéro), closing date (fermeture), …, and a much more complex item: a table whose syntax is intended to display a "routemap" (schéma) of that ligne.
The routemap semantics (French sites,) refers to Modèles/BS ("BS" stands for Bahn-Strecke for the model has been initiated by Germans). English sites refer to the general Route diagram template description [33,34].
Wikipedia parses the BS-table to draw a graphic representation of the ligne (Figure 9). There are 1000 such Wikipedia pages for French lignes, each ligne being linked to one or several other lignes. Therefore, it is worth developing a specific routine for the parsing of an Infobox and its BS-table.
However, the hard part is twofold:  the Infobox BS-table complies with Modèles/BS, whose syntax is somewhat tricky, but was parsed successfully in most cases: only a few code failures, hard to overcome, have been met. Figure 10 illustrates a simple case of {{BS-table}}, code and result.
 The other difficulty is to deal with the cascade of asynchronous Internet requests for processing a single line.
Routine wikiBSTable (Appendix A) shows how the parsing of one table can trigger a cascade of asynchronous requests: initial fetch, possible second fetch if there is an indirection on the BS-table, then a fetch trying to get the geometry of each individual gare found in the BS-table. The full code is much longer, and out of the scope of that paper.
Moreover, when parsing a ligne, one or several connected lignes can be found, and for every such ligne, we can trigger a new request for that new ligne, and so on. Piling up these requests should in theory allow us to browse the entire network, but in practice rapidly saturates memory and computing resources, at least on a personal computer.
To develop a full scale "big-data" procedure would improve greatly the time spent linking data, from lignes to gares, and from gares to their geocoded location. Instead, we are querying Wikipedia with only a few (2 to 4) lignes requests at a time.

Gathering Information from VGI Pages
Wikipedia is a great, irreplaceable source of collaborative VGI, covering almost the full range of the lignes that have been included in the main French railway network. However, "secondary" lignes (rural lines and tramways), which were never embraced in the nationalization of 1937 into what became the SNCF [27], have not yet been treated so extensively: a dozen or two are fully described (with BS-table). About 100 are qualitatively described in the page of their owning operator, e.g., "Voies_ferrées_des_Landes" lists 12lignes, only two having their specific page (w/o BS-table).
However, in France as in most countries, railways arouse vocations for "volunteers", and a lot of French websites (Table 13) are devoted to keep track of past gares, lignes, viaducts and tunnels. Even the collection of street-signs ("place de la gare", "rue du petit-train") is valuable: it is the toponymy-memory of the location of a past gare. The variety of information, and unstructured representation, make it difficult to code any generic software. But some routines may help, which are proposed in thissection.

HTML-Style Tables
As opposed to a Wikipedia Infobox, the HTML <table> tag is not a semantic tag, and it is uneasy to determine its content, unless by direct reading (no simple artificial intelligence (AI) routine at hand, although research is becoming active). For decision making, the most important question is: is there a relevant key in that table? (e.g., num, insee, label?).
A few cases, from Routes or RDrail, were processed this way, giving information only about origin-terminus, and some ligne characteristics (closing date, gauge, length):  visual: make a decision about relevance;  manual: copy and paste the table text into a spreadsheet; convert into CSV format;  software: fetch CSV file; join objects with routine mergeLigneInfo (Appendix A).

Coordinates Lists (Polyline for a Ligne) in Google KML or GPX Format
Some contributors have digitalized lignes using the GoogleMyMaps facility, and sometimes, gare are individualized in such kinds of data. The procedure is:  manual: export the ligne data into KML format (from the Google page);  software: convert into geojson, and if the gare is individualized: add it directly, or:  manual: copy and paste the relevant geojson of the ligne;  visual: (see Section 3.3) locate each expected gare along that ligne.
This has been used with source Traver and a few lignes from cdfFR (Table 13).

Simple Lists of gares in Plain Text
It is not easy to automate a detection, but the result proves useful in obtaining a list of gares along a same ligne:  manual: copy-paste the text into a string constant;  software: check if that ligne is already documented, or add a new ligne number num.
The routine stationsAlongLine (Annex) is used in parsing the example of ( Figure 11). Parsing the text extracted from one page of Archef provides an ordered list of gares, from origin to terminus, a length, creates the lignenum, and two dates for its lifetime.

Remote-Sensing Information for the Assisted Visual Geocoding of Gares
If the direct geocoding of a gare (wikiCoordinates) has failed, the last opportunity is to visually inspect maps or aerial pictures to locate that gare, in a way similar to [35] with land-cover.
For geocoding a gare, there is a two-step procedure:


(1) software: use a geocoder API (e.g., Google), querying coordinates for: French word "gare" + label. For instance, with Cardet (Gard), we can try a few query variants (Table 14). It is impossible to know which query will give the best answer (distance data have been added in Table 14, only after visual inspection); (2) manual/visual: use any approximation, or skip first step, and input the toponym into an online map facility providing, preferably ancient, maps and aerial photos.
In France, the best data source for historic (mid-20th century) aerial images and maps, is IGN, the French national geographic organization, which provides the website:

Dual Use of the Website RemonterLeTemps, (a) Direct Geocoding of Toponyms
Routine checkToponymAt (Appendix A) encodes the identifier URI pointing to the relevant remonterLeTemps data for a given nom, what we did above with Chantelle ( Figure  12). The next step to detecting a gare, is visual: quite often rather easy, sometimes more difficult (cf. examples in Figure 13a-c). A gare is always at the junction of several linear elements (rail + road), easier to detect when finding the appropriate zoom level.
(a) Jaligny-sur-Besbre (Allier): a small road is visible, but where is the gare? The toponym "la Gare" is still present on the recent IGN map: (toponymy memory!), … now we can "see" the ancient railway bending NE (line of trees).
(b) Cardet (Gard): the railway line is clearly visible (E-W), also two roads converging towards village center, crossing the railway in two places. Where is the gare? intersection with the biggest road (east?), or closer to the center (west?). The answer is given later in the text.
(c) Le Cateau-Cambrésis (Nord): railway very clearly visible, half-circling the city. But where is the stop "Le Cateau-Halte"? Comparing 'km' with previous and next stops, by rough interpolation, the stop should be at the bridge, near "Faubourg de Landrecies" in the middle, different from another gare, just South of the picture, where two lines meet. Therefore, the different steps of the procedure can be summarized as:  software: generate URI using routine checkToponymAt;  visual: start with an intermediate zoom (e.g., z=16) and focus on the section where the line is expected to reach the commune (e.g., SW corner);  visual: inspect map-aerial photo combinations, focus on road-rail intersections;  manual: input back coordinates into geometry of the gare.

? ?
Adding more automation is beyond the scope of this study. The care brought to this task, must insure a high quality, for it will be very difficult to detect errors later.

Dual Use of the Website RemonterLeTemps: (b) Visual Reverse-Geocoding
To show where a geometry points to in aerial or map representation at a given date:  software: generate URI using routine checkGeocodeAt (Annex), cross-check with a geocoder, to get commune name;  visual: start with a high zoom (e.g., 19), and inspect map-aerial photo combinations, zoom-out to get neighbor toponyms (railway terminology may appear);  manual: confirm, or infirm, the quality of the information for that node.

Hints for Alternatives
Cartography is notan exact science, even at "the scale of a mile to the mile!" (Lewis Carroll), different maps may choose to display different toponyms. Sometimes Bingmaps or Googlemaps may help to solve an indecision, sometimes the "carte IGN", what is the case for the Cardet indecision (from Figure 13b): the gare is at the East intersection, "Chemin de la Gare" (Figure 14, right), not at the West intersection, although closer to the center of the village.
The problem is to find the right zoom level, which displays the street name. Note: to the same query: "Chemin de la Gare, Cardet", the Google geocoder answers a location 563 m (more South), versus 50 m with the IGN geocoder. In that particular case, we have the visual result exactly at the crossing with the road.

Results: The Overall Procedure, the Reconstructed Network, and Its Quality Control
There are four main results out of this study: 1. the dataset "CARP" of French gares, over the 1920-2020 time span; 2. the developed routines combined in a computer-assisted reconstruction procedure; 3. quality controls increase confidence, or point out odd/missing data, to be fixed later.

The Reconstruted Network (CARP)
The Computer-Assisted Reconstruction Procedure -CARP-data result from the fusion (Section 4.2.1) and from the VGI integration (Section 4.2.2). So far, as of January 2021:the CARP dataset contains about 10,900 nodes, accounting for about 9100 different gares (about 100 in neighbor countries), 450 fork nodes, and 68 border-crossing nodes.
Displaying the CARP Dataset Figure 15 illustrates a part of the CARP on the Region-Bretagne, active gares only are on Figure 16, and disused gares on Figure 17. All data are plotted on top of a classical tiled OpenStreetMap, and green lignes are from the LignesSNCF dataset. All gares are sorted by ligne, and by km increasing order: it allows a simple routine to convert these sequences into a line-string, plotted as dotted red lines that approximate the disused lignes: no extra information is added.
The Region Bretagne has been thoroughly processed: the CARP data represent almost 100% of what has ever existed. Region Provence is almost complete as well.
The CARP dataset is deposited, under open license, on the data.gouv.fr website. The format is geojson. Periodic revisions will bring additional data until full completion.  The meta-analysis appliedto CARP, has helped in understanding more discrepancies (systematic bias rather than arbitrary errors): 1. label disparity due to change between dates: the toponym-pattern (Appendix B.1); 2. geometry deviation, due to network evolution and "fork-pattern" (Appendix B.2).
The CARP tries to restore the evolution of the data as best as possible, and to bring additional corrections wherever identified.     Only the last step ( Figure 21) is "manual/visual", although assisted by some direct or reverse geocoding, depending on what is available first (toponym, or closest geometry).

The Computer-Assisted Reconstruction Procedure for Past Railway Stations
The dataset built at this step is named "CARP-main".

Addition of New Lignesand Gares, from VGI
A gare exists only if a ligne exists to which that gare belongs. Most of the 'secondary' lignes are not recorded in public datasets, and have no official num code. VGI has recorded many of them. Depending on what can be found, various routines may apply. Figures 22 and 23 sketch the three steps of a possible full software chain: (a) from a ligne: list of nodes (routines executed on the server): f[i] = {"label":"l", "num":"n", "km":"k", "info":"…"}, "geometry":null; (b) for each node: ask a geocoder, as in Figure 20, or visual inspection as in Figure 21: The dataset built at this step is named "CARP-VGI".

Enacting Quality through Constraint Checking
The automated constraint checking is performed a posteriori, checking: 1. a gare has a label, a num, and a km, identifying it and the ligne to which it belongs; 2. all gares sharing the same num, have their km in strict increasing order, from the gare tagged origin to the gare tagged terminus, what means at least two gares per ligne; 3. a gare with an inseecommune code, has its geometry-point inside the geometry-polygon of that commune; every gare located in France can receive an inseecommune code; 4. a garelabel is consistent with its communenom and insee. Quite often a label is the name of an existing commune (sometimes 2 communes names).
Rules 1-4were implemented, It helped to reporting a few dozen inconsistencies, e.g.: about 60 ambiguities, and a dozen duplicates were identified and corrected. Only a few having required a manual correction, and only one was interpreted as an arbitrary error identified so far. Some cases revealed a 'fork-pattern' again, such as this example: {"label":"Lézan", "nom":"Lézan", …"coordinates": [4.11488, 44.04241]}, Checking coordinates, rule 4 answers: nom = Vézénobres, about 7 km East of Lézan. The Wikipedia Infobox:schéma (Figure 24) indicates the fork just before the viaduct. Result: fork-pattern again, discovered by the final computer-assisted quality control.

A Posteriori Control of the Reconstructed Network
Meta-analysis of the CARP dataset is compared (Tables 15 and 16) with previous data.  The result for CARP is everywhere better than Sncf2017, and in particular the lignes are much better represented with CARP.

Discussion
This paper has focused on designing a computer-assisted reconstruction procedure with the goal to gather as many gares as possible from any kind of available source of information. Several software solutions were considered and then abandoned forrequiring too much development (concurrent cascading queries), or for having too little expected benefit (deep machine-learning gare detection, based on already learned gare locations).
Because the spectrum is rather large, we need some guidelines to assess the overall method from an organizational point of view (Section 5.1). Then, to open some perspectives in possible applications, one example is given in Section 5.2, and in European cooperation, because it was the initial motivation of starting this work: to understand how a century of French transportation policies has led to shrinkage of the national coverage so much, what shifts cargo transportation to roads, and what scientific arguments to reverse that policy trends are important. This can be grounded on historical analysis, in comparison with other European countries that have, or have not, followed the same trend in rail transportation policies (the author's feeling is that important differences can be revealed).

Theoretical Issues in Building a Comprehensive Dataset from Internet Sources
The problem of building a dataset from the vast amount of information available on the Internet has been theorized, noticeably in the medical sector, for instance by Sreenivasaiah et al. [36], which lists four categories of a total 11 issues for integrating information. This present work is confronted with at these categories.

Data Representation and Standards; Interoperability; Data Visualization; Issues in the Development of Tools
These four issues are very critical in geographic information in general, because geography by itself is a reservoir of very strong constraints, which must be consistent with all other constraints: e.g., linear referencing, digraph connectivity,etc.whichhas been studied extensively for decades, leading to the definition of standards by INSPIRE or OGC [28,29], or implementations as in OSM [37]. The CARP dataset compliance with such standards should be more extensively checked, whichwould help interoperability [38], in particular if considering other European past railway datasets. For instance, Linear referencing (km attribute), and Network interconnections (≠gares-same node "tolerance"), are explicit terms of INSPIRE Transport Networks. Moreover, adding time constraints, as happens with a historical railway network, should also be treated more carefully: so far, only an "end" (validTo) date is provided, when a gare is no longer active. We indirectly use the standardization efforts of the UIC (Section 2.4): today railway ontology in Wikipedia benefits from it for keepinghomogeneity in representation, visualization and associated tools (e.g.,Infobox/BS-tables). However, toponymy is much less standardized: changes in names and boundaries of administrative objects is accelerating in France, without real retrospective help for data reconstruction (e.g., simple list of communes merging, see Section 2.3). Overall, interoperability should be improved with a better compliance with INSPIRE, especially if a European collaboration is foresighted.
Data visualization: has been limited to the use of geojson, which can be displayed on line, by many websites, and which is easily read as an open source. Examples are displayed with OpenLayers, on top of OpenStreetMap tiles.
Issues in the development of tools: the code is in JavaScript, well suited to direct interaction with the web, with web-workers (what allows some parallelism if the computer supports this), but the maximum number of actually possible Internet requests is rapidly reached on a personal computer. Any perspective of development would imply working on a stronger architecture more adapted to Big Data. We have not yet started research in that direction although this is a bottleneck issue.

Data Quantity; Computation Intensivity
These two issues are met all along this work: we are working with about 1000 lignes 10,000 gares, 35,000 communes and a gare-commune join requires 350 M possibly complex tests. For instance, when seeking for insee codes of the gares, the process (point-in-polygon) should, and can, be parallelized for 90% of data, because most gares of a same ligne are in the same area (5 or 6 large parts of France), excepting a few very long lignes (e.g.,Paris-Nice). The technology of web-workers has been used successfully on a personal computer with 4 processors. In the Wikipedia realm, it is rather easy to obtain gares from a ligne, then posting asynchronous queries about coordinates of these gares, which is computationally difficult beyond a dozen concurrent gares: see above comment about better Big Data access (Section 5.1.1). In the crowd of VGI, where the most sophisticated structure isoften a list of names, the rule is to work "schemaless", as proposed in [39], possibly after a text analysis looking for words ligne and gare (in French, or other language of course), and a "text pattern" identifying a list, then deriving for instance a ligne-gare re-lationship. Important work is yet to be done in the future, in particular if aiming at handling more European countries.

Data Quality; Version Control
Undoubtedly big issues. Data quality issues are addressed in almost every stage of this work, for different quality components: -Attribute accuracy: label (cf. both toponym-pattern and fork-pattern); -Logical consistency e.g.,label, uic and (num,km) consistency (cf. ambiguities, duplicates); -Completeness: e.g., creating missing km by interpolation; -Positional accuracy: cf. fork pattern, computer-assisted visual geocoding; -Time period and contemporaneousness: e.g., comparing capa/gare with mnemo and validTo/ligne.
Most standard quality metadata components are concerned and have been used, and a posteriori quality control is largely automated by the systematic constraint checking.
Version control should be better taken into account, in order to track: -the evolution of the CARP dataset: a Git-type solution would improve it; -the lineage of data used in setting attribute values. Some retrospective information could be retrieved, partially, for that lineage mixes software and non-reproducible manual steps.

Data Availability; Data Access; Security
These issues are not really addressed in this work. We are considering sources only from public datasets (excepting an author copy of a dataset that is no longer on-line), or Internet accessible open data, and the outcome is archived on a public portal. Together with version control, these aspects should be given more attention.

A Dataset for Further Social Sciences Studies
The CARP resulting dataset is open, accessible on a public website, and is intended as a tool, for geographers and transportation specialists to further investigate how long-term trends form and evolve. The present version (Jan.2021) is about 80% complete (?), but compared to the previous public datasets, already shows an increase in all total counts: nodes: +56%, lignes (≠num): +30%, communes (≠insee): +39%.
One typical question used as an example of the opportunity to reconstruct this past network, was: Does a commune possess a gare? Today? A century ago? The CARP dataset can list, and count the number of such gares (Table 17). To date, 20% of communes, have been checked as having been connected to the railway network, at least by a simple stop for some time between 1920 and now. This is already significant material for demographic, socio-economicand even electoral studies. Almost 4000 additional nodes have been geocoded and quality checked. Also in CARP: every national border crossing with neighbor countries (68 such places), showing how European rail liaisons, from or to France have evolved since WWII ( Figure 25).
In particular, surprisingly, some French gares are still in operation only because they are served by foreign trains from Luxembourg (Audun-le-Tiche, Volmerange), or Germany (SaarBahn to Sarreguemines) or Switzerland (Line 10 of the Basel tramway).
This CARP dataset, can also confirm broadly to some local patterns mentioned in the literature [2][3]29,40], for instance the lignes that have been closed just before WWII in Region Bretagne, or in the late 1990s.
These observations are intended to be studied further in collaboration with transport geographers, including those at a European level. Already cited papers [2,7] demonstrate that there is an interest in developing such research in Europe.

Similar Past Railway Network Reconstruction for European Countries
Though the present work has been specifically developed, for France, several aspects demonstrate that similar work can be undertakenin other countries.
In Germany routemap tables are used instead, instead of BS-tables, e.g.: https://de.wikipedia.org/wiki/Stuttgart_Hauptbahnhof, uses Infobox_Bahnhof a similar template, whose Strecken entry (= list of lignes), has a different syntax. Other templates have been developed for most European countries, also Russia, China, … (total 22 different), whichmeans that analogs to the wikiInfobox routine (Section 3.1.2), tailored for the French template, can be developed, at the cost of some adaptations.

Conclusions
Making observations about the long evolution of the network is important, in particular for environmental policies.
The deviation between the EU White Paper objectives [4], and decisions made by member states cannot be understood without understanding in turn the combination of multiple factors, by multiple actors, over several decades. In France, rail carried 9.9% of inland freight transport in 2018, versus 18.7% for average EU [41], while the EU report observes that road freight is about 100 times more polluting than rail, and that between 2008-2013 road freight increased from 70% to 72%, which was opposite to the goal of shifting 30% freight from road to rail/maritime by 2030.
An EU recent update [42] states: "National projections compiled by the European Economic Area (EEA) suggest that transport emissions in 2030 will remain above 1990 levels, even with measures currently planned in Member States. Further action is needed particularly in road transport, the highest contributor to transport emissions, as well as aviation and shipping …", whichwe can read as: rail is the only transportation mode able to significantly decrease CO2 and greenhouse gas (GHG) emissions.
The first sentence of the EU report's executive summary states: " The fact that only limited data are available and that the impacts of most of the initiatives cannot yet be observed do not allow the proper assessment of the effectiveness of the measures adopted so far and their contribution to reaching the goals ". Therefore, better long-term data could help the EU to better measure the adoption of its recommendations by the member states, and better understand what hurdles exist on the route. Acknowledgments: work initiated with the help of the sophomores 2018 at IUT-Marne-la-Vallée, and pursued with the encouraging comments from IGN (seminar on patrimonial data), SNCF-Réseau and several VGI contributors. First version of the manuscript submitted in August 2020, extensively rewrote thanks to the acute, thorough and highly relevant comments of two rounds of reviewers, who deserve heartfelt thanks.
Funding: This research received no specific external funding.
Institutional Review Board Statement: Not applicable: no humans or animals involved in this study.
Informed Consent Statement: Not applicable: no humans involved in this study.

Conflicts of Interest:
The author declares no conflict of interest.

Appendix A. Code Description for Routines Mentioned in the Text
Routines A1-A7 are components of the meta-analysis and fusion, both methods applied to the public datasets. and meta-analysis applied to the CARP result.
We have now an identified list: accents, Saint.e abbreviations, spaced-hyphens, three kinds of basename -specifier association (department, new name, gare-type), which may explain toponym changes. We name that recording pattern thetoponym-pattern.
To distinguish cases and resolve pairs either "fusionable" or "integrable" is difficult.The solution is to undertake manual editing, or to keep both nodes as if really different.

Appendix B.2. Revealing "Fork-Patterns" from Suspect Fusion Examples
Let us name fork-pattern, the recording pattern consisting in attributing the label of some gare, to the last (active) node located where the ligne forks towards that gare. This is the case with the INSPIRE illustration in Figure 2: a RailwayNode just before a RailwayStationNode.
The case Bauvin-Provin is also a fork-pattern: a former, partly dismantled ligne, going to that gare, ends then (at recording date) at a forknode, therefore this is the last fork before Bauvin-Provin (see Figure A1). The case of Saintes, mentioned with the duplicates, is also a fork-pattern. The gare location is on top of the image, between toponyms Bauvin and Provin, the same label in a pop-up print, near the Harnes toponym is explained by the history of the ligne (small blue dots passing through Harnes, Courrières and Carvin).
Let's use some examples in Table A17, of the suspected false positives (SN) that were mentioned in Section 2.4.2. Fougères 4964 Question is: Why the same gare names so far away?
In the example of Reims, there are 4 nodes, two being very close (Reims on picture Table A16), one for the North ligne, and the farthest fork (N-E) being 2.273 km away.
In the example of Fougères ( Figure A3, next section), two locations are 4.964 km apart: one is the now a closed gare, the other is the last fork of the former ligne, still located on an active ligne.
All cases in the list (Table A17) result from the fork-pattern. It was right to tag them "suspected negatives", and to not fusion them with their respective gares.

Appendix B.3. The Target Schema and the Graph Representation of the Network
The case with the initial UML-class diagram is that it is does nottake into account correctly the difference between a gare-node, and a non-gare-node, such as a fork. This is not an issue, for instance when confronting gares and communes, which requires a simple "point_in_polygon" algorithm. But a simple example ( Figure A2) illustrates the problem: one gare (yellow double circle) is at the junction of two lignes coming from two simple gares (yellow simple circle), and a fork (white circle) gives the exact location of where the two lignes meet ( Figure A2-case a). Case (b): the fork is ignored and a simple "digraph" representation will work correctly with algorithms: shortest-path (well known algorithm also named after Dijkstra), or path-contraction, an algorithm to build the minor graph of the original graph, which ignores intermediate [simple] gares and nodes (not terminus, not junction).
Case (c) is more complex: it preserves the information of the fork, whichleads to a better visual result. With an historical viewpoint, archiving more geographical information about the past network is much better.
The fork-pattern example of Fougères ( Figure A3), involves four lignes (N, S-E, S-W, N-W). In the original data, the fork North was the last node in the list, hence plotted as the terminus: this is solution (b), illustrated on the left. Solution (c) is on right.