1. Introduction
Railways are a very reliable, safe, and eco-friendly mode of transportation, whose importance continues to grow in connection with increasing transportation needs worldwide. Autonomous train operation is seen as a means to increase traffic density in existing railway networks, and the development of technologies enabling autonomous train operations has become a central research topic.
One of the core technical challenges in this context is reliable and safe environment perception. Autonomous trains must establish and continuously update situational awareness, including their exact position, the route they will travel (their ego-track), and potential obstacles on that ego-track. Since the processing of sensing data from cameras and other sensors is complex and difficult to master using traditional algorithmic approaches, most on-board perception systems rely on machine learning (ML) components to establish this understanding. As the detection of ego-tracks and potential obstacles is safety-critical, the performance of the ML-enabled perception components is of utmost importance.
The performance of ML-components is highly dependent on the adequacy and quality of the data used in their training. Unlike automated driving in the automotive sector, where large training datasets have been published over the years, only a small number of datasets are publicly available for autonomous train operations, most of which are also limited in scope and size (see
Section 2.1). This poses a significant problem for many actors in the field, as most of them do not have direct access to the equipment and infrastructure that enable them to create their own datasets.
In order to fill that gap, publicly available training datasets are required that are sufficient in size, with content variety matching the targeted operational design domains, and with high-quality annotations covering the required informational aspects. To be used as training datasets for highly automated driving in the railway sector, the data annotations must include information on tracks, switches, signals, and other track-side infrastructure, as well as potential obstacles. Furthermore, annotations for soft categories and image property tags are required to support experimentation and performance analysis.
Creating such dataset annotations with existing multipurpose annotation tools is time-consuming and sometimes error-prone. Advanced annotation tools specialized for railway image annotation would enable the efficient and accurate labeling of large image datasets. Currently, there is no labeling tool available that is optimized for creating railway image annotations and supports efficient labeling in this application area.
In this paper, we introduce Labels4Rails, an image annotation tool that is specifically tailored to support the efficient and accurate labeling of large training datasets for the railway domain. We defined the features of the tool and the supported annotation types based on requirements derived from our prior research in the railway field and the experience gained in creating and applying training datasets. The tool and its documentation are now available for public use at GitHub [
1].
We used this tool to create a new training dataset, L4R_NLB (Labels4Rails Nordlandsbanen), which is based on footage from a Norwegian TV format that documents the train ride between Trondheim and Bodø in four different seasons. The dataset is publicly available at Zenodo [
2], the EU open research repository, and will be continuously extended.
In this paper,
Section 2 analyzes existing training datasets for highly automated driving in railways, as well as tools and formats for the annotation of railway image datasets.
Section 3 describes the Labels4Rails image annotation tool and its features for manual and automatic image annotation.
Section 4 explains the data source used for creating the L4R_NLB dataset, the structure of the dataset and the annotation format, and provides details on the dataset statistics. Finally,
Section 5 provides insight into how we plan to address known limitations of the tool and dataset and gives an outlook on future work.
2. Related Work
2.1. Datasets for Railway Highly Automated Driving
The performance of ML-enabled perception systems depends largely on the amount, representativeness, and quality of the data available for their training. Whilst in the automotive sector, many datasets with several hundred thousand entries are publicly available, only a few comparatively small training datasets have been published for rail-bound traffic (see
Table 1), some of which focus on specific applications within the autonomous driving domain.
The following datasets primarily support the training of ML components for track detection applications:
Railsem19 [
3]: This dataset focuses on the semantic segmentation of rail scenes, containing 8500 annotated frames captured from very diverse railway environments. It includes images from 530 video sequences covering rails from the ego-vehicle’s point of view in 38 different countries. From all datasets surveyed, it has the highest variance in weather and lighting conditions, surroundings, camera models, and mounting positions.
The annotations in RailSem19 provide 22 classes relevant to railways, including tracks, switches, trains, and obstacles, aiming to enhance semantic understanding in complex rail environments. They cover both pixel-level segmentation and object detection. Although multiple rounds of quality assurance were performed during the labeling process, some labels are erroneous, usually due to the incorrect pairing of the rails in complex scenes with multiple switches.
OSDaR23 [
11]: This dataset includes data from multiple sensors, including cameras, LiDAR, and radar, covering various environmental conditions and lighting scenarios. The annotated dataset comprises 21 sequences, divided into 45 subsequences, with a total of 1534 annotated multi-modal frames, which consist of one frame from each sensor. As the multi-modal frames were recorded during a short time interval or at a low speed of the train, the difference between the frames of a sequence is small, reducing the variance of the dataset.
The annotations comprise tracks but also contain nine object classes, including switches (although the direction of the switches is not specified). Annotations are provided as poly-lines or bounding boxes.
Rail-DB [
10]: This dataset comprises 7432 annotated images with poly-lines marking railway tracks, intended for rail detection in varied environments. A small set of environmental tags, such as ’sun’, ’rain’, ’night’, ’curve’, ’cross’, and ’slope’, categorizes images by conditions. Additionally, during the capture of the images, the train passed several construction sites with workers on the tracks, which makes this dataset useful for obstacle detection tasks. There are no annotations for switches, but images containing switches are tagged with ’cross’. Two camera views, ’near’ and ’far’, capture diverse perspectives. However, the ’near’ images, despite their high resolution, suffer from heavy compression and visible artifacts. Some images are slightly distorted, and there are duplicated frames when the train is at a standstill, providing consistency but adding redundancy.
RailSet [
7]: This dataset focuses on anomalies and track defects. It consists of 7700 images, 1100 of which contain artificially generated anomalies, such as holes in the tracks or rail discontinuities. Annotations contain masks for rails and tracks. Additionally, there are centerline representations of the track annotations, enabling the training and usage of different kinds of models. Although the dataset is said to be publicly available, no working download link is provided in the paper.
MRSI [
8]: The dataset contains approximately 23,000 images in the visible spectrum and 4000 infrared images, including 1500 multi-modal frames where the infrared and RGB images show the same scene at the same time. While there are no annotations for the multi-modal frames, there are 780 frames with pixel masks comprising eleven classes, including background, 4370 images with pixel mask annotations for tracks, and 3589 separate images with bounding boxes for obstacles such as persons, cars, and various geometric shapes. The dataset comprises several lighting and weather conditions, including rain, strong and weak light, as well as night and underground scenes.
Multisensor Dataset [
15]: This dataset is a recently announced multi-modal dataset with a total duration of 5292 s, containing annotated frames with a total of 7,052,055 individual annotations covering 23 object classes. The dataset is said to be publicly available on demand for industry and research.
Other datasets solely focus on selected railway perception tasks that differ from track detection, such as switch and signal recognition and obstacle detection.
FRSign [
5] and GERALD [
12] provide bounding boxes for railway signals on 105,352 and 12,205 images, respectively.
RailGoerl24 [
14] provides multi-modal datasets with annotations for persons on tracks.
RAILOD [
9] contains 4651 images with bounding boxes for persons and cars in a railway environment. The dataset is not publicly available.
A dataset curated by the German Aerospace Center (DLR) [
6] contains 2500 images with annotated switches. The dataset is not publicly available.
Railway Obstacle Detection (ROD) [
13,
16] is a public dataset of 300 images with bounding boxes for obstacles on or near the tracks.
2.2. Image Annotation Tools
In Machine Learning, image annotation is performed to provide the ground truth information that is required for supervised learning. Annotation types include image annotations and tags (for object classification and/or to describe scene properties), bounding boxes (for object detection and localization), segmentation masks as polygons or pixel-wise labels (for segmentation), key-point and landmark annotation (e.g., for object pose estimation), and line annotation (e.g., for lane and track recognition).
Railway image annotation presents domain-specific challenges beyond traditional annotation tasks. Railway scenes contain complex infrastructure components, such as tracks, switches, signals, level crossings, bridges, and station platforms. To train ML components for autonomous train operation using annotated railway scene images, it is not sufficient to simply distinguish objects according to object classes and boundaries. Switches, for example, determine track topology through their states, and signals require track associations. These topological relationships must also be captured in annotations.
In addition, image tags are required to capture scene-related information on weather, lighting conditions, and specific scene content such as bridges, level crossings, third rails, and alike. These metadata support the experimentation and evaluation of the performance, reliability, and safety of trained ML models in challenging real-world scenarios.
There are many free and commercial image annotation tools available [
17]. Most of them support a range of annotation types and annotation formats. Some of these tools also provide advanced features, such as automated object detection and segmentation by means of integrated AI models and temporal propagation of annotations across several video frames. Examples of such tools are given below:
CVAT (Computer Vision Annotation Tool) [
18] is a self-hosted or web-based annotation tool that provides a wide variety of annotation features. It supports annotation of images and video sequences and supports different annotation types, such as point, polyline, rectangle, polygon, cuboid, skeleton, ellipse, and mask. It is able to export data in a large variety of formats. The tool provides AI-assisted automated annotation features using several different pretrained AI models as detectors or interactors.
makesense.ai [
19] is a web-based annotation tool with AI-supported capabilities. The tool offers basic annotation types such as point, line, rectangle, and polygon, and supports multiple export formats, including CSV, YOLO, VOC XML, VGG JSON, COCO JSON, and pixel mask, depending on the annotation type. It also integrates pretrained models (YOLO v5, COCO SSD, and Pose-Net) for automatic object detection and pose estimation.
SAM2 (Segment Anything Model 2) [
20] is a general-purpose foundation model for video and image segmentation. Its memory-based architecture enables temporal consistency across frames. The model accepts points, boxes, and masks as input prompts.
For rail tracks, it performs sufficiently well at close range but requires increasingly detailed prompts for distant tracks, horizons, and complex intersections. Its key advantage for railway applications lies in the temporal propagation of annotations across video frames. This feature may significantly reduce the labeling effort for continuous track sequences.
Label-Studio [
21] is a multipurpose labeling tool that supports the annotation of images, videos, text, and speech data. For image annotation, it supports points, lines, polygons, masks, and other annotation types. The tool can export annotations into various formats, including COC, CoNLL2003, CSV, JSON, and VOC XML. The tool integrates features for automated annotation, including object detection and segmentation with pretrained AI models.
LabelMe [
22] is an annotation tool for images and videos, supporting line, rectangle, circle, and polygon annotation types and various export formats. This tool also has—as part of its commercialized version—a feature for AI-assisted automated annotation.
However, these general purpose tools fall short in providing the specific functionality that is desirable in the context of training data annotation for railway autonomous train operations. They cannot capture topological relationships between infrastructure components; thus, the implicit relationships between tracks, switches, and signals cannot be exported for further use and cannot be utilized for annotation automation purposes. These tools also do not leverage the consistent perspective information and camera extrinsics that are common in railway footage, nor the fact that consecutive frames show only slight variance in object position within the image. With their domain-agnostic approach, they do not sufficiently support the efficient creation of large, high-quality datasets for the development of ML-enabled railway perception systems.
2.3. Image Annotation Formats
The choice of suitable annotation formats is crucial for implementing robust and efficient machine learning processes. Well-structured and flexible formats not only facilitate interoperability between various tools but also help ensure the quality of training data and streamline the processing and analysis of data in ML pipelines.
The following formats are widely used for annotating machine learning training image datasets and have established themselves as standard formats in the field:
The COCO annotation format [
23] is widely used for object detection and image segmentation tasks. It uses JSON files to store data such as bounding boxes, key-points, polygons, and image captions. COCO is supported by ML frameworks such as TensorFlow and PyTorch, and it is well-suited for bounding-box annotations and categorical analysis.
The YOLO annotation format [
24] is another popular annotation format for object detection. It uses compact TXT files containing class, bounding box, and relative position information for each object. Starting with YOLOv8 [
25], YOLO comprises an additional format for polygon annotations suitable for instance segmentation.
The ASAM OpenLABEL annotation format [
26] has been initially developed for the automation domain but has also been endorsed for use in the railway domain.
3. Annotation Tool
After evaluating several existing annotation tools, we decided to develop our own annotation tool, Labels4Rails, to support the efficient labeling of railway image material and to easily incorporate new features as needed (
Figure 1).
3.1. Manual Annotations
Labels4Rails enables efficient track annotation by placing markers along the rails. The points between markers are automatically approximated using Catmull–Rom splines [
27], resulting in a faster labeling process as fewer markers are needed, especially on curved tracks. The differentiation between the ego-track, the left track, and the right track must be made by the user.
In addition to the tracks, switches are another key element of image content that must be labeled with the tool. Switches consist of two key attributes: position, which determines whether the switch is set to fork or merge, and direction, which indicates the specific path the train will take (left, right, or unknown). Switches are annotated by setting the two points that represent the corners of a bounding box enclosing the switch.
The Labels4Rails tool also supports the addition of tags to the images, comprising the categories defined in
Section 4.3:
Track Layout,
Weather,
Light,
Time of Day,
Environment, and
Additional Attributes. This enables the filtering of images to fine-tune machine learning models according to specific attributes, such as snowy landscapes or dark settings, but also to identify weaknesses in the perception capabilities of models with certain types of images.
Figure 1.
Labels4Rails user interface with annotated ego track (yellow) and right neighbor track (green).
Figure 1.
Labels4Rails user interface with annotated ego track (yellow) and right neighbor track (green).
3.2. Automated Annotations
Additionally, the Labels4Rails tool is able to detect tracks, switches, and scenery tags in an automated manner. Using this labeling approach frees the user from manual input and speeds up the annotation process. This capability is currently implemented as an experimental feature and requires further optimization.
Tracks are determined using an existing train ego-path detection network [
28], which utilizes a regression-based approach to detect the positions of both the left and right rails, as well as the horizon position.
Switches are determined by the overlap of train tracks in the scene and their relationships to each other. The switch type is derived by comparing the count of tracks above and below the respective track overlap. More tracks above the track overlap than below indicate a fork-type switch. Consequently, the inverted scenario indicates a merge-type switch. When no difference in the respective track counts can be determined, the switch is regarded as unknown.
The switch direction is derived from the relative positions of the switch’s tracks to each other. It is determined which of both tracks is the persisting one of the switch junction and should be regarded as the reference track for the other. Based on that relationship and the present switch type, a simple coordinate comparison yields the switch direction.
Tags are determined using two distinct approaches, as their real-world detectability can vary significantly. For easily quantifiable attributes such as track geometry (e.g., curve/straight), classical rule-based logic is used. More abstract or context-dependent attributes, such as weather, lighting, or time of day, are inferred using OpenAI’s CLIP network [
29], which excels at zero-shot scene interpretation by matching image content to natural language descriptions without task-specific training.
3.3. Data Export
Annotation data are stored by the Labels4Rails annotation tool in a custom JSON file format. This is primarily due to the fact that the Catmull–Rom splines, which are at the heart of track annotation in Labels4Rails and a major element enabling the efficient annotation of tracks, are incompatible with other annotation formats, such as ASAM OpenLabel. Yet, Labels4Rails supports the export of data into standard formats, thereby integrating well with established training pipelines and tools.
Bounding boxes can be exported according to the YOLO format. Segmentation masks for tracks are exported as PNG files. By selecting multiple sections of the input directory, several datasets can also be combined. Furthermore, the tool allows users to include or exclude images utilizing the image tags that describe weather conditions (e.g., cloudy, rainy), lighting types (e.g., natural, artificial), track layouts (e.g., curve, straight), time of day, environment (e.g., rural, urban, station), and others. Users can make informed filtering choices because each tag selection displays the number of matching images in the dataset. Users can apply logical conditions (AND/OR) to determine how selected tags should be interpreted, which gives them precise control over the images to be included or excluded. The data selection process becomes more detailed through additional filters that allow users to choose images based on track numbers, switch types, and track positions (ego, left, or right). Users can also select pixel-level color encoding (grayscale values from 0 to 255) for rail components, including left and right rails and track-beds for ego and neighboring tracks. Users have the option to duplicate both original images and annotations or use symbolic links to reduce storage requirements. The final image count shows the total number of images that satisfy the chosen criteria. The modular interface produces segmentation outputs that align with the established filtering strategy while maintaining focused results.
3.4. Experimental Evaluation
We evaluated the Labels4Rails tool by applying it to annotate the L4R_NLB dataset, which is detailed in the second part of this paper. In this effort, the tool has proven to be easy to use, as we involved a larger number of students in the annotation work, including some with a non-technical background.
To prove the assumed greater efficiency of annotation work with Labels4Rails, we conducted an experiment comparing the time required to annotate the same images by the same individuals using the Labels4Rails tool vs. using other, more generic tools. As an example of the group of generic tools, we choose CVAT [
18], as it appears to be the most advanced and versatile generic tool for our use case. We compared the manual annotation of images of varying complexity in Labels4Rails and CVAT, distinguishing the effort required for track annotation, switch labeling, and tag assignment. We also investigated the usability of automated, AI-based annotation features. We made the following observations in this experiment (see
Figure 2,
Table 2, and
Table 3):
3.4.1. Manual Annotation
Using standard manual annotation, working with Labels4Rails is approximately twice as efficient compared to working with CVAT when requiring the same level of annotation quality. For track annotation, the efficiency gain primarily relates to the use of Catmull–Rom splines in Labels4Rails, which require less than 50% of the number of anchor points for the approximation of track curvature compared to the polygons used in CVAT. For tag assignment, the efficiency gain is due to the Labels4Rails user interface design, which ensures the direct accessibility of the tags; whereas in CVAT, several user actions are required to select and set a tag. For object annotation, the efficiency is at a similar level in both tools, as the user activities are very similar.
3.4.2. Automated Annotation
The automation features in CVAT cannot be meaningfully applied to the use case of track and switch annotation. Automated segmentation in CVAT, which is based on the Segment Anything network, does not recognize tracks as integral objects; therefore, it does not yield meaningful masks or polygons. However, it works well with objects that are on or close to the track, such as trains. Similarly, switches (track forks and merges) are not identified as objects by the YOLO-based automatic object detection feature in CVAT, likely because the network has never been trained on such objects.
Table 2.
Time required for annotation (maximum, minimum and weighted average), comparing CVAT and Labels4Rails (manual annotation).
Table 2.
Time required for annotation (maximum, minimum and weighted average), comparing CVAT and Labels4Rails (manual annotation).
| | CVAT | Labels4Rails (Manual) |
|---|
| | max | min | avg | max | min | avg | gain |
| Tracks | 01:45.5 | 00:20.7 | 01:05.0 | 00:44.3 | 00:18.7 | 00:36.6 | −44% |
| Switches | 00:19.5 | 00:12.3 | 00:15.4 | 00:18.4 | 00:10.5 | 00:14.9 | −4% |
| Tags | 00:07.4 | 00:04.6 | 00:05.6 | 00:02.9 | 00:01.4 | 00:02.1 | −62% |
| Image | 11:21.7 | 00:48.4 | 05:27.1 | 06:40.4 | 00:28.7 | 03:10.8 | −42% |
Table 3.
Time required for annotation (maximum, minimum and weighted average), comparing CVAT and Labels4Rails (automated annotation and manual amendment).
Table 3.
Time required for annotation (maximum, minimum and weighted average), comparing CVAT and Labels4Rails (automated annotation and manual amendment).
| | CVAT | Labels4Rails (Automated) |
|---|
| | max | min | avg | max | min | avg | gain |
| Tracks | 01:45.5 | 00:20.7 | 01:05.0 | 01:00.4 | 00:07.4 | 00:30.3 | −53% |
| Switches | 00:19.5 | 00:12.3 | 00:15.4 | 00:11.5 | 00:01.0 | 00:06.2 | −60% |
| Tags | 00:07.4 | 00:04.6 | 00:05.6 | n/a |
| Image | 11:21.7 | 00:48.4 | 05:27.1 | 04:51.8 | 00:07.4 | 02:17.0 | −58% |
In contrast, the automation features in Labels4Rails are specifically tailored for annotating railway images. Annotation of switch objects works very well, provided that the prior annotation of tracks has been done properly (because switch positions and types are automatically derived from the topology of tracks). Manual correction of detected switch objects is necessary only in very few cases. Automated track annotation also works well but is restricted—in line with the capabilities of the underlying AI model—to the annotation of the ego track. As a result, the efficiency gain depends strongly on the complexity of the track layout, as highlighted by the considerable spread between the minimum and maximum times required to annotate an image using this feature. The spread is primarily due to an increase in the time required to correct auto-generated annotations for complex scenes—an effect also reported in [
30]. For both aspects, automated track annotation and automated switch annotation, the numbers presented here cover the full duration, including the time to perform the automated annotation, the time required for correcting the automatically generated annotations, and the time required for creating additional annotations as needed. The automated assignment of tags has not been part of our experiment.
3.5. Current Tool Limitations
The Labels4Rails tool currently does not support annotating railway signals and other relevant objects within the scene that are required to facilitate the full spectrum of use cases in the context of ML-enabled perception systems for autonomous trains.
As a minor drawback, the tool does not support the annotation of railway tracks when only a single rail is visible, which impairs accurate annotation in scenarios involving partial visibility.
Lastly, annotations are saved in a custom format (as is the case for many other annotation tools). More versatile annotation export features are required to ensure compatibility with established training pipelines and tools.
4. The L4R_NLB Dataset
Using the Labels4Rails annotation tool, we have implemented the L4R_NLB dataset (Labels4Rails Nordlandsbanen) as a first reference dataset.
4.1. Data Source
The L4R_NLB dataset utilizes data sourced from the Minute by Minute documentary provided by the Norwegian Broadcasting Corporation (NRK) [
31]. It consists of videos from the driver’s perspective of the railway between Trondheim and Bodø in Norway. The train route was recorded four times during different seasons.
The dataset was created from existing MP4 footage using the FFmpeg tool. Only intra-coded frames (I-frames) were extracted to minimize redundancy. The resulting time intervals between frames are approximately five seconds in most cases, maintaining regularity throughout the data. To ensure the chronological order of the extracted frames, single-thread mode was employed during processing, avoiding any misalignment that might arise from multi-threaded processing.
In total, 7155 frames per season were extracted, resulting in 28,620 frames across all four seasons. Out of these, 10,253 frames were labeled according to a predefined labeling policy, which is detailed in the following section of this paper. A statistical analysis based on several properties of the images was conducted to evaluate the quality of the dataset.
4.2. Dataset Structure
The dataset is split into four parts according to the seasons. Each season is structured as follows:
4.3. Annotation Specification
The image annotations are provided in JSON files and are divided into three lists: ’Tracks’, ’Switches’, and ’Tags’. Tracks consist of their position in the image and their left and right rails, where each rail is described by a list of points. Switches consist of two points defining their bounding box, the switch type, and the active direction. Tags are divided into categories. Each category only contains tags that apply to the image. All points are provided in pixel coordinates, and bounding boxes are axis-aligned, with top-left and bottom-right coordinates.
4.3.1. Track Annotation
All drivable tracks that are visible in the image are labeled if both rails belonging to the track are visible. They are labeled from the bottom of the image to the background, as far as they are visible to the human eyes. The anchor marks are placed in the center of the rails.
There are three different track positions: ’ego’, ’left’, and ’right’. The position ’ego’ is used for the track on which the train is actually driving. The positions ’left’ and ’right’ are used for neighboring tracks and are set according to the relative position of the respective track to the ego-track.
Tracks that traverse switches are labeled according to the active direction of the switch. If the switch state is not distinguishable, the most likely direction is assumed, and the tracks are annotated according to the assumed direction. Tracks starting or ending at a switch are annotated across the entire switch. Consequently, the active direction of the switch may also be derived from the track annotations.
Occlusions are handled in the following way: As long as only a minor part of the track is hidden, the track is marked across the object. However, if most of the track is hidden behind other objects, such as platforms, buildings, trains, or vegetation, a separate track is created for the portion of the track that begins after the object.
An example of an annotated image with left and right neighbor tracks is shown in
Figure 3.
Points describing rails are connected with Catmull–Rom splines [
27]. This approach allows for a significant reduction in the number of support points for the rails, especially on curved tracks, without adversely affecting their adherence to the actual curvature of the rails.
4.3.2. Switch Annotation
All switches that are visible in the image are labeled. The positions of switches are marked with bounding boxes that enclose all characteristic parts of the switch: frog, point blades, and stock rails (see
Figure 4 for an illustration of these switch parts).
The type of switch, ’fork’ or ’merge’, and its active direction, ’left’, ’right’, or ’unknown’, are annotated. The switch type is always selected with respect to the train’s driving direction. If the switch status is visible, the active switch direction, ’left’ or ’right’, is set. Otherwise, if the switch status is not visible due to image quality or distance, the direction is set to ’unknown’. Examples of switches are displayed in
Figure 5.
4.3.3. Tags Annotation
Track Layout tags are set according to the shape of all labeled tracks and structures that interfere with tracks, such as bridges or level crossings:
’straight’: At least one track is straight.
’curve’: At least one track is curved.
’parallel structures’: Support rails or similar structures, railings of bridges close to the track (not platforms or sidewalks).
’orthogonal structures’: Bridges over rails, level crossings.
’unknown’: Tracks are not visible due to lighting (dark tunnels) or weather conditions (fog).
The following rules apply to the Track Layout tags: At least one, either ’curve’, ’straight’ or ’unknown’, must be set. Both ’curve’ and ’straight’ are set if both curved and straight tracks are visible in the image.
Weather tags are set according to the impact they have on the tracks:
’cloudy’: Tracks are not illuminated by the sun. This can be the case even if the blue sky is visible.
’sunny’: Tracks are illuminated by the sun. This can be the case even if the visible sky is clouded.
’rainy’: The ground is wet, or there are raindrops on the camera lens.
’snow’: The ground is covered with snow. If it is snowing, the ’rainy’ tag must also be set.
’foggy’: The visibility of the tracks is limited due to fog. Fog obscuring other things is irrelevant.
’unknown’: Weather conditions cannot be determined due to tunnel or nighttime situations, etc.
The following rules apply to the Weather tags: Either ’cloudy’, ’sunny’, or ’unknown’ must be set. The other tags are optional. In some cases, up to four tags must be set, e.g., ’cloudy’,’rainy’,’snow’ and ’foggy’.
Light tags describe the lighting situation of the scene:
’natural’: The scene is illuminated by the sun.
’artificial’: The scene is illuminated by artificial light.
’uniform’: The tracks are evenly illuminated.
’hard shadows’: At least one rail is crossed by a shadow. Surroundings and shadows cast by the rails themselves are not considered.
’dark’: Scenes that are dark due to daytime, weather, or environment. This tag is initially set automatically (see
Section 3.2), but it can be adapted manually.
’bright’: Scenes that are bright due to the daytime or weather. Like the ’dark’ tag, this tag is initially generated automatically.
’unknown’: Scenes without any light, e.g., tunnels or nighttime situations.
The following rules apply to Light tags: At least one tag, ’natural’, ’artificial’ or ’unknown’, must be set. In some cases, both ’natural’ and ’artificial’ must be set. Furthermore, either ’uniform’ or ’hard shadows’ must be set.
Time of Day tags indicate—as far as detectable—the time of day when an image was taken:
’day’: The scene was recorded during the daytime
’twilight’: The scene was recorded in the morning or evening
’night’: The scene was recorded during the night.
’unknown’: the time of day cannot be determined. This can be the case when the train is in a tunnel.
The following rules apply to Time of Day tags: Exactly one of the tags must be set. If the time of day cannot be determined from the image data, it must be estimated using the environment, weather, and lighting conditions.
As the images of the L4R_NLB dataset are in chronological order, the
Time of Day tag was set as listed in
Table 4:
Environment tags describe the scene environment depicted in the image:
’rural’: There are natural surfaces in the immediate vicinity of the tracks, or the majority of the surroundings is natural.
’urban’: There are artificial or man-made objects in the immediate vicinity of the rails, or the majority of the surroundings is artificial. Tunnels in rural areas are not considered artificial objects in that sense.
’station’: A station is visible in the image, even if the train has not yet entered the station. If the train is in a station but there is no platform or other parts of the station visible in the image, the ’station’ tag is not set (so no prior knowledge of the train’s position is used, only the visible image content.
’underground’: Tracks or parts thereof are located underground, for example, at a tunnel entrance or exit.
’unknown’: The environment is not visible, for example, in nighttime scenes.
The following rules apply to Environment tags: At least one tag, ’rural’, ’urban’, ’underground’, or ’unknown’ must be set. If ’unknown’ is set, no other Environment tag may be set. On the other hand, all three tags, ’rural’, ’urban’, and ’underground’ can be set concurrently if the environment has mixed rural and urban features, and parts of the tracks are located underground. The ’station’ tag is always optional.
Additional Attributes tags are used to denote specific characteristics of images:
’duplicate’: Images within the dataset that have only minimal differences from other images. Such duplicates are the result of a train stopping at a station or in front of a signal. Such images that are strongly similar to their predecessors are tagged with the ’duplicate’ tag.
’obstruction’: Windscreen wipers, large raindrops, and other items belonging to the train or obstructing the camera that blur or hide a significant portion of the tracks are considered obstructions. Objects belonging to the environment are not regarded as obstructions, even if parts of the tracks are hidden by them.
The following rules apply to Additional Attributes tags: All these tags are optional.
4.4. Dataset Statistics
Using the image tags and other information, we conducted a detailed statistical analysis that helps to better understand the characteristics of the dataset.
The dataset consists of images taken in quite diverse environments, as demonstrated by the distribution of the different combinations of
Track Layout and
Lighting tag types. The dataset contains 2991 straight-track images and 7734 curve images, as well as 735 images showing both straight and curved tracks. Regarding lighting conditions, the majority of images were taken under natural or uniform lighting, while other lighting variations also exist in the dataset. The dataset provides extensive coverage of complex railway infrastructure through its 1793 images labeled with the ’multiple tracks’ tag, which represent dense railway layouts such as junctions, sidings, and yards. Operationally rich scenarios that require routing decisions occur in 937 images that show both tracks and switches. These images have exceptional value for infrastructure detection, switch classification, and track planning because they present complex contextual information about the railway tracks. The statistics are summarized in
Table 5 and
Table 6, listing the distributions of the tags, the types of switches, and the complexity of the infrastructure. The detailed presence of multiple tracks and track-switch combinations makes the dataset suitable for developing robust railway scene perception models.
4.5. Evaluation
We used the L4R_NLB dataset to conduct a large set of experiments in the context of our research on N-version machine learning, an approach to detecting and masking erroneous outputs from an ML-enabled system by using an architecture of multiple dissimilar machine learning modules [
32,
33,
34]. The dataset has been instrumental for this research due to its annotation quality and the possibility of performing detailed analyzes regarding ML model vulnerabilities and error densities using image tags as filters.
The application in this research also showed that the L4R_NLB dataset alone is not large and diverse enough to prevent overfitting, and that training a neural network solely with the images from the L4R_NLB dataset can lead to poor detection results, especially for neighboring tracks. To avoid these problems, we have used it in combination with additional datasets, such as RailSem19.
4.6. Current Dataset Limitations
The railway line between Trondheim and Bodø runs mainly through rural areas characterized by expansive, unspoiled terrain. The data collected from this area comprises predominantly images containing only the ego-track, with a smaller portion of the scenes also including neighbor tracks.
While the annotations of the dataset contain switches and their position, type, and state, they currently lack annotations for signals. Signals are important for determining the relevance of a track section as an ego-track and are therefore essential for developing complete solutions for autonomous trains. In addition, switch-related signals could also provide additional information about the state of switches, as they may be perceived from a greater distance than the position of the point blades.
Furthermore, the dataset does not contain annotations of objects along the railway track, such as humans, vehicles, or obstacles. Detection of such objects is also important for autonomous train operations.
5. Conclusions and Outlook
The Labels4Rails annotation tool represents a significant step forward in providing an efficient image annotation solution tailored for railway automated driving. As such, it considerably reduces the effort required to annotate image data and allows for the creation of meaningful machine learning training datasets with acceptable effort.
The first dataset created with this tool—the L4R_NLB dataset—is an important contribution to the set of publicly available machine learning training datasets for railway automated driving. Despite its limitations, particularly the dataset’s focus on rural environments and the absence of signal and object annotations, it can be effectively used for many training tasks in the context of ML-enabled perception systems for autonomous trains. For some tasks, however, the L4R_NLB dataset must be combined with other datasets to achieve the necessary variety in scene and object types.
Future work will, on one hand, concentrate on overcoming the tool-related limitations discussed in
Section 3.5. We aim to implement new tool features that will allow the annotation of signals and additional relevant object types on or near the track. In this context, an evolution of the object annotation approach towards 3D annotations will be investigated. In addition, we contemplate support for OpenLABEL as an open annotation format that is increasingly endorsed by the railway community.
In other tool-related work, we will continue to develop and strengthen the automated annotation features into a core tool feature that will further reduce the annotation effort.
On the other hand, we plan to enhance the existing L4R_NLB dataset, especially with regard to the currently unavailable signal and object annotations.
In general, it is strongly desirable to have more machine learning training datasets available, especially datasets containing complex scenes with many tracks, switches, signals, and other objects. Whilst we will certainly work in this direction and plan to eventually provide the results of such work as additional publicly available datasets to the community, we would like to encourage the community to use the Labels4Rails tool in creating their own datasets, and—ideally—to make these datasets also available to the public.
New versions of the Labels4Rails tool and the L4R_NLB dataset will be made publicly available in the same way as the current versions [
1,
2].
Looking ahead, a promising direction for further research and development is the potential extension of the Labels4Rails tool’s capabilities regarding the annotation of videos and the creation of respective machine learning training datasets. This transition could enable the use of recurrent neural networks (RNNs), which are particularly well-suited for capturing temporal dependencies in sequential data. By using video data, models could be trained to recognize and track changes over time, such as the detection of moving objects, changes in track conditions, or the dynamic state of switches. This would open new possibilities for more robust, real-time applications of autonomous train systems, ultimately contributing to the safe and efficient autonomous operation of trains. Thus, the Labels4Rails annotation tool and the L4R_NLB dataset may serve as cornerstones for future innovations in the field of autonomous railway technology.
Author Contributions
Conceptualization, T.H., F.H. and C.T.; methodology, T.H. and F.H.; software, T.H., F.H., S.M., E.K. and B.P.; validation, T.H. and C.T.; data curation, T.H.; writing—original draft preparation, T.H., F.H., C.T., S.M., E.K. and B.P.; writing—review and editing, C.T.; supervision, C.T.; project administration, C.T.; funding acquisition, C.T. All authors have read and agreed to the published version of the manuscript.
Funding
This work was partially funded by the German Federal Ministry of Research, Technology and Space (BMFTR) under Grant 16IS22029C.
Data Availability Statement
The original code and data presented in this paper are openly available in GitHub [
1] and Zenodo [
2], respectively.
Conflicts of Interest
Florian Hofstetter is currently employed by the company Knorr-Bremse. However, he contributed to this paper when he was with HTW Berlin. The authors declare that the company had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
References
- Hiebert, T.; Thomas, C.; Jaß, P.F. Label4Rails Annotation Tool. 2025. Available online: https://github.com/railway-perception-htw-berlin/Labels4Rails (accessed on 20 August 2025).
- Thomas, C.; Hiebert, T.; Jaß, P.F. L4R_NLB Dataset. Zenodo. 1 July 2025. Available online: https://zenodo.org/records/14260575 (accessed on 20 August 2025).
- Zendel, O.; Murschitz, M.; Zeilinger, M.; Steininger, D.; Abbasi, S.; Beleznai, C. RailSem19: A Dataset for Semantic Rail Scene Understanding. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 1221–1229. [Google Scholar] [CrossRef]
- Wang, Y.; Wang, L.; Hu, Y.H.; Qiu, J. RailNet: A Segmentation Network for Railroad Detection. IEEE Access 2019, 7, 143772–143779. [Google Scholar] [CrossRef]
- Harb, J.; Rébéna, N.; Chosidow, R.; Roblin, G.; Potarusov, R.; Hajri, H. FRSign: A Large-Scale Traffic Light Dataset for Autonomous Trains. arXiv 2020, arXiv:2002.05665. [Google Scholar] [CrossRef]
- Jahan, K.; Niemeijer, J.; Kornfeld, N.; Roth, M. Deep Neural Networks for Railway Switch Detection and Classification Using Onboard Camera Images. In Proceedings of the 2021 IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, 5–7 December 2021; pp. 1–7. [Google Scholar]
- Zouaoui, A.; Mahtani, A.; Hadded, M.A.; Ambellouis, S.; Boonaert, J.; Wannous, H. RailSet: A Unique Dataset for Railway Anomaly Detection. In Proceedings of the 2022 IEEE 5th International Conference on Image Processing Applications and Systems (IPAS), Genova, Italy, 5–7 December 2022; Volume 5, pp. 1–6. [Google Scholar] [CrossRef]
- Chen, Y.; Zhu, N.; Wu, Q.; Wu, C.; Niu, W.; Wang, Y. MRSI: A multimodal proximity remote sensing data set for environment perception in rail transit. Int. J. Intell. Syst. 2022, 37, 5530–5556. [Google Scholar] [CrossRef]
- Guan, L.; Jia, L.; Xie, Z.; Yin, C. A Lightweight Framework for Obstacle Detection in the Railway Image Based on Fast Region Proposal and Improved YOLO-Tiny Network. IEEE Trans. Instrum. Meas. 2022, 71, 1–16. [Google Scholar] [CrossRef]
- Li, X.; Peng, X. Rail Detection: An Efficient Row-based Network and a New Benchmark. In Proceedings of the 30th ACM International Conference on Multimedia (MM ’22), Lisboa, Portugal, 10–14 October 2022; pp. 6455–6463. [Google Scholar] [CrossRef]
- Tagiew, R.; Klasek, P.; Tilly, R.; Köppel, M.; Denzler, P.; Neumaier, P.; Klockau, T.; Boekhoff, M.; Schwalbe, K. OSDaR23: Open Sensor Data for Rail 2023. In Proceedings of the 2023 8th International Conference on Robotics and Automation Engineering (ICRAE), Singapore, 17–19 November 2023; pp. 270–276. [Google Scholar] [CrossRef]
- Leibner, P.; Hampel, F.; Schindler, C. GERALD: A Novel Dataset for the Detection of German Mainline Railway Signals. Proc. Inst. Mech. Eng. Part F 2023, 237, 1332–1342. [Google Scholar] [CrossRef]
- TARP Project. Railway Obstacle Detection (ROD) Dataset. Roboflow, 2023. Available online: https://universe.roboflow.com/tarp-proj/obstacle-detection-t5lua (accessed on 24 October 2025).
- Tagiew, R.; Wunderlich, I.; Zanitzer, P.; Sastuba, M.; Knoll, C.; Göller, K.; Amjad, H.; Seitz, S. Görlitz Rail Test Center CV Dataset 2024 (RailGoerl24). 2025. Available online: https://data.fid-move.de/dataset/railgoerl24 (accessed on 20 August 2025).
- Diotallevi, C.; Gudiño, R.; Pachalieva, Z.; Neumaier, P.; Denzler, P.; Köppel, M. Multisensordatensatz für die Umfeldüberwachung von Schienenfahrzeugen. In Deine Bahn; Bahn Fachverlag: Berlin, Germany, 2025; Volume 2025, pp. 22–25. [Google Scholar]
- Chen, C.; Qin, H.; Qin, Y.; Bai, Y. Real-Time Railway Obstacle Detection Based on Multitask Perception Learning. IEEE Trans. Intell. Transp. Syst. 2025, 26, 7142–7155. [Google Scholar] [CrossRef]
- Moschidis, C.; Vrochidou, E.; Papakostas, G.A. Annotation Tools for Computer Vision Tasks. In Proceedings of the Seventeenth International Conference on Machine Vision (ICMV 2024), Edinburg, UK, 10–13 October 2024; Volume 13517, p. 135171A. [Google Scholar] [CrossRef]
- Sekatchev, B. CVAT—Computer Vision Annotation Tool. 2018. Available online: https://github.com/cvat-ai/cvat (accessed on 24 October 2025).
- Skalski, P. Make Sense. 2019. Available online: https://github.com/SkalskiP/make-sense/ (accessed on 24 October 2025).
- Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. SAM 2: Segment Anything in Images and Videos. arXiv 2024, arXiv:2408.00714. [Google Scholar]
- Human Signal. Label-Studio. 2025. Available online: https://labelstud.io/ (accessed on 20 August 2025).
- Russell, B.C.; Torralba, A.; Murphy, K.P.; Freeman, W.T. LabelMe: A Database and Web-Based Tool for Image Annotation. Int. J. Comput. Vis. 2008, 77, 157–173. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 3 February 2025).
- ASAM e.V. ASAM OpenLABEL. 2021. Available online: https://www.asam.net/standards/detail/openlabel/ (accessed on 24 October 2025).
- Catmull, E.; Rom, R. A Class of Local Interpolating Splines. In Computer Aided Geometric Design; Barnhill, R.E., Riesenfeld, R.F., Eds.; Academic Press: Cambridge, MA, USA, 1974; pp. 317–326. [Google Scholar] [CrossRef]
- Laurent, T. Train Ego-Path Detection on Railway Tracks Using End-to-End Deep Learning. arXiv 2024, arXiv:2403.13094. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
- Sagastiberri, I.; Larrazabal, M.; Sánchez, M.; Aranjuelo, N.; Nieto, M.; De Eribe, D.O. Annotation Pipeline for Railway Track Segmentation. In Proceedings of the 2024 IEEE 12th International Conference on Intelligent Systems (IS), Varna, Bulgaria, 29–31 August 2024. [Google Scholar] [CrossRef]
- NRK. Nordlandsbanen: Minute by Minute, Season by Season. 2013. Available online: https://nrkbeta.no/2013/01/15/nordlandsbanen-minute-by-minute-season-by-season/ (accessed on 24 October 2025).
- Jaß, P.; Abukhashab, H.; Thomas, C.; Woltersdorf, P.; Weber, M.; Conrad, M.; Fey, I.; Schülzke, H. CertML: Initial Steps Towards Using N-Version Neural Networks for Improving AI Safety. Datenschutz Datensicherheit—DuD 2023, 47, 483–486. [Google Scholar] [CrossRef]
- Jaß, P.; Thomas, C.; Hiebert, T.; Plettig, G. Towards safe obstacle detection for autonomous train operation: Combining track and switch detection neural networks for robust railway ego track detection. In Proceedings of the ERTS 2024, Toulouse, France, 11–12 June 2024; pp. 1–10. [Google Scholar]
- Jaß, P.; Thomas, C. Using N-Version Architectures for Railway Segmentation with Deep Neural Networks. Mach. Learn. Knowl. Extr. 2025, 7, 49. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).