Virtual Axle Detector Based on Analysis of Bridge Acceleration Measurements by Fully Convolutional Network

In the practical application of the Bridge Weigh-In-Motion (BWIM) methods, the position of the wheels or axles during the passage of a vehicle is a prerequisite in most cases. To avoid the use of conventional axle detectors and bridge type-specific methods, we propose a novel method for axle detection using accelerometers placed arbitrarily on a bridge. In order to develop a model that is as simple and comprehensible as possible, the axle detection task is implemented as a binary classification problem instead of a regression problem. The model is implemented as a Fully Convolutional Network to process signals in the form of Continuous Wavelet Transforms. This allows passages of any length to be processed in a single step with maximum efficiency while utilising multiple scales in a single evaluation. This allows our method to use acceleration signals from any location on the bridge structure and act as Virtual Axle Detectors (VADs) without being limited to specific structural types of bridges. To test the proposed method, we analysed 3787 train passages recorded on a steel trough railway bridge of a long-distance traffic line. Results of the measurement data show that our model detects 95% of the axles, which means that 128,599 out of 134,800 previously unseen axles were correctly detected. In total, 90% of the axles were detected with a maximum spatial error of 20 cm, at a maximum velocity of vmax=56.3m/s. The analysis shows that our developed model can use accelerometers as VADs even under real operating conditions.


Introduction
All over the world, ageing bridge infrastructure faces the challenge of increasing traffic loads. For example, in the United States, there are more than 617,000 bridges, 42% of which are at least 50 years old and 7.5% of them are considered structurally deficient [2]. In Germany, more than 40% of the 25,710 railway bridges are older than 80 years, while the average lifespan is about 122 years [10,18]. The application of structural health monitoring (SHM) makes it possible to increase the operational availability and safety of these structures. As the knowledge of the actual operational loads is of high importance for the condition assessment of the structures, especially when it comes down to the assessment of fatigue failure and the evaluation of the remaining service life, the determination of the loads is a key aspect in the field of SHM. Since direct measurement of the loads is often technically difficult and usually requires a significant financial effort [5,19,9], different methods for load identification based on measured structural responses have been developed [17,14,24,8,9]. In the case of bridges, these techniques are referred to as Bridge Weigh-In-Motion (BWIM) [25,32,33,13].
For the majority of BWIM systems, the information on vehicle configuration (axle number and axle spacing) and velocity are prerequisites [13]. For this purpose, conventional axle detectors are used [30,35,3,19]. However, due to impact loads of the wheels, the axle detectors have a limited durability [34]. In addition, the installation of the axle detectors always implies road or railway track closures. Especially the latter case requires a considerable bureaucratic, logistic and financial effort. To avoid these issues, modern BWIM systems use axle detection concepts that only use sensors installed under the bridge. These concepts are referred as nothing-on-road (NOR) or free-of-axle-detector (FAD) [27,36].
In the FAD technology, two additional strain sensors at different positions of the bridge are used to identify the vehicle configuration and velocity [36]. Since FAD is only suitable for specific types of bridges [13], researchers attempted to identify axle velocity and spacing from global flexural strain or shear strain measurements [13,16]. For the proposed method in [16], the use of shear strains, re-quire the application of the stain gauges at the level of the neutral axis. This is a challenge for complex structures, especially for railway bridges with ballasted tracks, as in such cases, the position of the neutral axis cannot be easily determined. However, in [13], the proposed method is only suitable for structures where the structural response is dominated by the quasi static response of the bridge, e.g. where the dynamic amplification is low. Furthermore, in [13], the second time derivative of the strain signals is used. This makes the method sensitive to measurement noise, which leads to the necessity of a suitable noise filter, depending on the specific application.
In [13] the method of virtual axles is proposed, in which a vehicle with many virtual axles is assumed. All axles except the real ones are weightless. The true axles and their weights are then determined by solving a constrained least square problem. As the authors state, the method fails, if there is significant noise in the signals. Since a significant amount of noise is present in field measurements and further practical applications, the method cannot be practically applied without a sophisticated regularisation method. Furthermore, the method uses experimentally determined lines of influence, so it is not applicable to cases with significant dynamic amplification in their structural response.
To the best of our knowledge, only Zhu et al. [37] have so far published an accelerometer-based axle detection method. Here, a shallow Convolutional Neural Network (CNN) is used to detect potential axle sequences, which are then transformed with a continuous wavelet transform. Afterwards, the axles are detected in the transformed signals by peak-finding methods. The method of Zhu et al. [37] requires accelerometers close to the supports. The acceleration signals of these sensors are dominated by the vehicle induced impulses when entering and leaving the bridge, leading to the clear axle recognition in the time domain.
An axle detection method based on acceleration measurements is desirable, as the installation of acceleration sensors is much easier and less laborious, compared to to strain gauges. However, accelerometers are often already installed on the structure for the determination of the modal parameters and they are not necessarily located close to the supports. Therefore, we propose a method that enables the use of accelerometers attached at any position on a bridge as a VAD. In this way, the same acceleration sensors used for analyzing the global structural behavior (e.g. at midspan or quarter span of beam-like bridges) can be employed also for the axle detection, without having to install additional sensors in the proximity of the supports. Continuous-Wavelet-Transformations (CWTs) were used in the present work as they are an effective tool for the analysis of acoustic and visual signals in general [7], furthermore previous work [6,16,34,36,37] have shown that CWTs are an effective tool for axle identification. The wavelet transformed signals are subsequently analysed using a Fully Convolu-tional Network (FCN) that is trained in a supervised manner to perform a binary classification task (axle/no axle). This enables the signal processing of any length without having to be divided into time windows. Furthermore, the analysis in this way is not limited to certain mother wavelets or certain scales, as in the previously mentioned works that used the CWT [6,16,34,36,37]. To validate our method, we recorded a data set on a railway bridge with sensors distributed across the free span of the bridge on the main girders. The impulses of the wheel sets are superimposed with the vibrations of the bridge, which does not allow their clear visual identification in the time domain. For many bridges and for common sensor setups for monitoring purposes, similar to the ones used for this study, the method of Zhu et al. [37] would not be applicable.
Since we use a supervised learning approach for the VAD, a set of train passages with known axle distances and velocity are required. In the current research, this information is obtained by means of strain measurements at the rail level. For future practical applications, the information could be obtained from vehicles with known axle configuration and the use of a Differential Global Positioning System (DGPS). If such information is not available, a transfer learning approach based on simulated data could also be an option.
The paper is structured as follows: In chapter two, the methods are presented. The first section of the chapter describes the data acquisition in the field experiment and the subsequent data processing. The second section of chapter two contains the model definition. In the last section of the chapter, details are given on the training of the model. Chapter three presents and discusses the results. The paper ends with chapter four, in which the conclusions of the present study are drawn.

Data Acquisition
We recorded the measurement data used in the present study on a single-span steel trough railway bridge  The measurement set-up is shown in fig. 2. It can be seen that a total of ten seismic uniaxial accelerometers of the type PCB-39B04 (PCB Synotech) with a sensitivity of 1000 mV/g (±10%), a broadband resolution of 0.000003 g RMS , a measurement range of ±5 g pk and a frequency range of 0.06 to 450 Hz (±5%) were installed.
The measurements are triggered via the rising slope of the wheel load measuring point G1 ( fig. 2) from the ring buffer are stored from 10 seconds before the trigger together with the 50 seconds long measurement after the triggering. The recorded signals thus all have a length of 60 seconds. All sensor signals were recorded with a sampling frequency of f s = 600 Hz using the catmanAP software and the CX22 data recorder connected to an MX1601B universal amplifier and an MX1616B strain gauge amplifier (all products are from HBK).
By means of two wheel load measuring points, the average velocity of each axle was determined and from this the actual position of the axles during the passage was deduced. Every measuring point involves installation of at least one pair of rosette strain gauges (HBM 1-CXY41-6/350HE) on the rails. Each pair of strain gauges are placed at the level of the neutral axis of the UIC 60 rail profiles with a distance of 20 cm and allow the recording of bi-axial strains at an angle of 45°with respect to the neutral axis ( fig. 3). Thus shear strains are obtained. The difference of the shear strains allows the determination of the acting wheel loads. Since only the difference of the shear strains is of interest, the strain gauges can be combined into a single signal in a full bridge circuit. For further details, please refer to e.g. [19]. To compensate for the influence of the lateral wheel loads, a pair of strain gauges was placed on each side of the rail, so that one wheel load measuring point retrieves two signals.
The peaks of the wheel load measurement signals are automatically identified ( fig. 4a). All passages where the two wheel load measuring points have not detected the same number of peaks are discarded. This leads to 3745 usable out of a total of 3787 recorded passages, i.e. about 98.9 %. Using the temporal differences of the peaks at the two measuring points and the known distance between the wheel load measurement points of 14.40 m, the mean velocity can be determined for each axle. The trains reach a maximum velocity of about 57 m/s ( fig. 4b) In the next step, using the known distances from the first wheel load measurement point to each of the ten accelerometers and the mean velocity of each axle, the time at which the axle is at the same x-ordinate as the respective sensor can be calculated. Since the two strain gauges of one wheel load measuring point have a distance of 20 cm between them, the uncertainty with respect to the distance between the two wheel load measuring points s WLM = 14.40 m is assumed to be ∆s WLM = 0.2 m. This propagates through the velocity determination. Together with an uncertainty in time of ∆t = 1 fs = 1 600 s, the absolute spatial error ∆x results from the linear error propa-gation for each sensor (fig. 5) as follows: This shows that the absolute position error is increased with increasing velocity and with an increasing distance of the sensor with respect to the first wheel load measurement point (G1/G2).
The acceleration signals were combined into one data matrix A 36,000×5 L for the sensors L1 to L5 and A 36,000×5 R for the sensors R1 to R5 ( fig.2 b) for each passage, without any further signal processing steps. Additionally two data matrices L na×5 L and L na×5 R (n a : number of axles) containing the calculated indices at which an axle is at the respective sensor, are created.
The complete data set as well as the processing code is available online [23].

Data Transformation
Transforming a signal into the frequency-time domain enables the localisation of frequency content in time [4]. In our case, low-frequency effects such as the bridge natural vibration are separated from high-frequency effects such as measurement noise in the frequency domain, while the time domain is preserved. Therefore, the model can learn frequency-specific information, which should lead to faster training and more reliable results.
The most common choices for a frequency-time domain transformation are Short Time Fourier Transformation (STFT) and Continuous Wavelet Transformation (CWT). The multi-resolution approach of the CWT is particularly useful for complex signals, since it adapts the window size to the frequency [26]. The STFT has a fixed resolution, which means that there is always a trade-off between a good time resolution and a good frequency resolution, depending on the window size [4]. As a result, we have chosen the CWT because it is more suitable for the analysis of acoustic and visual signals than the windowed Fourier transform [7]. The CWT has also been shown in previous work to be an effective tool for axle detection [6,16,34,36,37].
From the signals, a section of 150 samples before the first axle to 500 samples after the last axle was further processed and transformed with the PyWavelets package [20] with empirically determined settings (tab. 1). To determine the transformation settings, the CWT were visualised and analysed for correlations between the axle positions (cyan dotted line) and the power of the transformed signal ( fig. 6). As a result, we can find that in the range of the bridge's natural frequency of about 6.9 Hz for the first bending mode ( fig. 6 left column), the influence of the bridge on the vibration is mainly visible while a correlation between the train axles (dashed cyan lines) and the signal seem not to be present. In the higher frequency range, a correlation becomes clearer, indicating that the influence of the axles are mainly located in the 64 Hz range ( fig. 6 right column).
We assume, however, that it is nevertheless advantageous for the model to receive both pieces of information (influence of the bridge and of the axles) in order to be able to distinguish them better.
As a result, all 6 transformations were used in combination ( fig. 6b-6g). To create the final model inputs, each signal (per passage and per sensor) was transformed according to our 6 settings, afterwardsthe transformations were normalised independently and stacked into a threedimensional array T ns×n f ×nt (n s : number of samples, n f : number of frequencies/scales and n t : number of transformations). 1.5 10 6g 10 40

Model Definition
For a Virtual Axle Detector (VAD) to be efficient, a model with a flexible input length (in the time domain) is essential to account for the large differences in velocities and train lengths. Therefore, we have developed a Fully Convolutional Network (FCN) [22], which only uses input size independent layers like convolution, pooling or batch normalization. Our model has been developed to output only a single value between 0 and 1 for the same number of samples as the input. These output values represent the model's certainty for an axle at x-ordinate of the respective sensor.
Our developed VAD model is based on the U-Net architecture originally proposed by Ronneberger et al. [29], which was developed for semantic segmentation tasks. Here, the goal is to classify each pixel of the input image individually to preserve the resolution from the input. For the U-Net, the resolution of the input is halved 4 times (via max pooling) in the encoder path and then doubled again 4 times (via transposed convolution) in the decoder path.  In addition, the intermediate results before each pooling layer are appended to the intermediate results after the transposed convolution layer with the same resolution and then processed together.
In our case, not each pixel but each sample is to be classified, thus reducing the resolution in the frequency domain to 1. We achieve this by increasing the resolution in the decoder path only in the time dimension ( fig. 7) by using a transposed convolution layer with a kernel size of 3 × 1. Before the intermediate results from the encoder path can be appended to the intermediate results from the decoder path, its resolution and number of feature maps are adapted. Each purple arrow in fig. 7 consists of a reshape layer to reduce the frequency domain to 1 value, and a convolution layer with 1 × 1 kernel size to adapt the number of feature maps.
The convolution blocks (CBs) consist of a batch norm layer and a convolution layer with Rectified Linear Unit (ReLU) activation [11]. The CBs in fig. 7 have a 3 × 3 kernel size. The residual blocks originally proposed by He et al. [12] were implemented consisting of 3 CBs in the filtering path and 1 CB in the skip connection. Here, the second CB in the filtering path has a 3 × 3 kernel size, while the other CBs have a 1 × 1 kernel size. The results of the filter path and the skip connection are added element wise before they are further processed. Our model has 4 pooling steps as the U-Net [29]. We can therefore input transformed signals of any length (in the time domain) as long as they are divisible by 16, since the resolution must remain an integer after being halved 4 times. For lengths that are not multiples of 16, the signal is padded with zeros and thus extended by a maximum of 15 samples.
The last layer is a convolution layer with a single kernel of size 3 × 3 with sigmoid activation. Therefore, the resulting outputs can be interpreted as independent pseudo probabilities p, which indicate the predicted likeliness for a certain class per sample. The resulting model has an input size of arbitrary number of samples (padded to a multiple of 16), arbitrary number of signal transformations and 16 frequencies, evenly spaced from minimum to maximum scale. The TensorFlow library [1] is used for implementation of the model and PlotNeuralNet [15] was used for visualising it.

Loss Function
We have defined the localisation task as a supervised classification problem instead of a regression problem in order to minimise complexity and maximise comprehensibility. We have labeled each sample with one of the following classes: Axle at the same x-ordinate as the sensor (1) or not (0).
A common loss function for a binary classification task is Cross Entropy (CE), but for imbalanced datasets Focal Loss (FL) has been shown to be more effective [21]. In our case, the total number of axles of a train is almost negligible compared to the total amount of samples of a passage. So if the model predicts all values to be 0 (and would not locate an axle), it would already achieve an almost perfect loss for CE and would learn to ignore the axles. This   brings us to the thesis that FL should be necessary to achieve good results. The FL is defined as follows [21]: where p t is defined as following: In the above p ∈ [0, 1] is the model's estimated probability for the class 1, y is the ground-truth class and γ is the focusing parameter. The equation of FL consists of − log(p t ) which is equal to the CE and (1 − p t ) which is a newly introduced modulating factor weighted by the focusing parameter γ. The larger the factor, the more significant is the effect of the modulating factor and with a γ of 0 FL corresponds to the CE [21].
Due to the gamma value, the modulating factor is included exponentially in the equation. As a result, the loss becomes exponentially smaller the better the prediction.
For misclassified examples, the loss is unaffected compared to CE, which makes misclassifications much more heavily weighted (a factor of 1000 and more is possible [21]).

Evaluation Metrics
The loss function itself does not contain information about the number of correctly detected axles. Other metrics are needed to assess the overall performance of the VAD. Accuracy as a metric is also insufficient to draw a conclusion about the models performance, due to the imbalance of our dataset. A prediction containing no axles at all would reach an accuracy of about 99% and would therefore not contain useful information. Precision and recall are suitable metrics for imbalanced data sets [11], but they only take into account binary results and not distance prediction and ground-truth. Due to the high sampling rate and the uncertainty of the labels described in section 2, however, we want to recognise axle predictions within a few samples next to the ground-truth as correct and measure the temporal error.
Since the model output does not always give clear results, but sometimes also a number of smaller peaks, the prediction must be processed further (figure 8). In order to ignore small values and only continue with plausible predictions, peaks are extracted in post-processing with the find-peaks function from SciPy [31]. In order to get consistent and satisfying results, we have fine-tuned the following parameters of the function: Minimum height of the peak (0.25), minimum distance between two peaks (20 samples) and prominence of the peak compared to the surrounding points (0.15). We have calculated the minimum distance d between two peaks with assumed minimum wheel distance ∆w min = 2 m and the maximum velocity v max = 220 km h as follows: A threshold is used to ensure that only predictions within a certain temporal error compared to the groundtruth are considered correct. For example, the threshold could classify predicted axles as correct with a maximum temporal error of 30 milliseconds compared to the ground-truth. Depending on the application, its requirements may be decisive for the determination of the threshold. In general, it should be taken into account that good results cannot be expected with thresholds that are lower than the label and measurement accuracy. To avoid making too strict assumptions, we have chosen the largest reasonable threshold with 20 samples (eq. 4) for the first evaluations. After the peaks found have been classified as correct or incorrect, they are further evaluated using the following metrics: Precision, recall and F 1 score, which is the harmonic mean of precision, recall.

Optimization of γ
In order to find an optimal γ value for the FL, we performed a parametric study with 150 epochs per run, 150 steps per epoch and 16 samples per batch. We have split the dataset randomly with 70% for training, 20% validation and 10% for testing. To ensure comparability, the same random state was used for every run. The selection criterion for γ is the F 1 score, because a high F 1 -score indicates a high value for both recall and precision.
We confirmed our hypothesis that our dataset is too unbalanced for standard loss functions like Cross Entropy. The model training with small γ values of 0 and 0.5 ended in dead ReLUs after 8 or 9 epochs and is therefore unusable. However, the modulation factor should also not be weighted too high to achieve the best performance. The relationship between γ, precision and recall can be described as a trade-off between detecting too many axles and detecting too few axles ( fig. 9). The γ values of 2, 2.5 and 3 achieved the highest F 1 score on the validation set. In order to decide which γ value to use for the final evaluation, we have trained the model with these γ values in a second run for 300 epochs. In the second run, the γ value of 2.5 achieved the highest F 1 score (tab. 2) and is therefore kept for testing. Since the results of the γ values are close to each other and the middle γ value performed best, we assume that the optimal value has been found.

Results and Discussion
The test set consists of 375 train passages with 13,480 axles in total. There are 10 acceleration sensors for which the individual crossing times are to be determined, resulting in 134,800 times to be localised. On the test set, for a threshold of 20 samples the VAD with a γ value of 2.5 achieved a F 1 score of 0.938, a recall of 0.946 and a precision of 0.941. Thus, 126,449 of 134,800 crossing times were localised correctly with a maximum error of 0.033 seconds. On average, the predicted axle times had a temporal error of 1.16 samples (0.002 s) compared to the ground-truth with a standard deviation of 3.06 samples (0.005 s).
Based on the distances between the sensors, we are able to convert the error from samples (temporal) to meters (spatial). In order to examine the spatial error more closely, we have chosen 3 threshold values: • 200 cm as minimum wheel distance We have calculated the precision and recall per passage and sensor for each threshold to examine the distribution of the metrics in more detail, resulting in 3750 values per threshold and metric ( fig. 11). The differences of the results with thresholds of 20 cm and 37 cm are small, as even the 25% quantile stays above 85% for both metrics ( fig. 11). Precision and recall for a threshold of 200 cm are much better with even the 25% quantile above 96%, while the mean spatial error has worsened greatly with more than double the value compared to the other thresholds (tab. 3). Therefore we conclude that 37 cm is the optimal threshold value to correctly evaluate the models performance. Thus, we consider predictions with an spatial error above 37 cm as outliers. Such outliers should be possible to be sorted out in postprocessing by comparison with known train configurations.
The evaluation of the test data took 335 seconds with an NVIDIA RTX 3090 for 375 passages and ten sensors. The model therefore needs 0.089 seconds per signal and for our entire measurement setup 0.89 seconds per passage. This allows a real-time application of the VAD and a flexible trade-off between accuracy and computing speed due to the number of sensors used.
Compared to the work of Chatterjee et al. [6] who used FAD sensors and wavelets to detect more axles with the FAD, our model shows a comparable success rate in detecting axles. They were able to successfully evaluate 42/47 (about 89.4%) passages. The mean absolute spatial errors are about 10.6 cm which is about three times as much as in our study. The achieved spatial accuracy in our study is still 1.4 times better compared to a study with FAD sensors in combination with a optimized mother wavelet and wavelet scale for the identification of axles [36]. Taking into account that we did not use FAD in our method and the velocities are about twice as high, this is a confirmation of our hypothesis that it is advantageous not to limit the analysis to certain mother wavelets and certain scales. In contrast to the method of Zhu et al. [37], due to our model architecture the VAD can be applied at any point of the bridge. This allows common SHM measurement setups to be used for axle detection without the need to attach additional sensors. The accuracy of the methods is similar. It should be noted that in all cases the detection of car axles is compared with that of train axles.   Table 3: Influence of the threshold on mean spatial error, F 1 , precision and recall.

Conclusion
We were able to show that using our proposed method, no additional FADs or strain gauges are required on the main girders to realise a NOR-BWIM system. Instead, our method allows accelerometers at any point of the structure to be used as VADs.
We have shown that FCNs are able to detect axles only using acceleration measurements within a spatial accuracy of 37 cm with an precision of 93% and recall of 91%, thereby the mean of the absolute values of the spatial errors compared to the ground-truth is about 3.9 cm. The results show that the method is able to detect the axles with similar spatial error as the data used for labeling.
Even if the results show a higher accuracy compared to other studies with a different methodology, we assume, that the accuracy for the determination of the vehicle configuration and velocity could be increased by the joint evaluation of several sensors, an increased model complexity, improved signal transformation or different measured variables such as strain and displacement. Enabling the method for other measured variables would also increase the amount of use cases.
Finally, the most important issue is the generalisability of the model. Depending on whether the model needs to be re-trained for the application of the method, if so with real data or simulated data, will determine how efficiently it can be used. If retraining with real data is necessary, we suggest to determine the axle position during passages with the help of vehicles with known axle configuration and a DGPS.