Abstract
During flight, unmanned aerial vehicles (UAVs) need several sensors to follow a predefined path and reach a specific destination. To this aim, they generally exploit an inertial measurement unit (IMU) for pose estimation. Usually, in the UAV context, an IMU entails a three-axis accelerometer and a three-axis gyroscope. However, as happens for many physical devices, they can present some misalignment between the real value and the registered one. These systematic or occasional errors can derive from different sources and could be related to the sensor itself or to external noise due to the place where it is located. Hardware calibration requires special equipment, which is not always available. In any case, even if possible, it can be used to solve the physical problem and sometimes requires removing the sensor from its location, which is not always feasible. At the same time, solving the problem of external noise usually requires software procedures. Moreover, as reported in the literature, even two IMUs from the same brand and the same production chain could produce different measurements under identical conditions. This paper proposes a soft calibration procedure to reduce the misalignment created by systematic errors and noise based on the grayscale or RGB camera built-in on the drone. Based on the transformer neural network architecture trained in a supervised learning fashion on pairs of short videos shot by the UAV’s camera and the correspondent UAV measurements, the strategy does not require any special equipment. It is easily reproducible and could be used to increase the trajectory accuracy of the UAV during the flight.
1. Introduction
Nowadays, navigation systems play a key role in different scientific and industrial application contexts, including robotics [1] and autonomous vehicles [2]. However, the recently obtained results in this field show remarkable improvements and precision levels. The localization and the pose estimation of a device can be achieved with multiple strategies and techniques [3,4,5]; anyway, the choice of the involved sensors remains a crucial aspect. For instance, the global positioning system (GPS) [6] is one of the most commonly used systems for localizing and tracking devices. Concurrently, the inertial measurement unit (IMU) [7] can also be exploited due to its adaptive nature: it is accurate enough to identify a person [8], but it can even detect actions [9]. Moreover, the combination of multiple sensors [10] can often offer a noticeable improvement in terms of accuracy and precision. In the specific case of unmanned aerial vehicles (UAVs) [11], numerous tasks can be accomplished thanks to technological advancement, e.g., change detection [12,13], anomaly detection [14,15], and tracking [16,17]. However, there are still some open issues. Some of these may concern incorrect values obtained from the UAV sensors, especially when the device is not expensive. According to [18], for short paths, it is possible to ignore the drift caused by a low-cost IMU. It means that the error is feasible in real-life applications. However, it is only valid when the distance to cover is short; in fact, for long distances, the error propagation can be highly relevant, implying an unmanageable drift in applications where accurate precision is critical. In this context, machine learning (ML) and deep learning (DL) approaches are also feeding the research [19,20]. In the past years, the scholars exploiting transformer-based neural networks [21] caused a shockwave in the DL community due to their simplicity and state-of-the-art performances. Transformers are employed in a wide variety of domains: in natural language processing [22,23,24], computer vision [25,26,27], audio processing [28], and UAV applications [29,30,31]. Although the transformer-based networks are currently the standard in many tasks, their downside lies in the energy drain [32], the required computational power, and the needed primary memory, which is quadratic in the number of inputs for attention-based transformers [21].
2. Related Work
To the best of our knowledge, none of the works in literature treats the analyzed task with the same approach proposed in this paper. However, numerous similar applications inspired the presented strategy for a camera-based IMU self-calibration for UAVs.
One of the most relevant works considered in our analysis is [33], where the authors proposed an online IMU self-calibration method for visual–inertial systems equipped with a low-cost inertial sensor. It is specifically designed for unmanned ground vehicles (UGV), but it involves both IMU and camera sensors. It aims to locate the device as accurately as possible. In the first step, the IMU measurements and the camera frames are mapped. Since the image frequency is lower than the IMU updates, the IMU measurements between the two frames are pre-integrated into a single compound measurement. An optimization of the IMU intrinsic parameters is proposed to minimize the errors constrained by the pre-integration measurements. It is achieved by computing a Jacobian matrix [34] for each pre-integration part and estimating the uncertainty of the pre-integration measurements. The final step consists in defining the IMU intrinsic parameters residual. The authors propose an approach in which they are assumed to be uncertain, but which is constant in the time span of the sliding window. The proposed system can be classified as a monocular visual–inertial system (VINS-Mono) [35]. The results obtained on several datasets [36,37,38,39,40] show outstanding results in terms of tracking accuracy and also computational times.
IMU/camera calibration strategies can also be applied in other application areas [41] than those entailing UGVs or UAVs. It suggests that this approach can be considered a valid procedure, even if it seems uncommon. In the specific field of UAVs, it is possible to find different techniques in the recent literature, from the fusion of sensors [42] to the self-calibration [43,44]. For instance, in [45], the authors propose a novel method for monitoring a UAV with a custom-made positioning module that exploits a fusion algorithm. The work explains how the module should be calibrated to reduce the influence of deterministic errors on the IMU. The authors highlight five variables that are correlated to the noise factors on accelerometers and gyroscopes: direct measures, reference excitation acceleration, statistic bias, scale factor, and non-orthogonality. Based on these assumptions, the proposed method exploits a turnable platform to make a specific accelerometer’s axis point to gravity to obtain an output excitation, and it also rotates at a constant angular rate on a specific gyroscope’s axis for the same aim. After the measured output and reference excitation have been acquired, a least-squares method calculates the deterministic errors parameter. Similarly, the deterministic errors of the magnetometer are influenced by multiple factors. The correlated variables are the measured output, the matrix that contains errors like soft iron coefficients and non-orthogonality, the vector of static bias and hard iron coefficients, and the environment magnetic field distribution. The authors noticed how those parameters could provide an equation of a classical vector form of an ellipsoid formula; with an ellipsoid fitting, the geometry parameter can be acquired to calibrate the sensor. Concerning stochastic error identification, three noises have been highlighted: white noise, exponentially correlated noise, and rate random walk. The authors treated them by exploiting the Allan variance method and its accuracy approximation. Then, with system equations, the sixteen states, including attitude, velocity, position, and angle/velocity increment bias, are managed. A Kalman filter-based technique is introduced, in which measures and predictions support each other. The provided steps run to a fusion algorithm that combines all the sensors’ data with the extended Kalman filter approach and provides a more accurate real-time attitude, velocity, and position of the UAV.
Another example of multi-sensor calibration for UAVs is provided in [46], where the authors propose a correction method that involves different GPS sensors, a LIDAR, and an IMU. In the pipeline, three branches are shown: bore sight, time delay, and level arm calibrations. The first one aims to know exactly the relative position of sensors (multiple GPS, IMU, and LIDAR) with respect to each other. The authors provide a method for generating the coordinate transform of these sensors with respect to the center of the drone. Concerning the bore sight calibration, it is used to determine the differences in the rotations of each sensor. The proposed algorithm provides the alignment of the extracted and pre-processed point cloud data from the LIDAR, as well as a comparison with the GPS signals (that are considered the ground truth). It is also specified that the calibration could be even more precise if the IMU is also involved. The time delay calibration deals with the calculation of error-corrected signals. In particular, it refers to the delay in GPS sending and receiving messages. The solution consists in always adding the retrieved error to the location shown by the GPS receiver.
There are also works [47,48,49,50] in which the IMU’s calibration does not require any correlation with other sensors. As expected, most of the recent studies involve machine and deep learning-based techniques. For instance, in [51], the authors propose a calibration method based on deep learning for micro electro mechanical system (MEMS) IMU gyroscopes. The network calibrates the error from the raw data of the MEMS IMU and regresses gyroscope data after the error compensation. During the training, the network output is used to estimate the quaternion and calculate the loss function. During the test phase, the obtained output is exploited for navigation to obtain the altitude and position of the carrier. The results show that the position and the altitude are significantly adjusted by the proposed calibration method, providing relevant scientific soundness of the self-calibration approaches in this application field. Another example is provided by [52], where the introduction of an adaptive neuro-fuzzy inference system (ANFIS) is proposed to improve the effectiveness of low-grade IMUs by estimating and compensating the measurement errors. The method combines the artificial neural network (ANN) with the fuzzy inference system (FIS) and is trained by adding noise to reference IMU data. The proposed solution shows a high correction impact on the considered dataset, highlighting the quality of the strategy.
Concerning the specific deep learning technique of transformers, there are only a few works in the literature featuring this architecture in a UAV-based calibration task. The networks proposed in [53], one based on a long–short term memory (LSTM) architecture [54] and another on a transformer [21], estimating the bias in the sensors involved in the visual–inertial odometry in case of complete absence of visual information. Moreover, ref. [55] proposes a transformer-based architecture to pre-process and clean the IMU measurements before even being used in a flight. These cases suggested that the transformers seem promising in this application area and could provide noticeable improvements when correctly exploited.
3. Proposed Method
The model predicts the offset in the IMU of the drone given an input sequence of n frames with height h, width w, and a sequence of aligned IMU measurements , containing records of accelerometer and gyroscope, each along the three axes. The first issue involves synchronization, which can be achieved by exploiting the IMU and the camera timestamps. However, it is worth considering that IMUs have a much higher sampling rate with respect to cameras and that these timestamps could not be perfectly matched. To address this issue, the practical solution is to match each camera frame with the closest corresponding IMU measurement in time. This also has some side effects. On one hand, it reduces the number of inputs for the model, thereby reducing the computational demand. On the other hand, this also ensures that IMU measurements are sampled at the same rate as cameras. This is a practical advantage since it makes the system scalable with the camera quality. To handle the multi-modal nature of the task, the model entails three blocks: a video reducer block (VRB) (Section 3.1) that aims to reduce the input sequence of consecutive frames into an equally long sequence of token vectors; an IMU reducer block (IRB) (Section 3.2) that projects each IMU vector, containing records of accelerometer and gyroscope in each instant of time, into a sequence of token vectors; and a noise predictor block (NPB) (Section 3.3), which merges the information from video and IMU records and predicts the amount of offset in the latter. Figure 1 depicts the model’s architecture. The source code of the presented strategy can be found at https://github.com/rom42pla/rgb2imu_error_estimation.

Figure 1.
The architecture of the model. A dashed line between two modules means weight sharing between them. (a) Overview of the full model. The outputs of the video reducer block (VRB) and of the IMU reducer block (IRB) are summed and passed to the noise predictor block that will output the final offset estimation ; (b) (overview of the video reducer block (VRB)). The input frames are unfolded into patches that are projected into token vectors and processed by two transformer encoder to output a token for each frame ; (c) (overview of the IMU reducer block (IRB)). The input IMU measurements are projected into a d-dimensional space and then processed by a transformer encoder to output a representation of the measurements; (d) Overview of the noise predictor block (NPB). The tokens obtained from the sum of the outputs of the VRB and the IRB are passed through a transformer encoder and its results to a transformer decoder along with a query matrix . The output is the predicted offsets .
3.1. Video Reducer Block
This module, whose architecture is shown in Figure 1 and is partially inspired by [26,27], takes as input a sequence of n frames from a video and transforms each frame into a d-dimensional vector, called token, that can be merged with those returned by the IRB for the final prediction. In this context, the video can be grayscale, with or RGB, with . The model is compatible with any combination of n, h, and w.
If a frame F is RGB, it is cast to a single channel via a small convolutional neural network (CNN) , including a three-dimensional convolution parameterized by , an activation function and a batch normalization layer [56]:
To ensure the preservation of the original information of each image, the output of the CNN is summed with the grayscale version of F as it is usually executed in residual networks [57]:
Consequently, each frame is unfolded into a series of p smaller patches of shape , yielding a matrix . If h is not divisible for , or w is not divisible for , the frames will be 0-padded (In this context, the image padding consists in adding extra pixels to perform an operation that requires a specific image shape. In a 0-padded image, the pixels added by the padding operation have an intensity of 0 on each channel.) until they fit the legal size for the unfolding operation. Each patch is then flattened into a -dimensional vector to be projected by a feed-forward network (A feed-forward network is often made of a sequence of one or more linear layers and activation functions. Moreover, also, other layer types, such as dropout [58] and normalization [56,59], can often be added, as well, depending on the context.) (FFN) into a d-dimensional token. This FFN, called , entails a linear layer, also called fully connected, parameterized by a weights matrix and a bias vector . There are also an activation function and a dropout layer [58] for regularization purposes:
After applying individually to each of the n frames, the result will be a sequence of tokens .
The token of each patch is then provided to a transformer encoder to exploit the spatial dependencies between the patches. For each frame, it outputs a single token bearing the meaningful learned features towards the task. The transformer encoder architecture is analogous to the one originally presented in [21]. This module is made of a stack of one or multiple transformer encoder layers, which contain a sequence of multi-head attention (MHA) and an FFN with skip-connections. A skip-connection technique consists in using the output of a layer as input for another one placed later in the architecture. This method is commonly exploited in very deep networks to mitigate the vanishing gradient phenomenon [60]. The output of both the MHA and the FFN is further processed by a normalization layer [56,59]. A dot-product attention function takes three inputs , , , called query, key, and value, and is parameterized by three weights matrix , , and :
where , , and are used to transform the three inputs Q, K, V into a single output of shape . Therefore, the output of the attention function is a weighted sum of the values V based on the similarity between the query Q and the keys K. A MHA is a projection of the concatenation of the outputs of k dot-product attention operations, parameterized by different weights. The concatenation of the outputs of the attention operations is projected into a smaller space by a learned weight matrix :
In a transformer encoder, the inputs of the attention functions are all equal to the input matrix, thus . In the VRB, there are multiple instances of the same transformer encoder . Each one is fed with the patches of a single frame summed with , a matrix of the learned positional embeddings. This is required since the attention Formula (4) does not model positional information [61] and cannot distinguish between patches in different positions. The output of a transformer encoder is a matrix of the same shape as the input, from which the model extracts the first row where all the meaningful information of the frame is squeezed, leading to a sequence of tokens :
where, is further summed with the learned temporal embeddings and fed to a transformer encoder to transform into the output of the VRB. This allows analyzing the temporal relationships between tokens:
3.2. IMU Reducer Block
This module, whose architecture is shown in Figure 1, takes as input a sequence of n IMU measurements and transforms each of them into a d-dimensional vector, called a token. Each measurement is time-aligned with the corresponding video frames. Although this paper considers 6 typologies of values in the formulations, the model is compatible with any combination of standard IMU measurements to accommodate any device. For instance, if records from both the accelerometer and gyroscope are present in I, then its last dimension will be 6, since three values are for the axes of the accelerometer and three for the gyroscope. Meanwhile, as in the case of the experiments on the Zurich MAV dataset [39], if the measurements of only the axes of one of the two sensors are available, I will contain values.
The measurements I are processed by , entailing two linear layers, respectively, parameterized by , and , , to project these vectors into the same space of the remaining tokens in the model:
Similarly to the VRB, the projected tokens are summed to the learned positional embeddings and fed to a transformer encoder [21] , parameterized by , to inject the temporal information into the inputs:
The model’s output is a sequence of tokens that bear the learned features from the IMU measurements for an analyzed video.
3.3. Noise Predictor Block
This module, whose architecture is shown in Figure 1, is used to merge the tokens of both VRB and IRB. To this aim, a transformer encoder parameterized by , is employed:
As for the other transformer encoders in the model, are the learned positional embeddings used to inform the module about the temporal relationships between input tokens. Then, a sequence of a transformer decoder and a FFN is used to make the final predictions. In this context, a transformer decoder [21] is similar to a transformer encoder, entailing a stack of one or more transformer decoder layers. Differently from transformer encoders, it takes two matrices as input, called target and memory. A transformer decoder layer is made of a sequence of two multi-head attentions and an FFN with skip connections. Although the first MHA takes the target matrix as input for the Q, K, and V parameters, the second one takes the target as Q and the memory as K and V, effectively injecting the target matrix with information about the target. In this model, the memory is equal to the output of , representing information about the input frames and the IMU measurements. Instead, the target is represented by a learned matrix that represents the embedding of the query of the final offsets vector in a latent d-dimensional space. This transformer decoder , parameterized by , outputs the embedding of the offsets vector that will be then projected into a 6-dimensional space by , which contains a single linear layer parameterized by and :
where will contain the predicted offsets of the IMU of the drone.
3.4. Training Strategy
The calibration problem is solved using a supervised learning paradigm. In this context, the system is trained to approximate a function given a training dataset composed of labeled samples. The learned function estimates the label of the unknown samples, minimizing a prediction error called loss [62]. In order to be processed by the model, the inputs are segmented into fixed duration windows, where each represents a sample of the training or the validation set . Each sample is a tuple made of a sequence of n grayscale () or RGB () frames and their corresponding time-aligned IMU measurements . The goal of the model is to estimate the amount of noise in the input signal to correct the misalignment of the sensor. To this aim, we combine the input signal with a generated noise that must be plausible in a real-world IMU. Let be a vector representing the noise that, in this context, corresponds to the offsets of the IMU. This vector is randomly sampled from a Gaussian distribution , with and . The latter is a noise multiplier that has been set to 2 in the experiments. The elements and for each are the mean and standard deviation of the i-th elements of all the IMU measurements in the dataset. In other words, each position with is randomly filled with a value taken into the symmetric interval . At the training time, the noisy IMU measurements, , are computed as follows:
where n is the number of frames and IMU measurements in the sample . During the training, a different is generated for each sample to increase data variability and, consequently, robustness. During the validation, instead, is generated once and applied to all the samples in the test. This choice guarantees the fairness of the results since, in real-world scenarios, when an IMU is miscalibrated, this displacement is constant.
The model M, described in Section 3 and parameterized by , is then fed with the frames and the noisy IMU values and aims to predict an estimation of the offsets. It is trained to minimize a loss function between the predicted and the ground truth offsets:
where, is a shorthand for . The optimizer of the proposed model is AdamW [63], a modified version of Adam [64], including decoupled weight decay. It iteratively adjusts the model’s weights via gradient descent to minimize Equation (13). The loss function used to train the network is the sum of three terms obtained using the mean squared error (MSE) between the predicted offsets and the ground truth offsets . The mean squared error function is defined in Equation (14) and returns a real number given two input vectors of the same shape:
The objective is to output an approximation of the noise that minimizes the dissimilarity between and , and, thus, the first term is the MSE between them:
However, the use of the loss made the model learn values near the mean of the distribution of the offsets, whose value is around 0. Two more auxiliary losses have been used to enforce the model to predict more meaningful values. enforces the modulus, also called magnitude, of each element in to be as near as possible to the magnitude of each corresponding element in :
This penalizes the model stagnation around low values when the ground truth is high, solving one of the problems. To enforce the correctness of the sign of each element in , the term is used:
Usually, if the sign of a value is wrongly predicted, the error is directly proportional to the expected magnitude. Thanks to , which can be seen as an uncertainty indicator, the model can automatically regulate itself in case of uncertainty and be less bold in predicting high magnitudes. The final loss function is defined as follows:
4. Experiments and Discussion
4.1. Datasets
The proposed strategy was tested on three datasets during the experiments to demonstrate its effectiveness. This section aims to highlight their main characteristics.
The EuRoC Micro Aerial Vehicle dataset (https://projects.asl.ethz.ch/datasets/doku.php?id=kmavvisualinertialdatasets), here referred to as EuRoC MAV [37], contains data from 11 indoor experiments in an industrial environment. The videos are available at a resolution of pixels with a single channel. Besides the onboard sensors’ records, the dataset comes with a point-cloud representation of the scenarios obtained from external LIDAR sensors.
The University of Zurich First-Person View dataset (https://fpv.ifi.uzh.ch/datasets/), here referred to as UZH-FPV [65], is a collection of 28 flight experiments recorded with a grayscale camera in different environments and with different setups for the drone. The scenarios are indoor and outdoor, and the camera is pointed either forward or at a 45° angle downwards. Each video has been recorded at a resolution of pixels, while the IMU data are composed of accelerometer and gyroscope records.
The Zurich Urban Micro Aerial Vehicle dataset (https://rpg.ifi.uzh.ch/zurichmavdataset.html), here referred to as Zurich MAV [39], is composed of a single, long record of the states of a drone while flying in an urban scenario. The flight covers around 2 Km at an altitude varying from 5 to 15 m. The provided data includes records from the on-board IMU, the Google Street View snapshots of the path, and the frontal camera acquisitions. In particular, the video has been shot at a resolution of pixels and with three RGB channels.
4.2. Metrics
This section introduces the metrics used in the result evaluation provided in Section 4.3 to quantify the model’s error when estimating the offsets in the IMU. The root mean squared error (RMSE) is a common metric to quantify the error in the regression tasks. Given a vector of predicted values and a vector of ground-truth values of the same shape , and this error is computed as the squared root of the mean squared error (described in Equation (14)):
RMSE is usually preferred to the MSE since its scale is the same as the input vectors instead of its square. In addition, it also supports humans to better visualize and understand the value. The score is a standard metric to quantify the error in the classification task. This measure is a particular case of the more general measure that expresses the classification error in terms of two other metrics called and :
where, TP, FP, and FN are the True Positives, False Positives, and False Negatives, respectively. They are calculated by comparing a vector of predicted values and a vector of correct values , where L is the set of all possible labels. The classification task in this paper is binary, meaning that there are two possible classes for each element: positive and negative signs, leading to . In this setting, a TP represents an element that is correctly classified; a FP, also called a Type 1 error, represents a negative element wrongly classified as positive; and a FN, also called a Type 2 error, represents a positive element wrongly classified as negative. However, does not take into account the , while does not consider the . Given that the two measures are complementary, is a way to consider them both and is defined as:
with denoting the impact of the with respect to the . In this paper, the version of is used, called score. This value of is a common procedure in many classification tasks and leads to equal consideration of the two measures:
4.3. Results and Discussion
A trial is a syntactic sugar word that consists of a full training and validation procedure. Two kinds of validation schemes [66] have been adopted for the experiments. The k-fold cross-validation scheme has been used in the trials on the Zurich MAV dataset since it is composed of a single video/experiment. In this scheme, windows/samples are partitioned into k groups called folds. To this aim, k different trials are performed using a different fold each time as the validation set and the rest as the training set, averaging the results between each training. The Leave-One-Out (LOO) cross-validation schema, used in UZH-FPV and EuRoC MAV, is a particular case of k-fold cross-validation. In this context, for each trial, the validation fold is each time made of samples from a single video/experiment and the rest of the dataset as the training set. The total number of trials reported is 49:11 for EuRoC MAV, 28 for UZH-FPV, and 10 for Zurich MAV. In the tables, it is worth pointing out that the most relevant parameter is the offset since it represents the overall displacement between the IMU’s measurements and the real values in a real context scenario. The chosen size for the windows/samples is 10 since, as shown in Table 1, longer sizes yield better results. Increasing the dimension of data provided to the architecture could probably allow for extracting more meaningful information from each sample. However, going too far with the windows’ size could negatively impact the results since it will reduce the number of available samples. At the same time, it will also significantly increase the amount of memory required for training, limiting the proposal’s applicability.
Table 1.
Results of the experiments at various window sizes. The experiments have been performed on UZH-FPV [65] dataset, the most challenging and various among the ones analyzed in this work. Train and validation sets follow an 80/20% split, fixed between experiments and chosen as described in Section 4.4. The stride of the windows is kept fixed to 1 alongside the seed. The best value in each result column is in bold.
The final configuration of the model has been inferred from the results of the ablation study (Section 4.4) and is the same for each trial and dataset. Table 2 presents a complete overview of the parameters. It is worth noticing that some parameters described in the table are not optimal according to the ablation study in Section 4.4, such as the number of attention heads and the size of the patches. The presented configuration is the one providing the best trade-off between regression and classification performance, as well as computation speed. The following lines describe in more detail the most meaningful ones. The dropout amount has been set to since it has been empirically found that bigger dropouts are one of the major causes of performance drops among the model’s parameters. The hidden size (referred to as d in Section 3) has been set to 256 since it provides the best tradeoff between the on signs prediction and the RMSE on magnitudes and offsets. The number of encoder and decoder layers in each transformer encoder and decoder has been set to 2, while the attention heads are 8. The noise multiplier , as already mentioned in Section 3.4, is the scalar factor used with the standard deviation of the noise distribution to generate the offsets . It has been empirically set to 2, since it covers reasonable IMU offsets. The size of the patches (referred to as in Section 3) has been set to since it works much better than a size of and slightly worse than . This results in a good compromise between the two sizes in terms of performance since bigger sizes require more epochs to converge, as shown in Section 4.4.
Table 2.
The configuration of the model used in the final evaluations.
The starting learning rate of the optimizer has been set to as in other state-of-the-art works, although AdamW is adaptive, and thus this parameter is not so impactful on the final results. Additionally, videos are first resized to a resolution of pixels; this operation is performed to have a fixed and , but it also enables stacking multiple videos into batches. The experiment’s timings and the hardware configuration are shown in Table 3.
Table 3.
Times and stats of the experiments. All the experiments have been run sequentially on a cluster’s node with 64 GB DDR4 RAM memory, AMD Epyc 7301 CPU core, and a Nvidia Quadro 6000 GPU with 24 GB of RAM.
The model validation results can be seen in Table 4 and graphically in Figure 2, where sample predictions are shown for six samples of each dataset.
Table 4.
Results on the three datasets. Each numerical cell is composed of the mean and standard deviation between the best values of each validation split.
Figure 2.
Sample predictions of the model on the datasets. Values , , and are, respectively, the measurement of the gyroscope for the three axes, while , , and are the corresponding ones from the accelerometer. (a) EuRoC MAV [37]; (b) UZH-FPV [65]; (c) Zurich MAV [39].
It is interesting to notice that the model is more firm when predicting the offsets on samples from EuRoC MAV and Zurich MAV, whose videos are taken in similar scenarios, while it is more uncertain when dealing with samples from UZH-FPV. In fact, the latter is the only dataset that contains both indoor and outdoor environments, as well as different camera angles (forward-facing and 45° downward-facing). Another possible reason may be the lower resolution of the images: pixels, compared to the pixels of EuRoC MAV and the pixels of Zurich MAV. Although the proposed strategy re-scales all the frames to pixels as a pre-processing step, low-quality starting images may hinder the results.
4.4. Ablation Study
The ablation study consisted of a grid search of the optimal value of several hyper-parameters. There were a series of trials on the UZH-FPV dataset where each one is performed using the base configuration of the model, described in Section 4.3 and Table 2, with a single variation of a parameter. LOO and k-fold cross-validation schemes would require training several instances of the same model on different splits of the dataset and averaging the results. However, from a computational point of view, this would be infeasible. Given that, we chose the compromise to keep fixed training and validation sets to only train a single model once for each combination of parameters. The training and the validation sets are, respectively, random and of windows in the dataset. This is a common proportion in the related works and is inspired by the Pareto principle [67]. The results of the ablation study are shown in Table 5 and graphically in Figure 3. Each row in Table 5 represents the trial’s results using different values from the base configuration for specific parameters. An early stopping mechanism has been implemented to stop a trial whenever the training loss does not improve for three consecutive epochs. The metrics used are the same as Section 4.3: the on signs and RMSE on offsets and magnitudes. The parameters that have been tested are the amount of dropout, the hidden size d, the number of attention heads in the MHA, the number of encoders or decoders in each transformer encoder and transformer decoder, the multiplier for the noise added to , and the dimension of the patches in the VRB .
Table 5.
Results of the ablation study. Variations of several parameters of the model have been tested, and this table shows the results of the best score on the validation set of each trial. The best value in each result column for a particular parameter is in bold.

Figure 3.
Plots of the results of the ablation study on four parameters of the model. (a) Hidden size (d). (b) Number of layers in the encoders and the decoders. (c) Injected noise multiplier (). (d) Size of the patches (, ).
One of the most impactful parameters is the amount of dropout, which yields a drop of in the RMSE offsets considering a dropout of or . As can be noticed from the table, for this kind of task, increasing the percentage of dropped connections highly worsens the results. A smaller hidden size d reduces the overfitting but considerably stretches the training time, as can be seen in Figure 3. A hidden size equal to 512, which corresponds to the highest one tested, performs better on the of the signs with respect to a hidden size of 256. However, the latter value improves the RMSE of offsets and magnitude, respectively, by and points. Models with 4 and 8 attention heads have been tested. The lowest value outperforms the other by on the of the signs and nearly on the RMSE of offsets and magnitude. Having more attention heads also implies a higher computation demand. However, as it happens for the hidden size and number of layers parameters, a higher number of attention heads lead to a decrease in convergence time. Increasing the number of layers also decreases the convergence time as for Figure 3. On average, the improvement of a model with 4 layers is on the RMSE of the offsets and on the RMSE of the magnitudes. Considering the on the signs, 4 is not the best value. However, on average, it improves by , and it is only worse than the best result achieved by a model with 2 layers. A higher noise multiplier hinders the capability of the model to accurately distinguish between real signal and noise, as shown in Figure 3. The model performs slightly better on the two RMSEs if the multiplier is set to since these values represent more plausible noises. Since the estimation of the sign is a binary classification task, it seems that increasing helps distinguish the two classes. In fact, the gap between a of and is percentage points on the signs metric. The size of the patches, inversely proportional to the number of tokens in the first transformer encoder in the VRB, seems to perform better if set with higher values, as can be seen in Figure 3. The worst performing size resulted in , which is worse on the signs with respect to a size of , worse on the RMSE offsets and worse on the RMSE magnitude. Anyway, higher patches severely mine the computational times and memory requirements. For this reason, we chose the since it would make a good compromise for a real application on a drone.
5. Conclusions
This work presents a novel neural network architecture for camera-based IMU calibration for UAVs. These procedures are crucial in contexts where the UAV navigation could be only based on a predetermined trajectory to reach the final destination (e.g., in war fields where the GPS signals are usually hijacked) and, in particular, when the route to cover is long. The contribution of this proposal is three-fold. In the first place, it does not make any use of special equipment or impractical hardware procedures to correct the offsets in the sensors, as it could be necessary for other kinds of calibration strategies. In the second place, the proposed pre-trained model can be used directly on the UAV, making it available in a wider range of scenarios. In third place, the algorithm is compatible with both grayscale and RGB images, accommodating different cameras mounted or integrated into diverse UAV models. Moreover, it does not require normalizing the IMU measurements in input, minimizing the computational efforts. More in detail, the algorithm is a transformer-based neural network that processes both a window of sequential frames recorded from the on-board camera and the IMU measurements to estimate the eventual offsets of the latter. The model comprises three modules called video reducer block, IMU reducer block, and noise predictor block. The first two reduce the video and the sequence of the IMU measurements to sequences of tokens of the same shape. These are then summed and given to the latter block to make the final offsets prediction. The approach has been tested on three datasets that are commonly used in the literature for other tasks. Two of them (EuRoC MAV, UZH-FPV) have low-resolution, grayscale-only videos, while the other (Zurich MAV) has full-HD RGB images. Interestingly, the method does not behave much differently on grayscale or RGB images, making it almost camera-agnostic. As expected, the uncertainties are fewer in the two more static datasets and slightly higher in UZH-FPV, which features different environments and camera angles.
One of the issues lies in the fact that the token mixing parts of the transformer encoders and decoders are based on the standard attention mechanism, in which complexity scales quadratically with the number of inputs (e.g., the number of frames of the video and the number of patches in each image). Although this strategy can be considered the standard in transformer-based works, we plan to replace the token, mixing parts with attention-free mechanisms to alleviate the time and space requirements of the neural network. It is worth reminding that, since the offsets are usually small, an IMU calibration has a tangible impact mostly for long traversals. For this reason, one effective way to evaluate the proposed approach would be to analyze data collected from a UAV flying along a straight path for a considerable distance (e.g., 1 Km) and compare the displacement between the predicted and actual final positions. However, to the best of our knowledge, no existing dataset contains data from UAV flights that meet these criteria. We plan to validate the proposed procedure in real-flight test scenarios, in which it is required that the UAV follows a predetermined trajectory toward a destination.
Author Contributions
Conceptualization, D.A., R.L., M.R.M. and A.M.; methodology, D.A., L.C., G.L.F., R.L., M.R.M., A.M. and F.S.; software R.L., M.R.M. and A.M.; validation L.C., G.L.F. and F.S.; writing—original draft preparation, R.L., M.R.M. and A.M.; writing—review and editing, D.A., L.C., G.L.F. and F.S. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Acknowledgments
This work was supported by the MIUR under grant “Departments of Excellence 2018–2022” of the Department of Computer Science of Sapienza University, the “Smart unmannEd AeRial vehiCles for Human likE monitoRing (SEARCHER)” project of the Italian Ministry of Defence (CIG: Z84333EA0D), the Departmental Strategic Plan (DSP) of the University of Udine—Interdepartmental Project on Artificial Intelligence (2020-25), the ERC Starting Grant no. 802554 (SPECGEO), and the PNRR project FAIR—Future AI Research (PE00000013) under the NRRP MUR program funded by NextGenerationEU.
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| LOO | Leave One Out validation |
| UAV | Unmanned Aerial Vehicle |
| MAV | Micro Aerial Vehicle |
| LIDAR | Light Detection And Ranging |
| FPV | First Person View |
| GPS | Global Positioning System |
| RGB | Red, Green, and Blue channels |
| HD | High Definition |
| IMU | Inertial Measurement Unit |
| TP | True Positive |
| FP | False Positive |
| FN | False Negative |
| MSE | Mean Squared Error |
| RMSE | Root Mean Squared Error |
| FFN | Feed-Forward Network |
| CNN | Convolutional Neural Network |
| MHA | Multi-Head Attention |
| TE | Transformer Encoder |
| TD | Transformer Decoder |
| VRB | Video Reducer Block |
| IRB | IMU Reducer Block |
| NPB | Noise Predictor Block |
References
- Bonin-Font, F.; Ortiz, A.; Oliver, G. Visual Navigation for Mobile Robots: A Survey. J. Intell. Robot. Syst. 2008, 53, 263. [Google Scholar] [CrossRef]
- Yeong, D.J.; Velasco-Hernandez, G.; Barry, J.; Walsh, J. Sensor and Sensor Fusion Technology in Autonomous Vehicles: A Review. Sensors 2021, 21, 2140. [Google Scholar] [CrossRef] [PubMed]
- de Ponte Müller, F. Survey on Ranging Sensors and Cooperative Techniques for Relative Positioning of Vehicles. Sensors 2017, 17, 271. [Google Scholar] [CrossRef] [PubMed]
- Wu, Y.; Ta, X.; Xiao, R.; Wei, Y.; An, D.; Li, D. Survey of underwater robot positioning navigation. Appl. Ocean. Res. 2019, 90, 101845. [Google Scholar] [CrossRef]
- Tariq, Z.B.; Cheema, D.M.; Kamran, M.Z.; Naqvi, I.H. Non-GPS Positioning Systems: A Survey. ACM Comput. Surv. 2017, 50, 1–34. [Google Scholar] [CrossRef]
- Bajaj, R.; Ranaweera, S.; Agrawal, D. GPS: Location-tracking technology. Computer 2002, 35, 92–94. [Google Scholar] [CrossRef]
- Yuan, Q.; Chen, I.M. Localization and velocity tracking of human via 3 IMU sensors. Sens. Actuators A Phys. 2014, 212, 25–33. [Google Scholar] [CrossRef]
- Marsico, M.D.; Mecca, A. Biometric walk recognizer. Multimed. Tools Appl. 2017, 76, 4713–4745. [Google Scholar] [CrossRef]
- Steven Eyobu, O.; Han, D. Feature Representation and Data Augmentation for Human Activity Classification Based on Wearable IMU Sensor Data Using a Deep LSTM Neural Network. Sensors 2018, 18, 2892. [Google Scholar] [CrossRef]
- Avola, D.; Cinque, L.; Del Bimbo, A.; Marini, M.R. MIFTel: A multimodal interactive framework based on temporal logic rules. Multimed. Tools Appl. 2020, 79, 13533–13558. [Google Scholar] [CrossRef]
- Avola, D.; Cinque, L.; Fagioli, A.; Foresti, G.L.; Pannone, D.; Piciarelli, C. Automatic estimation of optimal UAV flight parameters for real-time wide areas monitoring. Multimed. Tools Appl. 2021, 80, 25009–25031. [Google Scholar] [CrossRef]
- Avola, D.; Cinque, L.; Foresti, G.L.; Martinel, N.; Pannone, D.; Piciarelli, C. A UAV Video Dataset for Mosaicking and Change Detection from Low-Altitude Flights. IEEE Trans. Syst. Man Cybern. Syst. 2020, 50, 2139–2149. [Google Scholar] [CrossRef]
- Conforti, M.; Mercuri, M.; Borrelli, L. Morphological changes detection of a large earthflow using archived images, lidar-derived dtm, and uav-based remote sensing. Remote Sens. 2020, 13, 120. [Google Scholar] [CrossRef]
- Avola, D.; Cannistraci, I.; Cascio, M.; Cinque, L.; Diko, A.; Fagioli, A.; Foresti, G.L.; Lanzino, R.; Mancini, M.; Mecca, A.; et al. A Novel GAN-Based Anomaly Detection and Localization Method for Aerial Video Surveillance at Low Altitude. Remote Sens. 2022, 14, 4110. [Google Scholar] [CrossRef]
- Hamdi, S.; Bouindour, S.; Snoussi, H.; Wang, T.; Abid, M. End-to-end deep one-class learning for anomaly detection in uav video stream. J. Imaging 2021, 7, 90. [Google Scholar] [CrossRef] [PubMed]
- Avola, D.; Cinque, L.; Di Mambro, A.; Diko, A.; Fagioli, A.; Foresti, G.L.; Marini, M.R.; Mecca, A.; Pannone, D. Low-Altitude Aerial Video Surveillance via One-Class SVM Anomaly Detection from Textural Features in UAV Images. Information 2022, 13, 2. [Google Scholar] [CrossRef]
- Avola, D.; Cinque, L.; Diko, A.; Fagioli, A.; Foresti, G.L.; Mecca, A.; Pannone, D.; Piciarelli, C. MS-Faster R-CNN: Multi-Stream Backbone for Improved Faster R-CNN Object Detection and Aerial Tracking from UAV Images. Remote Sens. 2021, 13, 1670. [Google Scholar] [CrossRef]
- Örnhag, M.V.; Persson, P.; Wadenbäck, M.; Åström, K.; Heyden, A. Trust Your IMU: Consequences of Ignoring the IMU Drift. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, New Orleans, LA, USA, 19–20 June 2022; pp. 4468–4477. [Google Scholar] [CrossRef]
- Couturier, A.; Akhloufi, M.A. A review on absolute visual localization for UAV. Robot. Auton. Syst. 2021, 135, 103666. [Google Scholar] [CrossRef]
- Munaye, Y.Y.; Lin, H.P.; Adege, A.B.; Tarekegn, G.B. UAV Positioning for Throughput Maximization Using Deep Learning Approaches. Sensors 2019, 19, 2775. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers), pp. 4171–4186. [Google Scholar] [CrossRef]
- Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big Bird: Transformers for Longer Sequences. Adv. Neural Inf. Process. Syst. 2020, 33, 17283–17297. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
- Zhang, Y.; Li, X.; Liu, C.; Shuai, B.; Zhu, Y.; Brattoli, B.; Chen, H.; Marsic, I.; Tighe, J. VidTr: Video Transformer without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 10–17 October 2021; pp. 13557–13567. [Google Scholar] [CrossRef]
- Kolesnikov, A.; Dosovitskiy, A.; Weissenborn, D.; Heigold, G.; Uszkoreit, J.; Beyer, L.; Minderer, M.; Dehghani, M.; Houlsby, N.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2010, arXiv:2010.11929. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Association for Computing Machinery: New York, NY, USA, 2020; Volume 33, pp. 12449–12460. [Google Scholar] [CrossRef]
- Xing, D.; Evangeliou, N.; Tsoukalas, A.; Tzes, A. Siamese Transformer Pyramid Networks for Real-Time UAV Tracking. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–6 July 2022; pp. 1898–1907. [Google Scholar] [CrossRef]
- Ye, J.; Fu, C.; Cao, Z.; An, S.; Zheng, G.; Li, B. Tracker Meets Night: A Transformer Enhancer for UAV Tracking. IEEE Robot. Autom. Lett. 2022, 7, 3866–3873. [Google Scholar] [CrossRef]
- Ghali, R.; Akhloufi, M.A.; Mseddi, W.S. Deep Learning and Transformer Approaches for UAV-Based Wildfire Detection and Segmentation. Sensors 2022, 22, 1977. [Google Scholar] [CrossRef] [PubMed]
- Parcollet, T.; Ravanelli, M. The Energy and Carbon Footprint of Training End-to-End Speech Recognizers. 2021. Available online: https://hal.science/hal-03190119/ (accessed on 22 December 2022).
- Xiao, Y.; Ruan, X.; Chai, J.; Zhang, X.; Zhu, X. Online IMU Self-Calibration for Visual-Inertial Systems. Sensors 2019, 19, 1624. [Google Scholar] [CrossRef]
- Henawy, J.; Li, Z.; Yau, W.Y.; Seet, G. Accurate IMU Factor Using Switched Linear Systems for VIO. IEEE Trans. Ind. Electron. 2021, 68, 7199–7208. [Google Scholar] [CrossRef]
- Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
- Carlevaris-Bianco, N.; Ushani, A.K.; Eustice, R.M. University of Michigan North Campus long-term vision and lidar dataset. Int. J. Robot. Res. 2016, 35, 1023–1035. [Google Scholar] [CrossRef]
- Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; Omari, S.; Achtelik, M.W.; Siegwart, R. The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]
- Pfrommer, B.; Sanket, N.; Daniilidis, K.; Cleveland, J. PennCOSYVIO: A challenging Visual Inertial Odometry benchmark. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3847–3854. [Google Scholar] [CrossRef]
- Majdik, A.L.; Till, C.; Scaramuzza, D. The Zurich urban micro aerial vehicle dataset. Int. J. Robot. Res. 2017, 36, 269–273. [Google Scholar] [CrossRef]
- Schubert, D.; Goll, T.; Demmel, N.; Usenko, V.; Stückler, J.; Cremers, D. The TUM VI Benchmark for Evaluating Visual-Inertial Odometry. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1680–1687. [Google Scholar] [CrossRef]
- Qiu, D.; Li, S.; Wang, T.; Ye, Q.; Li, R.; Ding, K.; Xu, H. A high-precision calibration approach for Camera-IMU pose parameters with adaptive constraints of multiple error equations. Measurement 2020, 153, 107402. [Google Scholar] [CrossRef]
- Lee, Y.; Yoon, J.; Yang, H.; Kim, C.; Lee, D. Camera-GPS-IMU sensor fusion for autonomous flying. In Proceedings of the 2016 Eighth International Conference on Ubiquitous and Future Networks (ICUFN), Vienna, Austria, 5–8 July 2016; pp. 85–88. [Google Scholar] [CrossRef]
- Ren, C.; Liu, Q.; Fu, T. A Novel Self-Calibration Method for MIMU. IEEE Sens. J. 2015, 15, 5416–5422. [Google Scholar] [CrossRef]
- Hausman, K.; Weiss, S.; Brockers, R.; Matthies, L.; Sukhatme, G.S. Self-calibrating multi-sensor fusion with probabilistic measurement validation for seamless sensor switching on a UAV. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 4289–4296. [Google Scholar] [CrossRef]
- Huang, B.; Feng, P.; Zhang, J.; Yu, D.; Wu, Z. A Novel Positioning Module and Fusion Algorithm for Unmanned Aerial Vehicle Monitoring. IEEE Sens. J. 2021, 21, 23006–23023. [Google Scholar] [CrossRef]
- Sanjukumar, N.; Koundinya, P.N.; Rajalakshmi, P. Novel technique for Multi Sensor Calibration of a UAV. In Proceedings of the 2020 IEEE International Conference on Computing, Power and Communication Technologies (GUCON), Greater Noida, India, 2–4 October 2020; pp. 778–782. [Google Scholar] [CrossRef]
- Li, M.; Yu, H.; Zheng, X.; Mourikis, A.I. High-fidelity sensor modeling and self-calibration in vision-aided inertial navigation. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–5 June 2014; pp. 409–416. [Google Scholar] [CrossRef]
- Hwangbo, M.; Kim, J.S.; Kanade, T. IMU Self-Calibration Using Factorization. IEEE Trans. Robot. 2013, 29, 493–507. [Google Scholar] [CrossRef]
- Wu, Y.; Goodall, C.; El-Sheimy, N. Self-calibration for IMU/odometer land navigation: Simulation and test results. In Proceedings of the 2010 International Technical Meeting of The Institute of Navigation, Portland, OR, USA, 21–24 September 2010; pp. 839–849. [Google Scholar]
- Yang, Y.; Geneva, P.; Zuo, X.; Huang, G. Online Self-Calibration for Visual-Inertial Navigation Systems: Models, Analysis and Degeneracy. arXiv 2022, arXiv:2201.09170. [Google Scholar]
- Huang, F.; Wang, Z.; Xing, L.; Gao, C. A MEMS IMU Gyroscope Calibration Method Based on Deep Learning. IEEE Trans. Instrum. Meas. 2022, 71, 1–9. [Google Scholar] [CrossRef]
- Mahdi, A.E.; Azouz, A.; Abdalla, A.; Abosekeen, A. IMU-Error Estimation and Cancellation Using ANFIS for Improved UAV Navigation. In Proceedings of the 2022 13th International Conference on Electrical Engineering (ICEENG), Cairo, Egypt, 29–31 March 2022; pp. 120–124. [Google Scholar] [CrossRef]
- Buchanan, R.; Agrawal, V.; Camurri, M.; Dellaert, F.; Fallon, M. Deep IMU Bias Inference for Robust Visual-Inertial Odometry with Factor Graphs. IEEE Robot. Autom. Lett. 2023, 8, 41–48. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Steinbrener, J.; Brommer, C.; Jantos, T.; Fornasier, A.; Weiss, S. Improved State Propagation through AI-based Pre-processing and Down-sampling of High-Speed Inertial Data. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 6084–6090. [Google Scholar] [CrossRef]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 448–456. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv 2012, arXiv:1207.0580. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Srivastava, R.K.; Greff, K.; Schmidhuber, J. Training Very Deep Networks. Adv. Neural Inf. Process. Syst. 2015, 9, 2377–2385. [Google Scholar]
- Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 2–4 June 2018; Volume 2 (Short Papers), pp. 464–468. [Google Scholar] [CrossRef]
- Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef] [PubMed]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the ICLR (Poster), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Delmerico, J.; Cieslewski, T.; Rebecq, H.; Faessler, M.; Scaramuzza, D. Are We Ready for Autonomous Drone Racing? The UZH-FPV Drone Racing Dataset. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 6713–6719. [Google Scholar] [CrossRef]
- Wong, T.T. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognit. 2015, 48, 2839–2846. [Google Scholar] [CrossRef]
- Jana, P.; Tiwari, M. 2—Lean terms in apparel manufacturing. In Lean Tools in Apparel Manufacturing; Jana, P., Tiwari, M., Eds.; The Textile Institute Book Series; Woodhead Publishing: Sawston, UK, 2021; pp. 17–45. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).




