1. Introduction
The increased capability of modern devices gave way to the resurgence of data-driven approaches. As a subset of machine learning approaches, deep learning models have been the leading choices for a vast area of data-driven signal processing applications. These models tend to perform with increased data size [
1]. To increase data size for data-driven approaches, various methods have been proposed for image or audio-based signal processing tasks or text-based tasks [
2]. These methods are called data augmentation methods.
Data augmentation methods are the methods that were introduced in order to reproduce additional training data [
3]. The usage of augmented data increases feature space while preserving the labels, therefore, a classifier is expected to perform with less overfitting to training data and produce better evaluation results [
4]. For example, artificial data produced by using data augmentation methods were used in speech recognition problems [
5,
6].
Data augmentation methods can be applied to an audio waveform in the time domain or an audio spectrogram. The most commonly used data augmentation methods have been noise adding, time stretching, time shifting, and pitch shifting [
7]. In addition, while obtaining the spectrogram representation of the audio data, methods of obtaining new data by warping the linear frequency scale have also been used in the literature [
8]. Noise-adding methods are performed by adding Gaussian noise to the original signal by multiplying noise amplitude with a randomly selected noise factor [
9]. The time-stretching method changes the tempo and length of the original audio data. The important point of this method is to keep the pitch of the audio signal while changing the audio signal length with a randomly selected stretching factor [
10]. Pitch shifting is another extensively used audio augmentation method. In this method, the pitch is shifted by a randomly selected factor again. Especially in Music Information Retrieval applications, this augmentation method should be employed with great care, because, it can easily result in a changed label of original data [
11]. A changed pitch can change the expected frequency characteristics of a musical instrument. The time-shifting method is either delaying or advancing an audio waveform [
12]. With this method, it is expected to create a model agnostic to the signal beginning or the end.
One of the most used warping-based data augmentation methods has been Vocal Tract Length Perturbation (VTLP). VTLP is an implementation of the Vocal Tract Length Normalization (VTLN) approach as a data augmentation method. Vocal tract length differences of speakers result in speaker variability. For speech recognition tasks, this variability should be removed. By warping the linear frequency axis of a spectrogram using a warp factor, which is calculated by the statistics of an audio sample, vocal tract normalization is achieved. While implementing this method as a data augmentation tool, rather than calculating a warp factor, the linear frequency scale of an audio sample is warped by a randomly selected value [
13]. VTLP has applications not only in speaker recognition tasks but also in animal audio classification [
14] and environmental sound classification tasks [
15]. Applying different weights of frequency bands can be seen as similar to a procedure such as VTLP. This approach demonstrated promising results in acoustic event detection problems [
16].
Recent data augmentation approaches deviate from conventional methods. One of the possible cases of data augmentation is employing a neural-network-based approach [
17]. Convolutional neural networks are applied as data augmentation tools for speech data. Another neural-network-based data augmentation procedure employs Generative Adversarial Networks (GAN). The generator part of a GAN is used to produce fake data for augmenting the original dataset [
18]. A Deep Convolutional Generative Adversarial Network (DCGAN)-based augmentation approach is also used for environmental sound classification cases. Recurrent neural network (RNN) and CNN-based models are employed as classifiers to achieve state-of-the-art results [
19].
In a substantial number of deep learning applications of audio signals, the Log-Mel spectrogram transformation of an audio sample has been treated as an image input of a neural network. Therefore, data augmentation methods for image signals have also been considered for audio applications. For example, the SpecAugment procedure [
8] applies sparse image warping, frequency masking, and time masking to the Log-Mel spectrogram representation of an audio sample.
Since deep learning applications for computer vision have been a trending research topic, a vast set of data augmentation methods have been applied to the image signals. Some well-known image data augmentation methods include flipping, rotating, or cropping the image and applying color jittering and edge enhancement methods [
2]. It has been shown that Sobel operator-based edge enhancement can be successfully applied as a data augmentation method for Convolutional Neural Network (CNN)-based image classification models [
4].
The main objective of this research is to show that a fractional-order calculus framework can be proved to be able to produce beneficial methods for novel application domains. It has been argued that not having a universally agreed upon definition instigates reduced usage of fractional-order calculus-based methods. By applying two different definitions of fractional derivatives for data augmentation methods, we try to counter this argument. The fractional-order calculus framework has been applied to both audio and image signals in multiple forms. In the following, we provide some examples of these applications, first for audio signal applications and second, for image applications.
Fractional-order calculus is a generalization of differentiation and integration to non-integer orders. It is known that the concept of fractional derivative dates back to the discussions of Leibniz and L’Hospital [
20]. This mathematical phenomenon is shown to have a better capability of describing real objects accurately than classical, integer-order calculus. The non-integer derivation order in the fractional-order calculus framework provides an additional degree of freedom in a notable number of cases such as modeling objects, optimizing performance, and describing natural dynamical behavior with memory [
21]. Additionally, fractional calculus is shown to be capable of developing tools for signal processing [
22].
The capabilities of the fractional-order calculus framework connect it to the theory of fractals. Since being introduced by Mandelbrot, fractal theory has been a mathematical framework for explaining self-similar structures in nature [
23]. The textural information of a signal is an important aspect of understanding the said signal. This information can be modeled by tools of fractal geometry. For example, under the assumption of a stochastic signal obeys a well-defined fractal model, fractional-order calculus-based methods and models can be derived to estimate the frequency characteristics of a signal. Additionally, the model parameters derived in the fractal framework can be beneficial in solving problems such as textural segmentation of a signal [
24]. Fractal theory helps in explaining the local properties of a signal and in simplifying the geometrical or statistical description of the properties of a signal, regardless of whether the signal is fractal or not [
25].
Fractional-order calculus-based models can be used for reducing the number of linear prediction parameters of a signal, because by differ-integrating a signal by an appropriate order, the autocorrelation function of the signal can be manipulated to reduce the linear prediction parameters. The increased signal prediction performance of fractional linear prediction, which is an approach based on a weighted sum of fractional derivatives of a signal, is documented in an application for speech signal prediction problems [
26]. Fractional calculus is a nonlocal approach, which means the fractional derivative of a signal depends on all the previous values of the signal. This aspect of the fractional-order calculus framework makes it suitable for dealing with signals with memory. For optimal fractional linear prediction, some approaches with limited memory have been proposed. The proposed approaches are shown to have both good results on prediction accuracy and reducing the number of linear prediction coefficients, which are needed for encoding an audio signal [
27,
28,
29]. Moreover, fractional-order derivatives are used in audio processing applications as a metric for fractal analysis. Fractional derivatives of Gaussian noise can be used as assumptions for the excitation in an autoregressive model of speech [
30]. There are successful applications of fractal features to speech recognition, voiced–unvoiced speech separation [
31], and speaker emotion classification problems [
32]. For example, combining fractal-geometry-based features produces comparable results to Mel-frequency Cepstral Coefficients in speech classification problems [
33].
Fractional calculus-based approaches have also been applied to image-processing tasks [
34]. The main application form of fractional derivatives for image processing is to produce fractional differential masks. Fractional-order masks are used as a part of edge detection algorithms. Fractional derivative order, which adds a degree of freedom, can be tuned accordingly to create fractional-order derivative-based filters or masks for increased edge detection or segmentation performance. This approach has found its use in areas from satellite image segmentation [
35] to biomedical applications such as brain tomography segmentation [
36].
Some recent studies of fractional-order differential equations include bifurcation analysis regarding fractional-order biological models. It was shown that for fractional-order prey–predator models, the stability domain can be extended under fractional order [
37]. It was proven both analytically and numerically that fractional-order prey and predator systems present less chaotic behavior [
38]. Fractional-order calculus has also found its use in Genetic Regulatory Networks. Genetic Regulatory Networks are complex models for showing the relationship between the transcription of genes and the translation of mRNAs in biological cells. Fractional-order models are powerful tools for controlling genetic regulatory networks [
39]. Since, in fractional-order differential equations, time delaying is a very important factor that affects the dynamical behavior of systems, exploring the impact of time delay on fractional-order neural networks has great importance to optimize and control neural networks [
40]. Understanding bifurcations on fractional-order neural networks is, therefore, a crucial and active study area for understanding the dynamical properties of neural networks [
41,
42]
In this paper, we propose two fractional-order calculus-based data augmentation methods for audio signals. The first approach employs fractional differentiation of the Mel scale. This approach is based on representing the audio signal on a warped time-frequency scale as in VTLP. By using a randomly selected fractional derivation order the Mel Scale is warped, therefore, we aim to augment Mel scale-based time-frequency representations of audio data. The second approach is based on previous fractional-order image edge enhancement methods. Since multiple deep learning approaches are treating Mel spectrogram representations like images, we employ a fractional-order differential-based mask. The mask parameters are produced with respect to randomly selected fractional-order derivative parameters. The two methods are applied to the Environmental Sound Classification task.
This paper is organized in the following manner. In
Section 2, together with the Mel spectrogram representation of an audio signal, methodologies for both data augmentation methods are presented. In
Section 3, the experiment setup, environmental sound classification task, and experiment results are presented. In
Section 4, the results are discussed and the conclusions drawn from the results are explained.
3. Results
For the experiments, we use the UrbanSound8k dataset. UrbanSound8k dataset contains 8732 labeled urban sounds from 10 classes. It is advised by the producers of the dataset to apply 10-fold cross-validation [
49]. Extracted sound samples from the dataset are chosen to be 3s long and padded with zeros if necessary. The applied FFT size for time-frequency transformations is 1024 with a 75% overlap and the sample rate for all samples is chosen to be 22,050 Hz. The Mel frequency bin size is selected as 128. The inputs for all experiments are time-frequency representations with the shape of (128,128).
In this paper, an offline data augmentation procedure is applied. This means that the augmentation procedure is applied before model training. In addition to the original dataset, augmented datasets with the same size and for each procedure are generated. Furthermore, with each procedure, three times increased datasets are produced. For Log-Mel spectrogram features, the fractional-order mask procedure is applied after the logarithm operation. The experiments are conducted for both Mel spectrogram features and Log-Mel spectrogram features. The dataset sizes for experiments can be seen in
Table 2.
The environmental sound classification task has been an extensively studied area. A benchmark, 68% accuracy result for the UrbanSound8k dataset is produced with conventional machine learning algorithms. Publishers of the UrbanDataset8k dataset also proposed a deep CNN application with standard audio data augmentation technics and achieved 79% accuracy [
7]. Piczak proposed a CNN architecture with 73.7% accuracy [
50]. A dilated CNN approach resulted in 78% accuracy [
51]. A recent study with deep CNN, data augmentation, and network regularization claimed to have 95.37% accuracy on Log-Mel spectrogram features [
52]. In some cases, the data preparation procedure of experiments remained vague, which results in problems for experiment reproduction. Rather than trying to pass benchmark environmental sound classification accuracy scores, our focus was to inspect the capabilities of fractional-order calculus-based methods for data augmentation. Therefore, we designed an arbitrary CNN model.
Our network, which can be seen in
Figure 1, contains two convolutional layers with max pooling and an additional convolutional layer that precedes a global average pooling layer. We use global average pooling to reduce the needed parameters of the dense layer that follows convolutional layers for two reasons. Firstly, our hardware constraints limited us against larger network models, and secondly, in our research case, implementing an arbitrary network model does not weaken our point of showing the capabilities of fractional-order calculus framework.
Essentially, a deep neural network is a nonlinear mapping function with learnable weights W. Given an input representation X, the neural network maps input to the output as in (12). In classification tasks, the output vector is a vector of class probabilities. In this work, the implemented network has a layer size of 5.
The first three layers are 2D convolutional layers. A convolutional layer can be expressed as in (13).
In this expression,
is a pointwise activation function. In our implementation, we use a Rectified Linear Unit (ReLu) activation function for convolutional layers. ReLu expression can be seen in Equation (14).
Additionally, represents convolution operation and W and B are learnable tensors. Bias vector B can be excluded in implementations. 2D convolutional layers consist of N input channels and M output channels. Notationally, when B is excluded, a 2D Convolutional layer can be defined by the shape of tensor W, which has the shape of . is the shape of a convolutional kernel.
Layers 4 and 5 are fully connected layers. As seen in (15), this layer employs a matrix product.
If X is a vector with size N and the output size is M, excluding the bias term B, tensor W can be represented with the shape of (M, N).
For the experiments conducted for this research, the implemented Deep Neural Network resembles the network in [
7].
2D Convolutional Layer 1: The number of output channels in this layer is 16. The convolutional kernel has the shape of (3,3). The shape of W in this layer can be given as (16,1,3,3). This convolutional layer is followed by (3,3) strided max pooling layer and ReLu activation.
2D Convolutional Layer 2: The number of input channels in this layer is 16 and the number of output channels is 64. The convolutional kernel has the shape of (3,3). The shape of W in this layer can be given as (64,16,3,3). A (3,3) strided max pooling layer and ReLu activation are applied to the outputs of the 2D convolutional layer.
2D Convolutional Layer 3: The number of input channels in this layer is 64 and the number of output channels is 128. The convolutional kernel has the shape of (3,3). The shape of W in this layer can be given as (128,64,3,3). ReLu activation is applied to the outputs of the 2D convolutional layer. A global average pooling is applied after activation and the outputs are flattened resulting in a vector with a size of 128. This operation is applied to further reduce the parameter size and training time of our network, due to our hardware constraints.
Fully Connected Layer 1: This layer has 64 output units. The shape of W in this layer is (128,64). This layer is followed by ReLu activation.
Fully Connected Layer 2: This layer has 10 output units, which is the same as the number of classes. The shape of W in this layer is (64,10). This layer is followed by Softmax activation to map layer outputs to class probabilities.
Following each layer, Dropout operations with probabilities 0.3 for the first three layers and 0.5 for fully connected layer 1 are applied. The loss function for model optimization is Cross-entropy Loss. For optimization, ADAM optimizer is employed with a learning rate of 0.001, epsilon parameter of , and weight decay parameter of . Weight decay penalizes the squared magnitude of weight values and can be named the L2 regularization. Regularization has been used in deep learning to prevent weight bias and overfitting.
For all experiments, the trainings epoch number is selected as 100. While evaluating the classification performance, a 10-fold cross-validation procedure is applied. The UrbanSound8k dataset was published by its creator as divided into 10 folds. In the 10-fold cross-validation procedure, first, a fold is randomly selected and the network is trained with other folds. The classification accuracy is evaluated on the selected fold. After that, all the network weights and the optimizer is initialized again. These steps are repeated until the classification accuracies for all folds are calculated.
The data augmentation method performances are presented with respect to the mean accuracy. In
Figure 2a,b, the results for both Mel spectrogram features and Log-Mel spectrogram features can be seen.
For non-augmented Mel spectrogram features, the accuracy result becomes 58.6%. When we augment the dataset with Fractional-Order Mask and add it to the original dataset, we achieve 60.8% accuracy. When the same procedure is conducted for Fractional-Order Mel-Scale augmentation method, we achieve 61.3% accuracy. When the dataset size is increased by six times with both augmentation methods, the mean accuracy for 10-fold cross-validation becomes 62.5%.
When we change the input features to Log-Mel spectrogram representations, the accuracy result for the non-augmented case is 64.8%. Augmenting the dataset with Fractional-Order Mask and combining the augmented samples with non-augmented log Mel spectrogram features results in 65.8% accuracy, whereas, the same procedure for Fractional-Order Mel-Scale method gives 69.4% accuracy. For the last experiment, the dataset is augmented with both methods separately to increase its size three times. After the augmentation, both augmented datasets are combined. This procedure results in 70.3% accuracy. For Log-Mel spectrogram features, augmenting the dataset six times its original size with fractional-order calculus-based data augmentation methods produces an 8.48% relative increase in accuracy in our experiments.
In
Figure 2c–f, the loss and accuracy graphs with respect to the epochs are provided. In
Figure 2c,d, it is clear that data augmentation reduces the loss both for training and validation data. To achieve saturation, augmented data need fewer epochs. In
Figure 2e,f, augmented datasets achieve higher accuracy results both for training and validation. In
Figure 2f, augmenting the dataset two times with the Fractional-Order Mel-Scale augmentation method produces higher training accuracy, on the other hand, augmenting the dataset six times with both of the proposed data augmentation methods achieves the highest validation accuracy.
To inspect the results with respect to the increased complexity of the network model, the output unit size of the fully connected layer is increased from 64 to 256. This experiment produces a 66.3% accuracy score for non-augmented Log-Mel spectrogram features. Repeating the procedure with the fully augmented dataset results in 71.4% accuracy.
To further understand the performance of data augmentation methods on deeper neural networks an 18-layered ResNet-based model is implemented. The two differences of this implementation are on the input layer and on the output fully connected layer. The network parameters are randomly initialized. The original ResNet model is designed to have three-channel image data as the input, that is why a 2D convolutional layer with the weight shape of (3,1,3,3) is added. The original ResNet-18 implementation of Pytorch has 100 output classes. That is why the last fully connected layer is changed to have 10 output units. This network has a parameter size nearly 100 times the test network used in the experiments before. In deep learning, increased parameter size generally leads a network to overfit. This experiment showed that in terms of loss, the proposed data augmentation methods were unable to overcome the overfitting characteristic of a deeper network like ResNet18. The data augmentation performance on ResNet is presented with respect to the mean accuracy. In
Figure 3, the results for Log-Mel Spectrogram features of non-augmented and six times augmented datasets are shown.
In terms of accuracy, the augmented dataset achieved a nearly 10% relative increase in terms of validation accuracy after 25 epochs of training. The non-augmented dataset resulted in 62% validation accuracy, on the other hand, after the dataset was augmented six times using fractional-order calculus-based data augmentation methods, the resulting accuracy became 68.1%.
4. Discussion
Fractional-order calculus provides a capable framework for progressively increasing the number of application domains. In this research, we claim that fractional calculus-based approaches can be successfully applied to yet another domain. To our knowledge, there has not been an implementation of fractional-order calculus-based methods for audio data augmentation. Implementing Fractional-Order Mask and Fractional-Order Mel-Scale augmentation methods has proven to be beneficial.
The Fractional-Order Mel-Scale approach employs fractional differentiation of the Mel scale. It extends the group of frequency warping-based methods for data augmentation. The Fractional-Order Mask approach is an application of similar image edge enhancement methods in the audio data augmentation domain.
Using the Log-Mel spectrogram has been a more preferred way in deep learning because log scaling the Mel spectrogram output creates a dynamic range compression. This process results in a more detailed spectrogram image for the deep learning model to learn. The proposed data augmentation methods work better on the Log-Mel spectrogram representation of audio data. As seen from the experiments, the augmentation procedure for Mel spectrogram features achieves up to 6.6% relative increase, whereas, for Log-Mel spectrogram features the relative increase rate becomes 8.48%. This is an expected result. The Fractional-Order Mel-Scale approach results in amplitude deviations with respect to the original sample. Due to our conservative approach, smaller derivation orders are selected. Without the range compression, the variations on amplitude values of augmented data and original data become less significant, resulting in the model seeing the training data as too similar to the original data. More in-depth analysis showed that augmenting the dataset multiple times creates smaller gains with respect to accuracy and in some cases leads the model to overfit. Due to similar reasons as the Fractional-Order Mel-Scale method, the Fractional-Order Mask method creates greater variations from the original data when applied to the Log-Mel spectrogram features of the audio sample.
The aim of this work is not to overcome benchmark accuracy results of the proposed deep learning models for environmental sound classification tasks. A similar deep CNN model as in the literature is chosen but implemented with further global average pooling operation before fully connected layers. Because of our hardware constraints, a network with a smaller number of learnable parameters is preferred. Since, the main objective is to inspect the performance of the proposed data augmentation methods, choosing to implement a model like ours does not weaken our points. Nevertheless, experiments with increased size of output units for the first fully connected layer are conducted. To inspect the results with respect to the increased complexity of the network model, the output unit size of fully connected layer is increased from 64 to 256. This experiment produces a 66.3% accuracy score for non-augmented Log-Mel spectrogram features. Repeating the procedure with the fully augmented dataset results in 71.4% accuracy. The resulting 7.7% relative increase shows that our points with regards to applying fractional-order calculus-based methods are valid. In addition to the test networks, the ResNet-18 model is implemented to understand the performance of the proposed methods on deeper networks. In this experiment, a fully augmented network achieved a more than 9% relative increase in terms of validation accuracy. On the other hand, inspecting the loss and accuracy curves showed that network performance with respect to the overfitting characteristics is not substantially improved. This result leads us to believe that to achieve substantial improvement in terms of overfitting, the current parameter selection procedure for proposed data augmentation methods is not enough and needs improvement. It must be noted that recent studies on deep learning approaches employ more complex optimization policies to overcome overfitting. In our case, it is a conscious choice to keep the training procedure simpler to be able to see the effects of augmentation methods in a clearer way.
As feature works, these proposed methods can be experimented on for better derivation parameter range selection. It must be noted that these types of experiments should take differences in data types and neural network models into account. Additionally, further experiments can be conducted to understand the effects of fractional-order calculus-based data augmentation methods on speech data. Since automatic speech recognition tasks have great importance, trying to propose a parameter selection procedure for the proposed methods based on the human auditory system can be important. Furthermore, the proposed data augmentation method performances should be evaluated against the conventional data augmentation methods. Lastly, combining the proposed methods with conventional data augmentation methods remains an interesting future research topic for us.