3.1. Generation of the Sound Dataset
To generate the sound dataset, an FDTD implementation was used. As the FDTD method provides a grid-based numerical solution, in this case a solution of Equation (1), it was possible to account for paste distribution by increasing the mass of the membrane at specific grid cells that were determined by each distribution pattern.
3.1.1. FDTD Model
FDTD models provide an efficient numerical method for physical modelling sound synthesis and have been previously used for complete geometries of a guitar, a violin, and several other instruments [
18]. In this study, FDTD is implemented for Equation (1), in which the circular membrane is modeled as a finite number of grid cells and time is modeled as a finite number of time instants as depicted in
Figure 4. The fundamental time step is defined as
Δt, and the fundamental length of a cell is defined as
Δx = Δy. To maintain stability, constants
Δt and
Δx are chosen such that no artificial energy is introduced (Courant–Friedrichs–Levy condition) [
18,
19] as shown in Equation (5).
The membrane assumes boundary conditions u(xb, yb, t) = 0 for all points xb, yb around the circumference of the membrane and initial conditions u(x, y, 0) = 0 for all points except for the striking point of the membrane for which u(xp, yp, 0) = 1.
To compute
u(x,
y,
t) at every cell, the Newton–Störmer–Verlet also known as leapfrog algorithm is applied [
20]. Specifically, let
denote the acceleration defined as:
The second order partial derivatives are approximated as the following discrete form:
For each cell and time instant, the leapfrog algorithm follows simple formulas of kinematics and proceeds according to the following steps:
Following these steps, the displacement of every cell on the grid for every time instant is computed. Finally, a sound signal is computed assuming a microphone at distance d above the center of the membrane. This computation integrates the displacements at microphone position with a time delay and an attenuation determined by the virtual microphone position above the drumhead. Thereby, the attenuation d/r(x,y), with r(x,y) being the distances between the respective points on the membrane and the microphone and a delay r(x,y)/c, is used, where c is the speed of sound in the air.
3.1.2. Distribution of Paste
To account for the pattern and amount of paste applied on the drumhead, Equation (1) was reformulated as:
The synthesized sounds correspond to a reference membrane of predefined geometrical characteristics, i.e., radius
r = 0.25 m, thickness 3 mm, volume density 300 kg/m
3, and a constant uniformly distributed tension of
T = 800 Nt. These values correspond to a membrane having a total mass
m = 177.24 g. A uniform, square grid of 105 × 105 = 11,025 nodal points was used for the FDTD model. In this grid, only 8685 cells correspond to the area of the membrane (please refer to
Figure 5). Therefore
Δx = Δy = 4.76 mm and the mass of each cell was
Δm = 0.0204 g.
The damping parameter was kept constant at
D = 0.9999 throughout the sound computation. Real drums are damped stronger than in the present simulation. To better simulate real frame drums, the model should account for viscoelastic damping as presented by [
17]. Viscoelasticity does not change the relationships of spectral overtones. Instead, it introduces frequency-dependent damping, resulting in time-varying sound spectra, which do not preserve the fine structure of different partials throughout the duration of the sound signals. There are two reason for not including viscoelasticity in the FDTD model. Firstly, because different damping materials have different viscoelastic properties, and secondly because, due to the fact that different overtones have different durations, it would be more difficult to draw conclusions on how the damping material affects the overtone spectra.
Paste was applied according to six patterns shown in
Figure 5. These patterns were inspired by common tuning practices used by percussionists (
Section 2.3), i.e., circular disks, ring dampeners, gaffer tapes, and adhesive pads. Besides the patterns 1-diameter, 2-radius, 3-cross, 4-disc, 5-ring, 6-point, the 0-pattern was included to represent the case of no added paste. The 0-pattern represents the bare membrane (i.e., without any paste applied) and is not shown on
Figure 5.
Depending on the pattern, different parameters were varied to produce different sounds, as shown in
Table 1. Certain parameter combinations were used to generate three sounds corresponding to striking the membrane on three different impact points. As shown in
Figure 6, the (x, y) coordinates of the impact points were (18, 20), (25, 30), and (35, 45). This corresponds to one point close to the rim, one of medium distance between the rim and the center, and one close to the center. The center of the membrane was not used as an impact point as this would damp all normal modes having a node at the center (please refer to
Figure 2). The reason for considering three impact points was to account for the variation introduced by the same membrane owing to its excitation, regardless of the use of damping material.
For example, for the 0-pattern, different sounds were generated by varying the thickness of the membrane. Precisely, 529 thickness values were chosen in the range of 0.003–0.0067 mm. For each thickness value, three sounds were generated by shifting the impact point of membrane excitation to three pre-defined positions of
Figure 6. This yields a total number of 1587 sounds. As another example, for pattern 5-ring, different sounds were produced by varying the amount of added paste, expressed as percentage of mass increase per grid cell, having 13 values in the range of 20 to 220.1%, and the outer radius of the ring was varied between 33.6 and 95% of the actual radius of the membrane to yield 14 values and the width of the ring between 2 and 8 cm to provide 8 values.
It is important to note that although the dataset is balanced with respect to the paste patterns, i.e., an approximately equal number of sounds were generated for each pattern, it was not possible to achieve a respective balance for the increase in the membrane mass due to paste. The reason for this is that the extra mass is a combination of the percentage of mass increase per cell, denoted as Paste (%) on
Table 1, and the number of cells covered with paste. So instead of using all possible combinations, selected combinations were used. These combinations were determined to provide a compromise between considering realistic values of mass increase and a roughly even distribution of paste mass per pattern. This is the reason why the total number of sounds per pattern does not agree with the total number of parameter combinations multiplied by the three hitting points.
Figure 7 shows the histograms of the number of sounds produced for each bin corresponding to a range of values for the paste mass. Paste mass is depicted for 50 bins and for all paste patterns (
Figure 7a) as well as separately for each pattern (
Figure 7b–h). Besides the 0-no_paste pattern, all other patterns are represented by a higher number of sounds in the low range of paste mass (0–50 g) than the higher range of paste mass (50–200 g).
3.1.3. Implementation Details and Dataset Availability
To generate the audio files, the FDTD model was implemented using the CUDA architecture on an Intel i7-12700 2.1/4.9 GHz, 64 GB RAM, computer system using the NVIDIA GeForce GTX 970 4 GB GPU on Windows 10. The computation time was estimated to be around 3–4 s for generating 1 s of an audio signal at a sampling rate of 96 kHz.
A total number of 11,114 sounds were created using combinations of the values presented in
Table 1. These sounds may be accessed via a public link (
http://bit.ly/drumheads-sounddataset, accessed on 16 August 2023). The parameters values used to generate each file are provided with filenames explained in the accompanying document.
3.2. Data Investigation
One of the primary concerns of this research is to investigate whether each damping pattern yields perceptually similar sounds. Perceptual similarity was assessed by the frequency and amplitude ratios of the spectral overtones to those of the fundamental peak. An example of spectrum transformation due to paste, is presented in
Figure 8. On top, the FFT spectrum and the grid diagram of the reference membrane is depicted. The red mark on the grid diagram represents the impact point of (18, 20). As shown in the spectrum legend, the funamental frequency is f0 = 45 Hz, which is in agreement with Equation (4). The frequency ratios of the first sixteen partials to the frequency of the fundamental are (1.00, 1.60, 2.31, 2.93, 3.62, 4.24, 4.93, 5.42, 5.56, 6.22, 6.76, 6.87, 7.36, 7.53, 7.89, 8.09). The first five frequency ratios approximate those of modes (0, 1), (1, 1), (0, 2), (1, 2), (3, 2) of
Figure 2. At the bottom of
Figure 8, the FFT spectrum is presented and the grid diagram of the same membrane damped according to the ring pattern is shown in the bottom right. The specific sound corresponds to increasing the mass of the cells covered with paste by 157.372%. The coverage of paste is defined by an outer radius that is 76.008% of the membrane radius and a width of 0.0648 cm. This corresponds to a total coverage of 2668 cells, a paste mass of 85.65 g, and a total mass of the damped membrane of 262.89 g, which according to Equation (4) results in the fundamental frequency f0 = 37.47 Hz, which is again in agreement with the bottom spectrum (37 Hz). In this case, the frequency ratios of the peaks are (1.00, 1.49, 2.35, 3.14, 2.76, 4.41, 4.51, 5.32, 5.97, 6.11, 6.24, 6.46, 7.16, 7.59, 7.70, 7.89), which are significantly different from the ratios of the normal modes in
Figure 2. It is hence demonstrated that covering the specific cells with paste alters the eigenmodes in terms of their amplitude and frequency relationships.
A first attempt to investigate whether spectral similarities exist for each of the damping patterns is provided in one of our previous publications [
21] for a smaller dataset comprising 2331 synthesized sounds using the paste distribution patterns of
Figure 5 and an impact point at the center of the membrane. In that work, the FFT sound spectra and the spectral partials were computed using a peak-picking algorithm. The frequency ratios of partials to the fundamental were used to train a Self Organizing Map (SOM) that revealed similarities within the sound dataset. The sounds of each pattern were clustered in regions of high similarity, and a high dissimilarity was exhibited between circular patterns (disc, ring, and the no_paste pattern) and line patterns (i.e., radius, diameter, and cross). Moreover, the disc pattern appeared to belong to the same cluster as the no-paste pattern with respect to partial relationships, hence confirming that circular discs alter the fundamental frequency and damp partial amplitudes, without introducing significant changes in the frequency relationships.
The present dataset of 11,114 sounds was investigated using numerous dimensionality reduction algorithms including, Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), Uniform Manifold Approximation and Projection (UMAP), and Linear Discriminant Analysis (LDA). Two of these efforts were highly informative and presented in the rest of this section.
Figure 9 demonstrates the diagram of an LDA map trained on the first sixteen partials of the FFT spectra. Specifically, the FFT spectrum of each 96 kHz sound was computed and a peak-picking algorithm estimated the frequency and amplitude ratios of the first sixteen partials to the fundamental peak. This provided datapoints within a multidimensional space of 11,114 × 32 dimensions. Then, LDA was employed to reduce the dimensionality of this space to 2D. LDA is a dimensionality reduction technique commonly used as a preprocessing step for pattern classification. It is ‘supervised’, hence classes are known before training. It computes the directions (‘linear discriminants’) that reveal the axes that maximize the separation between multiple classes. Diagrams (a) to (h) demonstrate the overlap and the separation of different patterns corresponding to similarities and dissimilarities of the spectral envelopes. As depicted by (b), the point pattern is highly dissimilar to the ‘bare’ membrane (i.e., no-paste pattern) as well as any symmetric, with respect to the membrane center, pattern (d). To percussionists, this suggests that applying an adhesive pad will introduce a remarkable change in the sound texture of the drumhead. In contrast, applying angular dampeners, i.e., circular pads or muffle rings, introduces a decrease in the fundamental frequency without significantly altering the partial relationships and hence the perceived timbre. Furthermore, applying damping material using adhesive tapes, i.e., diameter, radius, and cross patterns, will produce sound textures that are highly dissimilar to any membrane of radial symmetry as well as dissimilar to the application of any adhesive pad.
A further interesting visualization of the dataset is the one depicted in
Figure 10. As an alternative to training with spectral overtones, PCA was employed to investigate the raw sounds after subsampling them to 22,050 Hz. This provided datapoints on a multidimensional space with a dimension of 11,114 × 22,050. PCA is an unsupervised data reduction technique, which projects the data values on the hyperplane that is closest to the datapoints, while preserving most of their variance.
Several interesting observations were made by hovering on the datapoints of this visualization. Mouse-hovering provided information about the paste mass and the impact point of the membrane excitation. It was surprising to observe that the three black trajectories of the 0-pattern (
Figure 10b) correspond to the three hitting points of the membrane. The outer trajectory corresponds to impact point (35, 45), which is the closest to the center of the membrane (please refer to
Figure 6). The neighboring middle trajectory corresponds to the point (25, 30), and the inner trajectory corresponds to the impact point that is closer to the rim, namely, the (18, 20) point. The remaining patterns are aligned around these three black trajectories, although they appear sparser than the black trajectories. This is because, as discussed in
Section 3.1.2, the variation of thickness of the 0-pattern is uniformly distributed within its range, while for the remaining patterns, specific values of the parameter range were selected in an attempt to approximate a uniform distribution for the values of paste. A second interesting observation of this visualization is related to the distribution of paste mass. For every pattern and every trajectory, paste mass increases along the trajectory starting from bottom right to bottom left.
A valuable conclusion drawn from the exploration of the produced dataset is that, as different patterns appear in different areas on the map of
Figure 9 and symmetric patterns are disimilar to non-symmetric patterns, the distribution pattern has an influence on the spectral envelop of the computed sounds, which may be computationally detectable. On the other hand, from the trajectories of
Figure 10, it appears that, although the paste distribution pattern is not detectable from the raw waveform, the estimation of parameters such as the impact point of the intial excitation or the mass of the damping material may be effectively estimated by reducing the dimensionality of the raw waveform. An interactive web application of this visualization is currently being developed to allow percussionists to explore the sound dataset by listening to the corresponding sound samples and easily locating their preferred timbre. Annotated information will provide suggestions on how to physically manipulate their instruments to produce their favorite sounds.
3.3. Damping Inference
A deep neural network was implemented to identify the damping strategy for deriving a given sound texture. Dataset sounds were resampled at a sampling rate of 22,050 Hz, and the raw waveforms represented the input of the network, which was trained to recognize the paste pattern and to estimate the total mass of the applied paste, thus accounting for a classification and a regression task, respectively.
A multi-output CNN [
22] was implemented to drive the training process towards making a combined inference of paste pattern and paste mass. As the dimensionality reduction techniques (
Section 3.2) demonstrated that each paste pattern spanned a considerable area on the 2D maps and patterns were not isolated, it was revealed that pattern and mass increase due to paste had a combined effect on the resulting sound texture, which was the reason for opting for a multi-output network, instead of separately training a classification and a regression task.
3.3.1. CNN Architecture
The CNN model was implemented in Python using TensorFlow and Keras on the Google Colaboratory environment, which made use of a Tesla T4 GPU. The RandomSearch algorithm of the KerasTuner framework was used to derive an optimized model and tune the hyperparameter space. Through iterative cycles of training and validation, various architectural configurations, parameter combinations, and evaluation metrics were explored. The resulting optimal architecture consisted of a relatively shallow model with two narrow convolutional layers, followed by maxpooling layers, and a wide dense layer before generating the final output.
The architecture of the final mutli-output CNN is shown in
Figure 11. It comprises several layers, including convolutional, pooling, dense, dropout, and flatten layers. The input layer accepts an audio signal of 1 s sampled at a rate of 22,050 Hz. The output of the input layer is driven to a 1D convolutional layer of 64 filters and a kernel size of three, and a maxpooling layer with a pool size of two, followed by a second convolutional layer of 128 filters and a pooling layer having a pool size of two. The output of the second pooling layer is passed through a dropout layer with a rate of 30%, and then it is flattened and retrieved to a dense layer of 512 units, which uses ReLU as the activation function. The network then splits into two separate outputs, one for classification and one for regression. The classification output is a dense layer with seven units and uses Softmax as the activation function. It outputs the class corresponding to the paste pattern, with values ranging from zero to six. The regression output is a dense layer with one regression unit and a linear activation function. It outputs the predicted amount of mass of the applied paste on the membrane in kg.
Besides the dropout layer, both convolutional layers used a maxnorm constraint as the regularization method to improve generalization [
23]. The model used the Adaptive Moment Estimation (ADAM) algorithm for optimization, and the cost functions were based on Categorical Cross-Entropy (CCE) loss for multiclass classification and Mean Squared Error (MSE) for regression.
3.3.2. CNN Training
The sound dataset was split into training and test sets in a proportion of 75% (8335 sounds) to 25% (2779 sounds), respectively. Three-fold cross validation, specifically the StratifiedKFold, was used to preserve class balance across folds and eliminate potential biases in the validation process.
The training history is depicted in
Figure 12. Training used a batch size of 20 samples. Mechanisms of early stopping and learning-rate reduction were used to increase the efficiency of the training process. As shown in
Figure 12, training was completed in 200 iterations (epochs), which required almost 3 h.