A Review of Neural Network-Based Emulation of Guitar Amplifiers

: Vacuum tube ampliﬁers present sonic characteristics frequently coveted by musicians, that are often due to the distinct nonlinearities of their circuits, and accurately modelling such effects can be a challenging task. A recent rise in machine learning methods has lead to the ubiquity of neural networks in all ﬁelds of study including virtual analog modelling. This has lead to the appearance of a variety of architectures tailored to this task. This article aims to provide an overview of the current state of the research in neural emulation of analog distortion circuits by ﬁrst presenting preceding methods in the ﬁeld and then focusing on a complete review of the deep learning landscape that has appeared in recent years, detailing each subclass of available architectures. This is done in order to bring to light future possible avenues of work in this ﬁeld.


Introduction
The use of vacuum tubes in electronics has largely diminished since the advent of semiconductor technologies, leading to their replacement in almost all fields except that of music technology, where a return to and rise of vacuum tubes can be observed [1]. Musicians tend to prefer the sonic characteristics of vacuum tube amplifiers over those of solid-state ones. Nevertheless, the shortcomings of vacuum tubes in the realm of guitar amplification are numerous and include elevated cost and weight, poor durability, and high power consumption. Their solid-state counterparts remedy some of the disadvantages of vacuum tube amplifiers but remain less popular among musicians who often seek the particular tone of the tube amplifiers used in iconic albums and guitar rigs.
Researchers have turned to Digital Signal Processing (DSP) methods for the emulation of vacuum tube amplifiers in order to circumvent some of the downsides of this technology. These DSP methods are not limited to guitar amplification or distortion and can consist of a wide range of audio effects such as reverberation, delay, and pitch-shifting [2]. The existing DSP methods for tube amplifier emulation can be divided into three categories depending on the degree of prior knowledge of the target device used. The "white-box" approach is historically the first and emulates each electronic component. The "black-box" approach tries to match the output of the target device given a specific input using mathematical functions uncorrelated with the internal components. Gray-box methods are similar to black-box but utilize some information about the target device in order to fine-tune the model. They comprise a block-oriented structure inspired by internal information of the device but the simulation disregards the behaviour of each of the individual components. These three modelling approaches will be discussed in further detail in Section 2.
These three "traditional" methods do, however, present some drawbacks. On one end of the spectrum, obtaining the internal knowledge of the target device needed for white-box approaches is both a costly and labour-intensive task and can provide inaccuracies in the emulation. They can also be computationally expensive. On the other side of the spectrum, black-box methods can also be computationally costly or cannot accurately approximate the nonlinear mapping of the amplifier. These methods will be discussed further in a future section. The computational cost of these methods is also prohibitive for real-time use, which is a fundamental factor to be considered in Virtual Analog (VA) modelling. Amplifier emulations need to be fast enough to run in real-time in order to be used by musicians. If the latency of an emulation exceeds around 10 ms, it can be perceived by the musician and hinder their playing [3].
The use of machine learning, in particular neural networks, has seen a significant increase in all data-related fields in recent years due to, in part, the increased computational capabilities of modern-day computers and the possibility of parallelizing computations on their GPU. The field of audio processing is no exception and the application of neural networks to the task of tube amplifier emulation seems to be appropriate. Indeed, neural networks are made up of the basic building blocks necessary for this type of problem including the presence of nonlinear activation functions which are well suited to model the nonlinearities created by vacuum tubes. Moreover, certain neural network architectures are able to capture and replicate the dynamic nature of a system. For example, autoregressive models retain information from previous time steps for their computations. Thus, the memory effect of amplifiers can be modeled using the appropriate type of neural network. Additionally, neural networks reduce the time necessary to create the emulation of a given analog device. Whereas for white-and gray-box methods it is necessary to repeat the modelling process for each circuit, only the training data needs to be modified for a new emulation via neural networks. The architecture remains unchanged. Parallels have also been made between typical DSP filters and certain neural networks [4]. Therefore, neural networks appear to be an appropriate candidate for the task of vacuum tube amplifier emulation.
The goal of this paper is to present an overview of the current state-of-the-art methods in vacuum tube amplifier emulation, with a focus on the newly emerging deep learning (DL) approaches. The structure of this paper is as follows: Firstly, a brief overview of the traditional white-, gray-, and black-box approaches mentioned above will be presented in Section 2; Then, an extensive review of the current neural network architectures used for amplifier emulation will be presented along with an evaluation of the approaches as well as their limitations in Section 3. Finally, a summary and discussion of the architectures will be given with the intent of providing a global view of the state-of-the-art in Section 4.

White-Box
The first class of methods used for the emulation of vacuum-tube distortion aims to create a model of the physical device using internal information of the amplifier to establish a system of differential equations. The main methods of this approach are state-space models, Modified Nodal Analysis (MNA), Port-Hamiltonian formalism, and the use of Wave Digital Filters (WDF). Here we present a non-exhaustive list of white-box methods in order to illustrate the limitations that can occur. More detail on the landscape of white-box methods can be found, for example, in the works of Pakarinen and Yeh [5,6].
Nodal analysis is a method used in circuit simulation techniques in order to establish the system of equations in matrix form to be solved. This is accomplished using the Kirchhoff Circuit Laws (KCL) at each node. Modified Nodal Analysis extends this to incorporate auxiliary equations into the system using the circuit's Branch Constitutive Equations (BCE) in order to solve the resulting system, which has the following form: MV = I where M is the matrix of conductance coefficients for KCL, I is a vector of independent current sources in the circuit, and V is the vector of unknowns (which are typically voltages). The most well-known electronic circuit simulator is SPICE (Simulation Program with Integrated Circuit Emphasis), which uses a combination of component-wise discretization of the circuit and MNA to create the system to be solved. This approach, while accurate, is computationally heavy and therefore poses a problem for real-time use [6]. Moreover, if an element within the circuit is changed, the MNA process needs to be repeated to take into account this change.
The WDF approach constructs digital filters based on the traveling-wave formulation of the physical elements of the device, obtained via a change of variables from the Kirchhoff to the wave domain which describes circuit dynamics in terms of incident waves a = v + Ri and reflected waves b = v − Ri. Unlike nodal analysis, this formalism allows for modularity without delay-free loops (i.e., loops that cannot be computed as the input and resulting output are needed at the same time step in order to accomplish this). Discrete-time models of each of the individual components (resistors, capacitors, and inductors) are connected to construct the filter. An interesting benefit of this approach is that the finite difference scheme resulting from the use of the bilinear transform for the discretization remains explicit for any number of models. WDF was used to model a number of distortion effects including a triode amplifier in [7].
State-space models rely on the principle that the equations of motion for any physical system may be formulated in terms of the state of the system: is the state of the system at time t, u(t) is a vector of external inputs, and the general vector function f t specifies how the current state x(t) and inputs u(t) cause a change in the state at time t. f t may be time-varying itself. All past states and the entire input history are "summarized" by the current state x(t). Thus, x(t) must include all the "memory" of the system.
To solve the nonlinear state space system, two notable approaches are the K-method and the Discrete K-method (DK-method). The K-method was originally proposed to avoid noncomputable delay-free loops in the system. It does this by transforming the system to highlight all loops involving nonlinear maps. The delay-free paths are isolated in order to be eliminated via geometrical transformations [8].
The DK-method establishes a system of Ordinary Differential Equations (ODE) based on MNA which is then solved to find the state transition matrix and the nonlinear relationships between variables in order to produce a nonlinear filter similar to that of the K-method. However, in the K-method, the ODE are discretized after derivation whereas in the DK-method, discretization is applied first [9]. David Yeh applies the DK-method to model a Bipolar-Junction Transistor (BJT) and a triode amplifier in a real-time plugin and evaluates this against a reference SPICE model [9].
The final method that will be described consists of a state-space representation that is structured based on the various energies of a system and their dynamics. This representation is structured into its energy-storing parts, dissipative parts, and then the external sources of energy [10]. This structure falls under the Port-Hamiltonian (PH) formalism. Like for the K-method family of approaches and WDF, the PH structures are derived from the circuit schematic by use of its reigning Kirchhoff laws. However, unlike the other two methods, the passivity property is conserved which ensures stability of the model [10]. This approach has been successfully used for the simulation of such analog effects as a BJT amplifier and a wah pedal in [10].
Each of these methods translates the circuit diagram of a given amplifier into a set of equations that completely describes the system which is then discretized, usually using a finite-difference or integration scheme, to then be solved iteratively. In order to establish these equations, access to the circuit diagram or study of the internal structure of the amplifier is necessary and results in a labour-intensive task. Moreover, the resolution of these differential equations relies on computationally expensive iterative methods or storing lookup tables and the component values measured can also introduce inaccuracies into the emulation or fail to faithfully emulate the particularities of a given amplifier.

Gray-Box
Gray-box methods seek to alleviate some of the labour required for the modelling task using white-box approaches. They do this by incorporating internal knowledge of the device in the form of blocks in order to simplify the model. Indeed, modelling each of the individual circuit components is not necessary in gray-box; only the behaviour of the device as a whole needs to be modelled.
Block-Oriented structures are a subclass of Volterra series [11] that aim to separate the Device-Under-Test's (DUT) effect on the input signal into "blocks": one set dealing with the dynamic, linear behaviour of the system, and one to introduce the nonlinearity. The common configurations of these structures are the following: • Hammerstein (and their subclass cascade Hammerstein) models: Static nonlinearity followed by linear filter. • Wiener models: Linear filter followed by static nonlinearity. • Wiener-Hammerstein models: Static nonlinearity between two linear filters.
Block-Oriented structures can be considered as gray-box methods as they can take into account the high-level structure of the DUT for their configuration and most examples of block-oriented structures applied to distortion approximation fall under the realm of gray-box modelling.
One of these configurations (Wiener-Hammerstein) has even been described as the "fundamental paradigm of electric guitar tone" in [12] and this family of methods is used in commercial products and patents such as Fractal Audio's Axe-Fx [12] and a 2008 patent from Kemper [13]. This patent describes a Wiener-Hammerstein topology and Fractal's white paper extends this model to include extra linear and nonlinear blocks in their MIMIC technology. Most notable in the field of gray-box guitar amplifier modelling using block-oriented structures are the works of Felix Eichas and Udo Zölzer, in which system identification using input-output signals and optimization of the block topology using iterative methods to minimize the error between model output and the measured target are used. These approaches will be detailed in this section.
In [14], a parametric Wiener-Hammerstein model for distortion effects is presented along with the iterative algorithm used for the optimization portion of the problem. The Levenberg-Marquardt algorithm was chosen for the minimization and the DUT was a Hughes and Kettner Tube Factor amplifier. This method produced satisfying results that were nearly indistinguishable from the target, but the model becomes more inaccurate for frequencies outside of the range [60 Hz, 18 kHz].
In a following work, an extended Wiener model was tested for the emulation of three different distortion devices: a diode clipper with pre-amplification of the input signal, a BJT distortion stage of an Electro-Harmonix Big Muff Pi, and the op-amp stage of an Ibanez Tube Screamer (all simulated using a SPICE circuit simulator). This simple topology had a harder time modelling the desired effects and only the diode clipper circuit was modeled successfully.
Eichas and Zölzer [15] presented a gray-box approach in which the modelling process is automatic using input-output measurements and iterative optimization. Its structure is the following: three linear filters inter-spaced with nonlinear blocks, the first consisting of a polynomial waveshaping function and the second consisting of a concatenation of hyperbolic tangents. An Exponential Sine Sweep (ESS) is used for the system identification process along with the Levenberg-Marquardt algorithm for the optimization problem. This method was used to model a Fender Bassman 100 and an Ampeg VT-22 but struggled at modelling distortion.
Finally, in [16], analog amplifiers including a Bassman 100 and a JCM 900 were modeled with an automated procedure and an extended Wiener-Hammerstein configuration consisting of a nonlinear mapping between two Linear Time-Invariant (LTI) blocks connected in series. Again the Levenberg-Marquardt was used for the optimization and the model performed well even for strong nonlinearities and has a low computational load compared to white-box methods but is not able to incorporate certain subtle effects such as power sagging or crossover distortion.
Block-oriented, gray-box methods present a lower computational cost than whitebox methods but struggle with emulating the target devices more so than the latter class of methods.
In addition to the block-oriented structures presented here, the use of machine learning and Artificial Neural Networks (ANN) has started to see a rise in the gray-box approaches and will be detailed further in a later section.

Black-Box
Black-box approaches are system identification methods that seek to remedy some of the problems of gray-and white-box approaches by relying on only input-output measurements to describe the overall behaviour of the amplifier. These methods enable emulation without the need for prior knowledge of the internal circuitry of the amplifier. They also tend to be more computationally efficient than white-box models and allow for the replication of idiosyncrasies that can exist in the analogue devices (due to imperfect electrical components for example). The main methods in this category are Volterra series, dynamic convolution, block-oriented structures (which are a subclass of Volterra series), and kernel regression.
Dynamic convolution, proposed in [17], is a variant of the system identification method used for linear systems in which instead of a single impulse response used to derive the transfer function of the DUT, multiple impulses are used as input at different amplitudes in order to obtain an approximation of the nonlinear behaviour. In this work, the standard convolution operation y(n) = ∑ L−1 k=0 x(n − k)·h(k) where h is the impulse response of length L, y is the output signal, and x the input, is replaced with so-called dynamic convolution operation defined as follows: where S(x(n)) is a selector function that determines which impulse response should be used for the input sample x(n). This method was used in the Sintefex FX8000 Audio Effects Replicator. Certain Machine Learning (ML) methods have also been proposed for the emulation of guitar distortion circuits. A Support Vector Machine (SVM), which is a class of linear ML algorithm, was used to emulate a common-cathode tube amplifier via kernel regression [18]. In the article, the authors linearize the nonlinear system in order to be able to apply the SVM to the regression task and suggest methods for choosing the proper kernel to linearize the system. The linearization is accomplished by mapping the data into a higherdimensional vector space in which the nonlinear functions of the system can be replaced by linear operators. Kernel methods are used to avoid the computationally intensive task of representing these functions in the higher-dimension space, requiring instead only the computation of the inner products between the vectors in this new space. While this method is theoretically solid, the choice of the necessary mappings and kernel function can be a difficult task.
ANN are a class of machine learning, black-box methods that use neurons and nonlinear activation functions such as sigmoid or hyperbolic tangent, functions frequently used in VA modelling to model various nonlinearities. This enables ANN to approximate complex mappings and recent technological improvements have enabled their wide-spread, leading to a rise of artificial intelligence in almost all fields in recent years. Similarities between a number of ANN architectures and certain black-box modelling methods can be drawn and suggest that they would appear to be well suited to the task of vacuum tube amplifier emulation as we shall discuss in further detail in a future section.
The Volterra series [19] is a functional expansion of multidimensional convolution kernels that enables the emulation of the "memory" effect of an amplifier as the output of the nonlinear system is dependent on the input at previous time steps. It is an extension of the common model of a linear system: where H is the transfer function of the system and x(t) the input. It expands this model to include nonlinearities by adding convolution integrals that take into account the interactions between the previous inputs thereby retaining the "memory" effect of the causal system: The functionals of this expansion, called Volterra kernels, can be estimated using various methods (e.g., cross-correlation or least-squares) but is a computationally intensive task. Moreover, Volterra series work well for relatively linear systems but struggle when the system being modeled exhibits very strong nonlinear behaviour. An example of Volterra series applied to the simulation of analog audio devices is presented in [20] where this method is used to emulate the weakly nonlinear behaviour of the Moog ladder filter. This filter is a key contributing factor to the Moog synthesizer's sound and is named so because of its schematic that resembles a ladder.
The block-oriented structures presented previously can also fall under the scope of black-box methods. Indeed, Eichas and Zölzer [21] used a Wiener model to emulate three distortion effects including a diode clipper with pre-amplification of the input signal, a BJT distortion stage of an Electro-Harmonix Big Muff Pi, and the operational amplifier (op-amp) based distortion stage of an Ibanez Tube Screamer.
Overall, the black-box methods presented here remain computationally expensive and/or tend to struggle when simulating the complex behaviour of vacuum tube amplifiers, particularly for very high levels of nonlinearity. Neural networks are another class of blackbox approach that have started to be applied to VA modelling and could remedy some of the downsides of the other methods detailed here.

Neural Network-Based Methods
The use of neural networks has seen a rise in practically all scientific fields since the start of the 21st century thanks to increased computational resources and the availability of larger data sets. These ANN have subsequently been able to achieve a level of accuracy that far surpasses that of other methods used previously. This is the case in a number of fields, particularly in computer vision with the use of Convolutional Neural Networks (CNN) and in speech recognition and machine translation with the use of Recurrent Neural Networks (RNN). CNN are networks specialized in processing grids of values such as images through convolution operations that serve to extract salient features in the pixels. RNN are tailored to processing sequences of values and are particularly suited for timedependent or sequence-based tasks such as language translation or natural language processing [22].
A neural network is a class of machine learning algorithm whose basic structure is made up of inter-linked neurons and often nonlinear activation functions. These neurons are organized into layers and are made up of a set of mathematical operations wherein the input to the layer is multiplied by a set of learnt weights, bias is optionally added, and the activation function is applied to the output which is sent to the next layer. The value of the weights are learnt by the network via an optimization task in which a given distance between the target and the network output is minimized. This enables the network to learn complex nonlinear mappings between the inputs and outputs which would make this method suited to the task of distortion modelling.
We can see similarities between this process and certain methods described previously in the black-and gray-box categories of methods. For example, an architecture comprising a 1D convolutional layer followed by a nonlinear activation can be considered as a Wiener model (linear filter followed by a nonlinearity) [23]. Certain nonlinear activation functions present in neural networks, such as the sigmoid or hyperbolic tangent, appear frequently in the literature of VA modelling [6]. Additionally, certain architectures that exist, designed for times series and sequence modelling, are well suited to the amplifier emulation task and also bear similarities with classic DSP elements. For example, Kuznetsov et al. [4] show an equivalence between infinite impulse response (IIR) filters and RNN. Taking into account such similarities between traditional VA modelling and neural networks, it seems natural to extend the use of ANN to this simulation task.
However, amplifier modelling via neural networks is not a straightforward task. The deep, convolutional, and data-hungry methods used in imaging for example can pose a problem for real-time use. In addition, processing raw waveforms in the time-domain means that we have to deal with extremely high temporal dimensionality because of the sampling rates needed in order to achieve high-quality audio. Indeed, a minimum sampling rate of 44.1 kHz is required [24]. Moreover, the sampling rate used in this type of application is often a constraint imposed by the hardware. This is a less detrimental factor in speech processing where sampling rates of around 16 kHz, or even 8 kHz, can be sufficient. These factors increase both the computational cost and the complexity of the task and render music deep learning challenging.
A number of DL architectures have appeared in the state-of-the-art of distortion circuit modelling techniques in recent years including various configurations of both convolutional and recurrent layers. Here, we study the recent works that have appeared in this state-ofthe-art. Each aspect of the neural modelling problem will be discussed with recapitulative tables that aim to summarize and clarify the situation. The different aspects that will be detailed here are the architectures used, the different modelling approaches, the various data sets, the loss functions, the evaluation techniques, and finally the real-time capabilities of each body of work.

Architectures
We detail here the three main categories of architectures that have populated the state-of-the-art: convolutional networks, recurrent networks, and hybrid configurations.
The first instance of CNN being used for the task of amplifier and distortion effects modelling was presented in [25] for the emulation of tube amplifiers and their follow-up article [26] for the emulation of distortion pedals, where the authors use a feed-forward variant of the WaveNet architecture from [27]. This autoregressive architecture from Deep Mind was originally designed for the generation of raw audio waveforms for speech synthesis. The feedforward variant presented by Damskägg et al. [25] contains a stack of dilated causal convolution layers, which enable for a large field of view without increasing the computational cost of the processing. The dilation is done by only using certain outputs from the previous layers for the convolution operation as illustrated in Figure 1. The larger field of view means that the network can be exposed to more past information and is thus able to better model long-term dependencies of the signal.
In the first article from Damskägg et al. [25], two models of different sizes are presented and compared, both with 10 convolutional layers with a filter size of 3 and a dilation pattern of {1, 2, 4, . . . , 512}. Both networks have a post-processing block comprising a three-layer fully connected network and the smallest network has two channels in the convolutional and post-processing layers for a total of around 600 parameters, while the largest one has 16 channels in both the convolutional and post-processing layers for around 30,000 parameters. The network was used to model a Fender Bassman 56F-A preamplifier. The WaveNet models were compared to a block-oriented model from [21] and a Multi-Layer Perceptron (MLP). The larger WaveNet model outperformed the others in both the objective and subjective evaluation. In their follow-up article [26], Damskägg et al. focus more on the real-time possibilities of their WaveNet architecture applied to distortion pedal emulation. The feedforward WaveNet is slightly modified in this work, for instance the three-layer post-processing block from their previous work is replaced with a single 1x1 convolutional layer. They aim to find a trade-off between accuracy of their method and computational load as well as the minimum amount of data required for training. For this, three different configurations were studied, see Table 1. Table 1. Feedforward WaveNet configuration summary. The nomenclature for these configurations is not generalized, and the names used here are taken from [26]. This new WaveNet variant was tested on three effects pedals: Ibanez Tube Screamer, Boss DS-1, and an Electro-Harmonix Big Muff Pi.

Model
A modified version of the feedforward WaveNet architecture, also known as Temporal Convolutional Network (TCN), was used in [28] in order to achieve more efficient computation for real-time use of models with more complex nonlinear behaviour. The authors show that by using shallower networks with very large dilation factors (in order to retain a large enough receptive field with this shallower configuration), they were able to achieve comparable performance with greater efficiency. In order to have a large enough receptive field for the input of the shallow network, large dilation factors are necessary. In [26], a dilation factor of d l = 2 l−1 where l is the lth layer of the network was used, whereas, here, 10 l−1 was used for the dilation growth. Batch normalization was also used in this work and the original gated activation from [27] is instead replaced with the Feature-wise Linear Modulation (FiLM) activation from [29]. This activation consists of a feature-wise affine transformation conditioned on scaling and bias parameters obtained via an embedding of the device's control parameters. Both causal and non-causal versions of this modified architecture were tested and compared for the emulation of a LA-2A dynamic range compressor. They achieved comparable performance but the noncausal variants performed slightly better in the time-domain. The results of listening tests indicated that a small difference was perceived among the models in comparison to the reference.
The first article presenting the use of recurrent neural networks for vacuum tube amplifier modelling dates back to 2013 where a Nonlinear AutoRegressive eXogenous (NARX) network was applied to the task in [30]. A NARX network is similar to a traditional RNN (i.e., a fully-connected network incorporating recurrence) but with limited connectivity to remedy the training problems usually present in RNN which are linked to vanishing and exploding gradients due to their recursive nature. The architecture used was a two-layer, feedforward network with sigmoid activation functions in the hidden layer and a single linear neuron in the output layer. The audio quality of this method was reported to be low when modelling a 4W Vox AC4TV tube amplifier either due to insufficient training or limited model capacity.
Zhang et al. [31] follow up the work on the NARX network with a study using Long Short Term Memory (LSTM), first proposed in [32], for the task of amplifier modelling. Again, a 4W Vox AC4TV was chosen for this work. LSTM is a RNN variant that incorporates the use of various gates to control information flow through each recurrent layer in order to avoid the vanishing and exploding gradients of regular RNN. These gates are the following: Forget gate: Input node: Output gate: Cell state: Hidden state: where σ is the sigmoid activation function, the Ws represent the weights of the layer, the bs represent the biases, and is the Hadamard product. The x and the h refer to the input of the cell and the hidden state, respectively. The models used were configured as three or four-layer networks with a structure comprising a variable length input, followed by LSTM with a variable number of units (including either linear or tanh activation) and a linear output unit. The number of units used in the LSTM layers varied from 5 to 20, with a number of hidden units varying in the same range and a sequence length in the range of [1,5]. During subjective listening tests, the audio quality of the network output, when compared to the target device, was not deemed satisfactory by semiprofessional guitarists.
Wright et al. tested a new LSTM architecture along with another variant on the traditional RNN, the Gated Recurrent Unit (GRU) [33] and both were compared to the WaveNet architecture from [26]. The GRU, like LSTM, aims to remedy exploding and vanishing gradient problems by controlling information flow through using an update gate instead of the four gates of LSTM, leading to improved computational efficiency. The computations carried out in a GRU are the following [34]: Update gate: New gate: Hidden gate: The architecture used is comprised of a single recurrent layer followed by a fully connected one. Preliminary experiments showed that adding extra recurrent layers had little effect on the audio quality of the output. This method was used to model a pedal (Big Muff) and a combo amplifier (Blackstar HT-1) in [33]. In terms of objective quality, the most accurate RNN outperformed the WaveNet for the pedal and the most accurate WaveNet outperformed the RNN for the amplifier, with LSTM outperforming the GRU in terms of accuracy with roughly the same processing time.
In a follow-up article [23], further comparisons between the recurrent networks from [33] and various WaveNet configurations were carried out for the modelling of two vacuum tube amplifiers: a Blackstar HT-5 Metal and a Mesa Boogie 5:50 Plus. This study includes the results from previous articles on the feedforward WaveNet and the LSTM network for distortion pedal emulation, and extends this with the inclusion of the vacuum tube amplifier models.
Various configurations of each architecture were trained and compared in order to gauge both the real-time capabilities (using C++ implementations) and the audio quality. The largest configurations were WaveNet3 with gated activation from [27], 18 layers, and 16 channels, and LSTM96 with 96 recurrent units. The smallest configurations were WaveNet1 with gated activation, 10 layers, and 16 channels, and LSTM32 with 32 recurrent units. The LSTM provides better processing speeds than the WaveNet models; however, the largest WaveNet can better model the highly nonlinear HT5M amplifier.
Analog distortion effects can be emulated by various configurations of neural network architectures [35]. In this work, eight architectures are tested and compared, one of which is the subject of a 2018 article [36]. These architectures include four LSTM networks, one hybrid convolutional LSTM, one plain Deep Neural Network (DNN) in the form of a MLP, one CNN, and one hybrid convolutional RNN. The most notable architectures of this work are: • The parametric LSTM in which the input dimensions are extended in order to taken into account the amplifier parameters. • The convolutional LSTM in which two stacked 2D convolution layers are used to redimension the input signals in order to parallelize the computations on GPU for accelerated processing. • The sequence-to-sequence LSTM which outputs a buffer instead of a single sample, again to accelerate processing.
All architectures presented vary in terms of accuracy and real-time performance but the hybrid convolutional LSTM presented the best Computation Time (CT) and accuracy trade-off.
Another category of hybrid method includes autoencoder architectures such as those presented in the works of Martinez-Ramirez et al. [37][38][39]. The general structure of an autoencoder comprises three stages, namely, an encoding front-end, a latent space containing the new representation of the input data, and a decoding back-end. This structure enables the model to learn an approximate copy of the input, forcing it to prioritize useful properties of the data [22].
In their preliminary article focused on modelling nonlinear audio effects [39], a convolutional autoencoder with fully connected latent space (dubbed CAFx) was proposed with the following structure: an adaptive front-end consisting of two 1D convolutional layers (the first with 128 filters of size 64 and the second with 128 filters of size 128), a max pooling layer, batch normalization before this pooling and a residual connection; as well as a latent-space of two dense layers with 64 units and a decoder consisting of four fully connected layers (of sizes 128, 64, 64, 128) whose last layer includes a Smooth Adaptive Activation Function (SAAF). This architecture was used to model three of the audio effects in the IDMT-SMT-Audio-Effects data set [40]: distortion, overdrive, and equalization (EQ).
Their follow-up article [37] modifies the latent space of this structure by replacing the fully connected layers with Bidirectional LSTM (Bi-LSTM) which are LSTM containing forward and backward information at every time step to create a convolutional and recurrent autoencoder (CRAFx) to model more complex, time-varying audio effects, again from the IDMT-SMT-Audio-Effects data set.
Another variant of this architecture that uses the feedforward WaveNet in the latent space (CWAFx) is introduced in [37] and the three autoencoders are compared with the original feedforward WaveNet architecure introduced in [26] on various modelling tasks, including that of a vacuum tube amplifier, sampled from a 6176 Vintage Channel Strip unit. The results of this comparison showed that both the feedforward WaveNet and CAFx, the original DNN autoencoder, performed similarly and that they are both outperformed by CRAFx and CWAFx with CRAFx performing slightly better than CWAFx. It was reported that the vacuum tube preamplifier was able to be successfully modeled on the two-second samples.
The last two architectures presented in Table 2 fall under the scope of gray-box neural methods which, along with newly emerging white-box methods, aim to improve interpretability of neural-network based approaches.

Approaches
Neural networks are black-box by nature but in recent years they have started to be integrated into both gray-and even white-box modelling methods. Indeed, the recent works of Parker et al. [41], Nercessian et al. [43], Kuznetsov et al. [4], and Aleksi Peussa [42] fall into the category of gray-box approaches.
Parker et al. present the State Trajectory Network (STN) in their gray-box approach [41], which is a method of integrating neural networks, namely, a MLP here, into a State-Space model. This method aims to augment the black-box neural network approach by integrating the internal values into the training data for a more accurate simulation. A number of distortion circuits are tested, namely a simplified version of the main distortion stage in the Boss DS-1 pedal was used in the form of a second-order diode clipper. Results show that this method is viable as all the circuits modeled were said to be indistinguishable from the targets in informal listening tests however the network training can be unstable [41,42].
Aleksi Peussa augments the STN in his Masters thesis [42] to include recurrence using a GRU and compares this to both the original STN and a black-box network using only the GRU trained on input-output recordings. This work confirms the instability in training for the STN as it was unable to model a Boss SD-1 pedal. Although the State-Space GRU was able to emulate this pedal, it was outperformed by its black-box equivalent. However the state-space model managed to outperform the black-box one when applied to a Moog ladder filter due to its self-oscillatory nature.
Some of the gray-and white-box approaches making an appearance in the neural network landscape result from the introduction of the Differentiable Digital Signal Processing (DDSP) library from Magenta [44]. DDSP enables the integration of classic signal processing elements, such as filters or oscillators which are typically non-differentiable and thus cannot be dealt with via gradient-based optimization techniques, into the deep learning pipeline in Tensorflow.
Kuznetsov et al. [4] explore the idea of differentiable IIR filters using the DDSP library. The authors present the link between IIR filters and RNN and present a Wiener-Hammerstein model using differentiable IIR filters. This model is used to emulate a Boss DS-1 distortion pedal and compared with a simple convolutional layer as a baseline. None of the models were able to fit the target data perfectly using this method.
Differentiable IIR filters are explored further in [43] by proposing a cascade of differentiable biquads to model a distortion effect. A digital biquad filter is a second order recursive linear filter containing two poles and two zeros, and higher order IIR filters can be created by cascading biquads in series [45]. The proposed model is said to have significantly fewer parameters and reduced complexity when compared to more traditional black-box architectures. This method was used to model a Boss MT-2 distortion pedal and comparison with WaveNet showed that the parametric EQ representation of the cascaded biquad outperformed the other three representations as well as the WaveNet.
Finally, white-box approaches have started to appear also in the landscape of neural methods. Esqueda et al. [46] implement a white-box model in differentiable form which allows approximate component values to be learned, thus remedying the accuracy problems that can arise in white-box modelling due to lack of access of the exact component values of the DUT's circuit. This method was tested on a Fender, Marshall, Vox (FMV) tone stack as well as on an Ibanez TS-808 Overdrive stage in order to validate the proposed model. The advantages and downsides of each approach is presented in Table 3.

Data Sets
The performance of any of the networks, no matter what category of approach they may fall under, is directly determined by a number of choices made regarding the training process. Notably, the choice of data set is critical.
The performance of neural networks for any given task depends heavily on the data set they are given during the training phase. Indeed, the network needs to be exposed to a wide range input-output pairings in order to learn an accurate mapping for the majority of cases it will encounter during its use.
There exists a number of data sets used for the task of amplifier modelling, all bearing certain similarities. Almost all of the data used throughout the state-of-the-art comprises clean guitar Direct Input (DI) sent through either the analog device or a SPICE simulation of the device. However, the data used depends on the approach.
In black-box approaches, only input-output guitar recordings are necessary for the neural network training, whereas for gray-or white-box different or additional data is required. In the gray-box approaches presented in [4,[41][42][43], component values of the internal circuit are also used in the training data. In the white-box approach of Esqueda et al. [46], only the circuit component values are used.
In the black-and gray-box approaches, certain aspects of the training data have a significant impact on the resulting model. These aspects are, namely, the sampling rate, the length, and type of the data. The sampling rate dictates the audio quality of the simulation and impacts its real-time processing capabilities.
The data used to train the WaveNet from [25] was obtained from a SPICE simulation of a Fender Bassman 56F-A preamplifier applied to DI from a Freesound data set for audio tagging [47]. A total of 4 h of data was used for the train set and 20 min for validation split into 100 ms segments and a random gain value in the range of [−15 dB, 15 dB] was applied to the inputs for more dynamic range. All data were recorded with a sampling rate of 44.1 kHz.
In their follow-up article [26], Damskägg et al. show that as little as three minutes of data is sufficient for the training of the convolutional networks, although final results presented were obtained with five minutes of data (50% guitar and 50% bass) from the IDMT-SMT-Guitar/Bass data sets from [48,49]. The sampling rate for all recordings was 44.1 kHz. These data sets contain a variety of single note recordings of various different playing styles with varying pickups. The raw inputs from this data set were sent through three effects pedals: Ibanez Tube Screamer, Boss DS-1, and an Electro-Harmonix Big Muff Pi. This data set was also used to train the recurrent networks from [33].
The amplifier models of [23] used a different training set than the pedal emulation, taken from a pre-existing data set. This data set was tailor-made for this modelling task and was published in [50]. It includes five different styles of guitar sounds sent through various guitar amplifiers with their gain parameters set to ten different levels. The audio used in [23] consists of around three minutes of guitar audio recorded at 44.1 kHz with the training set consisting of 2 min 43 s of audio. This data was used to train both the WaveNet style model and the LSTM.
The SignalTrain data set from [51] was used for training, testing, and validation of the shallower TCN architectures. This data set contains input-output recordings (at 44.1 kHz) of various instruments from a LA-2A dynamic range compressor.
The training data used for the NARX network of [30] was comprised of both signals from a function generator (with frequencies in the range [100 Hz, 500 Hz]) and an electric guitar fed to a vacuum tube amplifier, a 4W Vox AC4TV. All training data was recorded at a sampling frequency of 96 kHz and saved to 24-bit stereo wav files with one channel containing the raw input signal and the other containing the tube amplifier signal. The guitar recordings from this data set were used for the LSTM training in [31].
A summary of these data sets is presented in Table 4.

Data Set Details
Data set from [50] Five different playing styles of guitar sounds passing through different amplifiers Used in [23,35,36] at 44.1 kHz IDMT-SMT-Audio Effects [40] 44.1 kHz used in [28] and downsampled to 16 kHz in [37][38][39] Freesound [47] Input signal sent through SPICE simulations at 44.1 kHz. Used in [25] IDMT-SMT-Guitar/Bass [48,49] Used in [23,26,33] Custom 10-s logarithmic sine-sweeps combined with low-level white noise. This combined signal was increased in amplitude linearly from 0 to 1 over 1/2 the length of the signal. Min freq = 20 Hz and max = 10 kHz at 192 kHz [41] As the choice of training data has a decisive impact on the performance of a neural network, so does the choice of cost function used in the optimization process.

Loss Functions
Training neural networks is an optimization problem in which we often aim to minimize a given loss function. This loss represents the distance between the prediction and the target and must therefore accurately depict the perceptual difference between signals. This is often not the case with objective losses such as the Mean-Squared Error (MSE) defined as: The models used for amplifier modelling process raw audio in the time domain and the losses used are computed directly with the signal's waveform whose shape does not perfectly correspond to human perception. For example, different frequencies are not perceived in the same way to the human ear, with certain frequencies appearing louder than others [52]. This difference in loudness is represented by equal-loudness contours.
To improve the accuracy of a given network, spectral information can also be included into the loss computations but this approach presents its own problems and most of the loss functions used remain time domain-based.
The most widely known loss for training neural networks is the MSE described previously. This loss is one of the most used for training networks in distortion circuit simulation. It was used in one of the first articles presenting recurrent networks for amplifier simulation [31] as well as in all of the gray-box methods presented previously, including a normalized variant used in [41] in order to stabilize the initial training of the network. [30]: 14) and the Normalized Root Mean Squared Error (NRMSE) used in [35]:

Similar losses include the Root Mean Squared Error (RMSE) used in
Despite having relatively widespread use, the MSE-based losses lack perceptual accuracy. Another loss function frequently encountered in the state-of-the-art of amplifier simulation methods using neural networks is the Error-to-Signal Ratio (ESR) defined as: Variants of this loss have been used in the works of Damskägg et al. [25,26] and Wright et al. [23,33] with pre-emphasis filtering for better perceptual accuracy. In both Damskägg et al. papers, the high-pass pre-emphasis filter with the transfer function H(z) = 1 − 0.95z −1 was used to train their WaveNet architecture as without this filtering it was found that the model struggled at higher frequencies.
In order to train the recurrent networks presented in the works of Wright et al., a different pre-emphasis filter was applied to the ESR with the transfer function H(z) = 1 − 0.85z −1 along with a term to compensate for a DC offset present in the prediction: In a further work [53], various pre-emphasis filters are studied and compared, along with weighting functions in order to gauge which combination better reflects perceptual accuracy. The filters with the following transfer functions were tested: First-Order High-Pass: H HP (z) = 1 − 0.85z −1 (18) Folded Differentiator: First-Order Low-Pass: The low-pass filter is preceded by A-weighting in order to decrease emphasis in regions where little energy is present as this weighting aims to mimic the equal loudness curves of the human ear. Listening tests carried out in this study showed that pre-emphasis filtering enabled better accuracy during the modelling task, with the A-weighted low-pass filtering achieving the best performance.
The Mean Absolute Error (MAE) is also present in the state-of-the-art in both a purely time-domain formulation and a variant in which spectral content is also taken into account. The MAE is defined as and was used to train all of the autoencoders presented in a previous section [37][38][39]. Steinmetz and Reiss [28] use a combination loss comprising both time-domain and spectral features. For the time-domain, the MAE was used, and for the spectral magnitude, the Short-Term Fourier Transform (STFT) loss from [54] is used, leading to the following cost function: MAE + STFT where: . F is the Frobenius norm. An overview of the loss functions used in this field is presented in Table 5. Table 5. Summary of the loss functions and their perceptual relevance.

ESR + H HP (z)
Studied for perceptual relevance in [53]. Includes DC offset.

ESR + H LP (z)
Studied for perceptual relevance in [53]. Includes DC offset and A-weighting.

MAE
Non-perceptual, used in [37][38][39] STFT + MAE Time-Frequency content, used in [28] The loss functions used for training can also be applied to the evaluation process for an objective measure of performance during testing. However, other methods also exist to provide a more comprehensive assessment of the overall quality.

Evaluation Methods
The metrics presented in the previous section must be differentiable in order to be used within the optimization problem of the training phase. These metrics can also be used for the evaluation of the network after training and validation, as well as non-differentiable functions and subjective listening tests.
While objective metrics struggle to properly reflect the perceptual aspects of the output audio, listening tests take time to implement and hinder continuous integration of ML systems. Therefore, there is a real need for objective evaluation metrics that do not rely on human participation.
Most of the evaluation methods used for this modelling task rely on reuse of the loss functions used during training or a variation thereof. For example, Damskägg et al. [26] used pre-emphasized ESR for training and plain ESR for evaluation. While simple to implement, a single objective metric cannot replace subjective listening tests.
The listening tests that are mainly used for this task rely on MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) testing. In a MUSHRA test, the participants are presented with a labeled reference, various test samples, an unlabeled reference, and an anchor. This method is detailled in the Recommendation ITU-R BS.1534-3 [55].
A similar framework used to carried out listening tests with human participants is the Web Audio Evaluation Tool [56] which is based on the HTML5 Web Audio API for perceptual audio evaluation.
For the objective evaluation of the models compared in [28], three metrics were used: the MAE and the STFT loss described above and a perceptually informed loudness metric that uses the loudness algorithm from the ITU-R BS. 1770 Recommendation [57]. Both the causal and noncausal versions of the network achieved comparable performance but the noncausal variants performed slightly better in the time-domain. A listening test similar to MUSHRA was also carried out using the WebMUSHRA interface [58] to further validate the accuracy of the models. The results of this test indicated that a small difference was perceived among the models in comparison to the reference.
Subjective evaluation was provided in [23] in the form of a MUSHRA test, carried out using the WebMUSHRA interface [58]. These tests showed the WaveNet3 to be the most perceptually accurate model of the HT5M amplifier, although this prediction could still be distinguished from the original amplifier, whereas the LSTM96 proved to be closest to the Mesa 5:50 in terms of subjective quality and most people could not tell the difference between the model and the target.
For the work established in Thomas Schmitz's PhD thesis [35], a number of objective metrics as well as listening tests were presented. One listening test studies the number of parameters that can be reduced without loss of accuracy and the other is used to determine the threshold of a given metric above which the accuracy is no longer improved. An overview of the evaluation methods is presented in Table 6. Table 6. Summary of the evaluation methods for network validation after training.
A major factor to take into account when modelling analog audio devices is the realtime constraint. If the solution implemented cannot be used in real-time then it is of little use to the end-users. The real-time constraint for this type of application is approximately 10 ms [3]. Any latency above this value is likely to be perceived by the musician and hinder their playing.
The latency produced in digital audio processing is known as round-trip latency that is not limited to only the computation time (CT) of the DSP algorithm used [3]. Analog-to-Digital and then Digital-to-Analog conversion can add up to a 1 ms to the round-trip latency. Additionally, there is a latency inherent to the processing in buffers used in audio interfaces and Digital Audio Workstations (DAW). In order to process a buffer of audio, a delay of that buffer size is added to the round-trip latency which can quickly exceed the real-time constraint. For example, when processing buffers of 256 samples at 44.1 kHz, the time necessary to fill this buffer is around 5.8 ms, which significantly decreases the time left for the DSP computations. This means that using DL architectures with a lot of computations and layers is complicated as it either entails too high a latency or excessive CPU usage which prohibits its use. A number of the architectures present in the state-of-the-art are capable of being used in real-time to varying degrees but caveats also exist.
The architectures presented here although often capable of real-time processing, some of which are light-weight enough to work in real-time on CPU, are still computationally heavy and have not been demonstrated to be able to achieve real-time speeds for sampling rates over 44.1 kHz. Indeed, the hybrid convolutional and recurrent architecture from [35] utilizes parallelization on GPU in order to process the data in real-time which poses a problem for use in a practical setting in which CPU processing is required. The autoencoder architectures from [38] have been demonstrated using a sampling rate of 16 kHz, which is insufficient for high-quality audio applications. The shallow TCN from [28] are capable of real-time use only for input buffer sizes over 1024 samples, which, at 44.1 kHz, incurs a latency of approximately 23.2 ms, which is double the latency required for real-time use of around 10 ms. Finally, the architectures presented in [23], although capable of real-time processing, remain computationally heavy, even for a sampling rate of 44.1 kHz, which is the lower bound for high quality audio in music applications.
Moreover, digital implementations of analog audio effects usually introduce aliasing into the signal, and to remedy this, anti-aliasing techniques are used which often require upsampling the signals by a factor of eight [6]. This means that eight times the amount of samples need to be processed in the same amount of time required for a real-time (RT) implementation, further restraining the allowed CT of the chosen DSP algorithm. This also applies to neural networks, although formal study of the aliasing introduced by neural networks in this field is lacking.
A number of architectures are capable of real-time use, even on CPU. However, the RT measures presented in Table 7 vary in a number of ways including: • The sample rate used for processing; • The processing unit; • The implementation language; • The method of measuring the RTF (number of operations, timing the inference, etc.).
This makes formal comparison challenging. We provide the RTF of a number of illustrative models of the state-of-the-art, from each of the architecture classes of Section 3.1 in the following section.  State-Space GRU [42] No mention 48 Nercessian et al. [43] No mention 44.1 Kuznetsov et al. [4] Computed in terms of floating point operations 48 Esqueda et al. [46] No mention but same RT capabilities as traditional White-Box 96 4. Discussion

Architecture Comparison
Overall, the wide range of parameters used in the state-of-the-art in neural modelling of distortion effects make a formal comparison of the architectures complicated. To remedy this, we chose at least one model from each class of networks and trained them using the same data and loss function. This allows for a clearer comparison of each class of architecture. The data set used for training is the data from [50], using 80% for the training data and 20% for testing. The sampling rate of the data used is 44.1 kHz and the comparison is limited to black-box approaches. The loss function used is the ESR without pre-emphasis filtering. We chose to omit the pre-emphasis as the predictions diverge from the target in different frequency ranges depending on the network architecture. A DC term was added to the ESR when training the recurrent network as the predictions from this type of model are known to have an amplitude offset.
The training parameters for each of the networks are the following: • LSTM: Sequence-to-sequence LSTM with 32 recurrent units. The Shallow TCN network was trained for 150 epochs as opposed to the original maximum number of epochs, which was equal to 60 to account for the difference in training data and loss function, and all the other networks were trained until an early stopping condition was met. For each model tested, we present a number of objective evaluation metrics as well as the inference speeds of the Python implementations on an AMD Ryzen 7 3750H CPU at 2.3GHz presented in Table 8. The objective ESR and STFT metrics are plain ESR and the Aggregate STFT from [54] computed using the Auraloss PyTorch library [59].  Table 8 shows that the Shallow TCN outperforms the other architectures in terms of processing speed but for significantly lower objective quality. The lower audio quality could be due to insufficient training as this network was trained for a fixed number of epochs. The LSTM with 32 recurrent units outperforms all other models in terms of objective quality measures but is outperformed by the fully convolutional networks in terms of processing speeds. However, this is contradictory to the results presented in Wright et al.'s WaveNet and RNN comparison [23] which showed that most LSTM models were able to outperform the WaveNet models in terms of computation time. This difference could be due to the evaluation method, as Wright et al. report CT of an optimized C++ implementation of both network types and the results presented here were all obtained using Python. The hybrid conv-LSTM network from Schmitz [35] ranks highly in terms of objective quality but is incapable of real-time use without GPU parallelization. Overall, all models except for the Shallow-TCN were reported to have acceptable subjective quality, judged through informal listening tests, even though their objective time-domain metrics vary. This highlights the divergence between the objective values and actual perceptual quality. The STFT loss produces values closer to the perceptual quality of the outputs, showing that taking into account spectral features can improve the perceptual accuracy. However time-frequency metrics such as these can cause spectral leakage [44], which can affect training when being used as loss functions.

Proposed Further Explorations
The architectures presented here vary from convolutional methods to recurrent structures, including hybrid methods and autoencoders. Despite being a black-box approach, researchers have started to utilize this technology in both white-and gray-box modelling as well, further demonstrating the wide range of ways in which deep learning can be applied to this field. However, some aspects warrant further study.
These architectures are often capable of real-time processing, some of which are lightweight enough to work in real-time on CPU. However, they remain computationally heavy and have not been demonstrated to be able to achieve real-time speeds for sampling rates over 44.1 kHz. Indeed, some models either require parallelization of their operations on GPU, low sampling rates or large buffer sizes in order to achieve close to real-time performance. Furthermore, the architectures currently present in the state-of-the-art that are capable of real-time use have only been shown to work with sampling rates of 44.1 kHz which is the lower bound for high quality audio in music applications.
Moreover, digital implementations of analog audio effects usually introduce aliasing into the signal, and to remedy this, anti-aliasing techniques are used which often require upsampling of the signals by a factor of 8 [6]. This also applies to neural networks although formal study of the aliasing introduced by neural networks in this field is lacking. In [26], Damskägg et al. studied the effect of aliasing in the prediction of their feedforward WaveNet and claimed that aliasing was indeed present in the output even though the models were trained on non-aliased data but that this aliasing could not clearly be heard in the predictions. Therefore, anti-aliasing techniques might be required for neural models, and further work on both potentially lighter models and the possible impact of aliasing should be explored.
A variety of methods presented here allow for input parameters to be taken into account in the network for a parametric model. LSTM and other similar recurrent models allow for these parameters in the form of input features (i.e., an extra dimension of data pertaining to these parameters). Similarly, convolutional networks can allow for a parametric approach by use of extra input channels for the conditioning of the network. However, these methods could slow down training significantly and greatly increase the amount of necessary data. Moreover, all mentions of these parametric approaches in the literature have been hypothetical or implemented with marginal success and no clear demonstration of the methods have been presented that we know of.
The cost functions used for training and evaluation of the networks have been studied in recent years. Wright & Välimäki [53] present a study on various functions for preemphasis filtering of the signal before loss computations in order to better capture the perceptual features. In this work, three pre-emphasis filters were tested and compared: a high-pass filter, a low-pass filter and a folded differentiator, all applied to the ESR of the signals. Different weightings were also compared. It was shown that the loss that best improved audio results was the ESR with A-weighting pre-emphasis, a weighting that seeks to describe the frequency response of the human ear. A number of losses are also presented in [59], but a more formal comparison of these various cost functions for amplifier modelling would be desirable.
Another aspect to take into account is the interpretability of such models. Neural networks are notoriously lacking in interpretability. Although methods of increasing interpretability and explainability in DL are being studied, a range of methods are presented in [60] for example, and the advent of both gray-and white-box methods also bring along more knowledge of the model. However, this lack of explainability is undesirable, and further work on increasing interpretability of the models should be explored.

Conclusions
In this work, we present an overview of the current state-of-the-art methods in neural network-based vacuum tube amplifier modelling, covering the preceding methods in the state-of-the-art and the recent advances in deep learning in this field under black-box, graybox, and white-box approaches. We highlight the results of each method, including the audio quality and real-time capabilities. Moreover, we include the evaluation methods used and the limitations of each method, exploring avenues for further work. Notably the realtime capabilities of such approaches warrant further investigation as the current methods, although some of which are capable of real-time use, are limited to a sampling rate of 44.1 kHz which is the lower bound for acceptable audio quality in music and is insufficient for anti-aliasing techniques based on upsampling the signals. Moreover, the presence of aliasing produced by the networks requires further study. Evaluation methods with perceptual relevance often rely on listening tests with user input and are time-intensive to carry out. Therefore, there is a need for perceptually relevant, objective metrics to facilitate and automate evaluation. Additionally, differentiable objective metrics are also required to improve training towards a minimum that better reflects human audition. Two remaining points of focus in this field of study are the exploration into parametric models, that have only been alluded to in the current state-of-the-art without explicit demonstration, and the interpretability of the models used, as this is an issue with neural networks in general.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: AME