Sinusoidal models are widely used in the analysis [1
], synthesis [2
], and transformation [4
] of musical instrument sounds. The musical instrument sound is modeled by a waveform consisting of a sum of time-varying sinusoids parameterized by their amplitudes, frequencies, and phases [1
]. Sinusoidal analysis consists of the estimation of parameters, synthesis comprises techniques to retrieve a waveform from the analysis parameters, and transformations are performed as changes of the parameter values. The time-varying sinusoids, called partials, represent how the oscillatory modes of the musical instrument change with time, resulting in a flexible representation with perceptually meaningful parameters. The parameters completely describe each partial, which can be manipulated independently.
Several important features can be directly estimated from the analysis parameters, such as fundamental frequency, spectral centroid, inharmonicity, spectral flux, onset asynchrony, among many others [2
]. The model parameters can also be used in musical instrument classification, recognition, and identification [6
], vibrato detection [7
], onset detection [8
], source separation [9
], audio restoration [10
], and audio coding [11
]. Typical transformations are pitch shifting, time scaling [12
], and musical instrument sound morphing [5
]. Additionally, the parameters from sinusoidal models can be used to estimate alternative representations of musical instrument sounds, such as spectral envelopes [13
] and the source-filter model [14
The quality of the representation is critical and can impact the results for the above applications. In general, sinusoidal models render a close representation of musical instrument sounds because most pitched musical instruments are designed to present very clear modes of vibration [16
]. However, sinusoidal models do not result in perfect reconstruction upon resynthesis, leaving a modeling residual that contains whatever was not captured by the sinusoids [17
]. Musical instrument sounds have particularly challenging features to represent with sinusoids, such as sharp attacks, transients, inharmonicity, and instrumental noise [16
]. Percussive sounds produced by plucking strings (such as harpsichords, harps, and the pizzicato
playing technique) or striking percussion instruments (such as drums, idiophones, or the piano) feature sharp onsets with highly nonstationary oscillations that die out very quickly, called transients [18
]. Flute sounds characteristically comprise partials on top of breathing noise [16
]. The reed in woodwind instruments presents a highly nonlinear behavior that also results in attack transients [19
], while the stiffness of piano strings results in a slightly inharmonic spectrum [18
]. The residual from most sinusoidal representations of musical instrument sounds contains perceptually important information [17
]. However, the extent of this information ultimately depends on what the sinusoids are able to capture.
The standard sinusoidal model (SM) [1
] was developed as a parametric extension of the short-time Fourier transform (STFT) so both analysis and synthesis present the same time-frequency limitations as the Discrete Fourier Transform (DFT) [21
]. The parameters are estimated with well-known techniques, such as peak-picking and parabolic interpolation [20
], and then connected across overlapping frames (partial tracking [23
]). Peak-picking is known to bias the estimation of parameters because errors in the estimation of frequencies can bias the estimation of amplitudes [22
]. Additionally, the inherent time-frequency uncertainty of the DFT further limits the estimation because long analysis windows blur the temporal resolution to improve the frequency resolution and vice-versa
]. The SM uses quasi-stationary sinusoids (QSS) under the assuption that the partials are relatively stable inside each frame. QSS can accurately capture the lower frequencies because these have fewer periods inside each frame and thus less temporal variation. However, higher frequencies have more periods inside each frame with potentially more temporal variation lost by QSS. Additionally, the parameters of QSS are estimated using the center of the frame as the reference and the values are less accurate towards the edges because the DFT has a stationary basis [25
]. This results in the loss of sharpness of attack known as pre-echo.
The lack of transients and noise is perceptually noticeable in musical instrument sounds represented with QSS [17
]. Serra and Smith [1
] proposed to decompose the musical instrument sound into a sinusoidal component represented with QSS and a residual component obtained by subtraction of the sinusoidal component from the original recording. This residual is assumed to be noise not captured by the sinusoids and commonly modeled by filtering white noise with a time-varying filter that emulates the spectral characteristics of the residual component [1
]. However, the residual contains both errors in parameter estimation and transients plus noise missed by the QSS [27
The time-frequency resolution trade-off imposes severe limits on the detection of transients with the DFT. Transients are essentially localized in time and usually require shorter frames which blur the peaks in the spectrum. Daudet [28
] reviews several techniques to detect and extract transients with sinusoidal models. Multi-resolution techniques [29
] use multiple frame sizes to circumvent the time-frequency uncertainty and to detect modulations at different time scales. Transient modeling synthesis (TMS) [26
] decomposes sounds into sinusoids plus transients plus noise and models each separately. TMS performs sinusoidal plus residual decomposition with QSS and then extracts the transients from the residual.
An alternative to multiresolution techniques is the use of high-resolution techniques based on total least squares [32
] such as ESPRIT [33
], MUSIC [34
], and RELAX [35
] to fit exponentially damped sinusoids (EDS). EDS are widely used to represent musical instrument sounds [11
]. EDS are sinusoids with stationary (i.e.
, constant) frequencies modulated in amplitude by an exponential function. The exponentially decaying amplitude envelope from EDS is considered suitable to represent percussive sounds when the beginning of the frame is synchronized with the onsets [38
]. However, EDS requires additional partials when there is no synchronization, which increases the complexity of the representation. ESPRIT decomposes the signal space into sinusoidal and residual, further ranking the sinusoids by decreasing magnitude of eigenvalue (i.e.
, spectral energy). Therefore, the first K
sinusoids maximize the energy upon resynthesis regardless of their frequencies.
Both the SM and EDS rely on sinusoids with stationary frequencies, which are not appropriate to represent nonstationary oscillations [21
]. Time-frequency reassignment [39
] was developed to estimate nonstationary sinusoids. Polynomial phase signals [20
] such as splines [21
] are commonly used as an alternative to stationary sinusoids. McAulay and Quatieri [20
] were among the first to interpolate the phase values estimated at the center of the analysis window across frames with cubic polynomials to obtain nonstationary sinusoids inside each frame. Girin et al.
] investigated the impact of the order of the polynomial used to represent the phase and concluded that order five does not improve the modeling performance sufficiently to justify the increased complexity. However, even nonstationary sinusoids leave a residual with perceptually important information that requires further modeling [25
Sinusoidal models rely on spectral decomposition assuming that the lower end of the spectrum can be modeled with sinusoids while the higher end essentially contains noise. The estimation of the separation between the sinusoidal and residual components has proved difficult [27
]. Ultimately, spectral decomposition misses partials on the higher end of the spectrum because the separation is artificial, depending on the spectrum estimation method rather than the spectral characteristics of musical instrument sounds. We consider spectral decomposition to be a consequence of artifacts from previous sinusoidal models instead of an acoustic property of musical instruments. Therefore, we propose the full-band modeling of musical instrument sounds with adaptive sinusoids as an alternative to spectral decomposition.
Adaptive sinusoids (AS) are nonstationary sinusoids estimated to fit the signal being analyzed usually via an iterative parameter re-estimation process. AS have been used to model speech [43
] and musical instrument sounds [25
]. Pantazis [45
] developed the adaptive Quasi-Harmonic Model (aQHM), which iteratively adapts the frequency trajectories of all sinusoids at the same time based on the Quasi-Harmonic Model (QHM). Adaptation improves the fit of a spectral template via an iterative least-squares (LS) parameter estimation followed by frequency correction. Later, Kafentzis [43
] devised the extended adaptive Quasi-Harmonic Model (eaQHM), capable of adapting both amplitude and frequency trajectories of all sinusoids iteratively. In eaQHM, adaptation is equivalent to the iterative projection of the original waveform onto nonstationary basis functions that are locally adapted to the time-varying characteristics of the sound, capable of modeling sudden changes such as sharp attacks, transients, and instrumental noise. In a previous work [47
], we showed that eaQHM is capable of retaining the sharpness of the attack of percussive sounds.
In this work, we propose full-band modeling with eaQHM for a high-quality analysis and synthesis of isolated musical instrument sounds with a single component. We compare our method to QSS estimated with the standard SM [20
] and EDS estimated with ESPRIT [36
]. In the next section, we discuss the differences in full-band spectral modeling and traditional decomposition for musical instrument sounds. Next, we describe the full-band quasi-harmonic adaptive sinusoidal modeling behind eaQHM. Then, we present the experimental setup, describe the musical instrument sound database used in this work and the analysis parameters. We proceed to the experiments, present the results, and evaluate the performance of QSS, EDS, and eaQHM in modeling musical instrument sounds. Finally, we discuss the results and present conclusions and perspectives for future work.
2. Full-Band Modeling
Spectrum decomposition splits the spectrum of musical instrument sounds into a sinusoidal component and a residual as illustrated in Figure 1
a. Spectrum decomposition assumes that there are partials only up to a certain cutoff frequency
, above which there is only noise. Figure 1
a represents the spectral peaks as spikes on top of colored noise (wide light grey frequency bands) and
as the separation between the sinusoidal and residual components. Therefore,
determines the number of sinusoids because only the peaks at the lower frequency end of the spectrum are represented with sinusoids (narrow dark grey bars) and the rest is considered wide-band and stochastic noise existing across the whole range of the spectrum. There is noise between the spectral peaks and at the higher end of the spectrum. In a previous study [17
], we showed that the residual from the SM is perceptually different from filtered (colored) white noise. Figure 1
a shows that there are spectral peaks left in the residual because the spectral peaks above
are buried under the estimation noise floor (and sidelobes). Consequently, the residual from sinusoidal models that rely on spectral decomposition such as the SM is perceptually different from filtered white noise.
From an acoustic point of view, the physical behavior of musical instruments can be modeled as the interaction between an excitation and a resonator (the body of the instrument) [16
]. This excitation is responsible for the oscillatory modes whose amplitudes are shaped by the frequency response of the resonator. The excitation signal commonly contains discontinuities, resulting in wide-band spectra. For instance, the vibration of the reed in woodwinds can be approximated by a square wave [49
], the friction between the bow and the strings results in an excitation similar to a sawtooth wave [16
], the strike in percussion instruments can be approximated by a pulse [2
], while the vibration of the lips in brass instruments results in a sequence of pulses [50
] (somewhat similar to the glottal excitation, which is also wide band [46
b illustrates a full-band harmonic template spanning the entire frequency range, fitting sinusoids to spectral peaks in the vicinity of harmonics of the fundamental frequency
. The spectrum of musical instruments is known to present deviations from perfect harmonicity [16
], but quasi-harmonicity is supported by previous studies [51
] that found deviations as small as 1%. In this work, the full-band harmonic template becomes quasi-harmonic after the estimation of parameters via least-squares followed by a frequency correction mechanism (see details in Section 3.1
). Therefore, full-band spectral modeling assumes that both the excitation and the instrumental noise are wide band.
4. Experimental Setup
We now investigate the full-band representation of musical instrument sounds and the nonstationarity of the adaptive AM–FM sinusoids from eaQHM. We aim to show that spectral decomposition fails to capture partials at the higher end of the spectrum so full-band quasi-harmonic modeling increases the quality of analysis and resynthesis by capturing sinusoids across the full range of the spectrum. Additionally, we aim to show that adaptive AM–FM sinusoids from eaQHM capture nonstationary partials inside the frame. We compare full-band modeling with eaQHM against the SM [1
] and EDS estimated with ESPRIT [36
] using the same number of partials K
. We assume that the musical instrument sounds under investigation can be well represented as quasi-harmonic. Thus, we set
to the highest harmonic number k
below Nyquist frequency
or equivalently the highest integer K
. The fundamental frequency
of all sounds was estimated using the sawtooth waveform inspired pitch estimator (SWIPE) [53
] because in the experiments the frame size L
, the maximum number of partials
, and the full-band harmonic template depend on
. In the SM, K
is the number of spectral peaks modeled by sinusoids. For EDS, ESPRIT uses K
to determine the separation between the dimension of the signal space (sinusoidal component) and of the residual.
The SM is considered the baseline for comparison due to the quasi-stationary nature of the sinusoids and the need for spectral decomposition. EDS estimated with ESPRIT is considered the state-of-the-art due to the accurate analysis and synthesis and constant frequency of EDS inside the frame m. We present a comparison of the local and global SRER as a function of K and L for the SM and EDS against eaQHM in two experiments. In experiment 1, we vary K from 2 to and record the SRER. In experiment 2, we vary L from to samples and record the SRER, where is the fundamental period. The local SRER is calculated within the first frame , where we expect the attack transients to be. The first frame is centered at the onset with (and the first half is zero-padded), so artifacts such as pre-echo (in the first half of the frame) are also expected to be captured by the local SRER. The global SRER is calculated across all frames, thus considering the whole sound signal . Next, we describe the musical instrument sounds modeled and the selection of parameter values for the algorithms.
4.1. The Musical Instrument Sound Dataset
In total, 92 musical instrument sounds were selected. “Popular” and “Keyboard” musical instruments are from the RWC Music Database: Musical Instrument Sound [54
]. All other sounds are from the Vienna Symphonic Library [55
] database of musical instrument samples. Table 1
lists the musical instrument sounds used. The recordings were chosen to represent the range of musical instruments commonly found in traditional Western orchestras and in popular recordings. Some instruments feature different registers (alto, baritone, bass, etc
). All sounds used belong to the same pitch class (C), ranging in pitch height from C2 (
Hz) to C6 (
Hz). The dynamics is indicated as forte
(“f”) or fortissimo
(“ff”), and the duration of most sounds is less than 2 s. Normal attack (“na”) and no vibrato (“nv”) were chosen whenever available. Presence of vibrato (“vib”), progressive attack (“pa”), and slow attack (“sa”) are indicated, as well as different playing modes such as staccato
(“sforz”), and pizzicato
(“pz”), achieved by plucking string instruments. Extended techniques were also included, such as tongue ram
(“tr”) for the flute, près de la table
(“pdlt”) for the harp, muted (“mu”) strings, and bowed idiophones (vibraphone, xylophone, etc
.) for short (“sh”) and long (“lg”) sounds. Different mallet materials such as metal (“met”), plastic (“pl”), and wood (“wo”) and hardness such as soft (“so”), medium (“med”), and hard (“ha”) are indicated.
In what follows, we will present the results for 89 sounds because QHM failed to adapt for the three sounds marked * in Table 1
. The estimation of parameters for QHM uses LS [45
]. The matrix inversion fails numerically when the matrix is close to singular (see [44
]). The fundamental frequency (C2
Hz) of these sounds determines a full-band harmonic spectral template whose frequencies are separated by C2, which results in singular matrices.
4.2. Analysis Parameters
The parameter estimation for the SM follows [20
] with a Hann window for analysis, and phase interpolation across frames via cubic splines followed by additive resynthesis. The estimation of parameters for EDS uses ESPRIT with a rectangular window for analysis and OLA resynthesis [36
]. Parameter estimation in eaQHM used a Hann window for analysis and additive resynthesis following Equation (11
). In all experiments, ε
in Equation (12
) is set to
kHz for all sounds. The step size for analysis (and OLA synthesis) was
samples for all algorithms, corresponding to 1 ms. The frame size is
samples with q
an integer. The size of the FFT for the SM is kept constant at
samples with zero padding.