Audio Enhancement of Physical Models of Musical Instruments Using Optimal Correction Factors: The Recorder Case

: A simulation of a musical instrument is considered to be a successful one when there is a good resemblance between the model’s synthesized sound and the real instrument’s sound. In this work, we propose the integration of physical modeling (PM) methods with an optimization process to regulate a generated digital signal. Its goal is to ﬁnd a new set of values of the PM’s parameters’ that would lead to a synthesized signal matching as much as possible to reference signals corresponding to the physical musical instrument. The reference signals can be: (a) described by their acoustic characteristics (e.g., fundamental frequencies, inharmonicity, etc.) and/or (b) the signals themselves (e.g., impedances, recordings, etc.). We put this method into practice for a commercial recorder, simulated using the digital waveguides’ PM technique. The reference signals, in our case, are the recorded signals of the physical instrument. The degree of similarity between the synthesized (PM) and the recorded signal (musical instrument) is calculated by the signals’ linear cross-correlation. Our results show that the adoption of the optimization process resulted in more realistic synthesized signals by (a) enhancing the degree of similarity between the synthesized and the recorded signal (the average absolute Pearson Correlation Coefﬁcient increased from 0.13 to 0.67), (b) resolving mistuning issues (the average absolute deviation of the synthesized from the recorded signals’ pitches reduced from 40 cents to the non-noticeable level of 2 cents) and (c) similar sound color characteristics and matched overtones (the average absolute deviation of the synthesized from the recorded signals’ ﬁrst ﬁve partials reduced from 41 cents to 2 cents).


Introduction
The acoustic simulation of musical instruments using computer models is a pole of attraction for scientists of multidisciplinary fields (i.e., physics, informatics, musicology, etc.) [1][2][3][4][5][6][7][8]. In the last decades, several digital sound synthesis techniques have been developed (e.g., sampling, spectral modeling, and physical modeling) [9]. Physical Modeling (PM) is the technique that, by simulating the instrument's physical phenomena, can generate its sound. The audible result of this technique depends purely on the level of detail of the model. Describing all the phenomena in detail is not trivial (i.e., non-linearity of the vibrating reed, complex geometries, etc.). Thus, in the process of simulation, assumptions are made in order to simplify the model, which inevitably affects the final result.
Applying correction factors is a technique that can enhance the accuracy of a signal that is produced by a digital generator. Its goal is to find the values of the model's parameters that lead to a generated signal matching as much as possible to a reference one. In this work, we propose a minimum error method to choose the optimal parameters given the predetermined criteria. The integration of musical instruments' PM methods with our framework can tune the model, given the inherent PM limitations.
In the next section, we present the integration of PM of musical instruments with the optimization framework (Section 2). Next, we put the proposed method in practice for the case of a commercial Hohner recorder (Section 3). The physical model of the recorder, based on the Digital Waveguides technique, is presented in Section 3.1, followed by a detailed description of the optimization technique adopting the optimization framework to enhance the model's audio (Section 3.2). Finally, we compare the synthesized signals generated by the PM (without adopting the optimization framework) and by the PM-OPT (adopting the optimization framework) with the relative recorded from the real musical instrument and present our results in terms of the degree of similarity, the tuning accuracy, and the sound color characteristics (Section 3.3) ending up with a discussion about the current work (Section 4).

Method
In this work, we present the integration of the optimization framework with the PM of musical instruments ( Figure 1). In brief, this method enables the modification of the PM parameters through correction factors. It is an iterative process to solve an optimization problem. In every iteration, the optimizer tries a new set of values for the correction factors (CF new , Figure 1), resulting in new synthesized signals. Next, the signals are evaluated according to the predetermined criteria. The goal is to determine the optimal set of correction factors derived by the solution of an optimization problem. An optimal set of correction factors is the one that when applied on the relative parameters of the PM, will result in the synthesized signal with the highest evaluation score (Output, Figure 1). We would like to note here that in this work we used the known PM techniques, for the simulation of the musical instrument, thus, any modeling challenges (e.g., nonlinearities) concerning the PM were taken from the relevant literature (see Section 3.1). Although nonlinearities concern the PM and not the adopted optimization framework, it is a fact that they affected our approach by imposing the need to use a different set of correction factors for every note produced rather than for a single set.
The first step is to define the details of the synthesis part (Input 1, Figure 1). It includes the determination of (a) the modifiable parameters, (b) the parameters' limits, and (c) the specific modification type of every parameter. In this step, the designer should ensure that unwanted alterations of the core elements in the PM algorithm are avoided. Thus, setting the modifiable parameters (determination point a) and their limits (determination point b) is essential to avoid results with no physical meaning, such as placing the position of a tonehole outside the body of a wind instrument. In principle, all the parameters of the algorithm can be potentially considered tunable parameters. However, if parameter A depends on parameter B (i.e., A = f(B)), and B only affects A (i.e., there is not a parameter C for which C = g(B)), then modifying both A and B is unnecessary. Based on the above, we chose the tunable and locked parameters. Type of modification (determination point c) in the proposed method is the mathematical expression which, with the use of a correction factor (CF), tweaks a parameter P: (e.g., P modi f ied = P initial + CF, P modi f ied = P initial · CF, P modi f ied = P initial CF , etc.).
The second step includes the evaluation of the generated signal. The goal in our approach essentially is the following: the synthesized signal, generated by a PM according to building details (Input 2, Figure 1), to be as similar as possible to a reference. In our case, the reference is derived from a recorded signal generated by a musical instrument (Input 3, Figure 1). The proposed model, apart from recordings, works with other reference signals as well. For example, the goal signals could be generated by digital synthesis techniques (e.g., additive) and even signals consisting of raw numbers describing sound parameters (e.g., inharmonicity, deviation, durations, and other acoustic features found, for example, in [10]). In every iteration, the optimizer assigns new values to the correction factors' set (CF new , Figure 1), which modifies correspondingly the Physical Modeling resulting in a new synthesized signal ( Figure 1). The synthesized signal is then compared with the recording. An objective function calculates the resemblance of these two signals (in our case, the function is cross-correlation, see Section 3.2). It enables the quantification of the degree of their similarity (DS, Figure 1). The optimizer determines which specific set of correction factors (CF k , Figure 1) resulted when applied to the PM, in the maximum degree of similarity (DS k , Figure 1) between the synthesized and the recorded signal ( Figure 1). This particular set is the optimizer's output ( Figure 1).

The Physical Model
In order to demonstrate the integration of the optimization framework with PM of musical instruments, we present the case of a recorder. The recorder constitutes a wind instrument with a flute-like (air-jet) excitation mechanism and a cylindrical resonator. The player produces various pitches by changing the fingering (i.e., arrangements of closed or open toneholes), or by overblowing. The instrument in our case study is a typical commercial recorder: Hohner's melody recorder with baroque fingering (type 1-095.143-1011) and eight toneholes (seven regular toneholes in the upper part of the acoustic pipe and one fingerhole in the bottom part).
The method used for the physical modeling of the instrument is the Digital Waveguides (DWGs), a technique introduced by J. Smith [11], which simulates traveling waves by digital delay lines [12]. In this work, we have chosen to demonstrate our framework on a computationally cheap PM that enables a fast calculation runtime of a significantly high number of iterations during the optimization process (here 10 k, see Section 3.3). However, every PM technique (e.g., FEM) is built upon parameters that can potentially be tuned with the use of correction factors, hence, it can be integrated with the optimization framework. The only prerequisite is the available computation power to enable the optimizer to perform several iterations. Figure 2 demonstrates the block diagram of our recorder's PM based on established approaches [13][14][15].
Recorder's excitation phenomenon is based on the effect of an air jet blown that strikes a sharp edge (labium) [16]. In Figure 2, we simulate the air jet traveling from the player's lips to the labium by a delay line (jet-delay). The mouth pressure (forming the air-jet) is simulated as a constant pressure enriched with vibrato and noise content. In the real world, this constant mouth pressure is not reached and released instantly thus, in our model, a dynamic envelope is applied to provide the duration of the attack, the sustain, and the release. The air jet is modeled by a static non-linear element using a sigmoid function [17]. Here we use the y = x − x 3 sigmoid function as proposed in [17]. When the air blown by the player enters the instrument's bore the air particles inside the resonator's cavity start to vibrate. The bore effect is simulated as a one-dimensional DWG by using delay lines (one delay line for the right and one for the left-going part of the wave, noted as z −M in Figure 2) [12]. The length of the digital delay lines (in samples), which depends on the speed of the acoustic waves, corresponds to the bore's physical length (in meters). A more accurate modeling of wind instruments should also take into consideration the end corrections [18,19] in order to tune the generated pitch. However, our initial model neglects end corrections (and therefore creates a lower-than the more accurate model-correlation due to the frequencies mismatch, see Section 3.3 and Figure 3) as it is the proposed optimization framework that chooses the optimal one itself. The pressure waves travel from the mouthpiece along the tube towards the other end (assumably, right-going). When reaching the end so-called the bell, a portion of the wave is reflected towards the mouthpiece (assumably, left-going) and the other portion is transmitted outside the instrument. The superposition of the right-and left-going pressure waves forms a standing wave inside the resonator's cavity. In particular, the effect of the bell is to radiate out of the instrument the high frequencies and to reflect back the low frequencies. This reflection is simulated as a lowpass filter, the RL(z), which is, in our case, a first-order averaging filter. Further, to simplify the simulation, we assume that the first open tonehole defines the effective length of the bore [20] and consequently the length of the digital delay-line.

Analysis by Synthesis Model
We identified eleven internal parameters which are part of all the components of the block diagram in Figure 1 and affect the synthesized signal in both time and frequency domain. More specifically, three parameters affect the dynamic envelope of the mouth pressure (the duration of the attack, the sustain, and the release), three parameters affect the properties of the input (the frequency, the content, and the noise of the vibrato), one parameter affects the length of the delay lines, one parameter affects the interpolation used to achieve accurate tuning, and three parameters affect the filters' coefficients (the transmission of the tube into the mouth, and the reflections at both of its open ends). The number of correction factors is, thus, set to eleven. The chosen modification type is a multiplication (P modi f ied = P initial · CF), which, after several trials, proved to derive the best results and enabled the setting of initial generic logical boundaries (i.e., the range of values the optimizer is permitted to assign to the parameters in every iteration). The initial value of all the correction factors is set to one, which corresponds to the synthesized signal generated by the unmodified PM before the optimization framework integration. The determination of the logical boundaries of eleven parameters is not a straightforward task. Choosing a single modification type (in our case, this type is the multiplication) makes it easier to deal with this task by enabling the initial setting of generic logical boundaries as a starting point before their individual specification. These generic logical boundaries have been set to half and double the initial values for the lower and upper boundary, respectively. After several trials to ensure that all the extreme values are within logical limits and unwanted alterations of the PM algorithm's core elements are avoided, these boundaries were set for every individual correction factor.
The core part of the optimization framework integration with the PM is the comparison between the synthesized and the real sound of the relative musical instrument. To make this comparison possible, we recorded samples of the commercial Hohner recorder mentioned above. The reference signals (nine signals for nine fingerings) are the recordings of the Hohner recorder. The recordings took place at the audio recording studio of the National and Kapodistrian University of Athens, Department of Informatics and Telecommunications, using an electroacoustic chain with a flat frequency response (microphone: SD Systems LCM 85 MK II with "LP" Preamp Power Supply, soundcard: apogee duet, computer: MacBook air 2019). The distance (recorder-microphone) was approximately 1m and the microphone was placed off-axis from the instrument's bell. We want to note at this point, that in order to cross-correlate our method's performance, we recorded each note of the recorder 15 times and calculated all the possible Pearson correlation coefficients between the 15 recordings of the same note. As a reference signal to evaluate our model we chose for each note this recording that had the highest average Pearson correlation coefficient between itself and the rest 14 recordings of the same note. The total average Pearson correlation coefficient was found to be 0.7, with a standard deviation of 0.16.
The degree of similarity between the synthesized and recorded signals is defined by the Pearson correlation coefficient (Equation (1)), which measures the linear correlation between two variables [21] and takes values between −1 and +1 (+1 corresponds to total positive linear correlation, 0 to no linear correlation, and −1 to total negative linear correlation). As it is here non-relevant whether the correlation is positive or negative, we take the absolute value of this coefficient to define the objective (Equation (2)). Moreover, considering that computational optimizers deal with minimization problems more efficiently, we set our objective to output the minus absolute coefficient (Equation (2)). In that way, the objective is introduced to the optimizer (in this work we use Nelder-Mead, see Section 3.3) which is searching for a set of variables to minimize the objective and thus, maximize the correlation coefficient. In this work, the objective function leads to a non-convex optimization where the optimizer is looking for a global minimum. Thus, the number of iterations needs to be quite big to ensure good results. The objective function takes two inputs: (i) the synthesized digital signal generated by the PM and (ii) the recorded reference signal generated by the recorder. Our model's goal is to create a model with the best signals' match in terms of physical properties. The reason we made this choice is that maximizing the resemblance of the reference with the synthesized signals in terms of physical properties would, consequently, maximize the resemblance in terms of perceptual properties.
After determining the set of the correction factors, the boundaries, and the objective, the next step is to put the optimizer into practice. In this work, our focus is to find the optimal correction factors to tweak the algorithm's parameters in order for the synthesized signal to be as close as possible to the relative instrument's signal. The optimizer, at every iteration, is trying a new set of variables for the modifiable parameters of the recorder's PM that generates a signal (synthesized signal) to be compared (correlation coefficient) with the recorded signal (goal signal). The optimizer will minimize the objective function for all the possible notes (fingerings) and output an optimal set of correction factors.

Results and Discussion
In this work, we studied the enhancement of the generated signal of the PM of a Hohner melody recorder with baroque fingering using the optimization framework. We studied the fingering system, which results from the sequential opening of the toneholes (i.e., the one that starts with having all toneholes closed and lifting the fingers one by one, beginning with the closest to the bell-end). The recorder's eight toneholes result in nine notes, which correspond to the sequence of all toneholes closed (note 1) to all open (note 9). The proposed model's inputs are (a) the building information, i.e., the geometrical details to synthesize nine audio files (9 notes), (b) the nine relative recordings of the real instrument, (c) the initial values for the correction factors along with their upper and lower boundaries and outputs nine sets of correction factors, one individual set per note.
In order to calculate the optimal set of correction factors, we put in practice two optimization techniques. We compared their efficiency and embedded the winner to our model. The mathematical optimizers tested here are the Nelder-Mead (NM) [22,23] and the Simulated Annealing (SA) [24,25], which have been both used in acoustic-related studies [26,27]. To benchmark their performance, we run the relevant algorithms ten times for 10 k iterations per time. After several trials, this number of maximum iterations per time proved to be adequately high to satisfy the need for accurate tuning (the deviation between the recorded and the synthesized signals' pitch to be less than 10 cents). Both techniques achieved the best costs (maximum correlation factors) of similar values (±5% maximum deviation); however, NM was found to be more efficient than SA since it came back with the best cost value much faster (NM: 100-400 iterations, SA 1 k-5 k iterations).
Our framework significantly enhanced the similarity between the synthesized and the recorder signals (Figure 3). In the case of the PM synthesized signals (i.e., prior to the optimization integration-the initial value of all the correction factors equals one), the average Pearson correlation factor was 0.13 (the minimum and maximum are 0.03 and 0.48, respectively) and in the case of PM-OPT synthesized signals (PM integrated with the optimization framework), the correlation factor has reached the average value of 0.67 (the minimum and maximum are 0.59 and 0.76 respectively). The model resulted in a significant increase in the degree of similarity (Pearson correlation factor > 0.59) for all the notes, even for the ones with a low initial value (Pearson correlation factor < 0.1, notes 1, 3, 5, 7-9).
The improvement in terms of the degree of similarity resulted in the synthesis of more accurately tuned signals as per the relevant reference recorded signals (Figure 4). The average absolute deviation of the fundamental frequency of the synthesized from the recorded signals reduced from 40 cents in the case of PM signals, which corresponds to an interval of almost half semitone (a half-semitone deviation is 50 cents) to only 2 cents in the case of PM-OPT signals (which is a non-noticeable difference [28]). In 5 out of 9 notes, the PM-OPT model led to the synthesis of perfectly tuned signals with the relevant recordings (0 cent deviation, Figure 4 notes 2-5, 7).  Moreover, significant improvement in sound color resemblance was observed. The partials of the synthesized and the recorded signals initially deviated (e.g., PM vs. Recording case in Figure 5), and now they match (e.g., PM-OPT vs. Recording case in Figure 5). In order to measure the sound color resemblance, we studied the matching of the recorded and synthesized signals' frequency content by taking into consideration the first five partials (the fundamental and the first four overtones, Tables 1-3). The sound color resemblance per note between the recorded and the synthesized signals is determined by their first five partials average absolute deviation (the two columns on the right of Table 3). We can see this value is significantly lower for all the PM-OPT deviation from Recording cases than the corresponding PM deviation from Recording cases. The average value for all the nine notes prior to the optimization framework integration (PM deviation from Recording) is 41 cents, whereas, after the integration (PM-OPT deviation from Recording) diminishes to only 2 cents. For eight out of nine notes, the PM-OPT and the recording have almost identical spectrum contents (partials average absolute deviation ≤2 cents). This improvement is a byproduct of the precise tuning of the fundamental frequency. PM produces partials that resemble a harmonic series, which can be found in the recordings as well. Therefore, tuning the fundamental frequency tunes, correspondingly, the overtones.   1  522  1046  1568  2090  2614  2  591  1183  1774  2365  2957  3  661  1322  1982  2644  3305  4  721  1439  2160  2883  3604  5  781  1560  2347  3131  3901  6  883  1767  2659  3536  4414  7  1003  2005  3011  4013  5017  8  1117  2334  3352  4469  5586  9  1213  2428  3642  4855  6070   Table 2. The first five partials of the synthesized signals without adopting the optimal correction factors (PM) and after adopting the optimal correction factors (PM-OPT) per 9 notes, in Hertz.  1  511  523  1022  1045  1533  1568  2044  2091  2555  2613  2  594  591  1188  1183  1782  1774  2376  2365  2970  2957  3  668  661  1337  1322  2005  1982  2673  2643  3342  3304  4  722  721  1444  1441  2165  2162  2887  2883  3609  3603  5  824  781  1649  1563  2473  2344  3297  3125  4122  3906  6  912  884  1823  1767  2735  2651  3647  3538  4559  4415  7  1013  1003  2027  2007  3040  3010  4054  4013  5068  5017  8  1197  1111  2394  2222  3592  3333  4789  4443  5986  5554  9  1222  1214  2444  2428  3666  3622  4888  4856  6110  6073  Table 3. The absolute deviation (in cents) of the synthesized signals, without adopting the optimal correction factors (PM) and after adopting the optimal correction factors (PM-OPT), from the recording signals for the 9 notes, the average absolute deviation of all the partials between the recorded and the synthesized signals for the 9 notes and the total average absolute deviation for all the partials and notes.

Conclusions
In this work, we proposed a method that enables the maximization of the physical modeling (PM) of musical instruments efficiency by applying the optimal correction factors and presented a case study of a specific commercial recorder. PMs of musical instruments simulate the sound production mechanism of the relative physical instruments. However, the detailed analytical description of the phenomena governing the sound generation mechanism to design an accurate PM of the musical instruments is not trivial. The proposed use of the optimization framework to enhance the generated audio signal of the PM of musical instruments helps in practice the production of more realistic PM-generated signals. The results for the musical instrument used in our study indicate that the proposed model enhances the degree of similarity between the synthesized and the recorded signal (the average absolute Pearson Correlation Coefficient increased from 0.13 to 0.67), resolving mistuning issues (the absolute deviation of the synthesized from the recorded signals' pitches reduced from 40 cents to the non-noticeable level of 2 cents) and resulting to similar sound color characteristics (matching overtones).
We expect that this work will motivate researchers to create more complex optimization techniques using multiobjectives that will allow the parallel accounting of both the physical (e.g., inharmonicity, amplitude deviation, spectrum entropy) and the perceptual properties (e.g., pitch, loudness, roughness), as well as further validation schemes based on listening tests. We further expect that the approach we propose with this work will further improve the efficiency of both the existing and future PMs of musical instruments.

Conflicts of Interest:
The authors declare no conflict of interest.