Figure 1.
Example of source number and octave ambiguity - The spectrum of a single piano note C4, Musical Instrument Digital Interface (MIDI) note number 60, is almost identical to the spectrum of the two-key combination of C4 and C5. C5 is one octave apart from C4. (a) Spectrum of C4. (b) Spectrum of the mixture of C4 and C5. (c) Signal difference between (a,b).
Figure 1.
Example of source number and octave ambiguity - The spectrum of a single piano note C4, Musical Instrument Digital Interface (MIDI) note number 60, is almost identical to the spectrum of the two-key combination of C4 and C5. C5 is one octave apart from C4. (a) Spectrum of C4. (b) Spectrum of the mixture of C4 and C5. (c) Signal difference between (a,b).
Figure 2.
Block diagram of our multi-pitch estimation (MPE) system with multiple classifiers: first, an onset detector is applied to the input audio signal to infer where are the musical notes; then, an audio frame with 4096 audio samples (93 milliseconds) is extracted, starting at the onset time; the extracted audio fragment is then transformed to 5 different representations in the frequency domain, resulting in 5 different inputs (${I}_{1}\cdots {I}_{5}$); and these 5 inputs are fed into the 61 evolved classifiers, so that they can identify whether the corresponding musical notes are present in the extracted audio frame. Each of the 61 classifiers has a binary output.
Figure 2.
Block diagram of our multi-pitch estimation (MPE) system with multiple classifiers: first, an onset detector is applied to the input audio signal to infer where are the musical notes; then, an audio frame with 4096 audio samples (93 milliseconds) is extracted, starting at the onset time; the extracted audio fragment is then transformed to 5 different representations in the frequency domain, resulting in 5 different inputs (${I}_{1}\cdots {I}_{5}$); and these 5 inputs are fed into the 61 evolved classifiers, so that they can identify whether the corresponding musical notes are present in the extracted audio frame. Each of the 61 classifiers has a binary output.
Figure 3.
Onset detection process—(above) input signal in time domain $x\left[n\right]$ and its Hilbert envelope $EH\left[n\right]$; (bellow) the spectral flux $SF\left[n\right]$, where n represents the frame number and the spectral flux smoothed, and in black are the onsets detected and marked.
Figure 3.
Onset detection process—(above) input signal in time domain $x\left[n\right]$ and its Hilbert envelope $EH\left[n\right]$; (bellow) the spectral flux $SF\left[n\right]$, where n represents the frame number and the spectral flux smoothed, and in black are the onsets detected and marked.
Figure 4.
A Cartesian genetic programming (CGP) generic example of a genotype and its corresponding phenotype. There is a grid of nodes connected as a graph in which the functions are chosen from a set of primitive functions, (the function set). There are 2 inputs (${x}_{0}$ and ${x}_{1}$) and 4 outputs (${O}_{A}$, ${O}_{B}$, ${O}_{C}$, and ${O}_{D}$). The grid has ${n}_{c}=3$ columns and ${n}_{r}=2$ rows. Each node has 3 genes: the underlined is the function gene, which corresponds to the mathematical function being used, and the other 2 genes are the connection genes or node inputs. These inputs can refer to other nodes or system inputs.
Figure 4.
A Cartesian genetic programming (CGP) generic example of a genotype and its corresponding phenotype. There is a grid of nodes connected as a graph in which the functions are chosen from a set of primitive functions, (the function set). There are 2 inputs (${x}_{0}$ and ${x}_{1}$) and 4 outputs (${O}_{A}$, ${O}_{B}$, ${O}_{C}$, and ${O}_{D}$). The grid has ${n}_{c}=3$ columns and ${n}_{r}=2$ rows. Each node has 3 genes: the underlined is the function gene, which corresponds to the mathematical function being used, and the other 2 genes are the connection genes or node inputs. These inputs can refer to other nodes or system inputs.
Figure 5.
Our CGP system block diagram: the training stage starts with a augmentation data process using onset detection, it generates the inputs, the system engine is CGP toolbox that produces the output vector, then binarization and fitness evaluate the individuals and feedback for a CGP toolbox to proceed to the next generation of the evolutionary process.
Figure 5.
Our CGP system block diagram: the training stage starts with a augmentation data process using onset detection, it generates the inputs, the system engine is CGP toolbox that produces the output vector, then binarization and fitness evaluate the individuals and feedback for a CGP toolbox to proceed to the next generation of the evolutionary process.
Figure 6.
Data augmentation process using time translation—(a) in blue is the original piano signal in the time domain (≈0.2 s), in black is the inferred onset location ${I}_{os}$, and in red is the extracted frame, starting at ${I}_{os}$; (b) audio frame extracted from the original signal, starting at instant ${I}_{os}-512$; and (c) audio frame extracted from the original signal, starting at ${I}_{os}+512$.
Figure 6.
Data augmentation process using time translation—(a) in blue is the original piano signal in the time domain (≈0.2 s), in black is the inferred onset location ${I}_{os}$, and in red is the extracted frame, starting at ${I}_{os}$; (b) audio frame extracted from the original signal, starting at instant ${I}_{os}-512$; and (c) audio frame extracted from the original signal, starting at ${I}_{os}+512$.
Figure 7.
Preprocessing—(a) acquired audio frame in the time domain, (b) Hanning window representation, (c) windowed audio frame, and (d) windowed signal represented in the frequency domain after applying the DFT.
Figure 7.
Preprocessing—(a) acquired audio frame in the time domain, (b) Hanning window representation, (c) windowed audio frame, and (d) windowed signal represented in the frequency domain after applying the DFT.
Figure 8.
System inputs for a chord with 3 pitches (70 + 75 + 94), bottom down—real part of DFT, imaginary part of DFT, radius of DFT, angle of DFT, and real Cepstrum calculated from the time domain signal. Only the first 500 of 2048 bins of all the 5 vectors are shown.
Figure 8.
System inputs for a chord with 3 pitches (70 + 75 + 94), bottom down—real part of DFT, imaginary part of DFT, radius of DFT, angle of DFT, and real Cepstrum calculated from the time domain signal. Only the first 500 of 2048 bins of all the 5 vectors are shown.
Figure 9.
Node genotype and its graphical representation. Node 6 has five genes: the first gene (black) codifies the function from the function set table (value 4), the second and third genes (red) are connection genes (values 3 and 1: node 3 output and node 1 output), and the fourth and fifth genes (blue) are real parameters for function use (values $0.4$ and $7.2$).
Figure 9.
Node genotype and its graphical representation. Node 6 has five genes: the first gene (black) codifies the function from the function set table (value 4), the second and third genes (red) are connection genes (values 3 and 1: node 3 output and node 1 output), and the fourth and fifth genes (blue) are real parameters for function use (values $0.4$ and $7.2$).
Figure 10.
(a) CGP output signal, (b) harmonic mask, and (c) computing intersection.
Figure 10.
(a) CGP output signal, (b) harmonic mask, and (c) computing intersection.
Figure 11.
Results of 5-fold cross-validation for ten classifiers, using 500 cases dataset, half positive half negative. Precision ($\mu =0.91,\sigma =0.027$), recall ($\mu =0.95,\sigma =0.025$), and F-measure ($\mu =0.93$, $\sigma =0.025$).
Figure 11.
Results of 5-fold cross-validation for ten classifiers, using 500 cases dataset, half positive half negative. Precision ($\mu =0.91,\sigma =0.027$), recall ($\mu =0.95,\sigma =0.025$), and F-measure ($\mu =0.93$, $\sigma =0.025$).
Figure 12.
Results of the overall training stage for 61 classifiers. Precision ($\mu =0.92,\sigma =0.029$), recall ($\mu =0.97,\sigma =0.026$), and F-measure ($\mu =0.93,\sigma =0.024$).
Figure 12.
Results of the overall training stage for 61 classifiers. Precision ($\mu =0.92,\sigma =0.029$), recall ($\mu =0.97,\sigma =0.026$), and F-measure ($\mu =0.93,\sigma =0.024$).
Figure 13.
Results of the overall testing stage for 61 classifiers. Precision ($\mu =0.72,\sigma =0.08$), recall ($\mu =0.82,\sigma =0.036$), and F-measure ($\mu =0.764,\sigma =0.06$).
Figure 13.
Results of the overall testing stage for 61 classifiers. Precision ($\mu =0.72,\sigma =0.08$), recall ($\mu =0.82,\sigma =0.036$), and F-measure ($\mu =0.764,\sigma =0.06$).
Figure 14.
Polyphony test results from polyphony 1 (monophonic) to polyphony 6 for precision, recall, and F-measure.
Figure 14.
Polyphony test results from polyphony 1 (monophonic) to polyphony 6 for precision, recall, and F-measure.
Figure 15.
Our CGP system versions: from CGP-1, our first approach, to CGP-harmonic mask (HM)-data augmentation (DA), the last version, and it is the described approach with harmonic mask and data augmentation. Results from $62.7\%$ to $76.6\%$.
Figure 15.
Our CGP system versions: from CGP-1, our first approach, to CGP-harmonic mask (HM)-data augmentation (DA), the last version, and it is the described approach with harmonic mask and data augmentation. Results from $62.7\%$ to $76.6\%$.
Figure 16.
Resulting graph after decoding the genotype of classifier 55. The rectangles are functions from the function set, the circles are inputs, and the bold one is the output. In grey are the nodes that are not used to calculate the output due to the arity of some functions.
Figure 16.
Resulting graph after decoding the genotype of classifier 55. The rectangles are functions from the function set, the circles are inputs, and the bold one is the output. In grey are the nodes that are not used to calculate the output due to the arity of some functions.
Figure 17.
F-measure results comparison in (%) with state-of-the-art algorithms: Tolonen, Tolonen-500, Emiya and Klapuri. Our proposal is referenced to as CGP and it is the last version of our algorithm, CGP-HM-DA.
Figure 17.
F-measure results comparison in (%) with state-of-the-art algorithms: Tolonen, Tolonen-500, Emiya and Klapuri. Our proposal is referenced to as CGP and it is the last version of our algorithm, CGP-HM-DA.
Figure 18.
F-measure results for MPE on 2 different instruments: for piano, the F-measure is $76\%$, and for guitar, it rises to $83\%$.
Figure 18.
F-measure results for MPE on 2 different instruments: for piano, the F-measure is $76\%$, and for guitar, it rises to $83\%$.
Table 1.
The 5-fold cross-validation parameters trained/tested.
Table 1.
The 5-fold cross-validation parameters trained/tested.
Parameter | Value |
---|
K-folds | 5 |
Data Augmentation | 3 |
Positive Test Cases | 250 |
Negative Test Cases | 250 |
Frame Size | 4096 |
Fitness Initial Threshold | 1.5 |
Outputs | 1 |
Rows | 1 |
Columns | 100 |
Levels Back | 100 |
$\lambda $ (E.S. 1 + λ) | 4 |
Mutation Probability | 5% |
Threshold Mutation Probability | 6% |
Harmonic Mask | 2 |
Runs | 10 |
Generations | 10,000 |
Table 2.
The algorithm’s main features.
Table 2.
The algorithm’s main features.
Algorithm | Real-Time | Other Instruments | White Box | Piano FM (%) | Guitar FM (%) |
---|
Tolonem | √ | x | x | 47% | - |
Tolonen-500 | √ | x | x | 61% | - |
CGP | √ | √ | √ | 76% | 83% |
Emiya | x | √ | x | 80% | - |
Klapuri | x | √ | x | 82% | - |