#
Flow Synthesizer: Universal Audio Synthesizer Control with Normalizing Flows^{ †}

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. State-Of-Art

#### 2.1. Generative Models and Variational Auto-Encoders

#### 2.2. Normalizing Flows

#### Normalizing Flows in VAEs

#### 2.3. Synthesizer Parameters Optimization

## 3. Our Proposal

#### 3.1. Formalizing Synthesizer Control

#### 3.2. Mapping Latent Spaces with Regression Flows

#### 3.2.1. Posterior Parameterization

#### 3.2.2. Conditional Amortization

#### 3.3. Disentangling Flows for Semantic Dimensions

## 4. Experiments

#### 4.1. Dataset

#### 4.1.1. Synthesizer Sounds Dataset

#### 4.1.2. Audio Processing

#### 4.1.3. Metadata

#### 4.2. Models

#### 4.2.1. Baseline Models

#### 4.2.2. Our Proposal

#### 4.2.3. Optimization Aspects

## 5. Results

#### 5.1. Parameters Inference

#### 5.2. Increasing Parameters Complexity

#### 5.3. Reconstructions and Latent Space

#### 5.4. Out-Of-Domain Generalization

#### 5.5. Macro-Parameters Learning

#### 5.6. Semantic Parameter Discovery

#### 5.7. Creative Applications

## 6. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Puckette, M. The Theory and Technique of Electronic Music; World Scientific Publishing Co.: Singapore, 2007. [Google Scholar]
- Cartwright, M.; Pardo, B. Synthassist: An audio synthesizer programmed with vocal imitation. In Proceedings of the 22nd ACM international conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 741–742. [Google Scholar]
- Garcia, R.A. Automatic design of sound synthesis techniques by means of genetic programming. In Audio Engineering Society Convention; Audio Engineering Society: New York, NY, USA, 2002. [Google Scholar]
- Yee-King, M.J.; Fedden, L.; d’Inverno, M. Automatic Programming of VST Sound Synthesizers Using Deep Networks and Other Techniques. IEEE Trans. ETCI
**2018**, 2, 150–159. [Google Scholar] [CrossRef] - Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv
**2013**, arXiv:1312.6114. [Google Scholar] - Higgins, I.; Matthey, L.; Pal, A.; Mohamed, S.; Lerchner, A. Beta-Vae: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR. 2016. Available online: https://pdfs.semanticscholar.org/a902/26c41b79f8b06007609f39f82757073641e2.pdf (accessed on 27 September 2019).
- Rezende, D.; Mohamed, S. Variational Inference with Normalizing Flows. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015. [Google Scholar]
- Esling, P.; Bitton, A.; Chemla-Romeu-Santos, A. Generative timbre spaces with variational audio synthesis. arXiv
**2018**, arXiv:1805.08501. [Google Scholar] - Esling, P.; Masuda, N.; Bardet, A.; Despres, R.; Chemla-Romeu-Santos, A. Universal audio synthesizer control with normalizing flows. In Proceedings of the 22nd International Conference on Digital Audio Effects (DaFX), Birmingham, UK, 2–6 September 2019. [Google Scholar]
- Bishop, C.M.; Mitchell, T.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2014. [Google Scholar]
- Sønderby, C.K.; Raiko, T.; Maaløe, L.; Sønderby, S.K.; Winther, O. How to train deep variational autoencoders and probabilistic ladder networks. arXiv
**2016**, arXiv:1602.02282. [Google Scholar] - Chen, X.; Kingma, D.P.; Salimans, T.; Sutskever, I.; Abbeel, P. Variational lossy autoencoder. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Tolstikhin, I.; Bousquet, O.; Schölkopf, B. Wasserstein Auto-Encoders. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Kingma, D.P.; Salimans, T.; Jozefowicz, R.; Chen, X.; Sutskever, I.; Welling, M. Improved Variational Inference with Inverse Autoregressive Flow. Advances in NIPS. 2016, pp. 4743–4751. Available online: https://papers.nips.cc/paper/6581-improved-variational-inference-with-inverse-autoregressive-flow.pdf (accessed on 27 December 2019).
- Papamakarios, G.; Pavlakou, T.; Murray, I. Masked Autoregressive Flow for Density Estimation. NIPS. 2017, pp. 2338–2347. Available online: http://papers.nips.cc/paper/6828-masked-autoregressive-flow-for-density-estimation.pdf (accessed on 27 December 2019).
- Roth, M.; Yee-King, M. A comparison of parametric optimization techniques for musical instrument tone matching. In Audio Engineering Society Convention 130; Audio Engineering Society: New York, NY, USA, 2011. [Google Scholar]
- Garcia, R. Growing sound synthesizers using evolutionary methods. In Proceedings ALMMA 2001: Artificial Life Models for Musical Applications Workshop, (ECAL 2001); Citeseer: Prague, Czech Republic, 2001. [Google Scholar]
- Kingma, D.P.; Mohamed, S.; Rezende, D.J.; Welling, M. Semi-Supervised Learning with Deep Generative Models. Advances in Neural Information Processing Systems. 2014, pp. 3581–3589. Available online: http://papers.nips.cc/paper/5352-semi-supervised-learning-with-deep-generative-models.pdf (accessed on 27 December 2019).

**Figure 1.**Universal synthesizer control. (

**a**) Previous methods perform direct parameter inference from the audio, which is inherently limited by the non-differentiable synthesis operation and provides no higher-level form of control. (

**b**) Our novel formulation states that we should first learn an organized and compressed latent space $\mathbf{z}$ of the synthesizer’s audio capabilities, while mapping it to the space $\mathbf{v}$ of its synthesis parameters. This provides a deeper understanding of the principal dimensions of audio variations in the synthesizer, and an access to higher-level interactions.

**Figure 2.**Universal synthesizer control. We learn an organized latent audio space $\mathbf{z}$ of a synthesizer capabilities with a Variational Auto-Encoder (VAE) parameterized with Normalizing Flow (NF). This space maps to the parameter space $\mathbf{v}$ through our proposed regression flow and can be further organized with metadata targets $\mathbf{t}$. This provides sampling and invertible mapping between different spaces.

**Figure 3.**Reconstruction analysis. Comparing parameters inference and resulting audio on the test set with 16 (

**a**) or 32 (

**b**) parameters, and on the out-of-domain (

**c**) sets composed either of sounds from other synthesizers (

**left**) or vocal imitations (

**right**).

**Figure 4.**Latent neighborhoods. We select two examples from the test set that map to distant locations in the latent space $\mathbf{z}$ and perform random sampling in their local neighborhood to observe the parameters and audio. We also display the latent interpolation between those points.

**Figure 5.**Macro-parameters learning. We show two of the learned latent dimensions $\mathbf{z}$ and compute the mapping $p\left(\mathbf{v}\right|\mathbf{z})$ when traversing these dimensions, while keeping all other fixed at $\mathbf{0}$ to see how $\mathbf{z}$ define smooth macro-parameters. We plot the evolution of the 5 parameters with highest variance (

**top**), the corresponding synthesis (

**middle**), and audio descriptors (

**bottom**). (

**Left**) ${\mathbf{z}}_{3}$ seems to relate to a percussivity parameter. (

**Right**) ${\mathbf{z}}_{7}$ defines a form of harmonic densification parameter.

**Figure 6.**Semantic macro-parameters. Two latent dimensions $\mathbf{z}$ learned through disentangling flows for different pairs. We show the effect on the latent space (

**left**) and parameters mapping $p\left(\mathbf{v}\right|\mathbf{z})$ when traversing these dimensions, that define smooth macro-parameters. We plot the evolution of 6 parameters with highest variance and the resulting synthesized audio (

**right**).

**Figure 7.**FlowSynth interface for audio synthesizer control in Ableton Live. The interface wraps a given VST, and allows to perform direct parameters inference, audio-based preset exploration and relying on both semantic and unsupervised macro-controls learned by our model.

**Table 1.**Comparison between baselines, *AEs, and our flows on the test set with 16, 32, and 64 parameters. We report across-folds mean and variance for parameters (Mean-Squared Error [${MSE}_{n}$]) and audio (Spectral Convergence [$SC$] and ${MSE}_{n}$) errors. The best results are indicated in bold.

Test Set—16 Parameters | Test Set—32 Parameters | Test Set—64 Parameters | |||||||
---|---|---|---|---|---|---|---|---|---|

Params | Audio | Params | Audio | Params | Audio | ||||

${\mathit{M}\mathit{S}\mathit{E}}_{\mathit{n}}$ | $\mathit{S}\mathit{C}$ | ${\mathit{M}\mathit{S}\mathit{E}}_{\mathit{n}}$ | ${\mathit{M}\mathit{S}\mathit{E}}_{\mathit{n}}$ | $\mathit{S}\mathit{C}$ | ${\mathit{M}\mathit{S}\mathit{E}}_{\mathit{n}}$ | ${\mathit{M}\mathit{S}\mathit{E}}_{\mathit{n}}$ | $\mathit{S}\mathit{C}$ | ${\mathit{M}\mathit{S}\mathit{E}}_{\mathit{n}}$ | |

$MLP$ | 0.236 ± 0.44 | 6.226 ± 0.13 | 9.548 ± 3.1 | 0.218 ± 0.46 | 13.51 ± 3.1 | 36.48 ± 11.9 | 0.185 ± 0.41 | 39.59 ± 6.7 | 49.58 ± 2.7 |

$CNN$ | 0.171 ± 0.45 | 1.372 ± 0.29 | 6.329 ± 1.9 | 0.159 ± 0.46 | 19.18 ± 4.7 | 33.40 ± 9.4 | 0.202 ± 0.37 | 52.48 ± 7.2 | 76.13 ± 8.9 |

$ResNet$ | 0.191 ± 0.43 | 1.004 ± 0.35 | 6.422 ± 1.9 | 0.196 ± 0.49 | 10.37 ± 1.8 | 31.13 ± 9.8 | 0.248 ± 0.43 | 29.18 ± 3.8 | 78.15 ± 9.8 |

$AE$ | 0.181 ± 0.40 | 0.893 ± 0.13 | 5.557 ± 1.7 | 0.169 ± 0.40 | 5.566 ± 1.2 | 17.71 ± 6.9 | 0.189 ± 0.37 | 8.123 ± 2.4 | 34.07 ± 2.4 |

$VAE$ | 0.182 ± 0.32 | 0.810 ± 0.03 | 4.901 ± 1.4 | 0.153 ± 0.34 | 5.519 ± 1.4 | 16.85 ± 6.1 | 0.171 ± 0.37 | 5.152 ± 1.1 | 33.10 ± 2.4 |

$WAE$ | 0.159 ± 0.37 | 0.787 ± 0.05 | 4.979 ± 1.5 | 0.147 ± 0.33 | 3.967 ± 0.88 | 16.64 ± 6.2 | 0.167 ± 0.36 | 8.960 ± 1.8 | 32.59 ± 2.1 |

${VAE}_{flow}$ | 0.199 ± 0.32 | 0.838 ± 0.02 | 4.975 ± 1.4 | 0.164 ± 0.34 | 1.418 ± 0.23 | 17.74 ± 6.8 | 0.174 ± 0.36 | 6.721 ± 1.4 | 33.81 ± 2.3 |

${Flow}_{reg}$ | 0.197 ± 0.31 | 0.752 ± 0.05 | 4.409 ± 1.6 | 0.193 ± 0.32 | 0.911 ± 1.4 | 16.61 ± 7.4 | 0.178 ± 0.37 | 4.794 ± 1.8 | 34.49 ± 2.2 |

${Flow}_{dis.}$ | 0.199 ± 0.31 | 0.831 ± 0.04 | 5.103 ± 2.1 | 0.197 ± 0.42 | 1.481 ± 1.8 | 17.12 ± 7.9 | 0.182 ± 0.38 | 8.122 ± 1.8 | 34.97 ± 2.3 |

**Table 2.**Comparison between baselines, *AEs, and our flows on the out-of-domain parameters inference task. We report across-folds mean and variance for parameters (MSE) and audio (SC and MSE) errors.

Out-of-Domain (32 p.) | Out-of-Domain (64 p.) | |||
---|---|---|---|---|

$\mathbf{SC}$ | $\mathbf{MSE}$ | $\mathbf{SC}$ | $\mathbf{MSE}$ | |

$MLP$ | 2.348 ± 2.1 | 37.99 ± 7.8 | 4.534 ± 5.1 | 40.42 ± 3.7 |

$CNN$ | 2.311 ± 2.2 | 29.22 ± 8.2 | 6.329 ± 1.9 | 36.93 ± 2.3 |

$ResNet$ | 2.322 ± 1.6 | 31.07 ± 9.5 | 4.645 ± 3.1 | 27.46 ± 2.3 |

$AE$ | 1.225 ± 2.2 | 27.37 ± 7.2 | 2.557 ± 1.7 | 27.16 ± 1.4 |

$VAE$ | 1.237 ± 1.3 | 27.06 ± 7.1 | 1.141 ± 1.2 | 27.15 ± 1.3 |

$WAE$ | 1.194 ± 1.5 | 26.10 ± 6.4 | 0.999 ± 0.9 | 25.13 ± 1.3 |

${VAE}_{flow}$ | 1.193 ± 1.8 | 27.03 ± 6.4 | 1.022 ± 1.7 | 26.49 ± 1.3 |

${Flow}_{reg}$ | 1.201 ± 1.2 | 26.07 ± 7.7 | 1.132 ± 1.6 | 24.74 ± 1.3 |

${Flow}_{dis.}$ | 1.209 ± 1.4 | 26.77 ± 7.3 | 1.532 ± 1.8 | 27.89 ± 1.7 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Esling, P.; Masuda, N.; Bardet, A.; Despres, R.; Chemla-Romeu-Santos, A.
Flow Synthesizer: Universal Audio Synthesizer Control with Normalizing Flows. *Appl. Sci.* **2020**, *10*, 302.
https://doi.org/10.3390/app10010302

**AMA Style**

Esling P, Masuda N, Bardet A, Despres R, Chemla-Romeu-Santos A.
Flow Synthesizer: Universal Audio Synthesizer Control with Normalizing Flows. *Applied Sciences*. 2020; 10(1):302.
https://doi.org/10.3390/app10010302

**Chicago/Turabian Style**

Esling, Philippe, Naotake Masuda, Adrien Bardet, Romeo Despres, and Axel Chemla-Romeu-Santos.
2020. "Flow Synthesizer: Universal Audio Synthesizer Control with Normalizing Flows" *Applied Sciences* 10, no. 1: 302.
https://doi.org/10.3390/app10010302