# Populating the Mix Space: Parametric Methods for Generating Multitrack Audio Mixtures

^{*}

Next Article in Journal / Special Issue

Previous Article in Journal

Previous Article in Special Issue

Previous Article in Special Issue

Acoustics Research Centre, School of Computing, Science and Engineering, University of Salford, Greater Manchester, Salford M5 4WT, UK

Author to whom correspondence should be addressed.

Academic Editor: Tapio Lokki

Received: 31 October 2017
/
Revised: 24 November 2017
/
Accepted: 4 December 2017
/
Published: 20 December 2017

(This article belongs to the Special Issue Sound and Music Computing)

The creation of multitrack mixes by audio engineers is a time-consuming activity and creating high-quality mixes requires a great deal of knowledge and experience. Previous studies on the perception of music mixes have been limited by the relatively small number of human-made mixes analysed. This paper describes a novel “mix-space”, a parameter space which contains all possible mixes using a finite set of tools, as well as methods for the parametric generation of artificial mixes in this space. Mixes that use track gain, panning and equalisation are considered. This allows statistical methods to be used in the study of music mixing practice, such as Monte Carlo simulations or population-based optimisation methods. Two applications are described: an investigation into the robustness and accuracy of tempo-estimation algorithms and an experiment to estimate distributions of spectral centroid values within sets of mixes. The potential for further work is also described.

The mixing of audio signals is a complicated optimisation problem, in which an audio engineer must consider a vast number of technical and aesthetic considerations in order to achieve the desired result. Traditionally, many tasks in audio mixing are performed on a mixing console. Typically, such a device consists of a series of channel strips, one representing each audio track, on which various operations can be performed such as adjustments in equalisation, panning and overall level. While this format is useful for allowing a hands-on interaction with the audio content, it is not the most direct or efficient way of exploring these parameters and discovering mixes in the process.

One legacy of this console design philosophy is that, in the literature, it has become commonplace to define a mix as the sum of the input tracks, subject to control vectors for gain, panning, equalisation etc., [1,2,3]. Subsequently, a number of publications [4,5,6] have referred to a mix of n tracks as a point in an n-dimensional vector space, with each axis as the gain of a given track. While effective in certain cases, and certainly straightforward to visualise, this definition produces a solution space which is sub-optimal when searching for mixes.

The following are equations used to define a mix, according to various previous works. Note that the nomenclature has not been changed from the original texts. Equation (1) was used by [1], stating simply that a mix is the sum of all individual channels.

$$\mathrm{mix}=\sum _{n=1}^{N}{\mathrm{Ch}}_{n}\left[t\right]$$

This definition seems logical and even trivial, if inspired by a summing mixer, and has become the foundation for a series of more elaborate definitions, such as adding a gain vector, a to each track, allowing for time-dependent changes to the track gains, simulating the movement of individual faders [2].

$$y\left[n\right]=\sum _{k=1}^{K}{a}_{k}\left[n\right]\times {x}_{k}\left[n\right]$$

In a review paper from 2011 [3], Equation (3) was used, adding generic control vectors c which modulate the input signals x. These control vectors allow for a variety of results, such as polarity correction, delay correction, panning and source separation, depending on their implementation.

$${\mathrm{mix}}_{l}\left(n\right)=\sum _{m=0}^{M-1}\sum _{k=0}^{K-1}{c}_{k,m,l}\left(n\right)\times {x}_{m}\left(n\right)$$

Each of these equations considers the mix as the sum of the input tracks, although there is little agreement on terminology or nomenclature in this general definition. What is important to realise here is that these expressions characterise not strictly the mix itself but the output of a summing mixer, or conventional fader-based mixing console. As will be shown in Section 2, the set of unique mixes is a subset of this set, as illustrated by Equation (4). We refer to this subset as the mix-space, introduced in [7]. It is this space that a mixing console should directly explore, rather than the gain-space. Section 2 presents an updated definition of the term mix, which produces concise solution spaces by exploring only the parameter space $\varphi $, avoiding the redundancies in g, which represents the gain vector of the system.

$$\underset{\text{gain-space}}{\left(\underbrace{{g}_{1},{g}_{2},{g}_{3},\cdots ,{g}_{n}}\right)}=\underset{{\scriptstyle \mathrm{master}\phantom{\rule{4.pt}{0ex}}\mathrm{volume}}}{(\underbrace{r,}}\underset{{\scriptstyle \text{mix-space}}}{\underbrace{{\varphi}_{1},{\varphi}_{2},\cdots ,{\varphi}_{n-1}})}$$

The primary contributions of this work are as follows: (a) the mix-space as a theoretical framework in which existing audio mixes can be examined, in contrast to the gain-space, and (b) methods for the generation of audio mixes in the mix-space. These contributions are described in Section 2.

The creation of artificial datasets relating to music mixing practice helps to overcome one of the main obstacles in the field of mix analysis, which is the lack of available data and the cost associated with gathering new data from mix engineers. Thus far, it has been difficult to make statistical inference about music mixing practice as available studies have only had access to small datasets of user-generated audio mixes, with few exceptions [8].

Adjustment of track level, pan position and equalisation are common in audio processing. While level and pan are fundamental operations in multichannel mixing, equalisation is one of the most commonly used processors. Together, these three operations form a basic channel strip. As such, the scope of this paper considers these three operations.

Consider the trivial case where two audio signals are to be mixed, where only the absolute levels of each signal can be adjusted. In Figure 1, the gains of two signals are represented by x and y, where both are positive-bound. Consider the point p as a configuration of the signal gains, i.e., $({p}_{x},{p}_{y})$. From this point, the values of x and y are both increased in equal proportion, arriving at the point ${p}^{\prime}$. The magnitude of p is less than that of ${p}^{\prime}$ ($\parallel p\parallel <\parallel {p}^{\prime}\parallel $) yet since the ratio of x to y is identical, the angles subtended by the vectors from the y-axis are equal ($\angle p=\angle {p}^{\prime}$). In the context of a mix of two tracks, what this means is that the volume of ${p}^{\prime}$ is greater than p, yet the blend of input tracks is the same.

As an alternate to Equation (1), a mix can be thought of as the relative balance of audio signals. From this definition, the points p and ${p}^{\prime}$ are the same mix, only ${p}^{\prime}$ is being presented at a greater volume. If the listener has control over the master volume of the system, then any difference between p and ${p}^{\prime}$ becomes ambiguous.

Mix: an audio stream constructed by the superposition of others in accordance with a specific blend, balance or ratio.

From p, the level of fader y can be increased by ${\Delta}_{y}$, arriving at q. In this particular example, the value of ${\Delta}_{y}$ was chosen such that $\parallel q\parallel =\parallel {p}^{\prime}\parallel $. However, for any $|{\Delta}_{y}|>0$, $\angle q\ne \angle {p}^{\prime}$. Therefore, q clearly represents a different mix to either p or ${p}^{\prime}$. Consequently, the definition of a mix is clarified by what it is not: when two audio streams contain the same blend of input tracks but the result is at different overall amplitude levels, these two outputs can be considered the same mix. For this mixing example, where there are $n=2$ signals, represented by n gain values, the mix is dependant on $n-1$ variables; in this case, the angle to the vector. The ${\ell}_{2}$ norm of the vector is simply proportional to the overall loudness of the mix.

Figure 2a shows a similar structure, with $n=3$. Here, the point ${p}^{\prime}$ is also an extension of p. As in Figure 1, q is located by increasing the value of y from the point p and $\parallel q\parallel =\parallel {p}^{\prime}\parallel $. Here, the values of each angle are explicitly determined and displayed. All three vectors share the equatorial angle of 60. The polar angle of p and ${p}^{\prime}$ is 50, while the polar angle of q is less than this, at ≈37. As in the two-dimensional case, it is the angles which determine the parameters of the mix and the norm of the vector is related to the overall loudness.

While Figure 1 and Figure 2a show a space of track gains, there is clearly a redundancy of mixes in this space. What is ultimately desired is a space of mixes.

Mix-space: a parameter space containing all the possible audio mixes that can be achieved using a defined set of processes.

It becomes apparent that a Euclidean space with track gains as basis vectors is not an efficient way to represent a space of mixes, according to Definition 2. This explains why Equation (1) would not be appropriate when searching for mixes. If, in Figure 2a, a set of m points randomly selected on ${\mathbb{R}}^{3}$ were chosen, the number of mixes could be less than m, as the same mix could be chosen multiple times at different overall volumes. A set of m randomly selected points on a sphere of any radius (${\mathbb{S}}^{2}$) would result in a number of mixes equal to m. This surface is represented in Figure 2b, which shows the portion of a unit-sphere in positively-unbounded ${\mathbb{R}}^{3}$, upon which exist all possible mixes of three tracks.

While both the 2-content of ${\mathbb{S}}^{2}$ (surface area) and the 3-content of the enclosing ${\mathbb{R}}^{3}$, (volume) both, strictly, contain an infinite amount of points, the reduced dimensionality of ${\mathbb{S}}^{2}$ makes it a more attractive content to use in optimisation, as ${\mathbb{S}}^{2}$ is a subset of ${\mathbb{R}}^{3}$ (in this context, content can be considered as “hypervolume”. See http://mathworld.wolfram.com/Content.html). As a consequence, the mix-space, $\varphi $, is a more compact representation of audio mixes than the gain-space, g.

While the examples so far have used polar and spherical coordinates, for $n=2$ and $n=3$ respectively, to extend the concept to any n dimensions, hyperspherical coordinates are used. The conversion from Cartesian to hyperspherical coordinates is given below in Equation (5). The inverse operation, from hyperspherical to Cartesian, is provided in Equation (6), based on [11]. Here, ${g}_{j}$ is the gain of the jth track out of a total of n tracks. The angles are represented by ${\varphi}_{i}$. By convention, ${\varphi}_{n-1}$ is the equatorial angle, over the range $[0,2\pi )$ radians, while all other angles range over $[0,\pi ]\phantom{\rule{4pt}{0ex}}$ radians.

$$\begin{array}{cc}\hfill r=& \sqrt{{{g}_{n}}^{2}+{{g}_{n-1}}^{2}+\cdots +{{g}_{2}}^{2}+{{g}_{1}}^{2}}\hfill \\ \hfill {\varphi}_{i}=& arccos\frac{{g}_{i}}{\sqrt{{{g}_{n}}^{2}+{{g}_{n-1}}^{2}+\cdots +{{g}_{i}}^{2}}}\phantom{\rule{1.em}{0ex}},\phantom{\rule{4.pt}{0ex}}\mathrm{where}\phantom{\rule{4.pt}{0ex}}i=[1,2,\dots ,n-3],i\in \mathbb{Z}\hfill \\ & \vdots \hfill \\ \hfill {\varphi}_{n-2}=& arccos\frac{{g}_{n-2}}{\sqrt{{g}_{n}^{2}+{{g}_{n-1}}^{2}+{{g}_{n-2}}^{2}}}\hfill \\ \hfill {\varphi}_{n-1}=& \left\{\begin{array}{cc}arccos\frac{{g}_{n-1}}{\sqrt{{g}_{n}^{2}+{{g}_{n-1}}^{2}}}\hfill & {g}_{n}\ge 0\hfill \\ 2\pi -arccos\frac{{g}_{n-1}}{\sqrt{{g}_{n}^{2}+{{g}_{n-1}}^{2}}}\hfill & {g}_{n}<0\hfill \end{array}\right.\hfill \end{array}$$

$$\begin{array}{ll}{g}_{1}=& rcos{\varphi}_{1}\\ {g}_{j}=& rcos{\varphi}_{j}{\displaystyle \prod _{i=1}^{j-1}}sin{\varphi}_{i}\phantom{\rule{4.pt}{0ex}},\phantom{\rule{4.pt}{0ex}}\mathrm{where}\phantom{\rule{4.pt}{0ex}}j=[2,3,\dots n-2],j\in \mathbb{Z}\\ {g}_{n}=& r{\displaystyle \prod _{i=1}^{n-1}}sin{\varphi}_{i}\end{array}$$

Figure 3 represents a comparable 4-track mixing exercise, as described in [7]. The four audio sources were specifically chosen for this example (vocals, guitar, bass and drums) and assigned to ${g}_{1}$, ${g}_{2}$, ${g}_{3}$ and ${g}_{4}$ respectively. Consequently, the set of mixes is represented by a 3-sphere of radius r. Due to the deliberate assignment of tracks in this example, the parameters ${\varphi}_{1},{\varphi}_{2}$ and ${\varphi}_{3}$ represent a set of inter-channel balances which, due to the specific relationships of instruments, have importance to musicians and audio engineers: ${\varphi}_{3}$ determines the balance of bass to drums, the rhythm section in this case; ${\varphi}_{2}$ describes the projection of this balance onto the ${g}_{2}$ axis, i.e., the blend of guitar to rhythm section, and finally, ${\varphi}_{1}$ describes the balance of the vocal to this backing track.

From here, the parameter space comprising the $n-1$ angular components of the hyperspherical coordinates of a ($n-1$)-sphere in a n-dimensional gain-space, is referred to as a ($n-1$)-dimensional mix-space. More simply, this can be stated by saying the mix-space is the surface of a hypersphere in gain-space. In the case of music mixing, only the positive values of g are of interest. Subsequently, the interesting region of the mix-space is only a small proportion of the total hypersurface. This fraction is $1/{2}^{n}$.

As each point in $\varphi $ represents a unique mix, the process of mixing can be represented as a path through the space. In Figure 4a, a random walk begins at the point marked ‘∘’ in the 2D mix-space (the origin [0,0], which corresponds to a gain vector of [1,0,0]). The model for the walk is a simple Brownian motion (http://people.sc.fsu.edu/~jburkardt/m_src/brownian_motion_simulation/brownian_motion_simulation.html). After 30 s, the walk is stopped and the final point reached is marked ‘×’. The gain values for each of the three tracks are shown in Figure 4b and it is clear that the random walk is on a 2-sphere, as anticipated. The time-series of gain values is shown in Figure 4c. Note that $g\in [-1,1]$, so for positive g the region explored is as represented in Figure 2b.

When presented in isolation, such a random mix, whether static or time-varying, may be unrealistic. It is hypothesised that real mix engineers do not carry out a random walk but a guided and informed walk, from some starting point (“source”) to their ideal final mix (“sink”). For further discussion of these terms, see [7], which uses the mix-space as a framework for the analysis of a simple 4-track mixing experiment. The power in these methods comes from generating a large number of mixes, more so than realistically could be obtained from real-world examples, and estimating parameters using statistical methods. Further generation and statistical analysis of time-varying mixes is left to further work.

A set of mixes can be generated by choosing points in the mix-space. In selecting a suitable parametric distribution, it is important to note that linear distributions, such as the normal distribution, are not appropriate as the domain in question is not linear but a spherical surface. The statistics of such distributions are described by a number of equivalent terms in the literature, such as circular, spherical or directional statistics. In order to generate points close to a desired position on the $(n-1)$-sphere, points are generated from a von-Mises–Fisher (vMF) distribution. The probability density function of the vMF distribution for a random n-dimensional unit vector $\mathbf{x}$ is given by
where $\kappa \ge 0,\left|\right|\mu \left|\right|=1,n\ge 2$ and the normalisation constant ${C}_{n}\left(\kappa \right)$ is given by

$${f}_{n}(\mathbf{x};\mu ,\kappa )={C}_{n}\left(\kappa \right){e}^{\kappa {\mu}^{T}\mathbf{x}}$$

$${C}_{n}\left(\kappa \right)=\frac{{\kappa}^{n/2-1}}{{\left(2\pi \right)}^{n/2}{I}_{n/2-1}\left(\kappa \right).}$$

Here, ${I}_{v}$ is the modified Bessel function of the first kind at order v. The parameters $\mu $ and $\kappa $ are called the mean direction and concentration parameter, respectively. The greater the value of $\kappa $, the higher the concentration of the distribution around the mean direction $\mu $, resulting in lower variance. The distribution is unimodal for $\kappa >0$ and is uniform on ${\mathbb{S}}^{n-1}$ for $\kappa =0$. Further details can be found in [12,13]. The `SphericalDistributionsRand` (https://github.com/yuhuichen1015/SphericalDistributionsRand) code, based on the work of [14], was used to generate points according to a vMF distribution. In the context of audio mixes, $\mu $ (where $\left|\mu \right|=1$) represents the mix about which others are distributed, akin to the mean in a normal distribution. The $\kappa $ term represents the diversity of mixes generated, analogous (but inversely proportional) to variance. An example is shown in Figure 5, where three distributions are drawn from a 2-sphere.

From here, the example mixing session described is an 8-track session, containing vocals, guitars, bass and drums [15]. For $n=8$ tracks, the gains required for the equal-loudness mix (once all audio tracks have been normalised in perceived loudness) are distributed around the following $\mu $—each track gain is equal to ${n}^{-2}$, such that $\left|\mu \right|=1$.

$$\mu =\left[0.3536\phantom{\rule{8.0pt}{0ex}}0.3536\phantom{\rule{8.0pt}{0ex}}0.3536\phantom{\rule{8.0pt}{0ex}}0.3536\phantom{\rule{8.0pt}{0ex}}0.3536\phantom{\rule{8.0pt}{0ex}}0.3536\phantom{\rule{8.0pt}{0ex}}0.3536\phantom{\rule{8.0pt}{0ex}}0.3536\right]$$

Previous studies have indicated that, while a good initial guess, presenting each track at equal loudness is not an ideal final mix. As suggested by three recent PhD theses on the topic [15,16,17], vocals are often the loudest element in a mix. To this equal loudness configuration, a vocal boost is added according to p.157 of [16], i.e., a boost of 6.54 dB. This addition of 6.54 dB to the vocal track produces the following vector, where track 8 is vocals.

$$\mu =\left[0.3536\phantom{\rule{8.0pt}{0ex}}0.3536\phantom{\rule{8.0pt}{0ex}}0.3536\phantom{\rule{8.0pt}{0ex}}0.3536\phantom{\rule{8.0pt}{0ex}}0.3536\phantom{\rule{8.0pt}{0ex}}0.3536\phantom{\rule{8.0pt}{0ex}}0.3536\phantom{\rule{8.0pt}{0ex}}0.7507\right]$$

If the previous vector was, then it is clear that this point is no longer on the unit 7-sphere. To project the point back onto the unit 7-sphere, the vector is normalised by dividing by the ${\ell}_{2}$ (Euclidean) norm, resulting in the following.

$$\mu =\left[0.2948\phantom{\rule{8.0pt}{0ex}}0.2948\phantom{\rule{8.0pt}{0ex}}0.2948\phantom{\rule{8.0pt}{0ex}}0.2948\phantom{\rule{8.0pt}{0ex}}0.2948\phantom{\rule{8.0pt}{0ex}}0.2948\phantom{\rule{8.0pt}{0ex}}0.2948\phantom{\rule{8.0pt}{0ex}}0.6259\right]$$

This vector is the new $\mu $ on the unit 7-sphere about which a set of mixes will be generated. The result is shown in Figure 6a. Each mix generated draws a gain value for each track such that the ${\ell}_{2}$ norm is equal to 1. Note that the median values closely match the vector $\mu $, as expected. Of course, there may not exist a mix which has these median values. This specific value of $\kappa $ was chosen to avoid generating negative gains, achieved through trial and error. For a distribution which produces negative gains, the absolute value could be taken to avoid inverting the phase of the tracks. Ignoring phase, a gain of g is perceptually equal to $-g$, meaning that the shape of the distribution would be altered if negative gains were included.

Rather than a simple vocal boost, what is required is a more informed choice of instrument levels. In [7], a simple 4-track mixing exercise was reported, where participants created mixes of vocals, guitars, bass and drums using only volume faders. This experiment was expanded to an 8-track format, as in this paper, and is reported in [15]. Participants were asked the same task, only this time stereo-panning and a basic 3-band EQ was added. The median instrument levels obtained from this experiment are shown in Equation (8). Since participants had the ability to pan sources; the median levels were available for left and right channels separately, which are shown in Equations (10) and (11). Figure 6b shows the mixes obtained when the target vector is based on these median track levels, known as ${\mu}_{\mathrm{informed}}$. It can be seen that the levels of bass guitar and kick drum are higher than average, while drum overheads have been attenuated. Vocals are set high in the mix, as seen in the mono experiment [7,15] and other previous studies [16,17]. Matlab code for generating sets of mixes, as in Figure 6, is available for download (https://github.com/alexwilson101/PopulateMixSpace).

$${\mu}_{\mathrm{informed}}=\left[0.2254\phantom{\rule{8.0pt}{0ex}}0.2282\phantom{\rule{8.0pt}{0ex}}0.3221\phantom{\rule{8.0pt}{0ex}}0.2679\phantom{\rule{8.0pt}{0ex}}0.4437\phantom{\rule{8.0pt}{0ex}}0.3616\phantom{\rule{8.0pt}{0ex}}0.3221\phantom{\rule{8.0pt}{0ex}}0.5387\right]$$

Thus far, only mono mixes have been considered, where all audio tracks are summed to one channel. In creative music production, it is rare that mono mixes are encountered. The same mathematical formulations of the mix-space can be used to represent panning. Consider Figure 4, which shows track gains in the range $[-1,1]$. Should these be replaced with track pan positions ${p}_{n}$ (with $-1$ and 1 corresponding to extreme left and right pan positions, for example) then the mix-space (or “pan-space”) can be used to generate a position for each track in the stereo field. To avoid confusion with the earlier use of $\varphi $, the pan-space is denoted by $\theta $, although the formalism is identical.

$$\underset{\mathrm{absolute}\phantom{\rule{4.pt}{0ex}}\mathrm{panning}}{\left(\underbrace{{p}_{1},{p}_{2},{p}_{3},\cdots ,{p}_{n}}\right)}=\underset{{\scriptstyle \text{width-scaling}}}{(\underbrace{{r}_{\mathrm{pan}},}}\underset{{\scriptstyle \text{pan-space}}}{\underbrace{{\theta}_{1},{\theta}_{2},\cdots ,{\theta}_{n-1}})}$$

However, the mix-space for gains ($\varphi $) takes advantage of the fact that a mix (in terms of track gains only) is comprised of a series of inter-channel gain ratios, meaning that the radius r is arbitrary and represents a master volume. In terms of track panning, one obtains a series of inter-channel panning ratios, the precise meaning of which is not intuitive. Additionally, the radius ${r}_{\mathrm{pan}}$ would still be required to determine the exact pan position of the individual tracks. Therefore, the pan-space describes the relative pan positions of audio tracks to one another.

For a simple example with only two tracks, the meaning of ${r}_{\mathrm{pan}}$ and $\theta $ is relatively simple to understand. Consider the unit circle in a plane where the Cartesian coordinates $(x,y)$ represent the pan positions of two tracks, as shown in Figure 7. Mix A is at the point $(\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}})$: both tracks are panned at the same position. As this is a circle with arbitrary radius, ${r}_{\mathrm{pan}}$, then the radius controls how far positive (right) the two tracks are panned, from 0 (centre) to $+1$ (far right). Mix B does the same but towards the left channel. One may ask whether A and B are identical “panning-mixes”, as p and ${p}^{\prime}$ in Figure 1 were identical “level-mixes”?

Now consider mix C, where one track is panned left and the other right. Mix D is simply the mirror image of this. Are these to be considered as the same mix, or as different mixes? Here, ${r}_{\mathrm{pan}}$ adjusts the distance between the two tracks, from both centre when ${r}_{\mathrm{pan}}=0$, to $(-1,1)$ when ${r}_{\mathrm{pan}}=\sqrt{2}$ (as indicated by mix ${C}^{\prime}$). Does a change in ${r}_{\mathrm{pan}}$ change the mix, or is it simply the same mix only wider/narrower? Overall, the angle $\theta $ adjusts the panning mix and ${r}_{\mathrm{pan}}$ is used to obtain absolute positions in the stereo field, at a particular width-scale (i.e., to zoom in or zoom out).

The method for random gains (see Section 2.2) was used to create separate mixes for the left and right channels of a stereo mix. In absolute terms, hard-panning only exists when the gain in one channel is 0 (perceptually, the impression of hard-panning can be achieved when the difference between one channel and the other is sufficiently large [18]). Since the vocal boost prevents any vocal gain of zero, the panning of the vocals is much less wide than the other tracks. Additionally, since $\kappa =200$ was chosen to prevent any negative gains, there are few zero-gain instances; therefore, there is a lack of hard-panning. Figure 8a,b show the gain settings produced and a boxplot of the resulting pan positions is shown in Figure 8c, where the inter-quartile range extends to ±0.4 for the seven instrument tracks and about ±0.2 for the vocals. The estimated density of pan positions for each track is shown, illustrating the relatively narrow vocal panning. As expected, these estimated density functions are Gaussian, to a good approximation.

Rather than using the same $\mu $ for both left and right channels, a unique choice of ${\mu}_{\mathrm{L}}$ and ${\mu}_{\mathrm{R}}$ can be made, as described in Section 2.2.2. The vectors used are shown in Equations (10) and (11). When summed to mono, this is equivalent to Equation (8).

$${\mu}_{\mathrm{L}}=\left[0.2741\phantom{\rule{8.0pt}{0ex}}0.1354\phantom{\rule{8.0pt}{0ex}}0.3361\phantom{\rule{8.0pt}{0ex}}0.2657\phantom{\rule{8.0pt}{0ex}}0.4401\phantom{\rule{8.0pt}{0ex}}0.3796\phantom{\rule{8.0pt}{0ex}}0.2566\phantom{\rule{8.0pt}{0ex}}0.5651\right]$$

$${\mu}_{\mathrm{R}}=\left[0.1189\phantom{\rule{8.0pt}{0ex}}0.2597\phantom{\rule{8.0pt}{0ex}}0.3162\phantom{\rule{8.0pt}{0ex}}0.2612\phantom{\rule{8.0pt}{0ex}}0.4683\phantom{\rule{8.0pt}{0ex}}0.2935\phantom{\rule{8.0pt}{0ex}}0.3727\phantom{\rule{8.0pt}{0ex}}0.5531\right]$$

Figure 9 shows the difference in gains produced for left and right channels. There were some negative track gains produced: when generating audio mixes, the absolute magnitude of the gain was used to avoid phase inversions which would alter spatial perception of the stereo overhead pair. It is clear that the similarity of vocals gains in left and right channels produces a limited variety of pan positions close to the central position, as shown in Figure 9c,d. Other instruments are panned with mean position and variance in accordance with the experimental results [15].

This method involved generating random mono mixes as Section 2.2 (using Equation (7) and then generating pan positions separately. A ${\mu}_{\mathrm{pan}}$ was created for a vMF distribution. This vector was based on experimental results reported in [15], which showed that, generally, overheads and guitars were widely panned while kick, snare, bass and vocals were positioned centrally.

$${\mu}_{\mathrm{pan}}=[-0.5\phantom{\rule{8.0pt}{0ex}}0.5\phantom{\rule{8.0pt}{0ex}}0\phantom{\rule{8.0pt}{0ex}}0\phantom{\rule{8.0pt}{0ex}}0\phantom{\rule{8.0pt}{0ex}}-0.4\phantom{\rule{8.0pt}{0ex}}0.4\phantom{\rule{8.0pt}{0ex}}0]$$

This then needs to be a unit vector for it to be used in creating vMF-distributed points. Consequently, the precise values are not critically important, as it is the relative pan positions that are reflected in the normalised vector and ${r}_{\mathrm{pan}}$ which would be used to adjust the scaling of these relative positions.

$${\mu}_{\mathrm{pan}}=[-0.5522\phantom{\rule{8.0pt}{0ex}}0.5522\phantom{\rule{8.0pt}{0ex}}0\phantom{\rule{8.0pt}{0ex}}0\phantom{\rule{8.0pt}{0ex}}0\phantom{\rule{8.0pt}{0ex}}-0.4417\phantom{\rule{8.0pt}{0ex}}0.4417\phantom{\rule{8.0pt}{0ex}}0]$$

Three different values for $\kappa $ were used, which illustrates how this parameter controls the distribution of panning. The results are shown in Figure 10, where the influence of $\kappa $ is clear. When $\kappa \to 0$, the distribution of pan positions approaches uniform over the sphere, and so the median pan positions are close to 0 (central position in the stereo field) for all tracks, regardless of ${\mu}_{\mathrm{pan}}$. As $\kappa $ increases, the distribution of pan positions is narrower, more concentrated on the specific pan positions specified in ${\mu}_{\mathrm{pan}}$.

Figure 11 shows an example of two mixes created using this method. The gains and pan positions of each track are displayed. It is clear that the instruments are typically panned close to the positions specified in the pan vector (Equation (13). In this example, ${r}_{\mathrm{pan}}=1$; increasing this parameter would produce wider mixes, while a decrease would produce a less wide mix.

Similarly to how the mix can be considered as a series of inter-channel gain ratios, when the frequency-response of a single audio track is split into a fixed number of bands, the inter-band gain ratios can be used to construct a tone-space using the same formulae. For three bands, with gain of low, middle and high bands in the filter being ${g}_{\mathrm{low}}$, ${g}_{\mathrm{mid}}$ and ${g}_{\mathrm{high}}$ respectively, the problem is comparable to the 3-track mixing problem shown in Figure 2a. Again, one can convert this to spherical coordinates (by EquationS (5) and obtain $[{r}_{\mathrm{EQ}},{\psi}_{1},{\psi}_{2}]$, yet, in this case, the values of ${\psi}_{n}$ control the EQ filter applied, and ${r}_{\mathrm{EQ}}$ is the total amplitude change produced by equalisation (to avoid confusion, $\psi $ is used in place of $\varphi $ when referring to equalisation). As before, if all three bands are increased or decreased by the same proportion, then the tone of the instrument does not change apart from an overall change in presented amplitude, ${r}_{\mathrm{EQ}}$. Analogous to its use in track gains, the value of ${\psi}_{2}$ adjusts the balance between ${g}_{\mathrm{mid}}$ and ${g}_{\mathrm{high}}$, while ${\psi}_{1}$ adjusts the balance of ${g}_{\mathrm{low}}$ to the previous balance.

$$\underset{\mathrm{gains}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathrm{filter}\phantom{\rule{4.pt}{0ex}}\mathrm{bands}}{\left(\underbrace{{g}_{1},{g}_{2},{g}_{3},\cdots ,{g}_{{n}_{\mathrm{bands}}}}\right)}=\underset{{\scriptstyle \mathrm{scaling}}}{(\underbrace{{r}_{\mathrm{EQ}},}}\underset{{\scriptstyle \text{tone-space}}}{\underbrace{{\psi}_{1},{\psi}_{2},\cdots ,{\psi}_{n-1}})}$$

In Figure 12, five points are randomly chosen in the tone-space. These co-ordinates are converted to three band gains as before, except that, in order to centre on a gain vector of $[1,1,1]$, ${r}_{EQ}=\sqrt{{n}_{\mathrm{bands}}}$, which is $\sqrt{3}$ in this example. Of course, this method can be used for any number of bands.

With this method, one must assume that an audio track has equal amplitude in each band, which is rarely the case. When ${g}_{L}$ is increased on a hi-hat track, there may be little effect, compared to a bass guitar. Therefore, the loudness change is a function of ${r}_{\mathrm{EQ}}$ and the spectral envelope of the track, prior to equalisation. This is not considered here and is left to further work.

Being able to generate artificial datasets of audio mixtures in the mix-space has a variety of applications. Two such applications are described here. The procedure is similar for both experiments: an audio mix is created using a generated gain vector and raw multitrack audio, resulting in a generated mix from which audio signal features may be determined. Feature extraction used the MIRtoolbox [19], version 1.6.1. Equations (7) and (8) were used to create two sets of mixes. These experiments use sets of 500 mixes, rather than 1000 as outlined in earlier sections. It can be shown that the distributions of audio signal features do not change much beyond 500 mixes [15]. The reduced computation time is advantageous in these examples.

The test audio in these experiments is 30-second segments of the songs “Burning Bridges”, “I’m Alright” and “What I Want” as used in previous studies [8,15], available from the Mixing Secrets free multitrack download library (http://www.cambridge-mt.com/ms-mtk.htm). The raw multitrack audio was reduced to the required eight tracks and each track was normalised in perceived loudness according to a modified form of ITU BS.1770 [20]. The songs “I’m Alright” and “What I Want” feature a track of piano as track #7, in place of ‘Gtr 2’.

In the absence of any time-stretching processes, the tempo of each mix should be identical for a given song. As a result, if the tempo of alternate mixes is estimated and any disagreement is found, this suggests limitations in the tempo-estimation algorithm. In this section, the process of estimating tempo across a large set of artificial mixes is presented as a means of assessing the performance of tempo-estimation algorithms. Two such algorithms are tested herein: the classic and metre-based [21] implementations of `mirtempo` in the MIRtoolbox. In short, the classic tempo estimation algorithm performs onset detection based on the amplitude envelope of the audio. Periodicities in the detected onsets are determined by finding peaks in the autocorrelation function. The metre method additionally takes into account the metrical hierarchy of the audio, allowing for a more consistent tempo-tracking. Whichever tempo-estimation is used, the resultant tempo is the mean value over the 30-second audio segment. Panning and equalisation were not considered here as tempo was estimated from a mono signal.

Figure 13a and Figure 14a show the results for “Burning Bridges”, where it is clear that the classic method performs poorly. The correct tempo of 100 bpm is estimated for only a small percentage of the mixes while all others are estimated close to 133 bpm (see Figure 14a). This leads to a high mean squared error (MSE) as shown in Table 1. A similar flaw is evident for “I’m Alright” where the tempo is again overestimated by roughly 33% for both mix distributions (see Figure 13b and Figure 14b). This indicates a consistent error in the tempo-estimation routine, which is being revealed by these mix distributions. The metre-based method performs much better, estimating the correct tempo in almost all cases and exhibiting a lower MSE, with only a small amount of absolute error (0.1–0.2 bpm). The performance of the classic tempo-estimation method is improved for “What I Want”, where both methods are found to have a high level of accuracy, as shown in Figure 13c and Figure 14c. For both distributions, the metre-based version produces clusters of solutions for “What I Want”, although the tempo represented by largest cluster is consistent.

It is conceivable that no tempo-estimation algorithm is able to obtain the correct result in all cases. What this experiment reveals is that there is also variation within the mixes of a given song, with some mixes providing the correct tempo and other mixes yielding error, with different estimation methods showing varying levels of robustness to mixing practice.

It is common to use the spectral centroid as a feature to describe the timbre of an audio signal, specifically as an approximation to perceptual brightness [22,23,24]. However, where the spectral centroid of a mixed recording is evaluated, it is not clear that the value obtained is typical of the recording as a whole, or if it simply relates to that specific mix of the recording. This is especially problematic in an object-based audio broadcast, where no reference mix exists. This applies to any signal feature, not just the spectral centroid. As studies of features across multiple alternate mixes are still rare in the literature [8,25,26], this issue has not been adequately investigated.

A previous work by the authors [8] reports on the spectral centroid of 1501 user-generated mixes of 10 songs. The number of mixes per song ranges from 97 to 373. The estimated probability distributions of spectral centroid are shown (among other signal features relating to amplitude, timbre and spatial properties), indicating that the median spectral centroid can vary by song, although it is still possible for significant overlap in distributions to exist.

The work in this section investigates the distributions of the spectral centroid that occur for artificial mixes drawn from different mix-space distributions. Equation (7) describes a simple model for mixes while Equation (8) shows the result of a perceptual level-balancing experiment. What is it about the mix that changes when these levels are adjusted? In this section, an estimation of the median spectral centroid produced by these two sets of mixes is made using Monte Carlo methods.

The experiment was conducted as follows. Using $\mu =$ Equation (7) and $\kappa =200$, a set of 500 gain vectors was generated. For each of these vectors, a mix was created and the spectral centroid was measured. This resulted in 500 measurements of spectral centroid, the density of which was estimated using Kernel Density Estimation (KDE). This procedure was repeated for a second set of 500 mixes, generated using $\mu =$ Equation (8) and $\kappa =200$. The estimated density distribution of both is plotted in Figure 15. These distributions were compared using a Wilcoxon rank sum test, which tests the null hypothesis that the distributions of both samples are equal. This null hypothesis was rejected in each case, as shown by the p-values in each subplot of Figure 15 ($p<0.05$ in each case).

The significant difference between the medians of the two groups illustrates that there is a coarse perceptual difference in timbre, generally, between mixes drawn from the two distributions. This is true for all three songs considered. Of course, whether or not there is a significant difference between the medians of the two groups depends on the chosen parameters: the $\mu $ vectors must be perceptually different but if $\kappa $ is low enough, then the distributions will overlap, regardless of the choice of $\mu $ (recall that as $\kappa \to 0$, the distribution approaches uniformity). The choice of $\kappa $ depends on the application.

The higher spectral centroid in the simple equal-gain-with-vocal-boost approach (Equation (7)) is caused by an overestimation in the level of the drum overheads and vocal, and an underestimation of the level of bass and kick drum, when compared to the results of the perceptual test (Equation (8)). The distributions of the spectral centroid for these artificially-generated mixes were compared to the distributions of the spectral centroid for user-generated mixes, as were reported in previous work by the authors [8]. For “Burning Bridges” and “What I Want”, the peak of the ${\mu}_{\mathrm{informed}}$ distribution compares well to the user-generated mixes (approx 3.8 kHz and 3.2 kHz respectively). In the case of “I’m Alright”, $\mu =$ Equation (7) yields a better match to real mixes (approx 4.2 kHz); however, the 373 user-generated mixes of this song from [8] did contain a large proportion of highly amateur, potentially low-quality, mixes. For further comparison of artificial mixes and user-generated mixes, see [15].

This experiment shows that a set of mixes can be obtained by sampling the mix-space but that perceptually-relevant mixes are more likely to be obtained if some level of human guidance is fed into the system. The parametric mixing model for this experiment did not feature panning or equalisation. It has been shown that the addition of equalisation broadens the distribution of spectral centroid values, as would be expected given the wider variety of instrument tone [15].

The theoretical framework presented in this paper provides for a space of mixes that can be explored, using evolutionary computing, machine learning or similar computational methods. Applications of this include the creation of an initial population of solutions to be used in the search of balance-mixes [9] and electric guitar tones [10], both using interactive genetic algorithms. These approaches have yielded positive results, as the user is able to search the space effectively and find the desired solution.

For subjective testing, the methods presented in this paper have the advantage that each mix is generated at a constant perceived loudness, as the magnitude of the gain vector can be set to a constant (such as $r=1$ in Equation (4). In both [9,10], which used an interactive genetic algorithm, test participants were asked to rate subjectively the solutions presented. Being generated at a consistent loudness level allowed for fair evaluations, while avoiding the additional computational time required for specific loudness-normalisation to be applied to each generated mix. This allows a more free exploration of the solution space, since audio stimuli can be generated in real-time using this method.

Currently, newly-developed algorithms for tempo estimation, key estimation etc., are evaluated during specific challenges, such as the MIREX audio tempo estimation challenge (http://www.music-ir.org/mirex/wiki/2017:Audio_Tempo_Estimation), using standard datasets of audio recordings. We propose that sets of artificially generated mixes be considered as a standard test, in order to examine the level of robustness to mixing practice, as in Section 3.1.

Of course, more conventional experiments can be analysed in this framework. In a level-balancing task, where participants were asked to set track gains to their desired levels, the resulting gains can be converted to the mix-space and analysed therein [7]. This allows differences in cohorts to be investigated: thus far, the different mixes produced by headphone or loudspeaker users has been investigated [15] in addition to checking if changing the initially presented rough mix influences the mixing-decisions [7], a hypothesis also supported by later work [6].

A recent work analysed the audio mixes of broadcast audio stems (dialogue, foreground sound effects, background sound effects and music) as produced by hearing-impaired listeners [27]. This 4-track mixing scenario is equivalent to that represented by Figure 3. The changes in level made to each mix stem were reported in a bar chart, showing an increase in dialogue level and a decrease in the level of the other three stems. From a mix-space perspective, we know that these two strategies are equivalent. The mixes created from such an experiment can be more effectively analysed in a 3-dimensional mix-space, in which it could be more clear how different cohorts (such as hearing-impaired listeners) would balance the four tracks in different ways. If the needs of the user demand a change to the audio mix, as in the case of increasing speech intelligibility, then the path from the current mix to the desired mix may be more easily determined in the mix-space.

As object-based audio broadcast becomes commonplace, audio signal feature extraction algorithms will need to be robust to changes in the audio object, be it changes in amplitude, panning, equalisation, or other parameters. It has been shown that the measured value of pulse clarity (a measure of how easy it is to pick out the underlying rhythm of a mix [28]) varies with object loudness, typically decreasing as the mix moves into regions of the mix-space where the relative level of vocals is increased [15].

A method for the creation of artificial audio mixes has been presented. This has been achieved by the parametric generation of points in a novel “mix-space”, a concise representation of three audio processing activities: level-balancing, stereo-panning and equalisation.

This method has been used for a number of application thus far: in creating an initial population for evolutionary algorithms [9,10] and two simple experiments estimating the values of audio signal features using Monte Carlo techniques. This has revealed limitations in tempo-estimation algorithms. This paper suggests that, in the future, such algorithms need to be robust to changes in instrument level and other mixing practices. This will allow such routines to be applied to an object-based paradigm of audio broadcast, where no reference mix may exist on which to determine the value of the feature.

Future work is required to further generalise the presented models to audio mixing practices, such as dynamic range processing, as well as implementing a fully-parametric model of time-varying mixes and the related statistical analysis.

Portions of the work described in this paper are from the PhD thesis of A.W., under the supervision of B.M.F. A.W. conceived and designed the experiments; A.W. performed the experiments; A.W. and B.M.F. analyzed the data; A.W. contributed materials/analysis tools; A.W. and B.M.F. wrote the paper.

The authors declare no conflict of interest. No funding sponsors had a role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

- Gonzalez, E.; Reiss, J. Improved control for selective minimization of masking using Inter-Channel dependancy effects. In Proceedings of the 11th International Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, 1–4 September 2008. [Google Scholar]
- Tsilfidis, A.; Papadakos, C.; Mourjopoulos, J. Hierarchical perceptual mixing. In Proceedings of the 126th AES Convention, Munich, Germany, 7–10 May 2009. [Google Scholar]
- Reiss, J.D. Intelligent systems for mixing multichannel audio. In Proceedings of the IEEE 17th International Conference on Digital Signal Processing, Corfu, Greece, 6–8 July 2011. [Google Scholar]
- Cartwright, M.; Pardo, B.; Reiss, J. Mixploration: Rethinking the audio mixer interface. In Proceedings of the ACM 19th International Conference on Intelligent User Interfaces, Haifa, Israel, 24–27 February 2014. [Google Scholar]
- Terrell, M.; Simpson, A.; Sandler, M. The mathematics of mixing. J. Audio Eng. Soc.
**2014**, 62, 4–13. [Google Scholar] [CrossRef] - Jillings, N.; Stables, R. A semantically powered digital audio workstation in the browser. In Proceedings of the Audio Engineering Society International Conference on Semantic Audio, Erlangen, Germany, 22–24 June 2017. [Google Scholar]
- Wilson, A.; Fazenda, B.M. Navigating the Mix-Space: Theoretical and practical level-balancing technique in multitrack music mixtures. In Proceedings of the 12th Sound and Music Computing Conference, Maynooth, Ireland, 24–26 October 2015. [Google Scholar]
- Wilson, A.; Fazenda, B. Variation in Multitrack Mixes: Analysis of Low-level Audio Signal Features. J. Audio Eng. Soc.
**2016**, 64, 466–473. [Google Scholar] [CrossRef] - Wilson, A.; Fazenda, B. An evolutionary computation approach to intelligent music production, informed by experimentally gathered domain knowledge. In Proceedings of the 2nd AES Workshop on Intelligent Music Production, London, UK, 13 September 2016. [Google Scholar]
- Wilson, A. Perceptually-motivated generation of electric guitar timbres using an interactive genetic algorithm. In Proceedings of the 3rd Workshop on Intelligent Music Production, Salford, UK, 14 September 2017. [Google Scholar]
- Blumenson, L.E. A Derivation of n-Dimensional Spherical Coordinates. Am. Math. Mon.
**1960**, 67, 63–66. [Google Scholar] [CrossRef] - Fisher, N.I. Statistical Analysis of Circular Data; Cambridge University Press: Cambridge, UK, 1995. [Google Scholar]
- Mardia, K.V.; Jupp, P.E. Directional Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 494. [Google Scholar]
- Chen, Y.H.; Wei, D.; Newstadt, G.; DeGraef, M.; Simmons, J.; Hero, A. Statistical estimation and clustering of group-invariant orientation parameters. In Proceedings of the IEEE 18th International Conference on Information Fusion, Washington, DC, USA, 6–9 July 2015. [Google Scholar]
- Wilson, A. Evaluation and Modelling of Perceived Audio Quality in Popular Music, towards Intelligent Music Production. Ph.D. Thesis, University of Salford, Salford, UK, 2017. [Google Scholar]
- Pestana, P. Automatic Mixing Systems Using Adaptive Audio Effects. Ph.D. Thesis, Universidade Catolica Portuguesa, Lisbon, Portugal, 2013. [Google Scholar]
- De Man, B. Towards a Better Understanding of Mix Engineering. Ph.D. Thesis, Queen Mary, University of London, London, UK, 2017. [Google Scholar]
- Lee, H.; Rumsey, F. Level and time panning of phantom images for musical sources. J. Audio Eng. Soc.
**2013**, 61, 978–988. [Google Scholar] - Lartillot, O.; Toiviainen, P. A matlab toolbox for musical feature extraction from audio. In Proceedings of the 10th International Conference on Digital Audio Effects (DAFx-07), Bordeaux, France, 10–15 September 2007. [Google Scholar]
- Pestana, P.D.; Reiss, J.D.; Barbosa, A. Loudness measurement of multitrack audio content using modifications of ITU-R BS.1770. In Proceedings of the 34th AES Convention; Audio Engineering Society, Rome, Italy, 4 May 2013. [Google Scholar]
- Lartillot, O.; Cereghetti, D.; Eliard, K.; Trost, W.J.; Rappaz, M.A.; Grandjean, D. Estimating Tempo and metrical features by tracking the whole metrical hierarchy. In Proceedings of the 3rd International Conference on Music & Emotion (ICME3), Jyväskylä, Finland, 11–15 June 2013. [Google Scholar]
- Von Bismarck, G. Timbre of steady sounds: A factorial investigation of its verbal attributes. Acta Acust. United Acust.
**1974**, 30, 146–159. [Google Scholar] - Grey, J.M.; Gordon, J.W. Perceptual effects of spectral modifications on musical timbres. J. Acoust. Soc. Am.
**1978**, 63, 1493–1500. [Google Scholar] [CrossRef] - McAdams, S.; Winsberg, S.; Donnadieu, S.; De Soete, G.; Krimphoff, J. Perceptual scaling of synthesized musical timbres: Common dimensions, specificities, and latent subject classes. Psychol. Res.
**1995**, 58, 177–192. [Google Scholar] [CrossRef] [PubMed] - De Man, B.; Leonard, B.; King, R.; Reiss, J. An analysis and evaluation of audio features for multitrack music mixtures. In Proceedings of the 15th International Society for Music Information Retrieval Conference, Taipei, Taiwan, 27–31 October 2014. [Google Scholar]
- Wilson, A.; Fazenda, B.M. 101 Mixes: A statistical analysis of mix-variation in a dataset of multitrack music mixes. In Proceedings of the 139th AES Convention, Audio Engineering Society, New York, NY, USA, 29 October–1 November 2015. [Google Scholar]
- Shirley, B.G.; Meadows, M.; Malak, F.; Woodcock, J.S.; Tidball, A. Personalized object-based audio for hearing impaired TV viewers. J. Audio Eng. Soc.
**2017**, 65, 293–303. [Google Scholar] [CrossRef] - Lartillot, O.; Eerola, T.; Toiviainen, P.; Fornari, J. Multi-feature modeling of pulse clarity: Design, validation and optimization. In Proceedings of the 9th International Society for Music Information Retrieval Conference, Philadelphia, PA, USA, 14–18 September 2008. [Google Scholar]

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).