Solvable Model for the Linear Separability of Structured Data

Linear separability, a core concept in supervised machine learning, refers to whether the labels of a data set can be captured by the simplest possible machine: a linear classifier. In order to quantify linear separability beyond this single bit of information, one needs models of data structure parameterized by interpretable quantities, and tractable analytically. Here, I address one class of models with these properties, and show how a combinatorial method allows for the computation, in a mean field approximation, of two useful descriptors of linear separability, one of which is closely related to the popular concept of storage capacity. I motivate the need for multiple metrics by quantifying linear separability in a simple synthetic data set with controlled correlations between the points and their labels, as well as in the benchmark data set MNIST, where the capacity alone paints an incomplete picture. The analytical results indicate a high degree of “universality”, or robustness with respect to the microscopic parameters controlling data structure.


Introduction
Linear classifiers are quintessential models of supervised machine learning. Despite their simplicity, or possibly because of it, they are ubiquitous: they are building blocks of more complex architectures, for instance, in deep learning and support vector machines, and they provide testing grounds of new tools and ideas in learning theory and statistical mechanics, in both the study of artificial neural networks and in neuroscience [1][2][3][4][5][6][7][8][9]. Recently, interest in linear classifiers was rekindled by two outstanding results. First, deep neural networks with wide layers can be well approximated by linear models acting on a well defined feature space, given by what is called "neural tangent kernel" [10,11]. Second, it was discovered that deep linear networks, albeit identical to linear classifiers for what concerns the class of realizable functions, allow it to reproduce and explain complex features of nonlinear learning and gradient flow [12].
In spite of the central role that linear separability plays in our understanding of machine learning, fundamental questions still remain open, notably regarding the predictors of separability in real data sets [13]. How does data complexity affect the performance of linear classifiers? Data sets in supervised machine learning are usually not linearly separable: the relations between the data points and their labels cannot be expressed as linear constraints. The first layers in deep learning architectures learn to perform transformations that enhance the linear separability of the data, thus providing downstream fully-connected layers with data points that are more adapted for linear readout [14,15]. The role of "data structure" in machine learning is a hot topic, involving computer scientists and statistical physicists, and impacting both applications and fundamental research in the field [16][17][18][19][20][21][22].
Before attempting to assess the effects of data specificities on models and algorithms of machine learning, and, in particular, on the simple case of linear classification, one should have available (i) a quantitative notion of linear separability and (ii) interpretable parameterized models of data structure. Recent advances, especially within statistical mechanics, mainly focused on point (ii). Different models of structured data have been introduced to express different properties that are deemed to be relevant. For example, the organization of data as the superposition of elementary features (a well-studied trait of empirical data across different disciplines [23][24][25]) leads to the emergence of a hierarchy in the architecture of Hopfield models [26]. Another example is the "hidden manifold model", whereby a latent low-dimensional representation of the data is used to generate both the data points and their labels in a way that introduces nontrivial dependence between them [19]. An important class of models assumes that data points are samples of probability distributions that are supported on extended object manifold, which represent all possible variations of an input that should have no effect on its classification (e.g., differences in brightness of a photo, differences in aspect ratio of a handwritten digit) [27]. Recently, a useful parameterization of object manifolds was introduced that is amenable to analytical computations [28]; it will be described in detail below. In a data science perspective, these approaches are motivated by the empirical observation that data sets usually lie on lowdimensional manifolds, whose "intrinsic dimension" is a measure of the number of latent degrees of freedom [29][30][31].
The main aims of this article are two: (i) the discussion of a quantitative measure of linear separability that could be applied to empirical data and generative models alike; and, (ii) the definition of useful models expressing nontrivial data structure, and the analytical computation, within these models, of compact metrics of linear separability. Most works concerned with data structure and object manifolds (in particular, Refs. [8,27,28]) focus on a single descriptor of linear separability, namely the storage capacity α c . Informally, the storage capacity measures the maximum number of points that a classifier can reliably classify; in statistical mechanics, it signals the transition, in the thermodynamic limit, between the SAT and UNSAT phases of the random satisfiability problem related to the linear separability of random data [32]. Here, I will present a more complete description of separability than the sole storage capacity (a further motivation is the discovery, within the same model of data structure, of other phenomena lying "beyond the storage capacity" [33]).

Linear Classification of Data
Let us first review the standard definition of linear separability for a given data set. In supervised learning, data are given in the form of pairs (ξ µ , σ µ ), where ξ µ ∈ R n is a data point and σ µ = ±1 is a binary label. We focus on dichotomies, i.e., classifications of the data into two subsets (hence, the binary labels); of course, this choice does not exclude datasets with multiple classes of objects, as one can always consider the classification of one particular class versus all the other classes. Given a set of points X = {ξ µ } µ=1,...,m , a dichotomy is a function φ : X → {−1, +1} m . A data set {(ξ µ , σ µ )} µ=1,...,m is linearly separable (or equivalently the dichotomy φ(ξ µ ) = σ µ , µ = 1, . . . , m, is linearly realizable) if there exists a vector w ∈ R n , such that where (ξ µ ) i is the ith component of the µth element of the set. In the following, I will simply write w · ξ µ for the scalar product appearing in the sgn function when it is obvious that w and ξ µ are vectors. In machine learning, the left hand side of Equation (1) is the definition of a linear classifier, or perceptron. The points x, such that w · x = 0 define a hyperplane, which is the separating surface, i.e., the boundary between points that are assigned different labels by the perceptron. By viewing the perceptron as a neural network, the vector w is the collection of the synaptic weights. "Learning" in this context refers to the process of adjusting the weight vector w so as to satisfy the m constraints in Equation (1). Because of the fact that the sgn function is invariant under multiplication of its argument by a positive constant, I will always consider normalized vectors, i.e., both the weight vector w and data points ξ will lie on the unit sphere.
A major motivation behind the introduction of the concept of data structure and the combinatorial theory that is related to it (reviewed in Sections 5 and 6 below) is the fact that the definition of linear separability above is not very powerful per se. Empirically relevant data sets are usually not linearly separable. Knowing whether a data set is linearly separable does not convey much information on its structure: crucially, it does not allow quantifying "how close" to being separable or nonseparable the data set really is. To fix the ideas, let us consider a concrete case: the data set MNIST [34]. MNIST is a collection of handwritten digits, digitized as 28 × 28 greyscale images, each labelled by the corresponding digit ("0" to "9"). I will use the "training" subset of MNIST, containing 6000 images per digit. To simplify the discussion, I will mainly focus on a single dichotomy within MNIST: that expressed by the labels "3" and "7". The particular choice of digits is unimportant for this discussion; I will give an example of another dichotomy below, when subtle differences between the digits can be observed.
One may ask the question as to whether the MNIST training set, as a whole, is linearly separable. However, the answer is not particularly informative: the MNIST training set is not linearly separable [34]. But how unexpected is this answer? Can we measure the surprise of finding out a given training set is or is not linearly separable? Intuitively, there are three different properties of a data set that facilitate or hinder its linear separability: size, dimensionality, and structure.

•
Size. The number of elements m of a data set is a simple indication of its complexity. While a few data points are likely linearly separable, they convey little information on the "ground truth", the underlying process that generated the data set. On the contrary, larger data sets are more difficult to classify, but the information that is stored in the weights after learning is expected to be more faithful to the ground truth (this is related to the concept of "sample complexity" in machine learning [35]). • Dimensionality. There are two complementary aspects when considering dimensionality in a data oriented framework. First, the embedding dimension is the number of variables that a single data point comprises. For instance, MNIST points are embedded in R 784 , i.e., each of them is represented by 784 real numbers. The embedding dimension is n in Equation (1); therefore, n is also the number of degrees of freedom that a linear classifier can adjust to find a separating hyperplane. Hence, one expects that a large embedding dimension promotes linear separability. Second, the data set itself does not usually uniformly occupy the embedding space. Rather, points lie on a lower-dimensional manifold, whose dimension d is called the intrinsic dimension of the data set. The concept of general position discussed below is related to the intrinsic dimension; however, beyond that, I will not explicitly consider this type of data complexity in this article (for analytical results on the linear separability of manifolds of varying intrinsic dimension, see [27]). • Structure. As I will show in a moment, the effects of size and dimensionality on linear separability are easily quantified in a simple null model. Data structure, on the other hand, has proved more challenging, and it is the main focus of the theory described here. There is no single definition of data structure; different definitions are useful in different contexts. A common characterization can be given like this: data have structure whenever the data points ξ µ and their labels σ µ are not independent variables. I will specify a more precise definition in Section 5. Intuitively, the data structure can both promote or preclude linear separability. If points that are close to one another tend to have the same label then linear separability is improved; if, instead, there are many differently labeled points in a small region of space, then linear separability is obstructed.
Let us get back to the question "how surprising is it that MNIST is not linearly separable?". This question should be answered by at least taking into account the first two properties described above, the size of the data set and its dimensionality, which are readily computed from the raw data. In fact, the surprise, i.e., the divergence from what is expected based on size and dimensionality, may be interpreted as a beacon of the third property: data structure. I will show in the next section that the answer to our question is "exceedingly unsurprising". Yet, a slightly modified question will reveal that MNIST, albeit unremarkable in it not being linearly separable, is exceptionally structured.

Null Model of Linear Separability
Let us consider a null model of data that fixes the dimension n and the size p. I use a different letter (p instead of m), because it will be useful below to have two different symbols for the size of the whole data set (m) and for the size of its subsets. Consider a data set Z p = {(ξ µ , σ µ )} µ=1,...,p , where the vectors ξ µ are random independent variables that are uniformly distributed on the unit sphere, and the labels σ µ are independent Bernoulli random variables (also independent from every ξ µ ). These choices are suggested by a maximum entropy principle, when only the parameters m and n are fixed. What is the probability that a data set generated by this model is linearly separable? This problem was addressed and solved more than half a century ago [36][37][38]; In Section 6 I will describe an analytical technique that allows this computation. The fraction of dichotomies of a random data set that are linearly realizable is where ( · · ) is the binomial coefficient. Thus, a random (uniform) dichotomy has probability c n,p of being linearly realizable. In this article, I will refer to the probability c n,p as the separability, or probability of separation. A related quantity is the number of dichotomies C n,p = 2 p c n,p (here, 2 p is the total number of dichotomies of p points). Figure 1 shows the sigmoidal shape of c n,p as a function of p at fixed n. The separability is exactly equal to 1 up to p = n (which pinpoints what is known as the Vapnik-Chervonenkis dimension in statistical learning theory [35]), and it stays close to 1 up to a critical value p c , which increases with n. At p c , the curve steeply drops to asymptotically vanishing values, the more abruptly the larger is n. Rescaling the number of points p with the dimension n yields the load α = p/n. As a function of α, the probability of separation has the remarkable property of being equal to 1/2 at the critical value (that is known as the storage capacity) α c = p c /n = 2, independently of n. Such an absence of finite size corrections to the location of the critical point is an unusual feature, which will be lost when we consider structured data below. In the large-n limit, c n,αn converges to a step function that transitions from 1 to 0 at α c .
How large is the probability of separation c n,m that is given by Equation (2) when one substitutes the sample size m = 12,000 and the dimensionality n = 784, i.e., those of the dichotomy "3"/"7" in the data set MNIST? The probability, as anticipated, is utterly small, less than 10 −2000 : it should be no surprise that MNIST is not linearly separable. This comparison is not completely fair, because of the assumption, underlying Equation (2), of general position. The concept of general position is an extension of that of linear independence, which is useful for sets larger than the dimension of the vector space. A set X of vectors in R n is in a general position if there is no linearly dependent subset X ⊆ X of cardinality less than or equal to n. MNIST is quite possibly not in general position. To make sure that it is, I downscaled each image to 10 × 10 pixels and only considered 1000 images per class (to allow for faster numerical computations), and applied mild multiplicative random noise, by flipping 5% of the pixels around the middle grey value (see Figure 2); I will refer to this modified dataset as "rescaled MNIST". Running the standard perceptron algorithm on rescaled MNIST did not show signs of convergence after 10 5 iterations, which indicated that the data set is likely not linearly separable. For m = 2000 and n = 100, the separability c n,m is less than 10 −400 .  The null model provides a simple concise interpretation of the linear separability of a given data set, given its size m and dimensionality n, in terms of 5 possible outcomes (see Figure 1, bottom panel): 1.
The set is linearly separable and it lies in the region where c n,m ≈ 1. Separability here is trivial: almost all data sets are separable in this region, provided that the points are in general position.

2.
The set is not linearly separable and it lies in the region where c n,m ≈ 1. The only way this can happen for m ≤ n is if the points are not in a general position. For m > n, but still in this region, the lack of separability could also be attributed to a non-trivial data structure. 3.
The set is not linearly separable and it lies in the region where c n,m ≈ 0. Almost no dichotomy is linearly realizable in this region; therefore, the lack of separability is trivial here.

4.
The set is linearly separable and it lies in the region where c n,m ≈ 0. This situation is the hallmark of data structure. The fact that the data set happens to represent one of the few dichotomies that are linearly realizable in this region indicates a non-null dependence between the labels and the points in the data set.

5.
The set lies in the region where c n,m is significantly different from 0 and 1. Here, knowing that a data set is linearly separable or not is unsurprising either way. The location and the width of this "transition region" are the two main parameters that summarize the shape of the separability curve. In Section 6 I will show how to compute these quantities within a more general model that includes data structure. The separabilities of two representative dichotomies in the data set (digits "4" versus "9", and digits "3" versus "7") are far removed from the null model, as is apparent from the location (and the width) of their transition regions (green areas). The shaded areas denote the 95% variability intervals. (Right panel) By increasing the distance δ between the means of the two Gaussian distributions that define the synthetic data set (here in n = 20 dimensions), the separability increases. For δ = 0 (squares), one recovers the prediction of the null model (blue line). Error bars (not shown) are approximately the same size as the symbols.

Quantifying Linear Separability via Relative Entropy
In order to make a step further in the characterization of the linear separability of (rescaled) MNIST, we can consider its subsets. While there is only one subset with m = 2000 points (focusing on the dichotomy "3"/"7"), and only one yes/no answer to the question of its linear separability, there are many subsets of size p < m, which can provide more detailed information. To quantify such information, let us formulate a more precise notion of surprise with respect to a model expressing prior expectation [39]. Let us again fix an empirical data set Z m = {(ξ µ , σ µ )} µ=1,...,m and fix p ≤ m. Now, consider the set N p of all subsets ν = {ν 1 , . . . , ν p } of p indices ν i ∈ {1, . . . , m}, with ν i = ν j for i = j. Additionally, consider the set Σ p = {−1, +1} p of all dichotomiesσ = {σ 1 , . . . ,σ p } of p elements. (I use curly braces for both sets and indexed families.) For each pair ν ∈ N p ,σ ∈ Σ p , we can construct the corresponding synthetic dataset similarly, for each ν ∈ N p , we can construct the corresponding subset Z emp (ν) of the empirical data set Z m : The main tool for defining the surprise will be probability distributions on a space Ω p , which is defined as the union of all synthetic data sets: The empirical space Ω emp p ⊆ Ω p can be defined similarly: Essentially, Ω is the number of subsets of size p in the data set. Interpreted as a probability distribution on Ω p , the empirical data are uniform distributed on Ω emp p ; likewise, the null model defined above induces, by conditioning on the points {ξ µ }, the uniform distribution on the whole Ω p . In general, not every data set in Ω p (nor in Ω emp p ) is linearly separable. Let us define the subsets for which this property holds: Let us call Q p and Q emp p the uniform probability distributions on Ω p and Ω emp p , respec- then measures the surprise carried by the data with respect to the prior belief regarding its linear separability expressed by Q p . Because Q p and Q emp p are defined on sets (Ω p and Ω emp p ) of different cardinality, I define the (signed) surprise S p by subtracting the reference KL divergence between the uniform distributions on these spaces: Notice that the summand in the definition of KL divergence, Equation (8), is only nonzero for z ∈ Ω emp p ; one then obtains where I have defined the empirical separability c emp n,p as the fraction of linearly separable subsets of size p in Z m : The signed surprise S p is positive (respectively negative) when the fraction of linearly separable subsets of size p is smaller (respectively larger) than expected in the null model.

Separability in a Synthetic Data Set and in MNIST
The discussion above encourages the use of the empirical separability c emp n,p as a detailed description of the linear separability of a data set in an information theoretic framework. Despite being one of the simplest benchmark data sets used in machine learning, MNIST is already rather complex; its classes are known to have small intrinsic dimensions and varied geometries [15]. Therefore, before turning to MNIST, let us consider a simple controlled experiment, where the data are extracted from a simple one-parameter mixture distribution, defined, as follows. Let σ ∈ {−1, +1} be a Bernoulli random variable with parameter 1/2, which generates the labels. The data points ξ ∈ R n are extracted from a multivariate normal distribution with σ-dependent mean. The joint probability distribution of each point-label pair is where f N (µ,I) is the probability density function of the multivariate normal distribution with mean µ and identity covariance matrix. The parameter δ measures the distance between the two means: Figure 2 shows the empirical separability c emp n,p , as a function of the size p of the subsets, for such a data set containing m = 200 data points in n = 20 dimensions. When δ = 0, all of the data points are extracted from the same distribution, regardless of their labels: the data have no structure and the separability follows the null model, as in Equation (2). While δ increases, equally labelled points start to cluster, and the separability at any given p > n increases, as expected from the qualitative discussion in Section 2. It is interesting to note that the width of the transition region (∆p in Figure 1) is also an increasing function of δ. This dependence was not expected a priori; In Section 7, I will show that the theory of structured data presented below allows for explaining this behavior.
Let us now compute c emp n,p for the rescaled MNIST data set. Figure 2 shows the results of three numerical experiments, as compared with the null model prediction (2), and elicits four observations. (i) MNIST data are significantly more separable than the null model. For instance, the signed surprise, with respect to the null model, of the empirical dichotomies separating the digits "3" and "7" takes the values S 400 ≈ −55, S 500 ≈ −100, S 600 ≈ −150.
(ii) Even within the same data set, different classifications can have different probabilities of separation; the dichotomy separating the digits "4" and "9" in rescaled MNIST is closer to the null model than the dichotomy of "3" and "7" (e.g., S 400 ≈ −48). (iii) Destroying the structure by random reshuffling of the labels makes the separability collapse onto that of the null model; the surprise S p in this case is, at most, of order 10 −1 for all p. (iv) Similarly to what happens in the more controlled experiment with the synthetic data above, the separability curve of the "3"/"7" dichotomy, which has its transition point at a larger value of p than the "3"/"9" dichotomy, also has a wider transition region.
This analysis shows that, contrary to what appeared by looking solely at the whole data set, the dichotomies of rescaled MNIST are much more likely to be realized by a linear separator than random ones. In relation to the separability as a function of p, the null model has a single parameter, the dimension n. Is it possible to interpret the empirical curves as those of the null model with an effective dimension n eff ? Increasing n has the effect of increasing proportionally the value p c because the storage capacity is fixed to α c = 2. However, while fixing n eff ≈ 280 indeed aligns the critical number of points p c with the empirical one, it yields a much smaller width of the transition region (∆p ≈ 80 for the null model and ∆p ≈ 300 in the data). Furthermore, notice that the values of the surprise for the "3"-vs.-"7" and "4"-vs.-"9" experiments are not very different. The reason is the ingenuousness of the null model, which hardly captures the properties of the empirical sets, and whose term c n,p therefore dominates in S p . These observations, together with the motivations that are discussed above, are a spur for the definition of a more nuanced and versatile model of the separability of structured data.

Parameterized Model of Structured Data
Fixing a model of data structure in this context means fixing a generative model of data. Here, I use the model first introduced in [28]. This should not be considered to be a realistic model of real data sets. It is useful as an effective or phenomenological parameterization of data structure. It has two main advantages: (i) it allows the analytical computation, within a mean field approximation, of the probability of separation c n,p ; and, (ii) it naturally points out the relevant geometric-probabilistic parameters that control the linear separability.
The model is expressed in the form of constraints between the points and the labels. The synthetic data set is constructed as a collection of q "multiplets", i.e., subsets of k points {ξ 1 µ , . . . , ξ k µ } with prescribed geometric relations between them, and such that the labels are constant within each multiplet: The total number of point/label pairs is p = qk. Observe that, if one considers the set of all points X = {ξ i µ }, not every dichotomy of X is admitted by the parameterization of Z q in Equation (13). If a dichotomy assigns different labels to two elements of the same multiplet, it cannot be written in this form. The dichotomies that agree with the parameterization of Equation (13) are termed as admissible.
The relations between the points ξ i µ within each multiplet can be fixed, for instance, by prescribing that the k(k − 1)/2 overlaps ρ i,j = ξ i µ · ξ j µ be fixed and independent of µ (remember that |ξ i µ | = 1). The statistical ensemble for Z q , as specified by the probability density dp(Z q ), is chosen in accordance with the maximum entropy principle: it is the uniform probability distribution on the points and the labels independently, given the constraints: where Z n, q, {ρ i,j } is the partition function, fixed by the normalization condition The null (unstructured) model of Section 3 is recovered in this parameterization in two different limits. First, if k = 1 each multiplet is composed of a single point, and no contraints are imposed other than the normalization. Second, for any k, if all overlaps are fixed to 1, then all points in each overlap coincide, ξ 1 µ = ξ 2 µ = · · · = ξ k µ , and the model is equivalent to the null model with p = q.
The theory that will be described below depends on a natural set of parameters ψ m , with m = 2, . . . , k. These quantities are conditional probabilities of geometric events that are related to single multiplets. They characterize the properties of the multiplets that are relevant for the linear separability of the whole set. Consider a multiplet X = {ξ 1 , . . . , ξ k }. ψ m is a measure of the likelihood that a subset X ⊆ X of m ≤ k points is classified coherently by a random weight vector. More precisely, ψ m is the probability that the scalar product w · ξ has the same sign for all ξ ∈ X , being conditioned on the event that w · ξ has the same sign for all ξ ∈ X \ {ξ }. This probability is computed in the ensemble where the vector w is uniformly distributed on the unit sphere S n−1 , X is uniformly distributed on the subsets of X of m points, and ξ is uniformly distributed on the elements of X . This is coherent with the mean field nature of the combinatorial theory, which assumes uniformly distributed and uncorrelated quantities (see below).
In a few cases, ψ m can be computed explicitly. For instance, for a doublet {ξ,ξ} at fixed overlap ρ = ξ ·ξ, This is the probability that a random hyperplane does not intersect the segment that connects two points at overlap ρ. It is an increasing function of ρ, from ψ 2 (−1) = 0 to ψ 2 (1) = 1. If k > 2, then the quantity that enters the equations will be the mean of ψ 2 (ρ) over all the pairs in the multiplet. It can be shown that ψ m , as a function of the overlaps ρ i,j , does not explicitly depend on the dimensionality n [28]; this property greatly simplifies the analytical computations. In summary, the parameters of the model are the following: the dimensionality n, the multiplicity k, and the k − 2 probabilities ψ m . Actually, only two special combinations of the parameters ψ m emerge as relevant from the theory that is presented in the next sections: I will call them structure parameters. Other functions of the probabilities ψ m are relevant for other purposes, for instance, when considering the large-p asymptotics of c n,p , which relates to the generalization properties of the linear separator [32].

Combinatorial Computation of the Separability for Structured Data
Cover popularized a powerful combinatorial technique to compute the number of linearly realizable dichotomies in an old and highly cited paper [38]. Despite its appeal, the combinatorial approach (while certainly not extraneous to contemporary statistical physics, both theoretical and applied [40][41][42][43]) remained somewhat confined to very few papers in discrete mathematics, and it was only very recently extended to more modern questions, when it was used to obtain an equation for C n,q , the number of admissible dichotomies of q multiplets, for structured data of the type that is defined in the previous section. Ref. [28] first presented the arguments and computations leading to this equation. To make this article as self-contained as possible, I repeat most of the derivation here.

Exact Approach for Unstructured Data (k = 1 Points per Multiplet)
First, I recall the classic computation for unstructured data (k = 1 in our notation). The idea is to write a recurrence relation for the number of linearly realizable dichotomies C n,p and, consequently, for the probability c n,p , by considering the addition of the (p + 1)th element ξ p+1 to the set X p = {ξ 1 , . . . , ξ p } that was composed of the first p elements.
Consider one of the dichotomies of X p , let us call it φ p ; how many linearly realizable dichotomies of X p+1 = {ξ 1 , . . . , ξ p , ξ p+1 } agree with φ p (i.e., take the same values) on the points of X p ? When the point ξ p+1 is added to the set, two different things can happen: (i) sgn(w · ξ p+1 ) is the same for all possible weight vectors w that realize φ p ; and, (ii) there is at least one weight vectorŵ realizing φ p , such thatŵ · ξ p+1 = 0. These two cases lead to different contributions to C n,p+1 . In the first case, there is only one dichotomy of X p+1 agreeing with φ p , as the value that is assigned to ξ p+1 is fixed. In the second case, the value that is assigned to ξ p+1 can be either +1 or −1; therefore, the number of dichotomies of X p+1 agreeing with φ p is 2.
Let us call M n,p the number of those dichotomies, among the C n,p dichotomies of X p , such that (ii) holds for the new point; the number of those satisfying (i) will be C n,p − M n,p . The reasoning above then leads to C n,p+1 = (C n,p − M n,p ) + 2M n,p = C n,p + M n,p . Here lies the keystone that allows for the closure of the recurrence equation: M n,p is the number of dichotomies conditioned to satisfy a linear constraint; therefore, it is equal to the number of dichotomies, of the same number of points p, in n − 1 dimensions: M n,p = C n−1,p . Finally, the recurrence relation is C n,p+1 = C n,p + C n−1,p , which translates into the following equation for the probability c n,p : The boundary conditions of the recurrence (19) are which come from the conditions C 1,p>0 = 2 (there are only two normalized weight vectors in one dimension) and C n>0,1 = 2 (there is always a weight vector w, such that ±w · ξ = ±1). The solution of Equation (19) is Equation (2), as can be checked directly. However, the more complicated equations that are satisfied by the probabilities for structured data are not as easily solvable. For this reason, in Section 7, below, I will show a method to compute useful quantities that are related to the shape of c n,p directly from the recurrence relations, with no need for a closed solution.

Mean-Field Approach for Pairs of Points (k = 2 Points per Multiplet)
The simplest non-trivial extension of Cover's computation to structured data is k = 2. From here on I will useĉ n,q andĈ n,q to denote the fraction and number of linearly realizable admissible dichotomies of q multiplets because the symbols c n,p and C n,p were reserved to denote the fraction and number of linearly realizable dichotomies of p points.
Notice that all the quantities appearing above are notated with no explicit dependence on the points ξ. This is because the unstructured case enjoys a strong universality property (as proved in [38]): C n,p is independent of the points of X p , as long as they are in a general position. Such generality breaks down for structured data. In this case, the recurrence equations that will be obtained are not valid for all sets X p ; rather, they are satisfied by the ensemble averages ofĈ n,q andĉ n,q , in the spirit of the mean-field approximation of statistical physics.
The set of points is now X q ∪X q , where X q is a set of q points {ξ 1 , . . . , ξ q } andX q is a set of partners {ξ 1 , . . . ,ξ q }, where ξ µ ·ξ µ = ρ for all µ = 1, . . . , q (remember that all of the points are on the unit sphere). Consider the addition of the points ξ q+1 andξ q+1 to X q andX q , respectively. By repeating the reasoning described above for k = 1 with respect to the pointξ q+1 , one finds a formula for the number Q n,q of dichotomies of the set {ξ 1 ,ξ 1 , . . . , ξ q ,ξ q ,ξ q+1 } that are admissible on the first q pairs (and are unconstrained onξ q+1 ): Q n,q =Ĉ n,q +Ĉ n−1,q . These dichotomies can be separated into two classes, similarly to the two cases (i) and (ii) above: those that can be realized by a weight vector orthogonal to ξ q+1 (let us denote their number by R n,q ) and those that cannot (their number is then Q n,q − R n,q ). For each dichotomy φ of the first class, there exists one and only one admissible dichotomy of the full set X q+1 ∪X q+1 that agrees with φ and can be realized linearly. In fact, thanks to the orthogonality constraint, there is always, among the weight vectors realizing φ, one vector w, such that sgn(w · ξ q+1 ) = φ(ξ q+1 ), (21) thus satisfying the admissibility condition on the pair {ξ q+1 ,ξ q+1 }. The remaining Q n,q − R n,q dichotomies do not allow this freedom. How many of them are realized by weight vectors w, such that the admissibility condition (21) is satisfied can be estimated at the mean field level by the probability that, given a random weight vector w chosen uniformly on the unit sphere, the scalar products w · ξ q+1 and w ·ξ q+1 have the same sign. This probability does not depend on the actual points, but only on their overlap ρ, and it is exactly the quantity ψ 2 (ρ) that is defined in the previous section, Equation (16). I will denote it by ψ 2 in the following, with the dependence on ρ being understood. The foregoing argument brings the following equation: Similarly to what happens in the unstructured case, the unknown term R n,q can be expressed in terms of variablesĈ •,q by considering the same problem in a lower dimension.
In fact, remember that Q n,q above was computed by applying Cover's argument for k = 1, because it counts how the number of dichotomies is affected when the single pointξ q+1 is added to the set. R n,q must be computed in the same way, since it, again, counts the number of dichotomies that are admissible on the first q pairs and free onξ q+1 . However, these dichotomies must satisfy the additional linear constraint w · ξ q+1 = 0; therefore, the whole argument must be applied in n − 1 dimensions. This leads to R n,q =Ĉ n−1,q +Ĉ n−2,q .
Finally, substituting this expression of R n,q into Equation (22) yieldŝ As above, this translates to a similar equation for the probabilityĉ n,q : The boundary conditions of this recurrence are slightly different than for k = 1. They are discussed in the Appendix A, together with those for the general case.

General Case Parameterized by k
It is possible to extend the method that is described above to all k. I will only sketch the derivation; the details can be found in [28]. Just as the case k = 2 can be treated by making use of the recurrence formula for k = 1, the idea here is to construct the case k recursively by using the formula (yet to be found) for k − 1, therefore obtaining a recurrence relation in k as well as in n and q. To this aim, the (q + 1)th multiplet {ξ 1 q+1 , . . . , ξ k q+1 } is split into the two subsets {ξ 1 q+1 } andξ q+1 = {ξ 2 q+1 , . . . , ξ k q+1 }. The formula for k − 1 allows for applying the argument to the setξ q+1 , thus obtaining the number Q n,q of dichotomies of the set X q \ {ξ 1 q+1 } that are admissible on the first q complete multiplets and are admissible on the (q + 1)th incomplete multipletξ q+1 . More formally, Q n,q is the number of linearly realizable dichotomies φ, such that Now the argument goes exactly as for the case k = 2: some of these Q n,q dichotomies (their number being R n,q ) can be realized by a weight vector orthogonal to the point ξ 1 q+1 ; therefore, each of them contributes a single admissible dichotomy of the whole set X q+1 ; the remaining Q n,q − R n,q contribute with probability ψ k . Again, R n,q can be expressed by applying the same argument in n − 1 dimensions.
Finally, one finds that the probabilityĉ n,q satisfies a recurrence equation in n and q: where the coefficients θ k l are constants (independent of n and q) satisfying a recurrence equation in k and l: The boundary conditions for Equation (28) are the conditions at k = 1 are those that reproduce Equation (19).

Computation of Compact Metrics of Linear Separability
The model of data structure leading to the foregoing equations is very detailed, in that it allows for the independent specification of a large number of parameters. However, the influence of each parameter on the separabilityĉ n,q is not equal, with some combinations of parameters being more relevant than others. In this section, I compute two main descriptors of the shape ofĉ n,q as a function of q at n fixed: the transition point p c (equivalently, the capacity α c ) and the width ∆p of the transition region; they are defined more precisely below. We will see that only the structure parameters Ψ 1 and Ψ 2 , the special combinations defined in Section 5, are needed to fix p c and ∆p.

Diagonalization of the Recurrence Relation
Notice that, while the quantityĉ n,q that is given by the theory is expressed as a function of the number of multiplets q, the definition of separability that is discussed in Section 5 is given in terms of the number of points p = kq. This is not really a problem in the thermodynamic limit whereby the separability is expressed as a function of the load α. In the following, I will define the location q c and the width ∆q of the transition region in the parameterization by the number of multiplets q; the corresponding quantities that are parameterized by p are obtained by rescaling: Let us consider the discrete derivative ofĉ n,q with respect to n: γ n,q = ∆ nĉn,q ≡ĉ n+1,q −ĉ n,q .
As will be clear momentarily, working with γ n,q is convenient because it is normalized, as I will prove below. γ n,q satisfies the same recurrence relation asĉ n,q : The boundary conditions, in accordance with (20), are γ n,1 = δ n,0 , γ n<0,q = 0.
The right hand side of Equation (33) has the form of a discrete convolution between θ k • and γ •,q : The convolution is diagonalized in Fourier space, by defining the characteristic functions Multiplying both sides of Equation (35) by e int and summing over n yields From the definition (36) and boundary conditions (34), one getsγ 1 (t) = 1; hence, the solution of the recurrence equation isγ

Defining the Location and Width of the Transition Region
As mentioned above, γ n,q is normalized, which means that or, equivalently,γ q (0) = 1. To prove this, it suffices to show thatθ k (0) = 1, i.e., that θ k n is normalized. Summing both sides of Equation (28) in l from 0 to ∞ shows thatθ k (0) is constant in k, thereforeθ as can be computed from the boundary conditions (29). Because it is normalized, γ •,q can be interpreted as a probability distribution, whose cumulative distribution function isĉ •,q . The ath moment of the distribution is The same holds for θ k • , whose moments θ a k can be obtained from its characteristic functioñ θ k (t). Let us focus on the mean µ q and the variance σ q , Equation (39) allows for expressing these quantities in terms of the mean µ θ = θ k and variance σ 2 as can be checked by using Equation (42). We can now define the two main descriptors, q c and ∆q, which summarize the separability as a function of q:

Expression in Terms of the Structure Parameters
To compute these quantities, all we need is µ θ and σ θ , or θ k and θ 2 k . Solving Equation (45) for q c gives q c = nµ −1 θ + 1.
Solving Equations (46) and (47) for ∆q gives The corresponding expressions to leading order in n are the following The moments of θ k • satisfy the following equation, which can be obtained by multiplying both sides of Equation (28) by l a and summing over l: The boundary conditions are θ 0 k = 1 (computed above) and θ a 1 = 1/2, as given by Equation (29). In particular, for a = 1, we obtain whose solution is where the structure parameter Ψ 1 , as defined in Equation (17), implicitly depends on k. For a = 2, the recurrence Equation (51) becomes By substituting θ k−1 given by Equation (53) and solving the recurrence we obtain, after some algebra, where Ψ 2 is the second structure parameter that is defined in Equation (18). Finally, by combining the leading order expansions (50) and the moments (53) and (55), and by rescaling, as in Equation (31), we have the following explicit expressions for the two main metrics of separability as functions of the multiplicity k and the structure parameters Ψ 1 and Ψ 2 : For data that are structured as pairs of points, k = 2, Equation (56) gives the storage capacity of an ensemble of segments; this special result was first obtained, by means of replica calculations, in [44], and it was then rediscovered in other contexts in [8,45].

Dependence on the Structure Parameters and Scaling
The two structure parameters Ψ 1 and Ψ 2 , which control the two main metrics of linear separability, belong to k-dependent ranges: The two quantities are not independent, since they are constructed from the same set of k − 1 quantities ψ m ∈ [0, 1]. When conditioned on a fixed value of Ψ 1 , Ψ 2 has a lower bound Ψ − 2 and an upper bound Ψ + 2 that can be computed by considering the two following extreme cases. First, the supremum of Ψ 2 is realized in the maximum entropy case, where the value of Ψ 1 is uniformly distributed among the ψ m . Second, the infimum of Ψ 2 corresponds to the minimum entropy case, where Ψ 1 is distributed on the fewest possible ψ m 's. Explicitly, The definition of Ψ 2 , Equation (18), can be rewritten, as follows: Substituting (59) and (60) into (61), we obtain Figure 3 shows the location of the transition, p c , and the width of the region, ∆p, as functions of Ψ 1 and Ψ 2 for a few values of k. Notice that the range of ∆p at fixed k and Ψ 1 is itself bounded because of the limited range [Ψ − 2 , Ψ + 2 ] of Ψ 2 . There is an interesting observation to be made on a semi-quantitative level. At fixed k and n, p c is an increasing function of Ψ 1 . The width ∆p depends on both structure parameters, but, since the range of Ψ 2 at fixed Ψ 1 is so limited, one expects that, in practice, ∆p will be approximately an increasing function of Ψ 1 . Therefore, ∆p will be, in most cases, an increasing function of p c . This is exactly the phenomenology that is observed in Figure 2, in both the synthetic data and MNIST. The rescaled location of the transition p c /n, Equation (56), does not depend on Ψ 2 , and it depends on Ψ 1 only through the rescaled value Ψ 1 /k. For large k, it takes the scaling form The width ∆p, on the contrary, depends on both Ψ 1 and Ψ 2 . Because it is a monotonically increasing function of Ψ 2 , its upper bound ∆p + and lower bound ∆p − at fixed Ψ 1 can be obtained by substituting (62) and (63) in Equation (57). Expressing ∆p + again as a function of the rescaled parameter Ψ 1 /k, and only keeping the leading term in k → ∞, one obtains the scaling form Doing the same for ∆p − yields a complicated function, which is plotted in Figure 3. A simpler expression for the bound can be obtained by observing that Ψ − 2 ≥ (Ψ 2 1 − Ψ 1 )/2; using this more regular bound yields, at leading order in k, (66) Figure 3 shows the large-k scaling behavior of p c , ∆p + , and ∆p − . The two metrics are insensitive on most of the microscopic parameters of the theory, and they only depend on the two structure parameters, as shown analytically above. In addition, they display a large degree of robustness, even as functions of Ψ 1 and Ψ 2 : measuring p c /n from the data fixes (up to corrections in k) the quantity Ψ 1 /k, which, in turn, significantly narrows down the range of values that are attainable by ∆p, the more so the smaller is k.

Discussion
The discussion above focused on the quantification of linear separability within a model that encodes simple relations between data points and their labels, in the form of constraints. Such a model has the advantage of being analytically tractable and allows the explicit expression of p c and ∆p in terms of model parameters. Moreover, the parameters appearing in the theory have direct interpretations as probabilities of geometric events, thus suggesting routes for further generalization.
In the face of its convenience for theoretical investigations, the definition of data structure used here does not aim at a realistic description of any specific data set. It must be interpreted as a phenomenological or effective parameterization of basic features of data structure that have a distinct effect on linear separability. The limited numerical experiments on MNIST data reported above are a proof of concept, showing a real data set with unexpectedly high linear separability, and they serve as a notable motivation for the investigation of data structure. The main goal of this article is the theoretical analysis; therefore, I postpone any comparison of theory and data. Moreover, MNIST is a relatively simple and clean data set. The numerical analysis signals the highly constrained nature of these data, where points that are close with respect to the Euclidean distance in R n are more likely to have the same label. However, more complex data sets, such as ImageNET, are expected to be less constrained at the level of raw data, due to the higher variability within each category, and due to what are referred to as "nuisances", i.e., elements that are present, but do not contribute to the classification. Yet, even in these cases, the aggregation of equally-labelled points emerges in the feature spaces towards the last layers of deep neural networks, which improves the efficacy of the linear readout downstream, as empirically observed [14,15].
An interesting, and perhaps unexpected, outcome of the theory concerns the universal properties of the probability of separation c n,p . Here, I use the term "universality" in a much weaker sense than what is usually intended in statistical mechanics: I use it to denote (i) the qualitative robustness of the sigmoidal shape of the separability curve on the details of the model, and (ii) the quantitative insensitivity of the separability metrics on all but a few special combinations of parameters [46]. Importantly, the two metrics of data structure that are computed for the model, p c and ∆p, are the only two important parameters that fix c n,p in the thermodynamic limit, apart from the rescaling by k. The central limit theorem suggests this universality property. In fact, γ n,q is the probability distribution of the sum of p − 1 independent and identically distributed variables, as expressed by Equation (39). Therefore, γ n,q will converge to a Gaussian distribution with linearly increasing mean and variance. This indicates that µ q and σ q are the only two nonzero cumulants in the thermodynamic limit and, thus, q c and ∆q are the only two nontrivial metrics that are related toĉ n,q . This does not, by any means, imply that the model of data structure itself can be reduced to only two degrees of freedom. In fact, the phenomenology is richer if one considers the combinatorial quantity C n,q instead of the intensive oneĉ n,q , see [32]; still, regarding the probability of separation, the relevant metrics are the location and width of the transition region.

Appendix A. Boundary Conditions
The boundary conditions of the recurrence Equation (27) require some care. When a single (q = 1) multiplet is considered in dimension n ≥ k, both its admissible dichotomies are linearly realizable. This is because all dichotomies of k points can be realized in n ≥ k dimensions, as I mentioned above. Thereforê c n≥k,1 = 1. (A1) The boundary conditions for n < k are not simply the same as for k = 1. To see this, consider for instance what happens in n = 1 dimensions when dealing with a single (q = 1) multiplet of k = 2 points, ξ andξ. Two problems arise: (i) if the two points lie on opposite sides of the origin, a linearly realized dichotomy φ will always assign them different signs, φ(ξ) = −φ(ξ); (ii) there are not enough degrees of freedom to fix the overlap ρ = ξ ·ξ while keeping ξ andξ normalized. These obstructions are problematic when trying to define the value ofĉ 1,1 for k = 2. This quantity appears in the right hand side of the recurrence Equation (25) when n = 2 and q = 1, where it is needed, alongsideĉ 2,1 , to computeĉ 2,2 . Retracing the derivation for k = 2 shows thatĉ 1,1 in this context occurs when imposing a linear constraint in 2 dimensions, where it represents the fraction of admissible dichotomies of the doublet {ξ,ξ} that can be realized by a weight versor w satisfying w · ξ = 0. In 2 dimensions, the orthogonality condition fixes w up to its sign. If this constrained w is such that sgn(w · ξ) = sgn(w ·ξ) (A2) then exactly 2 admissible dichotomies of {ξ,ξ} are realizable, otherwise the only realizable dichotomies are not admissible. Thereforeĉ 1,1 expresses the probability that Equation (A2) is satisfied; in the mean field approximation, this is ψ 2 (ρ). The foregoing argument actually applies for all k ≥ 1. The probability that all k points in a multiplet lie in the same half-space with respect to the hyperplane realized by a random versor fixes the first non-trivial boundary conditionĉ 1,1 . For k = 2 this fixes everything. Let us now consider k = 3. In this case Equation (A1) omitsĉ 2,1 . What should its value be? Again, going back to the argument in Section 6.3 is helpful.ĉ 2,1 appears in the recurrence when n = 3 and a linear constraint is imposed on w. This fixes w up to rotations around an axis, identified by a versor v. Now, whether the multiplet {ξ 1 , ξ 2 , ξ 3 } allows 2 or 0 admissible dichotomies depends on whether there exists a vector w satisfying the constraint and such that sgn(w · ξ 1 ) = sgn(w · ξ 2 ) = sgn(w · ξ 3 ). This happens if and only if the axis of rotation v lies outside the solid angle subtended by the three vectors ξ 1 , ξ 2 , ξ 3 . This characterization allows to computeĉ 2,1 by elementary methods of solid geometry. One findŝ where For larger values of k, the same reasoning allows to express the non trivial boundary conditionsĉ n<k,1 as geometric probabilities. Fortunately, the hassle of computing all these probabilities can be bypassed by using the boundary conditions (20), which are approximate for k > 1, but still provide asymptotically correct results [28]. In fact, as is evident from the discussion in Section 7, if one takes the thermodynamic limit (30) the contribution of the k − 1 approximate values ofĉ n,1 becomes negligible. Other ways of taking the thermodynamic limit (e.g., if k is extensive in n) may not enjoy this simplification, and may require a different analysis of the boundary conditions.