Linear and Fisher Separability of Random Points in the d-dimensional Spherical Layer

Stochastic separation theorems play important role in high-dimensional data analysis and machine learning. It turns out that in high dimension any point of a random set of points can be separated from other points by a hyperplane with high probability even the number of points is exponential in terms of dimension. This and similar facts can be used for constructing correctors for artificial intelligent systems, for determining an intrinsic dimension of data and for explaining various natural intelligence phenomena. In this paper, we refine the bounds for the number of points and for the probability in stochastic separation theorems, thereby strengthening some results obtained by Gorban, Tyukin, Burton, Sidorov, Zolotykh et al. We give and discuss the bounds for linear and Fisher separability, when the points are drawn randomly, independently and uniformly from a d-dimensional spherical layer. These results allow us to better outline the applicability limits of the stochastic separation theorems in the mentioned applications.


I. INTRODUCTION
Recently, stochastic separation theorems [9] have been widely used in machine learning for constructing correctors and ensembles of correctors of artificial intelligence systems [6], [7], for determining the intrinsic dimension of data sets [1] and for explaining various natural intelligence phenomena, such as grandmother's neuron [8] etc. If the dimension of the data is high, then any sample of the data set can be separated from all other samples by a linear (even Fisher) discriminant with a probability close to 1 even the number of samples is exponential in terms of dimension. Due to the applications mentioned above the theorems of such kind can be considered as a manifestation of so called the blessing of dimensionality phenomenon [9].
In its usual form a stochastic separation theorem is formulated as follows. A random n-element set in R d is linearly separable with probability p > 1 − ϑ, if n < ae bd . The exact form of the exponential function depends on the probability distribution that determines how the random set is drawn, and on the constant ϑ (0 < ϑ < 1). In particular, various types of uniform distributions is considered in [5], [9], [10]. A wider class of distributions is considered in [7]. Roughly speaking, The work is supported by the Ministry of Education and Science of Russian Federation (project 14.Y26.31.0022). this class consists of distributions without sharp peaks in sets with exponentially small volume.
We note that there are many algorithms for constructing a functional separating a point from all other points in a data set (Fisher linear discriminant, linear programming algorithm, support vector machine, Rosenblatt perceptron etc.). Among all these methods the computationally cheapest is Fisher discriminant [6]. Other advantages of the Fisher discriminant is its simplicity and the robustness.
The papers [5]- [7], [9] deal with only Fisher separability, whereas [10] considered a (more general) linear separability. A comparison of the bounds for linear and Fisher separability allows us to clarify the applicability boundary of these methods, namely, to answer the question, for what d and n it is suffices to use only Fisher separability and there is no need to search a more sophisticated linear discriminant.
In [10] there were obtained bounds for the cardinality of the set of points that guarantee its linear separability when the points are drawn randomly, independently and uniformly from a d-dimensional spherical layer and from the unit cube. These results give more accurate estimates than the bounds obtained in [5], [9] for Fisher separability. Here we give even more precise bounds for the number of points in the spherical layer to guarantee their linear separability. Also, we report the results of computational experiments comparing the theoretical bounds for the probability of the linear and Fisher separabilities with the corresponding experimental frequencies and discuss them.

II. DEFINITIONS
A point X ∈ R d is linearly separable from the set M ⊂ R d if there exists a hyperplane separated X from M , i.e. there exists a linear function L : A set of points {X 1 , . . . , X n } ⊂ R d is called 1-convex [2] or linear separable [9] if any point X i is linear separable from all other points in the set, or, in other words, the set of vertices of their convex hull, conv(X 1 , . . . , X n ), coincides with {X 1 , . . . , X n }.
The set {X 1 , . . . , X n } is called Fisher separable if (X i , X j ) < (X i , X i ) for all i, j, such that i = j [6], [7]. Fisher separability implies linear separability but not vice versa (even if the set is centered and normalized to unit variance). Thus, if M ⊂ R d is a random set of points from a certain probability distribution, then the probability that M is linearly separable is not less than the probability that M is Fisher separable.
Let B d = {X ∈ R d : X ≤ 1} be the d-dimensional unit ball centered at the origin ( X means Euclidean norm), rB d is the d-dimensional ball of radius r < 1 centered at the origin.
Let M n = {X 1 , . . . , X n } be the set of points chosen randomly, independently, according to the uniform distribution on the spherical layer B d \ rB d . Denote by P (d, r, n) the probability that M n is linear separable, and by P F (d, r, n) the probability that M n is Fisher separable.
Denote by P 1 (d, r, n) the probability that a random point in the spherical B d \ rB d is separated from M n , and by P F 1 (d, r, n) the probability that a random point is Fisher separable from M n .

III. PREVIOUS WORK
In [9] it was shown (among other results) that for all r, ϑ, then M n is Fisher separable with a probability greater than The following statements are proved in [5].
• For all r, ϑ, where 0 < r < 1, 0 < ϑ < 1, • For all r, ϑ, where 0 < r < 1, 0 < ϑ < 1 and for d sufficiently large, if then P F (d, r, n) > 1 − ϑ. Note that the estimate is not applicable for case r = 0. Note that the authors of [9], [5] formulate their results for linearly separable sets of points, but in fact in the proofs they used that the sets are only Fisher separable.
The both estimates (1), (5) are exponentially dependent on d for fixed r, ϑ and the estimate (1) is worse than (5) (see Section V).

IV. NEW RESULTS
The following theorem gives a probability of the linear separability of a random point from a random n-element set The proof uses an approach borrowed from [2], [4].
Proof. A random point Y is linear separable from M n = {X 1 , . . . , X n } if and only if Y / ∈ conv(M n ). Denote this event by C. Thus P 1 (d, r, n) = P(C). Let us find the upper bound for the probability of the event C. This event means that the point Y belongs to the convex hull of M n . Since the points in M n have the uniform distribution, then the probability of C is Let us estimate the numerator of this fraction. We denote by S i the ball with center at the origin and with the diameter OX i . We denote by T i the ball with center at the origin and with the diameter r inside the ball S i . Then where γ d is the volume of a ball of radius 1. Hence and Note that the bound (6) obtained in Theorem 1 doesn't depend on r. Nevertheless the bound is quite accurate as is illustrated with Figure 1. The results of the experiment show that the probabilities P 1 (d, r, n) and P F 1 (d, r, n) are quite close and the theoretical bound (6) compared with (2) approximates well the both probabilities.
The following corollary gives an improved estimate for the number of points n guaranteeing the linear separability of a random point from a random n-element set M n in B d \ rB d with probability at least 1 − ϑ.
Proof. If n satisfies the condition n < ϑ2 d , then the inequality P 1 (d, r, n) > 1 − ϑ holds by the previous theorem.
The following theorem gives the probability of the linear separability of a random n-element set M n in B d \ rB d .
Proof. Denote by A n the event that M n is linear separable and denote by C i the event that X i / ∈ conv(M n \ {X i }) (i = 1, . . . , n). Thus P (d, r, n) = P(A n ). Clearly A n = C 1 ∩ . . . ∩ C n and P(A n ) = P(C 1 ∩ . . . ∩ C n ) = 1 − P(C 1 ∪ . . . ∪ Let us find the upper bound for the probability of the event C i . This event means that the point X i belongs to the convex hull of the remaining points, i.e. X i ∈ conv(M n \ {X i }). In the proof of the previous theorem, it was shown that Hence Note that the bound (8) obtained in Theorem 2 doesn't depend on r, although P (d, r, n) seems to increase monotonically with increasing r (for a big enough n). Nevertheless the bound is quite accurate as is illustrated with Figures 2, 3. The results of the experiment show that the probabilities P 1 (d, r, n) and P F 1 (d, r, n) are quite close and the theoretical bound (8) compared with (4) approximates well the both probabilities.
Another important conclusion from the experiment is as follows. Despite the fact that both probabilities P F (d, r, n) P (d, r, n) are close to 1 for sufficiently big d, the "threshold values" for such a sufficiently big d differ greatly. In other words, the blessing of dimensionality when using linear discriminants comes noticeably earlier than if we only use Fisher discriminants. This is achieved at the cost of constructing the usual linear discriminant in comparison with the Fisher one.
The following corollary gives an improved estimate for the number of points n guaranteeing the linear separability of a random n-element set M n in B d \rB d with probability at least 1 − ϑ. This result strengthens the result obtained in [10].  Corollary 2. Let 0 ≤ r < 1, 0 < ϑ < 1, Then P (d, r, n) > 1 − ϑ.
Proof. If n satisfies the condition n < √ ϑ2 d , then by the previous theorem

V. COMPARISON OF THE RESULTS
The following statement establishes the asymptotics of the bound (1).
If r and ϑ are fixed then the following asymptotic estimates hold: The equality Let us compare the bound (5) with the bound (1) proposed in [9], [5].

Corollary
3 If r and ϑ are fixed then the following asymptotic estimates of the quotient f g hold: Proof. If The following statement compares estimates of the number of points that guarantee linear separability of a random points in the spherical layer obtained in [5] and in Corollary 2.

VI. CONCLUSION
In this paper we refined the bounds for the number of points and for the probability in stochastic separation theorems. We gave new bounds for linear separability, when the points are drawn randomly, independently and uniformly from a ddimensional spherical layer. These results allow us to better understand the applicability limits of the stochastic separation theorems for high-dimensional data mining and machine learning problems. These results refine some results obtained in [5], [9], [10].
One of the main results of the experiment comparing linear and Fisher separabilities is as follows. The blessing of dimensionality when using linear discriminants can come noticeably earlier (for smaller values of d) than if we only use Fisher discriminants. This is achieved at the cost of constructing the usual linear discriminant in comparison with the Fisher one.