1. Introduction
The distribution of distances between elements in a set of points is present in many problems, particularly in spatial analysis, and in various fields of application: ecology, epidemiology, forestry, biology, astronomy, economics, particle physics, network applications, etc. [
1]. For example, given two points randomly selected in a set of points independently and uniformly distributed in space, we aim to know the probability of the distance between these two points inside the set of distances between all the pairs of points (
Figure 1).
This question is important when trying to evaluate or model spatial interactions between elements, such as clustering of objects, spatial autocorrelation of a variable across a set of locations, or neighbor relationships and connectivity [
2]. Indeed, in nature many problems involve distance-based interactions between events or elements. For example, most methods used to measure spatial autocorrelation or to model spatial interactions are based on a weighted average of a variable between pairs of elements in a disk [
2,
3,
4,
5]. If such an index measures a phenomenon related to the distance between the elements, the index may favor the pairs of elements of the most likely distances. In order to avoid such bias, it is necessary to know the distribution of distances and consider the relationship between the distance and the phenomenon independently of the distribution of distances between all pairs of elements [
2].
The distribution of distances between two randomly selected points in a compact has been studied for a long time. However, the results are fragmented because they are presented in different articles, with different methods of resolution, depending on the dimension of the space and the type of compact studied. In this article, we first present a literature review of these results. We then propose a unified method of resolution which uses only standard mathematical objects, and which is generalizable to any type of compact set in any dimension. We will describe this general approach and use it to calculate distributions of Euclidian distances between two randomly chosen points, for compact sets as lines, rectangles, disks, cubes, and specific results for hyperspheres of any dimension.
2. Literature Review
The task of distance distribution estimation between points is related to stochastic geometry [
6].
2.1. Rayleigh Distribution
Famous Nobel prize physicist Lord Rayleigh (1842–1919) solved a slightly simpler problem than the one studied in this article: he modelled the distribution (known as Rayleigh distribution) of Euclidean distances between a central point and a set of points normally distributed around this central point in a real vector space of dimension two [
7].
By positioning the central point to the origin, the problem addressed by Rayleigh was to evaluate the distribution of
, i.e.,
in cartesian coordinates for Euclidian distance, the vector
corresponding to the realization of two real random variables
and
independent but generated by the same density function:
where
, the standard deviation of this normal distribution, allows one to set the concentration of points around the center.
The distribution of distances to the center can then be easily obtained by using the independence of the two random variables and by using polar coordinates:
Many phenomena in various fields such as image processing, signal processing, particle physics, etc., follow a Rayleigh distribution.
2.2. Distance Distribution between Two Random Points Iud in a Region of ℝn
More recently, in the 20th century, spatial points’ processes in one or two dimensions and related spatial properties, as void or contact distribution or Euclidian distance distribution between k-neighbors, started to receive special attention [
8]. Nevertheless, “the research on distributions of distances in point processes of dimensions higher than one have never been an issue of systematic research and have been performed in rather ad hoc way in the past” ([
1], p. 2). The problem was not addressed holistically, but depending on the field of application and the geometric form considered, in two or three dimensions, essentially circles and spheres or rectangles and cubes.
Distribution of Euclidian distances between two random points for
iud set of points in a circle, sphere or hypersphere has been addressed many times in the literature in different fields and with different techniques (geometry, differential equations): in mathematics and statistics [
9,
10,
11,
12,
13], in chromosome analysis [
14], in geography [
15], in demography [
16,
17,
18,
19,
20], in network analysis [
1], and in physics [
21,
22].
For example, in
, for two random points
and
in a circle of radius
, geometric resolution of the distribution of
described in [
1] use the Croften’s fixed-point theorem and the mean value theorem [
23], and the result has been known since the end of the 19th century:
Again in
, distributions of Euclidian distances between two random points in a rectangle has long been addressed [
24], and an analytical resolution is presented in [
25]. More recently, other studies have focused on polygons [
26].
Distribution of Euclidian distances for a cube has also been addressed [
27,
28,
29,
30], but without general formulation for any dimensions. The distribution function for this random variable seems not to be known before 1978 [
27]. Robbins’s constant [
31] was defined as the mean Euclidian distance between two random points in a unit cube.
As one of the many random quantities studied in Geometric Probability, results were extended to the 4th and 5th dimensions [
33] but for higher dimensions the increase of algebraic complexity associated with derivation procedures was a strong limiting factor. These results can have practical applications in multidimensional analysis and data mining.
3. A Unified Method for the Evaluation of Euclidian Distance Distributions between Two Randomly Chosen Points
We present now an original approach using a unified method generalizable to any type of compact set in any dimensions. Mathematical formalization and resolution will use only well-known objects and methods such as random variables and density functions, convolution, marginal distribution, and some standard functions (Gamma, Beta). We will use this approach to calculate distributions of Euclidian distances between two randomly chosen points for hypersphere and hypercube of any dimensions, and therefore confirm results already known in the literature as mentioned before.
3.1. Mathematical Formalization
Let
be a vector in a normed
-vector space
of dimension
. Its coordinates are noted
in a orthonormal coordinate system,
the norm of
,
the distance associated to the norm (
. In this article, the Euclidean norm will be considered:
).
is the closed ball of center
and radius
, which corresponds to all the vectors
whose distance to
is less than or equal to
:
Let
and
denote two random vectors in
independent and identically distributed with the same probability distribution corresponding to a homogeneous Poisson point process
(a completely spatial random process). Random vectors
and
allow us to simulate the pairs
of elements of a subset
of
, and to evaluate the distribution of their distances
. Let
be the random variable in
such that:
.
represents the distance between two vectors obtained randomly in
. Our problem is to determine the probability density of
from the process
, i.e., to determine the function
such that
3.2. Using Convolution of Density Functions
Considering that , finding the density function using and would lead to the expected distribution of .
When
is uni-dimensional,
and
are simply independent
-random variables
and
. Convolution can be used to find the density
from the density functions
and
[
34],
Therefore,
can be seen as the convolution of
and
,
given that
, which stays true even in higher dimensions.
When
and
are two independent
-dimensional random vectors in a vector space
,
is a
-dimensional vector. Let
be the
matrix such that
,
where
is the identity matrix in
.
As a way of consequence,
is the marginal distribution of
:
Given that
thus leads to
As such we can obtain the distribution of from and distributions only.
3.3. Distribution with Random Set of Points Iud in a Compact Set
Compact sets are convenient to model any type of spatial region of any shape and with finite size (which is always the case in reality). We assume in the following that the set F is a set of elements
iud with uniform density ρ in a compact
K, corresponding to a homogeneous Poisson process of density ρ on
K. So, we have:
where
is the indicator function of
, and
the Lebesgue measure in
E.
Using previously presented tools,
where
.
The distribution of
follows:
Because
is defined as a measure with translational invariance, one can be sure that
is not affected by the position of
inside
but only by the “shape” of
and
. This translational invariance is very intuitive; distances inside a spatial area are never affected by the global position of the area:
4. Using Equation (3) for Typical Compact Sets
We will apply this general resolution formula for typical compact sets, especially to hypercubes and hyperspheres of any dimension.
4.1. K is a Segment in a 1-Dimension Space
In dimension 1, the compact
is a segment
in
. The set
stands for a set of random values uniformly distributed in
.
and
are simply independent
-random variables
and
. We can easily determine the well-known density function [
9] (
Figure 2). We have
Due to translational invariance, can be replaced with .
Therefore,
. Otherwise,
Hence,
As the density is linear, it is easy to calculate the mean, variance and median (called
):
4.2. K Is a Rectangle in a 2-Dimensional Space
The two-dimensional rectangle displays a highly convenient property, namely that the x-axis and y-axis are statistically independent.
Let us introduce parameters for our rectangle:
where
and
. Thanks to the remark previously made about translational invariance,
can be replaced by
without changing the final result. Here, Equation
gives
must be integrated on
in order to get the distribution
for
. By noticing that the longest distance inside
is
, the distribution function becomes
Clearly, we need to separate different cases:
In the particular case where (i.e., is a square), the polynomial function describes the first of all distances.
As an example,
Figure 3 shows the graph for density on a square (a) and inside some rectangles (b).
4.3. K Is a Disk in a 2-Dimensional Space
K is a disk with radius
,
where
is the center of the circle and
its radius. It is obviously possible to limit our study to the case where
.
To calculate we need to integrate the area on all angles .
We can demonstrate the isotropy of (the area remains the same for any value of ).
Let
be the rotation operator,
To calculate the surface S, let
be the length of the chord for the
coordinate on the x-axis [
35]:
This finally leads to the already-mentioned result above,
being the longest distance possible inside a circle.
Figure 4 shows the graph of distribution of distances inside a circle.
We can calculate the mean and variance of this distribution:
If
stands for the median, solving
for
turns out to be unsolvable analytically. Nevertheless, we can show that
where
is the mean of the distribution. The median is only very slightly lower than mean.
To calculate the variance, we calculate first
,
which is quite remarkable. Therefore,
Let us check, as an exercise, that the density is indeed normalized:
4.4. K Is a Sphere in a Three-Dimensional Space
When using Formula (2), it is clear that when a space’s dimension is equal to three or more, the issue is simply to calculate a volume (or hypervolume). Calculations are quite similar to those used for the circle. Firstly, let us give random vectors
and
’s density,
where
is the 3D sphere centered at
with a radius
. Just like previously, replacing
with
will not change any results. We have:
Using spherical coordinates
gives us
giving us directly the density
. A rotation matrix determinant is still equal to 1. Therefore angles
will not change the value of the volume:
The volume of the intersection can be calculated similarly to the previous two-dimensional area. The difference is that for every
in our circle, there is a whole disc for every
. Therefore:
The density is polynomial in a three-dimensional space. Clearly, third dimension favors the presence of longer distances:
which is greater than the expected value calculated in a two-dimensional space. Calculating the median implies solving for
a polynomial equation of degree 6:
Unfortunately it is impossible to give a general solution. Nevertheless, gives a very good approximation of . This estimation shows that this time, the median is greater than the mean. It is remarkable that in contrary to 2D, 3D distances are a slightly more likely to be longer than average.
Finally, let us calculate the variance
which is as expected significantly lower than two-dimensional variance.
4.5. K Is a Hypersphere in Higher Dimensions (n > 3)
Random vectors
and
are now considered independently and uniformly distributed in
where
. A generalization of previous methods allows us to expect
to take the following form:
where
is a constant and
the volume of hypersphere in n-dimensional space:
Therefore, we have
where
can be evaluated as the normalizing constant. The generalized binomial theorem [
36] gives us the way to evaluate the integral,
where
is the generalized binomial factor
with
.
Using the fact that
, we have
where
is Euler’s Beta function [
37].
This gives the explicit form of density function
of a random variable
in dimension n (
Figure 5a)
Therefore, the mean of
can be calculated using
where
is Euler Gamma function.
Using Stirling approximation [
35], we have
.
We have then
, which implies that
(
Figure 5b).
As we can see with Equation (7),
, and it is well-known that hyperspheres become "hollow" when the dimension is high enough [
38], and most points in a hypersphere tend to agglomerate towards its hypersurface. This has a consequence on distances that is quite intuitive and explains why the variance of distances tends to zero when the dimension increases, i.e., diversity of geometric configurations is increasingly limited as the dimension in space increases. Our result shows how fast this phenomenon impacts the distances between points inside the hypersurface when dimension increases.
5. Conclusions
We have developed a general and unified method to obtain the distribution of distances between two points randomly selected in a iud cloud of points in a geometric figure. These distributions are useful, especially in spatial statistics, to know the statistical representativeness (the weight) of a distance between two points. In the case of iud set of random points in a hypersphere, the expression of density is given for any dimension, and the variance of these distributions converge to zero when the dimension increases. This result also opens new perspectives in multidimensional analysis and data mining.