1. Introduction
The now classic nearest neighbors classification algorithm, introduced in a 1951 technical report by Fix and Hodges (reprinted in [
1]), marked one of the early successes of machine learning research. The basic idea is that, given some notion of proximity between pairs of observations, the class of a new sample unit is determined by majority voting among its
k nearest neighbors in the training sample [
2,
3]. A natural question is whether it is possible to develop a probabilistic model that captures the essence of the mechanism contained in the classic nearest neighbors algorithm while adding proper uncertainty quantification of predictions made by the model. In a pioneering paper, Holmes and Adams [
4] defined a probabilistic nearest neighbors model specifying a set of conditional distributions. A few years later, Cucala et al. [
5] pointed out the incompatibility of the conditional distributions specified by Holmes and Adams, which do not define a proper joint model distribution. As an alternative, Cucala et al. developed their own nearest neighbors classification model, defining directly a proper, Boltzmannlike joint distribution. A major difficulty with the Cucala et al. model is the fact that its likelihood function involves a seemingly intractable normalizing constant. Consequently, in order to perform a Bayesian analysis of their model, the authors engaged in a
tour de force of Monte Carlo techniques, with varied computational complexity and approximation quality.
In this paper, we introduce an alternative probabilistic nearest neighbors predictive model constructed from an aggregation of simpler models whose normalizing constants can be exactly summed in polynomial time. We begin by reviewing the Cucala et al. model in
Section 2, showing by an elementary argument that the computational complexity of the exact summation of its normalizing constant is directly tied to the concept of NPcompleteness [
6]. The necessary concepts from the theory of computational complexity are briefly reviewed.
Section 3 introduces a family of nonlocal models, whose joint distributions take into account only the interactions between each sample unit and its
rth nearest neighbor. For each nonlocal model, we derive an analytic expression for its normalizing constant, which can be computed exactly in polynomial time. The nonlocal models are combined in
Section 4, yielding a predictive distribution that does not rely on costly Monte Carlo approximations. We run experiments with synthetic and real datasets, showing that our model achieves the predictive performance of the Cucala et al. model, with a more manageable computational cost. We present our conclusions in
Section 5.
2. A Case of Intractable Normalization
This section sets the environment for the general classification problem discussed in this paper. We begin in
Section 2.1 with the definition of the Cucala et al. Bayesian nearest neighbors classification model, whose normalizing constant requires an exponential number of operations for brute force calculation. We indicate the Monte Carlo techniques used by the authors to sample from the model posterior distribution, as well as the approximations made to circumvent the computational complexity issues.
Section 2.2 reviews briefly the fundamental concepts of the theory of computational complexity, ending up with the characterization of NPcomplete decision problems, which are considered intractable.
Section 2.3 establishes by an elementary argument a connection between the summation of the normalizing constant appearing on the likelihood of the Cucala et al. model and one of the classical NPcomplete problems. In a nutshell, we show that the availability of an algorithm to exactly compute the normalizing constant of the Cucala et al. model in polynomial time in an ordinary computer would imply that all socalled NP problems could also be solved in polynomial time under equivalent conditions.
2.1. The Cucala et al. Model
Suppose that we have a size n training sample such that for each sample unit we know the value of a vector ${x}_{i}\in {\mathbb{R}}^{p}$ of predictor variables and a response variable ${y}_{i}$ belonging to a set of class labels $\mathcal{L}=\{1,2,\cdots ,L\}$. Some notion of proximity between training sample units is given in terms of the corresponding vectors of predictors. For example, we may use the Euclidean distance between the vectors of predictors of every pair of training sample units to establish a notion of neighborhood in the training sample. Given this neighborhood structure, let the brackets ${\left[i\right]}_{r}$ denote the index of the sample unit in the training sample that is the rth nearest neighbor of the ith sample unit, for $i=1,\cdots ,n$, and $r=1,\cdots ,n1$.
Introducing the notations
$x=({x}_{1},\cdots ,{x}_{n})$ and
$y=({y}_{1},\cdots ,{y}_{n})$, the Cucala et al. model [
5] is defined by the joint distribution
in which
$\beta \ge 0$ and
$k=1,\cdots ,n1$ are model parameters and
$\mathbb{I}(\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}})$ denotes an indicator function. Notice that dependence on the predictors occurs through the neighborhood brackets
${\left[i\right]}_{r}$, which are determined by the
${x}_{i}$’s. The model normalizing constant is given by
From this definition, we see that direct (brute force) summation of
$Z(\beta ,k)$ would involve an exponential number of operations (
$\mathcal{O}\left({L}^{n}\right)$). The much more subtle question about the possible existence of an algorithm that would allow us to exactly compute
$Z(\beta ,k)$ in polynomial time is addressed in
Section 2.3.
In their paper [
5], the authors relied on a series of techniques to implement Markov Chain Monte Carlo (MCMC) frameworks in the presence of the seemingly intractable model normalizing constant
$Z(\beta ,k)$. They developed solutions based on pseudolikelihood [
7], path sampling [
8,
9] (which essentially approximates
$Z(\beta ,k)$ using a computationally intensive process, for each value of the pair
$(\beta ,k)$ appearing in the iterations of the underlying MCMC procedure) and the Møller et al. auxiliary variable method [
10]. Although there is currently no publicly available source code for further experimentation, at the end of Section 3.4 the authors report computation times ranging from twenty minutes to more than one week for the different methods using compiled code. We refer the reader to [
5] for the technical details.
2.2. Computational Complexity
Informally, by a deterministic computer we mean a device or process that executes the instructions in a given algorithm one at a time in a singlethreaded fashion. A decision problem is one whose computation ends with a “yes” or “no” output after a certain number of steps, which is referred to as the running time of the algorithm. The class of all decision problems, with the input size measured by a positive integer n, for which there exists an algorithm whose running time on a deterministic computer is bounded by a polynomial in n, is denoted by P. We think of P as the class of computationally “easy” or tractable decision problems. Notable P problems are the decision version of linear programming and the problem of determining if a number is prime.
A nondeterministic computer is an idealized device whose programs are allowed to branch the computation at each step into an arbitrary number of parallel threads. The class of nondeterministic polynomial (NP) decision problems contains all decision problems for which there is an algorithm or program that runs in polynomial time on a nondeterministic computer. An alternative and equivalent view is that NP problems are those decision problems whose solution is difficult to find but easy to verify. We think of NP problems as the class of computationally “hard” or intractable decision problems. Notable NP problems are the Boolean satisfiability (SAT) problem and the travelling salesman problem. Every problem in P is obviously in NP. In principle, for any NP problem, it could be possible to find an algorithm solving the problem in polynomial time on a deterministic computer. However, a proof for a single NP problem that there is no algorithm running on a deterministic computer that could solve it in polynomial time would establish that the classes P and NP are not equal. The problem of whether P is or is not equal to NP is the most famous open question of theoretical computer science.
Two given decision problems can be connected by the device of polynomial reduction. Informally, suppose that there is a subroutine that solves the first problem. We say that the first problem is polynomial time reducible to the second if both the time required to transform the first problem into the second and the number of times the subroutine is called are bounded by a polynomial in n.
In 1971, Stephen Cook [
11] proved that all NP problems are polynomialtimereducible to SAT, meaning that 1) no NP problem is harder than SAT and 2) a polynomial time algorithm that solves SAT on a deterministic computer would give a polynomial time algorithm solving every other NP problem on a deterministic computer, ultimately implying that P is equal to NP. In general terms, a problem is said to be NPcomplete if it is NP and all other NP problems can be polynomialtime reduced to it, and SAT was the first ever problem proven to be NPcomplete. In a sense, each NPcomplete problem encodes the quintessence of intractability.
2.3. $Z(\beta ,k)$ and NPCompleteness
Let
$G=(V,E)$ be an undirected graph, in which
V is a set of vertices and
E is a set of edges
$e=\{v,{v}^{\prime}\}$, with
$v,{v}^{\prime}\in V$. Given a function
$w:E\to {\mathbb{Z}}_{+}$, we refer to
$w\left(e\right)$ as the weight of the edge
$e\in E$. A cut of
G is a partition of
V into disjoint sets
${V}_{0}$ and
${V}_{1}$. The size of the cut is the sum of the weights of the edges in
E with one endpoint in
${V}_{0}$ and one endpoint in
${V}_{1}$. The decision problem known as the
maximum cut can be stated as follows: for a given integer
m, is there a cut of
G with size at least
m? Karp [
12] proved that the general maximum cut problem is NPcomplete. In what follows, we point to an elementary link between the exact summation of the normalizing constant
$Z(\beta ,k)$ of the Cucala et al. model and the decision of an associated maximum cut problem.
Without a loss of generality, suppose that we are dealing with a binary classification problem in which the response variable
${y}_{i}\in \{0,1\}$, for
$i=1,\cdots ,n$. Define the
$n\times n$ matrix
$A=\left({a}_{ij}\right)$ by
${a}_{ij}=1$ if
j is one of the
k nearest neighbors of
i, and
${a}_{ij}=0$ otherwise. Letting
$B=\left({b}_{ij}\right)=A+{A}^{\top}$, this is the adjacency matrix of a weighted undirected graph
G, whose vertices represent the training sample units, and the edges connecting these vertices may have weights zero, one, or two, based on whether the corresponding training sample units do not belong to each other’s
kneighborhoods, just one belongs to the other’s
kneighborhood, or both are part of each other’s
kneighborhoods, respectively. The double sum in the exponent of
$Z(\beta ,k)$ can be rewritten as
for every
$y\in {\{0,1\}}^{n}$. Furthermore, each
$y\in {\{0,1\}}^{n}$ corresponds to a cut of the graph
G if we define the disjoint sets of vertices
${V}_{0}=\{i\in E:{y}_{i}=0\}$ and
${V}_{1}=\{i\in E:{y}_{i}=1\}$. The respective cut size is given by:
Since for every
$y\in {\{0,1\}}^{n}$ we have that
it follows that
Figure 1 gives an example for a specific neighborhood structure involving the three nearest neighbors with respect to Euclidean distance.
By grouping each possible value of
$T\left(y\right)$ in the sum over
$y\in {\{0,1\}}^{n}$ appearing in the definition of
$Z(\beta ,k)$, we obtain the alternative polynomial representation
in which
$z={e}^{\beta /k}$ and
${d}_{t}={\sum}_{y\in {\{0,1\}}^{n}}\mathbb{I}(T\left(y\right)=t)$, for
$t=0,1,\cdots ,nk$. Note that
${d}_{t}$ is the number of vectors
$y\in {\{0,1\}}^{n}$ such that
$T\left(y\right)=t$, and from
$(\ast )$ we have that
${d}_{t}$ is also the number of possible cuts of the graph
G with size
$\left(\frac{1}{2}{\sum}_{i,j=1}^{n}{b}_{ij}\right)t$.
Suppose that we have found a way to sum
$Z(\beta ,k)$ in polynomial time on a deterministic computer, for every possible value of
$\beta $ and
k and any specified neighborhood structure. By polynomial interpolation (see [
13]), we would be able to compute the value of each coefficient
${d}_{t}$ in polynomial time, thus determining the number of cuts of
G with all possible sizes, thereby solving any maximum cut decision problem associated with the graph
G. In other words: the existence of a polynomial time algorithm to sum
$Z(\beta ,k)$ for an arbitrary neighborhood structure on a deterministic computer would imply that P is equal to NP.
3. Nonlocal Models Are Tractable
This section introduces a family of models that are related to the Cucala et al. model but differ in two significant ways. First, making use of a physical analogy, while the likelihood function of the Cucala et al. model is such that each sampling unit “interacts” with all of its
k nearest neighbors, for the models introduced in this section each sampling unit interacts only with its
rth nearest neighbor, for some
$r=1,\cdots ,n1$. Keeping up with the physical analogy, we say that we a have a family of nonlocal models (for the sake of simplicity, we are abusing terminology a little bit here, since the model with
$r=1$ is perfectly “local”). Second, the nice fact about the nonlocal models is that their normalizing constants are tractable; the main result of this section being an explicit analytic expression for the normalizing constant of a nonlocal model that is computable in polynomial time. The purpose of these nonlocal models is to work as building blocks for our final aggregated probabilistic predictive model in
Section 4.
For
$r=1,\cdots ,n1$, the likelihood of the
rth nonlocal model is defined as
in which the normalizing constant is given by
with parameter
${\beta}_{r}\ge 0$.
In line with what was pointed out in our discussion of the normalizing constant $Z(\beta ,k)$ of the Cucala et al. model, brute force computation of ${Z}_{r}\left({\beta}_{r}\right)$ is also hopeless for the nonlocal models, requiring the summation of an exponential number of terms ($\mathcal{O}\left({L}^{n}\right)$). However, the much simpler topology associated with the neighborhood structure of a nonlocal model can be exploited to give us a path to sum ${Z}_{r}\left({\beta}_{r}\right)$ analytically, resulting in an expression that can be computed exactly in polynomial time on an ordinary computer.
Throughout the remainder of this section, our goal is to derive a tractable closed form for the normalizing constant ${Z}_{r}\left({\beta}_{r}\right)$. For the rth nonlocal model, consider the directed graph $G=(V,E)$ representing the associated neighborhood structure of a given training sample. For $i=1,\cdots ,n$, each vertex $i\in V$ corresponds to one training sample unit, and the existence of an oriented edge $(i,j)\in E$, represented pictorially by an arrow pointing from i to j, means that the jth sample unit is the rth nearest neighbor of the ith sample unit.
An example is given in
Figure 2 for the nonlocal models with
$r=1$ and
$r=2$. We see that in general
G can be decomposed into totally disconnected subgraphs
${G}^{\prime}=({V}^{\prime},{E}^{\prime}),{G}^{\prime \prime}=({V}^{\prime \prime},{E}^{\prime \prime}),\cdots $, meaning that vertices in one subgraph have no arrows pointing to vertices in the other subgraphs. If
${V}^{\prime}=\{{i}_{1},\cdots ,{i}_{k}\}$, we use the notation for the multiple sum
Since
the normalizing constant
${Z}_{r}\left({\beta}_{r}\right)$ can be factored as a product of summations involving only the
${y}_{i}$’s associated with each subgraph:
For each subgraph, starting at some vertex and following the arrows pointing to each subsequent vertex, if we return to the first vertex after
m steps, we say that the subgraph has a simple cycle of size
m. The outdegree of a vertex is the number of arrows pointing from it to other vertices; the indegree of a vertex is defined analogously.
Figure 3 depicts the fact that each subgraph has exactly one simple cycle: in a subgraph without simple cycles, there would be at least one vertex with outdegree equal to zero. Moreover, a subgraph with more than one simple cycle would have at least one vertex in one of the simple cycles pointing to a vertex in another simple cycle, implying that such a vertex would have outdegree equal to two. Both cases contradict the fact that every vertex of each subgraph has outdegree equal to one, since each sample unit has exactly one
rth nearest neighbor.
Figure 4 portrays the reduction process used to perform the summations for one subgraph. For each vertex with indegree equal to zero, we sum over the correspondent
${y}_{i}$ and remove the vertex from the graph. We repeat this process until we are left with a summation over the vertices forming the simple cycle. The summation for each vertex
i with indegree equal to zero in this reduction process gives the factor
because—and this is a crucial aspect of the reduction process—in this sum the indicator is equal to one just for a single term, and it is equal to zero for all the remaining
$L1$ terms,
whatever the value of ${y}_{{\left[i\right]}_{r}}$. Summation over the vertices forming the simple cycle is performed as follows. Relabeling the indexes of the sample units if necessary, suppose that the vertices forming a simple cycle of size
m are labeled as
$1,2,\cdots ,m$. Defining the matrix
$S=\left({s}_{a,b}\right)$ by
${s}_{a,b}=exp\left({\beta}_{r}\mathbb{I}(a=b)\right)$, we have
By the spectral decomposition [
14], we have that
$S=Q\mathsf{\Lambda}{Q}^{\top}$, with
$Q{Q}^{\top}={Q}^{\top}Q=I$. Therefore,
${S}^{m}=Q{\mathsf{\Lambda}}^{m}{Q}^{\top}$, implying that
$\mathrm{Tr}\left({S}^{m}\right)=\mathrm{Tr}\left({\mathsf{\Lambda}}^{m}{Q}^{\top}Q\right)=\mathrm{Tr}\left({\mathsf{\Lambda}}^{m}\right)={\sum}_{\ell =1}^{L}{\lambda}_{\ell}^{m}$, in which we used the cyclic property of the trace, and the
${\lambda}_{\ell}$’s are the eigenvalues of
S, which are easy to compute: the characteristic polynomial of
S is
yielding
For the
rth nonlocal model, let
${c}_{m}^{\left(r\right)}$ be the number of simple cycles of size
m, considering all the associated subgraphs. Algorithm 1 shows how to compute
${c}_{m}^{\left(r\right)}$, for
$r=1,\cdots ,n1$ and
$m=2,\cdots ,n$, in polynomial time. Taking into account all the subgraphs, and multiplying all the factors, we arrive at the final expression:
Algorithm 1 Count the occurrences of simple cycles of different sizes on the directed subgraphs representing the neighborhood structures of all nonlocal models 
Require: Neighborhood brackets $\{{\left[i\right]}_{r}:i=1,\cdots ,n;r=1,\cdots ,n1\}$. 
 1:
function count_simple_cycles($\{{\left[i\right]}_{r}:i=1,\cdots ,n;r=1,\cdots ,n1\}$)  2:
${c}_{m}^{\left(r\right)}\leftarrow 0\mathbf{for}(r,m)\in \{1,\cdots ,n1\}\times \{2,\cdots ,n\}$  3:
for $r\leftarrow 1\mathrm{to}n1$ do  4:
$visited\leftarrow \varnothing $  5:
for $j\leftarrow 1\mathrm{to}n$ do  6:
next if $j\in visited$  7:
$i\leftarrow j$  8:
$walk\leftarrow \mathbf{empty}\mathbf{stack}$  9:
while $i\notin visited$ do  10:
$visited\leftarrow visited\cup \left\{i\right\}$  11:
$\mathbf{push}i\mathbf{into}walk$  12:
$i\leftarrow {\left[i\right]}_{r}$  13:
end while  14:
$m\leftarrow 1$  15:
while $walk\mathbf{not}\mathbf{empty}$ do  16:
$\mathbf{delete}\mathbf{top}\mathbf{element}\mathbf{from}walk$  17:
if $\mathbf{top}\mathbf{element}\mathbf{of}walk=i$ then  18:
${c}_{m}^{\left(r\right)}\leftarrow {c}_{m}^{\left(r\right)}+1$  19:
break  20:
end if  21:
$m\leftarrow m+1$  22:
end while  23:
end for  24:
end for  25:
return $\{{c}_{m}^{\left(r\right)}:r=1,\cdots ,n1;m=2,\cdots ,n\}$  26:
end function
