## 1. Introduction

High-dimensional models with correlated predictors are commonly seen in practice. Most statistical models work well either in low-dimensional correlated case, or in high-dimensional independent case. There are few methods that deal with high-dimensional correlated predictors, which usually have limited theoretical and practical capacity. Neural networks have been applied in practice for years, which have a good performance under correlated predictors. The reason is that the non-linearity and interactions are brought in by the activation functions and nodes in the hidden layers. A universal approximation theorem guarantees that a single-layer artificial neural network is able to approximate any continuous function with an arbitrarily small approximation error, provided that there is a large enough number of hidden nodes in the architecture. Thus, the artificial neural network (ANN) handles the correlation and interactions automatically and implicitly. A popular machine learning application that deals with this type of dependency is the spatio-temporal data, where the traditional statistical methods model the spatial covariance matrix of the predictors. However, by artificial neural networks, working with this big covariance matrix can be avoided. Moreover, artificial neural networks also have good performance in computer vision tasks in practice.

A main drawback of neural networks is that it requires a huge number of training sample due to large number of inherent parameters. In some application fields, such as clinical trials, brain imaging data analysis and some computer vision applications, it is usually hard to obtain such a large number of observations in the training sample. Thus, there is a need for developing high-dimensional neural networks with regularization or dimension reduction techniques. It is known that

${l}_{1}$ norm regularization [

26] shrinks insignificant parameters to zero. Commonly used regularization includes

${l}_{p}$ norm regularization, for example, see the keras package [

6].

${l}_{p}$ norm regularization with

$p\ge 2$ controls the model sensitivity [

15]. On the other hand

${l}_{p}$ norm regularization with

$p<2$, where people usually take

$p=1$ for computation efficiency, does not encourage group information. The group lasso regularization [

27] yields group-wise sparseness while keeping parameters dense within the groups. A common regularization used in high-dimensional artificial neural network is the sparse group lasso by [

21], see for example [

11], which is a weighted combination of the lasso regularization and the group lasso regularization. The group lasso regularization part penalizes the input features’ weights group-wisely: A feature is either selected or dropped, and it is connected to all nodes in the hidden layer if selected. The lasso part further shrinks some weights of the selected inputs features to zero: A feature does not need to be connected to all nodes in the hidden layer when selected. This penalization encourages as many zero weights as possible. Another common way to overcome the high-dimensionality is to add dropout layers [

23]. Randomly setting parameters in the later layers to zero keeps less non-zero estimations and reduces the variance. Dropout layer is proved to work well in practice, but no theoretical guarantee is available. [

17] considers a deep network with the combination of regularization in the first layer and dropout in other layers. With a deep representation, neural networks have more approximation power which works well in practice. They propose a fast and stable algorithm to train the deep network. However, no theoretical guarantee is given for the proposed method other than practical examples.

On the other hand, though widely used, high-dimensional artificial neural networks still do not have a solid theoretical foundation for statistical validation, especially in the case of classification. Typical theory for low-dimensional ANNs traces back to the 1990s, including [

1,

2,

8,

25]. The existing results include the universal approximation capabilities of single layer neural networks, the estimation and classification consistency under the Gaussian assumption and 0-1 loss in the low dimensional case. These theory assumes the 0-1 loss which is not used nowadays and are not sufficient for high-dimensional case as considered here. Current works focus more on developing new computing algorithms rather building theoretical foundations or only have limited theoretical foundations. [

11] derived a convergence rate of the log-likelihood function in the neural network model, but this does not guarantee the universal classification consistency or the convergence of the classification risk. The convergence of the log-likelihood function is necessary but not sufficient for the classification risk to converge. In this paper, we obtained consistency results of classification risk for high-dimensional artificial neural networks. We derived the convergence rate for the prediction error, and proved that under mild conditions, the classification risk of a high-dimensional artificial neural network classifier actually tends to the optimal Bayes classifier’s risk. This type of property has been established on other classifiers such as KNN [

7], which derived the result that the classification risk of KNN tends to the Bayes risk, LDA [

28], which derives the classification error rate under Gaussian assumptions, etc. Popular tasks, like analyzing MRI data and computer vision models, were also included in these research papers, and we applied the high-dimensional neural network to these demanding tasks as well.

In

Section 2, we will formulate the problem and the high-dimensional neural network formally. In

Section 3, we state the assumptions and the main consistency result. In

Section 5, we apply the high-dimensional neural network in three different aspects of examples: the gene data, the MRI data and the computer vision data. In

Section 6, further ideas are discussed.

## 2. The Binary Classification Problem

Consider the binary classification problem

where

$\mathit{x}\in {\mathbb{R}}^{p}$ is the feature vector drawn from the feature space according to some distribution

${P}_{\mathit{X}}$, and

$f(\xb7):{\mathbb{R}}^{p}\to \mathbb{R}$ is some continuous function. Note here that, in the function

$f\left(\mathit{x}\right)$, there can be any interactions among the predictors in

$\mathit{x}$, which ensures the possibility to handle correlated predictors. Let

${P}_{\mathit{X},Y}$ be the joint distribution on

$(\mathit{X},Y)$, where

$\mathit{X}\in {\mathbb{R}}^{p}$ and

$Y\in \{0,1\}$. Here

p could be large, and may be even larger than the training sample size

n. To study the theory, we assume

p has some relationship with

n, for example,

$p=O\left(\mathrm{exp}\right(n\left)\right)$. Therefore,

p should be written as

${p}_{n}$, which indicates the dependency. However, for simplicity, we suppress the notation

${p}_{n}$ and denote it with

p.

For a new observation

${\mathit{x}}_{0}\in {\mathbb{R}}^{p}$, the Bayes classifier, denoted

${C}^{*}\left(\mathit{X}\right)$, predicts 1 if

$f\left(\mathit{x}\right)\ge {p}_{s}$ and 0 otherwise, where

${p}_{s}\in (0,1)$ is a probability threshold, which is usually chosen as

$1/2$ in practice. The Bayes classifier is proved to minimize the risk

However, the Bayes classifier is not useful in practice, since $f\left(\mathit{x}\right)$ is unknown. Thus a classifier is to be found based on the observations $\{({\mathit{x}}_{1},{y}_{1}),\dots ,({\mathit{x}}_{n},{y}_{n})\}$, which are drawn from ${P}_{\mathit{X},Y}$. A good classifier based on the sample should have its risk tend to the Bayes risk as the number of observations tends to infinity, without any requirement for its probability distribution. This is the so-called universal consistency.

Multiple methods have been adopted to estimate

$f\left(\mathit{x}\right)$, including the logistic regression (a linear approximation), generalized additive models (GAM, a non-parametric nonlinear approximation which does not take interactions into account), neural networks (a complicated structure which is dense in continuous functions space), etc. The first two methods usually work in practice with a good theoretical foundation, however, they sometimes fail to catch the complicated dependency among the feature vector

$\mathit{x}$ in a wide range of applications (brain images, computer visions and spatial data analysis). The neural network structure is proved to be able to capture this dependency implicitly without explicitly specifying the dependency hyper-parameters. Consider a single layer neural network model with

p predictor variables. The hidden layer has

${m}_{n}$ nodes, where

${m}_{n}$ may be a diverging sequence depending on

n. Similar to

${p}_{n}$, we suppress

${m}_{n}$ as

m. A diagram is shown in

Figure 1.

For an input vector

$\mathit{x}\in {\mathbb{R}}^{p}$, its weight matrix

$\mathit{\theta}\in {\mathbb{R}}^{p\times m}$ and its hidden layer intercept vector

$\mathit{t}\in {\mathbb{R}}^{m}$, let the vector

$\mathit{\xi}\in {\mathbb{R}}^{m}$ be the corresponding values in the hidden nodes, which is defined as

Let

$\psi (\xb7)$ be an activation function, then the output for a given set of weight

$\mathit{\beta}$ is calculated by

where the function

$\mathit{\psi}(\xb7)$ is the function

$\psi (\xb7)$ being applied element-wisely. We have a wide range of choices for the activation function. [

16] proved that as long as the activation is not algebraic polynomials, the single layer neural network is dense in the continuous function space, and can thus be used to approximate any continuous function. This structure can be considered as a model which, for a given activation function

$\psi (\xb7)$, maps a

$p\times 1$ input vector to an real-valued output

where

${\eta}_{(\mathit{\theta},\mathit{t},\mathit{\beta},b)}\left(\mathit{x}\right)\in \mathbb{R}$ is the output of the single hidden layer neural network with parameter

$(\mathit{\theta},\mathit{t},\mathit{\beta},b)$. Applying the logistic function

$\sigma (\xb7)$,

$\sigma \left({\eta}_{(\mathit{\theta},\mathit{t},\mathit{\beta},b)}\left(\mathit{x}\right)\right)\in (0,1)$ as an approximation of

$f\left(\mathit{x}\right)$ with parameters

$(\mathit{\theta},\mathit{t},\mathit{\beta},b)$
where

$\sigma (\xb7)=\mathrm{exp}(\xb7)/[1+\mathrm{exp}(\xb7\left)\right]$. According to the universal approximation theorem, see [

8], with a big enough

m, the single layer neural network is able to approximate any continuous function with a quite small approximation error.

By [

2], assuming that there is a Fourier representation of

$f\left(\mathit{x}\right)$ of the form

$f\left(\mathit{x}\right)={\int}_{{\mathbb{R}}^{p}}{e}^{i{\mathit{\omega}}^{T}\mathit{x}}\tilde{F}\left(d\mathit{\omega}\right)$, let

${\Gamma}_{B,C}=\{f(\xb7):{\int}_{B}{\Vert \mathit{\omega}\Vert}_{2}|\tilde{f}\left(d\mathit{\omega}\right)<C\}$ for some bounded subset of

${\mathbb{R}}^{p}$ containing zero and for some constant

$C>0$. Then for all functions

$f\in {\Gamma}_{B,C}$, there exists a single layer neural network output

$\eta \left(\mathit{x}\right)$ such that

${\Vert \eta \left(\mathit{x}\right)-f\left(\mathit{x}\right)\Vert}_{2}=O(1/\sqrt{m})$ on

B. Later [

20] generalizes the result by relaxing the assumptions on the activation function and improved the rate of approximation by a logarithmic factor. They showed that on a bounded domain

$\Omega \subset {\mathbb{R}}^{p}$ with Lipschitz boundary, assuming

$f\in {H}^{r}(\Omega )$ satisfies

$\gamma \left(f\right)={\int}_{{\mathbb{R}}^{p}}{(1+|\omega \left|\right)}^{m+1}\left|{\hat{f}}_{e}\left(\omega \right)\right|d\omega <\infty $ for some extension

${f}_{e}\in {H}^{r}\left({\mathbb{R}}^{p}\right)$ with

${f}_{e|\Omega}$, if the activation function

$\sigma \in {W}^{r,\infty}\left(\mathbb{R}\right)$ is non-zero and satisfies the polynomial decay condition

$|\left({D}^{k}\sigma \left(t\right)\right)|\le {C}_{r}{(1+|t\left|\right)}^{-s}$ for some

$0\le k\le r$ and some

$s>1$, we have

where the norm is in Sobolev space of order

r, and

$C(s,r,\Omega ,\sigma )$ is a function of

s,

r,

$\Omega $ and

$\sigma $ only. Both results ensure the good approximation property of single layer neural network, and the convergence rate is independent of the dimension of

$\mathit{x}$,

p, as long as

f has a Fourier transform which decays sufficiently fast.

Towards building the high-dimensional ANN, we start by formalizing the model. Let

$\mathit{X}$ be a

$n\times p$ design or input matrix,

let

$\mathit{y}$ be a

$n\times 1$ response or outcome vector,

let

$\mathit{\theta}$ be a

$p\times m$ parameter or input weight matrix,

let

$\mathit{t}$ be a

$p\times 1$ parameter vector,

let

$\mathit{\beta}$ be a

$m\times 1$ parameter vector representing node weights,

and let

b be a scalar parameter.

When one tries to bring neural network into the high-dimension set up, or equivalently, the small sample size scenario, it usually does not work well. The estimability issue [

14] arise from the fact that even a single layer neural network may have too many parameters. This issue might already exist in the low dimensional case (

$n<p$), let alone the high dimension case. A single layer neural network usually includes

$mp+2m+1$ parameters, which is possible to be much more than the training sample size

n. In practice, a neural network may work well with one of the local optimal solutions although this is not guaranteed by theory. Regularization methods can be applied to help obtain a sparse solution. On one hand, proper choice of regularization shrinks partial parameters to zero, which addresses the statistical estimability issue. On the other hand, regularization makes the model more robust.

Assuming sparsity is usually the most efficient way of dealing with the high dimensionality. A lasso type regularization on the parameters has been shown numerically to have poor performance on neural network models. On one hand, lasso does not drop a feature entirely but only disconnect it with some hidden nodes. On the other hand, lasso does not select dependent predictor variables in a good manner [

9]. Consider the sparse group lasso proposed by [

21], which penalizes the predictors group-wise and individually simultaneously. It is a combination of the group lasso and the lasso, see for example [

11]. The group lasso regularization part penalizes the input features’ weights group-wisely: a feature is either selected or dropped, and it is connected to all nodes in the hidden layer if selected. The lasso part further shrinks some weights of the selected inputs features to zero: a feature does not need to be connected to all nodes in the hidden layer when selected.

Define the loss function as the log-likelihood function

Besides the sparse group lasso regularization, we consider a

${l}_{2}$ regularization on other parameters. Then we have

such that

The sparse group lasso penalty [

11,

21] includes a group lasso part and a lasso part, which are balanced using the hyper-parameter

$\alpha \in (0,1)$. The group lasso part treats each input as group of

m variables, including the weights for the

m hidden nodes connected to each input. This regularization will be able to include or drop an input variable’s

m hidden nodes group-wisely [

27]. The lasso regularization is used to further make the weights sparse within each group, i.e., each input selected by the group lasso regularization does not have to connect to all hidden nodes. This combination of the two regularizations makes the estimation even easier for small sample problems. The

${l}_{2}$ norm regularization on the other parameters is more about practical concerns, since it further reduces the risk of overfitting.

Though with slight difference on the regularization, [

11] proposed a fast coordinate gradient descent algorithm for the estimation, which cycles the gradient descent for the differentiable part of the loss function, the threshold function for the group lasso part and the threshold function for the lasso part. Three tuning parameters,

$\alpha $,

$\lambda $ and

K can be selected with cross-validations on a grid search.

## 3. The Consistency of Neural Network Classification Risk

In this section, we conduct the theoretical investigation of classification accuracy of the neural network model. Before stating the theorems, we need a few assumptions. The independence property of neural networks, see [

1,

25] and [

11], states that the first-layer weights in

$\mathit{\theta},\mathit{t}$ satisfy

and

the set of dilated and translated functions

${\mathbb{R}}^{p}\to \mathbb{R}$
is linearly independent.

The independence property means that different nodes depend on the input predictor variables through different linear combinations and none of the hidden nodes is a linear combination of the other nodes, which is crucial in the universal approximation capability of neural networks. [

20] proved that the above set is linearly independent if

${\mathit{\theta}}_{i}^{\prime}s$ are pairwise linearly independent, as long as the non-polynomial activation function is an integrable function which satisfies a polynomial growth condition.

According to [

11], if the parameters

$\mathit{\varphi}=(\mathit{\theta},\mathit{t},\mathit{\beta},b)$ satisfy the independence property, the following equivalence class of parameters

contains only parameterizations that are sign-flips or rotations and has cardinality exactly

${2}^{m}m!$.

Let

$\mathbb{P}$ be the distribution of

Y for fixed

$\mathit{X}$ and

${\mathbb{P}}_{n}$ be the empirical measure. The best approximation in the neural network space is the equivalence class of parameters by minimizing the population loss

where

${l}_{\mathit{\varphi}}(y,\mathit{x})d{P}_{\mathit{X},Y}$ is the loss function with parameters

$\mathit{\varphi}$. Let

Q be the number of equivalent classes in

$E{Q}_{0}$. The Excess loss is defined as

where

${\mathit{\varphi}}^{0}$ is a set of parameters in

$E{Q}_{0}$. Moreover, when we refer to a set of parameters in

$E{Q}_{0}$ for some parameter

$\mathit{\varphi}$, we mean that

${\mathit{\varphi}}^{0}\in E{Q}_{0}$ has the minimum distance to

$\mathit{\varphi}$. [

11] has shown that this excess loss plus the estimation of the irrelevant weights is bounded from above and may tend to zero with proper choices of

n,

p and the tuning parameters.

Another concern is the estimability of the parameters. A common remedy is to assume sparsity of the predictors. Thus we make the following assumption.

**Assumption** **1.** (Sparsity) Only s of the predictors are relevant in predicting y (without loss of generality, we assume the first s predictors, denoted S are relevant, and the rest, denoted${S}^{C}$, are irrelevant), all weights in **θ** associated with${S}^{C}$, denoted${\mathit{\theta}}_{{S}^{C}}$, are zero in the optimal neural network$E{Q}_{0}$.

The next assumption is a standard assumption in generalized models, which controls the variance of the response from below and above. Consider a general exponential family distribution on y with canonical function $b\left(\theta \right)$, common assumptions is to bound ${b}^{\prime \prime}\left(\theta \right)$ and ${b}^{\prime \prime \prime}\left(\theta \right)$ from above and below. However, in binary classification problems, these functions are automatically bounded from above by 1, thus we only need to assume the boundedness from below. Some literature assume constant bounds on these quantities, however, we do allow the bounds to change with n and the bounds may tend to zero as n goes to infinity.

**Assumption** **2.** (Boundedness of variance) The true conditional probability of y for a given$\mathit{x}$is bounded away from 0 and 1 by a quantity$\tilde{\u03f5}$, which might tend to zero.

The following two assumptions are inherited from [

11]. The next assumption is a relatively weak assumption on the local convexity of the parameters.

**Assumption** **3.** (Local convexity) There is a constant${h}_{min}>0$that may depend on m, s, f and the distribution${P}_{\mathit{X},Y}$, but does not depend on p such that for all$\mathit{\varphi}\in E{Q}_{0}$, we havewhere$A\u2ab0B$means that$A-B$is a positive semi-definite matrix. The next assumption is made to bound the excess loss from below for the parameters outside $E{Q}_{0}$, i.e., the true model is identifiable. Let ${d}_{0}\left(\mathit{\varphi}\right)$ be the minimum distance from an element in $E{Q}_{0}$ to $\mathit{\varphi}$, then we assume

**Assumption** **4.** (Identifiability) For all$\u03f5>0$, there is an${\alpha}_{\u03f5}$that may depend on m, s, f and the distribution${P}_{\mathit{X},Y}$, but does not depend on p such that Assumption 3 states that though neural network is a non-convex optimization problem globally, the parameters of the best neural network approximation of the true function $f\left(x\right)$ has a locally convex neighborhood. The assumption can be assured in this way. By the continuity of the representation of the neural network and the loss function, the integration in Assumption 3 is infinitely continuously differentiable with respect to the nonzero parameters, therefore the second derivative is a continuous function of the nonzero parameters. By the definition of the parameters of the best neural network approximation, ${\mathit{\varphi}}_{0}$ minimizes the integration in Assumption 3. If there is not a positive ${h}_{min}$ that satisfies assumption, it either contradicts with the fact that the second derivative is continuous or the definition of ${\mathit{\varphi}}_{0}$.

Assumption 4 states that the non-optimal neural networks can be distinguished from the best neural network approximation in terms of the excess loss, if the parameters of the non-optimal neural network is not in the

$\u03f5$-neighborhood of any of the parameters of the best neural network class

$E{Q}_{0}$. Similar to the compatibility condition in [

4], the condition does not have to or even may not hold for the whole space, but is only needed in the subspace

$\{\mathit{\varphi}:\Vert {\mathit{\theta}}_{{S}^{C}}{\Vert}_{1}\le 3{\sum}_{j\in S}{\Omega}_{\alpha}({\mathit{\theta}}_{\left(j\right)}-{\mathit{\theta}}_{\left(j\right)}^{0,\left(\mathit{\varphi}\right)})+\Vert (\mathit{t},\mathit{\beta},b)-({\mathit{t}}^{0,\left(\mathit{\varphi}\right)},{\mathit{\beta}}^{0,\left(\mathit{\varphi}\right)},{b}^{0,\left(\mathit{\varphi}\right)}){\Vert}_{2}\}$, thus this is a weaker condition than imposing the lower bound on the excess loss. The subspace is derived from the the basic inequality of the definition of

$\hat{\mathit{\varphi}}$ with rearranging terms and norm inequalities, see for example [

4]. Similar subspace can also been found in the compatible condition in [

19]. Since

s is unknown, it can not be checked in practice, but it is sufficient to check the inequality for all sets

$S\in \{1,\dots ,p\}$ with cardinality

${s}_{0}$ if

${s}_{0}$ is known, which is a stronger version of Assumption 4.

Now we are ready to state our main result. We have to admit that our theory based on the estimator from (

3) is the global optima, which suffers from the biggest problem in optimization research: the gap between the global optima in theory and a local optima in practice. We will leave this computational issue to future research.

**Theorem** **1.** Under Assumptions 1–4, let our estimation be from Equation (3), choosing tuning parameter$\lambda \ge 2T\tilde{\lambda}$for some constant$T\ge 1$and$\tilde{\lambda}=c\sqrt{m\mathrm{log}n/n}(\sqrt{\mathrm{log}Q}+\sqrt{m\mathrm{log}p}\mathrm{log}\left(nm\right)/(1-\alpha +\alpha /\sqrt{m}))$, if$\mathrm{log}\left(n\right)/\left(n{\tilde{\u03f5}}^{2}\right)\to 0$,${s}^{2}m{\lambda}^{2}/\left(n{\tilde{\u03f5}}^{2}\right)\to 0$and${n}^{-1}{m}^{9/2}{s}^{5/2}\sqrt{\mathrm{log}\left(p\right)}\to 0$as$n\to \infty $, assume that our prediction is within a constant distance from the best approximation${\eta}^{0}\left(\mathit{x}\right)$, then we have A proof of this theorem is given in

appendix. This theorem states that with proper choice of tuning parameters and under some mild assumptions and controls of

n,

p and

s, the high-dimensional neural network with sparse group lasso regularization tends to have the optimal classification risk. This is a significant improvement in the theoretical neural network study, since it gives the theoretical guarantee that high dimensional neural network will definitely work in such situations.

## 6. Discussion

In this paper, we considered the sparse group lasso regularization on high-dimensional neural networks and proved that under mild assumptions, the classification risk converges to the optimal Bayes classifier’s risk. To the best of our knowledge, this is the first result that the classification risk of high-dimensional sparse neural network converges to the optimal Bayes risk. Neural networks are very good approximations to correlated feature functions, including computer vision tasks, MRI data analysis and spatial data analysis. We expect further investigation is warranted in the future.

An innovative idea that deserves further investigation is to specify a larger number of hidden nodes and use a

${l}_{0}+{l}_{1}$ norm penalty on the hidden nodes parameters

$\mathit{\beta}$. This methods searches a larger solution field and will give a model at least as good as the

${l}_{2}$ norm penalty. Moreover,

${l}_{0}+{l}_{1}$ norm penalty is proved to work well in low signal cases [

18]. In detail, the formulation is to minimize (

3) plus an extra regularization on

$\mathit{\beta}$:

${\lambda}_{3}{\Vert \mathit{\beta}\Vert}_{0}+{\lambda}_{4}{\Vert \mathit{\beta}\Vert}_{1}$. This formulation does not bring in extra tuning parameters, since we release the penalization on

${l}_{2}$ norm of

$\mathit{\beta}$ and the number of hidden nodes

m. With

${l}_{0}+{l}_{1}$ norm penalty, the parameters can be trained using coordinate descent algorithm with the

${l}_{0}+{l}_{1}$ norm penalty being handled by mixed integer second order cone optimization (MISOCO) algorithms via optimization software like Gurobi. This algorithm adds an extra step to the algorithm in [

11] to handle the

${l}_{0}+{l}_{1}$ norm penalty.

The computational issue of finding the global optimal in non-convex optimization problems is still waiting to be solved to eliminate the gap between theory and practice. This will pave way for further theoretical research.