The Compact Support Neural Network

Neural networks are popular and useful in many fields, but they have the problem of giving high confidence responses for examples that are away from the training data. This makes the neural networks very confident in their prediction while making gross mistakes, thus limiting their reliability for safety-critical applications such as autonomous driving and space exploration, etc. This paper introduces a novel neuron generalization that has the standard dot-product-based neuron and the radial basis function (RBF) neuron as two extreme cases of a shape parameter. Using a rectified linear unit (ReLU) as the activation function results in a novel neuron that has compact support, which means its output is zero outside a bounded domain. To address the difficulties in training the proposed neural network, it introduces a novel training method that takes a pretrained standard neural network that is fine-tuned while gradually increasing the shape parameter to the desired value. The theoretical findings of the paper are bound on the gradient of the proposed neuron and proof that a neural network with such neurons has the universal approximation property. This means that the network can approximate any continuous and integrable function with an arbitrary degree of accuracy. The experimental findings on standard benchmark datasets show that the proposed approach has smaller test errors than the state-of-the-art competing methods and outperforms the competing methods in detecting out-of-distribution samples on two out of three datasets.


Introduction
Neural networks have proven to be extremely useful in many applications, including object detection, speech and handwriting recognition, and medical imaging, etc. They have become the state-of-the-art in these applications, and in some cases they even surpass human performance. However, neural networks have been observed to have a major disadvantage: they do not know when they do not know, i.e., do not know when the input is out-of-distribution (OOD), i.e., far away from the data they have been trained on. Instead of saying "I do not know", they give some output with high confidence [1,2]. An explanation of why this is happening for the rectified linear unit (ReLU) based networks has been given in [3]. This issue is very important for safety-critical applications such as space exploration, autonomous driving, and medical diagnosis, etc. In these cases it is important that the system knows when the input data are outside its nominal range to alert the human (e.g., driver for autonomous driving or radiologist for medical diagnostic) to take charge in such cases.
A common way to address the problem of high confidence predictions for OOD examples is through ensembles [4], where multiple neural networks are trained with different random initializations, and their outputs are averaged in some way. The reason why ensemble methods have low confidence on OOD samples is that each NN's highconfidence domain is random outside the training data, and the common high-confidence domain has therefore shrunk through averaging. This works well when the representation space (the space of the NN before the output layer) is high-dimensional but fails when this space is low-dimensional [5].
• It introduces a novel neuron formulation that generalizes the standard neuron and the radial basis function (RBF) neuron as two extreme cases of a shape parameter. Moreover, one can smoothly transition from a regular neuron to an RBF neuron by gradually changing this parameter. The RBF correspondent to a ReLU neuron is also introduced and observed to have compact support, i.e., its output is zero outside a bounded domain. • It introduces a novel way to train a compact support neural network (CSNN) or an RBF network, starting from a pre-trained regular neural network. For that purpose, the construction mentioned above is used to smoothly bend the decision boundary of the standard neurons, obtaining the compact support or RBF neurons. • It proves the universal approximation theorem for the proposed neural network, which guarantees that the network will approximate any function from L p (R k ) with an arbitrary degree of accuracy. • It shows through experiments on standard datasets that the proposed network usually outperforms existing out-of-distribution detection methods from the literature, both in terms of smaller test errors on the in-distribution data and larger Areas under the ROC curve (AUROC) for detecting OOD samples.

Materials and Methods
This paper investigates the neuron design as the root cause of high confidence predictions on OOD data and proposes a different type of neuron to address its limitations. The standard neuron is f (x) = σ(w T x + b), a projection (dot product) x → w T x + b onto a direction w, followed by a nonlinearity σ(·). In this design, the neuron has a large response for vectors x ∈ R p that are in a half-space. This can be an advantage when training the neural network (NN) since it creates high connectivity in the weight space and makes the neurons sensitive to far-away signals. However, it can be a disadvantage when using the trained NN, since it leads to neurons unpredictably firing with high responses to far-away signals, which can result (with some probability) in high confidence responses of the whole network for examples that are far away from the training data.
To address these problems, a type of radial basis function (RBF) neuron [15], f (x) = g( x − µ 2 ), is used and modified to have zero response at some distance R from µ. Therefore, the neuron has compact support, and the same applies to a layer formed entirely of such neurons. Using one such compact support layer before the output layer, one can guarantee that the space where the NN has a nonzero response is bounded, obtaining a more reliable neural network.
The loss function of such a compact support NN has many flat areas, and it can be difficult to train directly by backpropagation. However, the paper introduces a different way to train it, by starting with a trained regular NN and gradually bending the neuron decision boundaries to make them have smaller and smaller support.
The compact support neural network consists of a number of layers, where the layer before last contains only compact support neurons, which will be described next. The last layer is a regular linear layer without bias, so it can output an all-zero vector when appropriate.

The Compact Support Neuron
The radial basis function (RBF) neuron [15], f (x, w) = g( x − w 2 ), for x, w ∈ R d , has a g(u) = exp(−βu) activation function. However, in this paper g(u) = max(R − u, 0) will be used because it is related to the ReLU.
A flexible representation. Introducing an extra parameter α = 1, the RBF neuron can be written as: Using the parameter α, one can smoothly interpolate between an RBF neuron when α = 1 and a standard projection neuron when α = 0. However, starting with an RBF neuron with g(u) = exp(−βu), the projection neuron is obtained for α = 0 as f α (x, w) = exp(2w T x) but which has an undesirable exponential activation function.

The Compact Support Neural Network
For a layer containing only compact support neurons (CSN), the weights can be combined into a matrix W T = (w 1 , . . . , w K ), the biases into a vector b = (b 1 , . . . , b K ) and the radii into a vector r = (R 1 , . . . , R K ) and the CSN layer can be written as: where f α (x, W, b, r) = ( f 1 (x), . . . , f K (x)) T is the vector of neuron outputs of that layer. This formulation enables the use of standard neural network machinery (e.g., PyTorch) to train a CSN. In practice usually no bias term is used (i.e., b = 0), except in low dimension experiments. The radius parameter r is trainable. The simplest compact support neural network (CSNN) has two layers: a hidden layer containing compact support neurons (2) and an output layer, which is a standard fully connected layer without bias, as shown in Figure 2a. A Batch Normalization layer without trainable parameters can be added to normalize the data as described below. A LeNet network with a CSN layer before the last layer is shown in Figure 2b. Training. Similar to the RBF network, training a neural network with such neurons with α = 1 is difficult because the loss function has many local optima. To make matters even worse, the compact support neurons have small support when α is close to 1, and consequently the loss function has flat regions between the local minima. This is why another training approach is taken. Using Equation (4), a CSNN can be trained by first training a regular NN (α = 0) and then gradually increasing the shape parameter α from 0 towards 1 while continuing to update the NN parameters. Observe that whenever α > 0, the NN has compact support, but the support becomes smaller as α becomes closer to 1. The training procedure is described in detail in Algorithm 1.

Algorithm 1 Compact Support Neural Network (CSNN) Training
where W, L are the last two layer weight matrices and g(x) is the rest of the NN. 2: Freeze g(x), compute u i = g(x i ), i = 1, . . . , n, their mean µ and standard deviation σ. 3 . . , n. 4: for e= 1 to N epochs do 5: Set α = e/N epochs

6:
Use the examples (v i , y i ) to update (W, L, b, r) based on one epoch of In practice, the training is stopped at an α ≤ 1 where the training and validation errors still take acceptable values, e.g., a validation error less than the validation error for α = 0. However, the larger the value of α, the tighter the support is around the training data and the better the generalization. The whole network can then be fine-tuned using a few more epochs of backpropagation.

Datasets Used
Real data experiments are conducted on classifiers trained on three datasets: MNIST [16], CIFAR-10 and CIFAR-100 [17]. The OOD (out of distribution) detection is evaluated using the respective test sets as in-sample data and other datasets as OOD data, such as the test sets of EMNIST [18], FashionMNIST [19] and SVHN [20], and the validation set of ImageNet [21].
For MNIST, a grayscale version of CIFAR-10 is also used as OOD data, obtained by converting the 10,000 test images to grayscale and resizing them to 28 × 28.

Neural Network Architecture
For MNIST, a 4-layer LeNet convolutional neural network (CNN) backbone is used, with two 5 × 5 convolution layers with 32 and 64 filters, respectively, followed by ReLU and 2 × 2 max pooling, and two fully connected layers with 256 and 10 neurons. For the other two datasets, a ResNet-18 [22] backbone is used, with 4 residual blocks with 64, 128, 256 and 512 filters, respectively.
For the CSNN, two architectures, illustrated in Figure 2, will be investigated. The first is a small one (called CSNN), illustrated in Figure 2a, which takes the output of the last convolutional layer of the backbone as input, normalized as described in Section 2.2 using a batch normalization layer without any learnable affine parameters. The second one is a full network (called CSNN-F), illustrated in Figure 2b, where the backbone (LeNet or ResNet) is part of the backpropagation, and a Batch Normalization layer (BN) without any learnable parameters is used between the backbone and the CSN layer.
All experiments were conducted on an MSI GS-60 Core I7 laptop with 16GB RAM and Nvidia GTX 970M GPU, running the Windows 10 operating system. The CSNN and CSNN-F networks were implemented in PyTorch 1.90.

Training Details
For all datasets, data augmentation with padding (3 pixels for MNIST, 4 pixels for the rest) and random cropping is used to train the backbones. For CIFAR-100, random rotation up to 15 degrees is also used.No data augmentation is used when training the CSNN and CSNN-F.
The CSNN was trained for 510 epochs with R = 0.01, of which 10 epochs at α = 0. The Adam optimizer is used with a learning rate of 0.001 and weight decay of 0.0001. SGD obtained similar results. The CSNN-F was trained with SGD with a learning rate of 0.001 and weight decay of 0.0005. Its layers were initialized with the trained backbone and the trained CSNN. Then α was kept fixed for two epochs and increased by 0.005 every epoch for 4 more epochs.
Training the CSNN from α = 0 to α = 1 for 510 epochs takes less than an hour. Each epoch of the CSNN-F takes less than a minute with the LeNet backbone and about 3 min with the ResNet-18 backbone. The CSNN-F was obtained by merging the corresponding CSNN head with the ResNet or LeNet backbone and training them together for 6 epochs.

Ood Detection
The out of distribution (OOD) detection is performed similarly to the way it is performed in a standard CNN. For any observation, the maximum value of the network's raw outputs is used as the OOD score for predicting whether the observation is OOD or not. If the observation is in-distribution, its score will usually be large, and if it is OOD, it will usually be close to zero or even zero. The ROC (receiver operating characteristic) curve based on these scores for the test set of the in-distribution data (as class 0) and one OOD dataset (as class 1) will give us the AUROC (Area under the ROC). If the two distributions are not separable (have concept overlap), some of the OOD scores will be large, but for the OOD observations that are away from the area of overlap they will be small or even zero.

Methods Compared
The OOD (out of distribution) detection results of the CSNN and CSNN-F are compared with the Adversarial Confidence Enhanced Training (ACET) [3], Deterministic Uncertainty Quantification (DUQ) [5], a standard CNN and a standard RBF network. The RBF network has the same architecture as the CSNN-F, but with an RBF layer instead of the CSN layer, and has a learnable σ for each neuron. Also shown are results for an ensemble of five or 10 CNNs trained with different random initializations; however, these methods are more computationally expensive and are not included in our comparison. The ACET results are taken directly from [3], and the DUQ, CNN and ensemble results were obtained using the DUQ authors' PyTorch code. The parameters for the CNN, RBF and ensemble were tuned to obtain the smallest average test error. For DUQ, multiple models were trained with various combinations of the length scales σ ∈ {0.05, 0.1, 0.2, 0.3, 0.5, 1.0} and gradient penalty λ ∈ {0, 0.05, 0.1, 0.2, 0.3, 0.5, 1.0} and the combination with the best test error-AUROC trade-off was selected. For the CSNN methods, for each dataset, the classifier was selected to correspond to the largest α where the test error takes a value comparable to the other methods compared.

Results
This section presents theoretical results on the gradient of the proposed CSNN and its universal approximation properties, plus experimental results on a number of public datasets.

Gradient Bound
The authors of the Deterministic Uncertainty Quantification (DUQ) [5] paper claim that gradient regularization is needed in deep networks to enforce a local Lipschitz condition on the prediction function that limits how fast the output will change away from the training examples. Thus, it is of interest to see whether this Lipschitz condition is satisfied for the proposed CSNN.
The following result proves that indeed the Lipschitz condition [5] is satisfied for the CSN layer: Theorem 1. The gradient of a CSN neuron is bounded by: Proof. The maximum gradient of a CSN neuron can be explicitly computed as follows.
The gradient of where ρ(u) = max(u, 0), is: The above inequality can be rearranged after dividing by α and regrouping terms as: where R 2 α was defined in Equation (3). Thus, the gradient is and therefore, the maximum gradient norm is: where the last equality was again obtained using Equation (3).
Thus, a small gradient can be enforced everywhere by penalizing the R and w 2 to be small using weight decay, or by making α close to 1, or both (all assuming b = 0).

Universal Approximation for the Compact Support Neural Network
It was proved in [23,24] that standard two-layered neural networks have the universal approximation property, in the sense that they can approximate continuous functions or integrable functions with arbitrary accuracy under certain assumptions. Similar results have been proved for RBF networks in [25,26]; thus, it is of interest whether such results can be proved for the proposed compact support neural network. In this section, the universal approximation property of the compact support neural networks will be proved.
Let L ∞ (R d ) and L p (R d ) and be space of functions f : R d → R such that they are essentially bounded and, respectively, their p-th power f p is integrable. Denote the L p and L ∞ norms by || − || p and || − || ∞ , respectively.
The family of networks considered in this study consists of two-layer neural networks, which can be written as: where H is the number of hidden nodes, β i ∈ R are the weights from the i-th hidden node to the output nodes, f (x, w i , b i ) is the representation of the hidden neuron with weights w i ∈ R n and biases b i ∈ R. The neurons used in the hidden layer can be either regular neurons or radial-basis function (RBF) neurons. The regular neuron can be written as: where g is an activation function such as the sigmoid or ReLU. An RBF neuron has the following representation: where the exponential function g(x) = exp(−x) is used for activation in RBF networks.
The studies on regular neurons [23,24] showed that if the activation function g used in the hidden layer is continuous almost everywhere, locally essentially bounded, and not a polynomial, then a two-layered neural network can approximate any continuous function with respect to the || − || ∞ norm.
Universal approximation results for RBF networks are quite limited according to [25,26], which proved that if the RBF neuron used in the hidden layer is continuous almost everywhere, bounded and integrable on R n , the RBF network can approximate any function in L p (R n ) with respect to the L p norm with 1 ≤ p ≤ ∞. More exactly, in [25], the following statement is proved: Theorem 2. (Park 1991) Let K : R d → R be an integrable bounded function such that K is continuous almost everywhere and R d K(x)dx = 0. Then the family of functions q( The above result will be used to prove the following statement for the CSNN: Theorem 3. Let α ∈ (0, 1], R 0 ≥ 0, and let g : R → R be any non-negative, continuous increasing activation function. Then the family of two-layer compact support neural networks: Proof. Since 0 < α ≤ 1, it implies that: Therefore, Observe that K(x) is bounded because g ≥ 0 is increasing and thus Since K(x) is also integrable and continuous and R d K(x)dx = 0, Theorem 2 applies with z i = w i /α and σ = 1/ √ α, obtaining the desired result.
Observe that universal approximation results for α = 0 have already been proven in [23,24] and that the standard RBF kernel can be obtained in Theorem 3 by taking g(x) = exp(x), α = 1 and R 0 = 0.

Two-Dimensional Example
This experiment is on the moons 2D dataset, where the data are organized on two intertwining half-circle-like shapes, one containing the positives and one the negatives. The data are scaled so that all observations are in the interval [0, 1] 2 (shown as a white rectangle in Figure 3. The out of distribution (OOD) data are generated starting with 100 × 100 = 10,000 samples on a grid spanning [−0.5, 1.5] 2 and removing all samples at a distance at most 0.1 from the moons data, obtaining 8763 samples. The confidence map for the trained classifier is shown in Figure 3d. One can see that the confidence is 0.5 (white) almost everywhere except near the training data, where it is close to 1 (black). This assures us that the method works as expected, shrinking the support of the neurons to a small domain around the training data. One can also see that the support is already reasonably small for α = 0.6, and it becomes tighter and tighter as α becomes closer to 1.
The training/test errors vs. α are shown in Figure 4. Also shown is the AUROC (area under the ROC curve) for detecting the OOD data described above against the test set. Observe that the training and test errors for α = 0 are quite large because the standard 2-layer NN with 128 neurons cannot fit these data well enough and decrease as the neuron support decreases in which the model is better capable of fitting the data. It is known [3,27] that the output of a ReLU-based neural network is piecewise linear, and the domains of linearity are given by the activation pattern of the neurons. The activation pattern of the neurons contains the domains where the set of active neurons (with nonzero output) does not change. These activation pattern domains are polytopes, as shown in Figure 5a, for a two-layer NN with 32 neurons. The activation domains for a CSNN are intersections of circles, as illustrated in Figure 5b, with the domain where all neurons are inactive shown in white. The corresponding confidence map is shown in Figure 5c. In real data applications, one does not need to go all the way to α = 1 since even for smaller α, the support is still bounded and if the instance space is high dimensional (e.g., 512 to 1024 in the real data experiments below), the volume of the support of the CNN will be extremely small compared to the instance space, making it unlikely to have high confidence on out-of-distribution data.

Real Data Experiments
In Figure 6, the train/test errors vs. α for the CSNN on the three datasets are shown. Also shown are the area under the ROC curve (AUROC) for OOD detection on CIFAR-10 or CIFAR-100. Observe that all curves on the real data are very smooth, even though they are obtained from one run, not averaged. One could see that the training and test errors stay flat for a while and then start increasing from a certain α, which depends on the dataset. At the same time, the AUROC stays flat and slightly increases, and there is a range of values of α where the test error is low and the AUROC is large. Different values of α make different trade-offs between test error and AUROC. In practice, α should be chosen as large as possible where an acceptable validation error is still obtained to have the smallest support possible. For example, one could choose the largest α such that the validation error at α is less than or equal to the validation error at α = 0.
MNIST CIFAR-10 CIFAR-100 The results are shown in Table 1. All results except the ACET results are averaged over 10 runs, and the standard deviation is shown in parentheses. The results of the best non-ensemble method are shown in bold and the second best in red. Table 1. OOD detection comparison in terms of area under the ROC curve (AUROC) for models trained and tested on several datasets. Best non-ensemble results are bold and second best are red. For each model, the test error in % is shown in the "Train on" row. The ACET results are taken from [3]. All other results are averaged over 10 runs, and the standard deviation is shown in parentheses.

Discussion
From Table 1, one could see that the trained RBF has difficulties in fitting the in distribution data well, with much larger test errors on CIFAR-10 and CIFAR-100 than the standard CNN. The most probable cause is that the training is stuck in a local optimum, because the training errors were also observed to be large (not shown in the table).
Further comparing the test errors, Table 1 reveals that the proposed CSNN-F method obtains the smallest test errors on in-distribution data among the non-ensemble methods compared. These findings strengthen the claim that the CSNN can fit the in-distribution data as well as a standard CNN, a claim that is also supported by the universal approximation statement from Theorem 3.
Comparing the detection AUROC on different OOD datasets, the proposed CSNN and CSNN-F methods obtain the best results on MNIST and CIFAR-100, and ACET and SNGP obtain the best results on CIFAR-10. However, the test errors of ACET are the highest among all methods by a large margin.
One should be aware that there usually is a trade-off to be made between small test errors and good OOD detection. ACET makes this trade-off in favor of a high AUROC by training with adversarial samples, while the other methods are trying to obtain small test errors, comparable with a standard CNN. Observe that the test errors of the CSNN-F approach are smaller than the CSNN, and the AUROCs are comparable.
Compared to ACET, both CSNN and CSNN-F obtain smaller test errors on all three datasets and better average AUROC on two out of three datasets. Compared to DUQ, the CSNN and CSNN-F obtain comparable test errors and better average AUROC on all three datasets. The RBF network cannot obtain a small training or test error in most cases, and the test errors and OOD detection results are poor and have a large variance.
Comparing the training time, the CSNN methods are about four times faster than training a 5-ensemble, eight times faster than a 10-ensemble and about three times faster than DUQ.
Overall, compared to the other non-ensemble OOD detection methods evaluated, the proposed methods obtain smaller test errors and better OOD detection performance (except ACET, which has large test errors). Training the whole deep network together with the CSNN results in smaller test errors and improved OOD detection performance.
In the synthetic experiment in Figure 3, the train and test errors were close to 0 for α = 1 because there are neurons close to all observations. However, in the real data applications where the representation space is high dimensional, the training, test and validation errors might first decrease a little bit but ultimately increase as α approaches 1. For example, one could see the test errors vs. α for the synthetic dataset in Figure 4 and for the real datasets in Figure 6. This is due to the curse of dimensionality, which makes all distances between observations relatively large and the CSNN centers will also be far from the observations, thus the neurons will have a larger radius to cover the training data.
It is worth noting that in contrast to the weights of a standard neuron, the weights of the compact support neuron exist in the same space as the neuron inputs, and they can be regarded as templates. Thus, they have more meaning, and one could easily visualize the type of responses that make them maximal, using standard neuron visualization techniques such as [28]. Furthermore, one can also obtain samples from the compact support neurons, e.g., for generative or GAN models.

Conclusions
This paper presented four contributions. First, it introduced a novel neuron formulation that is a generalization of the standard projection-based neuron and the RBF neuron as two extreme cases of a shape parameter α ∈ [0, 1]. It obtains a novel type of neuron with compact support by using ReLU as the activation function. Second, it introduced a novel way to train the compact support neural network of an RBF network by starting with a pretrained standard neural network and gradually increasing the shape parameter α. This training approach avoids the difficulties in training the compact support NN and the RBF networks, which have many local optima in their loss functions. Third, it proves the universal approximation property of the proposed neural network, in that it can approximate any function from L p (R d ) with arbitrary accuracy, for any p ≥ 1. Finally, through experiments, it shows that the proposed compact support neural network outperforms the standard NN and the RBF network and even usually outperforms existing state-of-the-art OOD detection methods, both in terms of smaller test errors on in-distribution data and larger AUROC for detecting OOD samples.
The OOD detection feature is important in safety critical applications such as autonomous driving, space exploration and medical imaging.
The results have been obtained without any adversarial training or ensembling, and adversarial training or ensembling could be used in the proposed framework to obtain further improvements.
In real data applications, the compact support layer was used as the last layer before the output layer. This ensures that the compact support is involved in the most relevant representation space of the CNN. However, because the CNN still has many projectionbased layers to obtain this representation, it means that the corresponding representation in the original image space does not have compact support, and erroneous high-confidence predictions are still possible. Architectures with multiple compact support layers that have even smaller support in the image space are subject to future study. Data Availability Statement: The following publicly available datasets have been used in experiments: MNIST [16], CIFAR-10 and CIFAR-100 [17], EMNIST [18], FashionMNIST [19], SVHN [20], and the validation set of ImageNet [21].