Improving Convolutional Neural Networks’ Accuracy in Noisy Environments Using k-Nearest Neighbors

: We present a hybrid approach to improve the accuracy of Convolutional Neural Networks (CNN) without retraining the model. The proposed architecture replaces the softmax layer by a k-Nearest Neighbor (kNN) algorithm for inference. Although this is a common technique in transfer learning, we apply it to the same domain for which the network was trained. Previous works show that neural codes (neuron activations of the last hidden layers) can beneﬁt from the inclusion of classiﬁers such as support vector machines or random forests. In this work, our proposed hybrid CNN + kNN architecture is evaluated using several image datasets, network topologies and label noise levels. The results show signiﬁcant accuracy improvements in the inference stage with respect to the standard CNN with noisy labels, especially with relatively large datasets such as CIFAR100. We also verify that applying the (cid:96) 2 norm on neural codes is statistically beneﬁcial for this approach.


Introduction
Deep convolutional neural networks are multi-layer architectures designed to extract high-level representations of a given input. They have dramatically improved the state-of-the-art in image, video, speech and audio recognition tasks [1]. When trained for supervised classification, the CNN layers are eventually able to extract a set of features suitable for the task at hand. These features are obtained by forwarding a raw input through the network, in which the last layer merely learns a linear mapping to get the most likely class.
We present a simple and effective technique for accounting for label noise on deep neural networks. Large datasets suffer from this problem, as researchers often resort to non-expert sources such as Amazon's Mechanical Turk or tags from social networking sites to label massive datasets, resulting in unreliable labels [2]. Large amounts of data (required for deep neural networks) typically contain some wrong labels, and the presence of this kind of noise can drastically degrade learning performance [3,4].
Learning with noisy labels has been studied previously, although not extensively, particularly on deep neural networks. In [5], a noise layer was added to simultaneously learn both the neural network parameters and the noise distribution. The work in [6] included a linear layer after the softmax to model the noise. Other approaches used bootstrapping [7], dropout regularization [2] or a modified loss function [8] to increase the CNN accuracy in the presence of noise.
Unlike the above-mentioned methods, the proposed approach is not performed in the training stage; therefore, it can be used for reliable prediction using existing models that were already trained with noisy labels.
The proposed method is based on a technique that has been extensively used in transfer learning. Due to their high generalization power, transfer learning can be used to apply CNN models trained on a domain to a different task where data are similar, but the classes are different [9]. This transfer can be done by fine-tuning the CNN weights of the pretrained network on the new dataset by continuing the backpropagation [10]. Alternatively, it can also be performed by using the CNN as a feature extractor to obtain mid-level representations, forwarding samples through the network to get the activations from one of the last hidden layers, which is usually a fully-connected or a pooling layer. These representations are also referred to as neural codes [11].
Extracting neural codes and then applying Support Vector Machines (SVM) or k-Nearest-Neighbors (kNN) is a common transfer learning technique [12]. Some previous works also used this strategy on the same domain where the network was trained, replacing the softmax layer by an SVM [13][14][15][16]. They proved that using neural codes extracted from a CNN to train an SVM improves the classification accuracy with respect to using the CNN softmax output directly. In addition, in a remarkable previous work, [17] proposed deep neural decision forests replacing the softmax with differentiable Random Forests (RaF) and training this hybrid approach in a joint manner.
The proposed architecture consists of replacing the softmax by a kNN classifier on inference. To our knowledge, there are no previous works using this technique to address the same task for which the network was trained, although this is a common approach for transfer learning. At the inference stage, neural codes from the last hidden layer are extracted to perform a kNN search on the training set in order to yield the class.
The motivation for using kNN is that it is a non-linear method that may obtain more complex decision boundaries than the common linear mapping used in the softmax layer of a CNN. Taking into account that CNN and kNN are totally complementary in terms of feature extraction and decision boundaries, in a hybrid system, both algorithms could exploit their potential and see their drawbacks mitigated by the other. In addition, CNN are usually trained with large datasets, and as the size of the training set approaches infinity, the NN classifier shows strong guarantees with respect to the Bayes error rate [18], the minimum achievable error rate given the distribution of the data.
Moreover, an interesting feature of kNN is its robustness to label noise in the training set by means of high values of the parameter k. We have performed the evaluation with different levels of induced noise to assess the robustness of the proposed method. Using different image datasets and network topologies, we show that the accuracy combining CNN and kNN outperforms consistently that of the softmax layer with noisy labels.
The rest of the paper is structured as follows: the proposed hybrid method, the datasets and the network topologies considered are described in Section 2; the evaluation is done in Section 3 using different label noise levels and performing statistical significance tests; finally, conclusions and future work are described in Section 4.

Materials and Methods
The presented work has an eminently experimental nature. Our hypothesis is that a hybrid CNN + kNN can outperform the predictive power of the widely-considered CNN + softmax in noisy environments. To assess this, we compare the results of a CNN trained for a classification task with the learned neural codes of the same CNN, but applying a kNN classifier to the last hidden layer output.
In addition, we consider the 2 norm for the normalization of the neural codes [13,19]. Let x be a vector of size n, which represents the neural codes; the 2 norm is defined as: After training a CNN with standard backpropagation, four configurations are evaluated in the inference stage (see Figure 1):

1.
Use of the CNN softmax layer.

2.
Use of the last hidden CNN layer (before the softmax) to get the neural codes, which are fed to a kNN that compares them to the training set prototypes using Euclidean distance in order to get the most likely class.

3.
Similar to the previous configuration, but normalizing with 2 the neural codes before the kNN. 4.
Using directly the kNN on the raw data without any representation learning. The presented configurations are evaluated with different datasets selected to depict different numbers of features and samples. Specifically, our evaluation comprises the following seven datasets of images (summarized in Table 1): • United States Postal Office (USPS) [20] and MNIST [21] are datasets of handwritten digits with binary images.

•
Landsat is a dataset from the UCI repository [22]. It comprises 3 × 3 windows in the four spectral bands depicting images of satellites. Each window is labeled as regards the information of the central pixel of the window.

•
Handwritten Online Musical Symbol (HOMUS) [23] depicts binary images of isolated handwritten music symbols collected from 100 different musicians.

•
NIST Special Database 19 (NIST) of the National Institute of Standards and Technology [24] consists of a huge dataset of isolated characters. For this work, a subset of the upper case characters was selected. • CIFAR-10 and CIFAR-100 [25] are standard object recognition datasets for the computer vision community. They consist of 32 × 32 color images extracted from the 80 million tiny images dataset [26] and containing 10 and 100 different categories, respectively. As regards preprocessing of the input data, the values of the pixels from MNIST, HOMUS and NIST images are divided by 255 for normalization, whereas the mean image is subtracted from CIFAR images. The USPS and Landsat datasets are already normalized in origin.
Furthermore, Table 2 shows the evaluated network topologies for each dataset. The selected CNN architectures are designed to obtain close to state-of-the-art results on the target task. Custom models have been designed for USPS, MNIST, Landsat, HOMUS and NIST, whereas the CIFAR datasets were trained using the well-known VGG16 [27] and the 50-layered ResNet [28] models.
ResNet50 [28] scaling input images to 224 × 224 The CNN training is carried out by means of stochastic gradient descent [29] considering the adaptive learning rate method proposed by Zeiler [30]. A 128 mini-batch size is used, and the networks have been trained with a maximum of 200 epochs, with early stopping if the loss does not decrease during 10 epochs.
A five-fold cross-validation scheme is followed to provide more robust figures with respect to the variance of the training data. That is, each dataset is divided into 5 equal parts. In each iteration, 4 of these folds are used as the training set (80%) and the remaining one as the test set (20%). After performing the experiments, average results are provided. In addition, these folds have been organized preserving the original distribution of the number of samples per class in the dataset.
Since some of these datasets are not evenly balanced, the performance metric used for evaluation is the weighted average of the F 1 scores of each class. The F 1 is defined as: where TP denotes the number of True Positives, FP denotes the number of False Positives and FN denotes the number of False Negatives.

Experiments
The four configurations (kNN, CNN + softmax, CNN+ kNN and CNN + 2 + kNN) have been evaluated with and without adding synthetic noise to the data by swapping the labels of pairs of randomly-chosen prototypes. The percentages of prototypes that change their label are 10%, 20%, 30% and 40% since these are the common values in this kind of experimentation [31]. It is important to remark that each CNN has been trained separately for each noise level, from 0% to 40%. Then, the weights obtained at each noise level were used as a basis for evaluating the different CNN configurations at that noise level. Different numbers of neighbors k (from 1 to 50) have also been tested for the kNN algorithm. Additionally, in the hybrid approach, the neural codes of the last hidden layer have been 2 -normalized to compare their performance with the unnormalized values. Table 3 reports the detailed results with the image datasets evaluated. In addition, Figure 2 illustrates the accuracy curves as the level of induced label noise is increased. In an overall sense, using kNN over the raw input data obtains a lower average accuracy than the CNN-based methods, especially with the most challenging datasets. Analogously, accuracy is strongly affected by the level of induced label noise in the dataset. All figures depict a noticeable accuracy drop as this level is increased.   For simple datasets such as USPS, MNIST or Landsat (those in which kNN achieves good results), we observe that CNN-based approaches worsen their performance in a much more pronounced way than kNN as the level of noise label is increased. In fact, for high noise levels, the kNN eventually outperforms these approaches. In the remaining datasets (NIST, HOMUS, CIFAR10, CIFAR100), the kNN is far away from the results obtained with the CNN approaches. In the case of NIST, it turns out that CNN + softmax is less accurate in the original scenario, but shows a greater noise robustness even with the lowest noise levels considered. Hybrid approaches, however, depict a more pronounced degradation curve. Furthermore, results reported in the rest of the datasets reflect a similar behavior: hybrid approaches achieve better results than the CNN + softmax, which also degrades generally faster as the noise is increased. Although differences are not especially remarkable with and without the 2 norm, it can be seen that the former is less sensitive to noise.
An interesting remark is that higher values of k are needed in the hybrid approach, even without the presence of induced noise. This might indicate that neural codes are somehow noisy in the closest neighborhood (when k is low), although this condition can be mitigated considering more neighbors during classification. This means that the proposed technique can benefit from large amounts of annotated data. In this sense, we observe that the 2 usually needs a lower value of the parameter k, which indicates that this normalization leads to a higher robustness of the neural codes. Figure 3 shows the impact on the average F 1 when using different values of k. As can be seen, the inclusion of kNN is more beneficial when the noise is higher. In addition, the value of k for outperforming the softmax depends on the amount of noise. The higher the noise, the higher k should be, but in all cases, this is a relatively small value (lower than 15). Therefore, the value of k is not so decisive to improve the results of the base CNN + softmax as long as it has been increased enough to surpass the elbow of the curve. As a general conclusion, the proposed approach consistently outperforms the CNN + softmax even with low noise levels. The accuracy improvement is especially remarkable with challenging datasets such as CIFAR. In addition, it is observed that adding the 2 norm consistently improves the accuracy with respect to using the raw neural codes, both with or without noise.
On the other hand, no unequivocal conclusions can be drawn with very high levels of noise, as in some cases, the hybrid approach outperforms the CNN (such as in CIFAR100), whereas in others (such as CIFAR10), that is not the case. However, there is a trend towards greater robustness to noise from the classical scheme (CNN + softmax) than from the hybrid schemes, contrary to what could be expected because of the use of kNN. A possible explanation is that training a network with noisy labels forces it to map the features onto an incorrect position of the neural space, making kNN misclassify them. That is, label noise might generate noise at the feature level, which prevents a correct kNN classification. The softmax, however, seems to palliate this phenomenon better because a lower degradation is observed as the induced noise is increased.
The addition of a kNN stage slightly increases the computational cost, but given the success of recent fast kNN methods [32][33][34], it might be more efficient than increasing the number of parameters of the CNN in order to achieve similar results as those obtained with a simpler CNN plus a kNN classifier. For instance, using the approach proposed in [34], a query on a dataset of five million prototypes with a 128-dimensional feature vector (the size of the Neural Code (NC) considered in this work) takes around 0.1 ms.

Statistical Significance Tests
The previous section has presented the evaluation results, which serve to analyze the general trend of our proposal. However, in this section, statistical tests are performed for comparing the results objectively [35]. Specifically, Wilcoxon signed-rank tests are used, which allow a pairwise comparison of the methods.
We did 4 tests, corresponding to the comparison between CNN + softmax and kNN, between CNN + kNN and CNN + softmax, between CNN + 2 + kNN and CNN + softmax and between CNN + 2 + kNN and CNN + kNN. The first one is done to verify that, even in the cases with greater noise, the selected CNN are good predictors. The second and third tests are those that verify the goodness of the proposal of this work: the use of a kNN algorithm instead of the softmax of the network. Finally, the last test is focused on measuring the statistical improvement when using 2 normalization. Table 4 reports the results of such tests, with significance established at a p-value < 0.05. The symbol states that the test is significantly accepted, whereas states that the test is accepted in the other way. No symbol is depicted otherwise. These tests reflect that CNN + softmax significantly outperforms kNN for all noise levels, as may be expected. Furthermore, the statistical test only recommends using CNN + softmax instead of the hybrid model without normalization in one case (label noise level of 40%), yet it never does when normalization is considered. Consequently, adding 2 is always statistically beneficial with respect to the original neural codes.

Comparison with Other Classifiers
One must also address the question of whether the improvement in the results is due to either the good robustness of the kNN against noise or the replacement of the softmax layer by another supervised classifier. That is why in this section, we analyze the general results of replacing the softmax with other schemes. In particular, we consider the following representative algorithms for supervised classification: • Support Vector Machine (SVM) [36]: This learns a hyperplane that tries to maximize the distance to the nearest samples (support vectors) of each class. It makes use of kernel functions to handle non-linear decision boundaries. In our case, a radial basis function (or Gaussian) kernel is considered. Typically, SVM also considers a parameter that measures the cost of learning a non-optimal hyperplane, which is usually referred to as parameter c. During preliminary experiments, we tuned this parameter in the range c ∈ [1,9]. • Random Forest (RaF) [37]: This builds an ensemble classifier by generating several random decision trees at the training stage. The final output is taken by combining the individual decisions of each tree. The number of random trees has been established experimenting in the range t ∈ [10, 500].
For the sake of clarity, we directly provide in Figure 4 the average F 1 of these classifiers over all datasets, in comparison to the CNN + softmax, CNN + kNN and CNN + 2 + kNN schemes. It can be observed that the approach proposed in this work (CNN + 2 + kNN) attains the best results in all cases, increasing its margin over other schemes as the noise level is higher. Note also that, on average, SVM and RaF also outperform both CNN + softmax and CNN + kNN in the presence of noise, yet they remain noticeably below CNN + 2 + kNN in terms of accuracy.

Conclusions and Future Work
In this work, we have proposed a hybrid method that replaces the softmax layer by a kNN classifier in the inference stage without modifying the training stage. Different datasets, network topologies and noise levels have been evaluated, performing statistical tests to get rigorous conclusions.
We show that replacing softmax by kNN significantly outperforms traditional CNN architectures with noisy labels. The optimal number of neighbors is higher in larger datasets; therefore, the proposed method can benefit from large amounts of annotated data. In these large datasets, the performance gain is remarkable, as can be seen in the CIFAR experiments.
It has been statistically verified that a non-linear classifier such as kNN used on the CNN representations significantly increases the accuracy with low noise levels. However, with high label noise levels, softmax and kNN get similar average results. This is because training with label noise forces the CNN to learn an incorrect transformation of data, which generates noise at the feature level. Although the softmax layer may learn to correct this problem, the noisy features confuse the kNN classifier. It has also been statistically verified that adding 2 normalization is beneficial at any noise level.
An advantage of using kNN is that, besides the class, the most similar prototypes are also obtained, making it suitable for similarity search tasks. In addition, we observed that replacing the softmax layer by other classifiers such as SVM or RaF is also beneficial in terms of accuracy, yet to a lesser extent than considering kNN. On the other hand, computational cost is higher with kNN, but given the efficiency of the recent approximate methods, this may not be a severe issue.
As future work, the reported results could be improved by using networks that aim to maximize the kNN accuracy by jointly training the CNN with kNN in order to optimize the classification error on training data [38] or introducing metric learning techniques. Furthermore, it would be interesting to compare kNN with other classifiers such as SVM or RaF using neural codes.