2.2. State of the Art
Because MNIST has been widely used to test the behavior of many implementations of classifiers, there has been an effort to publish some rankings in the past, using MNIST as a benchmark. Most of these rankings, as well as the established literature uses the “test error rate” metric when referring to performance over MNIST. This metric is the percentage of incorrectly classified instances.
One of the earliest rankings was published by LeCun et al. [
2] themselves, which includes references up to 2012. This website provides a taxonomy of classifiers and points out whether each work performed data preprocessing or augmentation (addition of new instances to the training set resulting from distortions or other modifications of the original data).
In this ranking, we can verify that the early machine learning approaches used by LeCun et al. [
4] included linear classifiers (whose error rate ranges from 7.6 to 12%), K-nearest neighbors approaches (K-NN, ranging from 1.1 to 5%), non-linear classifiers (about 3.5%), support vector machines (SVM, from 0.8 to 1.4%), neural networks (NN, from 1.6 to 4.7%) and convolutional neural networks (CNN, from 0.7 to 1.7%). It is remarkable that data augmentation leads to better results, in particular, the best error rate achieved using a convolutional neural network with no distortions and no preprocessing in LeCun et al. [
4] was 0.95%.
From the different works gathered in LeCun et al.’s ranking, those based in convolutional neural networks outperform the other techniques. However, there are some classical machine learning techniques which are still able to provide competitive error rates (in this paper, we will consider “competitive” those techniques achieving an error rate under 1.0%). For instance, Belongie et al. [
5] achieved 0.63% and Keysers et al. [
6] achieved 0.54% and 0.52% using K-NN, Kégl and Busa-Fekete [
7] achieved 0.87% using boosted stumps on Haar features, LeCun et al. [
4] achieved 0.8% and Decoste and Schölkopf [
8] attained results from 0.56 to 0.68% using SVM.
When using non-convolutional neural networks, Simard et al. [
9] achieved 0.9% and 0.7% respectively using a 2-layer neural network with MSE and cross-entropy loss functions. Deng and Yu [
10] achieved 0.83% using a deep convex network without data augmentation or preprocessing. Very interesting results were attained by Meier et al. [
11] (0.39%) using a committee of 25 neural networks and Cireşan et al. [
12] (0.35%) using a 6-layers neural network (it is worth mentioning that the reproducibility of this result has been put into question by Martin [
13] in his blog).
On the other hand, works based on convolutional neural networks attained a much better average performance (in fact, the worst result reported in LeCun’s ranking was 1.7%). LeCun et al. [
4] combined different convolutional architectures along with data augmentation techniques, obtaining error rates ranging from 0.7 to 0.95%. Lauer et al. [
14] attained between 0.54% and 0.83% using a trainable feature extractor along with SVMs, and Labusch et al. [
15] reported an error rate of 0.59% using a similar technique consisting on a CNN to extract sparse features and SVMs for classification. Simard et al. [
9] obtained error rates between 0.4% and 0.6% using CNN with cross-entropy loss function and data augmentation. Ranzato et al. [
16] reported results between 0.39% and 0.6% using a large CNN along with unsupervised pretraining, and some years later Jarrett et al. [
17] reported a test error of 0.59% with a similar approach technique and without data augmentation. The best results in this ranking are those obtained by Cireşan et al. [
18], which reported an error rate of 0.35% using a large CNN, and 0.27% [
19] and 0.23% [
20] using committees of 7 and 35 neural networks respectively, using data augmentation in all cases.
In 2015, McDonnell et al. [
21] proposed an approach using the ‘extreme learning machine’ (ELM) [
22] algorithm to train shallow non-convolutional neural networks, and they agree on the fact that data augmentation with distortions can reduce error rates even further. In their work, McDonnell et al. compare their results with other previous ELM and selected non-ELM approaches, updated with some CNN-based works as of 2015. The most outstanding ELM-based results achieved test error rates of 0.97% in the work by Kasun et al. [
23], and ranged from 0.57 to 1.36% in the proposal by McDonnell et al. [
21].
However, most interesting results included in the comparison by McDonnell et al. [
21] are not those provided in that work, but those from previous works where convolutional neural networks were used. For example, Wan et al. [
24] used CNNs with a generalization of dropout they called DropConnect, and reported an error rate of 0.57% without data augmentation and as low as 0.21% with data augmentation. Zeiler and Fergus [
25] proposed the use of stochastic pooling achieving an error rate of 0.47%. Goodfellow et al. [
26] described the
maxout model averaging technique, attaining a test error rate of 0.45% without data augmentation. In 2015, Lee et al. [
27] described “deeply supervised nets”, an approach by which they introduce a classifier (SVM or softmax) at hidden layers, reporting a result of 0.39% without data augmentation.
A more recent ranking, updated as of 2016, has been made available by Benenson [
3] in his GitHub’s page. This ranking includes many of the works reviewed so far, as well as other with very competitive results. For example, Sato et al. [
28] explore and optimize data augmentation, and attain a test error rate of 0.23% in the MNIST database using a convolutional neural network. A very interesting result is reported by Chang and Chen [
29], where an error rate of 0.24% (the best result found as of 2017 in the state of the art without data augmentation) was obtained by using a network-in-network approach with a maxout multi-layer perceptron, an approach they named MIN. Also, very competitive performances without data augmentation were reported by Lee et al. [
30] when using gated pooling functions in CNNs (0.29%), by Liang and Hu [
31] when using a recurrent CNN (0.31%), by Liao and Carneiro [
32] when using CNNs with normalization layers and piecewise linear activation units (0.31%) and competitive multi-scale convolutional filters (0.33%) [
33], or by Graham Graham [
34] using fractional max-pooling with random overlapping (0.32%).
Additional works where state-of-the-art results were obtained without data augmentation include those by McFonnell and Vladusich [
35], who reported a test error rate of 0.37% using a fast-learning shallow convolutional neural network, Mairal et al. [
36], who achieved 0.39% using convolutional kernel networks, Xu et al. [
37], who explored multi-loss regularization in CNNs obtaining an error rate of 0.42%, and Srivastava et al. [
38] used so-called convolutional “highway" networks (inspired by LSTM recurrent networks) to achieve an error rate of 0.45%.
Lin et al. [
39] reported an error rate of 0.47% using a network-in-network approach, where micro-neural networks are used as convolutional layers. Ranzato et al. [
40] used convolutional layers for feature extraction and two fully connected NN for classification, attaining an error rate of 0.62%. Bruna and Mallat [
41] designed a network that performed wavelet transform convolutions to the input and a generative PCA (principal component analysis) classifier to obtain an error of 0.43%.
Calderón et al. [
42] proposed a variation of a CNN where the first layer applies Gabor filters over the input, obtaining an error of 0.68%. Also, Le et al. [
43] explored the application of limited-memory BFGS (L-BFGS) and conjugate gradient (CG) as alternatives to stochastic gradient descent methods for network optimization, achieving 0.69% error rate. Finally, Yang et al. [
44] developed a new transform named Adaptative Fastfood to reparameterize the fully connected layers, obtaining a test error rate of 0.71%.
An interesting approach is followed by Hertel et al. [
45], in which they reuse the convolutional kernels previously learned over the ILSVRC-12 dataset, proving that CNNs can learn generic feature extractors than can be reused across tasks, and achieving a test error rate in the MNIST dataset of 0.46% (vs. 0.32% when training the network from scratch).
It can be seen how most recent works are based on convolutional neural networks, due to their high performance. However, some exceptions can be noticed. One example is the work by Wang and Tan [
46] where they attained a test error rate of 0.35% using a single layer centering support vector data description (C-SVDD) network with multi-scale receptive voting and SIFT (scale-invariant feature transform) features. Also, Zhang et al. [
47] devised a model (HOPE) for projecting features from raw data, that could be used for training a neural network, that attained a test error rate of 0.40% without data augmentation. Visin et al. [
48] proposed a recurrent neural network (RNN) that replaced the convolutional layers by recurrent neural networks that swiped through the image, attaining an error rate of 0.45% with data augmentation.
Additional examples with less performance include the work by Azzopardi and Petkov [
49] who used a combination of shifted filter responses (COSFIRE), attaining an error rate of 0.52%. Also, Chan et al. Chan et al. [
50] used a PCA network to learn multi-stage filter banks, and report an error rate of 0.62%. Mairal et al. [
51] used task-driven dictionary learning, reporting a test error rate of 0.54%, yet using data augmentation. Jia et al. [
52] explored the receptive field of pooling, and obtained an error of 0.64%. Thom and Palm [
53] explored the application of the sparse connectivity and sparse activity properties to neural networks, obtaining an error rate of 0.75% using a supervised online autoencoder with data augmentation. Lee et al. [
54] described convolutional deep belief networks with probabilistic max-pooling achieving an error rate of 0.82%. Min et al. [
55] reported an error of 0.95% using a deep encoder network to perform non-linear feature transformations which are then introduced to K-NN for classification. Finally, Yang et al. [
56] used supervised translation-invariant sparse coding with a linear SVM attaining an error rate of 0.84%.
Other approach to solve the MNIST classification problem involves so-called “deep Boltzmann machines”. The first application of this technique was suggested by Salakhutdinov and Hinton [
57] and enabled them to report an error rate of 0.95%. Few years later, Goodfellow et al. [
58] introduced the multi-prediction deep Boltzmann machine, achieving a test error rate of 0.88%.
The reviewed rankings gather most of the works providing very competitive results for the MNIST database. Besides those rankings, we have found a work by Mishkin and Matas [
59] where the authors work in the weights initialization of the CNN and replacing the softmax classifier with SVM, achieving a test error rate of 0.38%, and another work by Alom et al. [
60] where inception-recurrent CNNs were used to attain an error of 0.29%.
Finally, it is worth mentioning some techniques that automatically optimize the topology of CNNS: MetaQNN [
61], which relies on reinforcement learning for the design of CNNs architectures, has reported an error rate on MNIST of 0.44%, and 0.32% when using an ensemble of the best found neural networks. Also, DEvol [
62], which uses genetic programming, has obtained a error rate of 0.6%. Baldominos et al. [
63] presented a work in 2018 where the topology of the network is evolved using grammatical evolution, attaining a test error rate of 0.37% without data augmentation and this result was later improved by means of the neuroevolution of committees of CNNs [
64] down to 0.28%. Similar approaches of evolving a committee of CNNs were presented by Bochinski et al. [
65], achieving a very competitive test error rate of 0.24%; and by Baldominos et al. [
66], where the models comprising the committee were evolved using a genetic algorithm, reporting a test error rate of 0.25%.
A summary of all the reviewed works can be found in
Table 1 and
Table 2.
Table 1 includes works where some preprocessing or data augmentation took place before classification, whereas
Table 2 includes the remaining works. The top of the table shows the performance of classical machine learning, while the lower side shows the results reported using CNNs. In those cases when authors have reported more than one result using a single classifier (e.g., with different parameters), only the best is shown.