Structure and Base Analysis of Receptive Field Neural Networks in a Character Recognition Task

This paper explores extensions and restrictions of shallow convolutional neural networks with fixed kernels trained with a limited number of training samples. We extend the work recently done in research on Receptive Field Neural Networks (RFNN) and show their behaviour using different bases and step-by-step changes within the network architecture. To ensure the reproducibility of the results, we simplified the baseline RFNN architecture to a single-layer CNN network and introduced a deterministic methodology for RFNN training and evaluation. This methodology enabled us to evaluate the significance of changes using the (recently widely used in neural networks) Bayesian comparison. The results indicate that a change in the base may have less of an effect on the results than re-training using another seed. We show that the simplified network with tested bases has similar performance to the chosen baseline RFNN architecture. The data also show the positive impact of energy normalization of used filters, which improves the classification accuracy, even when using randomly initialized filters.


Introduction
Convolutional neural networks bring, along with considerable advantages, a number of challenging problems. One of them is the choice of hyperparameters for the individual layers. From previous research, we know the effects of the activation function, depth of the network, different types of normalization, preprocessing, and optimization methods on the overall accuracy achieved [1]. The open problem is still the choice of convolutional kernel sizes and their numbers in the individual layers. The popular answer is a minimum symmetric kernel of size 3 × 3 pixels [2], which also contains a small number of parameters. However, this solution is memory-intensive in terms of calculating convolution with high resolution input images, and, therefore, the convolutional kernel can be enlarged with respect to physical constraints [3]. Another reason for changing the kernel size is that it directly affects the size of the receptive field, as the larger filter takes into account the larger surroundings and captures higher-level image features. However, enlarging the size of a kernel quadratically increases the number of its possible parameters and may not improve network accuracy due to generalization problems [4,5]. Moreover, rapid reduction of dimensionality by convolution with a larger kernel leads to shallower architectures. The practical solution is the use of a grid search [6], as there is no general solution yet, and the architectural design of these networks requires some experience.
Another not less-important problem of classic deep networks is their representative complexity, where it is not entirely clear what transformation the network performs and what the individual features mean. Therefore, some authors also believe the direction that research should take is to create simpler rather than deeper models [7,8]. This trend of architecture modifications towards parameter reduction and simplification has become an active area of research in recent years. In this case, it is important to maintain the ability of networks to produce representative features with an emphasis on achieved accuracy.
Last but not least, the need for a large amount of data for the training process itself significantly limits the wider deployment of deep learning. Although there are ways to achieve good results on smaller datasets, such as using transfer learning and expanding the training set by image augmentation [9], these techniques should be implemented with caution. The use of transfer learning is, in many cases, application-specific [10], and image augmentation techniques can create small image artefacts that must be considered in critical applications such as medical diagnostics [11].

Simplification of Convolutional Neural Networks
With regard to the success of deep neural networks, the main motivations for their simplification are deployment on devices with limited computing resources, as well as a deeper understanding and easier interpretation of the created models. One of the major simplifications was tested by Iandola et al. [12], who proposed the SqeezeNet architecture. In addition to the introduction of Fire Modules that reduce the dimensionality of data in the first step, followed by an expansion phase, the same authors also introduced complex bypass connections. These bridging connections are similar to those in the ResNet [13] architecture but with an introduction of the point-wise convolution operation. The overall architecture has 50-fold fewer parameters while retaining similar accuracy as the AlexNet [3] architecture. Another point of interest is that the fully connected layers were not used in the Sqeezenet architecture intended for classification. Instead of the common approach, one feature map for each corresponding category of the classification task was generated. On top of these feature maps, the average of each feature map was taken separately, resulting in an output vector that was then directly fed into the softmax layer. This approach is generally called global average pooling and was introduced by Lin et al. [14] when proposing Network in Network architecture. On the success of previous architectures, Howard et al. [15] presented an efficient convolutional neural network designed for deployment on mobile devices based on depthwise separable convolution [16].
One of the other approaches is the replacement of some convolutional kernels by approximated Gabor filters. In convolutional computational layers, these filters are used as fixed kernels for the extraction of intrinsic features. Some authors used Gabor filters only in the first convolutional layer [17,18], while Sarwar et al. [19] also looked at their use in other layers with an increasing depth of the network. Adding fixed filters to deeper layers of the network, however, results in a decrease in overall accuracy. Therefore, the same authors proposed a solution that uses a combination of fixed and trainable kernels. The interim results show that this method leads to a significant improvement in energy savings, shortened training time, and reduced memory requirements with a negligible drop in classification accuracy. The combination of learned free-form filters with structured ones also bring the ability to modify the size of convolutional kernels in the training process [20]. It can lead to an effective expansion of the total number of convolutional kernels [21] or reduction of parameters by creating efficient layers for feature extraction [22,23].
In addition to these methods, there are many approaches based on compression, regularization, and pruning [24,25].

Path Towards Receptive Field Neural Networks
One of the possible solutions that could at least partially solve these problems is to replace standard convolutional kernels with a linear combination of predefined and fixed filters in convolutional neural network architectures. This influential concept, proposed by Jacobsen et al. [26], defines Receptive Field Neural Networks (RFNNs) as a group of networks that do not handle convolutional kernels as individual pixels but as functions in scale-space using a finite Gaussian derivative basis. The architecture of RFNNs allows the use of any unspecified base, which creates efficient filters designed for feature extraction. It is therefore not surprising that models based on fixed filters according to the original Receptive Field Neural Network (RFNN) architecture have aroused the interest of researchers in recent years. Schlimbach [27] showed that filter size has an impact on network performance, and it was therefore proposed to include scale as a parameter in the learning process itself. This idea was later elaborated by Pintea et al. [28], who created a network that effectively learns the dynamic size of the convolutional kernel. A similar approach was developed by Tabernik et al. [22,23], who created a Displaced Aggregation Units (DAUs) capable of learning the receptive field size through spatial displacement. By extending the original model with the Gaussian base of multiple scales and orientations, the authors of [29] created a Multi Scale and Orientation Structured Receptive Neural Network (MSO-RFNN) inspired by the DenseNet architecture [30].
Although the architecture of RFNNs are specific to their Gaussian derivative basis, other bases have also been tested. Verkes [31] proposed using the Gabor kernels, where real and imaginary parts are rotated to match a reference Gaussian basis accordingly, and achieved similar results. Discrete directional Parseval frames with compact support were tested in the role of a base by Labate et al. [32]. With the introduction of sparsity constraints, the Geometric-Biased CNN (GBCNN) was created and is capable of solving the problem of hyperspectral image classification. By incorporating Parseval frames into RFNN Resnet-like models, the ResnetRF architecture was proposed by Karantzas et al. [33]. Ulicny et al. [34] proposed using spectral filters based on a Discrete Cosine Transform (DCT), thus creating harmonic neural networks achieving fine accuracy in classification and segmentation tasks.
Although several bases have been proposed (Table 1), a valid comparison is still lacking, especially in the area in which these models excel: training with a small number of training samples. This implies our intention to verify the possibility of using different bases in the RFNN architecture under deterministic conditions.
In this work, we analyse and extend RFNN to study the consequences of changes in the individual compact bases. We restrict RFNN into shallow architectures to look more closely at the impact of different parameters and introduce a deterministic methodology for obtaining reproducible results and evaluation in the form of a Correlated Bayesian t-test.
The paper is structured as follows. In Section 2, we review in detail the RFN layer, which is the crucial layer for the RFNN architecture. In Section 3, we present the reference architecture chosen as the baseline for the experiments. The methodological changes that enable reproducibility of experiments are proposed in Section 4. The performed experiments and their results are presented in Section 5. The obtained results are discussed in Section 6. We conclude our contribution in Section 7.

Receptive Field Neural Layer
In neural network architectures, a convolution layer is commonly used for feature extraction. It may differ in implementation, but usually this layer performs a discrete convolution denoted by the * operator between the input tensor h x ∈ R 1×D x ×W x ×H x and the set of convolutional kernels K ∈ R N K ×D x ×W K ×H K , which, at the same time represent the set of parameters of this computational layer. The output is a multidimensional feature map h y ∈ R 1×N K ×(W x −W K +1)×(H x −H K +1) , which is created by applying convolution kernels in all mutual positions to the input where the kernel fits entirely within the input boundaries. Equation (1) shows the calculation of the j-th feature map h y j ∈ R 1×1×(W x −W K +1)×(H x −H K +1) , j ∈ {0, 1, 2, . . . , N K − 1} in a standard 2D convolutional layer, where q ∈ {0, 1, 2, . . . , W x − W K } and p ∈ {0, 1, 2, . . . , H x − H K } are their horizontal and vertical indices, respectively.
We can decompose the regular convolution into two simpler operations: extraction of features in the spatial plane using dephtwise convolution and subsequent channel recombination using pointwise convolution, effectively creating a depthwise separable convolutional layer [44,45]. Depthwise convolution is convolution in the spatial domain applied independently over every channel of an input. The calculation of the feature maps by the 2D depthwise convolution is described by Subsequent pointwise convolution can be seen as regular convolution with a 1 × 1 kernel size, recombining the channels along depth and projecting them onto a new channel space and creating an output j-th feature map h The original set of parameters K is hereby replaced by Θ ∈ R N K ×D x ×W K ×H K as a set of effective filters and α ∈ R D Y ×N K D x ×1×1 for the subsequent recombination of the created feature maps.
The main idea of the RFNN computational layer is to replace the parameters of 2D spatial filters Θ with N K fixed-base filters θ. This makes it possible to introduce information into the network a priori in the form of fixed spatial 2D convolutional filters for the extraction of intrinsic features, which are then linearly combined to create effective feature maps h y at the output of the RFNN layer (Equation (4)).
By introducing fixed filters from a certain domain, the convolutional kernels can be treated as functions, while the network parameters are also reduced, as only α parameters are learned during the training process. The detailed architecture of the RFNN layer is shown in Figure 1.
The hyperparameters of this layer are: • A set of fixed two-dimensional kernels (N K -number of kernels; W K , H K -width and height of kernels); • The convolution parameters used for feature extraction (padding, stride, and possibly other additional parameters); • The number of output feature maps (D Y -depth dimension of the output tensor).
-copy D x times -2D depthwise convolution -pointwise convolution Figure 1. Receptive Field Neural Layer. The set of N K fixed filters is first copied D x times to equal the number of input channels. Subsequently, intrinsic features are extracted using the depthwise convolution, and the output feature maps are created in the learning process by a linear combination using a pointwise convolution.

Choice of Reference Architecture
As a baseline for comparative evaluation of RFNN networks, we selected the methodology and results from Jacobsen et al. [26], as described below. We re-implemented their proposed RFNN network to reproduce the originally published results and to ensure correct implementation using the PyTorch [46] framework. We refer to this implementation as RFNN re f . Since the original article contains several experiments and architectural modifications, we decided to concentrate on results where the MNIST [47] dataset was used ( Figure 5 in the original paper [26]).
The baseline architecture is a convolutional neural network with three layers of receptive fields and is used to classify the known problem of handwritten digits into ten classes. A detailed description of the individual layers together with the output dimensions and number of parameters is given in Table 2. Table 2. Baseline reference neural network architecture (RFNN re f ) with 3 receptive field convolutional layers used in the classification task on the MNIST dataset. The output dimension of individual computational layers is denoted in the form N@H × W × D, where N is the number of objects with height, width and depth, accordingly.
Local response normalization Fixed convolution kernels from the Gaussian basis were formed in the same way for all layers with a difference in size. In the first layer, the convolutional kernels have a size of 11 × 11 pixels and σ = 1.5, while in the second and third layers the kernels are identical with a size of 7 × 7 pixels with σ = 1. In each receptive fields layer, filtered outputs after depthwise convolution are recombined into 64 output feature maps.

Proposed Changes
Although we managed to verify the performance of the baseline architecture (see Section 5.1) and even achieve better results (see Section 5.2) than the original authors using the baseline architecture, we noticed the shortcomings of such an evaluation, which is common in the field of neural networks. The results may be affected by the setting of hyperparameters and training methodology, randomness of selection and order of samples, as well as stochastic regularization (e.g., dropout) during training. These effects are more pronounced with a smaller training set, which, in our case, was manifested by an increase in variance. We encountered the difficulties when reproducing the original results. Therefore we propose to change to a methodology that ensures reproducible results. This methodology is also used in our experiments, starting in Section 5.3: • All used frameworks were set to deterministic mode ; • All parameters were initialised based on one chosen experimental seed; • To ensure variability, the selected experimental seed was used to generate a seed vector for all runs of the experiment.
Based on the recommendations in [48,49] we chose the Bayesian-based comparison in the form of a Correlated Bayesian t-test [50]. We used an implementation of this test from the Baycomp library [48]. The results of this test are three probabilities: that the models are practically equivalent, that model C1 achieves better results than model C2, and vice versa. For all tests, we chose the standard 1% as the Region of Practical Equivalence (ROPE), where if the difference in the metrics of the compared models A and B is less than or equal to 1%, we consider the models to be equivalent. Based on previous adjustments and evaluations, we decided to change the methodology of the experiments. Subsequently, in each experiment, all frameworks using randomness were set to the deterministic mode. We used stratified 10-fold cross-validation repeated 10 times with different initial seedings (a total of 100 simulations). The same master seed is used for each experiment, so both the selection and the order of the training samples are maintained for all models on the same dataset. The statistical evaluation of the model in the form of Correlated Bayesian t-test was performed on the validation part of the training set within the cross validation. However, a numerical evaluation was performed on the excluded test set containing the same 10,000 samples to obtain an objective view of the real accuracy of the model. A possible alternative is to combine the training and test sets into one dataset. We chose not to use this procedure in order to be able to compare the results obtained on a dedicated test set with previous research in this area. The final training and evaluation methodology for ensuring repeatability is shown in Figure 2.

Experiments and Results
In this section, we describe our experiments with RFNN architecture using the MNIST database. In Section 5.1, we describe our procedure for repeating the results from the original work [26]. Next, in Section 5.2, we experiment with training methodology by applying only one change at a time (random or stratified sampling, early stopping, or different optimizer) to analyse its influence on the accuracy. To see the impact of changes in the RFNN architecture, we simplified the reference network. This simplification is specified in Section 5.3. Next, in Section 5.4, we analyse the impact of sample selection and energy normalization on the achieved accuracy. In Section 5.5, we experiment with different bases instead of a Gaussian compact base, and last, in Section 5.6, we remove the pooling layer and provide corresponding learning curves.

Experiment 1-Reference Architecture Evaluation
Receptive field neural networks excel when not enough data are available to use common deep learning models. We used a limited number of training samples in the range from 300 to 60,000 samples, i.e., the whole training set. For each experiment, the entire MNIST test set, consisting of 10,000 samples, was used to evaluate the classification accuracy. Apart from normalizing data to the range <0,1>, no other form of sample preprocessing was used. The network was trained by a standard backpropagation algorithm using an average cross-entropy loss with a mini-batch size of 25 images. An Adadelta optimizer with decay rate ρ = 0.95, stability constant = 1 × 10 −6 and linear learning rate decay was applied. The initial value of the learning rate was set to lr i = 5 and decreased to the final value of lr f = 0.05 during the training, while a fixed number of epochs were taken from the original article. The values are listed in Table 3 [26]. The resulting classification accuracy of the reference architecture RFN N re f representing the average of three experiments is depicted in Figure 3. The original and our result lines approximately match, but there is apparent difference in the last point for the 300 training samples. The variability of these results even if the original methodology is followed may be due to various factors such as the random number generator (RNG) initial state, hardware parameters, software versions of used frameworks and others. These data are missing in the majority of published studies [51], which makes the verification of original results more difficult.
Since we were unable to achieve the original published values for three repetitions, we decided to repeat each experiment 300 times and evaluate the results by displaying the full range <min, max> of the achieved accuracy. The resulting interval is also shown in Figure 3. For reference, we compared the RFNN re f model with standard machine learning classifiers such as k-nearest neighbors (KNN) and support vector machine (SVM). We used an SVM classifier with a radial basis function kernel (RBF), a unit regularization parameter and an automatic kernel coefficient gamma. A KNN classifier with a uniform weight function and 5 neighbors was trained using a brute-force search with standard Euclidean distance as a metric. The range of achieved values is represented in the following part of the article by a violin plot [52]. In addition to the interquartile range, this type of graph also shows the estimated distribution of data. In our representation, we marked the median and displayed the entire range of values from min to max.
We can confirm that we have managed to achieve comparable results to those presented by Jacobsen et al. [26]. The RFNN re f reference architecture achieves a significant improvement in classification accuracy compared to standard machine learning models in our reproduction test on the MNIST database if the number of training samples is limited.  Table 3. Accuracies of the standard KNN and SVM machine learning classifiers are given as separate curves. The graph also shows the range of results obtained for repeating the experiment N = 300 times for the reference architecture RFN N re f . For the smallest number of training samples, the estimated distribution of all repeated experimental results shown as a violin plot is also displayed separately (best viewed in colour).

Experiment 2-Reference Architecture and Limited Number of Training Samples
In this section, we focus on the most important contribution of the RFNN architecture, which is the ability to achieve high accuracy with a limited number of training samples. Analysing the results, we observed a high variability of classification accuracy for the smallest number of training samples, which is evident from the violin graph ( Figure 3). From these results, it is not possible to determine whether the high variance is due to sampling, the characteristics of the chosen model, or by specific training parameters. To analyse the variance and impact of parameter changes, we tested several adjustments and modifications of the training methodology. Network training and pre-processing were done exactly the same as in the previous section with a random selection of 300 training samples. We repeated the experiment N = 300 times, making only one change in the methodology at a time. The results are shown in Figure 4 and also numerically in Table 4. Note, the leftmost case in Figure 4 is the same as the violin plot (rightmost case) in Figure 3.

Stratified Sampling
First, we replaced random with stratified sampling as a way to eliminate sampling bias and thus a possible source of variance. The resulting violin graph in Figure 4 is denoted as "(orig) + Stratification". We achieved comparable results to those of the reference methodology, and thus, the possible class imbalance did not have a significant impact on the observed variance.

Early Stopping
In the following experiment, in addition to the original methodology, we used early stopping as follows: 10% of the 300 random samples were selected via stratified sampling in the validation set. The remaining 90% of samples were used for training. During training, the cross-entropy loss on the validation set was evaluated after each epoch. If this loss was reduced, we saved the current parameters of the model and continued training. If the validation loss did not improve during p = M maxepochs , then we stopped the training and used the best model whose parameters we had saved. We performed two different experiments for the conditions of early stopping, p = 20 and 100, to evaluate the impact of this parameter. When testing a longer stopping condition p = 100, better results were obtained, but they had still deteriorated. The results show that a longer stop condition generally produces better results, which is consistent with the results of Prechelt [53]. Figure 4 shows that the use of early stopping can increase the variance of the results, especially if the stop condition is too strict. The deterioration of the results may be affected by the small number of samples in the validation set and by the use of a dropout in the architecture. However, their impact on the overall result would need to be examined in more detail. Due to mixed opinions on the use of early stopping and deteriorating results in our particular case, we decided not to use early stopping in further experiments.

Optimizer Modification
Choi et al. [54] showed that not only does the choice of the optimizer but also the setting of its parameters has a fundamental influence on the model accuracy. However, since tuning the optimizer hyperparameters would require a separate optimization process, and a grid search would be computationally intensive, we decided to test the latest optimizers with the default settings recommended by their respective authors without using the learning rate decay. We used Adadelta [55], Adam [56], AdamW [57], AdaBound [58] and Nadam [59] gradient descent optimization algorithms. The results of our tests (Figure 4 and Table 4) show that the optimizers are almost as successful-only AdaBound and Adadelta lag behind at this setting. In the last comparison, we decided to use the learning rate decay for the most successful optimizer, AdamW, as well as in the original experiments, where we managed to surpass the original results.

Experiment-Simplification of RFNN Architecture
For a valid comparison of the impact of changes in the RFNN architecture, we simplified the reference network as much as possible. Due to the possible mutual influence of parameters in deep architectures, we omitted the normalization and regularization layers and simplified the neural network to a shallow architecture with one computational RFConv layer. The architecture is described in detail in Table 5. Another simplification was the removal of computational layers (local response normalization and dropout) for which we could not ensure repeatability when training on a GPU. We kept the base and number of filters according to the original architecture. For training, we used the same settings as in the previous experiments except for changing the optimizer to AdamW using a fixed learning rate with a value of 1 × 10 −3 . For a reference, we evaluated the simplified 1-layer architecture RFNN L1 against the original 3-layer architecture RFNN re f in the form of 10 times repeated 10-fold cross-validation using a Correlated Bayesian t-test for the random selection of 300 samples. Figure 5 shows the achieved accuracies on a test set in the form of a violin graph and posterior probability distribution of a Correlated Bayesian t-test evaluated on a validation set in each cross-validation split. Compared to the original architecture, the results deteriorated significantly, which was expected. For a ROPE of 1%, the probability that RFNN re f was better than RFNN L1 was 99.03%, the probability that both models were practically equivalent was 0.95%, and the probability that RFNN L1 was better than RFNN re f was 0.02%. Figure 5. Comparison of RFNN re f and RFNN L1 neural model on the MNIST database using proposed methodology for random selection of 300 training samples. The evaluation of the results is shown for test classification accuracy represented by violin graph (a) and posterior probability distribution between RFNN re f and RFNN L1 of a Correlated Bayesian t-test for the validation classification accuracy at 1% ROPE (b).

Experiment 4-Sample Selection and Energy Normalization
In this experiment, we analysed the impact of sample selection and Energy Normalization on the achieved accuracy of the created model. The settings and methodology of the training were the same as in the previous section. We chose RFNN L1 as the reference architecture.

Sample Selection
We conditioned the selection of samples by changing the deterministic seed that affected the result. We trained again on a random selection of 300 samples and re-evaluated classification accuracies using the same methodology. A comparison of the results obtained with the seed change is shown in Figure 6.
The results differ due to a different selection of samples for the training set despite the same RFNN L1 architecture. The probability that the model RFNN L1 (seed#1) was better than RFNN L1 (seed#2) was 39.87%, the probability that RFNN L1 (seed#2) was better than RFNN L1 (seed#1) was 31.69%, and the probability that both models were practically equivalent was 28.44% = 100% − (39.87% + 31.69%). Figure 6. Comparison of RFNN L1 neural models trained on different samples based on two deterministic seeds on the MNIST database using proposed methodology. The evaluation of the results is shown for test classification accuracy represented by violin graph (a) and posterior probability distribution between RFNN L1 (seed#1) and RFNN L1 (seed#2) of a Correlated Bayesian t-test for the validation classification accuracy at 1% ROPE (b).

Basis Energy Normalization
Normalization in neural network architectures can help to improve both convergence and generalization [60]. Different types of preprocessing and normalization are introduced to eliminate sharp differences between network parameters or extracted feature maps [61]. This effect can be even more pronounced when using fixed filters. For this reason, we evaluated the impact of energy normalization on the achieved accuracy. By energy normalization, we mean normalization to unit energy, where the sum of all squared k-th filter elements θ k ∈ R W K ×H K is equal to one (Equation (5)).
where m ∈ {0, 1, 2, . . . , H K − 1} and n ∈ {0, 1, 2, . . . , W K − 1} are their corresponding vertical and horizontal indices. The achieved accuracies on a test set in the form of a violin graph and posterior probability distribution of a Correlated Bayesian t-test evaluated on a validation set in each cross-validation split are shown in Figure 7 for the Gaussian derivative basis, and in Figure 8 for randomly initialized kernels. Figure 7. Comparison of RFNN L1 neural models with Gaussian derivative basis with and without energy normalization according to Equation (5) on the MNIST database using proposed methodology for random selection of 300 training samples. The evaluation of the results is shown for test classification accuracy represented by violin graph (a) and posterior probability distribution between RFNN L1 (Gaussian) and the same model with the energy normalized base RFNN L1 (Gaussian normalized) of a Correlated Bayesian t-test for the validation classification accuracy at 1% ROPE (b).

RFNNL1(Gaussian) RFNNL1(Gaussian normalized)
The results of both experiments show that base energy normalization helped to improve classification accuracy. When using the original Gaussian derivative base, we found that RFNN L1 (Gaussian) was better than RFNN L1 (Gaussian normalized) with a probability of 6.75%, the probability that both models were practically equivalent was 55.06%, and the probability that RFNN L1 (Gaussian normalized) was better than RFNN L1 (Gaussian) was 38.19%. In the case of using a random base, the probability that RFNN L1 (Random) was better than RFNN L1 (Random normalized) was 21.15%, the probability that both models were practically equivalent was 49.21%, and the probability that RFNN L1 (Random normalized) was better than RFNN L1 (Random) was 29.64%. Figure 8. Comparison of RFNN L1 neural models with Randomly initialized kernels with normal distribution with and without energy normalization according to Equation (5) on the MNIST database using proposed methodology for random selection of 300 training samples. The evaluation of the results is shown for test classification accuracy represented by violin graph (a) and posterior probability distribution between RFNN L1 (Random) and the same model with the energy normalized base RFNN L1 (Random normalized) of a Correlated Bayesian t-test for the validation classification accuracy at 1% ROPE (b).

Experiment 5-Basis-Related Experiments
In the original article [26], the Gaussian compact base was used for feature extraction. The aim of our experiment was to verify whether we can replace the given base and what effect this change will have on the achieved classification accuracy. We again chose RFN N L1 as the reference architecture, while the parameters and evaluation methodology remained unchanged. In each experiment, we kept all parameters constant except for the base used, which had a size of 11 × 11 pixels. Since the reference Gaussian base contained 10 filters, we selected these filters along the primary diagonal based on a triangular selection according to Ulicny et al. [34]. We compared the reference Gaussian derivative base (Gaussian), the orthonormal Discrete Cosine base (DCTII), the orthonormal Discrete Hartley base (DHT), the randomly initialized kernels with normal distribution (RND), and the random orthonormal base (ORTHN_RND). The results for the selection of 300 samples are shown in Figure 9.
We evaluated neural models using a Correlated Bayesian t-test with respect to the simplified reference architecture RFNN L1 with a Gaussian base for each model change separately (Figure 10). The results showed that for all tested bases, the hypothesis that both compared models are equivalent has the highest probability, with the one exception being a random base, which worsened the results and had the greatest impact on achieved classification accuracy. The probability that the RFNN L1 model of a network with a Gaussian base is better than a model with a random kernels is about 47.1%, which far exceeds the other results. Compared to the other bases tested, we obtained the best results using discrete Hartley and discrete cosine bases. In both cases, the results of the statistical test were that the models were practically equivalent, with a probability of 51.06% and 44.77%, respectively.

Experiment 6-Learning Curve Analysis and Max-Pooling Removal
In the previous experiment in Section 5.5, we analysed the influence of the type of fixed basis used. As we can see in Figures 9 and 10, the random kernels also performed relatively well, which could be due to the simplicity of the MNIST database used, or excessive downsampling in max-pooling layer. Based on the obtained results, we decided to explore the influence of the max-pooling layer on the classification accuracy. We compared the network performance with and without the max-pooling layer on the more complex Kuzushiji-MNIST [62] database. The RFNN L1 model was trained for a fixed number of 50 epochs using the AdamW optimizer with default setting while following the same methodology and preprocessing as in the previous experiments. We show the corresponding representative results in the form of the learning curves in Figure 11. It can be seen from the results, that the difference between randomly initialized filters and structured bases widens significantly when max-pooling is omitted.  Figure 11. Learning curves of RFNN L1 neural models with and without max-pooling (WMP) layer displayed for Randomly initialized base, DCT, and Gaussian derivative basis on the Kuzushiji-MNIST database using proposed methodology for random selection of 300 training samples (a) and 1000 training samples, respectively (b).

Discussion
The outcomes of this paper have provided insight into the Receptive Field Neural Network architecture, which is specific in that it uses a linear combination of a predefined fixed kernels to create a set of effective filters instead of learning entire convolutional kernels pixel-by-pixel. The results in Section 5.1 confirm the previously published results on the MNIST database ( Figure 3) by Jacobsen et al. [26] and demonstrate the possibility of replacing the original Gaussian derivative base with other structured bases intended for the extraction of intrinsic features within the shallow RFNN architecture. Since the RFNN excels compared to other architectures when training with a limited amount of data; the experiments focused on the random selection of N = 300 training samples only while maintaining repeatability of results by introducing a deterministic evaluation methodology ( Figure 2). The methodology change was introduced to ensure reliability of findings with respect to the high variability of the achieved classification accuracy on the test set (Sections 5.1 and 5.2), which we could not reduce by the introduction of sample stratification, the use of early stopping during training, or by changing the optimizer (Figure 4 and Table 4), as shown in Section 5.2. We modified the training methodology to ensure repeatability and evaluated the proposed step-by-step changes in the form of a Correlated Bayesian t-test using 10 times repeated 10-fold cross-validation. By switching to this evaluation methodology, we ensured a reproducible environment for reliable comparison of results where not only the selection but also the order of the samples during training remained the same. To ensure repeatability within the neural network, we simplified the original RFNN re f ( Table 2) architecture to a single-layer network RFNN L1 ( Table 5) with removal of normalization and regularization layers for which we could not guarantee determinism within the used framework. At the cost of a significant deterioration of RFNN L1 classification accuracy compared to the baseline (Section 5.3, Figure 5), we obtained reproducible conditions for further experiments (Sections 5.4 and 5.5). When evaluating the influence of the seed for the pseudorandom number generator, on which depends samples selection and the initialization of parameters within the RFNN L1 network in the form of a Correlated Bayesion t-test, we found that the probability that the same model with different initial seedings and training sets are practically equivalent was only 28.44% (Figure 6b). This probability was the lowest among all experiments, meaning that this change had the greatest impact on the achieved classification accuracy. The results are in line with the findings of Crane et al. [51], who pointed out that unreported effects in evaluation methodology can substantially influence the achieved results. The data also show the positive impact of energy normalization (Section 5.4), which improved the classification accuracy not only for the Gaussian derivative base (Figure 7) but also for randomly initialized fixed filters ( Figure 8). The next experiment was a comparison of specifically structured bases within the RFNN L1 architecture (Section 5.5). Although we managed to outperform the original Gaussian derivative basis using DCT and DHT bases (Figure 9), the more important result is that the Bayesian Correlated t-test ( Figure 10) did not show a significant difference between individual bases in terms of achieved classification accuracy with the exception of randomly initialized filters (Table 6). In the last experiment (Section 5.6, Figure 11), we tested the impact of the max-pooling layer. The results show that omitting max-pooling increased the difference between structured bases and randomly initialized kernels. Further research could be devoted to initialization with structured filters based on various transforms, which could, in theory, improve convergence, as shown recently by Li et al. [63]. Since the feature extraction in our reference RFNN architecture is divided into filtering using a set of filters along the spatial dimensions and the subsequent combination of channels via pointwise convolution, representation of the extracted features and the subsequent visual display can be easier to interpret compared to the classical convolutional layers. Therefore, it would be interesting to examine the visual properties of these architectures with respect to the extraction of intrinsic features for better understanding and deployment in areas with a lack of training data, in which they excel.

Conclusions, Limitations, and Future Research Work
Although these results confirm the added value of structured filters within the RFNN architecture, the differences between the specific bases chosen in the context of neural networks need further investigation, mainly in the context of the various constraints introduced. Deployments of complex neural networks in critical applications and devices with limited computing power are the main reasons for simplification and deeper analysis of created models. In this work, we analysed the Receptive Field Neural Network (RFNN) as a promising simplified model that uses a Gaussian derivative basis inspired by the scale-space theory. When validating the original results, we identified a problem with a large variation in the achieved test classification accuracy, especially with a small amount of training data. To enable the reproducibility of the results by introducing deterministic methodology, we simplified the baseline RFNN architecture to a single-layer CNN network whose input of computational layers was a set of fixed filters of any size and number. Subsequently, we experimentally verified that bases other than Gaussian can be used as fixed filters within the RFNN architecture. We found that a change in base may have less of an effect on the results obtained than re-training the network using another seed in failing to ensure repeatability of results. We also verified the positive impact of energy normalization of used filters, which improves the achieved classification accuracy even when using randomly initialized kernels. Further research is needed to establish specific differences between individual bases and the influence of various hyperparameters within the RFNN architecture.