A New Pooling Approach Based on Zeckendorf’s Theorem for Texture Transfer Information

The pooling layer is at the heart of every convolutional neural network (CNN) contributing to the invariance of data variation. This paper proposes a pooling method based on Zeckendorf’s number series. The maximum pooling layers are replaced with Z pooling layer, which capture texels from input images, convolution layers, etc. It is shown that Z pooling properties are better adapted to segmentation tasks than other pooling functions. The method was evaluated on a traditional image segmentation task and on a dense labeling task carried out with a series of deep learning architectures in which the usual maximum pooling layers were altered to use the proposed pooling mechanism. Not only does it arbitrarily increase the receptive field in a parameterless fashion but it can better tolerate rotations since the pooling layers are independent of the geometric arrangement or sizes of the image regions. Different combinations of pooling operations produce images capable of emphasizing low/high frequencies, extract ultrametric contours, etc.


Introduction
Deep neural networks (DNN) have revolutionized orthodox tasks of image analysis in which they have accomplished outstanding results and continually do so [1][2][3]. By employing modifications to the architectures and introducing various techniques (often greedy), considerable improvements have been achieved.
Convolutional Neural Networks (CNNs) architectures are increased through multiresolution (pyramidal) structures, which come from an idea that the network needs to see different levels of detail to produce good results. A CNN stacks four different processing layers: convolution, pooling, and fully connected (dense) layers [4].
The pooling layer receives multiple feature maps from convolutional layers and applies the pooling function to each of them. The pooling layer (a) reduces the number of parameters in the model (subsampling) and calculations in the network while preserving their important characteristics, (b) improves the efficiency of the network and prevents overlearning [4]. To do this, the maximum pooling function downsamples the input representation by reducing its dimensionality: the image is split into regular cells without overlapping, then the maximum value is kept within each cell. Thus, the pooling layer makes the network less sensitive to the position of features: the fact that a feature value is a little higher or lower, or even that it has a slightly different orientation should not lead to a drastic change in the image classification.
The weaknesses of pooling functions are well identified [5]: (a) they do not preserve all the spatial information well by reducing the spatial resolution, (b) the discrete maximum chosen by the maximum pooling in the pixel grid may be not the true maximum, (c) average pooling assumes a single mode with a single centroid. Hence, the question is how to take into account in an optimal way the characteristics of the input image being grouped in the pooling operation [6]? Part of the response sets in Lazebnik's work, which demonstrated the importance of the spatial structure of pooling neighborhoods [7]. These local spatial variations of image pixel intensities (named textures in popular image processing) characterize an "organized area phenomenon" [8], which cannot be captured in pooling layers.
This paper proposes a new pooling operation, independent of the geometric arrangement or sizes of image regions, which can therefore better tolerate rotations. The operation is based on the Zeckendorf theorem for the decomposition of integers, which is also simple to implement. Zeckendorf theorem is mainly used in cryptography [9], e.g. to design small microcontrollers that can resist certain Fault Attack.
The rest of article is organized as follows: Section 2 presents the related works on pooling strategies. Zeckendorf additive partition is exposed in Section 3 and its implementation is explained in Section 4. Numerical experimentations and results are presented in Section 5. Finally, experimental works are discussed and future works are mentioned in Section 6.

Related Works
Throughout this paper small Latin letters a, b, . . . represent integers. small bold letters a, b are put for vectors and capital letters A, B for matrices or tensor depending of the context. '{. . . }' brackets indicate set of values. · is put for the cardinal operator.

Pooling Strategies in Image Processing
Convolutions in CNNs are discrete convolutions of an image V with a kernel K. Without loss of generality an input image V in a high dimensional space can be reduced into a vector v. Let's define N(i) as the set of all indices of elements in v which are neighbors of v i in the neighborhood defined by the convolution kernel K N(i) = {j ∈ N|v j ∈ neighborhood of v i given by K} As the structure of the neighborhood is fixed, we assume that N(i, j) ∈ {1, 2, . . . , N(i) }, which is the index of j in N(i, j). The discrete convolution can then be defined as where k are the the weights of the convolution kernel K.
The exponential growth of the number of parameters makes convolutions with large kernel sizes computationally expensive. Therefore, most CNN architectures keep the kernel size at 3 × 3 or 5 × 5. However, how does one do a sensitive prediction for an entire image, if a single convolution "sees" only a 3 × 3 neighborhood? The solution is the stacking of convolutional layers. With two layers following each other, the last one can "see" a 4 × 4 neighborhood. This means a lot of convolutions must be stacked to have a receptive field as large as a reasonable input image. The increase in receptive field by convolution can be considerably higher when the image is downsampled to a lower resolution between two convolution operations. Various methods exist for resampling a given feature layer at multiple rates prior to convolution such as dilated convolution that "inflate" the kernel by inserting holes between the kernel elements [10] or astrous convolution [11].
Maximum pooling is a popular choice for this downsampling operation. The pooling operations have been little revised beyond the main current maximum, average, and stochastic pooling options despite indications that choosing multiple pooling functions can improve performance [12].
Sharma et al. analyzed and discussed qualitatively the performances of pooling strategies on different datasets [13]. Lee et al. [6] experimentally demonstrate that their pooling operations combining maximum and average pooling provide an increase in invariance properties over the conventional pooling. Lee et al. proposed to combine pooling filters that are themselves learned. In [14], Gulcehre et al. investigate a novel nonlinear unit, called L p unit that generalizes a number of conventional pooling operators such as mean, root mean square, and maximum pooling.
Agostinelli learned activation functions to improve DNN in [15]. Boureau et al. analyze theoretically why max pooling works well in a wide variety of contexts, even if similar or different factors come into play in each case [16].
Many researchers are working on the development of advanced pooling mechanisms to effectively use these essential features of pooling [13], in particular on how to bring learning to the characteristics of the region being pooled into the pooling operation [6]?

Pooling and Statistics
In Statistics, "pooling" describes the practice of bringing together small datasets that are assumed to have the same value of a characteristic, e.g., a mean, and using the larger combined set (the "pool") to get a more precise estimate of this feature. Poolability can be formulated on the basis of the concept of statistical equivalence. Sheskin compiled in [17] a bibliography dealing with pooling procedures, for example to combine several independent tests of the same hypothesis.
The goal of pooling is to transform the convolutional characteristics into a new representation that preserves important information while ignoring irrelevant details. For instance, if a t-test between the two within-group slopes is not "passed", these characteristics cannot be grouped [18].
In some way, many other ensemble techniques, where a set of weak learners are combined to create a stronger learner, are very near to this notion of pooling [19].
So, should we pool or not? Or, putting it a little differently, when should we pool and when should we not? The answer depends on the training context. Moorthy et al. in [20] proposed to weight the image quality measures by visual importance to improve the correlations with subjective judgment. Achieving invariance to changes in position or lighting conditions, robustness to size, and compactness of representation are all common goals of pooling. We demonstrate experimentally here that these properties are achieved successively with the Z pooling operator, based on Zeckendorf number theorem.
Experimental validation is continued in Section 5 on predefined architectures and obtained by replacing the standard pooling operations with Z pooling.

Texture Coding
Most of image descriptors that encode local structures e.g., local binary patterns (LBP) (and its variants) [21,22] depend on (a) the size of the neighborhood, (b) the reading order of the neighbors, (c) the mathematical function that is used to compute the feature distance between neighboring pixels. The new pixel value L R (P) in the image is an integer in the range of 0 to 255 (for a 8-bit encoding) given by: where P is the number of pixels in the neighborhood considering the distance R between central pixel g c and the neighboring pixels {g p |p = 0, . . . , P − 1}. In Equation ( LBP-like texture descriptors have evolved into almost all fields of computer vision, because of their robustness to monotonic gray-scale changes, illumination invariance, and computational simplicity. Invariance w.r.t. any monotonic transformation of the gray scale is achieved by considering in (Equation (3)) the signs of the differences t(g i − g c ), i = 0, . . . , P − 1. The local texture can be represented as a joint distribution of the values of the differences at the center pixel g c . Assuming the independence of g c with respect to the differences (g i − g c ), i = 0, . . . , P − 1. However, under certain circumstances such as very low or high values of g c , the range of possible differences and so, LBP can miss the local structure as it does not consider the central pixel. To reduce the noise sensitivity, mostly in uniform regions, a three-level operator has been proposed by Tan and Trigg [23], which describes a pixel relationship with its neighbors by a ternary encoding, i.e., −1, 0, 1 rather than a binary code, i.e., 0, 1. The size of this code is reduced by splitting it into two LBP (Positive and Negative) codes, which results into two 8-bit strings thus needing a 16 bit space for representation.
In the next section, an algorithm is proposed for generating Z images, which could be utilized in contour detection or image segmentation.

Zeckendorf Additive Partition
In this section, an algorithm is proposed for so called Z pooling. In [24] the Belgian mathematician Édouard Zeckendorf states that any integer N may be uniquely represented as the sum of distinct Fibonacci numbers so that the sum does not include any two nonconsecutive Fibonacci numbers. The Fibonacci series 1, 1, 2, 3, 5, 8, . . . is a sequence of numbers f (n) such that f (n) is the sum of the 2 previous values with initial conditions Here we have a second-order linear constant coefficient difference equation that we want to solve. Specifically, consider the following by rewriting it in a slightly different form: The solution to the Equation (4) may be found using z-transforms as follows: Theorem 1 (Zeckendorf's additive theory). Any positive integer N can be expressed as a sum such that σ i σ i+1 = 0, i = 1, 2, . . . .

Proof. For any positive integer N, there is always a positive integer m such that
Since N − f (m) is positive, there exists a positive integer p such that and the process continues. Ultimately, we must reach the point where the partial sum equals a Fibonacci number-say f (t)-and thereby obtain the desired representation Zeckendorf partition is complete and canonical, i.e., every positive integer is the sum of distinct elements of Fibonacci series and, in the binary base, the sequence σ k , σ k−1 , . . . σ 3 , σ 2 with σ i ∈ {0, 1} in Equation (6)  From this additive property of integers, a new image encoding is proposed (see Algorithm 1), which encodes the local dependencies of pixels by combining a pooling operation and an integration operation, both chosen from supremum (max), infimum (min), summation, intersection (∩) or set difference (\) [25]. A texel is a texture element or texture pixel.
The way these operators are combined results in images that could be directly used in the computer vision pipeline for object segmentation or contour extraction. The result of applying various arithmetic operations after the intersection leads to different types of image variations.
Four of these variations on Lenna's image are shown on Figure 2. Each produces a characteristic inference line, which we explore below.
The first Figure 2a is produced by applying the supremum operation followed by another supremum. The edges are quite smooth and many edges are missed due to the maximum operation. This operation leaves smaller values in the intersection, resulting in fewer or no edges. Figure 2b is constructed by applying the supremum operation followed by an infimum. As expected, the max operator at the initial stage will produce the set of relatively larger values leaving small Fibonacci numbers. A minimum operator at the end slightly overcomes the maximum effect by selecting the minima for the central pixel. Figure 2c could be considered the complete opposite of the second. All the minimum values are first extracted using the infimum operator, then the supremum of the set is taken. It is totally intuitive to think of it as a double of the second image. Figure 2d is produced by applying a summation operator, which is then followed by the minimum operation.
The difference between the fourth and the second images is that the values are out of range for some pixels due to the intensity ranges saturating the summation operator. Algorithm 1: Image Z coding.  In Algorithm 1, e.g., for w = 3, the list of neighbor pixels surrounding

Evaluation and Result for Segmentation
Local image descriptors perform well on various computer vision tasks such as image retrieval [26], action recognition [27,28], object detection and recognition [29] etc. We discuss the Zeckendorf representation as a local image descriptor for two of these tasks.
Algorithm 1 results in ultrametric contours or segmented images based on the association of the aforementioned operations.
The union operator was not included in this work because computer vision generally derives directly from the ability of image descriptors to be discriminating, and this is achieved by intersection or set difference operators. Table 1 reports the performances of the top 10 algorithms and the Zeckendorf segmentation on 500 test images of BSD500 [30], combining set difference and max(max) operators. Segmented images obtained after region merging were also compared with the human annotated images using the benchmark code available at Berkeley's website in Table 2 [30]. We evaluated the quality of the extracted boundaries using Precision and Recall measures. Here, the Precision P is the probability that the extracted borderline pixel is a true borderline pixel and Recall (sensitivity) R is the probability that the real borderline pixels are correctly extracted: where TP, FP and FN are resp. the true positives, the false positives, and the false negatives.
Precision is how sure one is of true positives whilst Recall is how sure one is about not missing any positives. Due to the trade-off between the two mentioned measures, we calculated an F-score from Equation (11) to compare the results obtain after regions are merged: with α an adjustable perimeter, selected here as 0.5 to compare our results with results available from other algorithms. The F-measure of Z coding is 0.6652 with an average Recall of 0.833 (the highest), indicating that the edge pixels are rarely misclassified. This F-score could be further improved by refining certain factors such as postsegmentation region-fusion procedures.  Figure 4 illustrates the calculus of the performance metric of the segmentation process on the "horse" image from BSD500.

Z Pooling
Let h be the input volume (or image) with axis sizes n k and g the convolution kernel with axis sizes m k . In CNN the channel or feature axis has a special role. By convention g is the identity along the feature axis and the output of the convolution can have multiple features. The number of input features is n N , given by h. The number of output features is n e tal.pha and a parameter of the convolution operation. This is achieved by packing multiple convolutions into one operation, one for each output feature. The set of valid indices for the input volume h is defined as A = {(i 1 , i 2 , . . . , i N )|i k ∈ [1, . . . , n k ], k ∈ [1, . . . , N]}. As there are multiple output features the kernel g gets an additional axis and thus the indices of g are in the set B = {(j 1 , j 2 , . . . , j N+1 )|j b ∈ [1 . . . m k ], b ∈ [1, . . . , N + 1], m N = n N , m N+1 = n e tal.pha}. The resulting volume of the convolution operation has the same axis sizes as h except for the feature axis. The convolution with zero padding written in the terms of volumes becomes: with * is the convolution operation and · the ceiling function. In many publications the kernel size is split into a image and feature part, i.e., the convolution operation defined by Equation (16) The x and y are indices for the volume h and B can be seen simply as a selection mask. Note that this operation looks for the maximum in a neighborhood defined by B along the image axis. Unlike the convolution the channels are not mixed in this operation. Often the maximum pooling operation is used for downsampling the volume by restricting x. This restriction is called striding with stride s ∈ N and A is restricted The strided max pooling operation is then: The strided max pooling reduces the size of the input image by only considering every s-th entry along all image axes and discarding all others. The concept of strides can be used for convolution operations as well and where fractional strides can even be used for upsampling [32].
Z pooling can easily replace maximum pooling in a CNN in Equation (17) when writing where is the intersection or set difference operation. B is the mask (neighborhood) in which x is selected. Note that Z pooling is an operator without parameters as well as max pooling. With respect to fully connected neural networks, CNN are translation invariant. The translation invariance comes from the fact that the convolution kernel W is the same for every possible position of the input. So once the network learns to recognize an object in one position on the image it automatically will recognize it at any position. However, the use of convolutions comes with a cost: the number of parameters grows with the input and output size. Different pooling operations were carried out in a categorization context to compare the behavior of the different pooling operations.
The most relevant question at this stage is: are pooling layers more efficient when they pool texels or when they pool pixels? Experiments proposed in the following section give some answers.

Implementation
The experiments using the aforementioned algorithms were implemented in Python© 3.7 using Tensorflow and Keras frameworks except for the cascaded network for which the authors provide an implementation based on Niftynet [33]. Computation were completed on a Tesla VT100 CPU @ 3.60 Ghz with 64 GB of RAM. This study focuses on the use of a magnetic resonance imaging (MRI) dataset of acquired brain tumors from the challenge of multimodal brain tumor segmentation Brain Tumor Segmentation (BraTS) challenge [34].

Miccai BraTS Dataset
Segmentation of brain tumors from multiple modalities can produce a prediction that facilitates surgical planning, postoperative analysis and radiotherapy [35].
Brain tumors require early detection and sometimes prolonged treatment. They can be benign or malignant when they have a faster growth rate, although benign tumors are slower in growth and include low-grade variants (1)(2)(3)(4). Lower grade glioma (LGG) have a higher life expectancy and do not require immediate treatment. Both cases still require neuroimaging prior, during and after treatment. Medical imaging helps to assess tumor progression, surgical planning, and overall treatment [34]. Glioblastoma (GBM) is a very aggressive grade-4 brain tumor, the deadliest among cancers with a five-year survival rates of only 7%.
BraTS challenge requires not only the segmentation of the whole tumor but also subsequently the tumor core and enhancing tumor ( Figure 5). The Dice coefficient is used to measure the quality of the segmentation.

Experiment Details
In the first experiment we consider 2D U-Net, 3D U-Net, and Cascaded Network for which the training details are presented in Table 3.
The second experiment combines the best method in terms of the highest Dice (i.e., 2D U-Net) with the proposed enhancement methods. Hence, the results of the retrained model are presented following a curricular learning (CL) and data augmentation (DA). The third experiment considers the equally weighted majority voting performed using the 3D U-Net, Cascaded Network and the best performing model (i.e., 2D U-Net + CL) from the second experiment. When used, all the DA and CL transformation are applied on 25% of the initial training dataset. Curricular learning was first proposed by Bengio [37] to deal with nonconvex optimization to avoid the local optimum issue. The intuition behind Curricular Learning is to mimic the learning of human with a gradual training process with examples sorted in an increasing level of difficulty. Following this idea, we propose to pretrain the considered models from artificially downsampled MRIs by a progressive increasing level of resolutions. This enhancement was carried out by downsampling then upsampling by successive factors equal to eight, four and two. Hence, the first model is trained with the data that is downsampled/upsampled by a factor eight. Once saved, it is retrained with the data that is downsampled/upsampled by a factor four. This process is then repeated with the data downsampled/upsampled by a factor two. Finally, the resulting model is trained with the data in its original resolution.
Data augmentation is used to improve the robustness of the model by artificially increasing the size of the training dataset. In this study, the following geometrical transformations are used with randomly chosen settings: (a) 90 degrees rotation, (b) Horizontal/vertical flip, (c) Cropping, (d) Gaussian white noise.
In order to simultaneously take benefit of all the investigated methods, this proposal consists in developing an original method which combines the predictions provided by each technique (i.e., 2D U-Net, 3D U-Net, Cascaded network). An equally-weighted majority voting is then applied to each pixel of the input MRI. For the prediction, all the methods have the same relevance (weight) to assign a score to each prediction. The final decision is set to use the prediction, which obtains the highest voting score. If several different predictions obtain an identical score, the final prediction is randomly chosen among the best proposed choices.

Data
BraTS dataset was split using 125 patients: 25 patients are used for test, 75% for training, 25% for validation. To improve the computation efficiency of our evaluation, each MRI of the dataset was cropped from 240 × 240 × 155 to 144 × 160 × 60, removing background region pixels.

Training Protocol
All the three supervised methods were trained using the Dice Loss Function (DLF) equal to one minus the Sørensen-Dice index: where P denotes the set of the predicted pixels (P i being the i-th element) and T the set of the corresponding ground truth. We arbitrary defined in Equation (20) = 1 to deal with the particular case when P and T only contain background values equal to zero. The 2D and 3D U-Net were trained with 300 epochs while the Cascaded Network was only trained for 30 epochs due to time constraints. The network requires separate training for each region and each of the three views, which increases training time. 2D U-NET was first proposed for biomedical image segmentation by Ronneberger et al. [38] (Figure 6). This architecture contains two paths respectively called encoder and decoder, which contain several convolutional and maximum pooling layers at the encoder level and transposed convolution (up-conv) layers at the decoder. The autoencoder is designed to find a latent representation of a dimension smaller than the input that is used for the segmentation task. Unlike the U-Net originally proposed, zero-padding is used as well rather than maximum pooling to preserve the dimension of the output at each layer, allowing more flexibility for the dimension of the input. The U-Net used in this article follows the U-Net architecture proposed by Dong et al. [39] depicted in Figure 6. 3D U-NET extends the U-Net network for volumetric segmentation [40]. The input is taken as the voxels of the volumetric images and the resulting output is a 3D segmentation mask. All the operations are in 3D and a batch normalization of 10 has shown to improve the training convergence. Another difference is the reduction in the number of blocks in each path from five to four. The Dice loss function Equation (20) was also used for the training of this network. The encoder path contains two 3D convolutions followed by a Rectified Linear Unit (ReLU), and a 2 × 2 × 2 maximum pooling with strides of two. The decoder path blocks include 2 × 2 × 2 transposed convolution (up-conv) by strides of two in each dimension and two 3D convolutions followed by ReLU. The entire image is analyzed in the contracting path and subsequent extensions produce the final segmentation. CASCADE NETWORK proposed by Wang et al. [33] includes a combination of three CNNs that segment each of three subregions sequentially: whole tumor, tumor core and enhancing tumor. Hence, anisotropic convolution (i.e., dependent on the direction) are used to deal with 3D MRI but it results in a higher model complexity and memory consumption. Lastly the fusion of the CNN outputs in three orthogonal views: axial, sagittal, and coronal is used to enhance the segmentation of the brain tumor. The three CNNs follow the hierarchical structure of the tumor subregions as depicted in Figure 7.
After the convolutional layer with zero padding, we get feature maps of the same size as the input. Then each feature map is passed through Z pooling with stride one and k different windows of sizes d 1 × d 1 , d 2 × d 2 , . . . , d k × d k are used. The second layer is responsible for the increase of the receptive field, which is determined by the larger window size d k . For an input of size s × s we suggest d k = 2s to ensure that the receptive field is as large as the input image. In these experiments, the multiplicity is chosen at m = 10 and the window size d i = 2 i−1 + 1 i.e., d i ∈ {1, 3, 5, 9}. This is a good compromise between the size of the network and the expected performance. Hence k = log 2 (s) + 2 . The other window sizes determine the scales for which the information is collected. The initial convolution ensures that the features are relevant for each scale. The multiplicity m makes it possible to collect multiple features by scale. The convolution layers are followed by ELU [41] as an activation function.

Results and Discussion
Results presented in Table 4 (and illustrated in Figure 7) show the effectiveness of each method measured in terms of Sørensen-Dice index, Recall and Precision only on the tumor core region, the most difficult to segment. The pooling layers with configuration-2 favor segmentation unlike configuration-3, which favor ultrametric contours.
According to Table 4, the 2D U-Net obtains the highest Dice scores for the three subregions during the first experiment: tumor core = 0.65 (and for the record whole tumor = 0.65 and increased tumor = 0.46). The scores for the Cascade and 3D U-Net are not drastic. Given this result, the 2D U-Net was chosen for the improvement experiences: CL and DA. In regard to the equally weighted majority vote, the prediction of the first three methods was used to obtain the final prediction. The improvement in 2D U-Net results shows that the three improvement methods proposed improve Dice score and Precision, but the 2D U-Net formed with CL outperforms the others. Z pooling works comparatively better than maximum pooling in terms of accuracy, Recall, and Dice score. The Dice score indicates that Z pooling with max(min) misses fewer tumor cores on average than the max(max) combination. The best combination is obtained with a 2D U-Net with Curricular Learning, Zeck min(max) (Dice = 0.77, Recall = 0.8, Precision = 0.87) while comparatively 2D U-Net alone gives lower scores (Dice = 0.72, Recall = 0.79, Precision = 0.77). Note that the 2D U-Net association with DA and CL provides disappointing performances. The intuitive explanation is that Z-pooling prepares CNN better than maximum pooling for the segmentation tasks by sharpening the edges. The weights of the CNN shall accentuate the edges during training whenever there is a significant difference between two adjacent pixels. In capturing ultrametric contours Z-pooling can be seen as a kind of pretraining of the network to accelerate the learning and enhance the segmentation result.

Conclusions
To conclude, the experiments presented along with the results have demonstrated a pipeline of evaluation for the supervised segmentation MRI images with Z pooling. CNNs have once again proven to excel in image processing and more specifically in learning and distinguishing characteristics that then enables segmentation. Simple or complex addition or changes based on Z pooling have proved to improve results, which reinforces the need to further advance research in this area. The goal of this research was met, which was not only to examine the presented methods but also to introduce the enhancements and enable a thorough comparison.
Earlier, we raised two questions: when should we perform pooling? Is texel pooling more efficient than pixel pooling?
It is advisable to pool when we can extract features contained in the binned subregions from the input representation (input image, hidden layer, etc.) As mentioned in the discussion, some of the enhancements in the pooling improved certain results and diminished others. However, in most of our experiments, the texelization of the pooling layer improved the image segmentation capacity of the CNN. It is because Z coding, compared to other local descriptors: (a) can be extended to any neighborhood size or geometry, (b) is invariant in shift (c) is invariant in rotation, (d) is nonlinear (e) follows a integer generating function, (f) is less sensitive to noise.
The correct scale is therefore part of the definition of texture and plays an important role. In other words, texel pooling is more efficient in general because our world is "textured" but performance decreases directly as signal-to-noise ratio gets worse.
We challenged the concept for feature extraction, which has been uncontested for three decades, the feature extraction pyramid. Our method translates the series of solutions to enhance Z pooling with different window sizes. The effective receptive field of our method can be modified freely through the pooling window sizes without affecting the parameter number, whereas traditional feature extraction pyramids have a high parameter cost associated with an increase of the receptive field.
Further investigations should target all combinations of Z pooling operators and find out a performance criterion to maximize that describes the pixel organization.