On the Scale Invariance in State of the Art CNNs Trained on ImageNet †

: The diffused practice of pre-training Convolutional Neural Networks (CNNs) on large natural image datasets such as ImageNet causes the automatic learning of invariance to object scale variations. This, however, can be detrimental in medical imaging, where pixel spacing has a known physical correspondence and size is crucial to the diagnosis, for example, the size of lesions, tumors or cell nuclei. In this paper, we use deep learning interpretability to identify at what intermediate layers such invariance is learned. We train and evaluate different regression models on the PASCAL-VOC (Pattern Analysis, Statistical modeling and ComputAtional Learning-Visual Object Classes) annotated data to (i) separate the effects of the closely related yet different notions of image size and object scale, (ii) quantify the presence of scale information in the CNN in terms of the layer-wise correlation between input scale and feature maps in InceptionV3 and ResNet50, and (iii) develop a pruning strategy that reduces the invariance to object scale of the learned features. Results indicate that scale information peaks at central CNN layers and drops close to the softmax, where the invariance is reached. Our pruning strategy uses this to obtain features that preserve scale information. We show that the pruning signiﬁcantly improves the performance on medical tasks where scale is a relevant factor, for example for the regression of breast histology image magniﬁcation. These results show that the presence of scale information at intermediate layers legitimates transfer learning in applications that require scale covariance rather than invariance and that the performance on these tasks can be improved by pruning off the layers where the invariance is learned. All experiments are performed on publicly available data and the code is available on GitHub. A.D.


Introduction
Computer vision algorithms trained on natural images must achieve scale invariance for optimal robustness to viewpoint changes. Multi-scale scale invariant approaches are popular in both image processing (e.g., local descriptors, filter banks, wavelets and pyramid scale space [1]) and in recent deep learning techniques [2][3][4][5]. Deep Convolutional Neural Networks (CNNs) [6,7] achieve state-of-the-art performance in object recognition tasks with scale variations (e.g., ImageNet [8]) by implicitly learning scale invariance even without a pre-defined invariant design [9]. Such invariance, together with other learned features of color, edges and textures [10,11], is transferred to other tasks when pretrained models are used to learn from limited training data [12]. Training from scratch is sometimes a preferred alternative to introduce desired invariances in the learned features [13,14]. Scratch training is adopted by scale covariant [4] and multi-scale designs [15][16][17][18].
This work is based on the assumption that the scale invariance implicitly learned from pretraining on ImageNet can be detrimental to the transfer to applications for which scale is a relevant feature. With a controlled viewpoint and known voxel spacing dimensions, scale is informative (and often decisive) in some medical imaging tasks (e.g., size of lesions, tumoral regions or cell nuclei, as illustrated in Figure 1. The other transferred features such as shape, color and texture, however, are beneficial to the medical tasks, in particular for learning from limited data and improving the model accuracy and speed of convergence [10,[19][20][21]. We therefore formulate the hypothesis that a specific design retaining helpful features from pretraining while discarding scale invariance can perform better than both a standard transfer and training from scratch. The experiments in this paper focus on validating this hypothesis by identifying the network layers where the invariance to scale is learned and by proposing a way to isolate and remove this unwanted behavior while maintaining the beneficial impact of transfer.  We make use of deep learning interpretability to preserve the scale covariance of the deep features [22]. The network layers where invariance to scale is learned are identified by applying Regression Concept Vectors (RCVs) [23], a post-hoc interpretability method that uses linear probes [24,25] to determine the presence of a given concept in the network features. This information is used to optimize the transfer by developing a pruning strategy that maintains scale-covariant features without requiring the re-training from scratch in [13,14] or any specific network design. The experiments in this paper extend results, discussions and visualizations of our published research in the Workshop on Interpretability of Machine Intelligence in Medical Image Computing (iMIMIC) at the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI2020) [22] with new in-depth analyses and results. The additional contributions of this paper are stated in the following. New analyses including experiments on image resizing in Section 4.1 and inputs of random noise in Section 4.2 are used to show that object scale and input size have dissociated representations in the CNN layers. While the former is learned from the input data, the latter is shown to be intrinsically captured by the architecture (see Section 4.2). The results on the scale quantification are validated for multiple ImageNet object categories in Section 4.3. The significance of the results on the histopathology task is evaluated by statistical testing in Section 4.4. An additional study is performed on models trained from scratch for this task, showing that our proposed pruning strategy outperforms both models and pretrained networks in Section 4.4.
The results from this work increase our understanding of scale information in feature reuse. Scale covariance is highest at intermediate layers for all ImageNet object categories, while the invariance is learned in the last dense prediction layer (Section 4.3). This is relevant not only in medical imaging but also in other applications with a controlled viewpoint. Considering this information about scale may help to build models that predict the magnification range of images for which the physical dimension of voxels is unknown, for example, magnification level not reported. For example, remote sensing, defect detection, material recognition and biometrics (e.g., iris and face recognition with registered images) [1]. In the medical context, these results may have a positive impact on the use of large and growing open-access biomedical data repositories such as PubMed Central (https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/, accessed on 2 April 2021) to extend existing medical datasets [26].

Related Work
Built-in scale covariance (features that are covariant with a transformation are also referred to as continuous with this transformation.) and invariance in specific CNN designs have been studied and implemented in the literature [2][3][4][5]27,28]. While these methods, together with other types of inherent covariance, can alleviate the need for pre-training and large amounts of available data, transfer learning remains extremely common in deep learning applied to medical imaging [21]. As an attempt to understanding CNN behavior with respect to scale, manually selected deep activations were shown to respond to faces viewed at different scales in [29]. Invariance to scale in classic CNN architectures has been analyzed in [30], where the authors use computer-generated images to control attributes (concept measures, including scale) of a single object and visualized the effect on the internal representations. In [9], the regression of geometric image transformations (e.g., image flips and half-rescaling) was studied in an attempt to learn the homomorphic transformations in the feature space that account for the transformations of the input. The authors conclude that scale invariance is implicitly learned on ImageNet as accuracy is not improved by reversing the scaling transformations in the feature space. While [9] learns transformations in the feature space of a trained network, an end-to-end supervised method is proposed in [31] to enforce the disentanglement of transformations including rotations and scales, providing built-in covariance properties. On another line, the vulnerability of CNNs to adversarial attacks with transformations including scaling was studied in [32,33].
Network pruning approaches were proposed in [34,35], with medical applications for PAP smear imaging [36] and Chest X-rays [37]. Pruned networks achieve a similar performance to, if not better [36] than, that of the original network. The asset of network pruning is that even if not providing massive increases in network performance it improves training convergence and it reduces the number of parameters to be trained and thus the computational complexity of the models [37]. This allows the training and fine-tuning of the models on smaller datasets, as shown by the study on PAP smears [36]. Pruning methods mostly focus on identifying the importance of individual elements in the network, such as individual neurons [34], individual filters and/or feature maps [36,37]. Particularly in [35], the authors dealt with multiple object scales by specific-design observations that can make their pruning responsive to multiple object scales. We propose a pruning strategy that, differently from [34,36], focuses on entire layers and that evaluates the layer importance in terms of the scale covariance of the extracted features. Our pruning strategy does not require an explicit design as in [35], nor expensive computations of evolutionary strategies as in [37]. Our method can be applied to any architecture pre-trained on ImageNet inputs to understand the scale covariance of intermediate layers and proposes a pruning strategy that can improve the transfer to applications where object scale is a relevant feature.
Post-hoc interpretability, as defined in the taxonomy of Lipton [38], is particularly suited to the analyses required by this paper since it does not require adding any additional constraints to the optimization. A post-hoc method can be applied to any model without the need to re-train the parameters. Linear classifier probes [24] were proposed to analyze class-separability at intermediate layers in terms of the classification of the class labels by a linear model. Kim et al. introduced Concept Activation Vectors (CAV) [39] to classify arbitrary concepts (e.g., striped texture) that can be either present or absent in a set of sample images. RCVs [23] extended CAVs to continuous concept measures with a linear regression at intermediate layers of the CNN. This approach led to insightful observations in general computer vision [25,40] and medical imaging [23,41,42].

Materials and Methods
This section outlines the proposed method and the setups used for the experiments. Section 3.1 introduces the notations in the paper, while Sections 3.2 and 3.3 describe the datasets and network architectures, respectively. We outline the main approach in Section 3.4, while the evaluation metrics are defined in Section 3.5. The hypotheses, scope and methodologies of the multiple experiments are described in Section 3.6.

Notations
We consider an input image X ∈ R w×h , where w is the image width and h is the height. The function φ(·), defined as φ : R h×w → R d maps the input image to a vector of arbitrary dimension d. At intermediate layers, the d scalars are obtained from averaged feature maps. At the final fully-connected layer, φ(·) transforms X into a set of predictions. As further explained in Section 3.4, we analyze the scale information using covariance, defined as the transformation g : R d → R d that predicts the transformation g : R h×w → R h×w of the input image X in the feature space obtained by φ(g(X)). The scaling transformations are expressed as g σ (·), being parameterized by a scale factor σ. We consider images of original size S o = h o × w o containing a single object that is annotated by a bounding box of size

Datasets
The experiments in this paper involve two datasets since the scale analysis is performed on inputs of natural images and the proposed final architecture is evaluated on a medical image analysis task. For the scale quantification part, images with manual annotations of bounding boxes are selected from the publicly available PASCAL-VOC dataset [43]. We restrict our analysis to three object categories and images containing a single bounding box, chosen among the available annotated classes. These are albatross (ID: n02058221, 441 images), kite (ID: n01608432, 406 images) and racing car (ID: n04037443, 365 images).
For the histopathology application, the data consist of 141 Whole Slide Images (WSI) of Estrogen Receptor-positive Breast Cancer (ERBCa+). For these images of 2000 × 2000 pixels, manual annotations of 12,000 nuclei are available [44]. Smaller image regions are extracted as image patches from the WSIs. A total of 69,019 patches with nuclei segmentation masks were split into training, validation and test partitions (approximately 60%, 20%, 20% respectively) as shown in Table 1. To not introduce bias, all the patches from a single image were assigned to the same data partition. The imbalance in the magnification categories is due to the area covered by each magnification level. The average nuclei area is extracted for each input image by computing the average number of pixels in the relative nuclei segmentation mask. Example images with overlaid segmentation masks are displayed in Figure 2.

Network Architectures
InceptionV3 [6] and ResNet50 [7] are used for the analysis with pre-trained Ima-geNet weights. The networks produce a vector of probabilities f (X) Transfer to the histopathology data is performed from both the original and pruned architectures. To predict the average nucleus area, a single-unit dense layer is trained to minimize the mean squared error loss between the true areas and the predicted ones. The nuclei area is expressed for each image as the average number of pixels within the segmentation of the nuclei present in the image. The magnification category is also obtained from the average nuclei areas. The predicted areas are mapped to the magnification category that has the closest mean average value of the nuclei areas in the training set. This mapping approach was used since it outperforms the direct classification of the magnification in [17]. The networks are implemented in Keras and trained for five epochs with an Adam optimizer and standard hyperparameters (learning rate 1 × 10 −4 , batch size 32 and default values of the exponential decay rates). The full pipeline is shown in Figure 3 and the source code is available on github for reproducibility (https://github.com/medgift/scale_covariant_pruning) (acessed on 2 April 2021).

Quantification of the Scale and Pruning Strategy
Our method quantifies object scale in the input and in the representation space. The act of scaling is defined in image processing as a transformation g σ (·) that generates a new image with a larger or smaller number of pixels, depending on the scaling factor σ.
In the input space, one may intuitively think of g σ (·) as a reshaping operation. This transformation, however, causes the "train-test" resolution discrepancy in [45] during network inference. We focus this work on images containing a single object, for which we can define image scale as the solid angle of the object in the image, namely the proportion of the field of view occupied by the object [46]. Since a small bounding box corresponds to a smaller space in the field of view of the camera and thus a smaller solid angle, we measure scale in function of the bounding box area S b . Scale measures are thus defined as the ratio   In the feature space, we aim at finding a linear transformation g σ (·) that is a predictable transformation of g σ (·) in the input space. We start by using the definitions of invariance (1) and covariance (2) of a mapping φ(·) to a transformation g(·) as follows φ(g(·)) = g (φ(·)).
In our analysis, we consider functions of the input image X. We evaluate the covariance of the function φ(X), that is, whether we can find a transformation g : R d → R d in the feature space that predicts a transformation g : R h×w → R h×w of the input image (In these terms, equivariance is a particular case of covariance, when g (·) = g(·). The equivariance implies that the function φ(·) maps an input image to a function in the same domain, not relevant in our scenario). φ(X) consists of d scalars representing either the averaged feature maps of intermediate layers or the activations of fully-connected layers. To find the transformation g σ (·), we search a regression vector v (i.e., the RCV [23]) in the feature space to predict the scaling factor σ as (For simplicity, we omit the intercept. In Equation (3), the intercept is v 0 with φ 0 (g σ (X)) = 1.): We then have that g σ (·) can be represented as a translation matrix (in R d ) by σ along v, so that g σ (φ(X)) = φ(X) + v · σ. The proposed pruning strategy compares the test R 2 (determination coefficient) of the regression vectors obtained at multiple depths to identify the layer where the scale covariance is the highest. The layer with the highest test R 2 (the yellow layer in Figure 3) is where the scale covariance is the highest. Layers deeper than this one are pruned off the architecture and a GAP operation is added to obtain a vector of aggregated features.

Evaluation
In Section 4.1, the network performance is monitored in terms of top-5 accuracy and average probability of the correct class for a set of N inputs, that is, p = 1 N ∑ N j=1 f (X j ) y j for ground-truth labels y j . The regression of image size in Section 4.2 and the RCV of scale in Section 4.3 are evaluated on held-out images using the R 2 determination coefficient were N is the number of test data samples,ŝ i is the size predicted by the regression model,s is the mean of the true sizes {s i } N i=1 . A similar formulation applies to the evaluation of the regression of the scale ratio r). To keep the test R 2 within a [0,1] range for visualization and comparison, we report the normalized test R 2 = e R 2 e = e R 2 −1 , for which values below 1 e ≈ 0.37 evidence bad performance. The transfer learning experiments on the histopathology task in Section 4.4 are evaluated by the Mean Average Error (MAE) and Cohen's kappa coefficient. MAE is used to evaluate the regression of the average nuclei areas, while Cohen's kappa coefficient is used to measure the inter-rater reliability of the prediction of the magnification classes.

Experimental Setups
In the following, we clarify the objectives and setups of the performed experiments. The experiments in Section 4.1 address the main hypothesis that CNNs pretrained on ImageNet are invariant to transformations of the input size. We want to show, in particular, that this behavior is also true for images containing objects that naturally appear at various scales due to varying viewpoints (for which an example is given in Figure 4). To show this, we set up two related experiments. In the first experiment, we use filled images containing only one object covering the entire space in the image (i.e., S b ≈ S o ), which are selected manually from the pool of ImageNet validation images. In the experiment, each filled image is reshaped to a squared input of arbitrary size and the network output is monitored by checking the probability of the correct class (top-1 accuracy) and the top-5 accuracy. We use the Lanczos interpolation (Similar results were obtained using bilinear, nearest, bicubic and Lanczos interpolations.) to reshape the images to a squared input of S i = s i × s i , with s i ranging from 75 to 500 pixels. A total of 69 images of size S o = 500 × 500 and other 69 of smaller original size (meanS o = 285 × 285) were used. In the first set, images are either reduced or increased in size by the interpolation, whereas in the latter they are only reduced. In the second experiment, we separate the impact of input size S i from that of object size S b . We do not use filled images anymore and we release the condition S b ≈ S o . In other words, we compare images that are resized to the same sizes S i , but that contain objects of different sizes. Section 4.2 further analyses the difference between changing input size and object scale. We formulate the main hypothesis that the scaling operation g σ (·) cannot be performed as a simple input reshaping operation because the CNN features encode information about image size differently from object scale. We hypothesize that information about image size is encoded in the features from the padding effect of early convolutional layers. To verify this, we introduce the corrected Global Average Pooling (GAP) illustrated in Figure 5. This operation averages only the activations of the neurons with a receptive field contained entirely in the input image. This is in practice equivalent to discarding the activations at the border of the feature maps that are affected by padding operations. Images of white noise of different sizes are used for this experiment, since they do not contain any object nor related scale. These images are generated by sampling pixel values from a uniform distribution in the range [0, 255]. The experiment aims at regressing the image size for these noise inputs in the intermediate layers of the CNN. If the network encodes information about the image size differently from object scale, then we should be able to regress the size from the noise inputs. If this information is encoded from the padding at early layers, then the regression with the corrected GAP should fail as this operation discards the edges of the feature maps. We thus compare the regression of image scale with and without the corrected GAP to show that current state-of-the-art CNN architectures encode information about the image size. The regression vector v in Equation (3) is sought to regress the image width s i . Since the receptive fields grow throughout the network, the region of activations unimpacted by the paddings reduces up to a point where no activation remains for the corrected GAP. Because of this limitation, we can only use this method to show the impact of zero-padding but we cannot use it for the analysis of scale invariance throughout the network.   In the next experiments, we use images with fixed input size to S i = 299 × 299 because of our hypothesis that input size and object scale are learned in different ways. The measures of scale are based on the ratio r defined in Section 3.4. The experiments in Section 4.3 focus on the regression of scale measures in ImageNet pretrained models for the object categories albatross (ID: n02058221), race car (ID: n04037443) and kite (ID: n01608432).
We use 70% of the input class images to learn the regression, while the remaining images are held out for evaluating the determination coefficient. Examples of images and their corresponding scale concept measures r are shown in Figure 4 for the albatross class. Finally, we run experiments on the transfer to the histopathology task in Section 4.4. The information extracted in the previous experiments is used to improve the transfer of pretrained features to the medical imaging task in [17]. This is obtained by implementing the pruning pipeline summarized in Figure 3. The medical task in these experiments is the regression of the average area of the nuclei in histopathology images. The pruning of network layers is performed by comparing the test R 2 on the natural images (Figures 8a and 9 for InceptionV3 and Figure 8b for ResNet50) to identify the layer where the scale covariance is the highest. This evaluation is averaged across object categories to remove the dependence on the class of the inputs (see Appendix A.1, Figure A2).

Invariance of the Predictions to Resizing
This section contains the results of the first two experiments described in Section 3.6. The CNN predictions for reshaping transformations of the filled images are reported in Figure 6. The two subsets of images being used do not report marked differences. The results of the second experiment are not reported for brevity and because they are very similar to those in Figure 6. The predictions resulted in only slightly better results with filled images (S b ≈ S o ) and we did not notice a shift of lower probabilities towards smaller input sizes when the objects are smaller (S b < S o ). This shows that S i is more relevant for the predictions than the object size S b , as expected.

Experiments on Noise Inputs
In this section, we use input of white noise that does not contain any object nor scale information. The regression of s i is learned from five noise images (We intentionally use a small number of images to illustrate the simple linear correlation. Similar results are obtained when using more images.) and evaluated on 20 held-out images.
The results show that we can regress the size for the model with the regular GAP in deep layers, with the R 2 close to one in Figure 7a. On the contrary, Figure 7b shows that we cannot regress the size information when aggregating the feature maps using the corrected GAP (R 2 < 0).
In light of these results and those in Section 4.1, we do not associate the input size to the measure of object scale in the analyses of the next section.

Layer-wise Quantification of Scale Covariance
In this section, we start by focusing on 441 images of the albatross class containing a single object bounding box. Later in the section, the experiments are extended to the race car and kite classes.
In Figure 8a, we compare the scale regression in a randomly initialized InceptionV3 with one trained on ImageNet. We regress the scale concept measures in activations at different depths, as explained in Section 3.4. We also compare to a baseline in which the regression is trained with random concept measures, that is, shuffling the scale concept measures before regression. As explained in Section 3.5, we report the fraction e R 2 e to visualize positive values in the presence of large variations in negative values (as low as −11,077). The detailed values of R 2 are reported in Table A1   The results are similar on the other two classes, for which the results in InceptionV3 are reported in Figure 9. The results on the ResNet50 architecture are also similar and can be seen in Figure 8b and in Figure A1

Improvement of Transfer to Histopathology
The original InceptionV3 and ResNet50 networks are compared to their pruned counterparts in terms of performance in the nuclei area and magnification prediction in Table 2. We report the Mean Average Error (MAE) across ten repetitions (different seeds were used to initialize the dense connections to the last prediction layer.)and the relative standard deviation for the prediction of the average area. We also report Cohen's kappa coefficient for the prediction of the magnification category. For both evaluated networks and both tasks (area and magnification prediction) the results show significant improvements when the networks are pruned at the relevant layer, validating the proposed scale invariance analysis in the previous sections. The non-parametric Wilcoxon signedrank test was used to evaluate the statistical significance (p-value < 0.001 for the MAE and kappa with both networks). The average MAE (standard deviations reported in brackets) between the true nuclei areas and those predicted by the pruned Inception V3 are respectively 55. 33

Discussion
In this section, we discuss the results and give further insights regarding their interpretation, referring to previous studies in the field that support our hypotheses.
We first analyze the relationship between image size and object scale. Our first experiment in Section 4.1 reported in Figure 6 shows that the average probability for the correct class is approximately invariant to input sizes in the range [175,300]. Images in this range are likely seen during training since 299 is the size used to train the classification task on ImageNet. The top-5 accuracy is the maximum and unlike p plateaus for s i > 200. This is explained by probabilities being more spread across classes, yet highest probabilities are still given to the correct classes. The invariance of the predictions to the upsizing or downsizing of the original image size, also discussed in this section, confirms that the interpolation used for down and up-sampling has a neglectable influence on the predictions (bilinear, nearest, bicubic or Lanczos). The experiment with resized images containing objects of multiple sizes shows that the information of input size prevails on the one of scale when these two are not correctly separated. A similar yet less detailed analysis performed in [45] showed an increase of top-1 accuracy when training and testing sizes approximately match. The strong encoding of information about the input size within the network is attributed by the authors in [45] to the change in the distributions of the ReLU activations of deep layers for smaller input images. We further support our analysis with the experiments on noise inputs in Section 4.2. The white noise images do not contain any object and the information about image size is captured also in these images (as shown by the results in Figure 7a). By introducing the corrected GAP, we show that the regression of image scale in noise images is mostly due to the padding effects at early convolution layers that encode information about the input size. In Figure 7b, we confirm this hypothesis by showing the poor performance of the linear regression when removing the information on input size by manually correcting the GAP.
From the quantification of scale covariance in Section 4.3, we observe that information about scale is present at intermediate layers, and that invariance is reached only towards the last layers before softmax. Comparing the regression in the intermediate CNN layers of real concept measures (reported in blue in Figure 8a,b) and those of random concept measures (reported in green in the same figures), we conclude that the scale information is present at intermediate layers. We can linearly regress the true scale ratios better than random values of scale, with R 2 close to one. The R 2 of the randomly initialized model weights are close to the ones obtained with random concept measures and less than zero for almost all layers. This shows that an architecture with random weights does not contain information of scale and that this information is learned during network training. The low R 2 in the early layers of the trained networks seems to be due to the size of the receptive field, which is too narrow for correctly regressing the input scale. This was also supported by the previous results in [47,48], which suggest that early layers focus on local textures and small object parts. We show this further in the Appendix A in Figure A3, by visualizing the internal features at different depths. Primitive features of color and texture are not sufficient for regressing the object scale. The more complex features of object parts learned after the mixed2 layer, enable this regression. Finally, the drop in the regression prediction at the end of the trained network shows that scale invariance is learned in deep layers, mostly in the last dense layer (pre-and post-softmax).
The task analyzed in the final experiment for improving the transfer of the learned weights to histopathology data represents an important problem in this field. Many open access repositories (e.g., PubMed Central) do not provide information about the magnification level of the images, which become thus difficult to integrate with other datasets. Data from open access repositories or social networks can provide examples of rare and under-represented cases since these images are often presented for visual comparison and discussion among experts [26]. The proposed pruning strategy drops the layers with scale-invariant features to improve the transfer and better regress the magnification level of histopathology images. For InceptionV3, the pruned features are a result of a GAP on top of the mixed8 features. As shown in Table 2, the MAE = 54.93 of the nuclei area regression in mixed8 is significantly lower than the MAE = 81.85 in mixed10. This corresponds to a better prediction of the magnification range, hence to a higher kappa coefficient. The pruned architectures provide a reduction in complexity, requiring the training of 51% and 19% less of parameters respectively for InceptionV3 and ResNet50.

Conclusions
In this paper, we designed and used an experimental approach to analyze the covariance to object scale in CNNs trained on ImageNet. We then used the analysis of state-of-art CNNs to improve the transfer of these pre-trained networks on a medical task. We made the main distinction between input size and object scale, showing that these two measures should be properly separated to interpret the scale covariance of CNN features. Our scale quantification with the regression of scale ratios represents an intuitive and easy-to-apply method to determine the invariance to scale of intermediate network features. We showed that deep features (up to the penultimate layer) are linearly scale-covariant. These pretrained features can therefore safely be used either as feature extractor or fine-tuning for tasks in which the scale provides crucial information.
Our network pruning strategy can improve transfer by maintaining the scale-covariance of the features without requiring any explicit design or retraining of the network weights and can thus be applied to state-of-the-art CNNs pre-trained on ImageNet. The proposed pruning largely improves the prediction of magnification in histopathology images.
We recognize the limitations of the proposed work, including the linearity of the regression, where information about scale can be present but impossible to regress linearly. In future work, we will investigate non-linear regression and manifold learning of the feature space. main paper for brevity. In each class, 70% of the images are used for learning the regression, the remaining 30% are used to evaluate it. In Figure A2, we report the results, averaged across classes, that were used to select the pruning layer for both architectures. As mentioned in Section 4.4, we remove the dependency of the evaluation on the image selection (by using multiple splits) and category (by analyzing multiple classes). We average the results across 10 repetitions for all classes, with a total of 30 evaluations. The evaluation was performed for ten splits of images. The difference in softmax regression between randomly initialized InceptionV3 and ResNet50 for all classes is notable. This can be explained by the softmax probabilities of InceptionV3 being uniformly distributed around 1 1000 for all 1000 classes as opposed to the sparsely high probabilities of ResNet50. This difference in the probability distribution is due to the different pixel values normalization used in the input pre-processing of the networks. Interesting preliminary analyses of the probability distributions further support these claims, yet this is out of the scope of this paper and will be analyzed in future work.

Appendix A.2. Detailed Determination Coefficients
In Table A1, we report the values of R 2 obtained on the regression evaluation of the scale measure at different layers of InceptionV3 with images of the albatross class. The R 2 values were plotted in Figure 8a as e R 2 e due to their range.

. Visualization of Early Layer Features
As mentioned in the Discussion, early layers focus mostly on local pixel neighborhoods, not extracting sufficient information to regress the scale ratio in Figures 8a and 9. To support this claim, we use the Lucid toolbox (https://github.com/tensorflow/lucid, accessed on 2 April 2020) to visualize the internal features of InceptionV3 at different depths. As shown in Figure A3, early layers in InceptionV3 mostly focus on simple patterns and colors (see Figure A3a,b). Only at deeper layers it is possible to recognize object parts as in Figure A3c and entire dog faces in Figure A3d.