A Deep Learning Instance Segmentation Approach for Global Glomerulosclerosis Assessment in Donor Kidney Biopsies

: The histological assessment of glomeruli is fundamental for determining if a kidney is suitable for transplantation. The Karpinski score is essential to evaluate the need for a single or dual kidney transplant and includes the ratio between the number of sclerotic glomeruli and the overall number of glomeruli in a kidney section. The manual evaluation of kidney biopsies performed by pathologists is time-consuming and error-prone, so an automatic framework to delineate all the glomeruli present in a kidney section can be very useful. Our experiments have been conducted on a dataset provided by the Department of Emergency and Organ Transplantations (DETO) of Bari University Hospital. This dataset is composed of 26 kidney biopsies coming from 19 donors. The rise of Convolutional Neural Networks (CNNs) has led to a realm of methods which are widely applied in Medical Imaging. Deep learning techniques are also very promising for the segmentation of glomeruli, with a variety of existing approaches. Many methods only focus on semantic segmentation—which consists in segmentation of individual pixels—or ignore the problem of discriminating between non-sclerotic and sclerotic glomeruli, so these approaches are not optimal or inadequate for transplantation assessment. In this work, we employed an end-to-end fully automatic approach based on Mask R-CNN for instance segmentation and classiﬁcation of glomeruli. We also compared the results obtained with a baseline based on Faster R-CNN, which only allows detection at bounding boxes level. With respect to the existing literature, we improved the Mask R-CNN approach in sliding window contexts, by employing a variant of the Non-Maximum Suppression (NMS) algorithm, which we called Non-Maximum-Area Suppression (NMAS). The obtained results are very promising, leading to improvements over existing literature. The baseline Faster R-CNN-based approach obtained an F-Measure of 0.904 and 0.667 for non-sclerotic and sclerotic glomeruli, respectively. The Mask R-CNN approach has a signiﬁcant improvement over the baseline, obtaining an F-Measure of 0.925 and 0.777 for non-sclerotic and sclerotic glomeruli, respectively. The proposed method is very promising for the instance segmentation and classiﬁcation of glomeruli, and allows to make a robust evaluation of global glomerulosclerosis. We also compared Karpinski score obtained with our algorithm to that obtained with pathologists’ annotations to show the soundness of the proposed workﬂow from a clinical point of view.


Introduction
In order to evaluate if a kidney is eligible for transplantation, a key step is the histological assessment of renal biopsies by expert pathologists. The determination, by a pathologist, of the number of globally sclerosed glomeruli with respect to the total number of glomeruli is a fundamental criteria for accepting or discarding donor kidneys. Considering the shortage of organs suitable for transplantation, the possibility to have an automatic system for a rapid and effective evaluation of global glomerulosclerosis would be very important, permitting to retain the largest quantity of eligible kidneys. In this paper, we propose a Computer-Aided Diagnosis (CAD) system which has the purpose to support the expert pathologists in the glomerular detection and classification task, allowing them to easily obtain global glomerulosclerosis information. Automated systems proved to be useful in a variety of medical applications, including biometrical analysis for personal identification [1], cancer system biology [2], blood parameters evaluation [3], breast cancer classification [4], diagnosis of neurological disorders [5], analysis of nasal cytology [6], segmentation and investigation of the conjunctiva [7,8], and prediction of the on-target cleavage efficiency from sgRNA sequences [9]. The rise of Convolutional Neural Networks (CNNs) opened many opportunities for Computer Vision tasks like object detection, semantic segmentation, and instance segmentation. This has led to a large development of deep learning methods and techniques in these tasks, which cannot be extensively detailed here. A comprehensive review on object detection and instance segmentation approaches can be found in [10], whereas one for semantic segmentation is [11]. In the realm of Digital Pathology, several recent studies have employed CNNs for glomerulus identification in renal biopsies [12][13][14][15][16][17][18][19][20][21][22][23]. Glomerulus detection has been approached as object detection task (e.g., [13]) or as semantic segmentation task (e.g., [17,22]). In this paper, we treat it like an instance segmentation task (e.g., [23]). CNN and medical imaging techniques have proven to be useful for evaluation of eligibility of donor kidneys [14,15,17,22,23].
A fundamental quantitative measure for assessing the eligibility for transplantation of kidneys from expanded criteria donors (ECD) is the Karpinski score [24]. Glomerular, tubular, interstitial, and vascular compartments are evaluated from an histological point of view. Then, for each of these compartments, it is assigned a score in the range 0 to 3, where 0 corresponds to normal histology and 3 to the worst degree of, respectively, global glomerulosclerosis, tubular atrophy, interstitial fibrosis, and arterial and arteriolar narrowing [24,25]. The identification of all non-sclerotic and sclerotic glomeruli in the kidney biopsy is the preliminary task required to define a score for global glomerulosclerosis. Non-sclerotic glomeruli tend to have an elliptic shape. They are characterized by the Bowman's capsule and by the capillary tuft with the mesangium. The latter is sited inside the glomerulus, whereas the first is peripheral and contains the tuft. There is a space between these two elements, which is known as Bowman's space. The capillary tufts features nuclei of cells (blue points), capillary lumens (white areas), and the mesangial matrix (regions with similar tonality and different levels of saturation), so it resembles a pomegranate. A (globally) sclerotic glomerulus is characterized by capillary lumens which are obliterated for an increase in extracellular matrix, and collagenous material which completely fills the Bowman's space. Examples of non-sclerotic and sclerotic glomeruli are depicted in Figure 1. In this paper, we propose a deep learning framework, based on Mask R-CNN [26], for glomerular detection and classification with an end-to-end instance segmentation approach. Semantic segmentation networks can guarantee very high pixel-level results, but they may perform worse in the object detection task, if compared to specialized architectures [15]. The key points of the proposed method are: • the possibility to train an end-to-end instance segmentation neural network, by exploiting Mask R-CNN, strongly reducing the need of post processing operations and allowing to learn all the required features in a unified process; • the use of a variant of the standard Non-Maximum Suppression (NMS) algorithm, which we called Non-Maximum-Area Suppression (NMAS) that led to an improvement of the performances in our sliding window approach. Note that NMAS, like NMS, is a general purpose algorithm and can be useful also for other detection tasks; • it shows superior performances to other alternatives proposed in literature, without computational drawbacks. Alternatives include object detection approaches, as Faster R-CNN (adopted in [13]), which is herein used as baseline, and semantic segmentation approaches (adopted in [15,17]).

Dataset
All the experiments conducted in this paper exploited a dataset provided by the Department of Emergency and Organ Transplantations (DETO) of Bari University Hospital. This dataset is composed of 26 kidney biopsies coming from 19 donors. Kidney donors sections contain 2344 non-sclerotic glomeruli and 428 sclerotic glomeruli [15]. The dataset has been split into a train-validation set composed of 19 biopsies and a test set composed of 7 biopsies. The train-validation set has been exploited for model fitting and hyperparameters tuning, whereas the final estimation of the results has been computed on the test set. The whole train-validation set contains 1852 non-sclerotic glomeruli and 341 sclerotic glomeruli; the test set contains 492 non-sclerotic glomeruli and 87 sclerotic glomeruli.

Object Detection with Deep Learning
Deep learning refers to the adoption of architectural processing models, composed by different layers, at the purpose of learning structured representation of the input data. The role of deep learning has been pivotal in different sectors, including visual object recognition and object detection [27]. Starting from the breakthrough obtained by AlexNet [28], CNNs have become widely used for almost every kind of computer vision problem. In this work, we will focus on CNNs for object detection, a problem which consists of finding the bounding boxes for all the objects of interest present in the image, and for instance segmentation, in which it is also required to delineate precise masks for the objects.
Among the CNN-based methods for object detection, a particular mention is devoted to the Region-Based Convolutional Neural Networks (R-CNN) family of models. The original R-CNN was proposed in 2014 by R. Girshick et al. from UC Berkeley [29]. The method fuses region proposals with CNNs at the purpose of performing object detection. The first part of the R-CNN algorithm is devoted in generating region proposals which are category-agnostic and that may contain objects. Then, those regions are fed to a CNN which extracts a vector representing the features for each region. Finally, the feature vector is given as input to a set of class-specific linear support vector machines (SVMs).
In 2015, R. Girshick improved the R-CNN method, creating a new object detection network named Fast R-CNN [30]. In Fast R-CNN, the whole input image is fed to the CNN to generate convolutional feature maps. Then, region proposals are discovered from the convolutional feature maps, and are warped into squares. An RoI-pooling layer (RoI stands for Region of Interest) is adopted to reshape the proposals to a fixed size, so that they can be forwarded to fully connected layers. The inclusion of region proposals based on selective search causes performance issues in Fast R-CNN.
This concern was solved in 2016 with a further evolution of the R-CNN architecture, Faster R-CNN, proposed by S. Ren, K. He, R. Girshick, and J. Sun [31]. The team of Microsoft Research discovered that feature maps computed in the first part of Fast R-CNN can be used to generate region proposals instead of slower and not-learnable algorithms as selective search. The big evolution in Faster R-CNN is the introduction of a Region Proposal Network (RPN) after the feature maps extraction of Fast R-CNN. RPN exploits a novel concept, namely anchor boxes, instead of previous architectures which adopted pyramids of images or pyramids or filters. In order to generate anchor boxes, it is possible to employ a small network which input is an n × n spatial window of the feature map; the resulting anchor boxes are a collection of the rectangular bounding boxes proposals, with the related scores. The scale and aspect ratio of anchor boxes are parameters that can be decided from the architecture designer. In order to identify objects at different resolutions, it is required to make use of anchor boxes with different shapes.
A further improvement from the R-CNN family of detectors is Mask R-CNN, developed by a team of Facebook AI Research (FAIR) in 2017 [26]. Mask R-CNN allows to solve instance segmentation tasks, whereas Faster R-CNN and previous approaches were only able to perform object detection. The overall Mask R-CNN architecture is composed by two parts: the backbone architecture, which performs feature extraction, and the head architecture, which performs classification, bounding box regression and mask prediction.

Object Detection Definitions and Metrics
Reference metrics used for evaluating object detection models are based on object detection challenges as PASCAL VOC (http://host.robots.ox.ac.uk/pascal/VOC/), Google Open Images (https://opensource.google/projects/open-images-dataset), and COCO (https://cocodataset.org/). In general, the performance metrics used in these challenges offer a global level evaluation, estimating the performances of the model in the whole dataset. The adoption of global metrics makes benchmarking much simpler, but it does not provide insights on how and why the mistakes have been made.
In order to define object detection metrics, we have to outline what we intend with a detection first. For this purpose, we introduce Intersection over Union (IoU) and Intersection over Minimum (IoM) Given two bounding boxes A and B we can define the IoU as the ratio between the intersection of their areas and the union of their areas: In (1), | · | denotes the set cardinality operator. IoU values lie in the range [0, 1], where 1 indicates a perfect match. We say that a predicted object matches with a ground truth object when IoU between them is above a certain threshold (a common choice for the threshold is 0.5). Another concept related to IoU is IoM, which can be quite useful for defining detections in post processing algorithms. The IoM between two bounding boxes A and B is the ratio between the intersection of their areas and the minimum of their areas: IoM values lie in the range [0, 1], where 1 indicates a perfect match. Note that in these definitions we referred to bounding boxes, but IoU and IoM can be calculated between any finite sample sets. Widespread evaluation metrics are Average Precision (AP), which can be mainly defined as the area under the precision-recall curve, and mean Average Precision (mAP), that is AP averaged over all classes. A naive implementation of AP is described by the following equation: Anyway, we have to note that AP is usually calculated (e.g., PASCAL VOC) by adopting the average interpolated precision value of the positive examples [32]. We can explicate the dependence of precision and recall from confidence c using the notation p = P(c) and r = R(c). Recall R(c) is the fraction of objects detected with confidence of at least c. Precision P(c) is the fraction of detections that are correct: In (4), N j is the number of objects in class j and F(c) is the number of incorrect detections with at least confidence c.
Mean Average Precision (mAP) for K classes can be calculated as reported in (5):

Non-Maximum Suppression
The NMS algorithm is a fundamental post-processing step for object detection when it is required to remove overlapped bounding boxes for avoiding duplicate detections. Object detection and instance segmentation architectures from the R-CNN family discussed before adopt NMS to reduce the number of proposals, since many of them are overlapped. NMS is reported in Algorithm 1. Different improvements of the NMS algorithm have been proposed, as Soft-NMS by N. Bodla et al. [33]. In NMS, we pick the detection box B with the maximum score, and then we suppress all other detection boxes that overlap more than a predefined threshold. We continue with this procedure in a recursive way until all boxes have been processed. The NMS algorithm is designed so that objects lying within the predefined overlap threshold lead to misses. Soft-NMS attempts to solve this problem by decaying the detection scores of all other objects as a continuous function of their overlap with B. Therefore, no object is discarded in this procedure. Algorithm 1: Non-Maximum Suppression (NMS) [33].
input : NMS can be used also when applying object detectors in a sliding window fashion, to remove duplicate detections at the boundaries of adjacent windows. Anyway, both NMS and Soft-NMS suffer from the problem of not considering the area of the detected objects. This means that, if for an object there are two detected bounding boxes, one inside the other, the algorithm can choose the smaller box even if it has only a very slightly higher confidence score. In this paper, we define an algorithm, similar to NMS, but better suited for the purpose of handling overlapped bounding boxes in sliding window approaches. We called it NMAS, since it is a modification of NMS which considers also the area of the bounding box and not only its confidence s j . We introduced a new parameter f j = w j h j s 2 j , which incorporates also the area of the bounding box (w j h j ) and the square of the confidence (s 2 j ). Since s j falls in the range from 0 to 1, we used the square of the confidence to penalize lower values. NMAS is reported in Algorithm 2. Another improvement of NMAS is the usage of IoM together with IoU to detect overlapping boxes. IoM easily allows to recognize bounding boxes mainly contained in other ones, a common case in overlapping sliding window approaches.

Algorithm 2: Non-Maximum-Area Suppression (NMAS).
input : A high-level overview of the proposed CAD system is depicted in Figure 2. The pathologists can visualize and annotate whole slide images (WSIs) using Aperio ImageScope. An XML interface has been implemented for both the MATLAB and Python environments. This allows to create the training set and also to make the network predictions available to the clinicians, with a very smooth integration. To accomplish the task of calculating the Karpinski histological score, we have to make a careful choice for the architecture of the network. In this work, we compare an object detection framework with an instance segmentation one. For a semantic segmentation approach, consider our previous work [15]. All the models have been trained and validated on the same machine of [15]. We used a dual boot system; the MATLAB implementation has been tested on Windows, whereas Ubuntu has been exploited for the Python implementation.

Faster R-CNN
The implemented baseline is based on Faster R-CNN, with the workflow depicted in Figure 3. Starting from a WSI, we segmented its sections using Section Extractor [15]; then we got kidney sections undersampled by a factor of 4. These undersampled biopsy sections are divided into patches of size 500 × 500, with stride of 250 × 250. The stride has been chosen to guarantee an overlap of 250 × 250, so that there is at least one patch in which each glomerulus is fully contained. Since the dimensions of glomeruli in images at full resolution (20×) are lesser than 800 × 800, at undersampled resolution (5×) they are lesser than 200 × 200, thus the claimed condition is easily obtained. In this way we did not discard any glomerulus from training data. Note that in inferencing phase we can apply again this procedure, reducing the eventuality of missing glomeruli. Dividing the original image into patches poses the problem on how the partially contained glomeruli should be considered in the training patch (compare Figures 4 and 5). At the purpose of solving this issue, a hyperparameter has been introduced, the tolerance, indicating the maximum allowed percentage of glomerulus size that can be out of patch to consider that glomerulus as positive example for training.    Due to the small dataset sample size, composed of 26 WSIs which contain 101 sections, we exploited oversampling as data rebalancing methodology. In particular, for each training patch that has at least a sclerotic glomerulus inside (underrepresented class), we performed data augmentation by rotating this patch by 90 • , 180 • and 270 • . In this way, we roughly quadruplicated the number of scleroic glomeruli (note that also the number of non-sclerotic glomeruli is increased by this operation).
Since our model has been trained on small patches (with size of 500 × 500), it is not advisable to directly adopt it for performing inferences on images of full sections (up to 2500 × 2500). Moreover, some sections can be too large to fit in memory. The proposed solution is straightforward: we divided in patches also the images used for inferences. Again, we used patches of 500 × 500 with stride of 250 × 250, thus reducing the probability of a glomerulus miss (since, as stated before, we have at least a full glomerulus in each patch). The use of overlaid windows for patches posed the problem of overlapped detections in full image (when we projected patch-level detections on original image), as can be seen in Figure 6. For suppressing duplicated bounding boxes, we used two iterations of NMS (Algorithm 1): standard NMS and NMS with matches computed on IoM instead of IoU. We exploited MATLAB selectStrongestBboxMulticlass function https://www.mathworks.com/help/vision/ ref/selectstrongestbboxmulticlass.html. The result of applying NMS with threshold for IoU set to 0.3 to bounding boxes in Figure 6 is depicted in Figure 7. Since in some cases there are small bounding boxes mainly contained inside larger ones, we performed also NMS on IoM with the threshold set to 0.5 (i.e., we performed NMS on all the boxes that overlaps with IoM greater or equal than 0.5). In the case of Figure 7, this step did not result in further suppression. Further details about hyperparameters configuration of Faster R-CNN approach can be found in Appendix A.1.

Mask R-CNN
The general schema that we used for the instance segmentation approach is depicted in Figure 8. In the training phase we sampled from each section (obtained using the Sections Extractor algorithm already employed in [15]) random patches of 1024 × 1024 pixels, then we performed random data augmentations on-the-fly, so that the network processes different data for each epoch. In the inferencing phase we used larger windows, since the memory requirements are less restrictive. We selected patches with size 1536 × 1536, with an overlap between adjacent patches of 250 × 250, for the same reason we explained in Faster R-CNN based detector. We performed zero padding for the missing information. When we project back the patch-level detections to WSI-level detections, we perform NMAS described in Algorithm 2, which results in an improvement over NMS. Examples of patch-level and WSI-level detections can be seen in Figures 9 and 10, respectively.  Note that, compared with Faster R-CNN, we have also a mask besides the bounding box, since Mask R-CNN purpose is to solve instance segmentation task and not only object detection task. Using NMAS proved to be very useful in sliding window approaches. An example is depicted in Figure 11. We can see that using simple NMS, the chosen bounding box in one case is not the most suitable, since it does not overlay the whole glomerulus. NMAS solves this problem by considering also the areas of involved bounding boxes and not only their confidence scores. We used ResNet-50 as backbone, since it allows quality feature extraction but is lighter than ResNet-101 [35]. In the training process, we used a pretrained model on the COCO dataset. In order to exploit in the best way the pretraining, we trained only the network heads for the first 20 epochs. Then, for the subsequent 40 epochs, we fine-tuned ResNet stage 4 and layers above. For the last 40 epochs, we trained all the layers of the network, and we lowered the learning rate to 0.0001. Further details about hyperparameters configuration of Mask R-CNN approach can be found in Appendix A.2.

Baseline: Faster R-CNN
With the Faster R-CNN-based approach, we get the results reported in Tables 1 and 2. The mAP for the Faster R-CNN approach is 0.803.

Mask R-CNN
The results obtained with the Mask R-CNN-based approach are reported in Tables 3 and 4. Using NMAS instead of NMS for suppressing overlapped bounding boxes leads to an improvement of mAP from 0.881 to 0.902, and of F-measure for non-sclerotic glomeruli from 0.917 to 0.925.

Karpinski Score Assessment
In order to assess the clinical validity of the obtained results, we compared the Karpinski score computed by our CNN with that of expert pathologists.
The comparison between the baseline Faster R-CNN and Mask R-CNN is shown in Table 5. Ratio refers to number of sclerosed glomeruli divided by the overall number of glomeruli: Ratio = S S+NS . The corresponding Karpinski score for the glomerular compart is determined according to the following: 0, if there are no globally sclerosed glomeruli; 1, if there is <20% global glomerulosclerosis; 2, if there is 20-50% global glomerulosclerosis; 3, if there is >50% global glomerulosclerosis [24]. We note that the Faster R-CNN approach makes five errors in assessing the Karpiski score: four times it gives a score of 1 instead of a score of 2; one time it gives a score of 2 instead of a score of 1. The Mask R-CNN approach makes only three errors in assessing the Karpinski score: one time it gives a score of a score of 0 instead of a score of 1; two times it gives a score of 1 instead of a score of 2.

Discussion
Recent studies tried to accomplish glomerular detection in kidney biopsies, using a wealth of techniques, most of which based on deep learning. Nonetheless, many of these approaches did not consider the task of classifying between non-sclerotic glomeruli and sclerotic ones. A full comparison of our approach with the recent research works in the task of glomerular detection is in Table 6, extending the one proposed by Kawazoe et al. [13] to the glomerulosclerosis classification case when available. We note that our model performs well in the detection of non-sclerotic glomeruli, with very high recall and precision values, but metrics for sclerotic glomeruli suffer from a higher number of false negatives. From the tests performed in this paper, it is possible to observe that glomerular detection and classification tasks should be approached as an instance segmentation tasks. Even if object detection approaches can guarantee respectable results, they do not exploit the mask information in the dataset. Semantic segmentation approaches allow to obtain decent results too, but they are slightly worse than instance segmentation ones. Indeed, training a CNN which classifies at pixel-level in a detection task is a less powerful method. Difficulties that occur with semantic segmentation networks include presence of noisy points in the output and lack of distinction between touching objects. Semantic segmentation networks can principally exploit texture information but are less capable to understand concepts as shapes, thus working worse on detection task comparing to specialized architectures. Nonetheless, in [17], Marsh et al. used fully convolutional network (FCN) (together with BLOB detection as post-processing of semantic segmentation network output) to measure global glomerulosclerosis from kidney biopsies. The proposed Mask R-CNN approach outperforms their FCN-based one, improving F-score for healthy glomeruli from 0.848 to 0.925 and F-score for sclerosed glomeruli from 0.649 to 0.777. An important reason for these better performances may lie in the choice of the better model, relying on an instance segmentation network instead of a semantic segmentation one. Anyway, it has to be noted that Marsh et al. dealt with HE stained biopsies, whereas the dataset adopted for our experiments is made up by Periodic acid-Schiff (PAS) stained biopsies, which can be a better staining for glomerular recognition tasks. Although, it has to be demonstrated that CNNs work consistently better on PAS compared to HE. It is also worth noting that in [17] the unbalancing ratio is less than ours, being 3.44:1 compared to 5.48 : 1, thus allowing a smoother training process for the underrepresented class. Other works do not address the task of determining glomerulosclerosis, but focus only on glomerular detection. Though this is a simpler task, we confront our work also with them, considering healthy and sclerotic glomeruli as a single class. In [19,20] the authors used classical machine learning approaches, obtaining worse results than us. In [21], Temerinac-Ott et al. compared a machine learning approach, based on Histogram of Oriented Gradients (HOG) [36]) feature extraction and a support vector machine (SVM) classifier, and a deep learning one with CNN. Anyway, both obtained lower performances than our end-to-end instance segmentation framework. Gallego et al. exploited a CNN for classifying if each patch is a glomerulus or not in a sliding window fashion [16]. Although this may look like a more naive approach, compared to adopting a detector from the R-CNN family (which can also reduce the problem of redundant computation across neighboring patches), the results obtained in the paper are quite impressive, with a recall of 1. However, it is worth noting that Gallego et al. considered only glomeruli with area of at least 200 × 200 pixel (>100 µm of diameter), whereas we consider glomeruli of all sizes in the metrics, and many of our false negatives are among the small glomeruli. Furthermore, we provide a precise mask for each glomerulus found, while Gallego et al. can only determine coarse masks composed by the union of rectangular patches they considered. Kawazoe et al. used Faster R-CNN for the glomerular detection task, obtaining results comparable with the proposed Mask R-CNN approach, with an F-score of 0.925 (ours is 0.919) [13]. We believe that the possibility to use a larger training dataset (200 WSIs instead of 26) can explain why they can get comparable (or even slightly better) results even with a less powerful model. As already noted, our Faster R-CNN model performs worse than our Mask R-CNN one.

Conclusions
In this paper, we develop a framework that could aid pathologists in the process of automatically detecting and classifying non-sclerotic and sclerotic glomeruli from sections of kidney biopsies. The proposed approach relies on Mask R-CNN, which proved to be a very sensible choice for a glomerular detection and classification task, improving over the baseline Faster R-CNN method and our previous works based on semantic segmentation approaches [15]. The proposed method allows to train an end-to-end instance segmentation neural network, therefore strongly reducing the need for post processing operations and allowing to learn all the required features in a unified process. An interesting novelty concerning post processing is the development of the Non-Maximum-Area suppression algorithm, that with seemingly minor changes compared to standard NMS algorithm, led to an improvement of the performances in our sliding window approaches. Note that NMAS, like NMS, is a general purpose algorithm and can be useful also for other detection tasks. The best model we trained is based on Mask R-CNN, and exploits NMAS for projection on full images. It outperforms related works in the field of the determination of global glomerulosclerosis, as [15,17]. The methods we used for evaluating the validity of our detection models are more specific than widespread global metrics (as mAP) used in benchmark datasets as PASCAL VOC or COCO. The analysis of object detection confusion matrices allows a better understanding of the model performance, bringing an an insight on the model response for each problem class. At the moment, the proposed framework allows to get a reliable estimate of global glomerulosclerosis; the pathologists can benefit from glomeruli annotations provided by our CAD through an XML interface with the commonly used Aperio ImageScope software, easing the burden of the manual annotation. In the future, it could be extended to other kidney biopsies analysis tasks, consenting to define the complete Karpinski histological score.  We performed data augmentation, exploiting the imgaug library (https://imgaug.readthedocs.io/ en/latest/) [37], as reported in Table A4. In particular, of the augmentations listed there, we randomly performed none, one, or two augmentations.