Reproducibility and Feasibility of Classification and National Guidelines for Histological Diagnosis of Canine Mammary Gland Tumours: A Multi-Institutional Ring Study

Simple Summary Tumours of the mammary gland are common in humans, as in canine species. They are very heterogenous with numerous morphological variants and different biologic behaviours. In the last few decades, several efforts have been made to classify these tumours histologically and establish the level of malignancy by using histologic grading systems. However, reproducibility and diagnostic agreement of such classification and grading have been only rarely assessed. In this study, we tested the variability in diagnoses performed by 15 pathologists using the same classification and grading system. Prior to the study, pathologists agreed on guidelines regarding how to apply these systems. Pathologists worked blindly on 36 digital histologic slides of canine mammary tumours. The agreement was statistically analysed using Cohen’s kappa coefficient that, when equal to 1, indicates perfect agreement. The overall agreement in the identification of hyperplastic-dysplastic/benign/malignant lesions was substantial (kappa 0.76), while outcomes on morphological classification had only a moderate agreement (k = 0.54). Tumour grade assigned by pathologists was the least concordant and kappa could not be calculated. Although promising, the results underline that each diagnostic/grading system should be assessed and optimized for standardization and high diagnostic agreement. Abstract Histological diagnosis of Canine Mammary Tumours (CMTs) provides the basis for proper treatment and follow-up. Nowadays, its accuracy is poorly understood and variable interpretation of histological criteria leads to a lack of standardisation and impossibility to compare studies. This study aimed to quantify the reproducibility of histological diagnosis and grading in CMTs. A blinded ring test on 36 CMTs was performed by 15 veterinary pathologists with different levels of education, after discussion of critical points on the Davis-Thompson Foundation Classification and providing consensus guidelines. Kappa statistics were used to compare the interobserver variability. The overall concordance rate of diagnostic interpretations of WP on identification of hyperplasia-dysplasia/benign/malignant lesions showed a substantial agreement (average k ranging from 0.66 to 0.82, with a k-combined of 0.76). Instead, outcomes on ICD-O-3.2 morphological code /diagnosis of histotype had only a moderate agreement (average k ranging from 0.44 and 0.64, with a k-combined of 0.54). The results demonstrated that standardised classification and consensus guidelines can produce moderate to substantial agreement; however, further efforts are needed to increase this agreement in distinguishing benign versus malignant lesions and in histological grading.


Introduction
The histopathological diagnosis and grade of Canine Mammary Tumours (CMTs) are considered the gold standard for patient management and research outcomes [1][2][3]. In this regard, misdiagnosis and/or interinstitutional diagnostic variability between pathologists can seriously affect the interpretation of clinical data that use the histological output as the reference standard. In particular, the evaluation of therapeutic protocols as well as the interpretation of predictive/prognostic molecular markers can be adversely affected [4]. Histopathological diagnostic criteria for CMTs were largely established during years with updated internationally recognized classifications from the World Health Organization (WHO) [5,6] and from the Davis-Thompson DVM Foundation (DTF) [7]. However, whether these classification systems are uniformly applied, as well as the challenges encountered by pathologists in agreeing on the diagnosis and grade of CMTs, is unclear. Two studies were performed on CMT diagnostic agreement already underlying some diagnostic disagreement [8,9]. In addition, significant problems can be related to the application of different classification and grading systems (human versus veterinary; national versus international) and, even when applying common systems, to the subjective interpretation of histological criteria or misleading concepts and terms used to classify and grade CMTs.
Many factors, therefore, could promote interobserver variability (IOV) and decrease the diagnostic reproducibility among pathologists, resulting in classification and/or grading errors that lead to failure to predict tumour behaviour. Standardisation of diagnosis is also a prerequisite to allow a comparison between research studies worldwide. Efforts for standardisation are common both in human and veterinary medicine and assessment of agreement has been performed on some tumoral and non-tumoral diseases, as nonexhaustively summarised in Table 1 and references therein. Studies evaluating the agreement based on common classification/grading systems or histological criteria underlined that the concordance, despite never being excellent/perfect, is higher when the applied system/criteria are shared and discussed in consensus meetings, thus, establishing the sources of IOV [10][11][12][13][14].
To address the issue of IOV for the diagnosis of CMTs, as part of a national initiative of the Italian Association of Veterinary Pathologists (AIPVET), a group of 15 Italian veterinary pathologists developed national guidelines for CMT assessment using, as a starting point, the last DTF classification of CMTs [7].
Therefore, the aim of this study was to assess IOV in the classification and grading of CMTs when applying the same system and guidelines. The effect of this standardisation on diagnostic concordance and agreement rates was evaluated and critical aspects were pointed out and discussed. Table 1. Scientific studies dealing with inter-and intraobserver variability in the pathological diagnosis (references are listed in alphabetical order for the topic, separately for humans and animals).

Materials and Methods
Fifteen veterinary pathologists (see authors list) from academic Schools of Veterinary Medicine, from Veterinary State laboratories (Experimental Zooprophylactic Institutes), and from private veterinary diagnostic laboratories constituted a working group (Working panel, WP) to discuss critical aspects of the recently published DTF classification of CMTs. Two (RR and VZ) of the fifteen pathologists were among the authors in the DTF classification. The WP produced national guidelines [63] and established consensus criteria for histological diagnosis and grading of malignancies for the entities of the DTF classification. For this purpose and due to COVID-19 pandemic, twenty telematic meetings lasting an average of 90 min were held by all components to address common challenges and misconceptions in histopathological assessment of CMTs. Moreover, unified national guidelines were created to reduce the discrepancies between pathologists operating in different institutions and private diagnostic laboratories. The WP identified the principal causes of diagnostic disagreement by interjecting their own direct and indirect (e.g., held seminars and discussions with other colleagues) experiences into the discussion.
More specifically, the consensus regarded the following critical aspects: a. histological subtypes; b. grading; c. criteria for malignancy; d. approach for lymph node metastases and micrometastases; e. pathological prognostic factors; f. markers for phenotype and prognosis; g. content of the histopathological report; and h: application and revision of ICD-O-3.2 codes [64]. ICD-O-3.2 codes have been used now for more than 35 years, principally in human tumour or cancer registries, for coding the site (topography) and the histology (morphology) of the neoplasm, in this way helping standardization. For the purpose of this study, only aspects a) to c) and h) will be presented.
A consensus on the aforementioned critical aspects was reached and applied during the ring study. The ring study was performed on selected histological samples to evaluate the effect of the classification and the national guidelines in the reproducibility of morphological diagnosis. One experienced pathologist (VZ, ECVP diplomat), internationally recognized for research and continuing veterinary education on mammary gland pathology, selected 34 slides best representing the hyperplastic/dysplastic/neoplastic lesions of the canine mammary gland described in the DTF classification (Table 2). Slides were chosen from the available archive of a university diagnostic veterinary pathology service (BCA Dept., University of Padua, Italy). As per Directive 2010/63/EU of the European Parliament and of the Council of 22 September 2010, regarding the protection of animals used for scientific purposes, the Italian legislature (D. Lgs. n. 26/2014) does not require approval from ethical committees for the use of stored samples in retrospective studies. Additionally, submitting vets sign an informed consent for privacy and to allow the use of protected data regarding samples in research studies. Two more slides were provided by a second participant (RR, ECVP diplomate), also with broad experience on mammary gland pathology. The 36 slides, with minimal repetition of the same histological diagnosis, were progressively numbered (from 1 to 36), digitally scanned (D Sight Menarini) and distributed to the WP for digital examination. Participants were provided the same single hematoxylin-andeosin-stained slide per case. WP diagnoses were anonymous and blinded to previous and to each other's interpretations and the diagnosis was recorded providing multiple pre-filled choices of answer to minimize errors (i.e., one drop-down menu for H/B/M and one drop-down menu for the three possible features associated with the scoring for each criterion of the grading). WP participants were asked to interpret the cases following the DTF classification and the newly established national guidelines. They had to classify the lesion as hyperplasia-dysplasia (H) for non-neoplastic lesions, or as benign (B) or malignant (M) for tumours and to identify the specific histological diagnosis also including the corresponding ICD-O-3.2 [64] code as reported in Table 2. Regarding ICD-O codes, the WP analysed the available codes at the time of the study taking as a reference the International Classification for Disease in Oncology ICD-O-3.2. [64] The goal of this action was to update and standardise the cancer codes for the current veterinary cancer registries active in Italy. WP members were also asked to report the histological CMT grading features (i.e., mitotic count, percentage of tubules formation, and degree of pleomorphism) according to Peña and co-authors [2] to calculate the grade. Participants were given a time frame of 4 weeks to complete the evaluations. No clinical details or immunohistochemistry (IHC) results were provided.

Statistical Analysis
Statistical analysis was carried out by calculating Cohen's kappa (k) [65][66][67]. The k evaluates the agreement between panellists taking into account agreements due solely to chance. The lacking gold standard was replaced by the mode of the results given by panellists for each sample (majority opinion, GM) [68]. The k is scaled to be 0 when the amount of agreement is what would be expected by chance and 1 when there is perfect agreement between the observers. Kappa values between 0.21 and 0.40 were considered to represent fair agreement; 0.41-0.60 indicated moderate agreement; 0.61-0.80 substantial agreement; and 0.81-1.00 excellent agreement [67]. K was calculated for each panellist versus all (k_ava) and for each panellist versus the GM (k_vGM). The performances of the single panellist were obtained by calculating the mean of the k_ava and of the k_vGM. To synthesise the overall results, a k-combined was calculated for each statistic separately, according to Fleiss and co-authors, indicating the mean of all the k_ava means and k-vGM means, respectively [68]. Statistics were computed for the following parameters: (1) samples identified as H/B/M and (2) specific histological diagnosis, reported as ICD-O-3.2 code (Table 2).
To detect and comment on the differences among the specific histological diagnosis under study, the proportion of cases correctly identified (i.e., cases corresponding to GM) over the total of cases was calculated. This measure does not take into account the effect of chance and, for the purpose of this paper, was referred to as concordance.
The panellists' experience was then evaluated by performing a hierarchical cluster analysis with the Ward's method. Cluster analysis was firstly applied to the self-reported variables denoting experience (years of experience, caseload per week, number of published papers), then a second analysis was performed on the classification of lesions as H/B/M. Ward's method of clustering joins the two groups that result in the minimum increase in the error sum of squares [69].

Guidelines and WP Composition
For the purpose of this study, the WP discussed some critical points and established and reported into the guidelines a consensus regarding the following aspects. a.
Histological subtypes-To precisely apply the histological diagnosis reported in the DTF classification, as proposed by the authors. For example, the term "carcinoma in situ" was not applied and instead used atypical hyperplasia or atypical epitheliosis, depending on specific morphological aspects. As another example, it was agreed that the tumour histotype was defined based on the prevalent morphological pattern where more than one pattern was observed (e.g., tubular and solid). b.
Grading-To use the canine grading system proposed by Peña and colleagues [2], as summarised in Table 3. The histological grading was reported, regardless of the presence or absence of vascular invasion. c.
Criteria for malignancy-To employ the following parameter as criteria for malignancy: (I) tumour architecture with reduced tubular organisation (with no objective measurement and no specific cut off); (II) marked cellular and nuclear pleomorphism (with no objective measurement and no specific cut off); and (III) high mitotic count. A cut off ≥6 mitoses per 2.37 mm 2 was proposed and applied exclusively when other criteria for malignancy were borderline/unclear. This was to indicate the possibility of a lesion with clear evidence of malignancy (e.g., anaplastic carcinoma) and a mitotic count below 6, or of a clearly benign lesion (e.g., ductal adenoma) with a number of mitoses higher or equal to 6. Regarding the mitotic count, it was performed digitally by the WP following these criteria: total area of observation of 2.37 mm 2 [70] taking into consideration that the digital fields to obtain this total area had to be highly cellular and avoid cystic/necrotic fields. If the expected area (2.37 mm 2 ) could not be obtained, the mitotic count was proportionally determined; most mitotically active areas (usually at the periphery of the tumour) were chosen to start, moving to consecutive fields. After two fields with no mitoses, the third new field was chosen as the next new mitotically active field to then proceed again consecutively, and so on until ten counted fields in total. In order to do so, each participant calculated the number of fields to be examined on their screen to cover the standardised 2.37 mm 2 area. This was done by dividing 2.37 mm 2 by the total area of a 40"×" U+00D7 image field, which was measured with a ruler tool on the screen [70]. Additional criteria for malignancy were (IV) presence of small areas of random necrosis (groups of neoplastic cells with karyolysis and karyorrhexis), keeping in mind that central wide necrosis can be present both in benign and malignant lesions; (V) peripheral infiltration, determined as an irregular contour of the tumour showing a desmoplastic reaction, often associated with a mixed inflammatory infiltrate; (VI) pluristratification of neoplastic cells with loss of polarity, atypia, and dysplasia; and (VII) lymphatic vessel invasion by neoplastic cells. Table 3. Histological grading for canine mammary tumours [2]. Main categories used for grading are indicated in bold.

B. Nuclear pleomorphism (b)
Uniform, regular, small nuclei with occasional small nucleoli 1 Moderate degree of variation in nuclear size and shape, hyperchromatic nucleus, presence of nucleoli (some of which can be prominent) 2 Marked variation in nuclear size, hyperchromatic nucleus, often with more than 1 prominent nucleoli 3 WP participants' characteristics and relative data and experience are shown in Table 4.

Outcomes Expressed in Terms of Hyperplasia-Dysplasia/Benign/Malignant (H, B, M) Showed a Substantial Agreement
The results, in terms of hyperplasia-dysplasia (H), benign (B), or malignant (M) communicated by individual readers, are reported in Figure 1.   (Figure 1). The highest discordance was seen for four cases (case 7, 10, 26, 29) with less than 10 panellists agreeing on a diagnosis and having a GM = B or no GM. Figure 2a shows that the agreement between the participating laboratories was not uniform: the participants, in fact, had an average k ranging from 0.66 to 0.82 (with the 95% CI limits varying between 0.43 and 0.98) and the k-combined is equal to 0.76 (0.74-0.79). The k-vGM, shown in Figure 2b, presents relatively better results than those relating to the k-ava: panellists had average k-vGM ranging between 0.71 and 0.95 (with the 95% CI limits varying between 0.47 and 1.00). The k-combined for the panellists vs. GM was 0.86 (range of means 0.62-1.00).

Outcomes Expressed in Terms of ICD-O Morphological Code/Diagnosis Had a Moderate Agreement
The results expressed by the participants as morphological diagnosis together with the GM are reported in Table 5. As such, 14/15 participants repeated a diagnosis at least once. The estimate of the k for each participant reported in Figure 3a shows that agreement among participants was not uniform: participants had an average kappa ranging between 0.44 and 0.64 (with 95% CI limits ranging between 0.39 and 0.70). The k-combined is equal to 0.54 (95% CI 0.54-0.55). The analysis with respect to the k-vGM, shown in Figure 3b, presents relatively better results than those relating to the k-ava. Panellists had an average k-vGM ranging between 0.52 and 0.94 (with 95% CI limits ranging between 0.47 and 1.00). The k-combined for the panellists vs GM was 0.70 (range of means 0.64-0.76). Table 5. Classification of histological subtypes by the 15 panellists (P) for the 36 canine mammary tumour samples included in the study. In bold red the diagnoses that differed from the majority opinion (GM), in grey boxes diagnoses repeated by the same panellist. P01  P02  P03  P04  P05  P06  P07  P08  P09  P10  P11  P12  P13  P14  P15  GM  1  IDPA  IDPA  DC  DC  DC  IDPC  DC  IDPC  IDPC  DC  IDPC  IDPA  IDPC  DC  IDPC  2  IMPC  IMPC  STC  IMPC  IMPC  IMPC  IMPC  IMPC  IMPC  IMPC  IMPC  STC  IMPC  IMPC  IMPC  IMPC  3 CAD    20), and one of the six had a combined tubular and papillary pattern (GM = simple tubulopapillary carcinoma, case no. 5). An additional 3/11 discordant cases had a GM = B and included 1/3 cases with a GM = fibroadenoma (case no. 26) mainly differentially diagnosed as hyperplasia with fibrosis, 1/3 case with a GM of simple adenoma (case no. 10), which had one of the lowest concordances (5/15 panellists) and included several differential diagnoses (i.e., ductal adenoma/carcinoma; simple tubular carcinoma; lobular hyperplasia with atypia), indicating a difficulty in identifying also the M/B/H nature. A further 2/11 discordant cases had a GM = H, 1 with a GM = lobular hyperplasia with fibrosis (case no. 18) also diagnosed as fibroadenoma (3/15) or hyperplasia with atypia (4/15) and 1 with a GM = lobular hy-perplasia with atypia (case no. 29) associated with several differential diagnoses, including lobular hyperplasia with fibrosis (1/15), simple adenoma (3/15), complex adenoma (4/15), and complex carcinoma (1/15).

Outcomes Expressed in Terms of Grading
Since grading was assessed only for samples diagnosed as malignant, a certain amount of heterogeneity was seen. Therefore, we decided not to calculate the k, but to give instead a description of the most discordant elements of grading.
Grading (Supplementary Table S1) was never 100% concordant. With regard to those lesions with a GM of a malignant tumour and a GM of histological subtype for which the grading was applicable (15 cases), in 11/15 cases, all the three grades were used by panellists, and in the remaining 4 cases, either grade III or grade I was not applied, two tumours did not reach a GM for grade (n. 30 and n. 36), and the most common GM was grade 2 (9/13). The highest concordance was for one grade II tumour (case n. 5, 73% with 11/15 panellists and with a 100% concordant GM of simple tubulopapillary carcinoma) and for one grade III tumour (case n. 11, 66% with 10/15 panellists and with a 100% concordant GM of comedocarcinoma). Cases with GM = B when diagnosed as malignant (six cases) were predominantly scored grade I, two cases with GM = H diagnosed as malignant were scored as grade I (n.13 by two panellists) and grade II (n. 12 by one panellist).

Outcomes Expressed Considering Panellist Features
The cluster analysis performed on the variables synthesising the panellists' experience pointed out the existence, at the first level of partition, of two groups: one with 11 members and one with 4 members (Figure 4). The four members (3,8,11,12) in the smaller group were identified among the five "experts". This definition coincides with being considered an expert on CMTs by colleagues, as reported in Table 4. The cluster analysis performed on the classification of lesions, as H/B/M shows, at the first level of partition, found the existence of two groups of eight and seven members. All the experts belong to the first group and cluster among them on the second and the third level of partition ( Figure 4). The five panellists considered experts by reputation had 100% concordance between them in 28/36 (77.8%) in terms of H/B/M classification, and these cases were always concordant with the GM. With regard to the morphological diagnosis, their concordance was 100% only in 13/36 cases (36.1%) and these were always concordant with the GM. Only in 6/36 (16.6%) cases, their discordance was regarding non-tumoural/benign versus malignant histotypes.

Discussion
In this study, we evaluated concordance and agreement in the diagnosis of CMTs applying the same histological classification system (DTF classification [7]), and consensus guidelines [63]. However, as already demonstrated in the literature (see Table 1 and references therein), the overall concordance was below 100% and the overall agreement was below excellent values.
Difficulties in reaching perfect diagnostic consensus are both reader related and lesion related and multiple variables are involved ( [71] and references therein).
Considering the reader-related elements, the application of the same classification and grading systems is fundamental for standardization as well as the establishment of international consensus working groups and guidelines [11][12][13][14]71,72]. Nevertheless, even when applying approved systems, high accordance is not easily achieved [71]. In our study, we tested the reproducibility and the feasibility of the DTF classification [7] implemented with national consensus guidelines [63]. No previous studies have been performed on the application of a detailed histological classification and/or guidelines for the diagnosis of CMTs. For human breast cancer (HBC), several attempts have been made (see references in Table 1) obtaining 75% of concordance or a very variable agreement depending on specific subtypes [27,31]. The need for consensus discussions and shared guidelines have, therefore, already been pointed out in human medicine, both for tumoural and non-tumoural lesions, as the classification systems are still too prone to variability of application.
This variability is related to many additional factors. Among reader-related factors, the highest is the expertise and the longest is the experience of the pathologists in a specific field, then the highest can be the consensus in that specific area, as demonstrated by our cluster analysis, in which the experienced CMT pathologists are grouped together. In some human studies, a similar higher diagnostic agreement was observed in multivariate analyses as associated with higher diagnostic confidence, similar years of experience, and expertise in a specific area [35,71]. In addition, variable diagnostic approaches can also be pathologist-related aspects, impacting diagnostic variability during the routine, such as the number of sections evaluated per lesion, application of ancillary analyses, such as histochemical and immunohistochemical tests, and a combination of both pathological and clinical aspects to produce a diagnosis [73][74][75][76][77]. All these aspects are very hard to standardise and complex dedicated protocol guidelines should be considered in the attempt of reducing this variability [78][79][80]. They were not targeted in our study but should certainly receive further additional attention.
With regard to lesion-related aspects, some histological features (e.g., cellular/nuclear pleomorphism for establishing malignancy and grading) convey intrinsic qualitative subjective evaluation so that IOV is very hard to minimise [81,82]. Beyond this, biological processes are often a continuum of progressive steps identifying those that necessitate detailed morphological thresholds, which are not always available [83,84]. In CMTs and tumours in general, a major point of discussion is the identification of the transition of a lesion from non-neoplastic to neoplastic and, even more importantly, from benign to malignant [85,86]. In consideration of this point, we investigated the concordance/agreement in the identification of hyperplasia-dysplasia and benign/malignant lesions. Our study showed a relatively good result with the k-combined considered in the literature from "moderate" to "substantial" (means of k_ava ranging from 0.66 to 0.82 with the 95% CI limits varying between 0.43 and 0.98 and the k-combined equal to 0.76) and 23/36 cases with 100% concordance. It can, therefore, be said that the level of histological diagnosis in discriminating between benign, malignant, or hyperplastic-dysplastic lesions was quite satisfactory. However, since k-ava is strongly affected by the diagnosis of each single panellist (i.e., a strong disagreement of only one participant can severely decrease the k), in this study, we also calculated the k-vGM representing the distance of the single panellist from the majority opinion (GM) (ISO13528:2015). With k-vGM, we observed, indeed, an even better agreement (k-combined=0.86, CI 95% 0.62-1.00).
In a similar study conducted in Taiwan, 10 experienced pathologists classified 15 CMTs as either benign or malignant with no further histological classification and, likewise, our study obtained a moderate average level of agreement (0.43k) [9]. Prior to and during the study, these authors did not agree on any specific classification criteria or guidelines; however, they did not include hyperplasia/dysplasia as possible diagnosis, decreasing possibilities of discordance. In our study, a strong discordance was observed in 4/36 cases, in which morphological aspects were overlapping between hyperplastic/benign/malignant lesions. Distinction could be made more on a subjective evaluation than on (missing) objective criteria (e.g., a lesion with a simple tubular organisation with mild atypia can receive a diagnosis of lobular hyperplasia or simple adenoma or simple carcinoma grade I). In the attempt to implement agreement, particularly in these more subjective/borderline lesions, application of specific parameters/thresholds (e.g., mitotic count threshold) were agreed by the WP and probably helped consensus. Application of specific thresholds/methodologies has already been demonstrated to improve concordance in specific areas [49,79]. The parameters applied in our study were taken from the DTF classification and were based on authors' experience and not on published data. Before establishing precise morphological features/thresholds allowing the identification of tumour progression, the parameters should be carefully evaluated in follow-up large-scale studies, which, however, are very lacking in veterinary medicine [87]. For this reason, the authors still believe that multiinstitutional and international application of similar default thresholds would help standardisation, comparison of studies, and collection of large-scale data to assess and possibly redefine the thresholds themselves.
When it comes to the identification of specific histological tumour subtypes, the complexity of the lesions can increase difficulties in reaching blinded consensus diagnoses [7,31,35].
In our study, the agreement on the diagnostic code (identification of a specific histotype) was more unsatisfactory; the average k-combined for k_ava showed values considered in the literature as moderate (0.54k; 95% CI 0.54-0.55). In this case as well, the k_vGM gave a better agreement (k-combined = 0.70, CI 95% 0.64-0.76) suggesting that this type of statistic should always be calculated versus either a standard diagnosis or a majority opinion that is usually lacking within the studies (Table 1). No similar studies have been performed in CMTs. However, similarly to us, two distinct works analysed agreement in classifying canine soft tissue sarcomas and canine and feline nervous system tumours [59] applying specific histological systems and obtained, respectively, moderate (0.60k) and substantial agreement (0.66k) for IOV. In this regard, CMTs are well known for their heterogeneity and complexity of classification [7]. In our study, the tumours characterised by proliferation of myoepithelial cells were included in those lesions receiving less concordance/agreement. The presence of more than one cell type (including myoepithelial cells) in CMTs often requires IHC for definitive characterization; therefore, ancillary tests could be necessary for a definitive diagnosis and should be suggested and accounted for within the report [7]. Additional histostaining and IHC were demonstrated as also improving agreement in other types of tumours [21,59].
In our study, tumour grading also showed some discordance, and the agreement could not be calculated because all three grades were frequently applied by WP for the same malignant lesions. Grading has been often found as one of the most reliable prognostic parameters in multivariate analysis [1,2,88]. However, our and other studies, both on humans and on dogs [25] and references therein, indicate that the grading system contains weakly standardizable parameters that can be more easily affected by subjective evaluation [8]. As already reported, a two-tiered system might eventually increase the concordance [10]. In one study evaluating IOV of histological grading of 46 malignant CMTs performed by three mammary pathologists from the same institution, a moderate to substantial agreement (range of kappa means 0.51-0.71) was obtained [8]. This was in accordance with other similar human studies [10,25]. The lowest values were those conferred to nuclear pleomorphism (0.51k) and mitotic count (0.69k) [8]. Evaluation of pleomorphism has already been considered as one of the least concordant features in tumours, due to its heterogeneity within the same tumour and the qualitative subjective nature of the evaluation [10,25]. The mitotic count is instead strongly affected by the selection of areas for the evaluation [8,25,27]. In our study it was performed on digital slides, precisely defining the methodology; however, fields of evaluation varied depending on the starting field that was subjectively established as it was the chosen direction of consecutive fields. Within this framework, digital and computer-aided pathology (CAD), referred to a computational diagnosis system or a set of methodologies that utilises computers or software to interpret pathologic images, are considered emergent fields that will deeply change the temporal and spatial domains of pathologic diagnosis. Thus, CAD systems using machine learning algorithms have been demonstrated to improve classification accuracy and improve reproducibility, reducing the IOV [89][90][91][92].
Taking into account that in veterinary medicine, ring studies to assess IOV are few [8,53,57,59,61] and that a multitude of methodologies are utilised, our study should be interpreted considering some limitations.
First, the pathologists were aware that they were evaluating slides covering nearly all entities present in the DTF classification and this could have influenced interpretive performance, although this bias is likely to have been, at least partially, overcome during the observation of the slides by all the participants who, in the end, repeated the same diagnosis once or twice.
Second, we used only a single section per case. However, in clinical practice, pathologists typically review multiple slides per case and can request additional levels or ancillary immunohistochemical stains to reach a final diagnosis, particularly when more than one cell type is suspected, for example, involving pleomorphic myoepithelial cells [93,94].
Third, being aware of the complexity of CMT diagnoses, the WP carefully defined consensus guidelines based on the DTF classification that could have raised the level of concordance. In order to precisely assess the role of guidelines versus just the DTF classification, a new study should be performed comparing two groups of pathologists applying the same DTF classification and then either using or not the discussed guidelines. The application of guidelines has been already demonstrated as useful in increasing consensus and, therefore, should be considered in addition to or within internationally recognized classification systems [39,71,95].
Further ring studies should be performed, correcting some biases. Surely, the inclusion of more pathologists with even more variable professional expertise form worldwide countries should be considered, in which impact and, therefore, experience in CMTs can be diverse (e.g., Mediterranean countries have more CMTs compared to the US due to cultural attitudes in spaying female dogs) [96,97] and the distribution of cases with variable more realistic frequencies, as it would be in standard routine diagnosis.

Conclusions
There is no doubt that pathological examination has led to many of the currently used classifications and that morphological observation and its correlation with clinical parameters has provided a sound basis for clinical medicine as it is today. It is also true, however, that subjective histopathological approaches invalidate the overall concepts. Therefore, it is of critical importance to have a diagnosis that is reproducible. The reduction in methodological variables between veterinary pathologists would also improve comparison of studies regarding CMTs. To achieve this goal, we set to revisit the histopathological criteria for diagnosis of CMTs, considering the main findings of all entities described in the last classification of CMTs and to assign a weighting to criteria that drive the diagnosis and grade of these tumours. In this study of pathologists, the overall agreement between the individual pathologists' interpretation and reference diagnosis (majority opinion) was relatively high when classifying the nature of the lesion (H/B/M), but a bit lower when categorising the specific histotype.
Therefore, several efforts still need to be made to further standardise the application of international classification systems, particularly when approaching heterogeneous diseases, as mammary tumours are in dogs.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/xxx/s1, Table S1: histological grading given by panellists to the 36 studied cases. Informed Consent Statement: Informed consent was obtained from V.Z. and was the one signed by owners or clinicians submitting the samples.

Data Availability Statement:
Original data of samples are within institutional archives and have mainly been submitted through the platform www.simbavet.org, however personal data on clinicians, owners and animals are not available to the public.

Conflicts of Interest:
The authors declare no conflict of interest.