A Modiﬁed HSIFT Descriptor for Medical Image Classiﬁcation of Anatomy Objects

: Modeling low level features to high level semantics in medical imaging is an important aspect in ﬁltering anatomy objects. Bag of Visual Words (BOVW) representations have been proven effective to model these low level features to mid level representations. Convolutional neural nets are learning systems that can automatically extract high-quality representations from raw images. However, their deployment in the medical ﬁeld is still a bit challenging due to the lack of training data. In this paper, learned features that are obtained by training convolutional neural networks are compared with our proposed hand-crafted HSIFT features. The HSIFT feature is a symmetric fusion of a Harris corner detector and the Scale Invariance Transform process (SIFT) with BOVW representation. The SIFT process is enhanced as well as the classiﬁcation technique by adopting bagging with a surrogate split method. Quantitative evaluation shows that our proposed handcrafted HSIFT feature outperforms the learned features from convolutional neural networks in discriminating anatomy image classes.


Introduction
Medical images acquired from various imaging sources play an extremely important role in a healthcare facility's diagnostic process. The images contain the information about different conditions of a patient [1]. This information can be used to facilitate therapeutic and surgical treatments. Due to the advancements in medical imaging technology, such as the use of Computed Tomography (CT) and Magnetic Resonance Imaging (MRI), the volume of images has increased significantly [2]. As a result, the requirement for automatic methods of indexing, annotating, analyzing and classifying these medical images has increased. The classification of images to allow for automated storage and retrieval of relevant images has become a critical and difficult task, as these images are generated and archived at an increasing rate everyday [3].
From the perspective of radiology workflow, the images usually are archived within Picture Archiving and Communication Systems (PACS) [4]. Retrieving similar anatomy cases from a large archive composed of different modalities may be a daunting task and is considered one of the pressing issues in the rapidly growing field of content-based medical image retrieval systems [5]. Albeit of data disproportion [6], in the classification of medical images based on anatomies, there are also two main issues [7]: (1) High visual variability or intra-class variability among medical images belonging to the same category in which the images might look very different, belonging to the same class, in terms of varying contrast, deformed shapes due to advancement of various pathologies; (2) Inter-class visual similarity where images might look quite similar, while belonging to different visual classes . For instance, as shown in Figure 1, images that belong to different anatomy object classes might look quite identical. In spite of some remarkable work being done in building content based retrieval systems for images of specific modalities and anatomies, i.e., X-ray and CT images of different body parts [8], lung images based on CT modality [9] and breast ultrasound images [10], the retrieval efficacy of medical image retrieval systems is critically dependent on the feature representations. The human body has a high degree of symmetry, which is visible from the outside, and many organs, including the brain, lungs, and visual system, have symmetry as well. Even though for classifying medical images, a number of feature representation schemes have been developed, these representations are domain specific, and ought not to be used in different modality classes since the variability in medical images always exist.
The emerging trend in image classification tasks is adopting Convolutional Neural Networks (CNN) in which the features can be self-learned to discriminate the anatomical objects [11]. Nevertheless, there is yet to be a big pool of training datasets such as Imagenet [12] to guarantee the success of this approach in medical anatomy classification. How well does CNN perform in feature extraction even with moderately-sized datasets? Is it comparable with conventional or hand-crafted features that have been used in the field?
In this study, a hand-crafted feature which is an ensemble descriptor, HSIFT, with Bag-of-Visual-Words (BOVW) technique for the representation of anatomy images has been proposed. HSIFT is the result of embedding Harris Corner descriptors [13] and Scale Invariant Feature Transform (SIFT) [14]. It is a modification of SIFT in which the stages of the scale space extrema detection and keypoint localization are replaced with Harris corner since these steps are the most time-consuming and do not represent the edge information efficiently. To improve the accuracy, an ensemble classifier which is equipped with surrogate splits [15] is proposed to be used for medical image classification. The aim of the proposed approach is to overcome the traditional approach of classification [16,17]. The schematic of the proposed method is shown in Figure 2. Although it can be seen as a type of conventional feature as compared to nowadays CNN being a hot topic used in image classification, the discrimination power of this conventional feature as compared with CNN in anatomy classification of medical images is worth investigating.
The proposed feature representation technique is robust because of its abilities to perform the recognition tasks for different domains such as modality and anatomy. The novelty of this feature representation, compared to existing representations are several fold: (1) The modification of SIFT descriptor by embedding the Harris corner as keypoint detection technique has led to the generalized feature representation that can be applied for various medical image classification tasks as carried out in this research for multiple modalities and multiple anatomical images. (2) An surrogate splits technique is introduced with Bagging that is capable of handling the noisy data. The second major contribution of this study is to the trending area of deep learning for medical image analysis and classification. The novelty of this contribution is shown through the comparative evaluation of the handcrafted features and the deep learned features and these evaluations deduce that deep learning architectures still need large amounts of data to perform better in the automatic medical image classification domain. Whereas handcrafted features showed good performance with different sizes of datasets.
The following points summarize the contribution of this paper: • We discuss the problem of intra-class and inter-class variability in medical image classification. • We developed a robust classification method for classifying medical images based on modality and anatomy addressing the challenges of intra-class and inter-class variability. • We provided a detailed comparative analysis of the conventional method and deep learning methods for medical image classification. • We evaluate the efficacy of the developed method. The experiments demonstrate the effectiveness of the developed method for medical image classification based on modality and anatomy.
The rest of the paper is organized as follows: related work is discussed in Section 2, our HSIFT feature extraction method is presented in Section 3 while Section 4 shows the performance evaluation of the CNN architectures and our proposed feature in classifying medical image anatomies. Finally, Section 5 concludes the paper.

Related Work
A variety of low level feature descriptors have been proposed over the years as an image representation ranging from global features, like texture features, edge features [18] to the more recently used local feature representations such as Harris corner detectors and Scale Invariant Feature Transform (SIFT).
Bag of Visual Words (BOVW) representation with SIFT has been applied in anatomyspecific classification [19] extracted SIFT descriptors from 2D slices of the CT liver images for the classification of CT liver images whereas [20] adopted BOVW representation for the classification of focal liver lesions. In addition, SIFT has also been combined with other texture features such as Haralick, Local Binary Pattern (LBP) for the classification of distal lung images [21].
According to [22], descriptors like SIFT are designed to deal with regular images coming from a digital camera or a video system. Many keypoints can be easily detected and distinguished based on the kind of images and the rich information present in the images. However, when it is applied to medical images, these classical descriptors which were effective on regular images are no longer applicable without adaptation as noted in [23].
Another keypoint descriptor, which is Harris corner [13], exhibits strong invariance to rotation, illumination variation, and image noise. The effectiveness of Harris corner in terms of its invariant to repeatability under rotation and various illuminations has also been noted in [24]. It has been applied in the classification of breast mammography images [25], classification of X-ray images [26] and enhanced breast cancer image classification [27]. Harris corners have shown fast corner detection in terms of computation time [28] when it was used in classification of breast infrared images. Harris corners have also been used in brain medical image classification [29]. A hybrid approach of corner detection in which they combined the Harris corner and Susan operators for the brain magnetic resonance image registration [30]. In [31], the Harris corner detector was extracted as feature points and constructed as a collection of virtual grid for image comparison. The Harris corner was also applied for non rigid registration of lung CT images by detecting the anatomic tissue features of lungs [32] and also in multi-modal retinal image registration [33,34]. The authors used Harris corner for the image saliency detection to locate the foreground object [35].
It is evident from the previous studies that Harris corner has shown prominent performance in various medical image analysis tasks [36]. In addition, Harris corner have been used as an initial phase for the segmentation of the vertebra in the X-ray images [37] while [38] adopted Harris corner as an initial phase for the segmentation of human lung from two dimensional X-ray images. Ref. [39] combined the Harris corner and SIFT by computing the scale coverage of the SIFT descriptor by subjecting images to bi-linear interpolation to different scales.
A modified version of SIFT has been used by [40] for the segmentation of prostrate MRI images to reduce computation time. Changes were made to SIFT in order to adapt the contrast threshold from the classical SIFT method for deformable medical image registration [41]. When it comes to brain MRI images it works efficiently however, it was not the case with ultrasound images, many invalid keypoints have been detected.
Hence, recent years have witnessed a surge of active research efforts in coming up with the suitable feature representation for a specific task. One of the main drawbacks of these studies is that the feature representations are designed for a specific anatomical structure that are capable of capturing only one facet of the anatomical image. They are not robust enough for the various transformations that medical images in routine work may be subjected to, such as pathological deformations, varying contrast and compression. Therefore, it is a needed to have a robust feature representation that can map the visual contents of different anatomical structures to the aforementioned deformations. Based on the success history of Harris corner and SIFT keypoint detectors, it is believed that the fusion of the process in both detectors may give promising results and hence our proposed hand-crated feature is developed by combining the efficient steps of Harris corner in the process of SIFT.
In biomedical applications deep learning with Convolutional Neural Network (CNN) has made sound advancements [42]. Current findings have shown that incorporating CNNs into state-of-the-art computer-aided detection systems (CADe), such as medical image classification, can significantly improve their performance [43][44][45], including medical image anatomies classification [11,12,46]. These studies, on the other hand, do not provide a comprehensive evaluation of milestone deep nets [11,46] and have been applied to a single modality, for instance, CT images [46].
In this paper, we present our proposed hand-crafted feature, which is HSIFT and its effectiveness in classifying anatomy objects with multi modalities as compared to CNN is presented.

Formulation of the Proposed HSIFT Model with Bag of Visual Words Representation
After evaluating the properties of SIFT and Harris corner features in [36], we proposed that SIFT and Harris corner be fused together because both features complement each other and have the least redundancy. The fused feature was given the name HSIFT.
By replacing scale space extrema detection and keypoint localization with Harris corner computation, the Harris corner detector was embedded into the SIFT structure. As a result, no scale difference coverage was required in this study, resulting in faster computation. Because first derivatives are used instead of second derivatives and a larger image neighborhood is taken into account, Harris points are typically more precisely located. In comparison to the original SIFT corner points, Harris points are preferred when looking for exact corners or when precise localization is required. This is important in determining the correspondence of points in the same anatomies when matching anatomies. Furthermore, Harris corner has demonstrated remarkable robustness to intensity changes and noise [29]. Because medical images are subjected to different transformations, which is an important factor in discriminating anatomical images, which contains a lot of noise and has low contrast resolution due to the low dose associated with it as shown in Figure 3. The features extracted from various anatomical images in the form of HSIFT are then projected into Bag of Visual Words representation by constructing the visual vocabulary by reducing the number of features through quantization of feature space using K-means clustering in order to improve classification accuracy. On the proposed feature, we looked into bagging with surrogate split as a classification technique.
Instead of using bilinear interpolation to minimize the impact of scale differences as used in [39], we use the Harris corner detector instead and did not conduct a scale space analysis during our study, which resulted in no use of scale difference coverage. This method significantly cuts down the time required to perform the calculation of the SIFT descriptor.

Data Collection
We began our experiments with the data set obtained from the National Library of Medicine, National Institutes of Health, Department of Health and Human Services [11]. It's a free, open-access medical image database under the National Library of Medicine, with thousands of anonymously annotated medical images. The content material is organized by organ system; pathology category; patient profiles; and, by image classification and image captions. The collection is searchable by patient symptoms and signs, diagnosis, organ system, image modality and image description. For this study CT, MRI, PET, Ultrasound, and X-ray modalities are among the anatomical images used in the experiments. There are also images with various pathologies in this database. We used 37198 images of five anatomies to train both our HSIFT and CNN models for evaluation. Another 500 images, 100 images per anatomy, were used to test both the HSIFT and CNN models, which were different from the training set. In this study, the anatomical medical images have different dimensions ranging from 200 × 150 to 490 × 150. The lung, liver, heart, kidney, and lumbar spine are the anatomies studied in our experiments. Sample images are as shown in Figure 4. The normal and pathological images were used in our experiments, so that our proposed features can be generalized to classify any image of the same organ albeit its varies in shape, contrast and even modalities.

Feature Extraction
The extraction of the features begins with modifying SIFT feature, in which the Harris corner detector is used as a starting point for the computation of keypoints. Conventionally, the SIFT algorithm has four major stages: Detection of scale space extrema; keypoint localization; keypoint orientation assignment; keypoint descriptor. The Harris corner as a proposed feature is used to replace the first two stages of the SIFT process, which were previously space extrema detection and keypoint localization. Due to the scale space analysis used to calculate the feature point positions, these two steps take a long time. In addition to this Harris corner is invariant to rotation and translation, which are important factors in distinguishing different anatomies and quantifying the correspondence between two features.

Detection of Harris Corner
The mathematical interpretation of the Harris corner is as follows: where, • The difference between the original and transferred window is represented by E, • the x-axis shift of the window is represented by u, • The y-axis shift of the windows is represented by v, • The window at (x, y) is represented by w(x, y). This appears to be a mask. Which ensures that only the appropriate window is used, • The window function is represented by ∑ x,y w(x,y), • The shifted intensity is represented by I(x+u,y+v) and the intensity at (x,y) is represented by I(x,y).
We have to maximize the equation above, because we are looking for windows with corners, i.e., windows with a lot of intensity variation will be considered. Particularly the term: [I(x + u, y + v) − I(x, y)] 2 by using the Taylor expansion as: The bilinear approximation for the average intensity change in direction [u, v] as follows: where M is a structure tensor of a pixel, which is a 2 × 2 matrix. It is a characterization of information of all pixels within the window where, (I 2 x , I x I y ), (I x I y , I 2 y ) are the partial derivatives in x and y.
Furthermore, after that, as a measure of corner response a point is described in terms of eigenvalues, λ 1 λ 2 , of M, as follows: with det(M) = λ 1 λ 2 , tr(M) = λ 1 + λ 2 . Here det(.) and tr(.) are the determinant and trace, respectively. In order to figure out why matrix M has the capacity to determine the corner's characteristic, we computed the covariance matrix Σ of gradient (I x , I y ) at a pixel. Let . So by the definition of covariance matrix we have: On the other hand, matrix M is as follows if we apply an averaging filter over the time dimension by giving the uniform weight over the pixels within the pixels: where f is the total number of pixels within the window. From the above equations, it can be seen that the matrix M is actually an unbiased estimate of the covariance matrix of the gradients for the pixels within the window. Therefore findings in the covariance matrix of the gradients calculated in Equation (5) can be used to analyze the structure tensors.
A good corner has a large intensity change in all directions [13], i.e., H should be a large positive number. We obtain the image derivatives by convolving it with the derivative masks and then smoothing these image derivatives with the Gaussian filter to compute the Harris corner response while keeping the basic working principle of Harris corner in consideration.
The pixels that are not part of the local maxima are then set to zero using non-maximal suppression. This allows us to extract the local maxima by dilation, then find the points in the corner strength image that match the dilated image, and finally obtain the coordinates, which are referred to as keypoints. Figure 5 summarizes the output of each step in Harris corner detection. Each stage in Figure 5 is obtained in the mathematical formulation as following steps: • computing image derivatives: I x and I y are obtained by computing the x and y derivatives as: • Calculate derivative products at each pixel as: I 2 y = I y × I y (10) • Finally, the response of the detector at each pixel is computed as: In order to see the discriminating capability of adopting Harris corner as a starting point for keypoint detection, the comparison in the first two steps of SIFT and Harris corner were analyzed. The output of scale space extrema of the SIFT descriptor is shown in Figure 6. As can be seen in Figure 6, the scale space extrema process in the fourth octave loses the details about an image as it is unable to determine the actual fine details of the image which results in fewer keypoints as shown in Figure 7. These characteristics are important in classifying images with variability as compared to Figure 5, in which fine details of the images are still preserved even if it is subjected to low resolution. Furthermore, it can be seen in Figure 7, that the keypoints detected by Harris corner capture the maximum shape of the object which will help in better classification. Medical images with low resolution will not be classified with only the SIFT process because of the loss of details as it progresses through the different octaves.

Key Points Orientation Assignment
Following the extraction of the Harris corners, the next step in the SIFT algorithm is orientation assignment, which involves determining ascendant orientations for each keypoint based on its local gradient direction. This is accomplished by creating a 36-bin histogram from the image gradients around the keypoint. Each keypoint is normalized by aligning it to the direction of its orientation, as determined by the following equations.
where, m(x, y) and θ(x, y) are the gradient magnitude and orientation of pixel (x,y) and H is the Hessian measure. The orientation of the computed features is shown in Figure 8.  The last step is to create a keypoint descriptor, which computes the local image descriptor based on the image gradients in its immediate vicinity. The area surrounding the keypoint is divided into 4 × 4 sub-regions. For each sub-region, an orientation histogram with eight bins is constructed, yielding a 4 × 4 × 8 = 128 dimensional vector for each region. The process is summarized as shown in Figure 9.

Construction of the Codebook
Encoding and pooling are the two steps in the construction of a bag of visual words (BOVW) representation. The local descriptors are encoded into codebook elements in the encoding step, and the codebook elements are aggregated into a vector in the pooling step.
The encoding step is the vital component, because it links the feature extraction and feature pooling and has a great impact on the image classification in terms of accuracy and speed. The most commonly used encoding step is vector quantization(VQ) [47]. The problem with this encoding method is that it assigns each descriptor to a single visual word, which results in quantization loss and it also ignores the relationship between different bases. In the case of medical images, this limitation imposes a constraint on feature encoding because medical image datasets are high dimensional and often represent complex nonlinear phenomena. To overcome this, an encoding method of Locality constraint Linear Coding(LLC) [48] is used, which achieves fewer reconstruction errors by using multiple bases.
In LLC, the least squares fitting problem is solved by the the following equation: where k= [k 1 , k 2 , ....., k N ] is the set of codes for local descriptors z= [z 1 , z 2 , .....zN]. Furthermore, d i is the locality constraint, which is given as: σ is the weight decay speed adjuster for locality constraint.

Ensemble Classifier with Surrogate Splits
An ensemble classifier consists of a set of individually trained classifiers whose predictions are combined for classifying new instances [49]. Ensemble classifier with surrogate splits [15] may help in minimizing the generalization error by handling the missing values which occurred due to the noisy data in some anatomical classes. Moreover, it has been stated that tree based ensembles handle high dimensional data easily [50]. Surrogate splits can be used to estimate the value of a missing feature based on the value of a feature with which it is highly correlated.

Bagging
By training each classifier on the random distribution of the training set [51], as a bootstrap pooling method generates replacements for its ensemble. Each classifier's training set is created by resampling the training set with replacement. A sequence of decision tree classifiers C m , m = 1, 2, · · · , M is created. After that, an ensemble classifier is created by combining the individual classifiers. The basic formulation of bagging is given in Algorithm 1.

Algorithm 1: Bagging Algorithm
Input:training set S, Decision Tree I, integer T (number of bootstrap samples). Output: classifier C * When two features have a high correlation, surrogates estimate the value of missing attributes based on the feature with which they are highly correlated. If an attribute's value is missing, the mean of all instances with the same estimated class value, i.e., the mean of the instances from the same class label, is used to estimate it.
In addition to this, surrogate splits have also shown good performance in handling both training and testing missing cases as reported in [50].
Our modified algorithm using bagging with surrogate splits is given in Algorithm 2. Create a surrogate split X, X ≤ s X , bootstrap replica by randomly sampling with replacement on training set T.

7:
Classify sample t i to class s j according to the number of votes obtained from classifiers. 8: end for 9: end procedure

Comparative Results for HSIFT and CNN and Experiment
We began our experiments with a machine equipped with an NVIDIA GeForce GTX 980M graphics card and a data set obtained from the National Institutes of Health, Department of Health and Human Services [11] as mentioned in Section 3.1.

Experimental Setup for HSIFT
Our experiments were run with 10-fold cross-validation. Various codebook sizes were empirically experimented, and the optimal codebook size for anatomy recognition found is 500 as shown in Figure 10. Hence, the dimension of the BOVW(HSIFT) adopted is 500 for each image instance. We conducted a thorough experiment on SIFT [52], SIFT + Harris corner [39] and BOVW(HSIFT) by applying the SVM and also the ensemble classifiers. As shown in Table 1, it was found that BOVW(HSIFT) outperformed the SIFT+Harris corner [39] and BOVW(SIFT) [53] although the latter had shown good matching in natural images and modality classification, respectively, from their work. It is also shown that BOVW(HSIFT) performed the best among all the tested features, which produced only 9.7% error rate by using SVM and 2% using bagging with surrogate splits classification technique. The results are summarized in Table 1. Experiments on regular bagging and bagging with surrogate splits with M = [1, 50] were conducted to see how different numbers of trees affected cross-validation errors in order to determine the optimal number of trees, M, to be used in the testing stage. It can be seen from the experiment results in Figure 11 that the minimum cross-validation error was achieved at M = 50. Taking this into account, during the testing stage, bagging with surrogate splits of 50 trees was chosen as the best option. For bagging with surrogate splits, the cross-validated error is 2%. Figure 11. Comparative analysis of cross-validated error for training data.
After analyzing the cross-validation results we test the trained model on the test data set of 500 images. The results of which are shown in Figure 12. It clearly shows that the proposed method outperforms the regular bagging, which results in 98.1% accuracy. The ROC curve for the diagnostic test evaluation of the proposed feature is as shown in Figure 13. It depicts the AUC of 0.98 indicating good performance of the method.

Experimental Setup for Convolutional Neural Network (CNN)
CNN has found a cutting edge capability in different competitions in computer vision and image recognition [54]. In order to leverage the effectiveness of CNN for medical image anatomy classification, we need to analyze the comparative performance of our proposed features and CNN features to investigate the discrimination power between those two methods.
An evaluation metric, normalized mutual information, that helps in determining the discriminative capacity of these two feature extraction methods was adopted in order to quantitatively analyze the discriminative capacity of the feature maps. The experiments were conducted based on calculations of entropy and mutual information.
Let L be the category label and F represents the features. The entropy H(.) is calculated as the amount of uncertainty for a partition set assuming two label assignments (L, F) are defined as: where P(i) = |L| N is the probability that an instance picked at random from L fall in class L i . Similarly for F: where P (j) = |F| N is the probability that an instance picked at random from F fall in class F j . The mutual information between (L, F) is calculated as: ).
Based on Equations (17)- (19), normalized mutual information [55] is defined as: where MI(L, F) is the mutual information between category label L and feature map activations F. H(.) is the entropy and H(L)H(F) is used for normalizing the mutual information to be in the range of [0, 1]. Table 2 shows the computed Normalized Mutual Information (NMI) for architectures used in this study. The layers, i.e., ip2/fc8/pool5/7 × 7_s1 represent the last fully connected layers of the three milestone architectures, respectively. A normalized mutual information score close to value 1 depicts strong correlation between the output activations and category labels while an NMI score close to zero depicts bad correlation. It is evident from the results in Table 2 that the proposed HSIFT feature outperforms the CNN's feature in terms of discriminative power of features in depicting the category labels.  Table 3 summarizes the performance of the proposed HSIFT feature versus the three milestone architectures in terms of runtime, training loss, validation accuracy, and test accuracy. Because these three landmark architectures were used for natural image classification, the parameter tuning and layer formulation required a more subtle approach. Even with optimized CNN architecture tuning, the HSIFT feature still outperforms deep learned features in terms of discriminating ability. The normalization function and subtle tuning of hyper parameters will yield better results for the task of medical image anatomy classification with the modification of basic CNN architecture in terms of number of layers. Nevertheless, results from Table 2 revealed, that the architectures used for natural image classification cannot be applied to medical image objects or anatomies.The few misclassified instances out of 500 images are shown in Figure 14. Figure 14. Misclassified instances of heart as lung, kidney and lumbarspine. (a) heart misclassified as lung; (b) heart misclassified as kidney; (c) heart misclassified as lumbarspine; (d) heart misclassified as lung.
As it can be seen from the Figure 14, that the shape of the organ and its intensity plays an important role in discriminating the anatomical structures.

Conclusions
We proposed a modified SIFT descriptor with Harris corner to form the Bag-Of-Visual-Words feature in image representations in this paper. The use of Harris corner to replace the first two steps of SIFT process has the advantage of assigning orientation to each keypoint that provides rotation invariance while at the same time helps to deal with viewpoint changes. As anatomical images are sometimes captured from different viewpoints, this is important in distinguishing between them. The combination of these primitive descriptors results in a more discriminative codebook in the BOVW representation, allowing for efficient anatomy classification while remaining robust to transformations. When compared to the traditional SVM classification technique, which is commonly used in classifying medical images, the use of an ensemble classifier that is Bagging with surrogate splits is good at handling missing data and improving classification accuracy.
The proposed ensemble descriptor outperforms some previous work in medical image anatomy, according to the experiments. Even with different modalities, it aids in the creation of more discriminative codebooks for anatomy classification. In comparison to convolutional neural networks, the proposed ensemble classifier of bagging with surrogate splits provides better discrimination power in extraction features. It also concludes that at the current stage, a Convolutional Neural Network does not guarantee its success in general medical image classification with various settings from moderate data sizes of a few tens of thousands, only. The work can be expanded in the future to include the recognition and classification of pathological structures from these anatomies, resulting in a fully automated medical image classification system.