Feature Importance for Human Epithelial (HEp-2) Cell Image Classiﬁcation †

: Indirect Immuno-Fluorescence (IIF) microscopy imaging of human epithelial (HEp-2) cells is a popular method for diagnosing autoimmune diseases. Considering large data volumes, computer-aided diagnosis (CAD) systems, based on image-based classiﬁcation, can help in terms of time, effort, and reliability of diagnosis. Such approaches are based on extracting some representative features from the images. This work explores the selection of the most distinctive features for HEp-2 cell images using various feature selection (FS) methods. Considering that there is no single universally optimal feature selection technique, we also propose hybridization of one class of FS methods (ﬁlter methods). Furthermore, the notion of variable importance for ranking features, provided by another type of approaches (embedded methods such as Random forest, Random uniform forest) is exploited to select a good subset of features from a large set, such that addition of new features does not increase classiﬁcation accuracy. In this work, we have also, with great consideration, designed class-speciﬁc features to capture morphological visual traits of the cell patterns. We perform various experiments and discussions to demonstrate the effectiveness of FS methods along with proposed and a standard feature set. We achieve state-of-the-art performance even with small number of features, obtained after the feature selection.


Introduction
Antinuclear antibody (ANA) detection with an HEp-2 substrate, is used as a standard test to reveal the presence of auto-immune antibodies.If antibodies are present in the blood serum of the patient, their presence manifests in distinct nuclear staining patterns of fluorescence on the HEp-2 cells [1].Owing to higher sensitivity of HEp-2 cell lines, ANA determination by indirect immunofuorescence (IIF) is arguably the most popular initial screening test for suspected autoimmune diseases (such as Myasthenia gravis, Pernicious anemia, Reactive arthritis, Sjogren syndrome) [2].The key step of the ANA HEp-2 medical test is the interpretation of obtained stained patterns of HEp-2 cells for establishing a correct diagnosis.The ANA HEp-2 test produces diverse staining patterns due to staining of different cell regions or domains.These domains such as nucleoli, nucleus, cytosol and chromosomes, differ in size, shape, number and localization inside the cell.This allows interpreters to distinguish staining patterns characteristic for different autoimmune deceases.Classifying these patterns using computer-aided diagnosis (CAD) systems has important clinical applications as manual evaluation demands long hours causing fatigue, and is also highly subjective.Being a second opinion system, the CAD systems reduce the workload of specialists, contributing to both diagnosis efficiency and cost reduction.
As such, many different staining patterns (approximately [30][31][32][33][34][35] are reported [3,4], on the basis of the formation of molecular complex at different sites.However, it is observed that only few of these staining patterns are clinically more significant and are often detected by indirect immunofuorescence (IIF) microscopy on HEp-2 cells in the sera of patients with autoimmune disease [5].Perhaps, this is the reason why a standard publicly available dataset [6] also provide images only with respect to these six patterns viz.Homogeneous, Speckled, Nucleolar, Centromere, Nuclear Membrane and Golgi.In our work, we use this dataset.Figure 1 depicts examples of these fluorescence staining patterns, where the task is to classify a given test cell image into one of these six classes.The characteristic definitions of some fluorescence patterns based on its visual traits, can be stated as follows [3,7]: 1. Homogeneous: A uniformly diffused fluorescence covering the entire nucleoplasm sometimes accentuated in the nuclear periphery.2. Speckled: This pattern can be explained under two categories (which, however, are not separately labeled in the dataset) -Coarse speckled: Density distributed, various sized speckles, generally associated with larger speckles, throughout nucleoplasm of inter-phase cells; nucleoli are negative.-Fine speckled: Fine speckled staining in a uniform distributed, sometimes very dense so that an almost homogeneous pattern is attained; nucleoli may be positive and negative.
3. Nucleolar: Characterized by clustered large granules in the nucleoli of inter phase cells which tend towards homogeneity, with less than six granules per cell.4. Centromere: Characterized by several discrete speckles  distributed throughout the inter phase nuclei.5. Nuclear Membrane: A smooth homogeneous ring-like fluorescence at the nuclear membrane.6. Golgi: Staining of polar organelle adjacent to and partly surrounding the nucleus, composed of irregular large granules.Nuclei and nucleoli are negative.
In addition to the classification of the above depicted interphase staining patterns, the diagnostic procedure based on ANA IIF images also involves an initial step of mitotic cells recognition.The mitotic cell recognition step is important for two reasons.First, the presence of at least one mitotic cell indicates a well prepared sample [8].While the mitotic detection (a binary classification problem) is important as is shown in some recent works [9,10], in this work, we focus on the multi-class interphase cell classification problems, which is also, in itself, important from the perspective of Auto-immune disease diagnosis.Indeed the importance of the interphase classification problem is indicated by the presence of relatively large dataset that we use in this work [6], which only contains cell images of interphase patterns.
Feature extraction plays an important role in the classification of staining patterns.It is a process that transforms the raw input data into a set of features.The generated feature set encodes numerous important and relevant characteristics and distinctive properties of raw data that help in differentiating between the categories of input patterns.To represent such characteristics for HEp-2 cell images, various automated methods have been developed which typically compute large number of standard image-based features (e.g., morphology, texture and other sophisticated features like (Scale-invariant feature transform (SIFT) [11], Stein's unbiased risk estimate (SURE) [11], Local Binary Pattern (LBP) [12], Histogram of oriented gradients (HOG) etc.).
We note that variations existing within the input patterns of cell images are typically reflected in terms of visual traits for morphology or texture.We can notice many features within the staining patterns which have various distinct morphological traits such as differences in area of foreground (white) pixels, size of granules (object), number of granules (object) etc.In this work, we have employed some class-specific features, formulated in one of our previous work [13], and some texture features [14] to represent such visual traits in order to discriminate the classes of HEp-2 cells.

Feature Selection
Out of a large amount of feature extraction, it often turns out that all features may not always be relevant for discrimination.If carefully chosen, a reduced feature set can perform the desired task.Thus, the task of feature selection (FS) aims to find a feature subset that can describe the data for a learning task more compactly than the original set.As indicated above, in the present task, as there can be numerous features to characterize cell morphology and texture, we have primarily focused on the feature selection among these.
Feature selection can induce following benefits [15,16]: (1) Improved classification performance due to removal of unreliable features that can negatively affect the performance; (2) Reduced dimensionality of the feature space, leading to reducing the test-phase computation and storage; (3) Removal of redundant features often leads to better generalization ability towards new samples;(4) Better interpretibility of the classification problem, considering a smaller number of features.
Feature selection algorithms can be categorized into two groups, i.e., supervised and unsupervised feature selection [15].Supervised feature selection algorithms are based on using discriminative information enclosed in labels.Below, we discuss some of these in brief:

•
Filtering methods that use statistical properties of features to assign a score to each feature.Some examples are: information gain, chi-square test [17], fisher score, correlation coefficient, variance threshold etc.These methods are computationally efficient but do not consider the relationship between feature variables and response variable.

•
Wrapper methods explore whole feature space to find an optimal subset by selecting features based on the classifier performance.These methods are computationally expensive due to the need to train and cross-validate model for each feature subset.Some examples are: recursive feature elimination, sequential feature selection [18] algorithms, genetic algorithms etc.

•
Embedded methods perform feature selection as part of the learning procedure, and are generally specific for given classifier framework.Such approaches are more elegant than filtering methods, and computationally less intensive and less prone to over-fitting than wrapper methods.Some examples are: Random Forests [19] for Feature Ranking, Recursive Feature Elimination (RFE) with SVM, Adaboost for Feature Selection etc.
In unsupervised scenario, however, no label information is used, and feature selection involves approaches such as clustering, spectral regression, exploitation of feature correlations etc. Various feature selection methods in unsupervised domain have been reported in the literature.Some of these methods, which we consider are: Infinite Feature Selection (InFS) [20], Regularized Discriminative Feature Selection (UDFS) [21], Multi Class/Cluster feature selection (MCFS) [22].Note that a reduced representation may also be obtained by some dimensionality reduction techniques that are based on projection of features into a subspace (e.g., principal component analysis) or compression (e.g., using information theory).However, unlike these, the feature selection techniques do not alter the original representation of the variables, but merely select a subset of them.Hence, they preserve the original semantics of the features, and do not lose on the advantage of interpretability.

Scope of the Work
The primary focus of this paper is to consider the effectiveness of different feature selection (FS) methods from perspective of HEp-2 cell image classification.This work involves various traditional and contemporary methods of feature selection among the categories mentioned above, including chi square test, t-test, information gain, sequential forward search (SFS), random subspace feature selection (RSFS) and variants on Random Forests.This study focuses to explore the features selection aspect considering the possibility of improvement in the classification process over the case which employs a complete set of baseline features.In this study, we consider both supervised and unsupervised feature selection methods.
Knowing that, there is often not a single universally optimal feature selection technique [23], it is useful to consider the hybridization of some FS methods, which can yield a more compact feature subset.This aspect is inspired by ensemble learning where, in a similar spirit, various classifiers are integrated to obtain a stronger classifier.
Motivated by this aspect, we propose a hybridization approach for a class of FS method, viz. the filter based FS methods.We consider only filter methods for the hybridization, as these methods are potentially adhoc, i.e., they provide feature ranking, but the selection of an optimal set of features is decided by an empirical threshold on feature ranking (score).The hybridization combines various filter methods to get a more robust feature ranking.The problem of adhoc threshold selection is also addressed by the hybridization approach as it automatically selects an eventual feature subset.
In addition, as discussed above, embedded algorithms provide feature importance by the internal mechanisms of the classification algorithm.For example, decision trees and their variants such as random forests [19], random uniform forest [24] perform recursive segregation of the given information by focusing on the features which have more potential to discriminate between the two classes.These methods are less intensive and much less prone to over-fitting.In this work, we have demonstrated an ability of Random Forest (RF) and Random Uniform Forest (RUF) for selecting an important discriminative feature set and thereby examine the contribution of each feature for the application of HEp-2 cell image classification.Unlike the existing works on HEp-2 cell classification, we stress on the feature selection aspect of random forest.Moreover, we also propose to utilize another variant of the random forest (viz.random uniform forest) for this work, which yields a superior performance over the standard random forest.
The choice of RF for this study is due to its various advantages over other classifiers such as, (1) From the feature selection perspective, RF inherently involves the notion of variable importance together with classification accuracy, which tells about how much the accuracy will degrade when a particular feature is removed from the feature set.By choosing the features which have higher ranks, an optimal subset can be found.(2) It has less parameters to tune (number and depth of trees), as compared to ensemble structures with popular classifiers (e.g., SVM, neural networks) where number of classifiers, choice of kernel and its parameters, cost function, number of hidden layers, number of nodes in the hidden layers etc. require tuning.In this work, we only tune the number of trees.(3) For a large dataset, it can be parallelized, and it can easily handle uneven data sets that have missing variables and mixed data (although this is not relevant in the current work).(4) It is known to perform well across various domains.
In addition to our study on feature selection, we focus on using simple feature definitions and training data (40%), and demonstrate state-of-the-art performance with the same.We believe that such a direction of feature selection is interesting, given that this is a emerging research area.

Summary of contribution:
(1) Exploring various feature selection (FS) techniques for the task of HEp2 cell image classification, (2) Hybridization technique to combine filter based FS techniques to automatic select feature subset, (3) Utilization of random forest and random uniform forests for feature selection for HEp-2 cell image classification, (4) Employing simplistic and visually more interpretable feature definitions, yielding state-of-the-art performance.(5) Experimental analysis including, (a) Demonstration of performance using low dimensional and high dimensional feature sets, (b) Providing an insightful discussion on the performance of various FS methods (c) Positive comparison with state-of-the-art methods.
This paper is a significantly extended version of our earlier work [25] published in the proceedings of the Int.Conf. on Medical Image Understanding and Analysis (MIUA) 2017.In [25], we have reported only random forest based feature selection for classification of HEp-2 cell images.The present work explores various feature selection methods for this classification task, and provides insightful discussions and comparisons.

Related Work
This section discusses some previous work, including state-of-the-art methods, aiming to automate the IIF diagnostic procedure in the context of ANA testing.Perner et al. [26] presented an early attempt in this direction.They introduced various sets of features which were extracted from cell region using multilevel gray thresholding and then used a data mining algorithm to find out the relevant features among large feature sets.Huang et al. [27] utilized 14 textural and statistical features along with self-organizing map to classify the fluorescence patterns whereas, in [28] learning vector quantization (LVQ) with eight textural features was used to identify the fluorescence pattern.The methods discussed above are evaluated on private dataset which makes performance comparison very difficult.Recently, the contests on HEp-2 cell classification [3,29], held in conjunction with the ICPR-2012, ICIP-2013, and ICPR-2014, released public datasets which are suitable for evaluation of methods as a part of relevant contests.These contests have given a strong impetus in the development of algorithms for the HEp-2 cell image classification task.Since the ICPR-2014 dataset (same as the ICIP-2013 dataset) is the most recent and of much larger scale than that of ICPR12, we use this dataset for validation and compare our framework.This is similar to other works such as [11,[30][31][32][33][34] which also use the same dataset.More recent reviews of this topic were provided by [6,35].
In this context, several new approaches were proposed for cell classification and evaluated on common basis.
The reader can refer to [36] for an overview of the literature and of the open challenges in this research area.Below, we discuss some specific works in terms of features and classifiers.
While the area of HEp-2 cell image classification has been somewhat explored, to the best of our knowledge, such a study on feature selection, yielding a good classification performance has not been reported.Random forest has only been utilized as classifier for this problem.Prasath et al. [30] utilized texture features such as rotational invariant co-occurrence (RIC) versions of the well-known local binary pattern (LBP), median binary pattern (MBP), joint adaptive median binary pattern (JAMBP), and motif co-occurrence matrix (MCM) along with other optimized features, and reported mean class accuracy using different classifiers such as the k-nearest neighbors (kNN), support vector machine (SVM), and random forest (RF).The highest accuracy is obtained with RIC-LBP combined with a RIC variant of MCM.In [14] authors utilized large pool of feature and used random forest as classifier.They have achieved a good accuracy, but the work uses the ICPR-2012 dataset, which is much smaller than the one which we use (ICPR-2013/ICPR-2014) for validation, and employs a leave-one-out cross validation, thus involving a large amount of training data.Bran et al. [31] made use of the wavelet scattering network, which gives rotation-invariant wavelet coefficients as representations of cell images.The work in [32] extracted various features (Shape and Size, Texture, Statistical, LBP, HOG, and Boundary) and used the random forest for classification.Note that above discussed methods employ Random Forests, as done in a part of this work.However, it is used only as a classifier and the feature selection aspect is not considered.
In [11,33,34] authors achieved state-of-art performance for HEp-2 cell image classification In [11], Bag of Words model based on sparse coding was proposed.They used scale-invariant feature transform (SIFT) and speeded-up robust features (SURF) features with sparse coding and max pooling.Siyamalan et al. [33] presented an ensembles of SVMs based on sparse encoding of texture features with cell pyramids, capturing spatial, multi-scale structure for HEp-2 cell classification and reported mean accuracy.Diego et al. [37] utilized a biologically-inspired dense local descriptor for characterization of cell images.In [34], authors utilized deep convolution neural network for cell classification.All the above discussed methods were bench-marked with the ICIP-2013/ICPR-2014 dataset.
The feature selection ability of RF has been reported in some other domains such as gene selection [38], breast cancer diagnosis [39,40], and analyzing radar data [41], which inspires us to explore it for the problem of HEp-2 cell image classification.In [42,43], authors provided survey on importance of feature selection, review of feature selection methods and their utilization in various fields.
Most of the feature selection methods considered here have been applied to different domains.However, the direction of feature selection has not been explored for the cell classification problem.We believe that reduction in number of features helps in improving the performance of a CAD system, allowing faster and more cost-effective models, while providing a better understanding of the inherent regularities in data.
In the context of combination of FS methods, several schemes have been proposed that either produce one score or feature subset.In [44], authors merged three different feature selection methods, i.e., Principal Component Analysis (PCA), Genetic Algorithms (GA), and decision trees (CART) based on union, intersection, and multi-intersection approaches to, examine their prediction accuracy and errors for predicting stock prices in the financial market.Abdulmohsen et al. [45] combined five feature selection methods, namely CHI squared (CHI), information gain (IG), relevancy score (RS), Galavotti Sebastiani Simi (GSS) and Goh Low (NGL) coefficient, using intersection and union approaches, considering top 50, 100 and 200 ranked features for Arabic text classification.Rajab et al. [46] generated new score by combining two feature selection methods (IG,CHI) for website phishing classification.In [47], authors compared the performance of various feature selection methods and their combination, for classification of speaker likability.Multi-criteria ranking of features was proposed by Doan [48] for text characterization.

Methodology
We divide this subsection into four parts in which we discuss the details of the filter, wrapper, embedded, and unsupervised methods that we consider for feature selection.

Filter Methods and Their Hybridization
This subsection discusses about the four filter methods chosen for this work, and the proposed hybridization approach in detail.Figure 2 shows the architecture of proposed scheme.

•
Chi square [17]: The chi-square test is a statistical test that is applied to test the independence of two variables.In context of feature selection, it is analogous to hypothesis testing on distribution of class labels (target) and feature value.The greater the chi-score of a feature, the more independent that feature is from the class variable.Information gain (IG) [49]: computes how much information a feature gives about the class label.Typically, it is estimated as the difference between the class entropy and the conditional entropy in the presence of the feature.
where C is the class variable, A is the attribute variable, and H( ) is the entropy.Features with higher IG score are ranked higher than features with lower scores.

•
Statistical dependency (SD) [47]: measures the dependency between the features and their associated class labels.The larger the dependency, higher will be the feature score.
The involved filter methods are univariate i.e., each feature considered separately without involving any feature dependencies.These methods have various advantage over other methods: (1) can be easily scaled to high-dimensional datasets, (2) computationally simple and fast, (3) independent of the learning algorithm i.e., not specific to a given learning algorithm.Moreover, feature selection needs to be performed only once.

Hybridization Process
This subsection describes the hybridization process that involves various filters methods which provide feature ranking based on different criteria.
We believe that by hybridizing various filters methods, a feature set which has enough variation and yet is representative enough can be obtained automatically.Our proposed scheme of hybridization involves iterative choosing common features from the rank sorted sets by each of the filter methods.The procedure is described below: We recursively divide the sorted feature set from each of the N filter methods, into halves based on ranking.For example, if total features are 128, than we first process 64 top features, followed by 32, 16, . . ., 8, 4, 2, 1. 2 Processing the first half: (1) An initial selected feature set contains features that are common in first half of rank-sorted features from all N methods, (2) We then keep adding to this list the common features from among first halves of the rank-sorted list of N − 1 methods, if the addition improves the accuracy.We follow this process with N − 2, N − 3, . . ., 2 methods.Note that at each stage, we consider common features from all combinations of N − 1, N − 2, . . ., 2 methods.3 Processing lower partitions: The above process is then carried out for the lower partitions and sub-partitions, recursively.Note that as one proceeds lower in the partitions, the number of features added in the final feature sets reduce, as many do not contribute to the increase in the accuracy.At each evaluation the improvement in accuracy is computed with a training and validation set by training a support vector machines (SVM).
Figure 2 shows the architecture of Hybridization Process.The algorithm for this process are illustrated in Algorithm 1.We note that this is an effective way of selecting a 'good' feature set without manually setting empirical threshold on the rank to determine the number of features to be selected.Moreover, it also suggests an approach to integrate the ranking provided by different filter methods.As the proposed hybridization approach employs a validation based approach using a classifier, one can argue that it is similar to the wrapper FS method.However, the proposed method is different from a typical wrapper methods as: (1) It uses already ranked features to come across the final feature subset.Hence it requires much less number of iterations than wrapper methods which involve an exhaustive or random search considering individual features.(2) The wrapper methods stop adding features when it does not find increment in accuracy.Therefore, to proceed further there is a need to define a parameter which determines the number of features that can be allowed in the selected feature set without improvement.
To compare the proposed hybridization approach with the existing hybrid approaches, we implement the method [46] that was used for website phishing classification.In this method, the authors combine two filer methods (IG, chi-square).First they normalize both feature scores to make them comparable and, then calculate the root mean square of the feature values.
Algorithm 1 For finding feature subset using hybridization

Wrapper Methods
Wrapper methods embed the model hypothesis (predictor performance) to find feature subsets.The space of feature subsets grows exponentially with the number of features, hence increase the overfitting risk.

•
Sequential Forward Selection (SFS) [18]: It is an iterative method in which we start with an empty set.In each iteration, we keep adding the feature (one feature at a time) which best improves our model till an addition of a new variable does not improve the performance of the model.It is a greedy search algorithm, as it always excludes or includes the feature based of classification performance.The contribution of including or excluding a new feature is measured with respect to the set of previously chosen features using a hill climbing scheme in order to optimize the criterion function (classification performance).

•
Random Subset Feature Selection (RSFS) [50]: In this method, in each step, random subset of features from all possible feature subset is chosen and estimated.The relevance of participating features keeps adjusting according to the model performance.As more iterations are performed, more relevant features obtain a higher score.and the quality of the feature set gradually improves.Unlike SFS, where the feature gain is estimated directly by excluding or including it from a existing set of features, in RSFS each feature is evaluated in terms of its average usefulness in the context of feature combinations.This method is not much susceptible to a locally optimal as SFS.

Embedded Methods and Feature Selection Procedure
In the following subsection we discuss about embedded methods considered in this work, viz.random forest and its variant, random uniform forest for measuring feature (variable) importance, and feature selection procedure which we employ in this work for both methods.

Random Forest
Random Forest [19] is a collection of decision trees where each tree is constructed using a different bootstrap sample from the original data.At each node of each tree, a random feature subset of fix size is chosen and the feature which yields maximum decrease in Gini index is chosen for split [19].In the forest, trees are grown fully and left unpruned.About one-third of the samples, called out of bag samples (OOB), are the left out of the bootstrap sample and used as validation samples to estimate error rate.To predict the class of a new sample, votes received by each class are calculated and the class which has majority of the votes is assigned to the new sample.Figure 3 shows the general structure of random forest.Random forest offers two different measures to gauge the importance of features, the variable importance (V I) and the Gini importance (GI) [19].The average decrease in the model accuracy on the OOB samples when specific feature is randomly permuted gives the V I of that feature.The random permutation essentially indicates the replacement of that feature with a randomly modified version of it.V I shows how much the model fit decreases when we permute a variable.The greater the decrease the more significant is the variable.For variable X j , the V I is calculated as follows: (1) Calculate model accuracy without permutation using OOB samples, (2) Randomly permute the variable X j , (3) Calculate model accuracy with permuted variable together with the remaining non-permuted variable, (4) Variable importance is found out by taking difference of accuracies before and after permutation and is averaged over all trees.
Gini importance (GI) indicates the overall explanatory power of the variables.The GI uses the drop in Gini index (impurity) as a measure of feature relevance.GI is a biased estimate [51] when features vary in their scale of measurement.Owing to this V I is more reliable for feature selection when subsampling without replacement is used instead of bootstrap sampling to construct the forest.Therefore, we consider the V I measure for variable ranking in this paper.

Random Uniform Forest (RUF)
RUF is an ensemble of random uniform decision trees, which are unpruned and binary random decision trees.The random cut-points in an RUF, to create partition from each node, is generated assuming an uniform distribution for each candidate variable.
An important purpose of the algorithm is to get tress which are less correlated, to allow a better analysis of variable importance.For more details, please refer to [24].
RUF also provides a measure for variable importance as RF for assessment of features.However, in this case, the procedure of its estimation is decomposed into two steps: (1) first is same as RF, where VI is computed for all features, (2) In second step, relative influence of each feature is calculated.Thus, it is termed as global variable importance measure.Each variable has equal chance to be selected but it will get importance only if it is the one that decreases the overall entropy at each node.We rank the features based considering the variable importance from RF and RUF.We then provide an examination how overall accuracy varies when sets of top-ranked features are chosen and how many features are actually required to achieve accuracy equal (or in some cases, even greater) to accuracy which had been obtained with all features.In the result section, we will briefly discuss about the impact of top-ranked features on accuracy using two different feature sets.

Feature Selection Procedure
This subsection describes the procedure which we employ to seek a subset of features based on the variable importance.Same procedure is followed for both RF and RUF.
1. Training: Random forest (RF) internally divides the training data into 67-33% ratio for each tree randomly, where 67% is used to train a model while 33% (out of bag) used for testing (cross-validation).The OOB error rate is calculated using this 33% data only.To automatically decide the value of RF parameters such as number of trees and number of features used at each tree, the OOB error rate is generally considered [52].In this work, we only decide the number of trees.
For the number of features at each tree, we employ typical default value (viz.square root of the total number of features).Figure 4 illustrates the variation in OOB error rate with the variation in the number of trees.There are seven lines in graph where six correspond to each class and the black line corresponds to mean of OOB error rate of all class.The point where the OOB error reduces negligibly could be considered as a good point to fix the value of number of trees [52].2. Feature selection: By using variable importance (V I) provided by RF (after cross-validation), a subset of good features is chosen to test the remaining 30% data.The procedure of selecting a good feature subset is given as follows: After some point, addition of more features do not increase the classification accuracy further.(c) As there is no significant improvement after this point, the features upto this point is considered as good feature subset.This feature subset gives the accuracy which is equal to the accuracy obtained using all features.

Unsupervised Feature Selection
• Infinite Feature Selection (InFS) [20]: It is graph-based method.Each feature is a node in the graph, a path is a selection of features, and the higher the centrality score, the most important (or most different) the feature.

•
Regularized Discriminative Feature Selection (UDFS) [21]: It is a L2,1-norm regularized discriminative feature selection method which simultaneously exploits discriminative information and feature correlations.It selects most discriminative feature subset from the whole feature set in batch mode.

•
Multi Cluster feature selection (MCFS) [22]: It selects the features for those e multi-cluster structure of the data can be best preserved.Having used, spectral analysis it suggests a way to measure the correlations between different features without label information.Hence, it can well handle the data with multiple cluster structure.

Dataset and Feature Extraction
This subsection discusses the dataset used in this work and features we employ for the study.Indeed, one type of features that we employ is closely related to the visual traits of cell organelles observed in the images.

Dataset
In this work, the publicly available dataset from the ICPR 2014 HEp-2 cell image classification contest is used which comprising more than 13,000 cell images [7].
The dataset consist of six classes termed as: Homogeneous (H), Speckled (S), Nucleolar (N), Centromere (C), Nuclear Membrane (NM), Golgi (G).In the dataset, each class consist of positive and intermediate images.The intermediate images are generally lower in contrast compared to the positive images.The dataset also includes mask images which specify the region of interest of each cell image.

Feature Extraction
One of our intentions in this work is to demonstrate that even simplistic feature definitions (unlike the more sophisticated ones such as SIFT, HOG, SURF etc.), can also yield good classification performance.Thus, we define two types of features.The first type which we term as 'class-specific features', are proposed in one of our earlier works [13], wherein we explicitly represent some semantic morphological aspects in the cell.For the second type of features, we choose some standard texture descriptors, but which yield only scalar values, thus are quite simplistic.
1. Class-specific features: Motivated by expert knowledge [9] which characterizes each class by giving some unique morphological traits, we define features based on such traits.As the location of these traits in each class may be different, features are extracted from specific regions of interest (ROI) for each particular class, computed using the mask images.Figure 6 shows some of the unique traits of each class.For example, in NM class useful information can be found in a ring which is centered on the boundary.Thus, utilizing such visually observed traits following features are extracted for the NM class: Here we only mention a list of these features in Table 1, but and we refer the reader to our previous work [13] which contains a more elaborate discussion.These features, involve simple image processing operations such as scaler image enhancement, thresholding, connected components.and morphological operation.Our earlier work [13] explains the features in more detail.These feature are scalar-valued, simple, efficient, and more interpretable.As listed in Table 1, the total feature definitions which are extracted from all classes is 18, and using various combination of threshold and enhancement parameters, a total 128 features are obtained.
2. Traditional scalar texture features [14]: These include features which include morphology descriptors (like Number of objects, Area, Area of the convex hull, Eccentricity, Euler number, Perimeter), and texture descriptors (e.g., Intensity, Standard deviation, Entropy, Range, GLCM) are extracted at 20 intensity thresholds equally spaced from its minimum to its maximum intensity.Again for a detailed description about these features, one can refer [14].However, in [14], these features were applied on much smaller dataset.Considering various parameter variations, in this case, a much larger set of 932 scalar-valued features is obtained.

Results and Discussion
In this section we discuss the experimental protocols and results, comparing among the different FS methods, and finally provide a comparison with the state-of-the-art approaches.

Training and Testing Protocols
In this section we discuss the experimental results obtained using the two different feature sets discussed above.We divide the given data into three parts: training (40%), validation (30%) and testing (30%) (Experimental protocol 1).All the experiments are done for four random trails, and average accuracies are reported for all.The state-of-the-art methods differ in their experimentation protocol.Hence, here, we also consider an experimental setup in which training, validation and testing ratios are defined in the same manner as the methods that yield the highest performance.This is a 5-fold cross-validation protocol where the training-validation data consists of 80% portion and the test data is 20% [Experimental protocol 2].Note that the protocol 1 involves lesser training data than the best performing approaches (protocol 2).We first provide and discuss with protocol 1 for all the approaches, before providing results for protocol 2. esting (80%).This is motivated by the fact that in realistic applications the amount of training data is low.

Evaluation Metrics
Various metrics can be utilized to evaluate performance.In the paper [7] which is associated with the dataset used in this work, mean class accuracy (MCA) was used to evaluate the performance.Here, we utilize mean class accuracy, and in addition we also employ overall accuracy and false positive as evaluation measures for fair comparison with other state-of-art methods [11,[30][31][32][33][34] which also use the same metrics.
where CC(i), N i are the classification accuracy and total samples of class i, and N is the total number of classes (here it is 6).False Positive (FP): It is a proportion of negatives samples that are incorrectly classified as positive.
where WC(i) is the wrongly classification samples of class i and FS i is the false samples for class i (for example, for class 1, the samples of other classes (2, 3, 4, 5 and 6) will be the false samples).
The overall accuracy (OA), which is the overall correct classification rate is also calculated for ease of comparison, is given as: To follow below tables, notions A h , N h , A s and N s are defined as: • A h : Highest accuracy with respect to the number of selected features.

•
N h : Number of features that yield the highest accuracy.

•
A s : Accuracy considering all features.• N s : Number of features that match the accuracy considering all features.
Each table provides the exact number for N h , N s and for their corresponding A h , A s for four random trails.From tables, it is clear that highest accuracy is slightly higher than accuracy obtained using all features.The reason is that including features which are not important, or correlated features can negatively impact the accuracy.In all the tables, an average value of Ah, Nh, As, Ns is reported across four random trails.

Filter Methods
For each filter method, the following procedure is opted to choose feature subset which produces higher accuracy using the obtained feature ranking: (1) Features are tested on validation data in sets of 1 (e.g., 1, 2, 3, 4, . . .).We utilize support vector machine (SVM) with gaussian kernel for classification.
(2) After some point, addition of more features do not increase the classification accuracy further.Although, in some cases it also decreases up to some point.
Table 2 describes the results obtained from filter methods using both datasets.Following observation can be made: (1) These methods select more features or are able to reject relatively few features.This happens, because interaction among features (relative importance) are not considered in process of making feature ranking.Approximately 75% features of class-specific set, and 84% features of texture set are utilized for the highest accuracy, whereas 68% features of class-specific set, and 78% texture features are used to match the performance same as using all features.(2) Although, these methods select larger number of features, they can better be generalize to unseen data i.e., the case where distribution of output data is somewhat different from input data.This can be seen through the small difference between validation and testing results.

Hybridization
The results of proposed hybridization strategy of the filter methods are given in Table 3 and shown for four random trials.It can be noticed from the table that the hybridization produces highest accuracy (among all methods, especially with class-specific features), but with less number of selected features than its component filter methods.However, the selected feature subset still contains features which are more than wrapper and embedded methods (to be discussed next).Nevertheless, it also produces the highest testing accuracy with both datasets.
An important point is that the increment in the size of feature set during the selection process can be controlled in wrapper and embedded methods (for features can be added one by one), while in hybridization method, this is automatic, and a large chunk of features (which are common across the high-ranked filter methods) is selected simultaneously.This could be the one of reason to get high length feature subset.

Wrapper Methods
Table 4 illustrates the results for wrapper methods.Following inferences can be made through observations: (1) It considers the relative importance of features while making optimal subset.Hence the feature which has higher individual, as well as relative importance will be chosen first.This method calculates the importance of each feature with respect to other features.As, this method selects features till it finds an increment in performance, it can sometimes get trapped in local minima.However, it can be observed from the table that, this approach selects least features (<15%) as compared to others, to generate high accuracy.However, the difference between validation and testing accuracy shows that it may not be able to better generalize to new samples (unseen data), as sometimes it over-fit to the training data, even based on the parameters chosen with the validation data.

Embedded Methods
To determine the effectiveness of random forest and random uniform forest, for feature importance the experiment involves two steps: (1) Computing the variable importance (V I) of features, (2) Computing the classification performance using top ranked features (in sets of 10).The same procedure is repeated for four random trials.The testing and validation results shows that embedded methods can be better generalize to unseen data than wrapper method even with less number of features.We note that (Table 4) only 25-35% features are sufficient to yield accuracy which is equal to standard accuracy (i.e., using all features) for the class-specific feature set, while for the texture feature set, approx <20% features are sufficient for the same.Thus, as in the wrapper methods the number of feature selected are quite low. Figure 7 shows the ranking of feature based on the variable importance, given by random forest and random uniform forest for one of the feature set (class-specific: one trial).The variable importance along with variable name are shown in Figure 8. Figure 8a shows the decease in accuracy when randomly permute the variable for Random forest, Figure 8b shows the relative influence of variables for Random uniform forest.

Unsupervised Feature Selection Methods
The results for unsupervised FS methods are given in Table 5.Following can be observed from Table 5: (1) It produces validation accuracy which is somewhat lesser than supervised (filter methods).But Produces higher test accuracy.(2) Required more number of features than supervised domain methods.From the results obtained using protocol 1, it can be seen that higher accuracies are produces with filter, embedded, hybrid and unsupervised methods.Due to this, we only implement such methods for second protocol.
Table 6 describes the results for filter and hybrid methods using both datasets.As protocol 2 uses more training data, it produces higher accuracy than protocol 1.However, it also utilizes more features.Other observation are same as above.Results for unsupervised methods are shown in Table 7.

Comparison among All Feature Selection Methods
Table 8 provides comparable performance among all the feature selection methods under one roof for protocol 1.From this, and from Tables 2-4, it can be observed that (1) Filters methods have better generalization ability than other methods which involve the classification performance, at the cost of optimal subset that contains large number of features compared to other methods.Hence, they performs well on unseen data.(2) Since filter methods selects larger dimension of feature subset, the cases where number of sample are less than number of features (for example, for bio-informatic applications [42,53]), wrapper methods typically perform well as compared to filter methods.(3) Hybridization of filters methods produces better performance than individual filter methods.It also selects less number of features compared to individual filter methods.(4) In can be seen that, in general for class-specific features, there is relatively less difference in number of optimal features selected across all methods.This indicates that the class-specific features are designed thoughtfully based on semantic visual characteristics of the cell images, and hence there may not be too many irrelevant features.However, this difference in numbers of selected features is clearly seen in case of texture features.(5) Also, in general, the highest accuracy obtained using the class-specific features is also, higher than that of texture features.(6) Both wrapper and embedded methods yield a high accuracy, importantly with less number of features.However, for the texture feature set, one can note that the generalization capability of the wrapper methods is low, as there is a considerable difference between the validation and the testing accuracy.(7) In general, considering overall testing accuracy and the number of features selected, the embedded methods can be said to perform best.However, if the computational complexity (i.e., the number of selected features) is not much of a concern, then the hybridization of filter methods yields the best results.
As we are reporting all the results corresponding to both the protocols in comparison section (with contemporary methods).Due to this reason, we are not adding table for protocol 2.

Performance Comparison with Contemporary Methods
Table 9 illustrates the performance comparison (using both the protocols) among various methods, where in [30][31][32] utilized random forest for classification, [11,33] are recent methods using, arguably, more sophisticated features, and the work of [34] uses deep learning.For the methods in [33,34], the numbers in brackets denote the results with data augmentation.In this table, for our approaches, we report the average results over 4 trials corresponding to Nh (Number of features that yield the highest accuracy).The results are shown only for one FS method out of many, in each category that produces higher accuracy.
From the table, following can be observed: (1) Considering the best results, the proposed approach where class-specific features are utilized, outperforms all the methods in all terms (overall accuracy, mean accuracy and false positives).( 2) All types of feature selection methods used with class-specific features outperforms the state-of-art methods.(3) Even with the texture features, comparable performance is achieved for the hybrid filter method and the embedded methods.
Experimental protocol 1 where less training data as compared to protocol 2 is used, produces comparable performance with less number features.It could be considered as a positive aspect of our approach, such that using less training data higher performance can be obtained.
Thus, it is clear that many of the proposed approaches outperforms the other contemporary methods even with simple morphology-based and texture features.Importantly, often times, very less number of features are utilized.

Computational Time Analysis
We have not done an extensive computational time analysis, as different methods may use different platforms.However, we can speculate that during the testing process, the proposed approaches would arguably be more efficient, as it involves less number of features as compared to other methods, and efficient classifier, especially, as compared to the approach based on CNN classifier.Having said that, during the training and feature selection stage, we also note that the complexity of methods can vary based various factors.For example, in case of Random Forests, large number of trees may require more time as compared to less number of trees.

Conclusions
This work explored various feature selection methods from the perspective of HEp-2 cell classification.We also proposed a technique which combines filters methods.We showed that by constructing hybrid feature selection techniques, robustness of feature ranking and feature subset selection could be improved automatically.
We explore random forest and random uniform forest for feature selection for HEp-2 cell image classification.The notion of variable importance is used to select important features from a large set of simple features.Our experiments show that such a feature selection yields a significantly reduced feature subset, which can, in fact, result in accuracy higher than that using original large feature set.For comparison, we also employ some wrapper methods.The method also generalizes well for unseen data with a high performance with the reduced feature set.
The results demonstrate, in some cases, reduction of large feature set for the best performance on the test data, indicating potential computational saving.Also, in general the approach with proposed class-specific features outperforms the state-of-art methods.

Figure 2 .
Figure 2. Architecture of proposed scheme for feature selection.

3. 1
.1.Filter Methods Filter techniques assess the relevance of features based on discriminative criteria.These are relatively independent of model performance, hence provide the score (feature ranking) by looking only at the intrinsic properties of the data.Chi square, t-test, information gain (IG) and statical dependency (SD) are some of the examples of filter methods.

Figure 5
provides the details of examples for each class.

Figure 5 .
Figure 5. Sample images (Positive and Intermediate) and number of examples of each class in dataset.

Figure 6 .
Figure 6.Unique characteristics of each class.

Figure 8 .
Figure 8. Variable importance (class-specific feature set) (a) Random forest, (b) Random uniform forest.Here, different range of symbols 'Ai-Aj' ('i, j' denoting a number) are used for different parametric variations of each class-specific features, as provided in the legend table.

Table 1 .
Class-specific features for all classes.

Table 6 .
Filter and hybrid methods: Experimental protocol 2.