^{*}

This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (

A common approach to improve medical image classification is to add more features to the classifiers; however, this increases the time required for preprocessing raw data and training the classifiers, and the increase in features is not always beneficial. The number of commonly used features in the literature for training of image feature classifiers is over 50. Existing algorithms for selecting a subset of available features for image analysis fail to adequately eliminate redundant features. This paper presents a new selection algorithm based on graph analysis of interactions among features and between features to classifier decision. A modification of path analysis is done by applying regression analysis, multiple logistic and posterior Bayesian inference in order to eliminate features that provide the same contributions. A database of 113 mammograms from the Mammographic Image Analysis Society was used in the experiments. Tested on two classifiers – ANN and logistic regression – cancer detection accuracy (true positive and false-positive rates) using a 13-feature set selected by our algorithm yielded substantially similar accuracy as using a 26-feature set selected by SFS and results using all 50-features. However, the 13-feature greatly reduced the amount of computation needed.

Breast cancer is among the most frequent forms of cancers found in women [

Image features are conceptual descriptions of images that are needed in image processing for analyzing image content or meaning. Features are usually represented as data structures of directly extractable information, such as colors, grays, and higher derivatives from mathematical computation of the basic features such as its edges, histograms, and Fourier descriptors. Each type of feature requires a specific algorithm to process it. Therefore, only features that carry essential and non- redundant information about an image should be considered. Moreover, feature-extraction techniques should be practical and feasible to compute. Many researchers have tried to improve the accuracy of CAD by introducing more features on the assumption that this will lead to better precision. However, adding more features necessarily increases the cost and computation time.

The addition of more features does not always improve system efficiency, which has led to an investigation of feature pruning techniques [

Among the algorithms to discard non-significant features are sequential forward search (SFS), sequential backward search (SBF), and stepwise regression. SFS and SBF focus on the reduction of MSE of the detection process while stepwise regression involves both the interaction of features and the MSE value. Using stepwise logistic regression is costly since this technique is based on calculations over all possible permutations of every feature in the prediction model. These techniques use an assumption to select features that has higher relation to the classifier decision output. However, an optimal set of features must be orthogonal. With the above techniques, it is possible that information from two or more candidate features may be redundant and a feature may be dependent on another.

To improve the effectiveness of feature-discarding techniques, we propose a new method using modified path analysis for feature pruning. A weighted dependency graph of features to the output of classifier and correlation matrices among features is constructed. Statistical quantitative analysis methods (regressions and posterior Bayes) and hypothesis testing are used to determine the effectiveness of each feature in the classifier decision. Experiments are performed using 50 features found in literature and evaluate feature selection effectiveness when applied on to two learning models: ANN and logistic regression. The resulting 13-feature set is compared with prediction using all 50 original features and a 26-feature set selected by the SFS method. We found that the quality is nearly equal; however, the number of feature computations is reduced by one-half and 13/50 when compared to the 26-feature set and all-feature set, respectively.

The paper is organized as follows. Section 2 is the medical image features problems and survey on the features in medical image research. Section 3 describes the feature extraction domains. Section 4 has details of the statistical collaborative methods. Section 5 describes our proposed algorithm and section 6 is the evaluation the experiments.

Medical image detection from mammograms is limited to analysis of gray-scale features. Distinction between normal and malignant tissue by image density is nearly impossible because of the minuteness of the differences [

In a previous study [

Fu

Zhang ^{14} possible feature subsets. The results showed that a few feature subsets (5 features) achieved the highest classification rate of 85%. In the case of a huge number of features and mammography, however, it is very costly to select features using the neural- genetic approach.

The Information Retrieval in Medical Applications (IRMA) [

The researchers' choices of medical image features depend on the objectives of the individual research. Cosit

Explorations of feature extraction analysis have been found that the effects of significant features can be direct or indirect and some features do not relate to the detection results at all. Therefore, ineffective and redundant features must be discarded.

This section presents details on feature domains that are used for medical image classification. Generally, the original digital medical image is in the form of a gray-scale or multiple spectrum bitmap, consisting of integer values corresponding to properties

The spatial domain is composed of features extracted and summarized directly from grid information. It implicitly contains spatial relations among semantically important parts of the image. Examples of spatial features are shapes, edges, foreground information, background information, contrasts and set of intensity statistics, such as mean, median, standard deviation, coefficient of variation, variance, skewness, kurtosis, entropy, and modified moment. In this research, we also use radian of mass.

Texture features are relations among pixels in a bitmap. Representation of texture features commonly uses co-occurrence matrices to describe their properties. The co-occurrence matrix of texture describes the repeated occurrence of gray-level configuration in an image. For a texture image, _{φ,d}

The frequencies of co-occurrence as functions of angle and distance can be defined as:

In this paper, we take

Energy or angular second moment (an image homogeneity measure):

Entropy:

Maximum probability:

Contrast:

Inverse difference moment:

Correlation (a measure of image linearity, linear direction structures in direction _{x}, μ_{y}, σ_{x}, σ_{y}

Spectral features [

Spectral entropy:

Block activity:

The above features are frequently found in the literature of medical image analysis; there are many more features available.

We hypothesize that using only one statistical method for classification will not be successful because of the restriction on measurement values of features and output. As this restriction, we investigate statistical techniques to fulfill the feature selection process. These statistical techniques consist of four parts: 1) feature classification, 2) path analysis, 3) exploration on relations among features and outputs, and 4) hypothesis testing. In the feature classification, we use correlation analysis to transform a number of features into a number of groups. In path analysis, the conceptual relations among different feature classes are constructed. Then, relations among features and between features and outputs are determined by three methods: logistic regression, simple regression, and multiple regression. Finally, hypotheses of feature relationships are tested by a Bayesian technique.

Since most low-level features are extracted from spatial and texture based, which are highlycorrelated, the feature selection strategy is subject to this limitation. The correlation coefficient is usedto analyze these features. The correlation coefficient

Correlation coefficients of features can be used to classify many highly related features into groups.

By the previous phase, we can identify groups of highly-related features. We find that the relationships of features within each group and relationships among groups to final output can be determined by path analysis.

Path analysis utilizes multiple regression analysis. Regression analysis is an analysis of causal models when single indicators are endogenous variables of the model. In a path model, there are two types of variables: exogenous and endogenous. Exogenous variables may be correlated and may have direct effects as well as indirect effects on endogenous variables. Causality is a relationship between an exogenous variable and endogenous variable(s); philosophical causation refers to the set of all particular “causal” relations.

Being a regression-based technique, path analysis is limited by the requirement that all variables be continuous. Because our study involves continuous cause variables while the endogenous output variable is dichotomous (discrete), we cannot use path analysis directly; however, the analysis is still a graph-based process. Causal relation analysis can be explained by dependent variables that are measured on an interval or ratio scale [

A path diagram not only shows the nature and direction of causal relationships but also estimates the strength of relationships. Comparatively weak relationships can be discarded; thus some features are eliminated. A path coefficient is the standardized slope of the regression model. This standardized coefficient is a Pearson product – moment correlation. Basically, these relationships are assumed to be unidirectional and linear. To overcome this limitation, we use regressions and Bayesian inference to construct a graphical model.

From the previous details about features and the path analysis, it is necessary to explore the cause and effect features by regression analysis. In our purpose, we suggest to use logistic regression, simple regression, and multiple regressions.

a) Using logistic regression. Logistic regression is a regression model for Bernoulli-distributed dependent variables. It is a linear model that utilizes the logit as its link function. Logistic regression has been used extensively in medical and social sciences [

where _{i}_{i}=1), _{j}_{i}_{i}

Logistic regression model can be used to predict the response features to be 0 or 1 (benign or malignant in the case of mammogram detection). Rather than classifying an observation into one group or the other, logistic regression predicts the probability

Using simple regression and multiple regression. Simple regression has the same basic concepts and assumptions as logistic regression but the dependent variable is continuous and the model has only a single independent variable. The simple regression can be modeled as _{i}_{0} +_{1}_{1}_{i}_{i}_{i}_{0} , Regression yields a p value for the estimator of_{1} are parameters (weights), and _{1}_{i}_{i}_{1} that can be used to decide whether

Simple logistic regression and multiple logistic regression are used to explore the cause features to effect output.

Although the statistical techniques in previous Section can be used to identify causal features, they cannot classify those features as direct or indirect. We use hypothesis testing for this.

An appropriate way to test the hypothesis about the direction of causal relationships is easier to illustrate an abstract concept by analogy with Bayesian inference. Bayesian inference uses the scientific method, which involves collecting evidence that may or may not be relevant to a given phenomenon. The more evidence is accumulated, the degree of belief in a hypothesis changes. With enough evidence, the degree of belief will often become very high or very low. It can be used to discriminate conflicting hypotheses. Bayesian inference usually relies on degrees of belief, or subjective probabilities. Bayes's theorem adjusts probabilities based on new evidence as
_{o}_{o}_{o}_{o}_{o}_{o}_{o}

Using hypothesis testing on the regression, we can use path analysis for the discrete output.

To solve this solution, simple regression, logistic regression, and Bayesian inference take into account of causality extraction problem. The algorithm is described as following steps.

Partition the original feature sets (_{1}_{2}_{n}_{i}_{1i}_{2i}_{ji}_{ij}_{i}_{j}

This step is to partition all features into feature subsets _{i}_{i}_{j}

Perform simple logistic regression of each independent feature _{ji}_{i}_{i}

The result from this step is a subset _{ri}, x_{pi} … x_{ki}) of features from _{i}_{i}

Perform multiple logistic regression by using all features in set S_{i}, i=1, 2 … k in the model and selecting the signified features B_{i} = (_{ti}_{li}_{zi}_{i} is a set of direct features and indirect cause features.

Let D_{i} = A_{i} Ə B_{i}; where Ə is our testing hypothesis operator for exploring the causal relations using the Bayesian inference conceptual framework.

This step is performed using Bayesian inference as in the following example for two features:

This step iteratively refines the search for the indirect cause feature with the highest correlation with the direct cause _{mi}

Through the above predicates (1) to (4), we can accept the _{ni}_{ni}_{ti}_{ni}_{ti}

Repeat from Step 2 while _{i},_{i}

Construct graph G by merging subgraphs _{i}

^{k}_{i}_{=1}_{i}_{i}_{i}

Our experiment is based on a training set of 113 ROIs from the Mammographic Image Analysis Society (MIAS) mammogram images that are segmented by radiologists. After image segmentation, 50 features from the spatial, texture, and spectral domains are extracted. The feature set consists of mass radian, mean, maximum, median, standard deviation, skewness, kurtosis of gray level from spatial domain, energy, entropy, modified-entropy, contrast, inverse different moment, correlation, maximum, _{x}_{y}_{ϕ,d}

After Step 1, the simple and multiple logistic regression analysis in each feature set are performed.

From

From the second column of

From the third column of

Finally, with Bayes inference, the direct effect is Entropy 0° and the indirect effect is the interaction of Entropy 0° and Entropy 45° cause

_{i}_{i}Ə B_{i}

The effectiveness of our selected 13-feature set (our-13) is compared to the results of the all-feature set (all-50) and 26-feature set from SFS (SFS-26) on two learning systems: ANN and logistic regression. True positive (TP), false positive rate (FP) and minimum squared error (MSE) are metrics in the comparison.

Graph-based analysis was examined using statistical techniques to identify the crucial direct or indirect features for breast cancer detection in medical images. Our algorithm requires time complexity ^{2}). We can accept the hypothesis that there is no significance between 50 features and 13 features for ANN and logistic regression with threshold 5%. A comparison of the performance between the different configurations of architectures over two set of features (50 and 13 features) with two classifiers (ANN and logistic regression) indicates that the selected 13 features provide the best results in terms of precision with respect to computation time. Using our approach, the detection step improves the temporal ratio of computation by number of features by 50:13. Moreover, the proposed method demonstrates satisfactory performance and cost compared to SFS.

In our experiment, the 50 features were partitioned into 12 feature sets with _{11} being the largest set. With this set, the search space for direct cause features (_{7}) is (^{7}_{1}) while indirect cause (_{7}) exploration was (^{7}

On the theoretical aspect of finding a best combination feature set, the only way to guarantee the selection of an optimal feature set is an exhaustive search of all possible subsets of features. However, the search space could be very large: 2^{N}^{k}C_{i}

In this research, a method to reduce a number of features for medical image detection is proposed. We use mammograms from the Mammographic Image Analysis Society (MIAS) as test data and applied the proposed algorithm to reduce the number of features from a frequently-used 50 features to 13 features, while the accuracies using two learning models are substantially the same. Our method can reduce the computation cost of mammogram image analysis and can be applied to other image analysis applications. The algorithm uses simple statistical techniques (path analysis, simple logistic regression, multiple logistic regressions, and hypothesis testing) in collaboration to develop a novel feature selection technique for medical image analysis. The value of this technique is that it not only tackles the measurement problem by path analysis but also provides a visualization of the relation among features. In addition to ease of use, this approach effectively addresses the feature redundancy problem. The method proposed has been proven that it is easier and it requires less computing time than using SFS, SBF and genetic algorithms. For further research, a deeper analysis of the texture domain and the dispersion of microcalcification may provide a more efficient breast CAD system, with cost reduction and higher precision.

This research is partially supported by the Kasetsart University Research and Development Institute. Authors would like to thank Nutakarn Somsanit, MD, of Rajburi Hospital for her advice about the training data. Lastly, authors also would like to thank Dr. James Brucker of the Department of Computer Engineering, Kasetsart University for his comments on writing.

An example of a general recursive causal system with four independent features and a dependent output. (A) Illustration of possible relations among features and output. (B) The result of feature selection by analogy with graph base.

The connected graph on two cause features and effect y. There is no direct effect of feature x_{ti} on y in (A) but, as shows in (B), there is an interaction effect of feature x_{ti} in addition with x_{ni} on y.

Complete graph on the experiment with direct and indirect effect from retaining process. (Dotted lines show indirect effects).

Feature selection and classification method from previous work.

Researcher | Domain | Features used (examples) | Classifier |
---|---|---|---|

Fu |
Texture | Co-occurrence matrix rotation with angle 0°, 45°, 90°, 135°: Difference entropy, entropy, difference variance, contrast, angular second moment, correlation | GRNN (SFS, SBS) |

Spatial | Mean, area, standard deviation, foreground/ background ratio, area, shape moment intensity variance, energy –variance | ||

Spectral | Block activity, Spectral entropy | ||

| |||

G. Samuel |
Spatial | Volume, sphericity, mean gray level, gray level standard deviation, gray level threshold, radius of sphere, maximum eccentricity, maximum circularity, maximum compactness | Rule-based, linear discriminant analysis |

| |||

E. Lori |
Spatial, Patient Profile | Patient profile, nodule size, shape (measured with ordinal scale) | Regression analysis |

| |||

Shiraishi |
Multi Domain | Patient profile, root-mean-square of power spectrum,histograms frequency, full width at half maximum of the histogram for the outside region of the segmented nodule on the background–corrected image, degree of irregularity, full width at half maximum for inside region of segmented nodule on the original image | Linear discriminant analysis |

| |||

Hening [ |
Spatial | Average gray level, standard deviation, skew, kurtosis, min- max of the gray Level, gray level histogram | SVM |

| |||

Zhao |
Spatial | Number of pixels, histogram, average gray, boundary gray, contrast, difference, energy, modified energy, entropy, standard deviation, modified standard deviation, skewness, modified skewness | ANN |

| |||

Ping |
Spatial | Number of pixels, average, average gray level, average histogram, energy, modified energy, entropy, modified entropy, standard deviation, modified standard deviation, skew, modified skew, difference, contrast, average boundary gray level | ANN and Statistical classifier |

| |||

Songyang and Ling, [ |
Mixed features | Mean, standard deviation, edge, background, foreground- background ratio, foreground-background difference, difference ratio of intensity, compactness, elongation, Shape Moment I-IV, Invariant Moment I-IV, Contrast, area, shape, entropy, angular second moment, inverse different moment, Correlation, Variance, Sum average | Multi-layer Neural Network |

Partition of the 50 original features into 12 feature sets.

#1 | 4 | Entropy rotations from 0°, 45°, 90°, 135° |

#2 | 4 | Energy rotations from 0°, 45°, 90°, 135° |

#3 | 4 | Inverse difference Moment rotations from 0°, 45°, 90°, 135° |

#4 | 4 | Mean Co-occurrence rotations from 0°, 45°, 90°, 135 |

#5 | 4 | Max Co-occurrence rotations from 0°, 45°, 90°, 135 |

#6 | 4 | Contrast rotations from 0°, 45°, 90°, 135° |

#7 | 4 | Homogeneity rotations from 0°, 45°, 90°, 135° |

#8 | 4 | Standard deviations on X rotation from 0°, 45°, 90°, 135° |

#9 | 4 | Standard deviations on Y rotation from 0°, 45°, 90°, 135° |

#10 | 4 | Modified entropy rotations from 0°, 45°, 90°, 135° |

#11 | 7 | mean, maximum, median, standard deviation (SD), coefficient of variation (CV), skewness, kurtosis (intensity of gray level) |

#12 | 3 | block activity, spectral entropy, mass radian |

The effects among features in feature set #1.

Entropy 0° to Entropy 45° | 0.000 |

Entropy 0° to Entropy 90° | 0.004 |

Entropy 0° to Entropy 135° | 0.000 |

Entropy 45° to Entropy 90° | 0.000 |

Entropy 45° to Entropy 135° | 0.022 |

Entropy 90° to Entropy 135° | 0.000 |

denotes significant with 5% threshold and

denotes highly significant with 1% threshold.

The effects of features in feature set #1 on output.

| ||
---|---|---|

Entropy 0° | 0.034 |
0.026 |

Entropy 45° | 0.433 | 0.031 |

Entropy 90° | 0.363 | 0.241 |

Entropy 135° | 0.159 | 0.169 |

denotes significant with 5% threshold and

denotes highly significant with 1% threshold.

Performance of logistic regression using all-50, SFS-26 and our-13 feature sets.

Using original 50 features (all-50) | 82.94 | 14.51 | 0.052 |

Using selected 26 features (SFS-26) | 77.41 | 18.72 | 0.102 |

Using selected 13 features (our-13) | 81.64 | 15.06 | 0.084 |

Performance of ANN using all-50, SFS-26, and our-13 feature sets.

Using original 50 features (all-50) | 83.32 | 14.42 | 0.034 |

Using selected 26 features (SFS-26) | 78.59 | 16.02 | 0.083 |

Using selected 13 features (our-13) | 82.35 | 15.02 | 0.065 |