Article A Clinical Decision Support Framework for Incremental Polyps Classification in Virtual Colonoscopy

Abstract: We present in this paper a novel dynamic learning method for classifying polyp candidate detections in Computed Tomographic Colonography (CTC) using an adaptation of the Least Square Support Vector Machine (LS-SVM). The proposed technique, called Weighted Proximal Support Vector Machines ( WP-SVM ), extends the offline capabilities of the SVM scheme to address practical CTC applications. Incremental data are incorporated in the WP-SVM as a weighted vector space, and the only storage requirements are the hyper-plane parameters. WP-SVM performance evaluation based on 169 clinical CTC cases using a 3D computer-aided diagnosis (CAD) scheme for feature reduction comparable favorably with previously published CTC CAD studies that have however involved only binary and offline classification schemes. The experimental results obtained from iteratively applying WP-SVM to improve detection sensitivity demonstrate its viability for incremental learning, thereby motivating further follow on research to address a wider range of true positive subclasses such as pedunculated, sessile, and flat polyps, and over a wider range of false positive subclasses such as folds, stool, and tagged materials.


Introduction
Due to the recent advancements in Computed Tomography (CT) technology, and with more than 57,000 colon cancer deaths per year in the United States, CT Colonography (CTC), also known as virtual colonoscopy, is becoming a promising tool for early diagnosis of colon cancer.CTC is a minimally invasive technique that detects colorectal polyps and masses based on the CT scans of distended colon [1].One of the major obstacles for CTC to be an effective tool for detecting polyps is that radiologists' expertise is required for analyzing the CTC images, in particular, for the detection of small polyps.Because diagnosis interpretation is a complicated task and any erroneous decision may lead to painful consequences for patients, computer-aided detection (CAD) of polyps would provide clinical decision support systems that benefit patients from correct clinical decisions by reducing the variability of the detection accuracy among radiologists [1].Such a CAD system typically employs a shape-based method for the initial detection of polyp candidates, followed by a machine learning (ML) method for the classification of polyps from non-polyps (normal colonic structures).The CAD system then generates the final list of polyps that are provided to the radiologist as a "second opinion" [2].Typically, the input to a CAD system is a large number of CT images, ranging from 300 to 3000 images per patient.
The large amount of data is one of the major obstacles for the ML method in any CAD system to be trained.Moreover, to update the CAD system, the ML method needs to be retrained when new CTC patient image data become available.Therefore, the need to scale up inductive learning algorithms in CAD systems is drastically increasing in order to extract valid and novel patterns from incremental data without a major ML retrain.Dynamic, incremental, or online learning refers, in this context, to the situation where a training image data set is not fully available at the beginning of the learning process.The data can arrive at different time intervals and need to be incorporated into the training data to preserve the class concept.Thus, constructing a ML method capable of incremental classification as opposed to batch-mode learning is very attractive and will become a strategic necessity for CTC because of a few reasons.First, the training period is the most significant resource-intensive element in ML.Second, the CTC data has a continuous, large and unbalanced stream by nature which makes it an ideal candidate for online learning.
To the best of our knowledge, no prior work has addressed incremental multi-classification of polyps using a support vector machine (SVM) approach within the framework of dynamic learning.The novel method we present in this work is called Weighted Proximal Support Vector Machine (WP-SVM) and extends traditional SVM beyond its existing static learning context to handle dynamic and multiple classifications of unbalanced data sets of polyps.The selection of SVM as a machine learning tool for assisting in CTC clinical decisions stems from several of its main advantages: SVM has its roots in statistical learning theory which ensures strong learning and generalization capabilities.It is computationally efficient in learning a hyper-plane that correctly classifies the high dimensional feature space and it is also highly resistant to noisy data [3].
The remainder of this paper is organized as follows: Section 2 presents an overview of multiclassification Least Square SVM (LS-SVM) principles.Section 3 covers our proposed multiclassification WP-SVM techniques for unbalanced data sets.Section 4 validates the effectiveness of WP-SVM in terms of detection performance, computation time, and storage requirements.Finally, Section 5 concludes this work with remarks and outlines for future research.

Multi-Classification LS-SVM Survey
SVM technique as invented by Boser, Guyon and Vapnik was first introduced during the Computational Learning Theory (COLT) conference of 1992 [4] and since then it has established itself as one of the leading approaches in the pattern recognition and the machine learning areas, as demonstrated by the results obtained by a broad range of practical applications and recent research work .In terms of structural or representational capacity, SVM behaves like a neural network, but it differs in the learning technique.SVM solves a quadratic programming (QP) problem and finds a computationally efficient way of learning a hyper-plane that correctly classifies the high-dimensional feature space after using a linear combination of kernel functions that have to be positive definite [28].With kernel functions centered on the training input data, SVM minimizes the confidence interval and keeps the training error e fixed while retaining only the support vectors (SV) from the input data.
For instance, given a classical linearly separable multi-classification task with attributes or feature represents the i th input image and y i the output class, an LS-SVM as introduced in [8] optimizes the objective function Objective function: where λ is a suitable positive penalty parameter that controls the tradeoff between the classification error e of the c different classes and the margin maximization during the training phase.The error term e, often referred to as the slack variable, accounts for the non-separable data points.Hsu and Lin [29] showed that SVM accuracy rates in general are influenced by the selection of λ which varies depending on the problem under investigation.The selection of λ can be found heuristically or by a grid search.
Depending on the decomposing strategy for converting the multi-classification problem into a set of binary ones, the ML constraint can be formulated as a multi-classification objective function, oneversus-rest, pair-wise, or error-correcting output code [9].The multi-classification objective function has probably the most compact form as it optimizes the problem into a one single step.It constructs c two-class rules, where each classifier separates training vectors of class y i from the other m classes using the constraint of the hyper-planes defined by their slopes w and intercepts b: LS-SVM classifiers reduce the optimization problem from a QP to a linear one and optimize the Lagrangian as: where α i represent the Lagrange multipliers which can be either positive or negative.These parameters are derived from the Karush Kuhn-Tucker (KKT) conditions that are valid as long as the objective function and conditions stay convex [28].The LS-SVM solution can be re-written as a linear system of equations in a matrix format as:

Proposed Multi-Classification
We propose several novel modifications to the standard multi-classification LS-SVM as highlighted in Section 2. First, we modify the objective function represented by Eq. ( 1) by adding the plane intercept b in order to uniquely define the hyper-plane by its slope and intercept.Second, because the input data can be unbalanced with respect to the class distribution, and because the penalty parameter of Eq. ( 1) could be biased towards controlling the overall error term e at the expense of specific classes, we include a new controlling parameter ζ which acts as a local penalty variable for each class.The proposed new objective function is We also modify the constraint relationship between different classes in the multi-classification objective function of Eq. ( 2) to be equality instead of inequality: Furthermore, instead of incorporating the constraint function into the objective function as proposed by LS-SVM, we use Eq.( 6) to find an expression for the slack variable in terms of the hyperplanes parameters that is substituted in Eq.( 5) in a similar manner to [19].The optimization problem represented in Eq. ( 7) now becomes fundamentally different than the standard SVM.
With the proposed changes and by dropping the Lagrange multipliers, the solution is reduced to an unconstrained optimization and becomes equal to the rate of change in the value of the objective function.This makes it faster than the standard LS-SVM which is known to converge slower than neural networks for a given generalization performance.In a traditional SVM, nonzero Lagrange multipliers correspond to a SV that summarizes the training data set.By storing only the SV and by discarding the training set after the classifier model has been established, the storage requirement over having to store the complete training set is reduced.However, depending on the classification task, the number of SVs can still be numerous, and the order of operations that are needed for N training points with f as the dimension for the feature space and N s as the total number of SVs would range from ) . . .( 2 depending on the SV location with respect to the hyper-planes [4].WP-SVM does not numerically solve for the support vectors and their nonzero Lagrange multipliers.Instead, it classifies points by assigning them to the closest parallel planes without explicitly calculating the SVs.The hyper-planes are still pushed apart by a maximum margin w, but the data are clustered around the planes rather than on the planes.The uniqueness of the global solution is still valid because it is a property of the Hessian being positive definite or semi definite [28].The mathematical steps for the optimization start by solving for the partial derivatives of with respect to both w and b.

Defining
Eq. ( 8) becomes: And let q(n) represent the size of a class n.We can thus re-write S w as: ) ( 11 Applying a similar reasoning for b, we can re-arrange Eq. ( 9) to obtain: ) ( To rewrite Eq. ( 10) in a matrix form, we use the series of definitions as described in Table 1.
Diagonal matrix of size (f*c) by c, the diagonal elements are the column vector ...
Square matrix of size c, made from the row vector q n of length c Column vector of size c, made from u n such that Square diagonal matrix of size c, the diagonal elements r n are as follows The above definitions allow us to manipulate Eq. (10) and rewrite it as a system of equations: Solving these equations for W and B, we obtain: We define matrix A to be: (13) and L to be: These definitions allow us to rewrite Eq. ( 12) in a very compact form: Eq. ( 15) provides the separating hyper-plane slopes and the intercept values for the different c classes.The hyper-plane parameters are uniquely defined by the matrices A and L, and don't depend on the SVs or the Lagrange multipliers.
A data point is tested against the decision function shown in Eq. ( 16) and is assigned to the class that shows the highest output value:

Proposed WP-SVM
Once the hyper-plane slopes have been defined, incorporation of recently acquired image data into a traditional LS-SVM model necessitates a full retraining of the system in order to calculate the new model parameters: For large data sets, such retraining is not efficient.It is expensive in terms of memory and computation time requirements.To maintain an acceptable balance between storage, accuracy and computation time, we propose the WP-SVM, a dynamic Weighted Proximal SVM approach.Whenever the model needs to be updated, each incremental sequence will alter matrices C, G, D, H, E, R, Q and U as defined in Eqs. ( 13) and ( 14) by the amounts of ΔC, ΔG, ΔD, ΔH, ΔE, ΔR, ΔQ and ΔU respectively.As an example, let us consider a recently acquired data set x N+1 that belongs to class t.Eq. ( 17) then becomes: In order to adequately capture the effect of the newly acquired sequences, and to ensure that their impact on the hyper-plane orientations W and B is accounted for despite the unbalanced classes, we scale the incremental changes in ΔC, ΔG, ΔD, ΔH, ΔE, ΔR, ΔQ and ΔU by some weight factors (Ψ).The basic idea of the WP-SVM is to assign an entropy measure to each incremental data point.These weight factors are determined based on the misclassification rate, the relative importance of a dynamic data point with respect to its class and its variance with respect to the other classes.The proposed weight factors are defined as: We define v as the frequency of the incremental data sequence acquired, and N is the total number of sequence data that was used to determine the initial hyper-plane parameters of the model.The 2 fc s factor is the Mahalanobis distance between an incremental data feature f and the hyper-plane parameters for class c, scaled by ζ c that represents the error rate observed in class c before the introduction of new incremental data.Eq. ( 18) ensures that different data points have different impact on the classifier parameters, and that data points which have low probability of occurrence but that are nevertheless important with respect to the hyper-plane position are not outnumbered and neglected in the dynamic model update process.

1) Dynamic Processing for Sequential Data
Sequential data refers to incremental data being acquired and processed serially as they are acquired.To assist in the mathematical manipulation, we define the following matrices: We can then rewrite the incremental change as follows: The dynamic model parameters now become: We thus can re-write Eq. ( 15) to reflect incremental learning: Eq. (19) shows that the separating hyper-plane slopes and intercepts of Eq. ( 15) for the c different classes can be efficiently updated by use of the old model parameters.The incremental change introduced by the recently acquired data stream is incorporated as a weighted 'perturbation' to the initially established system parameters.Any changes in ΔA are absorbed by the changes in ΔL and vice versa.L(w,b) remains convex and the proposed solution still satisfies KKT conditions.

2) Dynamic Processing for Chunk Data
For incremental chunk processing, the data are still acquired incrementally, but they are stored in a buffer awaiting batch processing.To update the model after capturing k sequences, the recently acquired data are processed and the model is updated as described in Eq. ( 18).Alternatively, we can use the Sherman-Morrison-Woodbury (SMW) [30] generalization formula to account for the perturbation introduced by matrices M and L such that In this case, the SMW generalization formula is 17) and ( 20), the new model can represent the incrementally acquired sequences as follows:   (21) Eq. ( 21) shows the influence of the incremental data on calculating the new separating hyper-plane slopes and intercept values for the c different classes.The proposed WP-SVM meets all the main requirements for online learning and uses the learned knowledge towards incorporating new 'experiences' in a computationally efficient manner.The leftmost sub-figure 2.a represents the plane orientation before the acquisition of x N+1 , whereas the rightmost sub-figure 2.b shows the effect of x N+1 on shifting the planes orientation whenever an update is necessary.Step # Algorithm Step 1 Step 2 Step 3 Train initial model using TrainSet which consists of N patient data each having f features.
Validate the generalization performance using decision function of Initial_Model with the independant TestSet

Data Set Details and Feature Selection
To assess the classification accuracy of WP-SVM, we used volumes of interest (VOIs) representing lesion candidates in clinical CTC data sets.The VOIs were labeled into true polyps (TP) and false positives (FP) by expert radiologists.The CTC data used, was acquired by use of helical single-slice and multi-slice CT scanners (GE HiSpeed CTi, LightSpeed QX/I, and LightSpeed Ultra; GE Medical Systems, Milwaukee, WI).The patients' colons were prepared with standard laxative pre-colonoscopy cleansing and scanned in supine and prone positions with collimations of 1.25 -5.0 mm, reconstruction intervals of 1.0 -5.0 mm, X-ray tube currents of 50 -260 mAs with 120 -140 kVp, in-plane voxel sizes of 0.51-0.94mm, and a CT image matrix size of 512 x 512.Two CTC scan positions (supine and prone) are generally used for each patient to improve the specificity of polyp detection through improved differentiation of mobile residual stool from polypoid lesions [31,32].We further divided the VOIs in the TP class into two categories: medium-size polyps that were between 6-9 mm in size (hereafter, TP1), and large polyps ≥10 mm (hereafter, TP2).This partition was determined by correlating the CTC data with colonoscopy reports.The motivation for this size-based partitioning is that in colorectal screening, large polyps are considered to require polypectomy, whereas for smaller polyps a follow-up surveillance may suffice.A total of 61 colonoscopy-confirmed polyps measured 6 mm or larger: 28 polyps were identified as TP1, and 33 polyps as TP2.The number of entries in the TP class would be higher than the number of actual polyps and this is because the lesion may be seen in both supine and prone positions, and because some large lesions could be represented by more than one detection.

Table 3. Database properties.
To compare the classification performance of WP-SVM with previously published CAD results, and to confine the variability to the classifier method itself, we used the technique proposed by our earlier work in [2].As for the feature extraction technique, we adopted the 3D CAD scheme also developed earlier in [2] that extracts a thick region encompassing the entire colonic wall in an isotropic CTC volume.Discriminative geometric features (shape index, curvedness, CT value, gradient, gradient concentration, and directional gradient concentration, where each of which is characterized by nine statistics) identify polyps at each voxel of the extracted colon and are used for detecting polyp candidates.Figure 3

Performance and validation criteria for WP-SVM
Because ML algorithms have a tradeoff between the classification accuracy on training data and the generalization accuracy on novel data, and because FP occurrences are much more frequent than those of TP1 and TP2, we calculated four performance measurements: the confusion rate (Mis_Err), the True Positive, the True Negative, and the False Positive Ratios.These can be derived from the entries s ij of the confusion matrix CM:

WP-SVM Performance in Processing Chunk versus Sequential Data
To characterize the detection performance of WP-SVM, we divided DB1 into 3 independent sets of data: a training set (hereafter, TrainSet), a testing set (hereafter, TestSet), and an incremental set (hereafter, IncSet) in a manner to preserve all data that belong to a patient in one of these sets.This validation technique insures exclusion of any criterion that has been optimized during the model training phase from optimistically biasing the model generalization performance in the validation step.We compared CTC classification performance when the ML model was retrained (hereafter, Retrain_Model) to the case where incremental learning using WP-SVM was applied.In the latter case, the dynamic data were processed either in a chunk manner (hereafter, Inc_Model) and by incorporating the data sequentially into the classifier (hereafter, Inc_Seq_Model).We also compared WP-SVM to the confusion rate for simple incremental SVM as compared to the Retrain_Model.Table 3 summarizes the average result for 20 different experiments as well as CPU requirements as normalized to the baseline of Retrain_Model while using Matlab's etime routine in order to insure the analysis is independent of machine specifics.The ratio of the confusion rates of Inc_Model to the Retrain_Model was found to be 1.2 on average.And for the sequential processing, the ratio of the confusion rates of the WP-SVM to the Retrain_Model was on average improved to a factor of 1.07 -which represents almost a 16% improvement over the chunk data processing scenario.Table 3 also shows the CPU usage times for the models normalized with respect to Initial_Model CPU requirements.On average, the ratio of the CPU times of Inc_Model to the Retrain_Model was 0.401.We also observed a marginal degradation in the CPU time for the Inc_Seq_Model.The ratio of the CPU times of Inc_Seq_Model to the Retrain_Model was 0.675.This means that the improvement in the sequential classifier's accuracy degraded CPU usage time with respect to batch processing.However, this is a reasonable price to incur for enhancing polyp detection by almost 16% with respect to the batch processing.
To illustrate the advantage of online learning in that incremental training could have the advantage of possibly improving classifier accuracy by being able to incorporate more data into the model over the baseline retrain method., we started with an intentionally poor-performing Initial_Model to which WP-SVM was applied iteratively for testing the model convergence.Figure 5 indicates that as TPR increases, FPR increases as well because the FPR which makes WP-SVM a good decision method will simultaneously have a reasonably high detection sensitivity and specificity for a specific setting of ζ.Since WP-SVM reached a sensitivity of 91 % and 96 % for TP1 and TP2 with a specificity of 90.3% and 90 % and an average of 3.2 false positive per patient respectively.Since the area under the ROC curve of WP-SVM is greater than the area under a hypothetical diagonal line which would represent a random guess, we can conclude that the obtained WP-SVM ROC curves are informative and that WP-SVM presents a promising online learning algorithm for detecting polyps.On average, detection performance reported by the CAD schemes used for binary classification in a non-dynamic scheme, as shown in Table 5, has varied between 50% and 100% with 1.5 to 15.7 false positives per patient.The WP-SVM results for false-positive findings per patient compares favorably with these published results, especially when we consider that WP-SVM is being applied as an online multi-classifier on a larger database.Note that mutli-classification accuracy is expected to be negatively impacted in comparison to binary classification.The parametric model of the SVM allows for adjustments when constructing the discriminant function.However, for multiclass problems these parameters do not always exhibit a perfect fit across the entire data set.This is partly supported by the fact that the VC dimension "h" impacts the generalization error and is bounded by the slope of the hyperplane w and R the radius of the smallest sphere that contains all the training points according to [28] : Finally, Table 6 compares the storage requirements of the Retrain_Model vs. the Inc_model when WP-SVM is applied.Over time, as the polyps' database increases, computer storage for the Retrain_Model will require memory space proportional to the number of CTC acquired so that the model can be retrained each time a new CTC is acquired.For the Inc_model, the memory space is reduced to the number of features f multiplied by the number of different classes.Table 6.Storage Requirements.

Conclusions
We presented a novel extension to LS-SVM to provide a dynamic multiclassification framework for CTC classification.The ratio of the confusion rates of Inc_Seq_Model and Retrain_Model was 1.07 on average, and the CPU requirements of the WP-SVM were 0.675 times the Retrain_Model.The accuracy of the proposed model was more constrained by the initial model accuracy when chunk learning rather than iterative learning was applied.Performance evaluation based on 169 clinical CTC cases showed using a 3D computer-aided diagnosis (CAD) scheme for feature reduction polyp detection sensitivities of 91% and 96% for 6 -9 mm and ≥10 mm polyps with specificities of 90.3% and 90%, respectively.We also showed that the storage requirements of WP-SVM are drastically reduced compared to standard classification, and that this is due to the fact that only the hyper-plane parameters are required for updating the classifier.The experimental results demonstrate the capability of WP-SVM in detecting polyps and motivate further work to improve performance accuracy and specificity measures as well as to validate the detection rates on a larger TP database.Further developments will include the application of kernel methods to WP-SVM and an adaptation of SVM as an image preprocessing technique for feature extraction.The future work will also involve the validation of the WP-SVM over a wider range of TP subclasses such as pedunculated, sessile, and flat polyps, and over a wider range of FP subclasses such as folds, stool, and tagged materials.

Figure 1 a
represents a standard multiclass SVM with the SV lying on the hyperplanes, whereas Figure1b illustrates the proposed WP-SVM where data points are rather clustered around the hyperplanes.

Figure 2 .
Figure 2. Effect of x N+1 on plane orientation when model is updated.

P 1 Figure 2
Figure 2.a Figure2.b (a) represents an axial CT slice where the white box indicates a region of interest Positive; TP1: True Positive (medium-size polyps: 6 -9 mm); TP2: True Positive (polyps >=10 mm); VOI: Volume of Interest with a polyp whereas Figure 3 (b) is a magnification of the region of interest with the polyp indicated by a white arrow.Folds are shown in light gray and colonic wall in dark gray.Suspicious regions identified by connected components are further segmented by use of hysteresis thresholding followed by fuzzy clustering to identify true polyps from non-polyps.

Figure 3 .
Figure 3.Effect of Shape Index in Differentiating Polyps.
i represents the correct class and index j the predicted class.Thus, s ij represents the number of data belonging to class i that WP-SVM classified as belonging to class j.The True Positive Ratio (TPR), also known as sensitivity, reflects how sensitive WP-SVM is in detecting polyps, whereas the True Negative Ratio (TNR), also referred to as specificity, represents how accurately the classifier identifies false positives.The False Positive Ratio (FPR) is simply the complement of TNR, and Mis_Err is the overall misclassification rate.
CDiagonal matrix of size (f*c) by (f*c), the diagonal elements are composed of the square matrix

Table 2
depicts the workflow of the WP-SVM classifier.
Store only W and B as Initial_Model.Discard TrainSet Acquire incremental data IncSet.Case 1: If Mis_Err < Acceptable Rate -Initial_Model still valid.-Store the incrementally acquired images in a buffer so that they are included in future updates.Storing these sequences will help ensure vital learning for the classifier even after several no model update steps.-Increment counter Count by 1 to keep track of the consecutive instances Initial_Model not updated.-If Mis_Err is statistically increasing and or Count==Limit, initiate a model retrain to insure learning and delete the incrementally acquired video sequence stored in the buffer.New Model= Retrain_Model -Go to Step 2.

Table 4 .
Normalized Confusion Rates and CPU Requirements with respect toRetrain_Model for Inc_Model, Inc_Seq_Model, and Incremental SVM.

Table 5 .
Binary and Offline CAD Results as Reported in Literature Compared with WP-SVM.