TopoResNet: A hybrid deep learning architecture and its application to skin lesion classification

Skin cancer is one of the most common cancers in the United States. As technological advancements are made, algorithmic diagnosis of skin lesions is becoming more important. In this paper, we develop algorithms for segmenting the actual diseased area of skin in a given image of a skin lesion, and for classifying different types of skin lesions pictured in a given image. The cores of the algorithms used were based in persistent homology, an algebraic topology technique that is part of the rising field of Topological Data Analysis (TDA). The segmentation algorithm utilizes a similar concept to persistent homology that captures the robustness of segmented regions. For classification, we design two families of topological features from persistence diagrams---which we refer to as {\em persistence statistics} (PS) and {\em persistence curves} (PC), and use linear support vector machine as classifiers. We also combined those topological features, PS and PC, into ResNet-101 model, which we call {\em TopoResNet-101}, the results show that PS and PC are effective in two folds---improving classification performances and stabilizing the training process. Although convolutional features are the most important learning targets in CNN models, global information of images may be lost in the training process. Because topological features were extracted globally, our results show that the global property of topological features provide additional information to machine learning models.


I. INTRODUCTION
The International Skin Imaging Collaboration (ISIC, [1]) has put forth a number of imaging challenges to the scientific community [2], [3], [4]. These challenges have presented unique opportunities for researchers to test novel computer vision ideas to improve the detection of skin cancer with the long-term goal of facilitating early treatment and greatly improving patient outcomes. This is a worthy goal. In the United States, the five-year survival rate for treated melanoma in the United States is 98% among those with localized disease and 17% among those in whom spread has occurred [5].
The ISIC 2018 challenge [6] was to design skin lesion diagnostic algorithms (and test their efficacy) using ISIC's Clifford Smyth was supported by the Simons Foundation grant 360468. archive of over 13000 dermatoscopic images collected from a variety of sources [7]. Each image was one of the following seven types: MEL, NV, BCC, AKIEC, BKL, DF, and VASC (see Fig. 1 for sample images). The algorithms were trained on a set containing 10015 images, and validated on a set of 193 images. The holdout set consisted of 1512 images and we report our scores from this set. The algorithms we developed were based on the powerful tool of persistent homology [8] and a novel concept that called persistence curves (PC) [9], and persistence statistics (PS) [10]. Dermatologist-level diagnosis of skin lesions has been obtained via machine learning analysis of raw images in [11] (see [11] also for references to earlier approaches).
The main contribution of this work has two folds. First, we craft topological features based on PC and PS for the classification task. Second, we design a novel deep learning architecture, called TopoResNet-101. The essential idea is to Fig. 2: Schematic pipeline of our proposed features extractions. First, for each image, we apply the mask produced by the segmentation algorithm developed in Section II-C; second, for each mask image, we calculate its cubical persistent homology which we review in Section II-A; finally, from the persistence diagrams, we extract PS and PC, which are discussed in Section II-B.
combine the above topological features with those produced by ResNet-101. The features generated by ResNet-101 are often local and geometric (e.g. gradients, edges) information. On the other hand, topological ones are global information, and hence, can be used as additional information for the original neural network model. As shown in Section III, TopoResNet-101 shows evidence on several advantages, such as accuracy and stability. To the best of our knowledge, this work is the first one to combine such topological features and those from ResNet-101 in the classification task.
The outline of this paper is as follows. In Section II, we discuss those mathematical backgrounds in order to properly define PC and PS. In Section III, we introduce the TopoResNet-101 and topological rate α. The main classification results are shown in Section IV and the conclusion is in Section V.

II. TOPOLOGICAL FEATURES
The goal of this section is to extract PS and PC from skin images. Because these features are based on persistence diagrams and persistent homology, we review those mathematical backgrounds in II-A. The PC and PS as features will be presented in II-B. As shown in Figure 1, typical images consist of lesions and non-lesions. In addition to feature engineering, we also propose an intuitive method in II-C for segmenting the lesion part of the image. There are deep learning methods to perform image segmentation, such as [12], [13]. It would be interesting to explore those segmentation methods in the content of skin images, but it will be beyond the scope of this paper. The main focus of this paper is to design topological features and combine them with ResNet-101. We propose a segmentation algorithm in Section II-C that is purely datadriven and only dependent on the topology and geometry of these skin lesion images.

A. Persistent Homology
Algebraic topology is a classical subject and has a long history within mathematics. Persistent homology, formally introduced in [8], brings the power of algebraic topology to bear on real world data. The field has proven useful in many applications, such as neuroscience [14], medical biology [15], sensor networks [16], social networks [17], physics [18], computation [19], nanotechnology [20], natural language processes [21] and more. We'll give a brief overview of homology and persistent homology for images and refer the reader to [22] and [23] for a more detailed exposition.
Informally, homology counts topological features such as connected components (0-dimensional homological features), holes (1-dimensional homological features), voids (2dimensional homological features), and so on. The counts of such k-dimensional holes are the well-known Betti numbers. In binary images, a black pixel is indicated by a value of 0 and a white pixel by a value of 1. We interpret 0and 1dimensional homological features in binary images as follows. We count connected clusters of white pixels as 0-dimensional homological features and connected clusters of black pixels (surrounded by white pixels) as 1-dimensional homological features (see Fig. 3 for an example). Let X be a binary image. We denote the kth Betti numbers of X by β k (X). To formalize this, we will treat binary images as cubical complexes, which we describe below. We refer readers to [22] for more detailed discussions on cubical complexes and homology.
We consider intervals of the form [ , Intervals of the form [ ] are called degenerate. We define an elementary cube to be a finite product of such intervals. In other words, Q is an elementary cube if Q = I 1 ×I 2 ×. . .×I n where I j is an elementary interval for j = 1, . . . n. The dimension of Q, denoted dim(Q), is the number of nondegenerate intervals in the product. We say the set X is cubical if it can be written as a finite union of elementary cubes. Let K(X) = {Q ∈ K | Q ⊂ X} denote the set of cubes making up X and let K k (X) = {Q ∈ K(X) | dimQ = k} denote the set of k-dimensional cubes in X. Now that we have a topological framework, we seek to provide a complementary algebraic framework. For this, we fix a ring R and cubical set X. (Note: a ring [24] is a system of numbers which admits addition and multiplication e.g. R, Z or Z 2 = {0, 1}.) We define the k-th chain module over X, denoted C k (X) to be the formal span of its elementary cubes of dimension k. That is, For each k ∈ N, C k (X; R) is an algebraic structure known as an R-module. If R = R (or more generally is a field) then the R-module C k (X; R) is just a vector space over R. It is worth mentioning that computing persistent homology using R = Z 2 is much simpler and efficient than using R = R, and, although the homologies are different in general, much of the utility is the same [21].
Each element of a chain module is called a chain. We now define the algebraic boundary map, denoted ∂, between chain modules. First, for any interval Note that if I j = [ , + 1] we define the term and if I j is degenerate, we define the term to be 0. Finally, It is a well-known and important fact that ∂∂c = 0. Note also that ∂ k := ∂ | C k (X;R) : C k (X; R) → C k−1 (X; R) is a map from the k-chain module to the (k − 1)-chain module. Thus given a cubical set X we can construct the chain complex of X, denoted C(X; R), the collection of all k-chain modules together with their boundary maps. To avoid notation overload, we will write C k for C k (X; R) when the context is clear. We can visualize a chain complex through the following sequence of mappings The kernel, ker ∂ k := {c ∈ C k | ∂ k (c) = 0}, of the k-th boundary map is called the set of k-cycles or just cycles and the image im ∂ k := {∂ k (c)|c ∈ C k } is the k + 1-boundary or just the boundary. Because ∂∂ ≡ 0 we im ∂ k is a normal additive subgroup of ker ∂ k+1 . Thus we define the k-th homology group of K to be the quotient group [24] The k-th Betti number is defined to be the rank (or order) of the k-th homology group (its dimension as an R-module). We denote this by β k (X; R).
In computer vision, a natural generation of binary images is grayscale image. Consider a grayscale image I where each pixel value I(x, y) is between 0 and 255. As I(x, y) increases from 0 to 255 the pixel (x, y) transitions from a color of black at 0, to steadily lightening shades of gray, to pure white at 255.
To obtain a binary image from I, one might threshold I by some value t to obtain a binary image T (I, t). The pixel function of T (I, t) is T (x, y, t) where T (x, y, t) = 1 if I(x, y) ≤ t and 0 otherwise. We alternately view T (I, t) as the set of pixels (x, y) for which T (x, y, t) = 1. Thus The Betti numbers of an binary image. By convention a binary image represents the cubical complex X of white pixels in the image, and its Betti number is β 0 (X) = 4 and β 1 (X) = 2. Note that if the image is surrounded by a boundary of white pixels, then β 0 (X) = 5 and β 1 (X) = 3.
A user's choice of any given threshold t at which to calculate the homologies of T (I, t) would be arbitrary. Persistent homology offers a methodology to consider all possible threshold values on I at once.
Suppose X represents a cubical set. We define a filtration of X to be a sequence of cubical sets indexed by a finite discrete set (often of real numbers) It is also handy to sometimes set X 0 = ∅. We can define a filtering function f : K(X) → {a 1 , . . . , a n } that assigns each cube C ∈ K(X) to the complex in which it appears first. That is if f (C) = a k then C ∈ K(X a k ) and C / ∈ K(X a k−1 ). Because of the inclusions, this also guarantees This last definition is quite useful in the case of images. Recall that every binary image can be viewed as a cubical complex. Hence given a grayscale image I, we can create a filtering function for this image to send each elementary 2-cube and its boundary elements to the pixel value of that cube. Thus, an n × m grayscale image acts as a filtering function on the cubical set [0, Suppose we have a filtration of cubical sets, The inclusion induces a homomorphism (see [24]) between the homology groups so that for each k we have, and one of the following hold: The last condition is described as the elder rule, which allows us to uniquely define the death of a class. This rule says in the choice between two classes, we choose to keep the "oldest" class. We can guarantee that every homology class α ∈ H k (X i ) for some i has a birth time. We cannot guarantee that each class has a death time. For such classes, we assign the "death time" as ∞. This procedure allows us to define a unique multi-set of points, one point (b, d) for each homology class where b is the birth time of the class and d is its death time. By collecting these pairs accounting for their multiplicity, we obtain a summary of the data called a persistence diagram for each dimension k. Refer to Fig. 4 for an example of a filtration and corresponding persistence diagrams for an image.
It has been proven that the persistence diagram is a stable summary of a space in the sense that small changes in the space correspond to small changes in the corresponding diagram [25] . These diagrams are an integral part of our algorithm as described in Section IV For a grayscale image I we use the filtration {T (I, t)} 255 t=0 . and calculate persistence diagrams for dimensions 0 and 1 (2 dimensional homology for 2-D images is trivial). We extend this notion to color images, by considering each channel in the color space individually. Suppose image I has I(x, y) = [R(x, y), G(x, y), B(x, y)] where 0 ≤ R(x, y), G(x, y), B(x, y) ≤ 255 are the intensities of red, green, and blue light at pixel (x, y), with 0 being no intensity and 255 being full intensity. Then one could obtain the persistence diagrams of R(x, y), B(x, y) and G(x, y) viewed each as a separate grayscale image.

B. Persistence Statistics and Persistence Curves
Both PC and PS can be viewed as summaries of persistence diagrams. The main idea is to extract topological information from persistence diagrams, and use them as features to build the classification model. PCs were introduced in [9], and are proven successful to texture datasets; persistence statistics were studied in [10] to describe different types of human red blood cells.
PS are statistical measurements of the birth and death coordinates. Given a persistence diagram P , denote the birth and death coordinates by b and d, respectively. Recall that from  ) where b < d, and hence, P contains finite number of pairs of points. The widelyused quantity, d − b, represents the "life" of this generator, meaning how persist this generator is in the filtration. One of the simplest summaries of the persistence diagram is called total persistence, and is defined as We also consider the midlife coordinates, where i = 0, 1. As shown in Table I, they are samples of persistence statistics used in the article. In addition to above statistics measurements, we also consider so called persistence entropy introduced in [26], which is defined as − (b,d)∈Pi p log(p), for i = 0, 1. Persistence entropy can be viewed as the diversity of the lifespans.
On the other hand, PCs offer a method to create a vectorized summary of a persistence diagram. By mapping a persistence diagram to a vector space, we open the door to the application of machine learning algorithms. The motivation for this class of curves lies in the Fundamental Lemma of Persistent Homology which states that given a k-dimensional persistence diagram P corresponding to a filtration X 1 ⊂ . . . X n , the space in the filtration, X t corresponding to threshold t has exactly β k (X t ) = |{(b, d) ∈ P | b ≤ t, d > t}|. Recall the formal definition of PCs from [9], which is a generalization of this idea. Let D represent the set of all persistence diagrams. let F represent the set of all functions ψ : D × R 3 → R so that ψ(D; x, x, t) = 0 for all x ∈ R. Let T represent the set of statistics or operators that map multi-sets to the reals, and finally let R represent the set of functions on R. We define a map P : The function P (D, ψ, T ) is called a persistence curve on D with respect to ψ and T . In [9], it is shown that persistence landscapes [27] are a special case of PCs.
In the present application, all filtrations have exactly 255 space. Thus, for each diagram, a persistence curve is a vector in R 255 . The two particular functions that were of greatest use were the functions ψ(b, d, t) = 1 giving rise to the Betti curve β(t) and e(b, d, t) = − d−b L log d−b L giving rise to a variant of the entropy summary (curve) E(t). The entropy summary and its stability are discussed in [26], [28]. In [9], a general stability result for an entire class of PC is given. We calculate the curves for the 0 and 1 dimensional persistence diagrams for each channel in our color space. Finally, we fed these features into machine learning models. The persistence curves we used in our final model are 1) β 0 (t) and β 1 (t).
We summarize our approach as follows. First, we apply the segmentation algorithm discussed in Section II-C to obtain the image mask. Second, we apply the mask to the original image. Third, we transform the RGB color space into RGB, HSV, or XYZ color space, and extract each channel. Fourth, we use persistent homology software, specifically, Perseus [29] and CubicalRipser [30], to compute persistence diagrams for each channel. Finally, from each persistence diagram, we calculate persistence curves, and persistence statistics as features.

C. Segmentation
We now describe the segmentation algorithm. The goal is to take an image I of a patch of skin that has a (contiguous) skin lesion and isolate the connected component of pixels corresponding to the diseased skin. The output is to be a mask M , a binary image with M (x, y) = 255 (or white) if pixel (x, y) is part of the lesion and M (x, y) = 0 (or black) if pixel (x, y) is part of the healthy skin in the image.
Naturally this is a subjective task. As a check, a mask M that is obtained from an image I is compared to the mask M that is obtained by a dermatologist. The evaluation metric is the Intersection over Union (IOU) score, namely J(M, M ) = |M ∩ M |/|M ∪ M |.
The segmentation algorithm proceeds in the following steps.
Step 1: We transform each RGB image I into a gray image I * that is the average the color intensities, i.e.
Step 2: Because the region of skin in the lesion is usually darker than the healthy region, we first compute the average value a of I * . If the pixel value in I * is less than a we take this to mean that it is more likely to be a part of the lesion. The second step is to observe the life interval of each pixel. Like persistent homology, for each step t with 0 ≤ t ≤ T , we define I * t to be the binary image Because pixels in I * t usually become 0 when t ≥ 30, we choose T = 50 in our implementation.
We define S t to be the set of all white pixels in I * t in step t and get the following filtration: This filtration is illustrated in Fig. 6 for one training image. As in persistent homology, we measure the life interval of each white pixel in S 1 , white pixels in S 1 with long life intervals have a more robust property in the whole image, and are more likely to be a part of the lesion. If (x, y) ∈ S t for 1 ≤ t ≤ t 0 and (x, y) ∈ S t for t > t 0 we say the life-span of (x, y) is L(x, y) = t 0 . If (x, y) ∈ S 1 we set L(x, y) = 0.
Step 3: For each image, we calculate a threshold T with 1 < T < T . The output segmentation is determined by connected components in S T . In general, as t gets larger, the number of connected components will decrease since noisy parts would disappear first. However, when t gets very large, the main components would be separated by small components. Therefore, we choose where 1 ≤ T ≤ T is the first time step such that S T +1 has more connected components than S T +1 , and 4 is a choice to ensure the main disease component is more complete.
Step 4: A binary image S T with small T usually contains many tiny (or noisy) white components, so we remove them and define life scores LS of connected components of S T . More precisely, if C is a (white) connected component of S T , then its life score is defined by modified average life-span where (x, y) is a pixel in C and d(C, o), d(C, b) are the minimal distance between c and midpoint o and boundary of the image respectively. Both of 1 + d(C, b) and 1 + d(C, o) factors occur in (5) as punishments on components not near the center of the image. (Usually the lesion appears close to the center of the image.) A connected component C with higher score means that this region might be more significant in the whole image. Thus we choose the convex hull of those connected components C 1 , C 2 , . . . , C k with life scores larger than the average life score among all connected components. The proposed method was considered by topological properties of raw images purely, hence any training process were not required. The average IOU score among 2595 images we obtained was 0.6651.

III. TOPOLOGICAL RESNET-101
Convolution neural networks (CNN) become main tools in the deep learning. Many researchers have been developing several CNN models, such as (just to name a few) AlexNet [31], VGG [32] and ResNet [33]. Applications of those deep learning models to the computer vision have proven successful, such as [12], [13], [33]. In this work, we will focus  (3). In this class, our segmentation algorithm selects t = 2 by equation (4), and the output segmentation would be the convex hull of the main connected component in S 2 , which is derived by (5).
on the residual neural network (ResNet); in particular, we will use the ResNet-101 [33], which provides an end-to-end architecture for the image classification. ResNet optimizes the residues between input and desired convolution features. The desired features can be extracted easier and more efficient than other CNN models. Therefore, this optimization of residues can be applied to reducing the number of parameters in a heavy network. Because of the benefit of reduction of parameters, the number of layers can be lifted efficiently. In this work, we modified ResNet with 101 layers by equipping the topological information.
The architecture of the TopoResNet-101 can be shown in Fig. 7. We introduce a new parameter α ∈ [0, 1], called the topological rate, as a weight of different features. We multiply each component in topological features (ResNet-101 output features) by α (1−α, respectively). To be more specifically, the input vector before the last fully-connected in the TopoResNet-101 i.e., the pink bar in Fig. 7 can be represented by where v ResNet−101 is the ResNet-101 output features (yellow bar in Fig. 7, and v Topology is the topological features (blue bar in Fig. 7), and ⊕ is the concatenation operator. Since α is a parameter in the TopoResNet-101, it will be changed in the learning process. In practice, α was initially set to be σ(0.5) ≈ 0.6, where σ : R → [0, 1] is the sigmoid function This sgimoid function will ensure that the α is always in between 0 and 1. In TopoResNet-101, we will always apply the sigmoid function to α. By regarding ResNet-101 as a regression model of data points, α records the importance of topological features for classifying skin lesions. We also allow  α as a weight in the network, so it can be optimized in training epochs. Fig. 10 and 9 show the role of α in the training process for each model.

IV. EXPERIMENTS
In this section, we describe the our main experiments and results. There are two models in this work-SVM and TopoResNet-101. Since the experimental settings of SVM model and TopoResNet-101 are slightly different, we explain the differences between those settings in Section IV-A. In Section IV-B and IV-C, we present the performances of SVM and TopoResNet-101, respectively. The affects of the topological rate, α, are discussed in IV-D.

A. Data Description and Experiment Setting
As shown in Table II, the distribution of images among classes is extremely unbalanced. In the SVM model, we use multi-class SVM with a "one-against-one" strategy and all 10015 skin lesion images as training input of SVM. An additional set of images as a validation dataset was provided by ISIC 2018 website as part of the challenge and the class information of validation images were not available to the end users. Because unbalanced training set may lead to over-fitting on classes which have large cardinality, we train the model by using a subset of the full training set (5000 images chosen randomly from the training set by MATLAB's random seed 1). On the other hand, because the ISIC 2018 validation system had been closed, we collected 50 images from each class as testing dataset (totally 350 images) in original 10015 skin lesion images. For the training process, we separated 70% of the rest images as training data and the others 30% as the validation data. Because CNN models usually need massive training set, the random sampling of training set is not applied for CNN models; in other words, all 10015 skin lesion images were inputted in training process.

B. Performance of SVM
The best 2 scores we had on the validation set was 65.6%, and 67.2% as shown in Table III. The experiment result shows that global information (PS and PC) may be useful features for classifying skin lesions. However, because local features were not considered in the model. Therefore, this motivates us to combine PS and PC into CNN models.

C. Performance of TopoResNet-101
For each topological features (totally 4 sets of features), we trained corresponding TopoResNet-101. For each model in the experiments, we observe convergence of mean class  Table IV presents the performance of the Model 1 ∼ 5 on balanced test dataset. Because features with large dimensions (PC-RGB and PC-XYZ) may result in unstable in the training process, we design a sub-net architecture for reducing the dimension of topological features before concatenation (i.e., pink one in Fig.7). Table IV presents the performance of the Model 6 ∼ 9, where the blue layer in Fig.7 was replaced by Fig. 8.
The reduced dimension in Fig. 8 is set to be 512 via a fully-connected sub-net merged into TopoResNet-101, and the weights of the sub-net were optimized synchronously by backpropagation technique in training processes.
Finally, because adding noise into neural networks may improve the performance [34], we also consider 512 dimensional noise vectors as an additional input of ResNet-101 (Model 9). In practice, all input features were normalized into a vector where each component of the vector is a value between 0 and 1 by Max-Min normalization.  By observing Table IV and Table V, the results show that Model 8 has the best performance among these 9 models. To measure the effect of topological features in neural networks, we plotted mean accuracy and class accuracy of Models 1, 8 and 9 in Fig. 11 in training processes on test dataset. By observing curves in Fig. 11, Model 8 has most stable performance on testing dataset. As we mentioned in Section IV-A, noise embedding in neural networks would also makes the neural network performs more robust. However, except AKIEC lesions (Class 3), topological features performs well than noise inputs significantly.

D. α Rates in TopoResNet-101
First, we note that α converges in all models as shown in Fig. 9 and 10. Observe from Table IV that the α rate seems to be influenced by the dimension of input features as Model 4 and 5 have higher feature dimension that Model 2 and 3 do. This is one of reasons we consider Model 6, 7, and 8. In Fig. 9, α rates converge in training processes, and the α rate of Model 8 is lager than Model 9, this phenomenon suggests that PS and PC are useful for recognizing skin lesions.
On the other hand, to investigate the effects of dimension reduction, we also plotted α curves of Model 4, Model 5, Model 6, Model 7 in Fig. 10. It shows that the reduction network does help optimize α rate more efficiently. However, observe that curves of Model 4 and Model 5 (yellow curve and green curve) in Fig. 10 also show significant peaks and increasing of α rates in earlier epochs. This may suggest that some global information in PC-RGB and PC-XYZ might be lost in the unbalanced training dataset.

E. Discussion
In the deep learning, choosing weights of models is a diffcult task, and it is still an open question whether there is a mathematical/theoretical algorithm or criterion to determine the optimal weight in training epochs. Moreover, because a typical method for choosing weights strongly depends on the performance of validation sets, the accuracy between testing and validation datasets may have a gap if the validation set is unbalanced or which has small amount, while this situation occurs frequently in medical images and data. Therefore, the stability of performance is definitely an advantage for choosing models. In our proposed method, neural models equipped with (reduced) topological features have most stable and good accuracy curve on classification task.

V. CONCLUSION AND FUTURE WORK
The appeal of the PCs and PSs lie in their simplicity. The features themselves do not require user defined parameters, thus one only needs to tune the attached machine learning algorithm. In addition, these features give intuitive shape summaries of the original space. The generalized nature of the persistence curve definition allow for a rich library of usable curves. In this paper, we have chosen to combine the PSs with the Betti and entropy curves and then feed them into SVM and ResNet-101. The performance shows that PS and PC can be used for lifting the models who considered convolution features only. One future direction is to apply these features into other classification tasks. Also, phenomenon in Fig. 9 and 11 shows that TopoResNet-101 with high α rate may perform well in specific classes (Class 2), so it may exist better way to embed α rate to neural models. Finally, because topological features were assumed to be stable on image with noise and it may occurs frequently in real application, it is another important future work to extend the data set with noise and consider the performances between pure ResNet-101 and TopoResNet-101.