Multiple Instance Learning with Differential Evolutionary Pooling

: While implementing Multiple Instance Learning (MIL) through Deep Neural Networks, the most important task is to design the bag-level pooling function that deﬁnes the instance-to-bag relationship and eventually determines the class label of a bag. In this article, Differential Evolutionary (DE) pooling—an MIL pooling function based on Differential Evolution (DE) and a bio-inspired metaheuristic—is proposed for the optimization of the instance weights in parallel with training the Deep Neural Network. This article also presents the effects of different parameter adaptation techniques with different variants of DE on MIL.


Introduction
The medical domain is evolving with the usage of huge amounts of data, including text or images, videos for personal sensing, computer-aided diagnosis, and treatments of severe diseases. Since these contents are generated from real scenarios, they are loosely controlled, and it is difficult to label them manually. Thus, Multiple Instance Learning (MIL) is employed to overcome this barrier of dealing with the inconsistent, incomplete, and weakly annotated nature of real-world medical data [1,2]. MIL is a form of weakly supervised learning that was initially proposed by Dietterich et al. [3]. MIL can be used in contexts in which training samples are ambiguous; i.e., where various instances corresponding to the same class label have different numbers of attributes or features. In MIL training, samples are formulated as a bag associated with a class label that contains a set of multiple instances. The objective of MIL is to train the classifier so that it can predict the class of unseen bags (bag-level classification).
To label a bag, the learning model will go through every instance of the corresponding bag and determine the impact of the instances on the bag labeling; i.e., which instances actively participate in bag labeling. This process will identify the instance-to-bag relationship. The use of Deep Neural Networks for MIL is relatively a new paradigm in machine learning, where the instance-to-bag relationship is established through pooling functions. Throughout the literature, various pooling functions are used. These pooling functions can be non-trainable or trainable. Sum pooling, mean pooling, max pooling, and log-sum-exp pooling [4] are non-trainable pooling techniques, while attention-based pooling, gated attention-based pooling [5], dynamic pooling [6], adaptive pooling [7], and genetic pooling [8] are trainable pooling techniques. In various works, different pooling techniques are also used in conjunction [9][10][11]. K. Bhattacharjee et al. [8] used a bio-inspired metaheuristic-the Genetic Algorithm (GA)-to design an MIL pooling function. In [12], K. Bhattacharjee et al. used another metaheuristic technique-Differential Evolution (DE) [13].
As future works building on that in [12], different DE variants can be explored to design the MIL pooling function better. This paper can be considered an extension to [12], where the most popular DE variants are applied to generate MIL pooling functions. The variants explored in this paper are SaDE [14], jDE [15], JADE [16], and SHADE [17].
The rest of the paper is divided into five sections. Section 2 gives an overview of Multiple Instance Learning (MIL). Section 3 briefly discusses the basic Differential Evolution (DE) and its variants used in this paper. Section 4 defines the proposed methodology, and in Section 5, results are analyzed. Finally, the concluding remarks are given in Section 6.

Multiple Instance Learning
In classical supervised learning, each input vector-i.e., feature vector X-has a corresponding output label-Y. In case of MIL, the output label Y is mapped to a set of instances; i.e., bag X = {x 1 , x 2 , . . . , x i } instead of a single instance. The cardinalities of the bags are independent of each other; i.e., i could vary from bag to bag, and the bag instances are permutation-invariant. It is assumed that each instance of the bag {x 1 , x 2 , . . . , x i } is associated with an output label {y 1 , y 2 , . . . , y i } ∈ Y. The prediction of the classifier is based on a prior assumption that a bag will be labeled positive if it contains at least a single positive instance, while labeling will be negative if all instances in the bag are negative.
Benefitting from fast and accurate learning from examples, deep learning with MIL was initially proposed by J. Ramon and L. De Raedt [4]. The authors extended the classical neural network to MIL using simple backpropagation. For a deeper understanding of MIL, interested readers may refer to [18,19].
MIL with deep learning has been applied for the training and prediction of imbalanced and incomplete real-world medical imaging data [1,5,8,10,11,[20][21][22][23][24]. In [20], S.Want et al. designed a recalibrated multi-instance deep learning to classify gastric cancer. M. Yousefi et al. [21] integrated MIL with the randomized tree to classify digital breast tomosynthesis images. To diagnose diabetic retinopathy, P. Cao et al. [22] used multi-class MIL. Landmarkbased deep MILwas proposed by M. Liu et al. [23] for brain disease diagnosis. J. Yao et al. [24] usedattention-based pooling for whole slide image feature learning. A trainable pooling using the genetic algorithm was designed in [8] for bag-level labelling. Z. Wang et al. [11] proposed AMI-Net+, an improvised AMI-Net neural network using multi-head attention and gated attention-based pooling. In [5], an MIL permutation-invariant bag score function is generated using the attention-based operator and fully parameterized by the neural network.

Differential Evolution
Differential Evolution (DE) is a population-based metaheuristic technique that is used to solve complex structured optimization problems in many application areas. DE was initially proposed by Storn and Price [13] in 1996. For a more profound understanding of this topic, readers can refer to [25]. In general, DE formulation is divided into two phases: initialization and evolution. The initialization phase comprises random population generation, and the evolution phase consists of mutation, crossover, and selection for generating the new population for the next generation. In view of its "one-to-one-spawning" selection mechanism, it resembles Swarm Intelligence methods such as PSO, but at the same time, it is similar to Evolutionary Algorithms such as the GA, as it requires a "mutation" operator, followed by a "crossover" strategy to produce a new solution. A general flowchart of the operations of DE is presented in Figure 1.

Initialization
In this step, a set of uniformly distributed random population is generated. These represent the initial solution points in the search space.
x j,i = lower + rand j,i * (upper − lower) where G is generation, NP is the number of individuals in the population, D is the dimension of an individual, lower is the lower bound, upper is the upper bound, rand ∈ [0, 1] is a random number, i ∈ {1, . . . , NP}, and j ∈ {1, . . . , D}.

Mutation
After population generation, mutation is performed to expand the search space. In the mutation strategy, for each target vector, a corresponding mutant vector is generated. DE has various mutation strategies. In this paper, the "DE/rand/1" strategy is used to generate mutant vector V i = (v 1,i , v 2,i , . . . , v D,i ): where V i is the mutant vector, F ∈ (0, 1.2] is the scaling factor, X are individuals in the population, and r 1 , r 2 , r 3 ∈ {1, . . . , NP}, where r 1 = r 2 = r 3 = i. After mutation, each attribute of the mutant vector is checked for infeasibility. If found infeasible (out of bounds), saturation correction is applied [26]; i.e., it is saturated with the nearest bound-if it is less than the lower bound, it is saturated with the lower bound, and if it is greater than upper bound, it is saturated with the upper bound.

Crossover
Crossover is performed between the target vector and mutant vector to increase the diversity of the population and to assimilate the best individual. After the crossover, trial vectors are generated. For a trial vector U i = (u 1,i , u 2,i , . . . , u D,i ), where CR ∈ [0, 1] is the crossover probability, rand ∈ [0, 1] is a random number, and j r ∈ {1, . . . , D}. The crossover probability (CR) is the DE parameter that controls the crossover between target vector and mutant vector; i.e., for a dimension, if rand ≤ CR or j = j r , the value for that dimension is copied from the mutant vector to trial vector, while the value for that dimension is otherwise copied from the target vector to trial vector. Thus, the greater the value of CR, the more the trial vector will resemble the mutant vector; the smaller the value of CR, the more the trial vector will resemble the target vector.

Selection
Tournament selection is performed between the trial and the target vector, and the vector with a better fitness value moves on to the next generation.
where f (·) is the objective function. DE is one of the most popular and heavily applied metaheurstics. DE variants are regular top-ranking algorithms in the IEEE Congress on Evolutionary Computation (CEC), one of the largest and most important conferences in the field of evolutionary computation. In recent times, LSHADE-RSP [27], LSHADE-cnEpSin [28], and LSHADE-EpSin [29] gained third, second, and first ranks in CEC 2018, CEC 2017, and CEC 2016 respectively [25]. A modified L-SHADE (mL-SHADE) [30] for single objective real-parameter optimization was introduced in CEC 2019 by Yeh et al. A new adaptive variant Explicit Adaptive Differential Evolution (EaDE) [31] was proposed by Zhang et al. this year, which explicitly controls the exploitation and exploration strategies of DE. A new selection operation for DE [32] was proposed this year by Zeng et al. In CEC 2019, a new, improved SALSHADE-cnEpSin algorithm with adaptive parameters [33], a Bi-Level Differential Evolutionary Algorithm (BLDE) for Constrained Optimization [34], NSADE [35], and Differential Evolutionary Multi-Task Optimization (DEMTO) [36] were published. In CEC 2020, a Novel Center-Based Differential Evolution Algorithm [37], Clustering-Based Adaptive Differential Evolution for Numerical Optimization (FCADE) [38], a Differential Evolution Algorithm with Q-Learning (DE-QL) for Solving Engineering Design Problems [39], Multi-Population Modified L-SHADE for Single Objective Bound Constrained optimization [40], and an extension of mL-SHADE were proposed. TheCEC and GECCO conferences and Applied Soft Computing, Soft Computing, Swarm and Evolutionary Computation, and Evolutionary Computation journals are the top-ranking conferences and journals in which most of the important research regarding DE is published. A thorough discussion of the recent advances in DE is not within the scope of this paper. Thus, the proceedings of the aforementioned conferences and journals should be explored for a better understanding of the recent advances in DE. Over time, many variants of DE have been proposed by many researchers. Among those, the four most popular DE variants in recent times have been chosen to study their effects on MIL. The parameter adaptation techniques of these four variants are discussed in the following subsections.

SaDE
In SaDE [14], F i takes different random values in the range (0,2], with a normal distribution having a mean of 0.5 and standard deviation of 0.3 for different individuals in each generation. F is related to the convergence speed and has more flexibility compared to CR. A normally distributed F can maintain both a local (with small F values) and global (with large F values) search ability to generate potential good mutant vectors throughout the evolution process. CR is much more sensitive to the property and complexity of a problem than F. In SaDE, previous learning experience within a certain generation interval is accumulated to dynamically adapt the value of CR i to a suitable range. CR i is initialized for each individual in a population through a normal distribution in the range (0,1] with an initial mean of 0.5 and standard deviation of 0.1. CR i values for all individuals remain the same for several generations (five in the experiments performed here), and then a new set of CR i values is generated under the same normal distribution. During every generation, the CR i values associated with trial vectors that successfully enter the next generation are recorded. After a specified number of generations (10 in the experiments performed here), the mean of normal distribution of CR i is recalculated according to all the recorded CR i values corresponding to successful trial vectors during this period. With this new normal distribution's mean and a standard deviation of 0.1, the aforementioned procedure is repeated. As a result, the proper CR value range for the current problem can be learned to suit the particular problem. The record of the successful CR i values is emptied once the normal distribution mean is recalculated to avoid possible inappropriate long-term accumulation effects.

jDE
In jDE [15], F i and CR i are initialized for each individual X i in the population (the initial F i and CR i are 0.5 and 0.9 for all individuals in the experiments performedhere). F i and CR i are updated in each generation in the following manner: where rand 1 , rand 2 , rand 3 , and rand 4 are uniform random values in the interval [0, 1]. t 1 , t 2 , F l , and F u are constants with values of 0.1, 0.1, 0.1, and 0.9, respectively.

JADE
In JADE [16], for each individual X i in a population, F i and CR i are generated in the interval [0, 1] with thefollowing equations: where randc i (µ, σ) and randn i (µ, σ) are Cauchy and normal distributions, respectively, with their mean µ and standard deviation σ. µ F is the mean of the Cauchy distribution used to generate F i while µ CR is the mean of the normal distribution used to generate CR i . Both µ F and µ CR are initialized to 0.5. The F and CR values associated with trial vectors successfully entering the next generation are recorded in S F and S CR . In each generation, µ F and µ CR are updated as follows: where c is the learning rate, which is taken to be 0.1, mean L is the Lehmer mean, and mean A is the arithmetic mean.

SHADE
SHADE [17] performs parameter adaptation depending on the success history. It maintains a historical memory with H entries (in the experiments performed here, H is taken to be equal to the population size NP) for both F and CR-M F and M CR , respectively. All elements of M F and M CR are initialized to 0.5. In each generation, the F i and CR i used by each individual X i are generated as follows: where r is a random index in the interval [1, H]. The F and CR values associated with trial vectors successfully entering the next generation are recorded in S F and S CR , respectively. In each generation, M F and M CR are updated as follows:

Proposed Method
Ilse et al. [5] proposed an attention-based pooling technique where the weighted average of instances is calculated to determine the bag labels. Weights are generated through a deep neural network. The problem can be formulated as follows.
For bag H = {h 1 , h 2 , . . . , h k } with K instances, the bag label is computed through MIL pooling as below: where a k is the weight corresponding to the h k instance in the bag. This can be treated as an optimization model where the objective is to determine the best combination of weights so that the value of z is minimized. In this paper, DE is used as an optimizer to minimize the value of z.
In case of general pooling methods-i.e., sum pooling, mean pooling, max pooling, or log-sum-exp pooling-a bag of instances is fed to the neural network, which generates the instance labels, and the pooling function extracts the bag label from these instance labels. This process is presented in Figure 2. In attention-based and gated attention-based pooling, extra dense layers are used for generating instance weights; i.e., along with instance label generation, the neural network also generates instance weights. This instance weight generator is a fully connected layer that has three layers. Then, these instance labels and instance weights are used by attentionbased and gated attention-based pooling functions to generate the bag labels. This process is presented in Figure 3. The evolutionary pooling function removes these extra layers for instance weight generation, as it randomly initializes a population of attention weights between [0, 1], and as the neural network trains, these weights are optimized simultaneously through DE or GA. For each set of weights representing the individuals of the population, the model is trained, and through a number of generations or passes, optimum values of the weights are obtained, thereby minimizing the loss. This process is presented in Figure 4. The instance weight optimization process through DE is presented in Figure 5. First, a population of instance weights is initialized with NP individuals. On the other hand, the neural network generates the instance labels. Now, using these instance labels and the population of instance weights, NP bag labels are predicted by the model (target bag labels). Then, through mutation and crossover operations, a trial population of instance weights is generated. Again, using instance labels and the trial population of instance weights, NP bag labels are predicted by the model (trial bag label). We already know the true bag labels, and thus the error in prediction or classification loss is calculated for both target bags as well as trial bags. Then, a tournament selection is performed for each individual for the target vector and trial vector. The vector with a smaller error is forwarded to the next generation. This process is repeated until the stopping criteria are met. An algorithm/pseudo-code for the MIL pooling function using DE is presented in Algorithm 1.
The pooling function is independent of the neural network architecture. In this paper, the AMI-Net [10] is used. The architecture and experimental setup are the same as used in [8] to allow a comparison of the results with other pooling functions. The AMI-Net is described pictorially in Figure 6. First, through the embedding layer, each instance of the input (bag of instances) is mapped to a dense vector. Then, in the multi-head attention [41] layer, the intra-relationship of instances in different embedding subspaces is captured, where the subspaces represent the organs or body parts affected by a particular disease. These symptoms are often related to each other and are formulated mathematically by the multi-head attention layer. Then, to mine the instance correlations, layer normalization [42] and residual connection are used. In the next step, a set of fully-connected layers is employed to obtain the instance representations. These instance labels are obtained through instance-level pooling and act as the bag representations in bag-level pooling, where bag labels are obtained to classify the bags.   The DE parameters used include a scaling factor F of 0.5 and crossover probability CR of 0.8. Binary cross-entropy is used as the objective function to compute the loss between calculated and desired bag labels. Binary cross-entropy is used here as the dataset has two classes: schizophrenia relapse and non-relapse. The maximum number of iterations or epochs is set at 100. Experiments are done on five iterations of the dataset for fair comparison. To optimize the network, an Adam optimizer is used with a learning rate of 0.00001, β 1 = 0.9, β 2 = 0.98, and ε = e −8 .

Experimental Setup
The dataset used in [8,10,11] is used in this study; it is known as The Western Medicine (WM) dataset, which is a schizophrenia dataset of 3927 patients with 88 medical features. For a particular patient, there can be a maximum of 21 features and a minimum of 5 features. This dataset is quite imbalanced, with a positive rate of only 0.057. The objective of this work is to predict the possibility of relapse of a schizophrenic patient within a duration of three months.
The evaluation metrics used in this paper are the area under the curve (AUC), average precision (AP), accuracy, balanced accuracy, negative predictive value (NPV), specificity or true negative rate (TNR), Zero One loss, and Hamming loss.
Experiments were conducted with aSpyder 4.1.4 Integrated Development Environment (IDE) with Python 3.7.7 through an Anaconda distribution on an Intel Xeon Gold 6240 2.6 GHz dual processor system with 192 GB RAM, an Nvidia Quadro RTX 8000 GPU, and a 64-bit Windows 10 Education Operating System.

Results Analysis
Six bag-level pooling functions-max pooling, mean pooling, sum pooling, log-sumexp pooling, gated attention-based pooling, and genetic pooling-were used to compare the proposed pooling function numerically as well as graphically. The pooling function designed through the basic DE variant was compared with other methods first to check whether it performed better than those methods. The numerical results obtained for different performance metrics are presented in Table 1. Highlighted values represent the best results obtained for each metric. The results in the table clearly indicate the robust performance of the proposed method, as it outperformed or performed on par with almost all the other methods used in this study. From a classification perspective, AUC is the most important evaluation metric for determining any classification model's performance. In [8], it was shown that genetic pooling outperformed the other pooling functions in terms of AUC. Here, the DE pooling showed a significant increase in AUC (around 12% more than genetic pooling). DE considers real numbers, while for GA, it is necessary to work with a chromosomal representation of variables. Thus, in case of numerical optimization, DE tends to be more effective than GA. The strength of the Differential Evolution approach is that it often displays better results than a genetic algorithm and other evolutionary algorithms and can be easily applied to a wide variety of real valued problems despite noisy, multi-modal, and multi-dimensional spaces, which usually make the optimization of problems very difficult [43]. This paper deals with the numerical combinatorial optimization problem. The one-to-one spawning technique of DE explores the search space better thanGA; thus, DE provides better results than GA. There is a great increase in terms of the Average Precision score, again establishing the superiority of DE pooling. The approach also exhibits higher accuracy and lower losses. DE pooling outperforms the other methods in every aspect.
The decaying value of loss is shown through the learning curves depicted in Figure 7. Here, it can be seen that DE pooling learning curve shows the same characteristics as genetic pooling. As the number of generations increases, genetic and DE pooling methods achieved the lowest loss value. Through this experiment, it was determined that, for metaheuristic approaches, increasing epochs results in a decrease in loss-i.e., better solutions-which is not the case for other methods. Generally, deep neural network models are trained with a large number of epochs, and thus metaheuristics-based pooling techniques are suitable in this case. The training curves for GA and DE pooling are quite similar; i.e., they converge towards the same solution. Their accuracies are also not greatly different. Statistical analysis is needed when, even after comparing the algorithms, no concrete conclusion can be reached. Non-parametric statistical models are required in the case of evolutionary algorithms [44]. Here, the Wilcoxon rank-sum test or Mann-Whitney U test was used to determine which algorithm ranked higher in terms of various performance measurement metrics. This test is an alternative to the independent samples t-test when the assumptions required by the latter are not met by the data. It is used to compare differences between two independent groups (GA and DE in this case) when the dependent variable is either ordinal or continuous but not normally distributed. For this test, both algorithms were run for five iterations and the critical performance metrics-AUC, accuracy, balanced accuracy, NPV, and Zero One loss-were recorded. The Mann-Whitney test was applied through IBM SPSS software. The statistical results are presented in Tables 2 and 3, where N represents the number of samples. Table 2 presents the rankings, while Table 3 shows the test statistics.  From Table 2, if we follow the mean rank, then it is clear that the DE pooling outranks the GA pooling in terms of AUC, accuracy, balanced accuracy, and NPV, but falls slightly behind in the case of loss. Furthermore, in Table 3 Table 4. The best result for each metric is highlighted in the table.  Table 4, it can be seen that, for most of the evaluation metrics, the variants produce a better result. Although there is no clear winner among the DE variants, it can be concluded that parameter adaptation improves the MIL DE pooling.

Conclusions
A trainable MIL pooling function based on Differential Evolution (DE) is proposed in this paper and implemented with AMI-Net architecture. The validation of the method is conducted on The Western Medicine (WM) dataset, and the results are compared with well-known pooling functions such as max pooling, mean pooling, sum pooling, log-sumexp pooling, gated attention pooling, and genetic pooling. Performance metrics used for comparison are the AUC, average precision Score, accuracy, balanced accuracy, NPV, Specificity/TNR, Zero One loss, and Hamming loss. It was observed that the proposed pooling function outperformed other approaches for all the metrics. Though there was no single best variant that outperformed others in every criterion, it is evident that parameter adaptation in DE improved the result. For future work, hybrid metaheuristic techniques can be explored in this case.  Data Availability Statement: Data available in a publicly accessible repository that does not issue DOIs. Publicly available datasets were analyzed in this study. This data can be found here: https:// github.com/Zeyuan-Wang/AMI-Net/blob/master/sample_data.xlsx (accessed on 31 March 2021).