MILDMS: Multiple Instance Learning via DD Constraint and Multiple Part Similarity

: As a subject area of symmetry, multiple instance learning (MIL) is a special form of a weakly supervised learning problem where the label is related to the bag, not the instances contained in it. The di ﬃ culty of MIL lies in the incomplete label information of instances. To resolve this problem, in this paper, we propose a novel diverse density (DD) and multiple part similarity combination method for multiple instance learning, named MILDMS. First, we model the target concepts optimization with a DD function constraint on positive and negative instance space, which can greatly improve the robustness to label noise problem. Next, we combine the positive and negative instances in the bag (generated by hand-crafted and convolutional neural network features) with multiple part similarities to construct an MIL kernel. We evaluate the proposed approach on the MUSK dataset, whose results MUSK1 (91.9%) and MUSK2 (92.2%) show our method is comparable to other MIL algorithms. To further demonstrate generality, we also present experimental results on the PASCAL VOC 2007 and 2012 (46.5% and 42.2%) and COREL (78.6%) that signiﬁcantly outperforms the state-of-the-art algorithms including deep MIL and other non-deep MIL algorithms.


Introduction
Multiple instance learning (MIL), i.e., learning from ambiguous data (the labels are related to bags, not instances within the bags, meaning that we only have partial or incomplete knowledge about training instances), has been widely studied and applied to many challenging tasks, such as text categorization [1], object tracking [2], person re-identification [3], computer-aided medical diagnosis [4], etc. Therefore, it has received considerable attention, and various algorithms, for example APR [5], DD [6], Citation-KNN [7], EM-DD [8], MI-Kernel [9], miSVM and MISVM [10], DD-SVM [11], MILES [12], MissSVM [13], MIGraph and miGraph [14], MILIS [15], MILDS [16], MILEAGE [17], mi-DS [18], CK_MIL [19], SMILE [20], MIKI [21], TreeMIL [22], MILDM [23], mi-Net and MI-Net [24], Attention and Gated-Attention MIL [25], etc., have been proposed to deal with the MIL problem. However, there are two issues that hinder its practical application. One is that most of these methods are sensitive to noise, which makes the mislabeled labels degrade the classification performance. The other is that these methods cannot deal well with both one target concept and multiple target concepts problems in one MIL framework. In this paper, we combine a novel diverse density (DD) constraint optimization model with multiple part similarity, named MILDMS, to handle the above two MIL issues by using hand-crafted and deep features in image categorization.
As an important paradigm in machine learning, the MIL was firstly proposed by Dietterich [5] in the context of drug activity prediction task, where a drug molecule is suitable for drug design if one of its conformations can tightly bind to the protein molecules, but the biochemical data can only tell us the binding capability of drug molecule, not its particular conformations. In MIL, to facilitate expression, the drug molecule and its conformations are named bag and instances separately. In this learning scheme, rather than representing each training bag as a fixed-length vector, each bag contains a set of instances (different size fixed-length vectors). Under the MIL assumption, all the instances in negative bags are negative, whereas instances in positive bag are either positive or negative instances. Thus, the intuition MIL algorithms, such as APR [5] and DD [6], attempt to find a dense positive instance rectangle or ellipse region (also named as target concept region) in instance space, which can be seen as an intersection of positive bags, by expanding or shrinking axis-parallel rectangle (APR) or optimizing a DD function. The experimental results on drug activity prediction have proven the performance of these two algorithms. Unfortunately, both of APR and DD are sensitive to noise because the noise data will greatly change the rectangle of APR and influence the DD value. Moreover, the APR and DD cannot capture the multiple target concepts in many MIL real-world applications, such as image categorization, image retrieval, etc.
For the image categorization problem, it can be naturally casted as multiple instance learning by treating the image as the bag and segmented regions of it as instances. Unlike the drug activity prediction, the image categorization is collectively determined by parts of instances (segmented regions) of the bag (image) or even all these instances. For example, the image labeled as 'africa' should contain not only people, but also elephants, savannah, lions, etc., because the high-level semantic 'africa' is determined by these objects.
Noting that the multiple target concepts problem existed in image categorization and the difficulty caused by it, Chen advocates to seek multiple DD points in instance space and convert the image (bag) to fixed-length features [11]. Unfortunately, this method is also sensitive to noise. To overcome this problem, Chen consequently proposes another algorithm named MILES [12], which maps the bag to a very high space generated by all instances of training bags and uses 1-norm SVM to select the target concepts. The shortcoming of this method is that the mapping process has high computation complexity.
In this paper, we propose a novel MIL method named MILDMS, which introduces an indicator vector for each instance and models the target concepts optimization process with a DD function constraint on positive and negative instance space separately, which can deal well with the labeling noise problem to some extent. Moreover, we also focus on the combination of positive and negative instances in the bag (generated by hand-crafted and deep learning features) with multiple part similarities to construct MIL kernel. Hence, the proposed MILDMS is a robust MIL algorithm and can capture the multiple targets.
To emphasize the main contribution of this paper, we summarize the following distinct advantages of our novel MILDMS below.

•
We made the first attempt to convert MIL to multiple part similarity problem and analyzed their relationships.

•
Our most positive and negative instance similarity with multiple part similarities combination method has shown to achieve significant improvements in robust MIL where noisy labels are provided.

•
The one target concept and multiple target concepts problems in MIL can be tackled in one framework. Meanwhile, we combine the hand-crafted and CNN features into our framework, which can provide more powerful feature representation ability. • Experiments on MIL dataset MUSK, PASCAL VOC, and COREL, etc., show that our proposal can outperform the state-of-the-art baselines including traditional MIL and deep MIL algorithms.
The rest of the paper is organized as follows. Section 2 provides a brief review of related work, gives an analysis between multiple instance learning and multiple part similarity. Section 3 presents a brief overview of our proposed method. The details of the method and the pseudo code of it are then elaborated upon in Section 4. In Section 5, we present the experimental results and analyses to evaluate the proposed algorithm. Finally, Section 6 gives the conclusions.

Related Works
During the past decade, many MIL algorithms have been proposed which can be roughly divided into three main categories-positive instance identification, bag structure similarity, and deep multiple instance learning methods.
We firstly introduce the positive instance identification and bag structure similarity algorithms. The difference between these two group methods is that the positive instance identification methods first determine the instance label, and then infer the bag label. Whereas, the bag structure similarity methods change the order of these two steps or only focus on the bag label.
The main idea of positive instance identification method is to locate a target concept region in instance space, where positive instances reside in it and all of the negative instances are far away from it. Therefore, the obtained target concept region can be used to select the most positive instance and infer the unknown bag's label. For example, by considering the MIL from geometry view, Dietterich [5] advocates the positive instances existing in an axis-parallel rectangle (APR) region which can be further used to predict bag label. By treating MIL as instance density estimation problem, Maron [6] defines a diverse density function and then maximizes this function to learn the target concept in the Gaussian-like compact region, which is extended by Zhang with expectation-maximization (EM) to speed up this process [8]. The variant method of support vector machine (SVM) methods, such as MissSVM [13], mi-SVM, and MI-SVM [10], are also proposed to track MIL, which hold the assumption that the positive instances appear in half space of the Hilbert space. Unlike above methods, MILIS [15] select the instances by intertwining the steps of instance selection and classifier learning in an iterative manner. While, SMILE [20] introduces a similarity weight to each instance in the positive bag.
Rather than considering the most positive instance in the bag, the bag structure similarity methods predict the bag label either directly on the bag or on the generated structure of learned target concepts. Taking the view that the same class bags should have similar structure, Citation-KNN [7] and mi-DS [18] algorithms compute bag similarity by references and citations which is further used to infer bag label. The difference between these two methods is that Citation-KNN directly applies shortest Hausdorff distance to compute bag distances, while mi-DS build the bag distances from the rules generated by DataSqueezer. Many other researchers, such as Gartner [9] and Zhou [14] directly construct MIL kernel from bags without pre-processing to obtain the multiple target concepts, in which Gartner computes the statistic characteristics of bags as bag similarity, while Zhou map the bag to an undirected graph and calculate graph similarity as MIL kernel. By mapping the bag to the space generated by multiple target concepts, the bag can be converted to a new fix-length representation vector. Following this way, Chen proposed two algorithms named DD-SVM [11] and MILES [12] to obtain multiple target concepts either by DD algorithm or by 1-norm SVM in the bag-to-instances mapping space.
The third group of MIL is deep multiple instance learning. Considering the success of convolutional neural networks (CNN), many MIL algorithms have been proposed. For example, in [24] the authors propose two kinds of methods: mi-Net and MI-Net with or without deep supervision or residual connections to deal with MIL problem; in [25] the authors adopt an attention mechanism to convolutional neural networks and achieves good performance on MUSK, MINIST, and Breast Cancer etc.; in [26] the authors propose an end-to-end learning framework based on deep multiple instance learning, which classifies the Panchromatic (PAN) and Multispectral (MS) imagery by the joint spectral and spatial information fusion; meanwhile, deep MIL has also been applied to deal with tooth-marked tongue recognition [27] and image auto-annotation [28].
To understand the relationship of above MIL algorithms more clearly, we provide an analysis to trigger our proposed model in Section 3.1.
Let B i and x ij be a bag and its j-th instance, y i ∈ {+1, −1} and y ij ∈ {+1, −1} the corresponding labels for them. For the positive instance identification methods, the posteriori probability of bag B i as positive can be defined as which indicates that the bag is totally dominated and determined by the most positive instance. Unlike positive instance identification methods, the bag structure similarity methods deem that all the instances within the bag has equal influence on bag label, and the posteriori probability can be denoted as The shortcoming of these two methods is that positive instance identification methods fail when the instances in bags have strong co-occurrence, while bag structure similarity fails when one predominant instance plays a key role in classification. Noting this, we advocate here to combine most positive and most negative instances and multiple part similarity together to improve MIL performance, which can be seen as a variant of Equation (2) and its posteriori probability can be defined as

Algorithm Overview
In this section, we first give an analysis of MIL and multiple part similarity to inspire our MILDMS algorithm. Then, based on the relationship obtained by this analysis, we present a brief overview of our proposed algorithm.

The Analysis of MIL and Multiple Part Similarity
The multiple part similarity problem lies at the heart of image similarity computation, where the image is represented by multiple parts in image and all these parts play equal role in image similarity. For example, in the image classification task, all the hand-crafted Scale Invariant Feature Transform (SIFT) [29] features extracted from an image, which can be seen as multiple part representation of this image, totally decide the image label. Additionally, the multiple part similarity methods, such as SIFT kernel [30] and Earth Mover's Distance (EMD) [31], have been intensively studied and applied in many image understanding tasks.
The common point of MIL and multiple part similarity is that MIL and multiple part similarity both contain different size elements, which can be treated as a set similarity problem. From the set similarity view, we analyze the MIL and multiple part similarity to trigger our method. Given two sets A, B and their elements , a trivial idea of set similarity is to compute the similarity by element-to-element. Therefore, the similarity between two sets can be defined as is a similarity function and w ij is the corresponding weight. In the multiple part similarity problem, the multiple parts have equal contributions to image similarity. Therefore, we give the same coefficient for multiple parts and the its similarity can be defined as f (A, B) = w i,j sim(a i , b j ). In contrast to multiple part similarity, the bag similarity is dominated by the most positive and most negative instances within two bags. Therefore, letting 1 > α + β > 0, α > 0, β > 0 and the most positive and negative instances in bag A and B denoted as a l , a m and b t , b s , the bag similarity can be denoted as which means that the other instances excluding the most positive and negative instances play equal role in bag similarity. If we replace (1 − α − β) with w, the third part of bag similarity can be converted to the multiple part similarity problem. That is to say, if the most positive and negative instances in bag are selected, the multiple instance learning is naturally converted to multiple part similarity.

Overview of Our Proposed MILDMS Algorithm
To overview our MILDMS, we first give some annotations to formulate multiple instance learning. Here, we only consider the two-class problem, while the multi-class problem can be handled by few minor modifications. Let L = {B 1 , B 2 , . . . , B s } represent training bags, in which the The goal of multiple instance learning is to predict the labels of the unlabeled bags B u , i.e., to find a classifier f (B u ) which can classify bag correctly. By lining all instances within positive bags one by one we can obtain L + = x 11 , . . . , x 1n 1 , x 21 , . . . , x 2n 2 . . . , x p1 , . . . , x pn p , which can be re-indexed as L + = x 1 , . . . , x T + , T + = p i=1 n i . Taking the similar way, we can also obtain negative instance set The basic steps of the MILDMS are illustrated in Figure 1, where each block represents an object to operate on and each arrow indicates an operation defined on the objects. In the training phase, we first learn the positive and negative concepts, also called Target concept locating, though a two-step inverse DD optimization model. After doing this, we perform the de-noising process and obtain one most positive and negative instance per bag through selecting the nearest instance within the bag to positive and negative target concepts, and hence the instances in the bag are split into three parts, the most positive and negative instances, and other instances. Next, we construct three kernels for positive instance, negative instance, and other instances separately, which can be combined together with multiple kernels to construct a new MIL kernel and train the SVM model. While in the testing phase, as done in the training process, we take a similar way to construct the MIL kernel using selected positive, negative, and other instances, and finally adopt the trained SVM to predict the unknown bag's label. Training phase Testing phase

MILDMS Algorithm
We introduce the MILDMS algorithm in this section, including four steps, i.e., target concept locating, instance de-noising, multiple part kernel construction, and MIL kernel construction. Before elaborating this algorithm, we first give a brief review of DD algorithm here, which works as constraint in our target concept obtaining process. The DD algorithm optimizes a DD function

MILDMS Algorithm
We introduce the MILDMS algorithm in this section, including four steps, i.e., target concept locating, instance de-noising, multiple part kernel construction, and MIL kernel construction. Before elaborating this algorithm, we first give a brief review of DD algorithm here, which works as constraint in our target concept obtaining process. The DD algorithm optimizes a DD function modeled on all training instances to directly locate the most diverse density point, while our MILDMS algorithm first finds the potential target concepts through optimizing and then verifying it by the DD constraints.

Review of Diverse Density (DD) Algorithm
As the most successful positive instance identification MIL algorithm, the DD method proposed by Maron attempts to locate the most diverse density point and then use it to infer instances' labels [6]. Therefore, the bag label is determined by the most positive instance within bag. Assuming that there exists one single target which is denoted as t, the diversity density function is defined as the probability on p positive and q negative bags.
The point, which maximizes the above DD(t) function, is chosen as target concept. By assuming the conditional independence in all bags, a uniform prior over target concept and applying Bayes' rules, we can rewrite Equation (4) as To compute the Pr(t|B i ) in above formula, for the positive bag, Maron adopts the noisy-or model and define it as Pr(t|B ). Here, the probability of an instance to a potential target can be defined as Pr(t|x ij = exp(− x ij − t 2 ). The optimization problem argmax t DD(t) can be solved by gradient descent approach with multiple starting points to locate the target concept t.

Target Concept Locating
Due to the successes of DD algorithm, we decided to adopt the DD constraint to our target concept locating model by especially considering how to overcome the shortcoming of the DD algorithm. To obtain meaningful results, we begin with analyzing the limitation of the DD function. Equation (5) contains two parts, Pr(t|B p+i ). These two parts work together to make the obtained target concept t close to positive instances and far away from the negative instances. Unfortunately, the mislabeled training bags will greatly influence the DD function value, leading to a wrong target concept, as the target concept is modeled on the whole positive and negative bag space. To overcome this problem, we advocate here to model the target concept locating model in positive and negative bag space separately with the DD constraint. To avoid the noise data's influence on DD function, we choose P nearest instances to these target concepts to calculate the DD function, where P equals the positive or negative bag number.
To facilitate the expressing, we first give some symbol notation. Suppose in MIL there are k + potential positive target concepts p i k + i=1 and k − potential negative target concepts q i k − i=1 negative target concepts, we use auxiliary variables δ ij to indicate the instance x i belongs to the potential positive target concept p j . That is to say, each δ ij can take values in {0, 1}, and δ ij = 1 implies the instance x i belongs to potential positive concept target p j . Taking the similar way, we define ξ ij to represent whether the instance x i belongs to potential negative target concept q j . Here, N p (x i ) is used to denote the P-nearest instances of point x i , and N denote integer number. T + and T − are the instance numbers of all positive and negative bags separately. With these ingredients, we now define the target concept locating optimization problem as follow where DD(·) is the diversity density (DD) function, · 2 is the 2-norm of vector.
To analyze the optimization (Equation (7)) more clearly, we split the constraints into four parts and name them C1, C2, C3, and C4, and convert it to formulate Equation (8) C1 : Obviously, C1 requires each instance of L + or L − to only belong to one potential target concept; C2 means we can choose any integer number as target concept number; C3 chooses the most positive and negative target concepts from the potential target concept P and Q which contain more instances than others; C4 encodes the DD constraint to our model which ensures the most positive target concept obtains the maximum DD value of its nearest p neighbor instances, while the most negative target concept obtains the minimum DD value.
For the k + , k − in above optimization problem (Equation (7)) is discrete variables and δ ij , ξ ij belong to {0, 1}, it is both a combinatorial and integer optimization problem, which is NP-hard. To make this optimization problem solvable, we split it to two more easier problems. One of them is a much simpler optimization problem and another is constraint. We split the optimization (Equation (8)) to below two parts (named as OP1 and CS1), where the first one is the objective function with constraint C1, C2 and the other is constraints C3 and C4.

OP1
: Unfortunately, the OP1 of Equation (9) is still an integer programming and combinatorial optimization problem that cannot be solved easily. It is trivial to ignore CS1 constraint variables k + and k − . Then we can have two separate sub-problems from OP1.
Each is a clustering problem, which can find a local minimum by iterative, alternating optimization with respect to P and δ, Q, and ξ.
After obtaining the potential positive and negative target concepts we choose the most positive and negative target concepts p t 1 and q t 2 by using constraint C3, which means the column of matrix δ and ξ that contain more non-zero entries than others will be selected as most positive and negative target concepts. Then, the next step is to determine whether p t 1 and q t 2 satisfy the constraint C4. For each potential target concepts in P and Q, we choose the L nearest neighbor instances, where the L can be any integer number. Thus we can obtain instance set {x i } L i=1 and y i L i=1 around the potential positive and negative target concepts p t 1 and q t 2 . Since our DD function is based on instances not bags, our DD(t) function can be rewritten as Pr(t|y i ) (11) where Pr(t|x i ) = exp(− x i − t 2 ), Pr(t|y i ) = 1 − exp(− y i − t 2 ). If the formula C4 is satisfied, the most positive and negative target concepts of MIL are obtained.

De-Noising Mislabeled Instances
Since the noise data will greatly influence MIL performance, we also perform a special process to de-noise the noise instances, which is based on a trivial idea that the positive instances should be in the most positive target concept region and far away from the most negative target concept region. In short, the positive instances should be closer to positive concepts and negative instances closer to negative concepts. For two class classification problem, we can treat one class as positive, another class as negative. Therefore, we define a function noise(·) in Equation (12) to choose the mislabeled noise instance x, which will be deleted from training sets L + , L − where · is the 2-norm of vector, label(·) is used to represent the label of a bag. Here we need to remember that Equation (12) can not only de-noise the mislabeled instances but also delete the negative instances in positive bags, which will be helpful for alleviating the ambiguous problem existing in MIL.

Multiple Part Similarity Kernel Construction
Once the most positive and negative target concepts p t 1 and p t 2 are selected, we can easily choose the most positive and negative instance for each bag. We choose the instances nearest to these two target concepts as the most positive and negative instances in the bag. The index of most positive and negative instances choosing formula can be formula as below where x ij is an instance in bag B i , · is the 2-norm of vector. Therefore we can obtain one most positive and one most negative instance denoted as x ih + and x ih − in bag B i . As discussed in Section 3, the label of bag B i largely depends on the similarity between the instances x ih + , x ih − in bag B i computed by a traditional kernel, such as radical basis function (RBF). The other instances excluding the most positive and negative instances in different bags, denoted as , j h + ∧ j h − , is still a special set similarity problem named multiple part similarity.
As reported in [31], the Earth Mover's Distance (EMD) can help to deal with the multiple part similarity problem, such as SIFT matching problem, histogram-based local descriptor comparing, etc. Thus, one possible way to compute the similarity between set C i and C j is to use a kernel function based on EMD distance.
To embed our multiple part similarity problem to EMD framework, we firstly line up all the instances in set C i and generate a new set C i = p j . Here, we further represent the suppliers of EMD as P = (p 1 , w p 1 ), (p 2 , w p 2 ), . . . , (p |C i | , w p |C i | ) , where p i ∈ C i , |C i | is the cardinal of C i , and the w p i is the weight of the supplier p i . Normally, the weight w p i is used as the total supply of suppliers or the total capacity of consumers in the EMD. Since all the instances p i in the set C i play equal role in set similarity, we set w p i 's value as 1/|C i |. Similarly, we also denote the set C j as Q = (q 1 , w q 1 ), (q 2 , w q 2 ), . . . , (q |C j | , w q |C j | ) , where q i ∈ C j and w q j = 1/|C j |. Letting d ij represent the distance between supplier and consumer p i and q j defined as d ij = p i − q j 2 , the EMD optimization problem between set C i and C j can be defined as Finally, the EMD distance between set C i and C j can be computed as where f ij is the optimal flow that can be determined by solving above Equation (14) with the standard simplex method. Note the f ij can also be interpreted as the optimal match between the instance x i and x j in bag B i and B j , and meanwhile the EMD distance between two sets C i and C j is non-negative and symmetrical, thus we incorporate EMD distance into the Gaussian function to construct the multiple part similarity kernel K(P, Q) = exp(−D(P, Q)) (16)

MIL Kernel Construction
To construct the MIL kernel, we need to model these similarities, including the most positive instance, most negative instance, and multiple part similarities, in a unified framework. Note the multiple kernel learning [32] has shown considerable success in computing multiple information similarities, here we incorporate these similarities to multiple kernel framework.
The key idea of multiple kernel is to construct a new kernel by linearly combining some pre-defined kernels. The multiple kernel is actually a convex combination of pre-defined kernel where M is the total number of kernels, and k i (x, y) is any kernel function meeting the mercer condition. Therefore, the objective of multiple kernel for MIL is to learn both the coefficients α i and the weights d m in the below quadratic optimization problem where B i and B j are two bags in MIL, M is the total number of kernels, k m (B i , B j ) and d m denote the m-th base kernel and its weights respectively, C is the penalty parameter to control the trade-off between accuracy and regularization. The MIL classifier is defined as where α * i and d * m are obtained optimal decision variables from Equation (19) and the b * can be computed by the below formula By integrating the multiple part similarity kernel defined in Equation (16) with most positive instance kernel and most negative instance kernel, our MIL kernel can be defined as where x ih + are the kernel functions used to compute the most positive and negative instance similarity, and k o (C i , C j ) is the predefined multiple part similarity kernel. Based on the aforementioned fact that MIL contains two cases-one target concept and multiple target concepts problem-we can see that in the case of one target concept the most positive and negative instances in the bag can determine the label of the bag without the multiple part similarity kernel, thus the MIL kernel can be reduced to To distinguish between one target concept and multiple target concepts, we compute the total numbers of instances contained by P = p i , and compute the ration {r i } 2 i=1 between the two biggest numbers of them which are compared to a threshold θ given in advance. Indeed, the total number of instances contained in P and Q can be easily obtained by counting none-zero entries in each row of matrix δ and ξ in Equation (11). Since the entries δ ij and ξ ij are either 0 or 1, the numbers of instances contained in p i and q i can be computed as where |·| is 1-norm of vector, δ T = [a 1 , a 2 , . . . , a k + ] and ·ξ T = [b 1 , b 2 , . . . , b k − ]. We sort the set S and T by descending order, and obtain the two biggest number from the sorted set S and T denoted as s b 1 and s b 2 , t b i and t b j . Then, the formula used to classify one target concept and multiple target concepts can be defined as It is not difficult to see that the MIL will be a one target concept when Class (MIL) equals 1, otherwise will be multiple target concepts. In this paper, the θ is chosen to be 0.1.

Feature Extraction
Feature extraction plays the core role in image understanding either the one target concept or multiple target concepts in MIL. Thus, it is very important to choose suitable features to represent the image. Now there are two ways to extract the image feature: hand-crafted and deep learning methods; while the hand-crafted methods benefit from professional human knowledge, which is in certain domains very important (see e.g., [33]), the deep learning methods benefit from large amounts of data sets (big data) and from deep network structures. To compare fairly with existing methods which adopt the hand-crafted features in MIL, we took the similar way to acquire the nine-dimension feature vector, including the color, texture, and shape of the five regions in the image, whose details can be found in [11].
In view of the powerful representation ability of convolutional neural networks (CNN), we used CNN to extract deep features. For the performance of reusing the existing deep learning architecture and find-tuning pre-trained model, we adopt below the network architecture and training strategy.
(1) Architecture: As an efficient CNN framework, VGG-16 contains in total 16 weight layers, 13 layers of which are convolutional layers and the other three of which are fully connected layers. The second fully connected layers are 4096 units whose output are used as features. To make the VGG-16 fit our MIL task, we replaced the last 1000-way fully connected layer with a two-way fully connected layer and used softmax function as the final prediction function.
(2) Network Training: The network was first pretrained on ImageNet dataset and then fine-tuned. We used stochastic gradient descent to fine-tune our network with a batch size of 128, learning rate of 0.0001, momentum of 0.9, and weight decay of 0.0005. We stopped our training after 36 epochs since the accuracy stops increasing.

Algorithm View
We summarize our MILDMS below. To facilitate expression, we denote the training bag set as two set union , p positive and q negative bags, and denote the instances from all the positive bags as L + = x k k = 1, . . . , T + , instances from negative bags as L − = x k k = 1, . . . , T − . Algorithm 1 was used to learn a MIL classifier defined by α * , d * , andb * , and the input was maximum target concept number M.

Algorithm 1. Learning MIL Classifier
Repeat iterations For k + = 1: M For k − = 1: M Step 1: Solve the optimization (Equation (10)) to obtain potential target concepts P, Q and indicator matrix δ, ξ Step 2: Use C3 in Equation (13) to select the most positive and negative target concepts Step 3: Denoise the mislabeled instance by applying Equation (12) Step 4: Compute the DD function using Equation (12) to check the constraints C4 in Equation (8) End End Until the constraint C4 is met For (each training bag B i in set B) Step 1: Choose the most positive and negative instances x ih + , x ih − using Equation (13) Step 2: Optimize Equation (14) to obtain optimal flow and construct multiple part kernel using Equation (15) End Optimize Equation (18) to obtain the decision variables α * , d * andb * The code for unknown bag classification is in Algorithm 2 given below. The input is an unknown bag set U = {B t } T T=1 , which is classified as positive by a bag classifier (α * , d * , b * ).

Algorithm 2. Classifying MIL Bag
For (each unknown bag B i in set U) Step 1: Choose the most positive and negative instances x ih + , x ih − using Equation (13)) Step 2: Optimize Equation (14) to obtain optimal flow and construct multiple part kernel using Equation (15)) Step 3: Use Equation (19) to predict Bag B i 's label)

Experiments
To evaluate the performance of our MILDMS, in this section, we conduct experiments on both two MIL datasets, the standard MIL dataset and image classification dataset. Here we use the standard MIL dataset to show the effectiveness of the MILDMS to deal with the one target concept in MIL and compare it with APR [5], SMILE [20], MILES [12], Attention and Gated-Attention [25], DD [6], DD-SVM [11], and MI-Net with DS [24]. For the object detection and image classification task, we applied the above Section 4.6 to extract the image features. Additionally, the Elephant, Fox, Tiger and PASCAL VOC 2007 and 2012 were used to show the performance on one target concept MIL and the COREL dataset shows the generality of our MIL method in multiple concepts detection.
For all our experiments, we fixed the maximum number of target concepts to 10. Gaussian function was chosen as the positive and negative instance similarity kernels and the regularization parameter C of Equation (18) was chosen by 5-fold cross validation on the training data using grid search. The threshold θ in Equation (24) was used to classify the one target and multiple target concepts problem was fixed to 0.1 which was determined by many trials by letting θ range from 0.1 to 0.9.
We applied the SimpleMKL [32] software toolkit to solve the multiple kernel learning. The clustering problem existing in Equation (10) was solved by Cartesian k-means [34].

Standard MIL Dataset
We first chose the most commonly used MIL dataset, MUSK dataset, to test our algorithm MILDMS. As a drug activity task, the MUSK dataset contains MUSK1 and MUSK2, where the former contains 47 positive bags and 45 negative bags with the bag instance number in the range from 2 to 40, while the latter contains 39 positive bags and 63 negative bags with the bag instance number in the range from 1 to 1044. Table 1 summarizes data sets according to the number of positive and negative bags, instances and features. Additionally, we also used a well-known three image class problem to verify the efficiency of our MILDMS, which has three classes Elephant, Fox, and Tiger, and each image is segmented into regions, with features given by color, texture, and shape [11]. We report our experimental result in Table 2, which indicates that our MILDMS is highly competitive with other state-of-the-art MIL methods. In Table 2, our MILDMS achieves the best result on MUSK2, Elephant, Tiger, Fox, and is competitive with SMILE [20], and also achieves higher classification precision than deep learning methods MI-Net with DS [24], Gated-Attention, and Attention [25].

Object Detection
In order to further test the performance of MILDMS on one target concept MIL, we evaluated our method on the challenging PASCAL VOC 2007 and 2012 [35] which have 9962 and 22,531 images respectively. In these two datasets, the total 9962 and 22,531 images were divided into 20 distinct categories, which are person, bird, cat, cow, dog, horse, sheep, airplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, and tv/monitor. For the object detection task, we compared our method with SMILE [20], MILES [12], Gated-Attention [25] and MI-Net with DS [24] algorithms, and report their average accuracy of 20 classes shown in Figure 2. As can be seen from Figure 2, our MILDMS achieves the best results and far exceeds the other MIL algorithms including traditional methods and deep learning methods in the object detection task. The reason lies in that positive and negative instances of images are generated by hand-crafted and deep features with DD constraint, which can fully utilize the features.

Image Classification Application
In addition, we also applied our MILDMS to an image classification task dataset COREL, which contains 20 categories, including Africa people and villages, Beaches, Historical buildings, Buses, Dinosaurs, Elephants, Flowers, Horses, Mountains and glaciers, Food, Dogs, Lizards, Fashion models, Sunset scenes, Cars, Waterfalls, Antique furniture, Battle ships, Skiing, and Desserts. In the dataset, the total 2000 images were divided into 20 distinct categories, where each category contains 100 images. We selected one sample image from each of the 20 categories and display them in Figure  3. The categories are ordered in a row-wise manner from the upper-leftmost image (Africa people and villages) to the lower-rightmost image (Desserts).

Image Classification Application
In addition, we also applied our MILDMS to an image classification task dataset COREL, which contains 20 categories, including Africa people and villages, Beaches, Historical buildings, Buses, Dinosaurs, Elephants, Flowers, Horses, Mountains and glaciers, Food, Dogs, Lizards, Fashion models, Sunset scenes, Cars, Waterfalls, Antique furniture, Battle ships, Skiing, and Desserts. In the dataset, the total 2000 images were divided into 20 distinct categories, where each category contains 100 images. We selected one sample image from each of the 20 categories and display them in Figure 3. The categories are ordered in a row-wise manner from the upper-leftmost image (Africa people and villages) to the lower-rightmost image (Desserts).
Dinosaurs, Elephants, Flowers, Horses, Mountains and glaciers, Food, Dogs, Lizards, Fashion models, Sunset scenes, Cars, Waterfalls, Antique furniture, Battle ships, Skiing, and Desserts. In the dataset, the total 2000 images were divided into 20 distinct categories, where each category contains 100 images. We selected one sample image from each of the 20 categories and display them in Figure  3. The categories are ordered in a row-wise manner from the upper-leftmost image (Africa people and villages) to the lower-rightmost image (Desserts). We report our MILDMS on COREL dataset in Table 3. The result show that the proposed method also yields competitive results for the COREL 2000 image set. Compared with other MIL methods, such as SMILE [20], MILES [12], DD-SVM [11], Attention and Gated-Attention [25], and MI-Net with DS [24], our algorithm achieves the best performance. The reason is that our method can capture the multiple concepts of images and combine them together to achieve good performance. We report our MILDMS on COREL dataset in Table 3. The result show that the proposed method also yields competitive results for the COREL 2000 image set. Compared with other MIL methods, such as SMILE [20], MILES [12], DD-SVM [11], Attention and Gated-Attention [25], and MI-Net with DS [24], our algorithm achieves the best performance. The reason is that our method can capture the multiple concepts of images and combine them together to achieve good performance.

Sensitivity to Labeling Noise
Labeling noise is an unavoidable problem in the real image application because labeling data requires long and difficult work, and it is easy to make mistakes. Thus, we designed a labeling noise condition to verify our algorithms' performance. To be fair, we also adopted the same noise sensitivity setting that was evaluated by the MILES [12] on COREL dataset. We compared MILDMS with MILES [12], SMILE [20], DD-SVM [11], Attention and Gated-Attention [25], MI-Net with DS [24] on Category 3 (Historical building) and Category 8 (Horses) of the Corel 2000 image set, on which we added d% of noise by changing the labels of d% of positive bags and d% of negative bags. Under different noise levels, we randomly split the image set into two same size sub-sets as training and test sets 20 times and report the average classification accuracy in Figure 4.
As shown in Figure 4, our MILDMS is more robust than DD-SVM and SMILE in high noise level, for our method is modeled in positive and negative instance space separately. It is worth noticing that our MILDMS algorithm achieves equal performance when comparing with MILES, and outperforms deep MIL methods including Attention and Gated-Attention [25], and MI-Net with DS [24]. sensitivity setting that was evaluated by the MILES [12] on COREL dataset. We compared MILDMS with MILES [12], SMILE [20], DD-SVM [11], Attention and Gated-Attention [25], MI-Net with DS [24] on Category 3 (Historical building) and Category 8 (Horses) of the Corel 2000 image set, on which we added d% of noise by changing the labels of d% of positive bags and d% of negative bags. Under different noise levels, we randomly split the image set into two same size sub-sets as training and test sets 20 times and report the average classification accuracy in Figure 4. As shown in Figure 4, our MILDMS is more robust than DD-SVM and SMILE in high noise level, for our method is modeled in positive and negative instance space separately. It is worth noticing that our MILDMS algorithm achieves equal performance when comparing with MILES, and outperforms deep MIL methods including Attention and Gated-Attention [25], and MI-Net with DS [24].

Conclusions
Multiple instance learning constitutes a framework for classification problems with ambiguity problems in instance labelling. In this paper, we introduced a new approach called MILDMS in which most positive and negative instances, generated by hand-crafted and deep features, are based on DD constraint. In addition, we integrated the multiple part similarity into our framework, which yields the capacity of performing well in MIL datasets. The experiments demonstrate the usage of

Conclusions
Multiple instance learning constitutes a framework for classification problems with ambiguity problems in instance labelling. In this paper, we introduced a new approach called MILDMS in which most positive and negative instances, generated by hand-crafted and deep features, are based on DD constraint. In addition, we integrated the multiple part similarity into our framework, which yields the capacity of performing well in MIL datasets. The experiments demonstrate the usage of the method MILDMS which is comparable to the state-of-the-art MIL algorithms. There are more interesting directions to investigate in the future. First, we could find a more efficient and effective way to deal with our optimization problem. Second, we would like to find the explainability of our method using explainable artificial intelligence (AI) methods.