Multiple Instance Learning with Trainable Soft Decision Tree Ensembles
Abstract
:1. Introduction
- A new RF neural network-based MIL model is proposed, which outperforms many existing MIL models when dealing with small tabular data.
- A new type of soft decision trees, similar to the soft oblique trees, is proposed. In contrast to the soft oblique trees, the proposed trees have a smaller number of trainable parameters. Nevertheless, the soft decision trees can be trained in the same way as the soft oblique trees. Outputs of each soft decision tree are viewed as a set of vectors (embeddings) which are formed from the class probability distributions in a specific way.
- An original algorithm for converting the decision trees into neural networks of a specific form for efficiently training parameters of the trees is proposed.
- The attention mechanism is proposed to aggregate the instance and bag embeddings with the aim of minimizing the corresponding loss function.
- The whole MIL model, including the soft decision trees, neural networks, the attention mechanism and a classifier, is trained in an end-to-end manner.
- Numerical experiments with the well-known datasets, Musk1, Musk2 [4], Fox, Tiger, and Elephant [15] illustrate STE-MIL. The above datasets have numerical features that are used to perform tabular data. The corresponding code implementing STE-MIL is publicly available at https://github.com/andruekonst/ste_mil (accessed on 17 July 2023).
2. Related Work
3. Preliminary
3.1. Multiple Instance Learning
3.2. Oblique Binary Soft Trees
4. A Softmax Representation of the Decision Tree Function
- the tree has non-leaf nodes parametrized by , where
- –
- is an one-hot vector having 1 at the position corresponding to the node feature;
- –
- is a threshold;
- the tree also has leaves with values , where is an output vector corresponding to the j-th leaf.
5. Soft Tree Ensemble for MIL
5.1. Soft Tree Ensemble
- Let us assign incorrect labels to instances of a bag; for example assigning the same label as that of the corresponding bag. The instance labels may be incorrect because we do not know true labels and their determination is our task. However, these labels are needed to build an initial RF. This is a kind of initialization procedure for the whole model, which is trained in the end-to-end manner.
- The next step is to convert the initial RF to a neural network having a specific architecture. To implement this step, non-leaf nodes of each tree in the RF are parametrized by trainable parameters , , , and non-trainable parameters .
- Parameters of the tree nodes , , are updated by using the stochastic gradient descend algorithm to minimize the bag loss defined in (5). To implement the updating algorithm, we propose approximating the tree path indicators , by using the specific softmax representation (11). This is a key step of the algorithm which allows us to update trees by updating neural networks and incorporating trees or the RF in the whole scheme of modules, including the attention mechanism and a classifier.
5.2. Trees to Neural Networks
- The first layer aims to approximate the node predicates. It is a fully connected layer with m inputs (dimensionality of ) and M outputs, i.e., it is held that:As a result, the first layer has only trainable parameters and . The matrix consists of one-hot vectors having 1 at positions corresponding to the node features.
- The second layer aims to estimate the leaf indices. It is fully connected layer with M inputs and L outputs having one trainable parameter :Matrix consists of values from the set: . If the path to i-th leaf does not contain j-th node, then . Otherwise, if the path goes to the left branch, then , and if the path goes to the right branch. The vector is needed to balance the decision paths. The sum of the sigmoid functions from the path to the k-th leaf in (11) can be represented as:
- The third layer aims to calculate the output values (embeddings). It is trainable and fully connected. Each leaf generates the class probability vector of the size C. We take the probability of class 1 and repeat it times, such that the whole embedding has the length E. The final output of the network (or the third layer) is of the form
Algorithm 1: Recursive matrix construction |
|
5.3. Peculiarities of the Proposed Soft Trees
- The sigmoid and softmax temperature parameters are trained starting from value to avoid having to fit them as hyperparameters. Temperatures as trainable parameters are not redundant because the first layer of the neural network contains a fixed weight matrix , so cannot be equivalent to . The same takes place with the softmax operation, which contains a fixed number of terms from 0 to 1.
- In contrast to [50], we did not use oblique trees, as they may lead to overfitting on tabular data. Trees with the axis-parallel separating hyperplanes allow us to build accurate models for tabular data where linear combinations of features often do not make sense.
- Therefore, we also did not use overparametrization, which is a key element for convergence in training the decision trees with quantized decision rules (when the indicator is represented not by a sigmoid function, but by the so-called straight-through operator [52]).
- We used softmax as an approximation of the argmax operation instead of the approximation of the sum of indicator functions. At the prediction stage, the implementation of the algorithm proposed in [50], which uses the sigmoid function, could predict the sum of the values at several leaves at the same time.
6. Attention and the Whole Scheme of STE-MIL
7. Numerical Experiments
8. Conclusions
8.1. Discussion
8.2. Open Research Questions
8.3. Concluding Remarks
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Hagele, M.; Seegerer, P.; Lapuschkin, S.; Bockmayr, M.; Samek, W.; Klauschen, F.; Muller, K.R.; Binder, A. Resolving challenges in deep learning-based analyses of histopathological images using explanation methods. Sci. Rep. 2020, 10, 6423. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- van der Laak, J.; Litjens, G.; Ciompi, F. Deep learning in histopathology: The path to the clinic. Nat. Med. 2021, 27, 775–784. [Google Scholar] [CrossRef] [PubMed]
- Yamamoto, Y.; Tsuzuki, T.; Akatsuka, J. Automated acquisition of explainable knowledge from unannotated histopathology images. Nat. Commun. 2019, 10, 5642. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Dietterich, T.; Lathrop, R.; Lozano-Perez, T. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 1997, 89, 31–71. [Google Scholar] [CrossRef] [Green Version]
- Zhu, L.; Zhao, B.; Gao, Y. Multi-class multi-instance learning for lung cancer image classification based on bag feature selection. In Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery, Jinan, China, 18–20 October 2008; Volume 2, pp. 487–492. [Google Scholar]
- Wei, X.S.; Ye, H.J.; Mu, X.; Wu, J.; Shen, C.; Zhou, Z.H. Multiple instance learning with emerging novel class. IEEE Trans. Knowl. Data Eng. 2019, 33, 2109–2120. [Google Scholar] [CrossRef]
- Amores, J. Multiple instance classification: Review, taxonomy and comparative study. Artif. Intell. 2013, 201, 81–105. [Google Scholar] [CrossRef]
- Babenko, B. Multiple Instance Learning: Algorithms and Applications; Technical Report; University of California: San Diego, CA, USA, 2008. [Google Scholar]
- Carbonneau, M.A.; Cheplygina, V.; Granger, E.; Gagnon, G. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognit. 2018, 77, 329–353. [Google Scholar] [CrossRef] [Green Version]
- Cheplygina, V.; de Bruijne, M.; Pluim, J. Not-so-supervised: A survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. Med. Image Anal. 2019, 54, 280–296. [Google Scholar] [CrossRef] [Green Version]
- Quellec, G.; Cazuguel, G.; Cochener, B.; Lamard, M. Multiple-Instance Learning for Medical Image and Video Analysis. IEEE Rev. Biomed. Eng. 2017, 10, 213–234. [Google Scholar] [CrossRef]
- Yao, J.; Zhu, X.; Jonnagaddala, J.; Hawkins, N.; Huang, J. Whole slide images based cancer survival prediction using attention guided deep multiple instance learning network. Med. Image Anal. 2020, 65, 101789. [Google Scholar] [CrossRef]
- Zhou, Z.H. Multi-Instance Learning: A Survey; Technical Report; National Laboratory for Novel Software Technology, Nanjing University: Nanjing, China, 2004. [Google Scholar]
- Srinidhi, C.; Ciga, O.; Martel, A.L. Deep neural network models for computational histopathology: A survey. Med. Image Anal. 2021, 67, 101813. [Google Scholar] [CrossRef]
- Andrews, S.; Tsochantaridis, I.; Hofmann, T. Support vector machines for multiple-instance learning. In Proceedings of the 15th International Conference on Neural Information Processing Systems, NIPS’02; MIT Press: Cambridge, MA, USA, 2002; pp. 577–584. [Google Scholar]
- Chevaleyre, Y.; Zucker, J.D. Solving multiple-instance and multiple-part learning problems with decision trees and rule sets. Application to the mutagenesis problem. In Proceedings of the Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2001; Volume 2056, pp. 204–214. [Google Scholar]
- Kraus, O.; Ba, J.; Frey, B. Classifying and segmenting microscopy images with deep multiple instance learning. Bioinformatics 2016, 32, i52–i59. [Google Scholar] [CrossRef] [Green Version]
- Sun, M.; Han, T.; Liu, M.C.; Khodayari-Rostamabad, A. Multiple instance learning convolutional neural networks for object recognition. In Proceedings of the International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 3270–3275. [Google Scholar]
- Wang, X.; Yan, Y.; Tang, P.; Bai, X.; Liu, W. Revisiting multiple instance neural networks. Pattern Recognit. 2018, 74, 15–24. [Google Scholar] [CrossRef] [Green Version]
- Wang, J.; Zucker, J.D. Solving the multiple-instance problem: A lazy learning approach. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML, Stanford, CA, USA, 29 June–2 July 2000; pp. 1119–1126. [Google Scholar]
- Pappas, N.; Popescu-Belis, A. Explicit Document Modeling through Weighted Multiple-Instance Learning. J. Artif. Intell. Res. 2017, 58, 591–626. [Google Scholar] [CrossRef] [Green Version]
- Fuster, S.; Eftestol, T.; Engan, K. Nested multiple instance learning with attention mechanisms. arXiv 2021, arXiv:2111.00947. [Google Scholar]
- Ilse, M.; Tomczak, J.; Welling, M. Attention-based Deep Multiple Instance Learning. In Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 2127–2136. [Google Scholar]
- Jiang, S.; Suriawinata, A.; Hassanpour, S. MHAttnSurv: Multi-Head Attention for Survival Prediction Using Whole-Slide Pathology Images. arXiv 2021, arXiv:2110.11558. [Google Scholar] [CrossRef]
- Konstantinov, A.; Utkin, L. Multi-attention multiple instance learning. Neural Comput. Appl. 2022, 34, 14029–14051. [Google Scholar] [CrossRef]
- Rymarczyk, D.; Kaczynska, A.; Kraus, J.; Pardyl, A.; Zielinski, B. ProtoMIL: Multiple Instance Learning with Prototypical Parts for Fine-Grained Interpretability. arXiv 2021, arXiv:2108.10612. [Google Scholar]
- Wang, Q.; Zhou, Y.; Huang, J.; Liu, Z.; Li, L.; Xu, W.; Cheng, J.Z. Hierarchical Attention-Based Multiple Instance Learning Network for Patient-Level Lung Cancer Diagnosis. In Proceedings of the 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Republic of Korea, 16–19 December 2020; pp. 1156–1160. [Google Scholar]
- Heath, D.; Kasif, S.; IJCAI, S.S. Induction of oblique decision trees. In Proceedings of the International Joint Conference on Artificial Intelligence, Chambéry, France, 28 August–3 September 1993; Volume 1993, pp. 1002–1007. [Google Scholar]
- Taser, P.; Birant, K.; Birant, D. Comparison of Ensemble-Based Multiple Instance Learning Approaches. In Proceedings of the 2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA), Sofia, Bulgaria, 3–5 July 2019; pp. 1–5. [Google Scholar]
- Doran, G.; Ray, S. Multiple-Instance Learning from Distributions. J. Mach. Learn. Res. 2016, 17, 4384–4433. [Google Scholar]
- Feng, J.; Zhou, Z.H. Deep miml network. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31, pp. 1884–1890. [Google Scholar]
- Liu, Q.; Zhou, S.; Zhu, C.; Liu, X.; Yin, J. MI-ELM: Highly efficient multi-instance learning based on hierarchical extreme learning machine. Neurocomputing 2016, 173, 1044–1053. [Google Scholar] [CrossRef]
- Xu, Y. Multiple-instance learning based decision neural networks for image retrieval and classification. Neurocomputing 2016, 171, 826–836. [Google Scholar] [CrossRef]
- Rymarczyk, D.; Borowa, A.; Tabor, J.; Zielinski, B. Kernel Self-Attention for Weakly-supervised Image Classification using Deep Multiple Instance Learning. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 1721–1730. [Google Scholar]
- Tang, X.; Liu, M.; Zhong, H.; Ju, Y.; Li, W.; Xu, Q. MILL: Channel Attention–based Deep Multiple Instance Learning for Landslide Recognition. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2021, 17, 1–11. [Google Scholar] [CrossRef]
- Li, B.; Li, Y.; Eliceiri, K. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14318–14328. [Google Scholar]
- Qi, C.; Hao, S.; Kaichun, M.; Leonidas, J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Schmidt, A.; Morales-Alvarez, P.; Molina, R. Probabilistic attention based on Gaussian processes for deep multiple instance learning. arXiv 2021, arXiv:2302.04061. [Google Scholar] [CrossRef] [PubMed]
- Costa, V.; Pedreira, C. Recent advances in decision trees: An updated survey. Artif. Intell. Rev. 2022, 56, 4765–4800. [Google Scholar] [CrossRef]
- Wickramarachchi, D.; Robertson, B.; Reale, M.; Price, C.; Brown, J. HHCART: An oblique decision tree. Comput. Stat. Data Anal. 2016, 96, 12–23. [Google Scholar] [CrossRef] [Green Version]
- Carreira-Perpinan, M.; Tavallali, P. Alternating optimization of decision trees, with application to learning sparse oblique trees. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 31, pp. 1–11. [Google Scholar]
- Xu, Z.; Zhu, G.; Yuan, C.; Huang, Y. One-Stage Tree: End-to-end tree builder and pruner. Mach. Learn. 2022, 111, 1959–1985. [Google Scholar] [CrossRef]
- Menze, B.; Kelm, B.; Splitthoff, D.; Koethe, U.; Hamprecht, F. On oblique random forests. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011; Springer: Berlin/Heidelberg, Germany, 2011; Volume 22, pp. 453–469. [Google Scholar]
- Katuwal, R.; Suganthan, P.; Zhang, L. Heterogeneous oblique random forest. Pattern Recognit. 2020, 99, 107078. [Google Scholar] [CrossRef]
- Cantu-Paz, E.; Kamath, C. Inducing oblique decision trees with evolutionary algorithms. IEEE Trans. Evol. Comput. 2003, 7, 54–68. [Google Scholar] [CrossRef]
- Hehn, T.; Kooij, J.; Hamprecht, F. End-to-End Learning of Decision Trees and Forests. Int. J. Comput. Vis. 2020, 128, 997–1011. [Google Scholar] [CrossRef] [Green Version]
- Lee, G.H.; Jaakkola, T. Oblique decision trees from derivatives of relu networks. arXiv 2019, arXiv:1909.13488. [Google Scholar]
- Hazimeh, H.; Ponomareva, N.; Mol, P.; Tan, Z.; Mazumder, R. The tree ensemble layer: Differentiability meets conditional computation. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 4138–4148. [Google Scholar]
- Frosst, N.; Hinton, G. Distilling a neural network into a soft decision tree. arXiv 2017, arXiv:1711.09784. [Google Scholar]
- Karthikeyan, A.; Jain, N.; Natarajan, N.; Jain, P. Learning Accurate Decision Trees with Bandit Feedback via Quantized Gradient Descent. arXiv 2021, arXiv:2102.07567. [Google Scholar]
- Madaan, L.; Bhojanapalli, S.; Jain, H.; Jain, P. Treeformer: Dense Gradient Trees for Efficient Attention Computation. arXiv 2022, arXiv:2208.09015. [Google Scholar]
- Bengio, Y.; Leonard, N.; Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv 2013, arXiv:1308.3432. [Google Scholar]
- Leistner, C.; Saffari, A.; Bischof, H. MIForests: Multiple-instance learning with randomized trees. In Proceedings of the European Conference on Computer Vision, Crete, Greece, 5–11 September 2010; pp. 29–42. [Google Scholar]
- Gartner, T.; Flach, P.; Kowalczyk, A.; Smola, A. Multi-instance kernels. In Proceedings of the ICML, Sydney, Australia, 8–12 July 2002; Volume 2, pp. 179–186. [Google Scholar]
- Zhang, Q.; Goldman, S. Em-dd: An improved multiple-instance learning technique. In Proceedings of the NIPS, Vancouver, BC, Canada, 9–14 December 2002; pp. 1073–1080. [Google Scholar]
- Zhou, Z.H.; Sun, Y.Y.; Li, Y.F. Multi-instance learning by treating instances as non-iid samples. In Proceedings of the ICML, Montreal, QC, Canada, 14–18 June 2009; pp. 1249–1256. [Google Scholar]
- Wei, X.S.; Wu, J.; Zhou, Z.H. Scalable algorithms for multi-instance learning. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 975–987. [Google Scholar] [CrossRef]
- Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef] [Green Version]
- Friedman, J. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Friedman, J. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
Data Set | N | n | m |
---|---|---|---|
Elephant | 200 | 1391 | 230 |
Fox | 200 | 1302 | 230 |
Tiger | 200 | 1220 | 230 |
Musk1 | 92 | 476 | 166 |
Musk2 | 102 | 6598 | 166 |
Elephant | Fox | Tiger | |
---|---|---|---|
mi-SVM [15] | 0.822 ± N/A | 0.582 ± N/A | 0.784 ± N/A |
MI-SVM [15] | 0.843 ± N/A | 0.578 ± N/A | 0.840 ± N/A |
MI-Kernel [54] | 0.843 ± N/A | 0.603 ± N/A | 0.842 ± N/A |
EM-DD [55] | 0.771 ± 0.097 | 0.609 ± 0.101 | 0.730 ± 0.096 |
mi-Graph [56] | 0.869 ± 0.078 | 0.620 ± 0.098 | 0.860 ± 0.083 |
miVLAD [57] | 0.850 ± 0.080 | 0.620 ± 0.098 | 0.811 ± 0.087 |
miFV [57] | 0.852 ± 0.081 | 0.621 ± 0.109 | 0.813 ± 0.083 |
mi-Net [19] | 0.858 ± 0.083 | 0.613 ± 0.078 | 0.824 ± 0.076 |
MI-Net [19] | 0.862 ± 0.077 | 0.622 ± 0.084 | 0.830 ± 0.072 |
MI-Net with DS [19] | 0.872 ± 0.072 | 0.630 ± 0.080 | 0.845 ± 0.087 |
MI-Net with RC [19] | 0.857 ± 0.089 | 0.619 ± 0.104 | 0.836 ± 0.083 |
Attention [23] | 0.868 ± 0.022 | 0.615 ± 0.043 | 0.839 ± 0.022 |
Gated-Attention [23] | 0.857 ± 0.027 | 0.603 ± 0.029 | 0.845 ± 0.018 |
STE-MIL | 0.885 ± 0.038 | 0.730 ± 0.080 | 0.875 ± 0.039 |
Musk1 | Musk2 | |
---|---|---|
mi-SVM [15] | 0.874 ± N/A | 0.836 ± N/A |
MI-SVM [15] | 0.779 ± N/A | 0.843 ± N/A |
MI-Kernel [54] | 0.880 ± N/A | 0.893 ± N/A |
EM-DD [55] | 0.849 ± 0.098 | 0.869 ± 0.108 |
mi-Graph [56] | 0.889 ± 0.073 | 0.903 ± 0.086 |
miVLAD [57] | 0.871 ± 0.098 | 0.872 ± 0.095 |
miFV [57] | 0.909 ± 0.089 | 0.884 ± 0.094 |
mi-Net [19] | 0.889 ± 0.088 | 0.858 ± 0.110 |
MI-Net [19] | 0.887 ± 0.091 | 0.859 ± 0.102 |
MI-Net with DS [19] | 0.894 ± 0.093 | 0.874 ± 0.097 |
MI-Net with RC [19] | 0.898 ± 0.097 | 0.873 ± 0.098 |
Attention [23] | 0.892 ± 0.040 | 0.858 ± 0.048 |
Gated-Attention [23] | 0.900 ± 0.050 | 0.863 ± 0.042 |
STE-MIL | 0.918 ± 0.077 | 0.854 ± 0.061 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Konstantinov, A.; Utkin, L.; Muliukha, V. Multiple Instance Learning with Trainable Soft Decision Tree Ensembles. Algorithms 2023, 16, 358. https://doi.org/10.3390/a16080358
Konstantinov A, Utkin L, Muliukha V. Multiple Instance Learning with Trainable Soft Decision Tree Ensembles. Algorithms. 2023; 16(8):358. https://doi.org/10.3390/a16080358
Chicago/Turabian StyleKonstantinov, Andrei, Lev Utkin, and Vladimir Muliukha. 2023. "Multiple Instance Learning with Trainable Soft Decision Tree Ensembles" Algorithms 16, no. 8: 358. https://doi.org/10.3390/a16080358
APA StyleKonstantinov, A., Utkin, L., & Muliukha, V. (2023). Multiple Instance Learning with Trainable Soft Decision Tree Ensembles. Algorithms, 16(8), 358. https://doi.org/10.3390/a16080358