Liver-VLM: Enhancing Focal Liver Lesion Classification with Self-Supervised Vision-Language Pretraining
Abstract
1. Introduction
- (1)
- We propose Liver-VLM, a Vision-Language Model for FLLs classification, which is specifically trained on a dedicated multi-phase CT FLLs dataset, making it better suited for the multi-phase CT domain. We focus on clinically problem-driven design over architectural complexity and position Liver-VLM as a strong, reproducible, and clinically interpretable baseline model for liver lesion diagnosis.
- (2)
- We design a self-supervised pre-training strategy tailored for Liver-VLM by integrating the phase shuffle prediction task, which was proposed in our previous work [6]. Our work is the first to integrate this specific three-channel phase-shuffle pretext task directly into a VLM training pipeline for medical imaging. Previous works used it for pre-training a visual-only model. We demonstrate that this task, which is inherently aligned with the multi-phase nature of Live CT, is highly effective not only for learning visual representations but for aligning those representations with clinical language, thereby enhancing the final VLM’s discriminative power.
- (3)
- We design tailored enriched textual prompts to stabilize optimization and enable robust classification even with limited labeled data. We also propose a data augmentation technique based on phase shuffle, which expands the training dataset by generating all six possible permutations of the CT phase order. This method enhances data diversity and improves the model’s robustness to variations in phase presentation.
2. Related Work
2.1. CLIP
2.2. MedCLIP
3. Method
3.1. Overview of the Proposed Method
3.2. Self-Supervised Pre-Training with Phase Shuffle Prediction Task
3.3. Target Training (Fine-Tuning)
3.4. Data Augmentation Based on Phase Shuffle
4. Experimental Results
4.1. Dataset and Implementations
4.2. Results
4.2.1. Ablation Studies
4.2.2. Comparison with Other Models
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Smithuis, R. CT Contrast Injection and Protocols; Radiology Department of the Rijnland Hospital: Leiderdorp, The Netherlands, 2014; Available online: http://www.radiologyassistant.nl/en/p52c04470dbd5c/ct-contrast-injection-and-protocols.html (accessed on 19 October 2025).
- Yasaka, K.; Akai, H.; Abe, O.; Kiryu, S. Deep learning with convolutional neural network for differentiation of liver masses at dynamic contrast-enhanced CT: A preliminary study. Radiology 2018, 286, 887–896. [Google Scholar] [CrossRef] [PubMed]
- Liang, D.; Liu, M.; Zhang, J.; Wang, Y.; Zhang, D. Combining convolutional and recurrent neural networks for classification of focal liver lesions in multi-phase CT imaging. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2018, 2nd ed.; Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G., Eds.; Springer: Cham, Switzerland, 2018; Volume 11071, pp. 666–675. [Google Scholar]
- Wang, W.; Wang, Y.; Liang, D.; Zhang, J. Classification of focal liver lesions using deep learning with fine-tuning. In Proceedings of the Digital Medicine and Image Processing (DMIP2018), Chengdu, China, 24–26 November 2018; pp. 56–60. [Google Scholar]
- Dong, H.; Iwamoto, Y.; Han, X.H.; Lin, L.; Hu, H.; Cai, X.; Chen, Y.-W. Case Discrimination: Self-supervised Feature Learning for the Classification of Focal Liver Lesions. In Innovation in Medicine and Healthcare, Proceedings of the 9th KES-InMed 2021, Virtual Event, 14–16 June 2021; Chen, Y.-W., Tanaka, S., Howlett, R.J., Jain, L.C., Eds.; Springer: Singapore, 2021; Volume 254, pp. 241–249. [Google Scholar]
- Song, J.; Dong, H.; Chen, Y.; Lin, L.; Hu, H.; Chen, Y.-W. Deep Neural Network-Based Classification of Focal Liver Lesions Using Phase-Shuffle Prediction Pre-training. In Innovation in Medicine and Healthcare, Proceedings of the 11th KES-InMed 2023, Rome, Italy, 14–16 June 2023; Chen, Y.-W., Tanaka, S., Howlett, R.J., Jain, L.C., Eds.; Smart Innovation, Systems and Technologies; Springer: Cham, Switzerland, 2023; Volume 357, pp. 235–243. [Google Scholar]
- Song, J.; Dong, H.; Chen, Y.; Zhang, X.; Zhan, G.; Jain, R.K.; Chen, Y.-W. Early Recurrence Prediction of Hepatocellular Carcinoma Using Deep Learning Frameworks with Multi-Task Pre-Training. Information 2024, 15, 493. [Google Scholar] [CrossRef]
- Desai, K.; Johnson, J. VirTex: Learning Visual Representations from Textual Annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA, 19–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 11162–11172. [Google Scholar]
- Zhang, Y.; Jiang, H.; Miura, Y.; Manning, C.D.; Langlotz, C.P. Contrastive Learning of Medical Visual Representations from Paired Images and Text. arXiv 2020, arXiv:2010.00747. [Google Scholar]
- Hayat, M.; Aramvith, S.; Bhattacharjee, S.; Ahmad, N. Attention GhostUNet++: Enhanced Segmentation of Adipose Tissue and Liver in CT Images. arXiv 2025, arXiv:2504.11491. [Google Scholar]
- Bawazir, A.; Wu, K.; Li, W. Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training. arXiv 2024, arXiv:2411.15207. [Google Scholar]
- Song, J.; Hu, Y.; Wang, H.; Chen, Y.-W. Liver-VLM: A Vision-Language Model for Focal Liver Lesion Classification. In Proceedings of the 2025 International Conference on Innovation in Medicine and Healthcare, Solin, Croatia, 25–27 June 2025. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Virtual Event, 3–7 May 2021. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
- Alsentzer, E.; Murphy, J.R.; Boag, W.; Weng, W.H.; Jin, D.; Naumann, T.; McDermott, M. Publicly Available Clinical BERT Embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop (ClinicalNLP 2019), Minneapolis, MN, USA, 7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 72–78. [Google Scholar]
- Bodenreider, O. The Unified Medical Language System (UMLS): Integrating Biomedical Terminology. Nucleic Acids Res. 2004, 32, D267–D270. [Google Scholar] [CrossRef] [PubMed]
- Available online: https://github.com/google-research/bert (accessed on 19 October 2025).
- Xu, Y.; Zhou, H.; Zhang, Z.; Wang, Y.; Xie, Y. PA-ResSeg: A Phase Attention Residual Network for Liver Tumor Segmentation from Multi-phase CT Images. Med. Phys. 2021, 48, 3752–3766. [Google Scholar] [CrossRef] [PubMed]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Chen, X.; Li, X.; Fan, H. Exploring Simple Siamese Representation Learning. arXiv 2020, arXiv:2011.10566. [Google Scholar] [CrossRef]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video Swin Transformer. arXiv 2021, arXiv:2106.13230. [Google Scholar] [CrossRef]
- Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022, arXiv:2201.03545. [Google Scholar] [CrossRef]
- Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. arXiv 2021, arXiv:2104.00298. [Google Scholar]
- Student. The Probable Error of a Mean. Biometrika 1908, 6, 1–25. [Google Scholar] [CrossRef]
- DeLong, E.R.; DeLong, D.M.; Clarke-Pearson, D.L. Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach. Biometrics 1988, 44, 837–845. [Google Scholar] [CrossRef] [PubMed]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. arXiv 2016, arXiv:1610.02391. [Google Scholar]





| CYST | FNH | HCC | HEM | Total | |
|---|---|---|---|---|---|
| G1: case | 5 | 4 | 4 | 4 | 17 |
| slice | 29 | 15 | 30 | 21 | 95 |
| G2: case | 6 | 3 | 4 | 4 | 17 |
| slice | 31 | 17 | 29 | 33 | 110 |
| G3: case | 6 | 3 | 4 | 4 | 17 |
| slice | 37 | 7 | 36 | 17 | 97 |
| G4: case | 6 | 3 | 4 | 4 | 17 |
| slice | 24 | 17 | 35 | 19 | 95 |
| G5: case | 7 | 3 | 3 | 4 | 17 |
| slice | 28 | 20 | 32 | 12 | 92 |
| Total: case | 30 | 16 | 19 | 20 | 85 |
| slice | 149 | 76 | 162 | 102 | 489 |
| GPU | NVIDIA RTX A6000 (NVIDIA, Santa Clara, CA, USA) |
| CPU | Intel(R) Core(TM) i9-10980XE @ 3.00 GHz (Intel, Santa Clara, CA, USA) |
| OS | Ubuntu 20.04.5 LTS |
| Deep Learning Framework | Pytorch 2.1.1 |
| Model | Avg. Acc (%) | AUC |
|---|---|---|
| CLIP (zero-shot inference) [8] single medical prompt | 20.54 ± 5.61 | 0.43 ± 0.09 |
| CLIP (zero-shot inference) [8] ensemble of 16 medical prompts | 18.74 ± 2.87 | 0.42 ± 0.07 |
| Live-VLM (from scratch) [12] | 81.15 ± 7.36 | 0.92 ± 0.04 |
| Live-VLM (ImageNet) [12] | 83.35 ± 4.81 | 0.92 ± 0.03 |
| Liver-VLM (Self-Supervised) | 85.09 ± 3.89 | 0.92 ± 0.03 |
| Model | CYST | FNH | HCC | HEM | Avg. Acc (%) | AUC |
|---|---|---|---|---|---|---|
| Liver-VLM (pre-train using SimSiam SSL) | 96.50 ± 5.54 | 92.79 ± 6.04 | 71.05 ± 11.23 | 58.46 ± 20.96 | 80.19 ± 5.16 | 0.92 ± 0.03 |
| Liver-VLM (proposed method) (pre-train using phase shuffle SSL) | 95.79 ± 3.54 | 87.60 ± 9.54 | 80.99 ± 10.67 | 68.54 ± 24.83 | 85.09 ± 3.89 | 0.92 ± 0.03 |
| Image Encoder | CYST | FNH | HCC | HEM | Avg. Acc (%) | AUC |
|---|---|---|---|---|---|---|
| Liver-VLM (Video_Swin) | 97.09 ± 2.70 | 57.63 ± 39.40 | 47.95 ± 25.09 | 32.94 ± 22.12 | 62.81 ± 14.42 | 0.82 ± 0.08 |
| Live-VLM (ConvNeXt) | 96.48 ± 3.94 | 75.13 ± 27.52 | 57.23 ± 20.64 | 40.36 ± 33.83 | 70.26 ± 9.35 | 0.87 ± 0.04 |
| Liver-VLM(EfficientNet) | 97.86 ± 4.28 | 80.69 ± 21.75 | 84.44 ± 13.85 | 57.52 ± 23.24 | 84.52 ± 2.39 | 0.93 ± 0.03 |
| Liver-VLM (ResNet18) | 95.79 ± 3.54 | 87.60 ± 9.54 | 80.99 ± 10.67 | 68.54 ± 24.83 | 85.09 ± 3.89 | 0.92 ± 0.03 |
| Model | CYST | FNH | HCC | HEM | Avg. Acc (%) | AUC |
|---|---|---|---|---|---|---|
| Liver-VLM (Self-Supervised) (misregistered) | 94.81 ± 4.41 | 72.39 ± 12.25 | 90.66 ± 6.59 | 67.88 ± 25.77 | 84.98 ± 5.66 | 0.91 ± 0.04 |
| Liver-VLM (Self-Supervised) | 95.79 ± 3.54 | 87.60 ± 9.54 | 80.99 ± 10.67 | 68.54 ± 24.83 | 85.09 ± 3.89 | 0.92 ± 0.03 |
| Model | CYST | FNH | HCC | HEM | Avg. Acc (%) | AUC |
|---|---|---|---|---|---|---|
| Model 0 Liver-VLM (CLIP Prompt) | 92.97 ± 8.27 | 85.96 ± 4.51 | 73.7 ± 5.58 | 64.95 ± 11.12 | 82.46 ± 7.34 | 0.93 ± 0.03 |
| Model 1 Liver-VLM (Proposed Prompt) | 95.79 ± 3.54 | 87.60 ± 9.54 | 80.99 ± 10.67 | 68.54 ± 24.83 | 85.09 ± 3.89 | 0.92 ± 0.03 |
| Model 2 (proposed method) Liver-VLM (Proposed prompt + Data Augmentation) | 98.03 ± 2.79 | 83.22 ± 21.07 | 83.45 ± 15.7 | 72.74 ± 10.51 | 85.63 ± 3.18 | 0.94 ± 0.01 |
| Model | CYST | FNH | HCC | HEM | Avg. Acc (%) | AUC |
|---|---|---|---|---|---|---|
| Data augmentation with FC layer | 98.49 ± 1.97 | 77.32 ± 13.46 | 83.04 ± 7.90 | 48.60 ± 15.93 | 80.1 ± 3.58 | 0.93 ± 0.02 |
| Liver-VLM (proposed method) (Proposed prompt + Data Augmentation) | 98.03 ± 2.79 | 83.22 ± 21.07 | 83.45 ± 15.7 | 72.74 ± 10.51 | 85.63 ± 3.18 | 0.94 ± 0.01 |
| Model | CYST | FNH | HCC | HEM | Acc (%) | AUC | |
|---|---|---|---|---|---|---|---|
| CLIP (Zero-shot inference) [8] | 0.00 | 0.00 | 0.00 | 100.00 | 20.54 ± 5.61 | 0.43 ± 0.09 | |
| MedCLIP (Zero-shot inference) [9] | 31.94 ± 3.60 | 0.00 | 0.00 | 15.63 ± 5.58 | 26.12 ± 2.28 | 0.48 ± 3.45 | |
| Liver-VLM (from scratch) [12] | ResNet50 | 96.50 ± 5.54 | 90.79 ± 7.06 | 76.49 ± 13.52 | 39.08 ± 22.72 | 77.64 ± 7.83 | 0.89 ± 0.04 |
| ResNet18 | 97.32 ± 4.14 | 90.44 ± 8.95 | 76.02 ± 16.46 | 55.14 ± 28.83 | 81.15 ± 7.36 | 0.92 ± 0.04 | |
| Liver-VLM (ImageNet) [12] | ResNet50 | 95.94 ± 4.17 | 68.01 ± 30.87 | 78.11 ± 13.55 | 76.30 ± 16.96 | 83.05 ± 4.44 | 0.93 ± 0.02 |
| ResNet18 | 95.94 ± 4.17 | 73.72 ± 25.15 | 76.44 ± 12.60 | 70.41 ± 16.90 | 83.35 ± 4.81 | 0.92 ± 0.03 | |
| Phase shuffle prediction with a traditional FC layer [7] | 98.27 ± 1.42 | 86.22 ± 12.27 | 82.90 ± 5.84 | 63.72 ± 12.68 | 84.82 ± 1.99 | 0.93 ± 0.02 | |
| Liver-VLM (ResNet18) (proposed) | 98.03 ± 2.79 | 83.22 ± 21.07 | 83.45 ± 15.7 | 72.74 ± 10.51 | 85.63 ± 3.18 | 0.94 ± 0.01 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Song, J.; Hu, Y.; Wang, H.; Chen, Y.-W. Liver-VLM: Enhancing Focal Liver Lesion Classification with Self-Supervised Vision-Language Pretraining. Appl. Sci. 2025, 15, 12578. https://doi.org/10.3390/app152312578
Song J, Hu Y, Wang H, Chen Y-W. Liver-VLM: Enhancing Focal Liver Lesion Classification with Self-Supervised Vision-Language Pretraining. Applied Sciences. 2025; 15(23):12578. https://doi.org/10.3390/app152312578
Chicago/Turabian StyleSong, Jian, Yuchang Hu, Hui Wang, and Yen-Wei Chen. 2025. "Liver-VLM: Enhancing Focal Liver Lesion Classification with Self-Supervised Vision-Language Pretraining" Applied Sciences 15, no. 23: 12578. https://doi.org/10.3390/app152312578
APA StyleSong, J., Hu, Y., Wang, H., & Chen, Y.-W. (2025). Liver-VLM: Enhancing Focal Liver Lesion Classification with Self-Supervised Vision-Language Pretraining. Applied Sciences, 15(23), 12578. https://doi.org/10.3390/app152312578

