On Isotropy of Multimodal Embeddings
Abstract
:1. Introduction
- The distribution of CLIP embeddings is not centered, therefore forming a cone, but they have a high level of isotropy in a local sense [8];
- Both CLIP image and text encoders generate embeddings that are located in a cone not just after training, but also right at initialization, even when layer biases are set to zero;
- While most elementary linear algebra methods aimed at increasing the isotropy of embeddings do not substantially improve CLIP’s zero-shot classification performance, certain dimensionality reduction and distribution alignment techniques may nonetheless provide potential enhancements;
2. Literature Review
3. Method Description
3.1. Visualization, Isotropy, and Transformations
3.1.1. Data Description
3.1.2. Isotropy Analysis
3.1.3. How to Improve Isotropy
3.1.4. Evaluation Settings
3.2. Multilingual CLIP
Evaluating the Quality of Embeddings
4. Results
4.1. Visualization and Isotropy Metrics
4.2. Procrustes and LSTSQ Transformations
4.3. CLIP Loss Evaluation
4.4. CIFAR-100 Zero-Shot Accuracy
4.5. Linear Probe Evaluation
4.6. Semantic Visualization of the Embedding Space
4.7. Multilingual CLIP
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Semantic Visualization
References
- Gao, J.; He, D.; Tan, X.; Qin, T.; Wang, L.; Liu, T. Representation Degeneration Problem in Training Natural Language Generation Models. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Fuster Baggetto, A.; Fresno, V. Is anisotropy really the cause of BERT embeddings not being semantic? In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 20 December 2022; pp. 4271–4281. [Google Scholar]
- Li, B.; Zhou, H.; He, J.; Wang, M.; Yang, Y.; Li, L. On the Sentence Embeddings from Pre-trained Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 9119–9130. [Google Scholar] [CrossRef]
- Su, J.; Cao, J.; Liu, W.; Ou, Y. Whitening Sentence Representations for Better Semantics and Faster Retrieval. arXiv 2021, arXiv:2103.15316. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; Proceedings of Machine Learning Research. Volume 139, pp. 8748–8763. [Google Scholar]
- Wang, L.; Huang, J.; Huang, K.; Hu, Z.; Wang, G.; Gu, Q. Improving neural language generation with spectrum control. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Zhou, W.; Lin, B.; Ren, X. IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 14621–14629. [Google Scholar] [CrossRef]
- Cai, X.; Huang, J.; Bian, Y.; Church, K. Isotropy in the contextual embedding space: Clusters and manifolds. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics; Long and Short Papers. pp. 4171–4186. [Google Scholar] [CrossRef]
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R., Eds.; Association for Computational Linguistics. pp. 8440–8451. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R., Eds.; pp. 5998–6008. [Google Scholar]
- Ding, Y.; Martinkus, K.; Pascual, D.; Clematide, S.; Wattenhofer, R. On Isotropy Calibration of Transformer Models. arXiv 2021, arXiv:2109.13304. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. [Google Scholar] [CrossRef] [Green Version]
- Schamoni, S.; Hitschler, J.; Riezler, S. A dataset and reranking method for multimodal MT of user-generated image captions. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), Boston, MA, USA, 17–21 March 2018; pp. 140–153. [Google Scholar]
- McInnes, L.; Healy, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
- Krizhevsky, A.; Nair, V. Cifar-100 (canadian institute for advanced research). 30 [65] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 26. [Google Scholar]
- Jacot, A.; Gabriel, F.; Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Adv. Neural Inf. Process. Syst. 2018, 31, 8580–8589. [Google Scholar]
Distribution | ||||
---|---|---|---|---|
CLIP | 0.84 | 0.03 | 0.83 | 0.03 |
+0-bias w/random weights | 0.02 | 0.69 | 0.00 | 16.61 |
+centering | 0.99 | 0.00 | 1.00 | 0.00 |
+whitening-128 | 0.99 | 0.00 | 1.00 | 0.00 |
BERT | 0.84 | 0.03 | - | - |
+centering | 0.99 | 0.00 | - | - |
Transform | Loss on Train | Loss on Test |
---|---|---|
CLIP | 0.4183 | 0.4097 |
centering | 1.6153 | 1.6201 |
whitening-128 | 1.5737 | 1.6042 |
Procrustes | 1.0223 | 1.2970 |
LSTSQ | 1.1785 | 1.3523 |
Transform | Accuracy |
---|---|
CLIP | 63.09% |
Procrustes on COCO | 38.42% |
LSTSQ on COCO | 39.21% |
Procrustes on CIFAR train | 61.39% |
LSTSQ on CIFAR train | 65.53% |
Whitening-128 on COCO | 42.59% |
Whitening-64 on COCO | 30.49% |
Whitening-32 on COCO | 17.27% |
Whitening-128 on CIFAR train | 56.77% |
Whitening-128 on CIFAR test | 56.04% |
Whitening-64 on CIFAR train | 54.87% |
Whitening-32 on CIFAR train | 38.37% |
PCA-450 on COCO | 64.1% |
PCA-450 on CIFAR test | 63.6% |
PCA-90 on text prompts | 62.6% |
Transform | Known Classes Accuracy | New Classes Accuracy |
---|---|---|
CLIP | 66.40% | 59.78% |
Procrustes | 70.88% | 3.46% |
LSTSQ | 70.54% | 19.68% |
Transform | Accuracy |
---|---|
CLIP | 75.61% |
Whitening-256 on COCO | 74.75% |
Whitening-128 on COCO | 77.1% |
Whitening-64 on COCO | 74.75% |
Quantity | Fr-En | Ru-En | De-En |
---|---|---|---|
Mean pairwise distance | 10.19 | 10.57 | 14.44 |
Mean global distance | 39.43 | 37.85 | 39.88 |
Ratio | 0.26 | 0.28 | 0.36 |
Transform | Accuracy |
---|---|
ruCLIP | 53.14% |
Procrustes on CIFAR train | 59.35% |
LSTSQ on CIFAR train | 65.84% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tyshchuk, K.; Karpikova, P.; Spiridonov, A.; Prutianova, A.; Razzhigaev, A.; Panchenko, A. On Isotropy of Multimodal Embeddings. Information 2023, 14, 392. https://doi.org/10.3390/info14070392
Tyshchuk K, Karpikova P, Spiridonov A, Prutianova A, Razzhigaev A, Panchenko A. On Isotropy of Multimodal Embeddings. Information. 2023; 14(7):392. https://doi.org/10.3390/info14070392
Chicago/Turabian StyleTyshchuk, Kirill, Polina Karpikova, Andrew Spiridonov, Anastasiia Prutianova, Anton Razzhigaev, and Alexander Panchenko. 2023. "On Isotropy of Multimodal Embeddings" Information 14, no. 7: 392. https://doi.org/10.3390/info14070392
APA StyleTyshchuk, K., Karpikova, P., Spiridonov, A., Prutianova, A., Razzhigaev, A., & Panchenko, A. (2023). On Isotropy of Multimodal Embeddings. Information, 14(7), 392. https://doi.org/10.3390/info14070392