References
[1]. A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision, ” in Proc. ICML, 2021, pp. 8748–8763.
[2]. X. Zhai et al., “LiT: Zero-Shot Transfer with Locked-image Text Tuning, ” in Proc. CVPR, 2022, pp. 18123–18133.
[3]. Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ” in Proc. ICCV, 2021, pp. 10012–10022.
[4]. C. Wah et al., “The Caltech-UCSD Birds-200-2011 Dataset, ” Caltech, Tech. Rep. CNS-TR-2011-001, 2011.
[5]. A. Frome et al., “DeViSE: A Deep Visual-Semantic Embedding Model, ” in Proc. NeurIPS, 2013, pp. 2121–2129.
[6]. X. Liu et al., “Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models, ” in Proc. UAI, 2024, pp. 2309–2330.
[7]. S. Jie et al., “Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning, ” in Proc. ICML, 2024, vol. 235, pp. 22062–22074.
[8]. L. Lan et al., “Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship Classification, ” arXiv preprint arXiv: 2403.08271, 2024.