Comparative Study of Zero-shot and Fine-tuned Vision Models--Evaluating CLIP, LiT, and Swin Transformer on Fine-grained Bird Classification
Research Article
Open Access
CC BY

Comparative Study of Zero-shot and Fine-tuned Vision Models--Evaluating CLIP, LiT, and Swin Transformer on Fine-grained Bird Classification

Yihua Wang 1*
1 Wuhan Polytechnic University
*Corresponding author: 2786250363@qq.com
Published on 14 October 2025
Journal Cover
ACE Vol.191
ISSN (Print): 2755-273X
ISSN (Online): 2755-2721
ISBN (Print): 978-1-80590-184-6
ISBN (Online): 978-1-80590-129-7
Download Cover

Abstract

Fine-grained image classification is challenging because categories are often separated by only minor visual cues, requiring models to capture very fine details for accurate discrimination. The latest advances in vision-language models, such as CLIP and LiT, have demonstrated strong zero-shot performance on general image recognition tasks, but their effectiveness in fine-grained domains remains underexplored. In this study, we conduct a comparative evaluation of CLIP, LiT, and the vision-only Swin Transformer on the CUB-200-2011 bird dataset. For zero-shot classification, we assess CLIP and LiT using a consistent prompt template, while for fine-tuning, we train both CLIP and Swin end-to-end using the AdamW optimizer. Results show that LiT outperforms CLIP in zero-shot settings (63.96% vs. 51.55% Top-1 accuracy), while Swin achieves the highest performance after fine-tuning (83.47% Top-1 accuracy). These findings highlight a trade-off between generalization and fine-grained specialization, and suggest that future work should explore lightweight adaptation techniques to bridge the performance gap without sacrificing zero-shot flexibility.

Keywords:

Zero-shot learning, Fine-tuning, Vision-language models, CLIP, Swin Transformer.

View PDF
Wang,Y. (2025). Comparative Study of Zero-shot and Fine-tuned Vision Models--Evaluating CLIP, LiT, and Swin Transformer on Fine-grained Bird Classification. Applied and Computational Engineering,191,59-64.

References

[1]. A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision, ” in Proc. ICML, 2021, pp. 8748–8763.

[2]. X. Zhai et al., “LiT: Zero-Shot Transfer with Locked-image Text Tuning, ” in Proc. CVPR, 2022, pp. 18123–18133.

[3]. Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ” in Proc. ICCV, 2021, pp. 10012–10022.

[4]. C. Wah et al., “The Caltech-UCSD Birds-200-2011 Dataset, ” Caltech, Tech. Rep. CNS-TR-2011-001, 2011.

[5]. A. Frome et al., “DeViSE: A Deep Visual-Semantic Embedding Model, ” in Proc. NeurIPS, 2013, pp. 2121–2129.

[6]. X. Liu et al., “Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models, ” in Proc. UAI, 2024, pp. 2309–2330.

[7]. S. Jie et al., “Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning, ” in Proc. ICML, 2024, vol. 235, pp. 22062–22074.

[8]. L. Lan et al., “Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship Classification, ” arXiv preprint arXiv: 2403.08271, 2024.

Cite this article

Wang,Y. (2025). Comparative Study of Zero-shot and Fine-tuned Vision Models--Evaluating CLIP, LiT, and Swin Transformer on Fine-grained Bird Classification. Applied and Computational Engineering,191,59-64.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

About volume

Volume title: Proceedings of CONF-MLA 2025 Symposium: Intelligent Systems and Automation: AI Models, IoT, and Robotic Algorithms

ISBN: 978-1-80590-184-6(Print) / 978-1-80590-129-7(Online)
Editor: Hisham AbouGrad
Conference date: 17 November 2025
Series: Applied and Computational Engineering
Volume number: Vol.191
ISSN: 2755-2721(Print) / 2755-273X(Online)