Semantic Segmentation in the Era of Foundation Models: Technical Evolution and Applications

Zhaolin Yu

doi:10.54254/2755-2721/2025.BJ24684

Applied and Computational EngineeringOpen access

Semantic Segmentation in the Era of Foundation Models: Technical Evolution and Applications

Research Article

Open Access

Semantic Segmentation in the Era of Foundation Models: Technical Evolution and Applications

Zhaolin Yu ^1*

¹ Beijing University of Posts and Telecommunications, Beijing, China, 100876

^*Corresponding author: zhuzaiyuhudie@gmail.com

Published on 4 July 2025

ACE Vol.177

ISSN (Print): 2755-273X

ISSN (Online): 2755-2721

ISBN (Print): 978-1-80590-241-6

ISBN (Online): 978-1-80590-242-3

Download Cover

Abstract

Semantic segmentation has undergone a remarkable transformation from traditional computer vision approaches to sophisticated deep learning architectures, culminating in the revolutionary capabilities introduced by foundation models. This comprehensive survey examines the technical progression of semantic segmentation methodologies, with particular emphasis on vision foundation models, such as the Segment Anything Model (SAM) and Contrastive Language-Image Pre-training (CLIP). This paper systematically analyzes how these large-scale pretrained models enable previously unattainable capabilities, including zero-shot learning and cross-domain generalization while identifying persistent challenges regarding computational efficiency and boundary precision. The investigation encompasses critical applications across medical imaging, remote sensing, and video understanding domains, revealing both transformative benefits and technical limitations. It concludes that foundation models represent a fundamental paradigm shift requiring hybrid approaches that effectively combine general capabilities with domain-specific optimizations.

Keywords:

Semantic Segmentation, Foundation Models, SAM, CLIP, Zero-shot Learning

View PDF

References

[1]. Shi, J., Malik, J., & IEEE. (2000). Normalized cuts and image segmentation. In IEEE Transactions on Pattern Analysis and Machine Intelligence. No. 8; Vol. 22, pp. 888–888.

[2]. Boykov, Y. Y., Jolly, M.P.. (2001). Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images. Proceedings of “Internation Conference on Computer Vision, ” pp.105–106.

[3]. Long, J., Shelhamer, E., Darrell, T., & UC Berkeley. (n.d.). Fully convolutional networks for semantic segmentation.

[4]. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-NET: Convolutional Networks for Biomedical Image Segmentation, May 18. arXiv.org. https: //arxiv.org/abs/1505.04597

[5]. Badrinarayanan, V., Kendall, A., & Cipolla, R. (2015). SEGNet: a deep convolutional Encoder-Decoder architecture for image segmentation, November 2. arXiv.org.

[6]. Chen, L., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation, February 7. arXiv.org.

[7]. Hu, J., Shen, L., Albanie, S., Sun, G., & Wu, E. (2017). Squeeze-and-Excitation networks, September 5. arXiv.org. https: //arxiv.org/abs/1709.01507

[8]. Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2016). Pyramid Scene Parsing network, December 4. arXiv.org. https: //arxiv.org/abs/1612.01105

[9]. Dosovitskiy, A., Beyer, L., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, October 22. arXiv.org. https: //arxiv.org/abs/2010.11929

[10]. Zheng, S., Lu, J., et al. (2020). Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, December 31. arXiv.org.

[11]. Radford, A., Kim, J. W., et al. (2021). Learning transferable visual models from natural language supervision, February 26. arXiv.org. https: //arxiv.org/abs/2103.00020

[12]. Rao, Y., Zhao, W., et al. (2021). DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting, December 2. arXiv.org. https: //arxiv.org/abs/2112.01518

[13]. Li, B., Weinberger, K. Q., Belongie, S., Koltun, V., & Ranftl, R. (2022, January 10). Language-driven semantic segmentation. arXiv.org. https: //arxiv.org/abs/2201.03546

[14]. Kirillov, A., Mintun, E., et al. (2023). Segment anything, April 5. arXiv.org.

[15]. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LORA: Low-Rank adaptation of Large Language Models, June 17. arXiv.org.

[16]. Houlsby, N., Giurgiu, A., et al. (2019). Parameter-Efficient Transfer Learning for NLP, February 2. arXiv.org. https: //arxiv.org/abs/1902.00751

[17]. Lester, B., Al-Rfou, R., & Constant, N. (2021). The power of scale for Parameter-Efficient Prompt Tuning, April 18. arXiv.org. https: //arxiv.org/abs/2104.08691

[18]. Ma, J., He, Y., Li, F., Han, L., You, C., & Wang, B. (2024). Segment anything in medical images. Nature Communications, 15(1). https: //doi.org/10.1038/s41467-024-44824-z

[19]. Ghiasi, G., Gu, X., Cui, Y., & Lin, T. (2021, December 22). Scaling Open-Vocabulary Image Segmentation with Image-Level Labels. arXiv.org. https: //arxiv.org/abs/2112.12143

[20]. Chen, G., Zhou, Y., et al. (2024). Remote sensing of diverse urban environments: From the single city to multiple cities. Remote Sensing of Environment, 285, 113396.

[21]. Zhang, C., Han, D., Qiao, Y., Kim, J. U., Bae, S., Lee, S., & Hong, C. S. (2023). Faster segment anything: towards lightweight SAM for mobile applications, June 25. arXiv.org.

[22]. Zhao, X., Ding, W., An, Y., Du, Y., Yu, T., Li, M., Tang, M., & Wang, J. (2023). Fast segment anything, June 21. arXiv.org. https: //arxiv.org/abs/2306.12156

[23]. Wang, D., Shelhamer, E., Liu, S., Olshausen, B., & Darrell, T. (2020). TENT: Fully test-time adaptation by entropy minimization, June 18. arXiv.org. https: //arxiv.org/abs/2006.10726

[24]. Kendall, A., & Gal, Y. (2017). What uncertainties do we need in Bayesian deep learning for computer vision? March 15. arXiv.org. https: //arxiv.org/abs/1703.04977

[25]. Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016). DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, June 2. arXiv.org. https: //arxiv.org/abs/1606.00915

[26]. Krähenbühl, P., & Koltun, V. (2012). Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials, October 20. arXiv.org. https: //arxiv.org/abs/1210.5644