Semantic Segmentation in the Era of Foundation Models: Technical Evolution and Applications
Research Article
Open Access
CC BY

Semantic Segmentation in the Era of Foundation Models: Technical Evolution and Applications

Zhaolin Yu 1*
1 Beijing University of Posts and Telecommunications, Beijing, China, 100876
*Corresponding author: zhuzaiyuhudie@gmail.com
Published on 4 July 2025
Journal Cover
ACE Vol.177
ISSN (Print): 2755-273X
ISSN (Online): 2755-2721
ISBN (Print): 978-1-80590-241-6
ISBN (Online): 978-1-80590-242-3
Download Cover

Abstract

Semantic segmentation has undergone a remarkable transformation from traditional computer vision approaches to sophisticated deep learning architectures, culminating in the revolutionary capabilities introduced by foundation models. This comprehensive survey examines the technical progression of semantic segmentation methodologies, with particular emphasis on vision foundation models, such as the Segment Anything Model (SAM) and Contrastive Language-Image Pre-training (CLIP). This paper systematically analyzes how these large-scale pretrained models enable previously unattainable capabilities, including zero-shot learning and cross-domain generalization while identifying persistent challenges regarding computational efficiency and boundary precision. The investigation encompasses critical applications across medical imaging, remote sensing, and video understanding domains, revealing both transformative benefits and technical limitations. It concludes that foundation models represent a fundamental paradigm shift requiring hybrid approaches that effectively combine general capabilities with domain-specific optimizations.

Keywords:

Semantic Segmentation, Foundation Models, SAM, CLIP, Zero-shot Learning

View PDF
Yu,Z. (2025). Semantic Segmentation in the Era of Foundation Models: Technical Evolution and Applications. Applied and Computational Engineering,177,10-15.

References

[1]. Shi, J., Malik, J., & IEEE. (2000). Normalized cuts and image segmentation. In IEEE Transactions on Pattern Analysis and Machine Intelligence. No. 8; Vol. 22, pp. 888–888.

[2]. Boykov, Y. Y., Jolly, M.P.. (2001). Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images. Proceedings of “Internation Conference on Computer Vision, ” pp.105–106.

[3]. Long, J., Shelhamer, E., Darrell, T., & UC Berkeley. (n.d.). Fully convolutional networks for semantic segmentation.

[4]. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-NET: Convolutional Networks for Biomedical Image Segmentation, May 18. arXiv.org. https: //arxiv.org/abs/1505.04597

[5]. Badrinarayanan, V., Kendall, A., & Cipolla, R. (2015). SEGNet: a deep convolutional Encoder-Decoder architecture for image segmentation, November 2. arXiv.org.

[6]. Chen, L., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation, February 7. arXiv.org.

[7]. Hu, J., Shen, L., Albanie, S., Sun, G., & Wu, E. (2017). Squeeze-and-Excitation networks, September 5. arXiv.org. https: //arxiv.org/abs/1709.01507

[8]. Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2016). Pyramid Scene Parsing network, December 4. arXiv.org. https: //arxiv.org/abs/1612.01105

[9]. Dosovitskiy, A., Beyer, L., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, October 22. arXiv.org. https: //arxiv.org/abs/2010.11929

[10]. Zheng, S., Lu, J., et al. (2020). Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, December 31. arXiv.org.

[11]. Radford, A., Kim, J. W., et al. (2021). Learning transferable visual models from natural language supervision, February 26. arXiv.org. https: //arxiv.org/abs/2103.00020

[12]. Rao, Y., Zhao, W., et al. (2021). DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting, December 2. arXiv.org. https: //arxiv.org/abs/2112.01518

[13]. Li, B., Weinberger, K. Q., Belongie, S., Koltun, V., & Ranftl, R. (2022, January 10). Language-driven semantic segmentation. arXiv.org. https: //arxiv.org/abs/2201.03546

[14]. Kirillov, A., Mintun, E., et al. (2023). Segment anything, April 5. arXiv.org.

[15]. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LORA: Low-Rank adaptation of Large Language Models, June 17. arXiv.org.

[16]. Houlsby, N., Giurgiu, A., et al. (2019). Parameter-Efficient Transfer Learning for NLP, February 2. arXiv.org. https: //arxiv.org/abs/1902.00751

[17]. Lester, B., Al-Rfou, R., & Constant, N. (2021). The power of scale for Parameter-Efficient Prompt Tuning, April 18. arXiv.org. https: //arxiv.org/abs/2104.08691

[18]. Ma, J., He, Y., Li, F., Han, L., You, C., & Wang, B. (2024). Segment anything in medical images. Nature Communications, 15(1). https: //doi.org/10.1038/s41467-024-44824-z

[19]. Ghiasi, G., Gu, X., Cui, Y., & Lin, T. (2021, December 22). Scaling Open-Vocabulary Image Segmentation with Image-Level Labels. arXiv.org. https: //arxiv.org/abs/2112.12143

[20]. Chen, G., Zhou, Y., et al. (2024). Remote sensing of diverse urban environments: From the single city to multiple cities. Remote Sensing of Environment, 285, 113396.

[21]. Zhang, C., Han, D., Qiao, Y., Kim, J. U., Bae, S., Lee, S., & Hong, C. S. (2023). Faster segment anything: towards lightweight SAM for mobile applications, June 25. arXiv.org.

[22]. Zhao, X., Ding, W., An, Y., Du, Y., Yu, T., Li, M., Tang, M., & Wang, J. (2023). Fast segment anything, June 21. arXiv.org. https: //arxiv.org/abs/2306.12156

[23]. Wang, D., Shelhamer, E., Liu, S., Olshausen, B., & Darrell, T. (2020). TENT: Fully test-time adaptation by entropy minimization, June 18. arXiv.org. https: //arxiv.org/abs/2006.10726

[24]. Kendall, A., & Gal, Y. (2017). What uncertainties do we need in Bayesian deep learning for computer vision? March 15. arXiv.org. https: //arxiv.org/abs/1703.04977

[25]. Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016). DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, June 2. arXiv.org. https: //arxiv.org/abs/1606.00915

[26]. Krähenbühl, P., & Koltun, V. (2012). Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials, October 20. arXiv.org. https: //arxiv.org/abs/1210.5644

Cite this article

Yu,Z. (2025). Semantic Segmentation in the Era of Foundation Models: Technical Evolution and Applications. Applied and Computational Engineering,177,10-15.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

About volume

Volume title: Proceedings of CONF-MLA 2025 Symposium: Applied Artificial Intelligence Research

ISBN: 978-1-80590-241-6(Print) / 978-1-80590-242-3(Online)
Editor: Hisham AbouGrad
Conference website: https://2025.confmla.org/
Conference date: 3 September 2025
Series: Applied and Computational Engineering
Volume number: Vol.177
ISSN: 2755-2721(Print) / 2755-273X(Online)