Moving Object Tracking Using Context-Aware Attention Transformer
Research Article
Open Access
CC BY

Moving Object Tracking Using Context-Aware Attention Transformer

Yimeng Wang 1*
1 School of Cyber Security and Computer / Department of Computer Teaching, Hebei University
*Corresponding author: 2545250732@qq.com
Published on 4 July 2025
Journal Cover
ACE Vol.176
ISSN (Print): 2755-273X
ISSN (Online): 2755-2721
ISBN (Print): 978-1-80590-239-3
ISBN (Online): 978-1-80590-240-9
Download Cover

Abstract

In video content analysis, accurate tracking and recognition of objects is a complex task. Current research has primarily focused on the development of complex scenes and fast-moving targets. Yet, there are challenges of small objects, long time-series dependencies, and object occlusion. In this paper, we propose the Intelli-context transformer to detect objects in a dynamic environment. Addressing this challenge, attention mechanisms, contextual information, and semantic information are integrated into Intelli-Context Transformer to enhance the accuracy of video object tracking. Intelli-Context Transformer employs an end-to-end training approach and incorporates a Contextual Spatiotemporal Attention Module, which dynamically adjusts the focus on different information to improve recognition accuracy. The proposed method is capable of capturing and analyzing the spatiotemporal features of a single target in videos in real time, effectively handling tracking tasks in complex scenes. Compared with state-of-the-art methods, Intelli-Context Transformer demonstrates its strong generalization capability in video object recognition. This research provides an efficient and reliable approach for dynamic target tracking in complex scenes and offers technical support for functions such as behavior analysis and anomaly detection, contributing to the development of intelligent video surveillance and navigation.

Keywords:

Object Detection, Object Tracking, Transformer Architecture, Attention mechanism, Deep Learning

View PDF
Wang,Y. (2025). Moving Object Tracking Using Context-Aware Attention Transformer. Applied and Computational Engineering,176,8-15.

References

[1]. O. Abdelaziz, M. Shehata, and M. Mohamed, “Beyond traditional single object tracking: A survey, ” 2024, arXiv: 2405.10439. [Online]. Available:   https: //arxiv.org/abs/2405.10439

[2]. X. Wang and Z. Zhu, “Context understanding in computer vision: A survey, ”  Comput. Vis. Image Underst., vol. 229, p. 103646, Mar. 2023, doi: 10.1016/j.cviu.2023.103646.

[3]. S. S. A. Zaidi, M. S. Ansari, A. Aslam, N. Kanwal, M. Asghar, and B. Lee, “A survey of modern deep learning based object detection models, ” 2021, arXiv: 2104.11892. [Online]. Available:   https: //arxiv.org/ abs/2104.11892

[4]. G. Ciaparrone, F. Luque Sánchez, S. Tabik, L. Troiano, R. Tagliaferri, and F. Herrera, “Deep learning in video multi-object tracking: A survey, ”  Neurocomputing, vol. 381, pp. 61–88, Mar. 2020, doi: 10.1016/j.neucom.2019.11.023.

[5]. A. Kamboj, “The progression of transformers from language to vision to MOT: A literature review on multi-object tracking with transformers, ” 2024, arXiv: 2406.16784. [Online]. Available:   https: //arxiv.org/ abs/2406.16784

[6]. H. Ouyang, Q. Wang, Y. Xiao, Q. Bai, J. Zhang, K. Zheng, X. Zhou, Q. Chen, and Y. Shen, “Codef: Content deformation fields for temporally consistent video processing, ” 2023.

[7]. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need, " in  Proc. 31st Int. Conf. Neural Inf. Process. Syst. (NIPS 2017), pp. 6000–6010, Dec. 2017.

[8]. W. G. C. Bandara and V. M. Patel, “A transformer-based Siamese network for change detection, ” 2022, arXiv: 2201.01293. [Online]. Available:   https: //arxiv.org/abs/2201.01293

[9]. L. Lin, H. Fan, Z. Zhang, Y. Xu, and H. Ling, “SwinTrack: A simple and strong baseline for transformer tracking, ” 2022, arXiv: 2112.00995. [Online]. Available:   https: //arxiv.org/abs/2112.00995

[10]. Y. Cui, C. Jiang, L. Wang, and G. Wu, “MixFormer: End-to-end tracking with iterative mixed attention, ” 2022, arXiv: 2203.11082. [Online]. Available:   https: //arxiv.org/abs/2203.11082

[11]. S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: towards real-time object detection with region proposal networks, " 2016, arXiv: 1506.01497. [Online]. Available:   https: //arxiv.org/ abs/1506.01497

[12]. A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, "Simple online and realtime tracking, " in  2016 IEEE Int. Conf. Image Process. (ICIP), pp. 3464-3468, Sep. 2016, doi: 10.1109/ICIP.2016.7533003.

[13]. G. Ciaparrone, F. Luque Sánchez, S. Tabik, L. Troiano, R. Tagliaferri, and F. Herrera, "Deep learning in video multi-object tracking: A survey, "  Neurocomputing, vol. 381, pp. 61-88, Mar. 2020, doi: 10.1016/j.neucom.2019.11.023.

[14]. T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks, " 2017, arXiv: 1609.02907. [Online]. Available:   https: //arxiv.org/abs/1609.02907

[15]. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need, " 2017, arXiv: 1706.03762. [Online]. Available:   https: //arxiv.org/abs/ 1706.03762

[16]. B. Yu, H. Yin, and Z. Zhu, "Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting, " in  Proc. 27th Int. Joint Conf. Artif. Intell. (IJCAI), pp. 3634-3640, Jul. 2018, doi: 10.24963/ijcai.2018/505.

[17]. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: unified, real-time object detection, " 2016, arXiv: 1506.02640. [Online]. Available:   https: //arxiv.org/abs/1506.02640

[18]. T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, "Feature pyramid networks for object detection, " 2017, arXiv: 1612.03144. [Online]. Available:   https: //arxiv.org/abs/ 1612.03144

[19]. J. Redmon and A. Farhadi, "YOLOv3: an incremental improvement, " 2018, arXiv: 1804.02767. [Online]. Available:   https: //arxiv.org/abs/ 1804.02767

[20]. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers, " 2020, arXiv: 2005.12872. [Online]. Available:   https: //arxiv.org/abs/ 2005.12872

[21]. J. Xie, B. Zhong, Z. Mo, S. Zhang, L. Shi, S. Song and R. Ji, “Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers, ” IEEE, 2024, DOI: 10.1109/CVPR52733.2024.01826.

Cite this article

Wang,Y. (2025). Moving Object Tracking Using Context-Aware Attention Transformer. Applied and Computational Engineering,176,8-15.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

About volume

Volume title: Proceedings of the 3rd International Conference on Machine Learning and Automation

ISBN: 978-1-80590-239-3(Print) / 978-1-80590-240-9(Online)
Editor: Hisham AbouGrad
Conference website: 978-1-80590-240-9
Conference date: 17 November 2025
Series: Applied and Computational Engineering
Volume number: Vol.176
ISSN: 2755-2721(Print) / 2755-273X(Online)