The Latest Research Progress of Attention Mechanism in Deep Learning

  • Xu Jiang School of Software, Harbin Institute of Information Technology, Harbin 150431, Heilongjiang, China
  • Xiaoling Bai School of Software, Harbin Institute of Information Technology, Harbin 150431, Heilongjiang, China
  • Lifeng Yin School of Intelligent Railway Engineering, Dalian Jiaotong University, Dalian 116028, Liaoning, China
Keywords: Natural language processing, Computer vision, Attention mechanism, Large models

Abstract

With the development of artificial intelligence and deep learning, the attention mechanism has become a key technology for enhancing the performance of complex tasks. This paper reviews the evolution of attention mechanisms, including soft attention, hard attention, and recent innovations such as multi-head latent attention and cross-attention. It focuses on the latest research outcomes, such as lightning attention, the PADRe polynomial attention replacement algorithm, the context anchor attention module, and improvements in attention mechanisms for large models. These advancements improve the efficiency and accuracy of models, expanding the application potential of attention mechanisms in fields such as computer vision, natural language processing, and remote sensing object detection, aiming to provide readers with a comprehensive understanding and stimulate innovative thinking.

References

Vaswani A, Shazeer N, Parmar N, et al., 2017, Attention is All You Need. Advances in Neural Information Processing Systems, 30: 5998–6008.

Hu J, Shen L, Sun G, 2018, Squeeze and Excitation Networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 7132–7141.

Devlin J, Chang MW, Lee K, et al., 2019, Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (long and short papers), Minneapolis, Minnesota, 4171–4186.

Wang X, Girshick R, Gupta A, et al., 2018, Non-local Neural Networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 7794–7803.

Anderson P, He X, Buehler C, et al., 2018, Bottom-up and top-down Attention for Image Captioning and Visual Question Answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6077–6086.

Bahdanau D, Cho K, Bengio Y, 2014, Neural Machine Translation by Jointly Learning to Align and Translate. https://doi.org/10.48550/arXiv.1409.0473

Luong MT, Pham H, Manning CD, 2015, Effective Approaches to Attention-based Neural Machine Translation. https://doi.org/10.48550/arXiv.1508.04025

Sukhbaatar S, Weston J, Fergus R, 2015, End-to-end Memory Networks. Advances in Neural Information Processing Systems, 28.

Yao L, Torabi A, Cho K, et al., 2015, Describing Videos by Exploiting Temporal Structure, Proceedings of the IEEE International Conference on Computer Vision, 4507–4515.

Martins A, Astudillo R, 2016, From Softmax to Sparsemax: A Sparse Model of Attention and Multi-label Classification, International Conference on Machine Learning. PMLR, 1614–1623.

Yang Z, Yang D, Dyer C, et al., 2016, Hierarchical Attention Networks for Document Classification. Association for Computational Linguistics, 2016: 1480–1489.

Lu J, Xiong C, Parikh D, et al., 2017, Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 375–383.

Gheini M, Ren X, May J, 2021, Cross-attention is All You Need: Adapting Pretrained Transformers for Machine Translation. https://arxiv.org/abs/2104.08771

Qin Z, Sun W, Li D, et al., 2024, Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models. https://arxiv.org/abs/2401.04658

Liu A, Feng B, Wang B, et al., 2024, Deepseek-v2: A Strong, Economical, and Efficient Mixture-of-experts Language Model. https://arxiv.org/abs/2405.04434

Yuan J, Gao H, Dai D, et al., 2025, Native Sparse Attention: Hardware-aligned and Natively Trainable Sparse Attention. https://arxiv.org/abs/2502.11089

Cai X, Lai Q, Wang Y, et al., 2024, Poly Kernel Inception Network for Remote Sensing Detection, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 27706–27716.

Letourneau PD, Singh MK, Cheng HP, et al., 2024, Padre: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer. https://arxiv.org/abs/2407.11306

Published
2025-05-29