AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition


Published on



Computer Vision ICLR

Temporal modelling is the key for efficient video action recognition. While understanding temporal information can improve recognition accuracy for dynamic actions, removing temporal redundancy and reusing past features can significantly save computation leading to efficient action recognition. In this paper, we introduce an adaptive temporal fusion network, called AdaFuse, that dynamically fuses channels from current and past feature maps for strong temporal modelling. Specifically, the necessary information from the historical convolution feature maps is fused with current pruned feature maps with the goal of improving both recognition accuracy and efficiency. In addition, we use a skipping operation to further reduce the computation cost of action recognition. Extensive experiments on Something V1\&V2, Jester and Kinetics show that our approach can achieve about 40\% computation savings with comparable accuracy to state-of-the-art methods.

This paper has been published at ICLR 2021.

Please cite our work using the BibTeX below.

title={AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition},
author={Yue Meng and Rameswar Panda and Chung-Ching Lin and Prasanna Sattigeri and Leonid Karlinsky and Kate Saenko and Aude Oliva and Rogerio Feris},
booktitle={International Conference on Learning Representations},
Close Modal