AR-Net: Adaptive Frame Resolution for Efficient Action Recognition
Authors
Authors
- Yue Meng
- Chung-Ching Lin
- Rameswar Panda
- Prasanna Sattigeri
- Leonid Karlinsky
- Aude Oliva
- Kate Saenko
- Rogerio Feris
Authors
- Yue Meng
- Chung-Ching Lin
- Rameswar Panda
- Prasanna Sattigeri
- Leonid Karlinsky
- Aude Oliva
- Kate Saenko
- Rogerio Feris
Published on
08/23/2020
Categories
Action recognition is an open and challenging problem in computer vision. While current state-of-the-art models offer excellent recognition results, their computational expense limits their impact for many real-world applications. In this paper, we propose a novel approach, called AR-Net (Adaptive Resolution Network), that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition in long untrimmed videos. Specifically, given a video frame, a policy network is used to decide what input resolution should be used for processing by the action recognition model, with the goal of improving both accuracy and efficiency. We efficiently train the policy network jointly with the recognition model using standard backpropagation. Extensive experiments on several challenging action recognition benchmark datasets well demonstrate the efficacy of our proposed approach over state-of-the-art methods. The project page can be found at https://mengyuest.github.io/AR-Net.
Please cite our work using the BibTeX below.
@inproceedings{10.1007/978-3-030-58571-6_6,
author = {Meng, Yue and Lin, Chung-Ching and Panda, Rameswar and Sattigeri, Prasanna and Karlinsky, Leonid and Oliva, Aude and Saenko, Kate and Feris, Rogerio},
title = {AR-Net: Adaptive Frame Resolution for Efficient Action Recognition},
year = {2020},
isbn = {978-3-030-58570-9},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
url = {https://doi.org/10.1007/978-3-030-58571-6_6},
doi = {10.1007/978-3-030-58571-6_6},
abstract = {Action recognition is an open and challenging problem in computer vision. While current state-of-the-art models offer excellent recognition results, their computational expense limits their impact for many real-world applications. In this paper, we propose a novel approach, called AR-Net (Adaptive Resolution Network), that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition in long untrimmed videos. Specifically, given a video frame, a policy network is used to decide what input resolution should be used for processing by the action recognition model, with the goal of improving both accuracy and efficiency. We efficiently train the policy network jointly with the recognition model using standard back-propagation. Extensive experiments on several challenging action recognition benchmark datasets well demonstrate the efficacy of our proposed approach over state-of-the-art methods. The project page can be found at .},
booktitle = {Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII},
pages = {86–104},
numpages = {19},
keywords = {Multi-resolution processing, Adaptive learning, Efficient action recognition},
location = {Glasgow, United Kingdom}
}