Big-Little-Video-Net: Work smarter, not harder, for video understanding
Authors
Authors
- Quanfu Fan
- Hildegard Kühne
- David Cox
- Chun-Fu (Richard) Chen
- Marco Pistoia
Edited by
Authors
- Quanfu Fan
- Hildegard Kühne
- David Cox
- Chun-Fu (Richard) Chen
- Marco Pistoia
Edited by
Published on
12/10/2019
Categories
Video understanding has made rapid progress in recent years, and many approaches use complex 3D convolutional neural networks (3D CNNs) to learn spatio-temporal representations for recognition. The popularity of 3D models versus simpler 2D models might be attributed to a perceived advantage in processing spatial and temporal information jointly, whereas 2D models process these separately. However, 3D models need to be quite deep and require long sequences of input frames. This makes training and inference of a 3D model very expensive. So the question arises: how useful is that extra dimension?
Perhaps we’re overcomplicating things.
More is less in AI video understanding
In our paper, More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation, presented at NeurIPS 2019, we show how a simple, memory-friendly 2D network architecture that outperforms more sophisticated 3D ones, and at a fraction of the cost. In other words, simpler is better.
What’s exciting here is we can show that a lightweight architecture can combine frames at different resolutions to learn video information in both space and time efficiently. We also show that by entangling spatial modeling and temporal modeling, a simple way of aggregating temporal features can outperform more costly 3D approaches. This means video understanding need not be prohibitively costly in real-world settings.
Step 1: Big-Little-Video-Net (bLVNet)
We call our architecture Big-Little-Video-Net (bLVNet) because the first part of our approach is inspired by the Big-Little-Net architecture (bLNet). To bLNet we add a new video architecture that has two network branches with different complexities: one branch processing low-resolution frames in a very deep subnet, and another branch processing high-resolution frames in a compact subnet. The two branches complement each other through merging at the end of each network layer. Using bLVNet, we can process twice as many frames as the baseline model without compromising efficiency.
Step 2: Depthwise Temporal Aggregation Module (TAM)
While this is exciting, bLVNet suffers a key limitation: it cannot capture temporal dependencies, which are crucial for video understanding. To address this, we further develop a method we call Depthwise Temporal Aggregation Module (TAM). The method enables the exchange of temporal information between frames by weighted channel-wise aggregation. The TAM is implemented as an independent network module and it is extremely compact so it adds only negligible computational costs and parameters to bLVNet.
Experimental Results
Putting bLVNet and TAM together, we see surprising results against today’s best-performing models, which are all based on 3D Convolutions. The table provided in the paper shows our results on the Moments-in-Time dataset, a large-scale action dataset with about three times more training samples than the popular Kinetics-400 benchmark dataset, which we also evaluate.
Scalable video understanding
Today only a handful of powerful companies and governments can afford to do research on large-scale video understanding models, let alone deploy these models in real-world settings. With a more efficient, memory-friendly architecture like bLVNet-TAM, we can make these action models more accessible, affordable, and scalable.
Perhaps more important, this research teaches us an important lesson: working harder is not always optimal; often we need to work smarter. This sometimes means challenging our assumptions and going back to first principles. In this case, many people took for granted that 3D Convolution would naturally be superior to 2D. Indeed sometimes it might be, but not always. Sometimes the surprising and impactful results start by questioning what we think we know for sure.
Sometimes you have to go backward before you can go forward.
Please cite our work using the BibTeX below.
@incollection{NIPS2019_8498,
title = {More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation},
author = {Fan, Quanfu and Chen, Chun-Fu (Richard) and Kuehne, Hilde and Pistoia, Marco and Cox, David},
booktitle = {Advances in Neural Information Processing Systems 32},
editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d\textquotesingle Alch\'{e}-Buc and E. Fox and R. Garnett},
pages = {2261--2270},
year = {2019},
publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/8498-more-is-less-learning-efficient-video-representations-by-big-little-network-and-depthwise-temporal-aggregation.pdf}
}