Big-Little-Video-Net: Work smarter, not harder, for video understanding

Computer Vision


Edited by

Published on


Video understanding has made rapid progress in recent years, and many approaches use complex 3D convolutional neural networks (3D CNNs) to learn spatio-temporal representations for recognition. The popularity of 3D models versus simpler 2D models might be attributed to a perceived advantage in processing spatial and temporal information jointly, whereas 2D models process these separately. However, 3D models need to be quite deep and require long sequences of input frames. This makes training and inference of a 3D model very expensive. So the question arises: how useful is that extra dimension?

Perhaps we’re overcomplicating things.

More is less in AI video understanding

In our paper, More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation, presented at NeurIPS 2019, we show how a simple, memory-friendly 2D network architecture that outperforms more sophisticated 3D ones, and at a fraction of the cost. In other words, simpler is better.

What’s exciting here is we can show that a lightweight architecture can combine frames at different resolutions to learn video information in both space and time efficiently. We also show that by entangling spatial modeling and temporal modeling, a simple way of aggregating temporal features can outperform more costly 3D approaches. This means video understanding need not be prohibitively costly in real-world settings.

Step 1: Big-Little-Video-Net (bLVNet)

We call our architecture Big-Little-Video-Net (bLVNet) because the first part of our approach is inspired by the Big-Little-Net architecture (bLNet). To bLNet we add a new video architecture that has two network branches with different complexities: one branch processing low-resolution frames in a very deep subnet, and another branch processing high-resolution frames in a compact subnet. The two branches complement each other through merging at the end of each network layer. Using bLVNet, we can process twice as many frames as the baseline model without compromising efficiency.

different architectures for action recognition

Above we show different architectures for action recognition a) TSN uses a shared CNN to process each frame independently, so there is no temporal interaction between frames. b) TSN-bLNet is a variant of TSN that uses bLNet [8] as backbone. It is efficient, but still lacks temporal modeling. c) bLVNet feeds odd and even frames separately into different branches in bLNet. The branch merging at each layer (local fusion) captures short-term temporal dependencies between adjacent frames. d) bLVNet-TAM includes the proposed aggregation module, represented as a red box, which further empowers bLVNet to model long-term temporal dependencies across frames (global fusion).

Step 2: Depthwise Temporal Aggregation Module (TAM)

While this is exciting, bLVNet suffers a key limitation: it cannot capture temporal dependencies, which are crucial for video understanding. To address this, we further develop a method we call Depthwise Temporal Aggregation Module (TAM). The method enables the exchange of temporal information between frames by weighted channel-wise aggregation. The TAM is implemented as an independent network module and it is extremely compact so it adds only negligible computational costs and parameters to bLVNet.

Temporal aggregation module (TAM)

Above we show the Temporal aggregation module (TAM). The TAM takes as input a batch of tensors, each of which is the activation of a frame, and produces a batch of tensors with the same order and dimension. The module consists of three operations: 1) 1×1 depthwise convolutions to learn a weight for each feature channel; 2) temporal shifts (left or right direction indicated by the smaller arrows; the white cubes are padded zero tensors.); and 3) aggregation by summing up the weighted activations from 1).

Experimental Results

Putting bLVNet and TAM together, we see surprising results against today’s best-performing models, which are all based on 3D Convolutions. The table provided in the paper shows our results on the Moments-in-Time dataset, a large-scale action dataset with about three times more training samples than the popular Kinetics-400 benchmark dataset, which we also evaluate.

Scalable video understanding

Today only a handful of powerful companies and governments can afford to do research on large-scale video understanding models, let alone deploy these models in real-world settings. With a more efficient, memory-friendly architecture like bLVNet-TAM, we can make these action models more accessible, affordable, and scalable.

Perhaps more important, this research teaches us an important lesson: working harder is not always optimal; often we need to work smarter. This sometimes means challenging our assumptions and going back to first principles. In this case, many people took for granted that 3D Convolution would naturally be superior to 2D. Indeed sometimes it might be, but not always. Sometimes the surprising and impactful results start by questioning what we think we know for sure.

Sometimes you have to go backward before you can go forward.

Please cite our work using the BibTeX below.

    title = {More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation},
    author = {Fan, Quanfu and Chen, Chun-Fu (Richard) and Kuehne, Hilde and Pistoia, Marco and Cox, David},
    booktitle = {Advances in Neural Information Processing Systems 32},
    editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d\textquotesingle Alch\'{e}-Buc and E. Fox and R. Garnett},
    pages = {2261--2270},
    year = {2019},
    publisher = {Curran Associates, Inc.},
    url = {}
Close Modal