Cross-Modal Discrete Representation Learning
Authors
Authors
- Aude Oliva
- James Glass
- Alexander H. Liu
- SouYoung Jin
- Cheng-I Jeff Lai
- Andrew Rouditchenko
Authors
- Aude Oliva
- James Glass
- Alexander H. Liu
- SouYoung Jin
- Cheng-I Jeff Lai
- Andrew Rouditchenko
Published on
05/27/2022
In contrast to recent advances focusing on highlevel representation learning across modalities, in this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code Matching objective that forces the representations from different views (modalities) to have a similar distribution over the discrete embedding space such that cross-modal objects/actions localization can be performed without direct supervision. We show that the proposed discretized multi-modal finegrained representation (e.g., pixel/word/frame) can complement high-level summary representations (e.g., video/sentence/waveform) for improved performance on cross-modal retrieval tasks. We also observe that the discretized representation uses individual clusters to represent the same semantic concept across modalities.