Cross-Modal Discrete Representation Learning
Authors
Authors
- Aude Oliva
- James Glass
- Alexander H. Liu
- SouYoung Jin
- Cheng-I Jeff Lai
- Andrew Rouditchenko
Authors
- Aude Oliva
- James Glass
- Alexander H. Liu
- SouYoung Jin
- Cheng-I Jeff Lai
- Andrew Rouditchenko
Published on
05/27/2022
Categories
In contrast to recent advances focusing on highlevel representation learning across modalities, in this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code Matching objective that forces the representations from different views (modalities) to have a similar distribution over the discrete embedding space such that cross-modal objects/actions localization can be performed without direct supervision. We show that the proposed discretized multi-modal finegrained representation (e.g., pixel/word/frame) can complement high-level summary representations (e.g., video/sentence/waveform) for improved performance on cross-modal retrieval tasks. We also observe that the discretized representation uses individual clusters to represent the same semantic concept across modalities.
This work was presented in the Association for Computational Linguistics (ACL) 2022.
Please cite our work using the BibTeX below.
@inproceedings{liu-etal-2022-cross,
title = "Cross-Modal Discrete Representation Learning",
author = "Liu, Alexander and
Jin, SouYoung and
Lai, Cheng-I and
Rouditchenko, Andrew and
Oliva, Aude and
Glass, James",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.215",
doi = "10.18653/v1/2022.acl-long.215",
pages = "3013--3035",
abstract = "In contrast to recent advances focusing on high-level representation learning across modalities, in this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code Matching objective that forces the representations from different views (modalities) to have a similar distribution over the discrete embedding space such that cross-modal objects/actions localization can be performed without direct supervision. We show that the proposed discretized multi-modal fine-grained representation (e.g., pixel/word/frame) can complement high-level summary representations (e.g., video/sentence/waveform) for improved performance on cross-modal retrieval tasks. We also observe that the discretized representation uses individual clusters to represent the same semantic concept across modalities.",
}