Humans reason with concepts and metaconcepts: we recognize red and blue from visual input; we also understand that they are colors, i.e., red is an instance of color. In this paper, we propose the visual concept-metaconcept learner (VCML) for joint learning of concepts and metaconcepts from images and associated question-answer pairs. The key is to exploit the bidirectional connection between visual concepts and metaconcepts. Visual representations provide grounding cues for predicting relations between unseen pairs of concepts. Knowing that red and blue are instances of color, we generalize to the fact that green is also an instance of color since they all categorize the hue of objects. Meanwhile, knowledge about metaconcepts empowers visual concept learning from limited, noisy, and even biased data. From just a few examples of purple cubes we can understand a new color purple, which resembles the hue of the cubes instead of the shape of them. Evaluation on both synthetic and real-world datasets validates our claims.
In this post, we share a brief Q&A with the authors of the paper, Visual Concept-Metaconcept Learning, presented at NeurIPS 2019.
How would you describe your paper in less than 5 words?
Jointly learning concepts and metaconcepts.
What is your paper about?
This paper is inspired by humans’ ability to reason with concepts and metaconcepts: we recognize _red_ and _green_ from visual input; we also understand that they _describe the same property of objects_ (i.e., the color). In this paper, we propose the visual concept-metaconcept learner (VCML) for joint learning of concepts and metaconcepts.
VCML is a unified symbolic-reasoning framework running on images and associated question-answer pairs. Our training data are composed of two parts: 1) questions about the visual grounding of concepts (e.g., is there any red cube?), and 2) questions about the abstract relations between concepts (e.g., do red and green describe the same property of objects?).
What is new and significant about your paper?
In this paper, we study the new challenge of incorporating metaconcepts, i.e., relational concepts about concepts, into visual concept learning. Beyond just learning from questions regarding visual scenes (e.g., is there any red cube?), our model learns from questions about metaconcepts (e.g., do red and yellow describe the same property of objects?).
As evidence of significance of metaconcepts, VCML generalizes well in two ways. It can successfully categorize objects with new combinations of visual attributes, or objects with attributes with limited training data; it can also predict relations between unseen pairs of concepts. We present a systematic evaluation on both synthetic and real-world images, with a focus on learning efﬁciency and strong generalization.
What will the impact be on the real world?
This paper provides clues of how abstract concepts in natural language could be learned and understood. The model in this paper learns visual concepts within a linguistic interface. It answers visual-reasoning questions related to visual scenes, which provide visual grounding for concepts. The learned concepts then provide cues for predicting relational metaconcepts between pairs of concepts.
In addition, this paper reveals a way whereby metaconcepts could help learning concepts in natural language. For example, with only a few examples of purple cubes, it may be hard to tell whether the concept purple describes the shape or the hue of these objects. However, provided with meta-concept knowledge that purple and red describe the same property of objects, we can understand that purple is actually a new type of color.
What would be the next steps?
Our next step is to generalize our framework to other types metaconcepts. We also plan to further investigate the problem of visual compositional generalization with the help of metaconcepts.
What was the most interesting thing you learned?
I was most intrigued by the overfitting problem during the course of doing experiments. Ideally, when faced with the question _is cube a synonym of block?_, the model should associate _cube_ and _block_ with similar embeddings in order to produce the correct answer. However, it appeared that the model had another solution: it could overfit the question by memorizing the _synonym_ relationship mechanically, while keeping embeddings of _cube_ and _block_ intact. To avoid such phenomenon, we limited the width of metaconcept operators.
What was the most challenging part of your research?
The greatest challenge was in the design of concept and metaconcept representations. Our first attempt drew inspirations from simple geometric embeddings such as TransE. We associated visual concepts with embeddings vectors, and metaconcepts with translational vectors. We used cosine distance as a metric of the similarity between object and concept embeddings.
The TransE embedding turned out later to be too low-rank to model the complex relations between visual concepts. Additionally, the cosine metric failed to model (dis)similarity in some cases. For example, an apple may be both _red_ and _spherical_, but _red_ and _spherical_ are not similar themselves.
To solve these problems, we turned to use probabilistic embeddings as visual concept representation, because this method resembled the nature of visual concepts, which were classifiers on object distributions. We associated metaconcepts with neural operators. Subsequent experiments proved this method successful.
Please cite our work using the BibTeX below.