CVPR

All Work

Computer vision system marries image recognition and generation
Computer vision system marries image recognition and generation
MIT News
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
 
Bias Mimicking: A Simple Sampling Approach for Bias Mitigation
Bias Mimicking: A Simple Sampling Approach for Bias Mitigation
 
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval
 
MaskSketch: Unpaired Structure-guided Masked Image Generation
MaskSketch: Unpaired Structure-guided Masked Image Generation
 
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
 
Understanding and Improving Visual Prompting: A Label-Mapping Perspective
Understanding and Improving Visual Prompting: A Label-Mapping Perspective
 
Video Test-Time Adaptation for Action Recognition
Video Test-Time Adaptation for Action Recognition
 
Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
 
SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
 
FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer
FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer
 
Masked Motion Encoding for Self-Supervised Video Representation Learning
Masked Motion Encoding for Self-Supervised Video Representation Learning
 
EC^2 : Emergent Communication for Embodied Control
EC^2 : Emergent Communication for Embodied Control
 
Learning Situation Hyper-Graphs for Video Question Answering
Learning Situation Hyper-Graphs for Video Question Answering
 
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos
 
Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners
Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners
 
3D Concept Learning and Reasoning from Multi-View Images
3D Concept Learning and Reasoning from Multi-View Images
 
Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention
Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention
 
Teaching Structured Vision & Language Concepts to Vision & Language Models
Teaching Structured Vision & Language Concepts to Vision & Language Models
 
CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning
CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning
 
More Language, Less Labeling with Kate Saenko
More Language, Less Labeling with Kate Saenko
This Week in Machine Learning & AI (TWIML) podcast
A safer, lower-cost alternative to real data for pretraining computer vision models
A safer, lower-cost alternative to real data for pretraining computer vision models
IBM Research blog
Hallucinating to better text translation
Hallucinating to better text translation
MIT News
Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition
Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition
 
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
 
Non-Adversarial Video Synthesis with Learned Priors
Non-Adversarial Video Synthesis with Learned Priors
 
Camera On-boarding for Person Re-identification using Hypothesis Transfer Learning
Camera On-boarding for Person Re-identification using Hypothesis Transfer Learning
 
Semi-Supervised Action Recognition with Temporal Contrastive Learning
Semi-Supervised Action Recognition with Temporal Contrastive Learning
 
Fashion IQ: A New Dataset towards Retrieving Images by Natural Language Feedback
Fashion IQ: A New Dataset towards Retrieving Images by Natural Language Feedback
 
GAN Compression: Efficient Architectures for Interactive Conditional GANs
GAN Compression: Efficient Architectures for Interactive Conditional GANs
 
Separating Skills and Concepts for Novel Visual Question Answering
Separating Skills and Concepts for Novel Visual Question Answering
 
Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules
Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules
 
The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models
The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models
 
Fine-grained Angular Contrastive Learning with Coarse Labels
Fine-grained Angular Contrastive Learning with Coarse Labels
 
Black-box Explanation of Object Detectors via Saliency Maps
Black-box Explanation of Object Detectors via Saliency Maps
 
Anycost GANs for Interactive Image Synthesis and Editing
Anycost GANs for Interactive Image Synthesis and Editing
 
Relationship Matters: Relation Guided Knowledge Transfer for Incremental Learning of Object Detectors
Relationship Matters: Relation Guided Knowledge Transfer for Incremental Learning of Object Detectors
 
Identifying Interpretable Action Concepts in Deep Networks
Identifying Interpretable Action Concepts in Deep Networks
 
Jointly Optimize Data Augmentation and Network Training: Adversarial Data Augmentation in Human Pose Estimation
Jointly Optimize Data Augmentation and Network Training: Adversarial Data Augmentation in Human Pose Estimation