Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles

1Microsoft Cloud AI, 2City University of Hong Kong

Brief Summary

How much commonsense knowledge does Contrastive Language-Image Pre-training (CLIP) have? How to improve it by contrastive learning on image-text pairs but with commonsense augmented?

We learn commonsense by pairing images (e.g., a lemon image) with commonsense riddles (e.g., “It tastes sour”), which can be generated in unlimited-scale and high-quality by leveraging any knowledge graph and image-text dataset. We provide a fair and widely applicable benchmark and show that VL-models have a long way to go. We achieve significant improvements on commonsense benchmarks.



Existing VL-models lack common sense. With the text “It tastes sour”, none of the existing models, e.g., CLIP, correctly identify the lemon. So why do they lack commonsense knowledge? We discover that VL datasets are inadequate in common knowledge compared to regular NLP texts. Also, other objectives like VQA or generation are not widely applicable for training and have limited size. This implies that we should enhance VL data with commonsense knowledge.



However, it is still challenging to inject the knowledge into image-text pairs without changing model structure or relying on external knowledge. Thus, we propose a novel commonsense-aware training strategy DANCE. We first re-organize the commonsense knowledge graph into entries with (entity, relation, entity) format, and pair them to the images that contain one of the entities. We then hide the name of entities in that image with demonstrative pronouns, e.g., “this item”. The generated descriptions are in textual form and therefore readily applicable for the training of most VL-models.



In addition, existing VL commonsense evaluations are restricted to visual question answering and generation which are not a good fit or well received in the majority of VL-models. Therefore, we propose a new diagnostic benchmark in a wider adaptable form, i.e., Image-Text and Text-Image Retrieval, to achieve a fair evaluation of the pre-trained VL-models. The set is upgraded by neighborhood hard-negative filtering to further ensure data quality.


We achieve significant improvement in commonsense ability and can even generalize to unseen knowledge, on not only our test set, but also the most generally used OK-VQA commonsense benchmark. We also measure a variety of cutting-edge VL-models and discover weaknesses that are not widely known: commonsense easy for humans (83%) is hard for current state-of-the-art VL-models (<42%).


Illustration of the commonsense lacking problem of VL-models (officially pre-trained and fine-tuned CLIP, ViLT, BLIP) that is not widely known. All models fail in retrieving the correct image.

Through our analysis, we find one important reason is that existing large-scale VL datasets do not contain much commonsense knowledge. This motivates us to improve the commonsense of VL-models by augmenting VL data with commonsense knowledge.

We propose a novel commonsense-aware training strategy DANCE. We inject commonsense into existing VL datasets on the fly during training, by leveraging the commonsense knowledge graph, so that the data pair generation is automatic, scalable, and trustworthy. It is compatible with the most of VL-models and the inference stage needs no change.

We propose a new retrieval-based well-received commonsense benchmark to analyze a suite of VL-models and discover weaknesses that are not widely known: commonsense easy for humans (83%) is hard for current state-of-the-art VL-models (<42%). Extensive experiments demonstrate the effectiveness of the proposed strategy and diagnostic test set.

Abstract

This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models.

Despite the great success, we observe that existing VL-models still lack commonsense knowledge/reasoning ability (e.g., "Lemons are sour"), which is a vital component towards artificial general intelligence. Through our analysis, we find one important reason is that existing large-scale VL datasets do not contain much commonsense knowledge, which motivates us to improve the commonsense of VL-models from the data perspective.

Rather than collecting a new VL training dataset, we propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE). It can be viewed as one type of data augmentation technique, which can inject commonsense knowledge into existing VL datasets on the fly during training. More specifically, we leverage the commonsense knowledge graph (e.g., ConceptNet) and create variants of text description in VL datasets via bidirectional sub-graph sequentialization.

For better commonsense evaluation, we further propose the first retrieval-based commonsense diagnostic benchmark.

By conducting extensive experiments on some representative VL-models, we demonstrate that our DANCE technique is able to significantly improve the commonsense ability while maintaining the performance on vanilla retrieval tasks.

Qualitative comparison


Comparison on our text-image retrieval test set.




Comparison on our image-text retrieval test set.




Comparison on OK-VQA.

BibTeX

@misc{ye2022commonsense,
      title={Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles},
      author={Shuquan Ye and Yujia Xie and Dongdong Chen and Yichong Xu and Lu Yuan and Chenguang Zhu and Jing Liao},
      year={2022},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
  }