Multimodal ArXiv

A Dataset for Improving Scientific Comprehension of Large Vision-Language Models


The University of Hong Kong
Peking University

Abstract

Large vision-language models (LVLMs), exemplified by GPT-4V, excel across diverse tasks involving concrete images from natural scenes. However, their ability to interpret abstract figures, such as geometry shapes and scientific plots, remains limited due to a scarcity of training datasets in scientific domains. To fill this gap, we introduce Multimodal ArXiv, consisting of ArXivCap and ArXivQA, for enhancing LVLMs scientific comprehension. ArXivCap is a figure-caption dataset comprising 6.4M images and 3.9M captions sourced from 572K ArXiv papers spanning various scientific domains. Drawing from ArXivCap, we introduce ArXivQA, a question-answering dataset generated by prompting GPT-4V based on scientific figures. ArXivQA greatly enhances LVLMs' mathematical reasoning capabilities, achieving a 10.4% absolute accuracy gain on a multimodal mathematical reasoning benchmark. Furthermore, employing ArXivCap, we devise four vision-to-text tasks for benchmarking LVLMs. Evaluation results with state-of-the-art LVLMs underscore their struggle with the nuanced semantics of academic figures, with domain-specific training yielding substantial performance gains. Our error analysis uncovers misinterpretations of visual context, recognition errors, and the production of overly simplified captions by current LVLMs, shedding light on future improvements.

Multimodal ArXiv

Comparison with previous scientific figure datasets. Our ArXivCap is the largest captioning dataset and our ArXivQA is the only QA dataset that covers a wide range of domains from real papers.



Overview of our dataset curation process. Starting from the ArXiv paper source files, we ensure the paper quality by selecting papers according to publication records. Figure and caption pairs are extracted and then cleaned according to manually designed rules. ArXivQA is generated by prompting GPT-4V with a curated template.



Prompt used for GPT-4V to generate QA pairs based on scientific figures.

Multiple-choice Question Answer Pairs Generation for Scientific Figures
Guideline
The goal of this task is to create answerable multiple-choice questions based on figures from scientific papers, to improve the ability of a large vision language model.
The questions should be challenging, and require college-level reasoning. The type of questions should be diverse. The question should be answerable based on the figure. The answer should be one of the answer choices. The answer choices should be plausible and challenging.
Format
Below is an example of the format of the input and output for the task.
Input
Figures: [Figures input in the task]
Output
Question: [Question]
Answer Options: [Answer choices, a bullet list.]
Correct Choice: [Correct answer choice, e.g., A]
Rationale: [Rationale for the correct answer, explain why the answer is correct]

Example

A single-figure caption pair in our ArXivCap dataset. (The figure and caption are from paper arxiv:1908.04642.)



A multiple-figure caption pair in our ArXivCap dataset. (The figure and caption are from paper arxiv:1810.10761.)



A case from ArXivQA. (The figure and caption are from paper arxiv:2011.09217.)



Evaluation

Evaluation on MathVista dataset. ArXivCap and ArXivQA together enhance Qwen-VL-Chat's overall performance, surpassing that of the commercial model Bard. The best results are highlighted in bold, while the second-best scores are marked with underline.



Evaluation results of single figure captioning. Grey results are obtained from a 200-sample subset. Despite most LVLMs struggling to produce high-quality captions of scientific figures, training with ArXivCap significantly boosts the performance.



Relative accuracy changes brought by the training on different domain ArXivQA samples.



Manual Evaluation

We conduct a manual inspection for single-figure captioning results. To ensure a more informed evaluation, we focus on a paper from the CS domain, leveraging our domain knowledge to assess caption quality better.



Case Study

ArXivQA enables the model not only to answer questions related to scientific figures in papers (left) but also to improve mathematical understanding ability (right). The model not only selects correct options but also gives reasonable rationale.



BibTeX


        @misc{li2024multimodal,
              title={Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models}, 
              author={Lei Li and Yuqi Wang and Runxin Xu and Peiyi Wang and Xiachong Feng and Lingpeng Kong and Qi Liu},
              year={2024},
              eprint={2403.00231},
              archivePrefix={arXiv},
              primaryClass={cs.CV}
        }
  

Acknowledgement

This website is adapted from Nerfies and LLaVA-RLHF, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of Qwen-VL and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.