Classification-Regression for Chart Comprehension

graph icon NN icon graph icon
Matan Levy1   Rami Ben-Ari2   Dani Lischinski1
1The Hebrew University of Jerusalem, Israel
2OriginAI, Israel

ECCV 2022

To appear in European Conference on Computer Vision 2022
Page map: [Abstract | The CQA task  | Test Examples | Figures | Method | Citation]


Chart question answering (CQA) is a task used for assessing chart comprehension, which is fundamentally different from understanding natural images. CQA requires analyzing the relationships between the textual and the visual components of a chart, in order to answer general questions or infer numerical values. Most existing CQA datasets and models are based on simplifying assumptions that often enable surpassing human performance. In this work, we address this outcome and propose a new model that jointly learns classification and regression. Our language-vision setup uses co-attention transformers to capture the complex real-world interactions between the question and the textual elements. We validate our design with extensive experiments on the realistic PlotQA dataset, outperforming previous approaches by a large margin, while showing competitive performance on FigureQA. Our model is particularly well suited for realistic questions with out-of-vocabulary answers that require regression.

Test Examples

The task: Chart Question Answering (CQA)

Figures and charts play a major role in modern communication, help to convey messages by curating data into an easily comprehensible visual form, highlighting the trends and outliers.

The Chart Question Answering (CQA) task is closely related to Visual Question Answering (VQA), which is usually applied on natural images. VQA is typically treated as a classification task, where the answer is a category. In contrast, answering questions about charts often requires regression. Furthermore, a small local change in a natural image typically has limited effect on the visual recognition outcome, while in a chart, the impact might be extensive. A chart comprehension model must consider the interactions between the question and the various chart elements in order to provide correct answers.



We present an overview of our CRCT architecture for CQA in Fig. 3. In our approach, the image is first parsed by a trained object detector (see object classes in Fig. 2). The output of the parsing stage are object classes, positions (bounding boxes), and visual features. All of the above are projected into a single representation per visual element, then stacked to form the visual sequence.

Similarly, each textual element is represented by fusing its text tokens, positional encoding and class. Together with the question text tokens, we obtain the text sequence.

The two sequences are fed in parallel to a bimodal co-attention-transformer (co-transformer). The output of the co-transformer are pooled visual and textual representations that are then fused by Hadamard product and concatenation, and fed into our unified classification-regression head.

Citation - BibTeX

This webpage code was adapted from this source code.