The Definition of Abstraction and Reasoning Corpus (ARC): A Beginner's Lesson

cover
11 Apr 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Sebastien Ferr ´e, Univ Rennes, CNRS, Inria, IRISA Campus de Beaulieu, 35042 Rennes, France and [email protected].

Abstract & Introduction

Abstraction and Reasoning Corpus (ARC)

Related Work

Object-centric Models for ARC Grids

MDL-based Model Learning

Evaluation

Conclusion and Perspectives, References & Supplementary Materials

2 Abstraction and Reasoning Corpus (ARC)

ARC is a collection of tasks3 , where each task is made of training examples (3.3 on average) and test examples (1 in general). Each example is made of an input grid and an output grid. Each grid is a 2D array (with size up to 30x30) filled with integers coding for colors (10 distinct colors). For a given task, the size of grids can vary from one example to another, and between the input and the output.

Each task is a machine learning problem, whose goal is to learn a model that can generate the output grid from the input grid, and so from a few training examples only. Prediction is successful only if the predicted output grid is strictly equal to the expected grid for all test examples, there is no partial success. However, three trials are allowed for each test example to compensate for potential ambiguities in the training examples. Figure 1 shows two ARC tasks (with the expected test output grid missing). The first is used as a running example in this paper.

We now more formally define grids, examples, and tasks.

ARC tasks have 3.3 training examples on average, and 1 or 2 test examples (most often 1). As illustrated by Figure 1, the different input grids of a task need not have the same size, nor use the same colors. The same applies to test grids.

ARC is composed of 1000 tasks in total: 400 “training tasks”4 , 400 evaluation tasks, and 200 secret tasks for independent evaluation. Figure 1 shows two of the 400 training tasks. Developers should only look at the training tasks, not at the evaluation tasks. The latter should only be used to evaluate the broad generalization capability of the developed systems.


This paper is available on Arxiv under CC 4.0 license.