Evaluation Our Approach on ARC and Beyond: A Look Back at Our Experiments

cover
19 Apr 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Sebastien Ferr ´e, Univ Rennes, CNRS, Inria, IRISA Campus de Beaulieu, 35042 Rennes, France and [email protected].

Abstract & Introduction

Abstraction and Reasoning Corpus (ARC)

Related Work

Object-centric Models for ARC Grids

MDL-based Model Learning

Evaluation

Conclusion and Perspectives, References & Supplementary Materials

6 Evaluation

In this section, we first evaluate our approach on ARC, comparing it to existing approaches in terms of success rates, efficiency, model complexity, and model naturalness. We then evaluate the generality of our approach beyond ARC by applying it to a different domain, spreadsheets, where inputs and outputs are rows of strings. Our experiments were run with single-thread implementations on Fedora 32, Intel Core i7x12 with 16GB memory. We used one run per task set as there is no randomness involved.

6.1 Abstraction and Reasoning Corpus (ARC)

We evaluated our approach on the 800 public ARC tasks, and we also took part in the ARCathon 2022 challenge as team MADIL. The few parameters were set based on the training tasks. To ensure a good balance of the computational time between parsing and learning, we set some limits that remained stable across our experiments.

The number of descriptions produced by the parsing of a grid is limited to 64 and only the 3 most compressive are retained for the computation of refinements. At each step, at most 100,000 expressions are considered and only the 20 most promising refinements,

Table 5: Number of solved tasks (and percentage) and average learning time for solved tasks, for different methods on different task sets

according to a DL estimate, are evaluated. The rehearsal rate α is set to 10. The tasks are processed independently of each other, without learning from one to the other. The results are given for a learning time per task limited to 60s plus 10s for the pruning phase.

The learning and prediction logs and the screenshots of the solved training tasks are available as supplementary materials.

Task sets and baselines. We consider four task sets for which results have been reported: the 400 training and 400 evaluation public tasks, the 100 secret tasks of Kaggle’20, and the 100 secret tasks of ARCathon’22. We presume that those secret tasks are taken from the 200 secret ARC tasks. As baselines, we consider published methods that report results on the considered task sets [10,3,23,2]. We also include the winners of the two challenges for reference. Unfortunately, the reported results are scarce, and the papers do not provide their code. The code of our method is available as open source on GitHub6 . Version 2.7 was used for the experiments reported here.

Success rates. On the training tasks, for which we have the more results to compare with, our method solves 96 (24%) training tasks, almost on par with the best method, by Ainooson et al (26%). Both methods also solved a similar number of evaluation tasks (23 vs 26 tasks), and both solved 2/100 tasks in ARCathon’22, and ranked 4th ex-aequo.

Comparing the different task sets, it appears that the evaluation tasks are significantly more difficult than the training tasks, and the secret tasks of ARCathon seem even more difficult as the winner could only solve 6 tasks. Icecuber managed to correctly predict an amazing 20.6% of the test output grids in Kaggle’20, but at the cost of the hand-coding of 142 primitives, 10k lines of code, and brute-force search (millions of computed grids per task).

The ARC evaluation protocol allows for three predictions per test example. However, the first prediction of our method is actually correct in 90 of the 96 solved training tasks. This shows that our learned models are accurate in their understanding of the tasks. To better evaluate the generalization capability of learned models, we also measured the generalization rate as the proportion of models that are correct on training examples that are also correct on test examples: 92% (94/102) on training tasks, and 72% (23/32) on evaluation tasks. This again suggests that the evaluation tasks feature a higher generalization difficulty.

Without the pruning phase, this rate decreases to 89% (91/102) on training tasks. This shows that the pruning phase is useful, although description-oriented model learning is already good at generalization. Reasons for failures to generalize are: e.g., the test example has several objects while all training examples have a single object; the training examples have a misleading invariant.

Efficiency and model complexity. Intelligence is the efficiency at acquiring new skills, according to Chollet. Although ARC enforces data efficiency by having only a few training examples per task, and unique tasks, it does not enforce efficiency in the amount of priors, nor in the computation resources. It is therefore useful to assess the latter. We already mentioned Icecuber’s method that relies on a large number of primitives, and intense computations. The method of Ainooson et al, which has comparable performance to ours, uses 52 primitives and about 700s on average per solved task.

In comparison, our method uses 30 primitives and 4.6s per solved training task (21.7s over all training tasks). Moreover, doubling the learning timeout at 120s does not lead to solving more tasks, so 60s just seems to be enough to find a solution if there is one. Note also that our method does not stop learning when a solution is found but when no more compression can be achieved.

Another way to evaluate efficiency is to look at the complexity of learned models, typically the number of primitives composing the model in program synthesis approaches. A good proxy for this complexity is the depth of search that was reached in the allocated time. In our case, it is equal to the number of refinements applied to the initial empty model. Few methods provide this information: Icecuber limits depth to 4, and Ainooson’s best results are achieved with a brute-force search with maximum depth 3.

Methods based on DreamCoder [3] have similar limits but can learn more complex programs by discovering and defining new operations as common compositions of primitives, and reusing them from one task to another. Our method can dive much deeper in less computation time, thanks to its greedy strategy.

The number of refinement steps achieved in a timeout of 60s on the training tasks ranges from 4 to 57, with an average of 19 steps. This demonstrates the effectiveness of the MDL criteria to guide the search towards correct models. This claim is reinforced by the fact that a beam search (width=3) did not lead to solving more tasks.

Learned models. The learned models for solved tasks are very diverse despite the simplicity of our models. They express various transformations: e.g., moving an object, extending lines, putting one object behind another, order objects from largest to smallest, remove noise, etc. Note that none of these transformations is a primitive in our models, they are learned in terms of objects, basic arithmetics, simple geometry, and the MDL principle.

We compared our learned models to the natural programs of LARC [1]. Remarkably, many of our models involve the same objects and similar operations than the natural programs.

For example, the natural program for task b94a9452 is: “[The input has] a square shape with a small square centered inside the large square on a black background. The two squares are of different colors. Make an output grid that is the same size as the large square. The size and position of the small inner square should be the same as in the input grid. The colors of the two squares are exchanged.”

For other tasks, our models miss some notions used by natural programs but manage to compensate them: e.g., topological relations such as ”next to” or ”on top” are compensated by the three attempts; the majority color is compensated by the MDL principle selecting the largest object. However, in most cases, the same objects are identified.

These observations demonstrate that our object-centric models align well with the natural programs produced by humans, unlike approaches based on the composition of grid transformations. An example of a program learned by [10] on the task 23b5c85d is strip black; split colors; sort Area; top; crop, which is a sequence of grid-to-grid transformations, without explicit mention of objects.

6.2 From Grids to Strings (FlashFill)

A similar yet different kind of tasks, compared to ARC, is the automatic filling of some columns in a spreadsheet given already filled columns, from only a few input-output examples. A notable work in program synthesis [13] has led to a new feature in Microsoft Excel 2013, called FlashFill. As a simple example, consider a spreadsheet where column A contains lastnames (e.g., Smith), column B contains firstnames (e.g., Jones Paul), and column C is expected to contain the initial of the first firstname followed by the lastname (e.g., J. Smith).

Like in ARC, each task comes with a few input-output pairs, and the output should be predicted from the input. The main difference lies in the type of inputs and outputs, here rows of strings instead of colored grids. The research hypothesis here is that by changing only the definition of patterns, functions, and the model-specific DLs, our approach is able to learn models that solve the tasks given as examples in the work cited above.

Models. Table 6 lists the patterns of our models for rows of strings. A row model describes a row of strings, and is simply an array of cell models. A cell model decribes the content of a spreadsheet cell, i.e. a string. It is either the empty string (Nil), or the factorization of a string with a token in the middle and two substrings on each side.

Tokens here play the role of objects. A token model is either a constant string or a regular expression, taken among a list of predefined ones: e.g., Digits = [0-9]+ matches contiguous sequences of digits. Unknowns (?) can be used as cell models and as conditions (Cond). Finally, the Alt constructor can be used in cell models and token models to express alternatives (Alt(?,M1,M2)), conditionals (Alt(expr,M1,M2)), and optionals (Alt(?,M,Nil)). The available functions are so far limited: string length, filtering chars (digits, letters, upper letters, lower letters), converting a string to uppercase or lowercase, converting ints and bools to strings, equality to some constant value and logical operators for conditions. Expressions and references are so far restricted to tokens.

For comparison, the DSL of FlashFill also uses predefined regular expressions but uses them to locate positions in the string, rather than tokens. Their programs are conditional expressions (switch), where each branch is a concatenation of substrings specified by position, and constant strings. In contrast, our models allows for free nestings of conditionals (Alt) and concatenation (Factor). However, their DSL has loops that have so far no counterpart in our models.

Task set. For a preliminary evaluation, we used as a task set the 14 examples in [13]. Each task has one or two strings as inputs and one string as output, and 2-6 training examples (avg. 3.4). We complement them with 3-6 evaluation examples, some of which feature some generalization difficulty. Those 14 tasks are available in the supplementary materials in the same JSON format as ARC tasks.

Efficiency and success rates. Learning takes 1s or less, except for Task 1 and Task 13 that have longer input strings and take respectively 9.9s and 5.2s. The depth of search ranges from 11 to 76, and averages at 35 steps. For 11/14 tasks (all except Tasks 4, 5, 9), the learned model correctly describes and predicts the training examples.

However, only 5 of those learned models generalize to all test examples: 3 models fail on a single test example (e.g., in Task 3 the file extension .mp4 contains a digit unlike other file extensions); in Task 8, the training examples are ambiguous because the input string is a date in different formats, and the output string could either be the day or last two digits of the year; in Task 11, there is a typo in the training examples (on purpose), which makes the task under-specified. The (partial) failure for other tasks is explained by missing features in our models, notably the counterpart of loops, or by a wrong sequence of refinements. For instance, Task 9 can be solved by delaying the insertion of alternatives.

Learned models. Input models can be expressed as regular expressions with groups on tokens and alternatives, and output models can be expressed as string interpolations with group identifiers as variables. For Task 10, we obtain the following model.

The input is made of three integers, the first one being optional. The output is the concatenation of those three integers, separated by dashes, and the first integer is 425 when missing in the input. In FlashFill, in general, a large number of programs is generated as an exhaustive search is performed. For Task 10, the program given as solution in [13] is the following (A refers to the input column):

This is illustrative of the different programming style between our pattern-based models and the computation-based DSL programs.


This paper is available on Arxiv under CC 4.0 license.