LexSym: Compositionality as Lexical Symmetry

In tasks like semantic parsing, instruction following, and question answering, standard deep networks fail to generalize compositionally from small datasets. Many existing approaches overcome this limitation with model architectures that enforce a compositional process of sentence interpretation. In this paper, we present a domain-general and model-agnostic formulation of compositionality as a constraint on symmetries of data distributions rather than models. Informally, we prove that whenever a task can be solved by a compositional model, there is a corresponding data augmentation scheme — a procedure for transforming examples into other well-formed examples — that imparts compositional inductive bias on any model trained to solve the same task. We describe a procedure called LexSym that discovers these transformations automatically, then applies them to training data for ordinary neural sequence models. Unlike existing compositional data augmentation procedures, LexSym can be deployed agnostically across text, structured data, and even images. It matches or surpasses state-of-the-art, task-specific models on COGS semantic parsing, SCAN and Alchemy instruction following, and CLEVR-CoGenT visual question answering datasets.

: Example of our approach to compositional modeling in the visual question answering domain. Given a dataset of (image, question, answer) triples, we extract a lexicon that relates words to their visual groundings. We then find homomorphic transformations (Section 3) of this lexicon that, when applied to training examples, produce new, well-formed examples.
{f (x) : x ∈ X} = X (1) Figure 2: Idealized compositional semantic parser following Definition 3. A (sentence, logical form) pair is translated into a lexical abstraction containing information about each token's type and semantic equivalences. We then determine whether the sentence evaluates to the logical form using only the type and equivalence matrices, using types to assign the sentence an abstract logical form, and equivalences to determine whether it matches the target. x ∈ X is a discrete sequence [x 1 , . . . ,

198
We aim to prove that if I can be computed com-  Fig. 2). Then we say that the inter- Σ with respect to L (an "L-homomorphism") if: relationships. An example is depicted in Fig. 1; 263 note that both the words yellow and green and the 264 corresponding meanings must be swapped in order 265 to satisfy Eq. 3.

266
Our main claim is then as follows:

267
Proposition 1. If X is L-compositional, f is an 268 L-homomorphism, and x ∈ X, then f (x) = 269 3 The full NL framework of MacCartney and Manning can model a rich set of sentence relations beyond equivalence, including contradiction and entailment, via a similarly enriched set of word-level relations. Our framework has a natural generalization that replaces ϵ with a set of n-ary relations and modifies Definition 4 to require that f be a homomorphism with respect to each (see Appendix C for more details).

384
(Because the lexicon is one-to-many, either

390
The homomorphism f is closely related to swap- for these data augmentation procedures, and shows 405 how to modify them to ensure that they preserve 406 the well-formedness of data.   Table 1: Exact match accuries on the CLEVR and CLEVR-CoGenT validation sets. We provide average results over 4 random restarts of our method. We obtain state of the art results in after applying our data augmentation scheme (without using any pre-trained image respresentations). Data augmentation with L-homomorphisms also yields higher accuracies than the substitution version Section 4.3-specifically for questions that require counting and comparisons.

473
Recall that the lexicon and data augmentation pro- Akyürek and Andreas (2021).

488
We train these base models by performing data aug-  Table 2: Semantic parsing results on COGS dataset for generalization set. We provide mean and standard deviations over 10 random seeds. Augmentation improves significantly over the LSTM baseline and obtains performance on-par with the LexLSTM model but behind specialized semantic parsing approaches.

497
In CLEVR, we apply one transformation uniformly 498 to each batch; in COGS, we apply one transforma-499 tion to each example with 20% probability.