Sample-efficient Linguistic Generalizations through Program Synthesis: Experiments with Phonology Problems

Neural models excel at extracting statistical patterns from large amounts of data, but struggle to learn patterns or reason about language from only a few examples. In this paper, we ask: Can we learn explicit rules that generalize well from only a few examples? We explore this question using program synthesis. We develop a synthesis model to learn phonology rules as programs in a domain-specific language. We test the ability of our models to generalize from few training examples using our new dataset of problems from the Linguistics Olympiad, a challenging set of tasks that require strong linguistic reasoning ability. In addition to being highly sample-efficient, our approach generates human-readable programs, and allows control over the generalizability of the learnt programs.


Introduction
In the last few years, the application of deep neural models has allowed rapid progress in NLP. Tasks in phonology and morphology have been no exception to this, with neural encoder-decoder models achieving strong results in recent shared tasks in phonology (Gorman et al., 2020) and morphology (Vylomova et al., 2020). However, the neural models that perform well on these tasks make use of hundreds, if not thousands of training examples for each language. Additionally, the patterns that neural models identify are not interpretable. In this paper, we explore the problem of learning interpretable phonological and morphological rules from only a small number of examples, a task that humans are able to perform.
Consider the example of verb forms in the language Mandar presented in Table 1. How would a neural model tasked with filling the two blank cells do? The data comes from a language that * Work done while at the University of Richmond to V to be Ved is not represented in large-scale text datasets that could allow the model to harness pretraining, and the number of samples presented here is likely not sufficient for the neural model to learn the task. However, a human would fare much better at this task even if they didn't know Mandar. Identifying rules and patterns in a different language is a principal concern of a descriptive linguist (Brown and Ogilvie, 2010). Even people who aren't trained in linguistics would be able to solve such a task, as evidenced by contestants in the Linguistics Olympiads 1 , and general-audience puzzle books (Bellos, 2020). In addition to being able to solve the task, humans would be able to express their solution explicitly in terms of rules, that is to say, a program that maps inputs to outputs.
Program synthesis (Gulwani et al., 2017) is a method that can be used to learn programs that map an input to an output in a domain-specific language (DSL). It has been shown to be a highly sampleefficient technique to learn interpretable rules by specifying the assumptions of the task in the DSL (Gulwani, 2011).
This raises the questions (i) Can program synthesis be used to learn linguistic rules from only a few examples? (ii) If so, what kind of rules can be learnt? (iii) What kind of operations need to explicitly be defined in the DSL to allow it to model linguistic rules? (iv) What knowledge must be im-plicitly provided with these operations to allow the model to choose rules that generalize well?
In this work, we use program synthesis to learn phonological rules for solving Linguistics Olympiad problems, where only the minimal number of examples necessary to generalize are given (Ş ahin et al., 2020). We present a program synthesis model and a DSL for learning phonological rules, and curate a set of Linguistics Olympiad problems for evaluation.
We perform experiments and comparisons to baselines, and find that program synthesis does significantly better than our baseline approaches. We also present some observations about the ability of our system to find rules that generalize well, and discuss examples of where it fails.

Program synthesis
Program synthesis is "the task of automatically finding programs from the underlying programming language that satisfy (user) intent expressed in some form of constraints" (Gulwani et al., 2017). This method allows us to specify domain-specific assumptions as a language, and use generic synthesis approches like FlashMeta (Polozov and Gulwani, 2015) to synthesize programs.
The ability to explicitly encode domain-specific assumptions gives program synthesis broad applicability to various tasks. In this paper, we explore applying it to the task of learning phonological rules. Whereas previous work on rule-learning has focused on learning rules of a specific type (Brill, 1992;Johnson, 1984), the DSL in program synthesis allows learning rules of different types, and in different rule formalisms.
In this work, we explore learning rules similar to rewrite rules (Chomsky and Halle, 1968) that are used extensively to describe phonology. Sequences of rules are learnt using a noisy disjunctive synthesis algorithm NDSyn (Iyer et al., 2019) extended to learn stateful multi-pass rules (Sarthi et al., 2021).

Phonological rules as programs
The synthesis task we solve is to learn a program in a domain-specific language (DSL) for string transduction, that is, to transform a given sequence of input tokens i ∈ I * to a sequence of output tokens o ∈ O * , where I is the set of input tokens, and O is the set of output tokens. Each token is a symbol accompanied by a feature set, a set of key-value pairs that maps feature names to boolean values.
We learn programs for token-level examples, which transform an input token in its context to output tokens. The program is a sequence of rules which are applied to each token in an input string to produce the output string. The rules learnt are similar to rewrite rules, of the form where (i) X : I → B is a boolean predicate that determines input tokens to which the rule is applied (ii) φ i : I → B is a boolean predicate applied to the i th character relative to X, and the predicates φ collectively determine the context in which the rule is applied (iii) T : I → O * is a function that maps an input token to a sequence of output tokens.
X and φ belong to a set of predicates P, and T is a function belonging to a set of transformation functions T . P and T are specified by the DSL. We allow the model to synthesize programs that apply multiple rules to a single token by synthesizing rules in passes and maintaining state from one pass to the next. This allows the system to learn stateful multi-pass rules (Sarthi et al., 2021).

Domain-specific language
The domain-specific language (DSL) is the declarative language which defines the allowable string transformation operations. The DSL is defined by a set of operators, a grammar which determines how they can be combined, and a semantics which determines what each operator does. By defining operators to capture domain-specific phenomena, we can reduce the space of programs to be searched to include those programs that capture distinctions relevant to the domain. This allows us to explicitly encode knowledge of the domain into the system.
Operators in the DSL also have a score associated with each operator that allows for setting domain-specific preferences for certain kinds of programs. We can combine scores for each operator in a program to compute a ranking score that we can use to identify the most preferred program among candidates. The ranking score can capture implicit preferences like shorter programs, more/less general programs, certain classes of transformations, etc.
The DSL defines the predicates P and set of transformations T that can be applied to a particular token. The predicates and transformations in the DSL we use, along with the description of their semantics, can be found in Tables 2 and 3.

IsToken(w, s, i)
Is x equal to the token s? This allows us to evaluate matches with specific tokens.
Is(w, f, i) Is f true for x? This allows us to generalize beyond single tokens and use features that apply to multiple tokens.
TransformationApplied(w, t, i) Has the transformation t has been applied to x in a previous pass? This allows us to reference previous passes in learning rules for the current pass.

Not(p)
Negates the predicate p.
Table 2: Predicates that are used for synthesis. The predicates are applied to a token x that is at an offset i from the current token in the word w. The offset may be positive to refer to tokens after the current token, zero to refer to the current token, or negative to refer to tokens before the current token.

Transformation
ReplaceBy(x, s1, s2) If x is s1, it is replaced with s2. This allows the system to learn conditional substitutions.
ReplaceAnyBy(x, s) x is replaced with s. This allows the system to learn unconditional substitutions.

Insert(x, S)
This inserts a sequence of tokens S after x at the end of the pass. It allows for the insertion of variable-length affixes.

Delete(x)
This deletes x from the word at the end of the pass.
CopyReplace(x, i) These are analogues of the ReplaceBy and Insert transformations where the token which is added is the same as the token at an offset i from x. They allow the system to learn phonological transformations such as assimilation and gemination.

Identity(x)
This returns x unchanged. It allows the system where a transformation applies under certain conditions, but does not under others. The offset i for the Copy transformations may be positive to refer to tokens after the current token, zero to refer to the current token, or negative to refer to tokens before the current token. Sequences of rules are learnt as disjunctions of IfThen operators, and are applied to each token of the input using a Map operator (Figure 1). The conjunction of predicates X and φ that define the context are learnt by nesting IfThen operators.
A transformation produces an token that is tagged with the transformation that is applied. This allows for maintaining state across passes.
The operators in our DSL are quite generic and can be applied to other string transformations as well. In addition to designing our DSL for string transformation tasks, we allow for phonological information to be specified as features, which are a set of key-value pairs that map attributes to boolean values. While we restrict our investigation to fea-tures based only on the symbols in the input, more complex features based on meaning and linguistic categories can be provided to a system that works on learning rules for more complex domains like morphology or syntax. We leave this investigation for future work. for disjunctive synthesis to the task of graphemeto-phoneme (G2P) conversion in Hindi and Tamil. They propose the idea of learning transformations on token aligned examples, and use languagespecific predicates and transformations to learn rules for G2P conversion. We use a similar approach, and use a different set of predicates and
transformations that are language-agnostic. Figure 2 sketches the working of the algorithm. The NDSyn algorithm is an algorithm for learning disjunctions of rules, of the form shown in Figure 1. Given a set of examples, it first generates a set of candidate rules using the Flash-Meta synthesis algorithm (Polozov and Gulwani, 2015). This algorithm searches for a program in the DSL that satisfies a set of examples by recursively breaking down the search problem into smaller subproblems. Given an operator, and the input-output constraints it must satisfy, it infers constraints on each of the arguments to the operator, allowing it to recursively search for programs that satisfy these constraints on each of the arguments. For example, given the Is predicate and a set of examples where the predicate is true or false, the algorithm infers constraints on the arguments the token s and offset i such that the set of examples is satisfied. The working of FlashMeta is illustrated with an example in Figure 3. We use the implementation of the FlashMeta algorithm available as part of the PROSE 2 framework.
From the set of candidate rules, NDSyn selects a subset of rules with a high ranking score that correctly answers the most examples as well incorrectly answers the least 3 . Additional details about the algorithm are provided in Appendix A.
The synthesis of multi-pass rules proceeds in passes. In each pass, a set of token-aligned examples is provided as input to the NDSyn algorithm. The resulting rules are then applied to all the exam-ples, and those that are not solved are passed as the set of examples to NDSyn in the next pass. This proceeds until all the examples are solved, or for a maximum number of passes.

Dataset
To test the ability of our program synthesis system to learn linguistic rules from only a few examples, we require a task with a small number of training examples, and a number of test examples which measure how well the model generalises to unseen data. Additionally, to ensure a fair evaluation, the test examples should be chosen such that the samples in the training data provide sufficient evidence to correctly solve the test examples.
To this end, we use problems from the Linguistics Olympiad. The Linguistics Olympiad is an umbrella term describing contests for high school students across the globe. Students are tasked with solving linguistics problems-a genre of composition that presents linguistic facts and phenomena in enigmatic form (Derzhanski and Payne, 2010). These problems typically have 2 parts: the data and the assignments.
The data consists of examples where the solver is presented with the application rules to some linguistic forms (words, phrases, sentences) and the forms derived by applying the rules to these forms. The data typically consists of 20-50 forms, the minimal number of examples required to infer the correct rules is presented (Ş ahin et al., 2020).
The assignments provide other linguistic forms, and the solver is tasked with applying the rules inferred from the data to these forms. The forms in the assignments are carefully selected by the IsToken (w ," a " , -1) IsToken (w ," b " ,0) IsToken (w ," c " ,1) Search for rule IfThen ( IsToken (w ," a " , -1) , ReplaceBy (x ," b " ," d ")) IfThen ( IsToken (w ," b " ,0) , ReplaceAnyBy (x ," d ")) IfThen ( IsToken (w ," c " ,1) , ReplaceAnyBy (x ," d ")) Figure 3: An illustration of the search performed by the FlashMeta algorithm. The blue boxes show the specification that an operator must satisfy in terms of input-output examples, with the input token underlined in the context of the word. The Inverse Semantics of an operator is a function that is used to infer the specification for each argument of the operator based on the semantics of the operator. This may be a single specification (as for predicate) or a disjunction of specifications (as for token and offset). The algorithm then recursively searches for programs to satisfy the specification for each argument, and combines the results of the search to obtain a program.
The search for the rule in an IfThen statement proceeds similarly to the search for a predicate. Examples of programs that are inferred from a specification are indicated with =⇒ . A dashed line between inferred specifications indicates that the specifications are inferred jointly.
designer to test whether the solver has correctly inferred the rules, including making generalizations to unseen data. This allows us to see how much of the intended solution has been learnt by the solver by examining responses to the assignments.
The small number of training examples (data) tests the generalization ability and sample efficiency of the system, and presents a challenging learning problem for the system. The careful selection of test examples (assignment) lets us use them to measure how well the model learns these generalizations.
We present a dataset of 34 linguistics problems, collected from various publicly accessible sources. These problems are based on phonology, and some aspects of the morphology of languages, as well as the orthographic properties of languages. These problems are chosen such that the underlying rules depend only on the given word forms, and not on inherent properties of the word like grammatical gender or animacy. The problems involve (1) inferring phonological rules in morphological inflection (Table 4a) (2) inferring phonological changes between multiple related languages (Table 4b) (3) converting between the orthographic form of a language and the corresponding phonological form (Table 4c) (4) marking the phonological stress on a given word (Table 4d). We refer to each of these categories of problems as morphophonology, multilingual, transliteration, and stress respectively. We further describe the dataset in Appendix B 4 .

Structure of the problems
Each problem is presented in the form of a matrix M . Each row of the matrix contains data pertaining to a single word/linguistic form, and each column contains the same form of different words, i.e., an inflectional or derivational paradigm, the word form in a particular language, the word in a particular script, or the stress values for each phoneme in a word. A test sample in this case is presented as a particular cell M ij in the table that has to be filled. The model has to use the data from other words in the same row (M i: ) and the words in the column (M :j ) to predict the form of the word in M ij .
In addition to the data in the table, each problem contains some additional information about the symbols used to represent the words. This addi-  tional information is meant to aid the solver understand the meaning of a symbol they may not have seen before. We manually encode this information in the feature set associated with each token for synthesis. Where applicable, we also add consonant/vowel distinctions in the given features, since this is a basic distinction assumed in the solutions to many Olympiad problems.
We use the assignments that accompany every problem as the test set, ensuring that the correct answer can be inferred based on the given data.

Dataset statistics
The dataset we present is highly multilingual. The 34 problems contain samples from 38 languages, drawn from across 19 language families. There are 15 morphophonology problems, 7 multilingual problems, 6 stress, and 6 transliteration problems. The set contains 1452 training words with an average of 43 words per problem, and 319 test words with an average of 9 per problem. Each problem has a matrix that has between 7 and 43 rows, with an average of 23. The number of columns ranges from 2 to 6, with most problems having 2.

Baselines
Given that we model our task as string transduction, we compare with the following transduction models used as baselines in shared tasks on G2P conversion (Gorman et al., 2020) and morphological reinflection (Vylomova et al., 2020). Neural: We use LSTM-based sequence-tosequence models with attention as well as Transformer models as implemented by Wu (2020). For each problem, we train a single neural model that takes the source and target column numbers, and the source word, and predicts the target word. WFST: We use models similar to the pair n-gram models (Novak et al., 2016), with the implementation similar to that used by Lee et al. (2020). We train a model for each pair of columns in a problem. For each test example M ij , we find the column with the smallest index j such that M ij is non-empty and use M ij as the source string to infer M ij .
Additional details of baselines are provided in Appendix C.

Program synthesis experiments
As discussed in Section 3.1, the examples in a problem are in a matrix, and we synthesize programs to transform entries in one column to entries in another. Given a problem matrix M , we refer to a program to transform an entry in column i to an entry in column j as M :i → M :j . To obtain token-level examples, we use the Smith-Waterman alignment algorithm (Smith et al., 1981), which favours contiguous sequences in aligned strings.
We train three variants of our synthesis system with different scores for the Is and IsToken operators. The first one, NOFEATURE, does not use features, or the Is predicate. The second one, TO-KEN, assigns a higher score to IsToken and prefers more specific rules that reference tokens. The third one, FEATURE, assigns a higher score to Is and prefers more general rules that reference features instead of tokens. All other aspects of the model remain the same across variants. Morphophonology and multilingual problems: For every pair of columns (s, t) in the problem matrix M , we synthesize the program M :s → M :t . To predict the form of a test sample M ij , we find a column k such that the program M :k → M :j has the best ranking score, and evaluate it on M ik . Transliteration problems: Given a problem matrix M , we construct a new matrix M for each pair of columns (s, t) such that all entries in M are in the same script. We align word pairs (M is , M it ) using the Phonetisaurus many-to-many alignment tool (Jiampojamarn et al., 2007), and build a simple mapping f for each source token to the target token with which it is most frequently aligned. We fill in M is by applying f to each token of M is and  Table 5: Metrics for all problems, and for problems of each type. The CHFF score for stress problems is not calculated, and not used to determine the overall CHRF score.
M it = M it . We then find a program M :s → M :t . Stress problems: For these problems, we do not perform any alignment, since the training pairs are already token aligned. The synthesis system learns to transform the source string to the sequence of stress values.

Metrics
We calculate two metrics: exact match accuracy, and CHRF score (Popović, 2015). The exact match accuracy measures the fraction of examples the synthesis system gets fully correct.

EXACT = #{correctly predicted test samples} #{test samples}
The CHRF score is calculated only at the token level, and measures the n-gram overlaps between the predicted answer and the true answer, and allows us to measure partially correct answers. We do not calculate the CHRF score for stress problems as n-gram overlap is not a meaningful measure of performance for these problems. Table 5 summarizes the results of our experiments. We report the average of each metric across problems for all problems and by category. We find that neural models that don't have specific inductive biases for the kind of tasks we present here are not able to perform well with this amount of data. The synthesis models do better than the WFST baseline overall, and on all types of problems except transliteration. This could be due to the simple map computed from alignments before program synthesis causing errors that the rule learning process cannot correct.

Analysis
We examine two aspects of the program synthesis models we propose. The first is the way it uses the explicit knowledge in the DSL and implicit knowledge provided as the ranking score to generalize. We then consider specific examples of problems, and show examples of where our models succeed and fail in learning different types of patterns.

Features aid generalization
Since the test examples are chosen to test specific rules, solving more test examples correctly is indicative of the number of rules inferred correctly. In Table 6, we see that providing the model with features allows it to infer more general rules, solving a greater fraction of more problems. We see that allowing the model to use features increases its performance, and having it prefer more general rules involving features lets it do even better.

Correct programs are short
In Figure 4 we see that the number of rules in a problem 5 tends to be higher when the model gets the problem wrong, than when it gets it right. This indicates that when the model finds many specific rules, it overfits to the training data, and fails to generalize well. This holds true for all the variants, as seen in the downward slope of the lines. We also find that allowing and encouraging a model to use features leads to shorter programs. The average length of a program synthesized by Figure 4: Number of rules plotted against EXACT score NOFEATURES is 30.5 rules, while it is 25.8 for TOKEN, and 20.7 for FEATURE. This suggests that explicit access to features, and implicit preference for them leads to fewer, more general rules.

Using features
Some problems provide additional information about certain sounds. For example, a problem based on the alternation retroflexes in Warlpiri words (Laughren, 2011) explicitly identifies retroflex sounds in the problem statement. In this case, a program produced by our FEATURE system is able to use these features, and isolate the focus of the problem by learning rules such as IfThen ( Not ( Is (w , " retroflex ", 0)) , Identity (x )) The system learns a concise solution, and is able to generalize using features rather than learning separate rules for individual sounds.
In the case of inflecting a Mandar verb (McCoy, 2018), the FEATURE system uses a feature to find a more general rule than is the case. To capture the rule that the prefix dichanges to maswhen the root starts with s, the model synthesizes IfThen ( Is (w , " fricative ", 1) , ReplaceBy (x , "i", "s ")) However, since s is the only fricative in the data, this rule is equivalent to a rule specific to s. This rule also covers examples where the root starts with s, and causes the model to miss the more general rule of a voiceless sound at the beginning of the root to be copied to the end of the prefix. It identifies this rule only for roots starting with p as IfThen ( IsToken (w , "p", 1) , CopyReplace (x , w , 1)) The TOKEN system does not synthesize these rules based on features, and instead chooses rules specific to each initial character in the root. Since the DSL allows for substituting one token with one other, or inserting multiple tokens, the system has to use multiple rules to substitute one token with multiple tokens. In the case of Mandar, we see one way it does this, by performing multiple substitutions (to transform dito masit replaces d and i with a and s respectively, and then inserts m).

Multi-pass rules
In a problem on Somali verb forms (Somers, 2016), we see a different way of handling multi-token substitutions by using multi-pass rules to create a complex rule using simpler elements. The problem requires being able to convert verbs from 1st person to 3rd person singular. The solution includes a rule where a single token (l) is replaced with (sh). The learned program uses two passes to capture this rule through sequential application of two rules: first ReplaceBy(x, "l", "h"), followed by IfThen ( TransformationApplied (w , "{ ReplaceBy , h }" , 1) , Insert (x , "s "))

Selecting spans of the input
In a problem involving reduplication in Tarangan (Warner, 2019), all variants fail to capture any synthesis rules. Reduplication in Tarangan involves copying one or two syllables in the source word to produce the target word. However, the DSL we use does not have any predicates or transformations that allow the system to reference a span of multiple tokens (which would form a syllable) in the input. Therefore, it fails to model reduplication.

Global constraints
Since we provide the synthesis model with tokenlevel examples, it does not have access to wordlevel information. This results in poor performance on stress problems, as stress depends on the entire word. Consider the example of Chickasaw stress (Vaduguru, 2019). It correctly learns the rule IfThen ( Is (w , " long ", 0) , ReplaceAnyBy (x , "1")) that stresses any long vowel in the word. However, since it cannot check if the word has a long vowel that has already been stressed, it is not able to correctly model the case when the word doesn't have a long vowel. This results in some samples being marked with stress at two locations, one where the rule for long vowels applies, and one where the rule for words without long vowels applies.
6 Related work Gildea and Jurafsky (1996) also study the problem of learning phonological rules from data, and explicitly controlling generalization behaviour. We pursue a similar goal, but in a few-shot setting. Barke et al. (2019) and Ellis et al. (2015) study program synthesis applied to linguistic rule learning. They make much stronger assumptions about the data (the existence of an underlying form, and the availability of additional information like IPA features). We take a different approach, and study program synthesis models that can work only on the tokens in the word (like NOFEATURE), and also explore the effect of providing features in these cases. We also test our approach on a more varied set of problems that involves aspects of morphology, transliteration, multilinguality, and stress.
Ş ahin et al. (2020) also present a set of Linguistics Olympiad problems as a test of the metalinguistic reasoning abilities of NLP models. While problems in their set involve finding phonological rules, they also require the knowledge of syntax and semantics that are out of the scope of our study. We present a set of problems that only requires reasoning about surface word forms, and without requiring the meanings.

Conclusion
In this paper, we explore the problem of learning linguistic rules from only a few training examples. We approach this using program synthesis, and demonstrate that it is a powerful and flexible technique for learning phonology rules in Olympiad problems. These problems are designed to be challenging tasks that require learning rules from a minimal number of examples. These problems also allow us to specifically test for generalization.
We compare our approach to various baselines, and find that it is capable of learning phonological rules that generalize much better than existing approaches. We show that using the DSL, we can explicitly control the structure of rules, and using the ranking score, we can provide the model with implicit preferences for certain kinds of rules.
Having demonstrated the potential of program synthesis as a learning technique that can work with very little data and provide human-readable models, we hope to apply it to learning more complex types of lingusitic rules in the future.
In addition to being a way to learn rules from data, the ability to explicity control the generalization behaviour of the model allows for the use of program synthesis to understand the kinds of learning biases and operations that are required to model various linguistic processes. We leave this exploration to future work. We use the NDSyn algorithm to learn disjunctions of rules. We apply NDSyn in multiple passes to allow the model to learn multi-pass rules. At each pass, the algorithm learns rules to perform token-level transformations that are applied to each element of the input sequence. The tokenlevel examples are passed to NDSyn, which learns the if-then-else statements that constitute a set of rules. This is done by first generating a set of candidate rules by randomly sampling a token-level example and synthesizing a set of rules that satisfy the example. Then, rules are selected to cover the token-level examples.
Rules that satisfy a randomly sampled example are learnt using the FlashMeta program synthesis algorithm (Polozov and Gulwani, 2015). The synthesis task is given by the DSL operator P and the specification of constraints X that the synthesized program must satisfy. In our application, this specification is in the form of token-level examples, and the DSL operators are the predicates and transformations defined in the paper. The algorithm recursively decomposes the synthesis problem (P, X ) into smaller tasks (P i , X i ) for each argument P i to the operator. X i is inferred using the inverse semantics of the operator P i , which is encoded as a witness function. The inverse semantics provides the possible values for the arguments of an operator, given the output of the operator. We refer the reader to the paper by Polozov and Gulwani (2015) for a full description of the synthesis algorithm.
After the candidates are generated, they are ranked according to a ranking score of each program. The ranking score for an operator in a program is computed as a function of the scores of its arguments. The arguments may be other operators, offsets, or other constants (like tokens or features). The score for an operator in the argument is computed recursively. The score for an offset favours smaller numbers and local rules by decreasing the score for larger offsets. The score for other constants is chosen to be a small negative constant. The scores for the arguments are added up, along with a small negative penalty to favour shorter programs, to obtain the final score for the operator.
This ranking score selects for programs that are shorter, and favours either choosing more general by giving the Is predicate a higher score (FEATURE) or more specific rules by giving the IsToken predicate a higher score (TOKEN). The top k programs according to the ranking function are chosen as candidates for the next step.
To choose the final set of rules from the candidates generated using the FlashMeta algorithm, we use a set covering algorithm that chooses the rules that correctly answer the most number of examples while also incorrectly answering the least. These rules are applied to each example, and the output tokens are tagged with the transformation that is applied. These outputs are then the input to the next pass of the algorithm.

B Dataset
We select problems from various Linguistics Olympiads to create our dataset. We include publicly available problems that have appeared in Olympiads before. We choose problems that only involve rules based on the symbols in the data, and not based on knowledge of notions such as gender, tense, case, or semantic role. These problems are based on the phonology of a particular language, and include aspects of morphology and orthography, and maybe also the phonology of a different language. In some cases where a single Olympiad problem involves multiple components that can be solved independent of each other, we include them as separate problems in our dataset.
We put the data and assignments in a matrix, as described in Section 3.1 . We separate tokens in a word by a space while transcribing the problems from their source PDFs. We do not separate diacritics as different tokens, and include them as part of the same token. For each token in the Roman script, we add the boolean features vowel and consonant, and manually tag the tokens according to whether they are a vowel or consonant.
We store the problems in JSON files with details about the languages, the families to which the languages belong, the data matrix, the notes used to create the features, and the feature sets for each token.

C.1 Neural
Following Ş ahin et al. (2020), we use small neural models for sequence-to-sequence tasks. We train a single neural model for each task, and provide the column numbers as tags in addition to the source sequence. We find that the single model approach works better than training a model for each pair of columns. LSTM: We use LSTM models with soft attention (Luong et al., 2015), with embeddings of size 64, hidden layers of size 128, a 2-layer encoder and a single layer decoder. We apply a dropout of 0.3 for all layers. We train the model for 100 epochs using the Adam optimizer with a learning rate of 10 −3 , learning rate reduction on plateau, and a batch size of 2. We clip the gradient norm to 5. Transformer: We use Transformer models (Vaswani et al., 2017) with embeddings of size 128, hidden layers of size 256, a 2-layer encoder and a 2-layer decoder. We apply a dropout of 0.3 for all layers. We train the model for 2000 steps using the Adam optimizer with a learning rate of 10 −3 , warmup of 400 steps, learning rate reduction on plateau, and a batch size of 2. We use a label smoothing value of 0.1, and clip the gradient norm to 1.

C.2 WFST
We use the implementation the WFST models available at https://github.com/sigmorphon/2020/tree/ master/task1/baselines/fst for the WFST models. We train a model for each pair of columns. We report the results for models of order 5, which were found to perform the best on the test data (highest EXACT score) among models of order 3 to 9.