When is Wall a Pared and when a Muro?: Extracting Rules Governing Lexical Selection

Learning fine-grained distinctions between vocabulary items is a key challenge in learning a new language. For example, the noun “wall” has different lexical manifestations in Spanish – “pared” refers to an indoor wall while “muro” refers to an outside wall. However, this variety of lexical distinction may not be obvious to non-native learners unless the distinction is explained in such a way. In this work, we present a method for automatically identifying fine-grained lexical distinctions, and extracting rules explaining these distinctions in a human- and machine-readable format. We confirm the quality of these extracted rules in a language learning setup for two languages, Spanish and Greek, where we use the rules to teach non-native speakers when to translate a given ambiguous word into its different possible translations.


Introduction
With increasing globalization there is a widespread prevalence and need for good materials and tools to help people learn languages. Curating such content manually requires a large time and cost investment which poses a challenge particularly for languages where protection and revival efforts are ongoing (Moline, 2020). One of the most important and challenging processes in learning a new language (L2) is vocabulary acquisition (Ellis, 1996;Moore, 1996), which is generally made easy by associating L2 words with words from the first language (L1) (Hulstijn et al., 1996;Watanabe, 1997). In many cases, L1 words or word senses can be unambiguously associated with L2 words. For example "linguistics" and "lingüística" essentially form a one-to-one mapping between English and Spanish. However, different languages carve up the semantic space of the world in different ways leading 1 https://github.com/Aditi138/ LexSelection Es: muro En: wall Es: pared to semantic subdivisions, distinctions made in one language not made in another. For example, "wall" in English is manifested differently in Spanish as "pared" or "muro", as shown in Figure 1, and for an L1 English speaker it may not be immediately obvious when one should be used over the other. A skilled teacher or comprehensive language learning resource may be able to provide explanations that resolve this ambiguity. For example, Robertson (2020) present word definitions in-context for Finnish learners, while CAVOCA (Groot, 2000) takes a learner through various stages of the word acquisition process including word usage, syntax.
In this work, we propose a method to automatically discover rules regarding fine-grained lexical distinctions and present L2 learners with concise descriptions derived from them in an interactive framework. Research in L2 vocabulary acquisition (Groot, 2000) has shown that it is effective to combine strategies using explicit definitions and examples in context. However, our work contrasts to most prior work in this field such as CAVOCA (Groot, 2000) or Duolingo, 2 which use learning content manually created by subject matter experts. This necessity for curation makes it difficult to comprehensively scale this approach to many languages.
Specifically, our framework consists of two steps: i) use a parallel corpus to identify words in L1 which have different lexical manifestations owing to a semantic subdivision in L2, and ii) create human-and machine-readable concise descriptions that allow for easier interpretation of each lexical distinction. First, we extract source (L1) and target (L2) parallel sentences for each shortlisted L1 word. We then extract lexical and semantic features, as well as a label encoding the lexical choice in the target language from these parallel sentences for each L1 word. Finally, we train a prediction model that distinguishes between the lexical choices, and extract human-understandable descriptions from this model. These descriptions could either be used as-is, or could be used as a starting point for further curation by educators.
To confirm the quality of the extracted descriptions, we conduct a study where we use them to teach English native speakers lexical distinctions arising from semantic subdivisions in Spanish and Greek. We make our study interactive by presenting the learning content in the form of cloze tests (Taylor, 1953) where the English word to be taught is presented to the learner in context along with extracted concise description. The learner is then required to select the most appropriate lexical choice from the given set. The main methodological contributions therefore are automated methods to: • Identify fine-grained lexical distinctions arising due to semantic subdivisions. To evaluate this and future work, we also create a lexical selection dataset for two language pairs, English-Spanish and English-Greek. • Extract rules to help humans understand the usage of lexical distinctions in context. Studies with 7 Spanish and 9 Greek learners show that they learn faster when given access to our extracted descriptions; for example they achieve an (avg.) accuracy of 81% within roughly 20 questions, as opposed to more than 40 questions required otherwise.

Problem Formulation
For the purpose of this paper, we define the task of lexical selection as choosing contextually correct translations from a set of target translations for an ambiguous word in the source language (Lefever and Hoste, 2010). We first define some variables: x = x 1 , x 2 , . . . , x |x| denotes a sentence in the source language (L1), y = y 1 , y 2 , . . . , y |y| is its translation in the target language (L2) and V x and V y are the source and target vocabulary respectively. Given a source sentence x containing an ambiguous word x i , then trans(x i ) ⊆ V y denotes the set of its "possible" target translations i.e. words in the target language to which the ambiguous word x i might be translated (concrete methods to define this set are explained later). The task of lexical selection involves choosing the most appropriate translation y i ∈ trans(x i ), and can be performed either by machines or humans. 3 In this work, we particularly focus on machine-learned methods to help humans learn lexical selection, extracting lexical selection models that are not only usable by machines, but also interpretable by humans in order to aid the process of learning a new language. We thus plan to extract the rule set R vx which governs this lexical selection process in a human-and machine-readable format.

Identifying Semantic Subdivisions
In this section, we describe in detail the procedure for identifying L1 words that have different lexical manifestations in L2 owing to semantic subdivisions. For the purpose of this work, we refer to these different lexical manifestations in L2 as lexical choices and the corresponding L1 words as focus words. Our work is "loosely inspired" by ContraWSD (Rios et al., 2018) and SemEval-2013 (Lefever andHoste, 2013) which construct a dataset for cross-lingual word sense disambiguation, using a semi-automatic approach combining frequency-based heuristics with human supervision. These datasets are restricted to a subset of manually selected nouns (20 for SemEval-2013 and 70-80 for ContraWSD). In contrast, our approach is fully automated going beyond using just frequency-based filters. Furthermore, we do not restrict to any one word class leading to words being identified across different word classes (nouns, verbs, adjectives, adverbs) for both Spanish and Greek. 4 We start with a parallel corpus D = {(x 1 , y 1 ), · · · , (x |D| , y |D| )} where (x m , y m ) denote the source and target sentence pair. Next, we extract word alignments automatically using a word aligner that finds sets of pairs of source and target words A m = { x i , y j : x i ∈ x m , y j ∈ y m }, where for each word pair x i , y j , x i and y j are semantically similar to each other within this context.
To focus on translations of the underlying content, as opposed to morphological variations, we then lemmatize all words in both the source and target sentence pairs. Thus, V x and V y refer to the lemmatized vocabulary of the source and target language. Going forward, all words refer to their respective lemmatized forms. We perform automatic part-of-speech (POS) tagging, dependency parsing and word sense disambiguation (WSD) on the source side data, resulting in a POS tag and word sense associated with each source word, tag(x i ) ∈ T x and sense(x i ) ∈ S x where T x is the set of POS tags and S x is the word sense vocabulary in the source language.
In order to identify the focus words, we extract a list of lemmatized L1 word types v x filtered by their part-of-speech (POS) tags t x giving us tuples of the form v x , t x . This ensures that we don't conflate meanings across POS tags, because in many languages the semantics of a word can vary widely across its different POS tags. 5 We refer to the extracted tuples v x , t x as focus words for simplicity. We then extract the focus words with their respective lexical choices as follows: 1. Extract translations : For each aligned word pair x i , y j compute the number of times c(v x , t x , v y ) the lemmatized source word type (v x = lemma(x i )) along with its POS tag (t x = tag(x i )) is aligned to the lemmatized target word type (v y = lemma(y j )) across the whole corpus. Also, store the number of times the word sense of x i (s x = sense(x i )) appears with the source word type, source POS tag and the translation word type in g(v x , t x , s x , v y ).
2. Filter on frequency : Extract tuples of source types and POS tags v x , t x that have been aligned to at least two target words at least 50 times ({v y : |c(v x , t x , v y ) ≥ 50}| ≥ 2), to account for alignment errors. To avoid ambiguity on the target side, translations aligned to words other than the word v x in question (at least 3 times) are excluded.
3. Filter on entropy : Remove source tuples that have an entropy H(v x , t x ) less than a pre-selected 5 "Brown" as a verb (as in "brown the meat") is treated differently from the adjective sense (as in "brown hair"). threshold. The entropy is computed using the conditional probability of a target translation given the source type and POS tag: is the set of target translations for the source tuple v x , t x and p(v y |v x , t x ) is the conditional probability of the target translation for this source type v x and its POS tag t x . High entropy suggests that a word is ambiguous, with finegrained distinctions that likely require context to be resolved, and thus is a word we should focus on.
4. Filter on word sense : Remove source tuples whose target translations have distinct sourceword senses. For some words, the differences between target translations can be straightforwardly explained by the different source word senses. For example, banco in Spanish refers to the financial institution, given by the WordNet (Miller, 1995) sense 'bank.n.02' while orilla refers to the edge of a river, outright matched to 'bank.n.01'. For such words, the word sense definitions would be an easy-to-provide rule for learners, but we want to go beyond that. We are interested in finding those words where the word sense information alone is insufficient to distinguish between the lexical choices and are hence likely to be hard for human learners. For a source tuple, use the highest occurring word sense for a given target translation v y computed as: Finally, retain the source tuples whose target translations all have the same sense, giving us L lexical choices trans(v x , t x ) = {v y 0 , . . . , v y |L| } for a source tuple (v x , t x ).

Lexical Selection Model
After identifying a set of focus words in the source language, we train a lexical selection model parameterized by θ vx,tx for each focus word v x , t x . We extract the parallel sentences from D that include the focus word and its corresponding lexical choices, denoting them with D vx,tx . The model takes as input the source sentences x vx,tx ∈ D vx,tx and predicts the contextually correct target translation v y from a set of possible translations trans(v x , t x ) = v y 1 , v y 2 , · · · , v y k Since we aim to induce concise, humanunderstandable explanations of semantic distinctions that can be presented to learners to help them better understand the lexical selection process, we train a prediction model which allows us to easily extract such descriptions for each lexical choice v y ∈ trans(v x , t x ). In this paper, we use humanreadable descriptions of the features learned by a linear model, where these features are defined over a set of lexical and semantic features extracted from the source sentences in D vx,tx . For designing features, we take inspiration from prior work which uses extracted contextual information to improve cross-lingual sense disambiguation in machine translation systems (Garcia-Varea et al., 2001;Carpuat and Wu, 2007b,a).

Model Features
For training a lexical selection model θ vx,tx for the focus word v x , t x , we construct training data from the source-target sentence pairs D vx,tx . We focus on features extracted only from the current source sentence, although the framework can be easily extended to include features from the target sentence as well. We represent each source sentence x vx,tx ∈ D vx,tx with a set of features extracted from the neighborhood of the focus word context relevant to the lexical selection process. This neighborhood includes (1) words from the source sentence that occur within a fixed window of the given ambiguous word, and (2) the head and dependents of the focus word as given by the dependency parse of the sentence. For each word in this relevant context, we extract the following lexical features: • Lemma Lemma of the token.
• WSD Word sense of the token as extracted from a state-of-the-art word sense disambiguation (WSD) model. • Bigram Bigrams constructed from lemmas of the words present within a fixed window around the focus word. We exclude punctuation and stop words within the window. 6

Model Training
To enable extraction of human-understandable descriptions, we use a model that is conducive to interpretation: the linear SVM (LinearSVM; Cortes and Vapnik, 1995), which gives us feature weights θ vx,tx that can be easily interpreted as the importance of each feature in making the decision. Since there can be n-ary lexical choices for a given focus word, we train using the one-vs-rest (OvR) method which trains one model per each lexical choice v y k , where data from v y k are treated as positive examples and data from all other choices as negative, allowing us to extract feature weights for each decision.

Rule Extraction
As mentioned above, we use human-readable descriptions of the features learned by a linear model to be presented to the human learners. More broadly, we refer to these descriptions as "rules", however these rules could take other forms as well, and we hope that future work by us or others could find other creative ways to induce or define these rules.
For each focus word v x , t x , we extract the rule set R vx,tx,vy k , which is the set of rules for selecting a given lexical choice v y k from the set of possible choices trans(v x , t x ). For this, we extract salient features from the trained model θ vx,tx for each lexical choice. As mentioned above, using the OvR classification method we get one model per choice v y k , from which we can then extract the top-N features having the highest weight coefficients for each choice. In order to present this rules in a human-readable form, we create concise rule templates as shown in Appendix B.1.

Automated Validation
Since our main research goal is to aid human learners in their learning, we focus on two approaches of evaluation: (a) automated validation, a preliminary evaluation where we validate to what extent our interpretable model can perform cross-lingual lexical selection, and (b) human evaluation ( §6) which answers our main question of whether it can teach human learners the usage of L2 words.
For the automated evaluation in particular, we verify several things. First, we check whether our interpretable lexical selection model is able to learn cross-lingual lexical selection at all by measuring its performance compared to selecting the most frequently occurring translation in the corpus for a given focus word ("Frequency"). We also compare with another alternative interpretable model, decision trees (DTree) trained using the same features Figure 2: Learning Interface. Rules for the correct answer are displayed to the learner after each question. Individual rules that apply to the given example are highlighted for the convenience of the learner. "wall" here refers to an outside wall and the adjective stone serves as a hint in arriving at the correct answer.
as LinearSVM, to validate the choice of SVMs as an interpretable model over other alternatives. Further, we check how our interpretable linear SVM model compares with a "performance skyline"; a less interpretable BERT-based neural model (Devlin et al., 2019) that extracts representations of the source sentence from BERT and trains a classifier to predict the correct lexical choice.

Setup
Data: We experiment with two L2 languages: Spanish and Greek. These languages were chosen due to (1) availability of parallel corpora with which to train models, and (2) availability of linguists and annotators to verify and analyze the data used in our experimental setting. For Spanish we use 10 million English-Spanish parallel sentences from OpenSubtitles (Lison and Tiedemann, 2016), Tatoeba, TED (Tiedemann, 2012), and Europarl (Koehn, 2005). 7 For Greek, we use 31 million English-Greek parallel sentences extracted from OpenSubtitles. For word alignment we use the AWESOME aligner (Dou and Neubig, 2021), for lemmatization we use spaCy (Honnibal et al., 2020), for POS tagging and dependency parsing we use Stanza (Qi et al., 2020), and for English WSD we use EWISER (Bevilacqua and Navigli, 2020). 8 7 We use only 1 million sentences from Europarl because we found sentences from Europarl to contain fewer semantic subdivisions owing to the very specific domain of the dataset. 8 POS tagging, dependency parsing and WSD is required only for the source language, here English. Using our automatic pipeline ( §3), we identify 157 English words which have fine-grained distinctions in Spanish and 707 English words for Greek. Among these, for Spanish there are 127 nouns, 15 verbs, 10 adjectives, 5 adverbs and, for Greek there are 452 nouns, 123 verbs, 126 adjectives and 6 adverbs. Along with nouns which do account for much of the data, we do find significant number of verbs and adjectives also exhibiting ≥ 2 lexical choices. 9 A manual inspection by a Greek-English bilingual speaker revealed that most automatically created lexical choices were correct. In just a couple of cases, lemmatizer errors lead to two choices corresponding to the same actual lemma (which were manually corrected for the user studies).
Model: We train a linear SVM lexical selection model with sklearn (Pedregosa et al., 2011) for each L1 focus word and divide the extracted parallel sentences into a train/test split with a 80-20 ratio per lexical choice. We perform 5-fold crossvalidation to select the best model hyperparameters (detailed in Appendix A.2) from which we then extract the top-20 features for each lexical choice to form our rule set. Details on the setup of DTree and BERT are also in Appendix A.2. earSVM significantly outperforms both Frequency and DTree by a significant margin, indicating that it is both learning to perform lexical selection to a significant degree, and outperforming other reasonable alternatives for interpretable models. 10 This gives us confidence to proceed to use it in our following human learning experiments. Interestingly, our interpretable LinearSVM model is within 97% relative accuracy of the skyline BERT model (just 2.09 percentage points behind). The fact that the more complicated but less inherently interpretable BERT model is better overall paves the way for future work in applying model interpretation techniques (Abnar and Zuidema, 2020, inter alia) to extract human-interpretable rules for lexical selection, although this is beyond the scope of the current paper. 11 We find that lexical selection accuracy varies by part of speech; all models perform poorly on adverbs with (avg.) gain of only +0.97 points over the baseline (c.f. with gains of +8.04 for nouns, +5.16 for verbs, +6.24 for adjectives).

Evaluation with Human Learners
We move to our main evaluation where we examine how effective our extracted rules are in aiding human learners in understanding the distinctions in L2 words.

Evaluation Methodology
We take inspiration from existing research on second language acquisition (SLA) to design our evaluation method. For instance, Groot (2000) highlights the different learning strategies which are based on generally accepted language acquisition theories (Nation, 2005;Richards et al., 1999), which suggest that a learner is required to go through different levels of language processing for effectively learning vocabulary. In particular, Groot (2000) empirically show that some of these levels can be accelerated with appropriate design of the language tasks by combining learning strategies which use both examples in context and definitions for effective learning. Our cloze-style tasks are essentially examples in context showing the word usage in a given context and the extracted rules are a proxy for human-provided definitions. Specifically, we set up an interactive exercise where a human learner is presented with the English focus word in context, along with a set of possible L2 (Spanish or Greek) lexical choices. The learner is then required to select one of the possible lexical choices, based on which they think correctly translates the focus word in the given source context. They must also mark how confident they are in their answer ("Not at all", "Slightly", "Somewhat", "Quite" or "Very"). After they select the answer to each question, they are told the correct answer immediately. For each focus word, we ask the learner to answer up to N multiple-choice questions in sequence, which contain roughly equal number of questions for each lexical choice.
In order to evaluate how effective the extracted rules are in aiding the learning process, we perform this study in two setups, a baseline one without rules, and one using our proposed system with rules.
Baseline Setup: In this setup, the human learner does not have access to any rules and immediately starts answering questions. If the learners do not know the target language, they are likely to start out with approximately chance accuracy (e.g. 50% if there are two choices), but as they are given feedback they may be able to grasp the patterns under which one particular translation or another is used, and gradually rise above chance accuracy.
Proposed Setup: In the proposed setup, before starting the task, the learner is shown brief rules regarding when you would use each possible lexical choice v y k ∈ trans(v x , t x ), constructed from the rule set R vx,tx,vy k . They take as much time as they want to review these rules, and then move to answering questions. The interface for answering questions is the same as the baseline, but below the task screen they can review the rules of different translation choices (figures in Appendix B.2). On selecting a choice, the learner is shown the correct answer accompanied with its corresponding human-readable rules of only the correct answer. Further, we highlight those individual rules that helped decide the correct answer ( Figure 2) for the convenience of the learner. By highlighting it in the two bottom panes, we hope to draw the learner's attention to these hints and thus strengthen the understanding of the underlying concept.
In this setting, the annotator may achieve nonchance accuracy even at the very beginning of answering questions, as they have been given an explanation regarding the underlying rules that they can leverage in answering questions. The accuracy will likely further increase as they practice and become familiar with actual examples and how the extracted features apply to them.

Experimental Details
We select native English speakers, 7 for the Spanish study and 9 for the Greek study. 12 Each annotator is presented with the same set of English words or tasks. For each study, half of the words will be annotated using the baseline setup and remaining half with the proposed setup. To ensure an unbiased setup, we randomize whether each focus word uses rules or not, while ensuring that at least half the annotators see the proposed setup and the other half perform the same task in the baseline setup for each word. We further shuffle the order in which the words are presented. For each English word, we select up to 40 examples each for the respective lexical choices. However, as an incentive, we end a task early if the annotator correctly answers 10 questions straight in a row for each lexical choice. We explain below the selection procedure for the English words used in the experiments.
Word Selection: In an ideal situation, we would like to conduct these experiments for all identified English focus words, but this would involve annotating thousands of sentences, requiring a large time commitment from the annotators. Instead, we shortlist a handful of words using the following automated procedure: First, for a given L2 study, we sort all focus words by the number of available data points (D vx,tx ). Next, from the trained lexical selection model θ vx,tx we compute an F1-score for each lexical choice and filter focus words where the model gets an F1 > 0.5 for each lexical choice. Finally, we select upto 10 focus words with the most data points that fit the above condition. For each 12 We allow participants who know other languages but none that are familiar with the L2 or its related languages. word ( v x , t x ), we then select 40 representative examples for each lexical choice (see paragraph below). Details on the shortlisted words can be found in Appendix B.3.

Representative Example Selection:
To facilitate an effective learning process, we present examples to the learner that have sufficient sourceside context required for correctly identifying the target-side lexical choice. This is important because there are examples in the corpus where the sufficient context requires context spanning over multiple sentences. To make our learning content both concise and effective, we focus only on context self-contained in a single sentence. Further to efficiently conduct a high-quality study, we enlist help from native speakers of the L2 language to filter the required sentences. We note, though, that the relevant sentences could also be potentially filtered automatically (left for future work).
To get such meaningful examples, we present bilingual English-Spanish and English-Greek speakers with the English sentence containing the focus word and the set of possible lexical choices in Spanish and Greek respectively. They then select the word which best suits the given context and mark their confidence in the selection. The interface for the example selection is the same as Figure  2 (but without rules). We collect these annotations from multiple native speakers and only keep those sentences on which all native speakers agree (see Appendix B.3 for details).

Results and Discussion
To confirm whether the extracted rules are effective to the learning process, we examine the following questions: Do the extracted rules result in increased learner accuracy? We compute the learner accuracy across all learners for each L2 study. If a learner attains higher accuracy with fewer attempted examples for the experiment with rules than without, then the extracted rules could be considered effective in the learning process. However, we cannot directly use the learner accuracy as-is because of the possibility of other sources of variability such as (a) underlying learner ability, as some learners may be more proficient than others, (b) underlying task difficulty, as some words may be harder to disambiguate than others, or (c) word ordering, as learners may become proficient as they do more tasks. Therefore, we use a mixed effects model (McLean et al., 1991), which models random effects and fixed effects to account for such random variability. Random effects are variables responsible for random variation such as task-identity, task-order and the learner, while fixed effects such as the presence of rules are the variables of interest for determining the response variable i.e. learner accuracy. A linear mixed-effect model (LME) is defined as: y = Xβ + Zu + where y is the learner accuracy, β and u are the fixed-effect and random-effect regression coefficients, X and Z are the respective design matrices and the noise.
We fit LME models on our data by varying the number of first n attempted examples n = [5, 10, 20, 30, 40, 50, all]. Each fitted LME model gives us an intercept which informs us of the learner accuracy in absence of rules, and the fixed-effect coefficient β which informs us about the gain with rules. As shown in Figure 3, it is clear that learners having access to our automatically extracted rules achieve higher accuracy with fewer examples as compared to without. As expected, with an increasing number of attempted examples the gap in accuracy between the two settings reduces. Interestingly, we find that the rules still have a significant effect on the learner's confidence even later in the learning process. This suggests that with our rules learners require fewer examples to infer the patterns governing each lexical choice and further get more confident in their understanding. This is encouraging as in true settings the learning exercise would be conducted for every focus word that the learner is attempting to learn, and because this process will have to be repeated many times, making it more efficient is of significant value. In Appendix B.4 we report the p-value for the fitted LME models which shows that the positive gains from the presence of rules are most significant for ≤20 examples for Spanish and for all examples for Greek.
Overall, we find our extracted rules help both Spanish and Greek learners in their learning process. We note that the results on Greek are promising as it does not enjoy the same luxuries as Spanish in having a high-quality lemmatizer or word aligner. This is encouraging especially for researchers involved in the revival efforts of endangered languages.
Do the extracted rules result in increased learner confidence? While answering the questions we ask the learner to mark how confident they are in their answer. As before, we fit LME models for each n using annotator confidence as the response variable Y and presence of rules as the fixed-effect. We find that the learners' confidence in the correct answer increases more when they are provided rules (Figure 3) for both the languages.
Do the extracted rules help some words more over others? Since the focus words may vary in the difficulty level, we check if our extracted rules are more effective for some words over others. So, we fit a LME model on each focus word and compute the β coefficient to measure the effect of rules on learner accuracy after 20 attempted examples. 13 We plot the β coefficient with the accuracy (averaged across all learners) for each focus word when they didn't have rules in Figure 4 and find that words on which the learners performed the worst such as wall, oil, farmer, and vote for Spanish, benefit most by our rules. Similar observations can be seen for Greek where learners are benefited more for words (break, wheel, tour, old, roof ) on which they performed the worst. Some of these words, in fact, indeed have finer semantic subdivisions than the rest. For instance, the choices for farmer: agricultor refers to exclusively the one who works the land, harvests, sows, etc., whereas granjero is less formal referring to the one who manages a farm, or works or lives on it. 14 This analysis shows that, encouragingly, our rules are especially helping learners with more difficult words.
We also plot the β coefficient with the lexical model accuracy ( Figure 5) and find a positive correlation, meaning that rules help more for words where the model performs well. This suggests that if we can develop more accurate models with an equal level of interpretability, the learning effect might become even stronger.

Related Work
Computer-assisted language learning CALL systems have been increasingly using NLP for creating learning content. Both SMILLE (Zilio et al., 2017) and WERTi (Meurers et al., 2010) aim to help the text understanding process by highlighting linguistic structures using hand-written rules and automatically acquired syntactic analysis. Apertium (Tyers et al., 2012), a rule-based MT system, while not aimed at language learning, does use human-and machine-readable rules, whose formalism can account for only fixed-length ordered contexts restricting their application. Further, these rules use a combination of only lemma and POS tags while our framework uses more features.
Cross-lingual word sense disambiguation CL-WSD disambiguates a word in-context by providing appropriate translation across languages. Lefever and Hoste (2010) construct a dataset (25 ambiguous English nouns across six languages) semi-automatically from parallel corpora which are then verified by expert translators. Such lexical choice tasks have been created also for evaluating MT systems (Rios Gonzales et al., 2017;Rios et al., 2018). However, these methods cover a limited set of words (mostly nouns) and require some manual intervention during the data creation process. To the best of our knowledge, our proposed pipeline is

Future Work
While we have demonstrated the efficacy of our extracted rules in teaching new words for two languages, we plan to apply our framework on much less-resourced languages which have fewer available learning resources where learners would benefit more from an automated system. We also plan to use automated methods such as selection using model confidence to select 'representative' examples for the learning setup instead of using the native speakers. Furthermore, multimodal features have proven their utility in automatic methods for lexical acquisition (Hewitt et al., 2018), and we plan to examine their effectiveness for L2 learning.

A.1 Identifying Semantic Subdivisions
In Section 3, we describe the procedure for identifying focus words in L1. For the step of filter on entropy within that procedure, we use a threshold of 0.69 so focus words having an entorpy H > 0.69 are selected in that step. For binary lexical choices, an ambiguous word would be aligned to each choice with uniform distribution and the entropy in that case would be 0.69. Hence we are interested in words that exceed this minimum threshold. In Figure 6 we show the distribution of number of lexical choices for all extracted focus words, filtered by each POS tag for Spanish and Greek. We check the CEFR levels 15 which measure the reading proficiency in a language. We use the automated tool provided by Duolingo 16 (currently available only for Spanish and English) to get these levels and find that 60% of the extracted Spanish lexical choices belong to the B level which is the intermediate level and 20% belong to the advanced level. This suggests that the identified words are indeed more challenging.

A.2 Model Hyperparameters and Results
For the LinearSVM and DTree models, we clean the data to remove punctuation and extract features within a 3-word window of the focus word.
As mentioned before, we train a lexical selection model for each focus word and in Table 5, 6, 7, 8 we report the individual accuracy for the test accuracy for LinearSVM, DTree, BERT and the baseline method across both Spanish and Greek. We also provide the train accuracy for our main model, LinearSVM.
DTree We also experimented with other interpretable models such as decision trees (Quinlan, 1986) using the CART algorithm (Breiman et al., 1984), however we found them to be performing worse than the SVM model. We used the following hyperparameters: criterion = [gini, entropy], max depth = [6,15], min impurity decrease = 1e −3 .
BERT We compare the interpretable models Lin-earSVM and DTree with more complex neural model based on the popular BERT (Devlin et al., 2019). We retain the same hyperparameters as the original paper using 768 dimentions for the encoder representations. We train the model for 20 epochs, using the AdamW optimzer with a learning rate of 5e − 5.

B.1 Rule Templates
The human-understandable "rules" are essentially those features from the training set which the model thought were important for determining the correct label (i.e. features that were given higher weights for a given label). In particular, for each label (i.e. muro or pared), we choose the top-20 features. We then group these individual features together by their feature types, for instance, all the bigram features are grouped under the category called "Short Phrase", lemma features are grouped under the category "Words", and the WSD features are first expanded into their natural language form using the WordNet (Miller, 1995) knowledge base and then grouped under the category "Concepts", as shown in Table 2.

B.2 Language Learning Interface
In our proposed language learning setup, the learner is first presented with a screen showing concise explanations for each lexical choice (Figure 7(a)). They can take as much time as they require for reviewing the rules and then proceed to the tasks. Within each task, the learner is then shown an English sentence with the focus word highlighted and a set of possible lexical choices. The page also displays the concise explanations for the learner to refer if they wish to (Figure 7(b)). The learner is required to select one of the lexical choices and mark how confident they are in their answer. Once submitted, the learner is immediately shown the correct answer along with individual rules that applied to that example highlighted ( Figure 2 in main text). Learners took (avg.) 3-4 hours in total to complete all tasks. Table  3 presents the tasks performed by the respective Spanish and Greek learners. Since English speakers might not be familiar with the Greek alphabet, we display the English transliteration of the respective Greek words. Some of the lexical choices (e.g. muralla/muro/muros) contain multiple inflec-  tions of the same lemma (muro). This is due to errors in the automatic Spanish lemmatizer which failed to correctly map the inflections to a single lemma. We therefore run an edit-distance based post-processing to combine lexical choices having the same prefix. We note that this simple heuristic might not be ideal for languages such as Indonesian that use affixes and/or reduplication with far-from-perfect lemmatizers; nevertheless such a post-processing method, when applied carefully, helps fix many of the erroneous lemmatization issues.

B.3 Representative Example Selection
The shortlisted words for both the Spanish and Greek study can be found in Table 3. We use native speakers to filter sentences that have sufficient context for correctly identifying a lexical choice. We enlist 3 Spanish native speakers who each annotate roughly 200 examples each for 10 English focus words. The inter-annotator agreement for Spanish, computed using Fleiss' kappa is 0.77. For Greek, we use 2 native speakers to annotate 10 English words. For 7 out of 10 words we did not always have access to 2 native speakers so we relied on a single expert annotator. The (avg.) inter-annotator agreement for the remaining 3 words (tour, tie, bill) between the two annotators is 0.83. Of the 10 selected words, we discard words/lexical choices which have < 10 examples on which all native speakers agree (Table 3) giving us 9 English words for the Spanish study and 10 English for the Greek study.

B.4 Results
In Table 4 we report the p-values for the linear mixed effect (LME) models fitted on predicting learner accuracy with rules as fixed-effect. The results show that the positive effect of rules on accuracy is statistical significant up to first 20 attempted examples for Spanish and up to all examples for Greek.
(b) Rules for the correct answer are displayed to the learner after each question. Individual rules which apply to the given example are highlighted for the convenience of the learner.