ETS Lexical Associations System for the COGALEX-4 Shared Task

We present an automated system that computes multi-cue associations and generates associated-word suggestions, using lexical co-occurrence data from a large corpus of English texts. The system performs expansion of cue words to their inflectional variants, retrieves candidate words from corpus data, finds maximal associations between candidates and cues, computes an aggregate score for each candidate, and outputs an n -best list of candidates. We present experiments using several measures of statistical association, two methods of score aggregation, ablation of resources and applying additional filters on retrieved candidates. The system achieves 18.6% precision on the COGALEX-4 shared task data. Results with additional evaluation methods are presented. We also describe an annotation experiment which suggests that the shared task may underestimate the appropriateness of candidate words produced by the corpus-based system.


Introduction
The COGALEX-4 shared task is a multi-cue association task: finding a target word that is associated with a set of cue words. The task is motivated, for example, by a tip-of-the-tongue search application, as described by the organizers: "Suppose, we were looking for a word expressing the following ideas: 'superior dark coffee made of beans from Arabia', but could not remember the intended word 'mocha'. Since people always remember something concerning the elusive word, it would be nice to have a system accepting this kind of input, to propose then a number of candidates for the target word. Given the above example, we might enter 'dark', 'coffee', 'beans', and 'Arabia', and the system would be supposed to come up with one or several associated words such as 'mocha ', 'espresso', or 'cappuccino'." The data for the shared task were sampled from the Edinburgh Associative Thesaurus (EAThttp://www.eat.rl.ac.uk). For each of about 8,000 stimulus words, the EAT lists the associations (words) provided by human respondents, sorted according to the number of respondents who provided the respective word. Generally, when more people provided the same response, the underlying association is considered to be stronger (Kiss et al., 1973). For the COGALEX-4 shared task, the cues were the five strongest responses to an unknown stimulus word, and the task was to recover (guess) the stimulus word (henceforth, target word). The data for the task consisted of a training set of 2000 items (for which target words were provided), and a test set of 2000 items. The origin of the data was not disclosed before or during the system development and evaluation phases of the shared task competition.
The ETS entry consisted of a system that uses corpus-based distributional information about pairs of words in English. No use was made of human association data (EAT or other), nor of any other information such as the order of importance of the cue words, or any special preference for the British spelling often used in the EAT. SLLa , b=2⋅P  a , b⋅log P a ,b  P a  P b −P  a , bP a  P b P(a,b) signifies probability of joint co-occurrence. For bigrams, that is joint co-occurrence in a specific sequential order (e.g. AB vs. BA) ; for DSM data the co-occurrence is order-independent.

Procedure for generating candidate multi-cue associates
Our general procedure for generating target candidates is as follows. For each of the five cue words, candidate targets are generated separately, from the corpus-based resources: 1. From the DSM (generally associated words) 2. Left words from bigrams (words that, in the corpus, appeared immediately to the left of the cue) 3. Right words from bigrams (words that appeared immediately to the right of the cue) Retrieved lists of candidates can be quite large, with hundreds and even thousands of different neighbors. One specific filter implemented at this stage was that only word-forms (alphabetic strings) were allowed, and any punctuation or '#' strings were filtered out.
Since our resources are not lemmatized, we extended the candidate retrieval procedure by expanding the cue words to their inflectional variants. This provides richer information about semantic association. We used an in-house morphological analyzer/generator. Inflectional expansions were not constrained for part of speech or word sense. For example, given the cue set {1:letters 2:meaning 3:sentences 4:book 5:speech} (from the training set of the shared task, target: 'words'), after expansion the set of cues is {1:letters, lettered, letter, lettering 2:meaning, means, mean, meant, meanings 3:sentences, sentence, sentenced, sentencing 4:book, books, booking, booked 5:speech, speeches}. The vector of right neighbors for the cue 'letters', brings such words as {sent, from, between, written, came, addressed, ...}. The vector of left neighbors for same cue word brings such candidates as {write, send, love, capital, review, ...}. From the DSM, the vector of co-occurrence may bring some of the same words (but with different values of association), as well as words that do not generally occur immediately before or after the cue word, e.g. {time, people, word, now,…}.
Next, we apply filtering that ensures the minimal requirement for multi-word association -a candidate must be related to all cues. The candidate must appear (at least once) on the list of words generated from each cue family. A candidate word that does not meet this requirement is filtered out. 2

Scoring of candidate associates
Scoring of candidate associate-words is a two-stage process. First, for each candidate, we look for the strongest association value it has with each of the five cue families. Then, the five strongest values are combined into an aggregated score.
For a given cue family, several instances of the same candidate associate might be retrieved, with various values of association score (from DSM and n-grams, and also for each specific inflectional form of the cue). We pick the highest score, siding with the source that provides the strongest evidence of connection between the cue and the candidate associate. The maximal association value is stored as the best score for this candidate with the given cue family. We note that since the same measure of association is used, the scores from the different sources are numerically comparable. 3 For example, when PMI is used as the association measure, the following values were obtained for candidate 'capital' with cue family 'letters, lettered, letter, lettering' (expanded from 'letters'). General cooccurrence (DSM): capital & letters: 0.477, capital & letter: 0.074, etc.; left bigrams: capital letters: 5.268, capital letter: 2.474, etc. The strongest association here is the bigram 'capital letters', and the value 5.268 is the best association of the candidate 'capital' with this cue family.
Next, for each candidate we compute an aggregate score that represents its overall association with all five cues. In current study, we experimented with two forms of aggregation: 1) sum of best scores (SBS), and 2) product (multiplication) of ranks (MR). Sum of best scores is simply the sum of best association scores that a candidate has with each of the five cues (families). To produce a final ranked list of candidate targets, candidates are sorted by their aggregate sum value (better candidates have higher values). Multiplication of ranks has been proposed as an aggregation procedure by Rapp (2014Rapp ( , 2008. In this procedure, all candidates are sorted by their association scores with each of the five cues (families) separately, and five rank values are registered for each candidate. The five rank values are then multiplied to produce the final aggregate score. All candidates are then sorted by the aggregate score, and in such ranking better candidates have lower aggregate scores. Multiplication of ranks is computationally more intensive than sum of scores -for a given set of candidate words from five cues, multiplication of ranks requires six calls for sorting, while aggregation via sum-of-best-scores performs sorting only once.
Finally, all candidates are sorted by their aggregate score and top N are outputted for the calculation of precision@N, to be described below.

Results
Our system ran with several different configuration settings, using various association measures and score aggregation procedures. Under any given configuration, the system produces, for each item (i.e. a set of five cue words), a ranked list of candidates. According to the rules of the shared task, official results are computed by selecting the single best candidate for the item as the suggested target word. If the suggested word strictly matches the gold-standard word (ignoring upper/lower case), it is considered a match. If the two strings differ even slightly, it is considered a mismatch. The reported result is precision (percent matches) over the test set of 2000 items.
With strict-matching, our best result for the test-set was precision of 18.6% (372 correctly suggested targets). This was obtained by using NPMI as the association measure, product of ranks as the score aggregation procedure, and with filtering of candidates using a stoplist and a frequency filter. 4 The shared task was described as multi-cue association for finding a sought-after 'missing' word, a situation not unlike a tip-of-the-tongue phenomenon. In such situation, a person looking for an associated word, might find it useful if the system returns not just one highest-ranked suggestion (which would often be a miss), but a list of several top-ranked suggestions -the target word might be somewhere on such list 5 . Thus, we also present our results in terms of precision for n-best suggestions -i.e. in how many cases the target word was among the top n returned by the system, with n ranging from 1 up to 25.
A similar consideration applies to inflectional variants. A person looking for a word associated with a set of cue words, might be satisfied when a system returns either a base-form or an inflected variant of the target word. Thus, we report our results both in terms of strict matches to gold-standard targets and under a condition of 'inflections-allowed'. 6 On the test set, our best result for precision@1, with inflections allowed, is 24.35% (487 matching suggestions).
First, we present our baseline results. Figure 1 presents the results of our system for the training set of 2000 items, using the NPMI association measure. Panel 1A presents data obtained using aggregation via sum-of-best-scores (SBS). Panel 1B presents data obtained using aggregation via multiplication of ranks (MR). Figure 2 presents similar breakdown for results of the test set. Both sets of results are quite similar. Thus, we restrict our attention to just the results of the test set. 7 4 We initially submitted a result of 14.95% strict-match precision@1 (see Figure 2A). This was improved to 16.1% ( Figure  2B), and with additional filters -to 18.6% (see section 4.2). 5 A list of n-best suggestions is standard approach for presenting candidate corrections for misspellings (Flor, 2013;Mitton, 2008). Also, precision "at n documents" is a well known evaluation approach in information retrieval (Manning et al., 2008). A recent use of n-best suggestions in an interactive NLP system is illustrated by Madnani and Cahill (2014). 6 Each target word form, both in the training set and the test set, was automatically expanded to all its inflectional variants, using our morphological analyzer/generator. In our evaluations, a candidate target is considered a 'hit' if it matches the gold-standard target or one of its inflectional variants. 7 We did not use the training set for any training or parameter tuning. We used it to select the optimal association measures for this task -we also experimented with t-score, weighted PMI and conditional probability, but PMI and NPMI performed much better than others. We found, as expected, that performance improves when the target is sought among the n-best candidates produced by the system. With NPMI and MR aggregation, strict-match precision improves from 16.1% for precision@1 to 30.3% for precision@5, 37% for precision@10, and 46.9% for precision@25 ( Figure 2B).
Another expected result is that performance is better when matching of targets allows inflectional variants. This is clearly seen on the charts, as the difference between the two lines. With NPMI and MR aggregation, precision@1 improves from 16.1% to 21.45%, precision@5 improves from 30.3% to 36.3%, and precision@25 improves from 46.9% to 54%, Similar improvement is observed when using aggregation via sum-of-best-scores.
Our third finding is that multiplication of ranks achieves slightly better results than sum-of-bestscores ( Figure 2, panel B vs. panel A). For precision@1 with strict matches, using NPMI, MR achieves 16.1% and with inflectional variants 21.45%, while SBS achieves 14.95% and 20.25% respectively. For precision@10, MR achieves 37% (43.55%), while SBS achieves 36% (42%). Notably, MR is consistently superior to SBS for all values of n-best, from 1 to 25, under both strict or inflections-allowed matching, with both NPMI and PMI (see Figure 3). However, the advantage is consistently rather small -about 1-1.5%. Since MR is computationally more intensive, SBS emerges as a viable alternative.
We have also conducted experiments with three different measures of association. Results are presented in Figure 3. With MR aggregation, NPMI achieves better results than the PMI measure. Both measures clearly outperform the Simplified log-Likelihood. Similar results are obtained with SBS aggregation. For each association measure, allowing inflections provides better results than strict matching to gold-standard targets.

Additional studies
In several additional experiments we looked at the contribution of different factors to overall performance. We tried several variations of resource combination and also tested filtering of candidates by frequency and by using a list of stopwords.

Ablation experiments
We investigated how the restriction of resources impacts the performance on this task. Specifically we restricted the resources as follows. In one condition we used only the bigrams data, retrieving candidates only from the vectors of left co-occurring words (immediate preceding words) of each cue word (condition NL -n-grams left). A similar restriction is when candidates are retrieved only from right (immediate successor) words (condition NR -n-grams right). A third condition still uses only bigrams, but admits candidates from both left and right vectors (condition NL+NR). Under the fourth condition (DSM), n-grams data is not used at all, only the DSM resource is used. In the fifth and sixth conditions we combine candidates from DSM with n-gram candidates (left or right vectors onlyrespectively). The seventh condition is our standard -candidates from DSM and both left and right neighbors from bigrams are admitted. For those experiments, we used NPMI association measure with MR aggregation, and included inflections in evaluation. The results are presented in Figure 4. Using only right-hand associates (typical textual successors of cue words) provides very low performance (precision@1 is 2.95%). Using only left-hand associates (typical textual predecessors of cue words) provides slightly better performance (precision@1 is 4.5%). However, it is notable that there are some items in the EAT data where all cues are strong bigrams with the target, e.g. {orange, fruit, lemon, apple, tomato} with target 'juice'. Combining these two resources (condition NL+NR) provides much better performance: precision@1 is 8.5%. Using just the DSM, the system achieves 10.5% precision@1, which may seem rather close to the combined NL+NR 8.5%. However, with DSM, for n-best lists precision rises quite sharply (e.g. 24.35% for precision@5), while for the NL+NR setting precision tends to be under 17% for all values of n up to 25.
Since our DSM and bigrams resources are built on the same corpus of text, for any given set of cues the DSM produces all the candidates that the bigrams resource does (but with different association values) and a lot of other candidates. However, results for DSM+NR and DSM+NL settings (which are better than DSM alone) indicate that association values from bigrams contribute substantially to overall performance. The best result in this experiment is achieved by a setting that combines candidates (and association values) from all three resources, indicating further that associations from sequential word combinations (bigrams) provide a substantial contribution to performance in this task.

Applying filters on retrieved candidates
We also experimented with applying some filters on the retrieved candidates for each item. One of the obvious filters to use is to filter out stopwords. For general tip-of-the-tongue search cases, common stopwords are rarely useful as target words; thus presenting stopwords as candidates makes little sense. We used a list of 87 very common English stopwords, including the articles {the, a, an}, common prepositions, pronouns, wh-question words, etc. However, since the data of the shared task comes from EAT, common stopwords are actually targets in some cases in that collection. Therefore, we used the following strategy. For a given item, if at least one of the five cue words is a stopword, then we assume that the target might also be a stopword, and so we do not use the stoplist to filter candidates for this item. However, if none of the cues is a stopword, we do apply filtering -any retrieved candidate word is filtered out if it is on the stoplist. An additional filter, applied with the stoplist, was defined as follows: if a candidate word is strictly identical to one of the cue words, the candidate is filtered out (to allow for potentially more suitable candidates). 8 The other filter considers frequency of words. The PMI measure is known to overestimate the strength of pair association when one of the words is a low-frequency word (Manning & Schütze, 1999). Normalized PMI is also sensitive to this aspect, although less than PMI. Thus, we use a frequency filter to drop some candidate words. For technical reasons, it was easier for us to apply a cutoff on the joint frequency of a candidate and a cue word. We used a cutoff value of 10 -a candidate is dropped if corpus data indicates it co-occurs with the cue words fewer than 10 times in the corpus data.
We applied the stoplist filter, the frequency filter and a combination of those two filters, always using NPMI as our association measure, aggregating scores via multiplication-of-ranks, and allowing inflections in evaluation. No ablation of resources was applied. The results are presented in Figure 5. The baseline condition is when neither of the two filters is applied. The frequency filter with cutoff=10 provides a very small improvement for precision@1, and for higher values of best-n it actually hurts performance. Application of a stoplist provides a very slight improvement of performance. The combination of a stoplist and frequency cutoff=10 provides a sizable improvement of performance (precision@1 is 24.35% vs. baseline 21.45%, and precision@10 is 44.55% vs. baseline 43.55%). However, for n-best lists of size 15 and above, performance without filters is slightly better than with those filters. For the shared task (using strict matching -no inflections), our best result is 18.6% precision@1 with two filters (16.1% without filters).
Given that the gold-standard targets in the shared task are original stimulus words form the EAT collection, we can use a special restriction -restrict the candidates to just the EAT stimuli word-list (Rapp, 2014). Notably, this is a very specific restriction, suited to the specific dataset, and not applicable to the general case of multi-cue associations or tip-of-the-tongue word searches. We used the list of 7913 single-word stimuli from EAT as a filter in our system -generated candidates that were not on this list were dropped from consideration. The results ( Figure 5) indicate that this restriction (EATvocab) provides a substantial improvement over the baseline condition. For precison@1, using EATvocab (24.55%) is comparable to using a stoplist+cutoff10 (24.35%). However, for larger n-best lists, EATvocab filter provides substantially better performance.   Figure 5. System performance on the test set with different filtering conditions. All runs use NPMI as sociation and MR aggregation. Inflections allowed in evaluation. C10: frequency cutoff=10.

Small-scale evaluation using direct human judgments
Inspecting results from training-set data, we observed a number of cases where the system produced very plausible targets which however were struck down as incorrect (not matching the gold-standard). For example, for the cue set {music, piano, play, player, instrument} the gold-standard target was 'accordion'. But why not 'violin' or 'trombone'? To provide a more in-depth evaluation of the results, we sampled 180 items at random from the test set, along with the candidate targets produced by our system, 9 and submitted those to evaluation by two research assistants. For each item, evaluators were given the five cue words and the best candidate target generated by the system. They were told that the word is supposed to be a common associate of the five cues, and asked to indicate, for each item, whether the candidate was (a) Just Right; or (b) OK; or (c) Inadequate; (a,b,c are on ordinal scale). Out of the 180 items, 80 were judged by both annotators. Table 1 presents the agreement matrix between the two annotators. Agreement on the 3 classes was kappa=0.49. If Just Right and OK are collapsed, the agreement is kappa=0.60. The discrepancy is largely due to a substantial number of instances that one annotator judged OK and the other -Just Right.  Table 1. Inter-annotator agreement matrix for a subset of items from the test-set. 9 Using all resources, NPMI association measure, MR aggregation, and with the general stoplist filter.
We note that one annotator commented on a difficulty making a decision in a number of cases where the cues are a list of mostly adjectives or possessives, and the target produced by the system is an adverb. For example, the cue set {busy, house, vacant, engaged, empty} with the proposed candidate target 'currently'; the cue set {food, thirsty, tired, empty, starving} with the proposed candidate 'perpetually'; the cue set {fat, short, build, thick, built} with the proposed candidate 'slightly'; the cue set {mine, yours, his, is, theirs} with the proposed target 'rightfully'. This annotator felt that these responses were OK, while the other annotator rejected them.
We merged the two annotations to provide a single annotation for the full set of 180 items by taking one annotator's judgment on single-annotated cases and taking the lower of the two judgments for the double annotated disagreed cases (thus, OK and Inadequate are merged to Inadequate; Just Right and OK are merged to OK). We next compare these annotations to the EAT gold standard. Table 2 shows the confusion matrix between the "gold label" from EAT and our annotation. We observe that the totals for Just Right and EAT-match are almost identical (43 vs 42); however, only 17 items were both Just Right and EAT-matches. There were 24 EAT matches that were judged as OK by the annotators (presumably, these did not quite create the "just right" impression for at least one annotator). Examples include: the cue set {beer, tea, storm, ale, bear} with the proposed correct target 'brewing' (one annotator commented that the relationship with "bear" was unclear); the cue set {exam, match, tube, try, cricket} with the proposed correct target 'test' (one annotator commented that the relationship with 'cricket' was unclear); the cue set {school, secondary, first, education, alcohol} with the proposed correct target 'primary' (one annotator commented that the relationship with 'alcohol' was unclear). These results might reflect cultural differences between original EAT respondents (British undergraduates circa year 1970) and present-day American young adults who, e.g. might not know much about cricket. Another possibility is that in the EAT collection, the 5 th cue sometimes corresponds to a very weak associate provided by just a single respondent out of 100, as in brewingbear and primary-alcohol cases. Interestingly, the weak cues did not confuse the system, but replicability of the human judgments for such cases is doubtful.  There were also 26 instances that were judged as Just Right yet were not EAT-matches. Three of these were derivationally related, like 'build' (EAT target) vs 'buildings' (proposed) for the cue set {house, up, construct, destroy, bricks}, the others were 'dwell' vs 'dwellings', 'collector' vs 'collecting'. In the rest of the cases, the generated candidates seemed as good as, or better, than the EAT words. For example, the cue set {ships, boat, sea, ship, ocean} had 'liners' as the EAT target, whereas the system proposed 'cruise'. For the cue set {natural, animal, nature, birds, fear}, the gold-standard EAT target is 'instinct', whereas the system proposed 'predatory'. For the cue set {sound, speak, sing, noise, speech} the gold-standard EAT target is 'voice', while the system produced 'louder'. For the cue set {music, band, noise, club, folk} the target was 'jazz', whereas the system proposed 'dance'. For the cue set {violin, music, orchestra, bow, instrument} the target was 'cello', while the system produced 'stringed'. Furthermore, in as many as 58 cases (32%) the response produced by the system did not match the target from EAT, but was OK-ed by the annotators. Some examples include: the cue set {fool, loaf, idiot, lout, lazy} with proposed candidate 'ignorant'; the cue set {hard, problems, work, hardship, trouble} with proposed candidate 'economic'; {interesting, intriguing, amazing, book, exciting} with proposed candidate 'discoveries'; {lazy, chair, about, lying, sitting} with proposed candidate 'motionless'. In all, if the system were evaluated by counting Just Right and OK annotations as correct, the precison@1 would have been (43+82)/180 = 69%. The estimation of performance based on gold-standard EAT data for this set is 42/180 = 23%, exactly one-third of what annotators found to be reasonable responses. This suggests that evaluation of multi-cued retrieval on targets from EAT rejects many good semantic associates, and thus might be considered too harsh.

Conclusions
This paper presented an automated system that computes multi-cue associations and generates associated-word suggestions, using lexical co-occurrence data from a large corpus of English texts. The system uses pre-existing resources -a large n-ngram database and a large word-co-occurrence database, which have been previously used for a range of different NLP tasks. The system performs expansion of cue words to their inflectional variants, retrieves candidate words from corpus data, finds maximal associations between candidates and cues, and then computes an aggregate score for each candidate. The collection of candidates is then sorted and an n-best list is presented as output. In the paper we presented experiments using various measures of statistical association and two methods of score aggregation. We also experimented with limiting the lexical resources, and with applying additional filters on retrieved candidates.
For test-set evaluation, the shared task requires strict-matches to gold-standard targets. Our system, in optimal configuration, was correct in 372 of 2000 cases, that is precision of 18.6%. We have also suggested a more lenient evaluation, where a candidate target is also considered correct if it is an inflectional variant of the gold-standard word. When inflections are allowed, our system achieves precision of 24.35%. Performance improves dramatically when evaluation considers in how many cases the gold-standard target (or its inflectional variants) are found among the n-best suggestions provided by the system. For example, with a list of 10-best suggestions, precision rises to 45%, and to 54% with a list of 25-best. Using an n-best list of suggestions makes sense for applications like tip-ofthe-tongue situation.
We note that the specific data set used in COGALEX-4 shared task, i.e. the Edinburgh Associative Thesaurus, might be sub-optimal for evaluation of multi-cue associative search. With the EAT dataset, the gold-standard words were the original stimuli from EAT, and the cue words were the associated words that were most frequently produced by respondents in the original EAT experiment (Kiss et al., 1973). Rapp (2014) has argued that corpus-based computation of reverse-associations is a reasonable test case for multi-cued word search. However, Rapp also notes that in many cases, suggestions provided by a corpus-based system are quite reasonable, but are not correct for the EAT dataset. We have conducted pilot human annotation on a small subset of the test-set -judging how reasonable the top suggestion of our system is in general, and not whether it matched EAT targets. In this experiment, 69% of the system's first responses were judged acceptable by humans, while only 23% matched targets. This provides a quantitative confirmation that EAT-based evaluation underestimates the quality of results produced by a corpus-based multi-cue association system.
The use of data from EAT hints at the following direction for future research. In the original EAT data, the first cue is actually the strongest associate of the target word (original stimulus), while other cues are much weaker associates. In our current implementation, we treated all cues as equally important. Future research may include consideration for relative importance or relevance of the different cues. In potential applications, like the tip-of-the-tongue word search, a user may be able to specify which cues are more relevant than others.