Deep Learning from Web-Scale Corpora for Better Dictionary Interfaces

This paper explores advanced learning mechanisms – neural networks trained by the Word2Vec method – for predicting word associations. We discuss how the approach can be built into dictionary interfaces to help tip-of-the-tongue searches. We also describe our contribution to the CogALex 2014 shared task. We argue that the reverse response-stimulus word associations chosen for the shared task are only mildly related to the motivation idea of the lexical access support system. The methods employed in our contribution are brieﬂy introduced. We present results of experiments with various parameter settings and show what improvement can be expected if more than one answer is allowed. The paper concludes with a proposal for a new collective effort to assemble real tip-of-the-tongue situation records for future, more-realistic evaluations.


Introduction
Human memory is fundamentally associative. To focus just on lexical access issues, it is often the case that people cannot immediately recall a word expressing a specific concept but they can give one or more words referring to concepts associated with the desired one in their minds. The failure to retrieve a word from memory, combined with partial recall and the feeling that retrieval is imminent, is generally referred to as the tip-of-the-tongue phenomenon (TOT), sometimes called presque vu (Brown, 1991).
Before one starts to think about automatic means supporting the lexical access, it is important to distinguish various situations in which TOT appears. First, the personal state of the language producer (writer/speaker) plays a crucial role. Fatigue or lack of attention can increase frequency of TOT situations. Specific problems come with mild cognitive impairments (incipient dementia) which is more frequent in elders. The communication mode (written or spoken language) also needs to be taken into account -it often helps to recollect an intended word if one just says associated words aloud. Consequently, people can prefer expressing the hesitation over a TOT word as a question to a family member, a friend or an automatic assistant. The spoken communication generally brings longer, more specific and detailed clues that can potentially lead to better identification of the word to be reminded. The language (mother tongue v. foreign language) and producer's familiarity and proficiency also need to be considered. Language learners would frequently associate a word with others that sound similar but are not related semantically, they could combine clues in their native language and the target one, misspell/mispronounce words, etc. Although the search across languages is not typically considered as a kind of the TOT phenomenon, we include this situation in the considered scenario.
Research prototypes of automatic assistants have to consider the above-mentioned settings and clearly identify in what types of TOT they can help. The primary decision a tool designer needs to make relates to the appropriate interface. The ultimate goal of the work described in this paper consists in integrating a TOT-aware assistants into natural user interfaces. Rather than on a desktop or tablet computer with a standard keyboard or (hand)written input, we focus on smart-phones or even wearable interfaces (smart watches, glasses), intelligent home/office infrastructure components, or robotic companions that can communicate in natural (spoken) language and that help users in their language producing tasks. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organizers. Licence details: http://creativecommons.org/licenses/by/4.0/ Although the current research deals strictly with explicitly expressed requests for a TOT-situation help, there is a possibility of automatic detection of TOT-related hesitations and immediate generation of word suggestions. In any case, the state of the art on this topic is at the beginning and these types of automatic assistants are mostly research prototypes.
To be able to evaluate the ongoing development work, the first author of this paper started to document and collect real TOT events. This includes personal experiences but also cases appearing during his communication with colleagues, family members, etc. In a relatively short time of three months, 19 documented cases were recorded. This shows that a collective effort in this area could easily lead to a new reasonably-large resource that would help to direct future research (see the concluding section). As we aim at a general TOT setting, the collected data include full descriptions of the clues, not just keyword-based TOT searches. For written-only interfaces, we provide a list of extracted keywords too. Thus, there would the full sentence: It is like racism but on women (the correct answer -discrimination) and the set of two keywords -racism, women -for the written case.
Although its current limited size does not allow deriving statistically-significant results (only 3 out of 19 TOT cases can be correctly retrieved by our method if we allow 4 suggestions), the resource can be used to demonstrate crucial differences between the task of TOT-and reverse-association predictions (see the next section).
In addition to this discussion, the paper presents methods used for the stimulus-response association prediction submitted as our contribution to the CogALex 2014 shared task and their results. Section 3 introduces the methods, while Section 4 summarizes results under varying parameters. We conclude with future directions of our research and a proposal for a joint TOT-related activity.

Related Work
There is a long-term interest in intelligent dictionary interfaces that reflect natural lexical-access needs. Yet, advanced mechanisms of the access by meaning are rarely implemented as their integration presents significant challenges. Zock and Bilac (2004) discuss lookup mechanisms on the basis of word associations. Sinopalnikova and Smrz (2004) introduce lexical access-supporting dictionary enhancements based on various language resources -corpora, Wordnets, explanatory dictionaries and word association norms.
Free-word associations are frequently used as testing data for word relatedness experiments. Church and Hanks (1990) estimate word associations by a corpus-based association ratio. Word association thesauri or norms, representing a collection of empirical data obtained through large-scale psycholinguistic free-association tests, often define a gold standard. In particular, Zesch and Gurevych (2007) employ the University of South Florida word association, rhyme, and word fragment norms (Nelson et al., 1998) to compare characteristics of its graph representation to that of Wikipedia. Rapp (2008) experiments with associative responses to multiword stimuli on the Edinburgh Associative Thesaurus. The CogALex 2014 shared task is very close to the experimental setting discussed in (Rapp, 2013) which also aims at computing a stimulus word leading to responses given in EAT. A fixed-window size to count word co-occurrences is used first. Log-likelihood ratios are employed to rank candidate words and products of the ranks then define the winner. Providing 7 responses (as compared to 5 in the CogALex 2014 shared task), the stimulus word is predicted with 54 % accuracy. However, only a specific subset (Kent and Rosanoff, 1910) of EAT is used which comprises 100 words. It is also not fully clear from which set of potential words target answers are chosen. This is a crucial aspect that influences accuracy. For example, Rapp (2014) took into account only primary associative responses from EAT, i. e,. only 2,792 words. Obviously, it is far simpler to choose the correct answer from a limited set than from all existing words.
Other word association resources are also frequently used as test data. In addition to the Wordnet itself, they include TOEFL (Landauer and Dumais, 1997) and ESL synonym questions (Turney, 2001), RG-65 (Rubenstein and Goodenough, 1965) and WordSimilarity-353 (Finkelstein et al., 2001) test collections for degree of synonymy or SAT analogy questions (Turney et al., 2003) for the relational similarity.
Additionaly, Heath et al. (2013) evaluate their association model in word guessing games (games with a purpose).

Free word associations v. TOT -similarities and differences
The CogALex 2014 shared task was motivated by natural lexical access but it was defined as computing reversed free-word associations. Participating automatic systems were employed to determine the most probable stimulus leading to given five most frequent responses from a free association test. For example, given words circus, funny, nose, fool, and fun, participating systems were supposed to compute word clown as the answer.
Training and test datasets came from the Edinburgh Associative Thesaurus (EAT) 1 (Kiss et al., 1972). EAT comprises about 100 associative responses given by British students for each of 8,400 stimuli. Items containing multi-word units and non-alphabetical characters were filtered out from the CogALex 2014 experimental data.
Although it has been shown that free word association norms and thesauri provide a valuable source of information for TOT-assisting (Sinopalnikova and Smrz, 2006), the two corresponding phenomena are not identical. Indeed, available data and experience clearly point out similarities but also significant differences.
Both, individual free associations as well as TOT can be full of idiosyncrasies. However, while association norms and thesauri try to present prototypical, generalized, most frequent associations, TOT assistants need to cope with personal specificity. Ideally, a system should be able to help its user remind a word given the clue it was mentioned by Mary during our yesterday's conversation.
Both the phenomena are also strongly culturally-dependent. Among others, this can make some resources such as large-scale corpora for particular language variants unusable. For example, let us consider the very first item from the CogALex test set -word capable is to be guessed as the stimulus for responses able, incapable, brown, clever, good. Putting aside the first two response words sharing their roots with the stimulus for a while (see the related discussion below), we come to word brown. This refers to Lancelot Brown, more commonly known as Capability Brown -an 18th century English landscape architect. This association is specific for the U.K. and it is hardly known to Americans. For example, the two words never collocate in the 450 million Corpus of Contemporary American English (COCA) 2 , while Capability Brown is mentioned 36 times in the 100 million British National Corpus (BNC) 3 .
This observation led us to the question what is the overlap between two distinct word association thesauri/norms. To explore this, we compared EAT to the University of South Florida Word Association Norm (SFWAN) 4 (Nelson et al., 1998). SFWAN consists of 5,019 normed words and their 72,176 responses. EAT and SFWAN have 3,545 stimulus terms in common. There are 11,788 words used as one or more responses in both the sets. Despite the substantial overlap of the stimulus and response sets, responses for same stimulus words in SFWAN rarely correspond to those given in EAT. Using a simple algorithm of the highest overlap among response sets, only 106 stimuli from the CogALex test set (out of 2,000) can be correctly determined from SFWAN. It can be partially explained by the cultural differences between the U.K. and the U.S.A., but also by relatively distant times of collecting/publishing the resources (1972 v. 1998), slightly different settings of the experiments and non-uniform presentation of the results. In any case, this finding casts doubts upon suitability of EAT for the shared task if no available (large) corpus data reflects the time and the setting of corresponding word association experiments (reflecting the background of students in 1972).
It can be also argued that observed associations corresponding to TOT clue words are of different nature than (reversed) free-word associations. Definitely, numbers of given clues vary, sometimes, there are two or three words only, sometimes, there are full sentences giving more than 5 keywords to associate with. Spoken clues also frequently explicitly state the kind of relation of the search word to a clue (e.g., it is an opposite to. . . , it is used for. . . ).
Subjects are usually instructed to give the first response in their mind to the stimulus in free word association tests. On the other hand, TOT clues are usually related to the searched word in much more subtle way. At least, it is usually enough to mention any word of the same root/stem as a candidate and the subject finds the word in TOT situations. Thus, testing free associations such as choler-cholera, capable-incapable, misuse-abuse, actor-actress is completely irrelevant for vocabulary access problems.
Native speakers have usually no problem to retrieve a word from memory if it forms a part of an idiom and the other part of the idiom is suggested. Thus, predicting either word of tooth a nail is not relevant for TOT situations (in any language that lexicalizes Latin dentibus et vnguibus). Considering lexical access in a foreign language, the reason for the same conclusion can be opposite -an idiom can be unknown to a learner so that it is not probable that a part will be given as a clue.
In languages naturally conceptualizing different parts of speech, writers or speakers always know what word category they search for. The collected data as well as intuition also suggest that TOT clues would not mix various senses of a word to be recalled. Consequently, free associations such as stagetheatre/coach or March -April/Hare have also nothing to do with TOT.

Methods
This section introduces methods used to compute multi-word reversed associations in our experiments. The primary method applied in the submitted system takes advantage of deep learning from web-scale corpora. To be sure that computed word associations automatically derived from large textual data cannot be matched by those resulting from a manually created resource, associations predicted by various Wordnet-based measures were also considered.
The Word2Vec technique 5 available in Python package GenSim 6 (Řehůřek and Sojka, 2010) was primarily utilized to predict a stimulus word from a list of most frequent responses. Word2Vec defines an efficient way to work with continuous bag-of-word (CBOW) and skip-gram (SG) architectures computing vector representations from very large data sets (Mikolov et al., 2013). The CBOW and SG approaches are both based on distributed representations of words learned by neural networks. The CBOW architecture predicts a current word based on contexts, while the SG algorithm predicts surrounding words given a current word. Mikolov et al. (2013) showed that the SG algorithm achieves better accuracies in tested cases. We have therefore applied only this architecture in our experiments. Various parameters of the training model need to be set -the dimensionality of feature vectors, the maximum distance between a current and a predicted word within a sentence or the initial learning rate. Consequently, we built various instances of the stimulus predictor varying values of the parameters. Their detailed evaluation is given in the next section. The CogALex 2014 shared task was divided into two categories. Unrestricted systems could use any kind of data to compute results, while restricted systems were allowed only to draw on the freely available UKWaC corpus (Ferraresi et al., 2008) in order to predict word associations. We implemented systems for both the categories. Our unrestricted system employs the ClueWeb12 corpus. 7 UKWaC comprises about 2 billion words and has size of about 30 GB (including annotations). The ClueWeb12 dataset consists of more than 733 million English web pages (collected between February and May 2012). The size of the complete ClueWeb12 data is 1.95 TB. To speed-up the process of training, only a fraction of the ClueWeb12 dataset was used to compute the Word2Vec models. It consists of about 8.7 billion words and has size of 131 GB. The ClueWeb12 data was pre-processed by removing web-page boilerplates and content duplication. The original UKWaC dataset already contains POS and lemma annotations. TreeTagger 8 was used to produce the same input for the ClueWeb12 dataset. Some models were created from identified lemmata rather than individual tokens.
We took advantage of the nltk 9 toolkit to experiment with Wordnet-based measures. A candidate list of all possible Wordnet-related words that could be considered as potential stimuli was computed for each of five given responses first. The word with the highest sum of similarities to all five response words was returned as the best stimulus candidate.
To populate the set of all possibly related words, standard Wordnet relations (Fellbaum, 1998) were considered -hypernyms/hyponyms, instances, holonyms/meronyms (including members and substances), attributes, entailments, causes, verb groups, see-also and similar-to relations. As the similarity measures, we used the path similarity based on the shortest path that connects the senses in the is-a (hypernym/hyponym) taxonomy, Wu-Palmer's similarity (Wu and Palmer, 1994) based on the depth of word/sense pairs in the taxonomy and that of their Least Common Subsumer, Leacock-Chodorow's similarity (Leacock and Chodorow, 1998) based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur, Resnik's similarity (Resnik, 1995) based on the Information Content (IC) of the least common subsumer, Jiang-Conrath' similarity (Jiang and Conrath, 1997) based on the Information Content (IC) of the least common subsumer and that of the two input synsets and Lin's Similarity (Lin, 1998) based on the Information Content (IC) of the least common subsumer and that of the two input synsets.

Word2Vec approach
There are various parameters to tune up when creating the models for the SG algorithm. We experimented with three of them -values 100, 300 and 500 were tested as dimensionalities of feature vectors, values 3, 5 and 7 for maximum distances between current and predicted words within a sentence, and the lemmatization was switched on or off. Value word means that original lowercased tokens were used for the computation of models, whereas value lemma means we used lowercased lemmata corresponding to original words. Resulting models are named accordingly: size-window-token (e.g., 100-3-lemma, 500-3-word), where size denotes the dimensionality of the feature vectors, window the maximum distance between the current and a predicted word within a sentence and token determines whether original words or lemmata were used for a given model. We restricted the parameters to these values mainly to cope with computational requirements. Although the Word2Vec toolkit supports multi-threaded computation, it took significant time to build all the models. For example, 60 hours in 8 threads were needed to compute the 500-7-lemma model for the ClueWeb12 data. Although higher values for size and window parameters would probably bring better accuracies, they were not tested due to time constraints. Results for various combinations of parameters are summarized in Table 1. EAT sometimes gives inflectional variants of words (e.g., plurals) as stimuli or responses. A strict evaluation comparing exact strings can then harm systems that do not try to match particular wordforms. To quantify the effect we compared results of our system on two versions of the test sets expanded target word lists which allow all wordforms for each target word 10 and the original lists. Results are given in Table 1 in columns denoted inflectional in the case of the expanded lists and non-inflectional for the original data.
As can be seen, model 500-5-lemma reaches the best accuracy for the unrestricted task and models 300-7-word and 500-5-word win in the restricted task. As only one set of results was allowed to be submitted for each task, we employed the 500-5-word model in our submission.
Although, the CogALex 2014 shared task was defined as to predict exactly one stimulus word for five given responses, lexical access helpers can easily accommodate more suggestions. This can be evaluated by checking how frequently a gold standard stimulus appears among top n predicted words. Figures 1  and 2 compare results of our unrestricted and restricted systems, respectively, for up to 10 suggestions (the inflectional case). As expected, the accuracy increases with the number of candidate words taken into account. The best value of 0.4865 for the unrestricted system is reached using model 500-5-lemma, while the best accuracy of 0.4575 for the restricted system comes from model 500-7-word. Together with their original Word2Vec implementation, Mikolov et al. (2013) made available also word vectors resulting from training on a part of the Google News dataset (consisting of 100 billion words). The model contains 300-dimensional vectors for 3 millions of words and phrases (no lemmatization was performed). We repeated CogALex 2014 shared task experiments with this pre-trained model as well. The resulting accuracy was 0.1375. The lower value is probably caused by the fact that the model is trained on the specific dataset with different pre-processing.

Wordnet-based measures
The Wordnet-based approach was evaluated in the same way as the Word2Vec one. Result for all six similarity measures are listed in Table 2. As in the previous case, accuracies for top n (1 ≤ n ≤ 10) predicted responses are considered. The best performing Wordnet similarity measure for the task showed to be Lin's similarity based on the Information Content of the least common subsumer and that of the two input synsets. Yet, the best values are far from accuracies of the Word2Vec-based methods, especially when only few predicted responses are allowed. This confirms our hypothesis that approaches deriving their lexical knowledge from large textual corpora overcome those based only on Wordnet.

Conclusions and future directions
The CogALex 2014 shared task focused on computing reversed multi-word response-stimulus relations extracted from the Edinburgh Association Thesaurus. We showed that this setting is only weakly related to computer-aided lexical access problems, namely to the tip-of-the-tongue phenomenon.
The submitted results were obtained with a system based on the Word2Vec distributional similarity model. Best of the implemented systems reaches accuracy of 0.1975 when trained on a subset of the ClueWeb12 dataset. Unfortunately, in time of writing this paper, official results of other teams are not published. Hence, no comparison with other participants could be included.
Section 2 also mentions our experience in collecting real TOT data. We believe that a collective effort could lead to a much larger resource better reflecting nature of the TOT phenomenon. We propose to establish a task force aiming at this goal. During discussions at the workshop, we could focus on actual    procedures and technical support means (through a web-based system) to build the resource within the next year. The collected dataset could then be used for future shared tasks in the domain.