Cryptonite: A Cryptic Crossword Benchmark for Extreme Ambiguity in Language

Current NLP datasets targeting ambiguity can be solved by a native speaker with relative ease. We present Cryptonite, a large-scale dataset based on cryptic crosswords, which is both linguistically complex and naturally sourced. Each example in Cryptonite is a cryptic clue, a short phrase or sentence with a misleading surface reading, whose solving requires disambiguating semantic, syntactic, and phonetic wordplays, as well as world knowledge. Cryptic clues pose a challenge even for experienced solvers, though top-tier experts can solve them with almost 100% accuracy. Cryptonite is a challenging task for current models; fine-tuning T5-Large on 470k cryptic clues achieves only 7.6% accuracy, on par with the accuracy of a rule-based clue solver (8.6%).


Introduction
The ambiguity of natural language is one of the most fundamental challenges in NLP research. While there are works and datasets specifically targeting ambiguity (Levesque et al., 2011;Raganato et al., 2017;Sakaguchi et al., 2020), these can be solved by a native speaker with relative ease. Can we design a dataset with ambiguities that pose a challenge even to competent native speakers?
We present Cryptonite, a large-scale dataset based on cryptic crosswords, which is both linguistically complex and naturally sourced. Cryptonite's 523K examples are taken from professionallyauthored cryptic crosswords, making them less prone to artifacts and biases than examples created by crowdsourcing (Gururangan et al., 2018;Geva et al., 2019). Each example in Cryptonite is a cryptic clue, a short phrase or sentence with a misleading surface reading, which poses a challenge even for humans experienced in cryptic crossword solving. A cryptic clue usually consists of two underlying parts: wordplay and definition. The * Equal contribution.  Figure 1: How to solve the cryptic clue "One doesn't like shifting earth (5)": Solving usually starts by figuring out which of the clue's words belong to the definition (blue) and which to the wordplay (orange). Next, one needs to figure out the type of wordplay, which is often hinted by an indicator (purple). In our case, "shifting" hints that the answer is an anagram of some part of the wordplay. As the enumeration (gray) states the answer is a five-letter word, "earth" is a promising candidate for anagraming. Finally, given that "hater" is both an anagram of "earth" and a synonym of the definition, we conclude it to be the correct answer.
clue's answer is both a disambiguation of the wordplay and, at the same time, directly answers the definition. While solving cryptic clues requires disambiguating semantic, syntactic, and phonetic wordplays, as well as world knowledge, clues are designed to have only one possible answer. See Section 1 for an example clue and its solution.
We provide a standard baseline by fine-tuning the generic T5-Large conditional language model (Raffel et al., 2020) on Cryptonite, achieving only 7.6% accuracy. For comparison, a rule-based cryptic clue solver (Deits, 2021) achieves 8.6% accuracy. These results highlight the challenge posed by Cryptonite, making it a candidate for assessing the disambiguation capabilities of future models.
Analyzing the results of both baselines, we find a correlation between performance on individual clues and a human assessment of the clue's difficulty, and that the enumeration (answer length) is highly informative. Finally, we show that ensuring that the answers of train and test examples are mutually exclusive is critical for a candid estimation of T5's ability to solve cryptic clues in general.

Cryptic Crosswords
A cryptic crossword, just like a regular (noncryptic) crossword, is a puzzle designed to entertain humans, comprised of clues whose answers are to be filled in a letter grid. Unlike regular crosswords, where answers are typically synonyms or hyponyms of the clues (Severyn et al., 2015), cryptic clues have a misleading surface reading, and solving them requires disambiguating wordplays. A cryptic clue has only one possible answer, even when taken outside the context of the letter grid. 1 Generally, a cryptic clue (henceforth, "clue") consists of two parts: wordplay and definition. The wordplay-definition split is not given to the solver, and parsing it is usually the first step in solving a clue. Both the wordplay and the definition lead to the answer, but each in a different manner. While the definition is directly related to the answer (e.g. a synonym or a hypernym), the wordplay needs to be deciphered, usually with the help of an indicator that hints at the wordplay's type.

Dataset
We introduce Cryptonite, a dataset of 523,114 cryptic clues from 17,375 English-language crosswords published in The Times 2 and The Telegraph 3 between October 2000 and October 2020. 4 For preprocessing, we remove clue-answer duplicates and examples whose answer and enumeration do not match. In addition, we remove any examples with the same clue but with a different answer. While this occurred in less than 0.1% of the data, these examples violate the principle that a clue must have a single solution once the wordplay is deciphered (see Section 2).

Model of car and every train (5)
T + each = Teach world knowledge (Ford Model T) synonym synonym (a) A clue with a relatively simple additive wordplay, also requiring world knowledge. One can decipher the wordplay by identifying "and" as a concatenation indicator.
Rent going up, it's said (4) hire higher synonym synonym sounds like (b) Clues can also have phonetic wordplays. Here, "it's said" implies that the wordplay is a homophone of the answer.
Got staff back in case of blockage (6) mace became ecam be synonym synonym reverse put one inside the other take boundary le�ers (c) Many clues combine more than one type of wordplay. This clue composes three: reversing the letters of a word ("back"), and inserting it ("in") into the boundary letters of another ("case of").
Getting fed up about midday (2,5) at lunch A possible surface reading is "someone being mad about it being noon".
A poten�al source of ambiguity is "fed up". Try finding a different, perhaps more literal reading if the clue.
(d) Although many wordplays can be roughly clustered into types and deciphered based on indicators, there is no silver bullet for solving cryptic crosswords. For this clue, even the standard wordplay-definition split does not apply; instead, the entire clue points to the answer.  We follow the recent findings of Lewis et al. (2020), and split Cryptonite into train, validation, and test sets, where no answer is shared between them. Answer splitting creates a far more challenging benchmark for supervised models than naive random splits (see Section 4.3). Table 1 shows some basic statistics of the final Cryptonite dataset.

Experiments
We provide initial results on Cryptonite using two baselines: T5-Large (Raffel et al., 2020) and a rule-based cryptic clue solver (Deits, 2021). Despite training on half a million clues (T5) or being tailored to the task (rule-based solver), both approaches solve only a small portion of the test data, demonstrating that Cryptonite is indeed a challenging task. We further investigate two properties of the data: how difficulty (as perceived by humans) correlates with accuracy, and the informativeness of enumeration. In addition, we analyze how a naive data split affects the performance of T5, demonstrating that partitioning by answers is crucial for obtaining a candid estimate of the neural model's ability to generalize to new cryptic clues.

Baselines
T5-Large Following current NLP methodology, we fine-tune the 770M parameter T5-Large (Raffel et al., 2020) on Cryptonite. The model's encoder takes the clue as input, and uses the decoder to predict the answer using teacher forcing during training and beam search (b = 5) during inference.
We use HuggingFace (Wolf et al., 2020) with the recommended settings (Raffel et al., 2020), optimizing with AdaFactor (Shazeer and Stern, 2018) at a constant learning rate of 0.001. We train until convergence with a patience of 10 epochs and a batch size of 7000 tokens, selecting the best model checkpoint using validation set accuracy.
T5 uses SentencePiece tokenization (Kudo and Richardson, 2018), which might incur some information loss, as many clues require character-level manipulations.  Rule-based Solver We also gauge the abilities of a rule-based solver with a manually-crafted probabilistic grammar (Deits, 2021). Building on the assumption that a clue can usually be split into a wordplay and a definition (Section 2), the solver tries to find the most probable parse such that the wordplay yields a semantically-similar result to the definition. The similarity between the definition and the parsed wordplay is calculated using expert-authored resources such as WordNet (Miller, 1995). Some less frequent wordplay types, such as homophones ( Figure 2b) and hidden-at-intervals (Moorey, 2018, Chapter 3), are not implemented in the solver's grammar.

Main Benchmark
We first evaluate our baselines on the main dataset.  Fine, 2016, 2018), though this expertise is acquired through significant training. Appendix A shows a selection of examples and the respective predictions of T5.

Analysis
Correlation with human perception of difficulty Quick cryptic crosswords is a subgenre of cryptic crosswords aimed at beginners, with clues designed to be easier to solve. Cryptonite's test set contains 2,081 such clues. Examining the results of our main benchmark, Table 3 shows that both baselines perform better on quick clues, suggesting a correlation between human assessment of linguistic difficulty and the models' performance on clues. 5 5 All quick clues are taken from Times Quick Cryptic (TQC). For a fair comparison to the quick clues, we consider a clue as non-quick only if it was published in The Times after March 10th 2014 (when TQC was introduced), and not as a part of TQC. Cryptonite's test set contains 4,653 such non-quick clues.   Table 4: Comparison of baseline accuracy when enumeration is provided and when it is not provided.

Quick Clues Non-Quick Clues
The effect of enumeration The enumeration is the number (or numbers) in parentheses at the end of a clue indicating the number of letters in its answer, e.g. (7) or (5,4). To measure the informativeness of enumeration, we run our main experiment again, this time without providing the enumeration. Table 4 shows an accuracy drop in both baselines when the enumeration is not provided. 6 While it is to be expected that enumeration helps the rulebased solver, we see that T5 is able to leverage this information as well.
Why do we split the data by answer? Many clues that share the same answer are paraphrases of each other (Appendix B). A neural model such as T5 might exploit this information and by copying answers from memorized training examples. Therefore, to test whether a model has learnt a general process for solving cryptic clues, we follow Lewis et al. (2020) and make Cryptonite's default split the answer split, in which the answers of the train, validation, and test sets are mutually exclusive.
We compare the answer split with a naive (random) partition of the data. Table 5 shows that a naive split of Cryptonite will grossly overestimate the performance of T5; while the rule-based solver's performance barely changes, T5-Large is now able to solve an additional 50% of the entire test set. Further analyzing the naive test set (Figure 3), we observe that the probability of T5 solving a clue is highly correlated with the number of times its answer appeared in the training set. This result indicates that that a significant part of the performance difference is due to the paraphrasing artifact, 6 Cryptonite's metadata contains additional information that could help a solver, such as orientation (whether the clue is across or down in the grid). Knowing the orientation can help in finding the clue's wordplay-definition split. Test Accuracy Answer Occurrences in Training Set Figure 3: The naive split exhibits a strong correlation between T5's accuracy on clues from the test set (vertical axis) and the number of times their answer appears in the train set (horizontal). Each dot's size represents the number of clues from the test set whose answer appears in the train set n times. Trend line is logarithmic.  and that ensuring unseen test answers is critical for establishing a true estimate of a model's ability to solve cryptic clues.

Related Work
Cryptic crosswords Williams and Woodhead (1979) attempt to devise a formal language for describing cryptic clues. Hart and Davis (1992) define four stages of rule-based solving, and implement the second stage -"syntactic identification". In our work we focus on creating a large-scale dataset of a cryptic clues and apply neural and rule-based methods to establish a strong baseline. Hardcastle (2001Hardcastle ( , 2007 focuses on rule-based approaches for creating cryptic clues given a word as an answer. Although in our work we test solving abilities, the reverse direction of creating a clue from an answer is also challenging, and the Cryptonite dataset could prove useful in this direction as well. Language disambiguation In addition to works and datasets specifically targeting disambiguation on the word level (Levesque et al., 2011;Raganato et al., 2017;Sakaguchi et al., 2020), there are other domains strongly related to language disambiguation. Among them are pun disambiguation (Miller and Gurevych, 2015;Miller et al., 2017), and sarcasm detection (Joshi et al., 2017;Oprea and Magdy, 2020). However, to the best of our knowledge Cryptonite is the first dataset both large in scale (unlike pun disambiguation), and containing a variety of wordplays (unlike sarcasm detection).
Non-cryptic crosswords As described in Section 2, non-cryptic ("regular") crosswords are the common crosswords found in most newspapers. There are works introducing regular crossword datasets, some even containing a small percentage of more "tricky" clues 7 (Littman et al., 2002). However, identifying this small portion of clues requires human effort, whereas Cryptonite is already guaranteed to consist entirely of cryptic clues. In addition, works on solving regular crosswords typically rely on an external database of clues (Ernandes et al., 2005;Barlacchi et al., 2014;Severyn et al., 2015). When given a clue as an input, these systems search the database for the most similar clues, in hope they share the answer with the input clue. In Cryptonite, the answers of the train, validation, and test sets are mutually exclusive (Section 3). In doing so, we hope to shift the focus of solving from memorization to reasoning, which is especially interesting in the setting of cryptic clues.

Conclusion
We presente Cryptonite, a large-scale dataset based on cryptic crosswords, whose solving requires disambiguating a variety of wordplays. We saw that the standard approach of fine-tuning T5-Large on Cryptonite does not outperform an existing rulebased model, achieving 7.6% and 8.6% accuracy respectively, while human experts achieve close to 100% accuracy. These results highlight the challenge posed by Cryptonite, and will hopefully encourage further research on disambiguation tasks that are not easily solved by a native speaker.

A Example Predictions Clue Answer
Act like tragic heroine with cold extremity mimic Group of musicians prohibited on the radio band Assumed diamonds to be shelved put on ice Is in control of distant armies abroad administrates Second parasite tick