Lexical Semantic Recognition

In lexical semantics, full-sentence segmentation and segment labeling of various phenomena are generally treated separately, despite their interdependence. We hypothesize that a unified lexical semantic recognition task is an effective way to encapsulate previously disparate styles of annotation, including multiword expression identification / classification and supersense tagging. Using the STREUSLE corpus, we train a neural CRF sequence tagger and evaluate its performance along various axes of annotation. As the label set generalizes that of previous tasks (PARSEME, DiMSUM), we additionally evaluate how well the model generalizes to those test sets, finding that it approaches or surpasses existing models despite training only on STREUSLE. Our work also establishes baseline models and evaluation metrics for integrated and accurate modeling of lexical semantics, facilitating future work in this area.


Introduction
Many NLP tasks traditionally approached as tagging, such as named entity recognition, supersense tagging, and multiword expression identification, focus on lexical semantic behavior-they aim to identify and categorize lexical semantic units in running text using a general set of labels. By analogy with named entity recognition, we can use the term lexical semantic recognition (LSR) for such chunking-and-labeling tasks that apply to lexical meaning generally, not just entities. This disambiguation can serve as a foundational layer of analysis for downstream applications in natural language processing, and provides an initial level of organization for compiling lexical resources, such as semantic nets and thesauri.
In this paper, we investigate a more inclusive LSR task of lexical semantic segmentation and dis-ambiguation in the STREUSLE corpus 1 (Schneider and Smith, 2015;Schneider et al., 2018a). As detailed in §2, STREUSLE contains comprehensive annotations of MWEs (along with their holistic syntactic status) and noun, verb, and preposition/possessive supersenses. It thus subsumes evaluations like those featured in the DiMSUM and PARSEME shared tasks (Schneider et al., 2016;Ramisch et al., 2018). We train a baseline neural CRF using BERT embeddings and find that it obtains encouraging results on the overall task and on the tasks it subsumes ( §3). This paper thus contributes: • a baseline neural CRF for an inclusive English lexical semantic recognition task as defined in the STREUSLE corpus; and • comparisons with the state of the art on existing tasks subsumed by our tagger.

Tagging Frameworks
In this work, we mainly address the rich lexical semantic analysis in STREUSLE ( §2.1), but also address PARSEME and DiMSUM ( §2.2).

STREUSLE
STREUSLE (Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions; Schneider and Smith, 2015;Schneider et al., 2018b) 2 is a corpus of online reviews annotated comprehensively for lexical semantic units and supersense labels. This is done with 3 annotation layers on top of the tokens. The layers are multiword expressions, lexical categories, and supersenses.
The supersenses apply to noun, verb, and prepositional/possessive units. Figure 1 shows an example. The annotation layers are described in §2.  Figure 1: Example sentence from the Reviews training set (reviews-086839-0003, "We took our vehicle in for a repair to the air conditioning"), with STREUSLE and UD annotations. The (strong) multiword expressions "took. . . in" and "air conditioning" each receive a single lexcat and supersense.

Annotation Layers
STREUSLE comprises the entire Reviews section of the English Web Treebank (Bies et al., 2012), for which there are gold Universal Dependencies (UD; Zeman et al., 2019) graphs, and adopts the same train/dev/test split. The part-of-speech and syntactic dependency parts of the UD annotation appear below the sentence in figure 1. The lexical-level annotations do not make use of the UD parse directly, but there are constraints on compatibility between lexical categories and UPOS tags (see §3.1).
Multiword expressions (MWEs; Baldwin and Kim, 2010) are expressed as groupings of two or more tokens into idiomatic or collocational units. As detailed by Schneider et al. (2014a,b), these units may be contiguous or gappy (discontinuous). Each unit is marked with a binary strength value: idiomatic/noncompositional expressions are strong; collocations that are nevertheless semantically compositional, like "highly recommended", are weak. The gap in a discontinuous MWE may contain single-word and/or other multiword expressions, provided that those embedded MWEs do not themselves contain gaps.
We use the term lexical unit for a token or strong group of tokens that is not contained by any larger strong MWE. That is, ignoring weak MWE groupings, each of the lexical units of a sentence is a single-word expression or strong MWE. A sentence's lexical units must be disjoint in their tokens. The other layers of semantic annotation augment lexical units, and weak MWEs are groupings of (entire) lexical units.
Lexical categories (lexcats) elaborate on the syntax of lexical units. They are similar to Universal POS tags available as part of the UD annotations in the corpus, but they are necessary in order to (a) express refinements relevant to the criteria for the application of supersenses, and (b) account for the overall syntactic behavior of strong MWEs, as this may not be obvious from their internal syntactic structure. 3 Table 1 gives the full list of lexcats.
Supersenses semantically classify lexical units and provide a measure of disambiguation in context. There are three sets of supersense labels: nominal, verbal, and prepositional/possessive. The set of applicable labels-and indeed, whether any supersense should be applied-is determined by the lexcat. Preposition tokens are labeled with two supersenses in some cases: scene role labels represent the semantic role of the prepositional phrase marked by the preposition, and functional role labels represent the lexical contribution of the preposition in itself.

Tag Serialization
STREUSLE specifies token-level tags to allow modeling lexical semantic recognition as sequence tagging. The  case counterparts o, b, i _ , i~are the same except they are used within the gap of a discontinuous MWE. For MWE identification, local constraints on tag bigrams-e.g., that the bigrams ⟨B,B⟩ and ⟨B,O⟩ are invalid, and that the sentence must end with I _ , I~, or O-ensure a valid overall segmentation into units (Schneider and Smith, 2015). The lexcat and (where applicable) supersense information is incorporated in the first tag of each lexical unit. 4 Thus B-N-n.ARTIFACT indicates the beginning of an MWE whose lexcat is N and supersense is N.ARTIFACT. I _ and i _ tags never contain lexcat or supersense information as they continue a lexical unit, whereas O, B, I~, o, b, and i~always do. Figure 2 illustrates the full tagging. All told, STREUSLE has 601 complete tags. 4 Though in named entity recognition it is typical to include the class label on every token in the multiword unit, STREUSLE does not do this because it would create a nonlocal constraint across gaps (that the tags at either end have matching lexcat and supersense information). A tagger would either need to use a more expensive decoding algorithm or would need to greatly enhance the state space so within-gap tags capture information about the gappy expression.

Related Frameworks
The Universal Semantic Tagset takes a similar approach (Bjerva et al., 2016;Abzianidze and Bos, 2017;Abdou et al., 2018), and defines a crosslinguistic inventory of semantic classes for content and function words, which is designed as a substrate for compositional semantics, and does not have a trivial mapping to STREUSLE categories. However, two shared task datasets consist of subsets of the categories used for STREUSLE annotations, on text from different sources.

PARSEME Verbal MWEs
The first is the English test set for the PARSEME 1.1 Shared Task (Savary et al., 2017;Ramisch et al., 2018), which covers several genres (including literature and several web genres) and is annotated only for verbal multiword expressions. The STREUSLE lexcats for verbal MWEs in table 1 were borrowed from PARSEME; thus, a tagger that predicts full STREUSLE-style annotations can be evaluated for verbal MWE identification and subtyping by simply discarding the supersenses and the non-verbal MWEs and lexcats from the output.

DiMSUM
The second shared task dataset is Detecting Minimal Semantic Units and their Meanings (DiMSUM;Schneider et al., 2016), which was annotated in three genres-TrustPilot web reviews, TED talk transcripts, and tweets-echoing the annotation style of STREUSLE when it contained only multiword expressions and noun and verb supersenses. Thus, DiMSUM does not contain prepositional/possessive supersenses or lexcats. It also lacks weak MWEs.

Baseline Model Performance
To establish the baseline performance on the full task of lexical semantic recognition with MWEs and noun/verb/preposition/possessive supersenses, we develop a strong neural sequence tagger.
Our tagger is a token-level sequence tagger. We pass pre-trained contextual representations from the (large, cased) BERT model through a bidirectional LSTM. A linear function projects the BiL-STM outputs into the label space, and we use a linear chain conditional random field (CRF) to produce our final output (Lafferty et al., 2001). During training, we minimize the negative log-likelihood of the tag sequence using Adam (Kingma and Ba, 2014). Our model is implemented in the AllenNLP framework (Gardner et al., 2018). For further implementation details, see appendix A. 5 The predicted tag for each token is the conjunction of its MWE, lexcat, annd supersense. The supersense may consist of a pair of labels (in the case of propositions and posessives), or may be a single label serving dually as a scene role and function. There are 572 such tags in the STREUSLE training set, and only 12 unique conjoined tags in the development set are unseen during training (≈5% of the development set tagging space, corresponding to ≈0.2% of the tokens in the development set).

Constrained Decoding
Given the hard constraints between the different forms of lexical semantic annotations, we apply several constraints to explicitly enforce them at evaluation time. To enforce valid MWE chunks, we use Viterbi decoding with the appropriate corpusspecific constraints (e.g., the BbIiOo _~t agging scheme for STREUSLE MWEs; see §2.1.2). In addition, a given token's possible lexcats are constrained by the token's UPOS tag and lemma. For instance, a token with the AUX UPOS tag can only take the AUX lexcat. However, if the token's UPOS is AUX and its lemma is "be", it can take either the AUX or V lexcats. To enforce these constraints, we use the predictions of an off-the-shelf UPOS tagger and lemmatizer (Qi et al., 2018) to constrain the model to consider only the tags with valid lexcats associated with the predicted UPOS tag and 5 Code and pretrained models will be released at https:// nelsonliu.me/papers/lexical-semantic-recognition. lemma.

Experiments
We train the tagger on the training set from version 4.3 of the English STREUSLE corpus ( §2.1), using the development set for early stopping. We evaluate it on the STREUSLE test set, as well as the English PARSEME and DiMSUM test sets ( §2.2). We use the latter two as out-of-domain test sets for our tagger, without retraining it on the shared task training sets.
To evaluate the performance contribution of pretrained contextual representations, we try replacing them with concatenation of 300-dimensional pretrained GloVe embeddings (Pennington et al., 2014) and the output of a character-level convolutional neural network (CNN). We use 200 output filters with a window size of 5 in the CNN, and it uses 64-dimensional character embeddings.
We also experiment with providing the model with gold POS tags and lemmas at training and test time to establish an upper bound on performance. Since the model itself uses only word representations as input, the difference between gold and predicted POS tags and lemmas only applies to the constrained decoding. Table 2 shows all standard STREUSLE evaluation metrics on the test set. For preposition supersenses (SNACS), we compare to the results in Schneider et al. (2018b), who performed MWE identification and supersense labeling for prepositions only. Note that Schneider et al. (2018b) used version 4.0 of the STREUSLE corpus, which is slightly different from the version we use (some of the SNACS annotations have been revised). However, our baseline tagger, even with GloVe embeddings, outperforms Schneider et al. (2018b) on that subset. Using BERT embeddings with predicted POS tags and lemmas improves performance substantially; on preposition supersense tagging, it even outperforms using gold POS tags and lemmas.  also found that BERT embeddings improved SNACS labeling on STREUSLE 4.0, although they study a simplified setting (gold preposition identification, and only considering single words). Table 3 shows standard PARSEME test set evaluation metrics for models trained on the STREUSLE training set. While using GloVe embeddings does not reach the performance of the existing PARSEME systems-some of which were   trained on the PARSEME training set-replacing them with BERT embeddings approaches the state of the art MWE-based F-score and exceeds the best reported token-based F-score, despite the challenging zero-shot out-of-domain generalization setting. 6 This demonstrates that pre-training contextualized embeddings on large corpora can help models generalize to out-of-domain settings. 7 Table 4 shows that standard DiMSUM test set evaluation metrics, for models trained on the STREUSLE training set. Again, this is a zero-shot out-of-domain evaluation setting. In this case, the BERT model did not outperform the best shared task system, likely owing to the comparative difficulty of the full lexical semantic recognition task 6 It is unclear whether Rohanian et al. (2019) used gold syntactic dependencies at test time. 7 A small fraction of sentences in the PARSEME test set (194 3965) are EWT reviews sentences that also appear in STREUSLE's dev set. The rest of the PARSEME test set contains other web and non-web genres (Walsh et al., 2018), and thus it is mostly out-of-domain relative to STREUSLE. None of the PARSEME training set overlaps with STREUSLE. versus the restricted DiMSUM setting.

Related Work
The computational study of MWEs has a long history (Sag et al., 2002;Diab and Bhutada, 2009;Baldwin and Kim, 2010;Ramisch, 2015), as does supersense tagging (Segond et al., 1997;Ciaramita and Altun, 2006). Vincze et al. (2011) developed a sequence tagger for both MWEs and named entities. Richardson (2017) was the first to perform joint supersense tagging of nouns, verbs, and prepositions, using a feature-based structural SVM tagger trained and evaluated on STREUSLE 3.0. He found that the first-order model was far superior to a local model for preposition supersenses, but did not model MWEs. Bingel and Søgaard (2017) used multitask learning to improve MWE identification and supersense tagging, showing the largest benefits with syntactic chunking as an auxiliary task.  probed pretrained language modelbased contextualized embeddings for adposition supersense disambiguation, among other tasks, and found that simple linear probing models substantially outperform the state-of-the-art (Schneider et al., 2018a). Shwartz and Dagan (2019) evaluated the capacity of various (contextualized) word embedding models to identify and classify MWEs, including a simplified version of STREUSLE. They find that the models mostly rely on syntactic cues, failing to recognize semantic subtleties such as idiomatic meaning and level of compositionality.

Conclusion
We propose the task of lexical semantic recognition, which seeks to segment and label units that apply to lexical meaning. We study the lexical semantic recognition task defined by the STREUSLE corpus, which involves joint MWE identification   Kirilin et al. (2016): the best performing system from Schneider et al. (2016). Kirilin et al. (2016) and other shared task systems had access to gold POS/lemmas and Twitter training data in addition to all of STREUSLE for training. and coarse-grained (supersense) disambiguation of noun, verb, and preposition expressions; this task subsumes and unifies the previous PARSEME and DiMSUM evaluations. We develop a strong baseline neural sequence model, and see encouraging results on the task. Furthermore, zero-shot outof-domain evaluation of our baselines on partial versions of the task yields scores that are comparable to the fully-supervised in-domain state of the art.  Our tagger uses the BERT (large, cased) pretrained model to produce input word representations; these input word representations are a learned scalar mixture of the BERT representations, following observations that the topmost layer of BERT is highly attuned to the pretraining task and generalizes poorly . The representation for a token is taken to be BERT output corresponding to its first wordpiece representation. We freeze the BERT representations during training.

Appendices A Baseline Implementation Details
The word representations from the frozen BERT contextualizer are then fed into a 2-layer bidirectional LSTM with 256 hidden units in each direction. The LSTM outputs then are projected into the label space with a learned linear function, and a linear chain conditional random field produces the final output.
For training, we minimize the negative loglikelihood of the tag sequence with the Adam optimizer, using a batch size of 64 sequences and a learning rate of 0.001. We train our model for 75 epochs, and gradient norms are rescaled to a maximum of 5.0. We apply early stopping with a patience of 25 epochs.