LexiClean: An annotation tool for rapid multi-task lexical normalisation

NLP systems are often challenged by difficulties arising from noisy, non-standard, and domain specific corpora. The task of lexical normalisation aims to standardise such corpora, but currently lacks suitable tools to acquire high-quality annotated data to support deep learning based approaches. In this paper, we present LexiClean, the first open-source web-based annotation tool for multi-task lexical normalisation. LexiClean’s main contribution is support for simultaneous in situ token-level modification and annotation that can be rapidly applied corpus wide. We demonstrate the usefulness of our tool through a case study on two sets of noisy corpora derived from the specialised-domain of industrial mining. We show that LexiClean allows for the rapid and efficient development of high-quality parallel corpora. A demo of our system is available at: https://youtu.be/P7_ooKrQPDU.


Introduction
Garbage in, garbage out is a well known adage in the computer science and machine learning community. In NLP it has become the centre-focus, demanding a task of its own right; namely, lexical normalisation (Baldwin et al., 2015). Lexical normalisation is the task of identifying and normalising non-canonical tokens (e.g. erroneous spelling, acronyms, . . . ) in noisy, non-standard, corpora (Han and Baldwin, 2011).
Largely made popular after the 2015 ACL-IJCNLP Workshop on Noisy User-generated Text (W-NUT) (Baldwin et al., 2015), lexical normalisation has demonstrated marked improvements on down-stream applications such as entity recognition, text classification, and part-of-speech (POS) tagging (Derczynski et al., 2013;Hua et al., 2015;Núñez et al., 2019). 1 LexiClean. https://lexiclean.nlp-tlp.org These improvements have centred around the fact that many NLP tools are not amenable to noisy corpora, such as those in micro-blogging domains like Twitter (Liu et al., 2011), and in specialiseddomains such as industrial mining (Stewart et al., 2018).
To date the most popular lexical normalisation corpus is based on English Twitter and was released as part of W-NUT (Baldwin et al., 2015). This has resulted in a number of algorithmic contributions to lexical normalisation task with the current state-ofthe-art using ensemble learning methods (van der Goot and van Noord, 2017). More recently, attention has shifted towards neural techniques that i) contextually normalise tokens based on high-level classifications (Stewart et al., 2019b), ii) modify and fine-tune large pre-trained transformer based representations (Muller et al., 2019), or iii) perform joint normalisation and sanitisation (e.g. masking sensitive tokens) (Nguyen and Cavallari, 2020).
However, neural models typically demand large volumes of high-quality training data, which is not available for the task of lexical normalisation. Despite the prevalence of open-source token-level annotation tools (Stenetorp et al., 2012;Yimam et al., 2013;Yang et al., 2017;Kummerfeld, 2019), there still remains a lack of support for lexical normalisation.
A gap in lexical normalisation research currently exists and consists of an absence of large scale annotated corpora and scalable, task-specific tools for their construction. To fill this gap, we introduce LexiClean, an annotation tool for multi-task lexical normalisation that is: i. Rapid: Enables fast corpus wide multi-task annotation.
iii. Intuitive: Maintains a simple and easy-to-use interface.
iv. Dynamic: Permits organic schema development during annotation.
The remainder of this paper is organised as follows. We define the task of lexical normalisation in Section 2 and briefly review related work in Section 3. Following this, we present and describe key features of LexiClean in Section 4. LexiClean's system architecture is then discussed in Section 5 with a case study presented in Section 6. Lastly conclusions are drawn and future work is proposed in Section 7. An online demonstration of LexiClean is located at https://lexiclean.nlp-tlp. org and the source code is available under an Apache-2.0 license at https://github.com/ nlp-tlp/lexiclean.

Problem Formulation
Lexical normalisation is defined as the mapping of non-canonical, out-of-vocabulary (OOV) tokens to canonical, in-vocabulary (IV) forms (Han and Baldwin, 2011). Non-canonical tokens are largely a result of i) unconventional and phonetic spelling, ii) improper casing, iii) acronyms, iv) abbreviations and initialisms, v) domain-specific terms, vi) neologisms, and vii) erroneous concatenation or tokenization. This task is akin to grammatical error correction (GEC) (Ng et al., 2014), although it does not involve token reordering that is core to GEC.
In contrast, token classification structures the task in a modular fashion where OOV candidates are identified and normalised in multiple stages. Typically a noisy sequence, X, is mapped to an intermediate sequence of semantic classes, Z = (z 1 , . . . , z n ). Token classification can be simple binary classification, L n=2 = {OOV, IV }, or comprehensive, L n=4 = {self, spelling_error, domain_specif ic, acronym}, where L is a space consisting of n pre-defined classes of token categories. After classification, alignment to suitable canonical forms is performed using similarity or distance based measures conditioned on labels in Z (Han and Baldwin, 2011;Baldwin et al., 2015).

Related Work
In the last decade, many open-source annotation tools have been developed for token-level classification tasks such as entity recognition and POS tagging, notably BRAT (Stenetorp et al., 2012), WebAnno (Yimam et al., 2013), YEDDA (Yang et al., 2017), and SLATE (Kummerfeld, 2019). The contributions of the current generation of tools have been significant, but support for the task of lexical normalisation has been overlooked. As a result, these tools do not have features that enable in situ token modification or data quality improvements such as decatentation and tokenization whilst performing their main tasks. On the other hand, proprietary writing assistants such as Grammarly 2 , ProWritingAid 3 , and Ginger 4 do contain features required for lexical normalisation, but are prohibitively expensive and not designed for the task of corpora annotation.

LexiClean -Key Features
This section provides an overview of the key features of LexiClean that enable rapid multi-task token-level annotation that supports both seq2seq and token classification task formats. An overview of the system is presented in Figure 1 with a webbased interface in Figure 3.

Project Creation and Automatic Labelling
LexiClean provides users upon project creation the facility to upload a predefined OOV to IV (1:1) replacement dictionary (e.g. {"hel" : "hello", "worl": "world"}) and an unlimited number of plain-text gazetteers ( Figure 3). Gazetteers are lists of tokens mapped to a high-level concept (e.g. do-main_specific → {u/s, . . . , c/o}. Here, these concepts are referred to as meta-tags and are used to support the token classification formulation of lexical normalisation. These resources are used to automatically label tokens in the entire corpus before an annotation session commences (Figure 1), notably reducing annotation effort. Depending on the resources used, replacements will be automatically applied as suggested replacements (Figure 3(a)) whereas meta-tags will applied directly ( Figure 2). However, any accepted suggested replacements or automatically applied metatags can be removed at any time throughout an annotation session if deemed unsuitable (see Figure 3(b) and Figure 2).

Single and Multiple Replacements
Instead of iteratively constructing replacement dictionaries only as a 1:1 mapping throughout the annotation process, LexiClean allows the correction of single tokens in situ (1:1) or across the entire corpora via cascading (1:N) (see apply and apply all in Figure 3(c)).
This has two main benefits: i) single noncanonical tokens can be replaced in situ enabling contextual normalisations to be captured, and ii) cascading replacements across the entire corpora hastens annotation speed. The importance of this is illustrated by considering the following textsaround the wod, cut the wod, and burn fire wod. 1:1 dictionary based methods (e.g. replace all) would only be able to capture the replacement as either wood or world which would incorrectly annotate either 1 or 2 of the texts. Here, LexiClean allows users to modify wod → world in situ and cascade wod → wood across the remainder of the corpus (if deemed suitable). In some instances, the application of both styles of normalisation can indirectly lead to N:1 mappings being formed.

Easily Identifiable Token Markup
Identifying and normalising OOV tokens in large corpora can be a demanding task, especially over thousands of texts. As a result consistency can be negatively impacted due to the inability of a user to recall corrections they have made to non-canonical token forms. To overcome this, LexiClean marks up tokens using a colour system. Colours for replacements, suggested replacements, and IV and OOV candidates are set to a default palette (Figure 3) whereas meta-tag colours are specified by the project creator on project creation. By using distinct colours to markup tokens, rapid identification can be ensured and consistency preserved. For example, users can quickly see where suggestions have been made and decide to accept or ignore them.

Dynamic Schema
Similar to token-level annotation tools that employ dynamic schemas (Stewart et al., 2019a), Lexi-Clean allows users to update their meta-tag schema throughout the annotation process. This feature permits users to organically modify their schema based on phenomena present in the corpora rather than fitting to a prescriptive set of classes. Updates include additional classes of meta-tags and toggling the active state of existing ones. Toggling of meta-tag active states within the schema permits a soft-deletion that can be reversed if required by the user.

Decatenation and Tokenization
Concatenation and irregular tokenization of texts are common in noisy corpora. Consider the following problematic example that exhibits both cases: original hewalkedacross th er oad corrections he, , walked, , across, {th er → the}, , {r oad → road} normalisation he walked across the road LexiClean manages this by first allowing the user to decatenate the concatenated tokens by introducing additional white space ( ). Secondly, incorrect tokenization is corrected through a utility function that allows users to change the annotation mode of a text and modify its token spans (see Figure 3(d)).

Sorting Algorithm
To optimise annotation speed, LexiClean computes the average inverse tf-idf weight (Manning and Schutze, 1999) on project creation from all OOV candidates in each text. Using these weights, texts are presented to the user in ranked order with the most prominent candidates appearing first. The rationale behind this technique is that the immediate annotation of high-frequency OOV candidates will have a significant impact on the conversion rate of texts when using the cascade style annotation.

Exporting Annotations and Normalisation Maps
At any stage of an annotation project, users can download their annotated corpora in an extended W-NUT JSON-based format (Baldwin et al., 2015). Additionally, replacements and meta-tag gazetteers generated over the course of the project can also be exported for use in new projects or external systems.

System Architecture
LexiClean is built using the modern full stack web development framework MERN 5 (MongoDB-Express-React-NodeJS). All annotations are captured at the token-level as shown in the MongoDB (NoSQL) entity relationship diagram in Figure 4. Here, the Project model stores information related to a project including references to Texts, Maps and Users. The Maps model captures replacements and meta-tag gazetteers, as well as static assets such as a standard English lexicon. The Texts model comprises information pertaining to individual texts such as its original value, aggregate tf-idf weight and resulting rank, whether its been annotated, and its constituent tokens. Texts reference the Tokens model that is composed of the tokens original value, and accepted or suggested annotations (replacement, meta_tags, sug-gested_replacement). Lastly, Users contains information about users such as their username, password and email.

Case Study
Without comparable systems, we demonstrate the efficacy of LexiClean through the annotation of 5 MERN. https://www.mongodb.com/mern-stack user generated content (UGC) from the specialiseddomain of industrial mining (IM) (Sikorska et al., 2016). To date, UGC in industrial domains has received little attention from the NLP community, with state-of-the-art systems relying heavily on hand-craft rules and heuristics for normalisation (Hodkiewicz and Ho, 2016;Gao et al., 2020). More recently, it has also been highlighted that corpora derived from such domains can pose challenges to state-of-the-art NLP systems (Dima et al., 2021).

Task Setup
We experiment on two corpora (IM-Pub and IM-Priv) and release one to the public 6 . As LexiClean currently is a single user application, we focus on the performance of a single user annotating under two modes to illustrate the efficacy of LexiClean's features. The two modes are i) from scratch (no automatic labelling using prepopulated replacements or meta-tag gazetteers), and ii) with automatic labelling from prepopulated assets. The same annotator was used for both modes. The annotators native language was English and they had prior familiarity with the domain of industrial mining.
In both modes, OOV token candidates are detected by matching to an English lexicon 7 . Annotation guidelines are borrowed from Baldwin et al. (2015) 8 with extension to support multi-task annotation. For both cases, a set of four meta-tags are used consisting of domain_specific, sensitive, unsure and noise. An overview and comparison of the statistics pertaining to both corpora compared to W-NUT15 is shown in Table 1 labelling. This corpora consisted of 4.5k texts and 3.9k candidate OOV tokens. The annotator performance shown in Figure 5 highlights the rapidity of OOV token annotation early on in the session owing to features such as cascading corpora wide annotation and the sorting algorithm. The impact of these features is also demonstrated by the user's annotation rate at the start of the session and its increasing nature through to completion. Moreover, a substantial number of normalisations and meta-tags were captured as is evidenced in Table 2. Figure 5: Overview of annotator performance for case one (progress is cumulative).

Case Two -Annotation from Prepopulated Assets
To evaluate the effectiveness of the automatic labelling feature of LexiClean, annotation of an equivalently sized corpora (IM-Priv) to case one was performed. Here, replacements and meta-tag gazetteers generated in case one were exported and used for automatic labelling. It was found that this feature significantly reduced the OOV tokens requiring annotation in IM-Priv by 47% (4,013 to 1,897) as well as reducing the vocabulary size by 3.5%. Comparable with case one, Figure 6 also demonstrated the rapidity of annotation and the ability to apply a significant number of normalisations and associated meta-tags to noisy corpora within a short period (Table 2).

Conclusion and Future Work
We have introduced LexiClean, an open-source annotation tool for multi-task lexical normalisation.
Stemming from gaps in current token-level annotation tools, we have demonstrated how a dedicated, task-specific tool can enable rapid annotation of Figure 6: Overview of annotator performance for case two (progress is cumulative).

One Two
Replacements 706   large corpora to support both seq2seq and tokenclassification formulations of the lexical normalisation task. As a result, LexiClean is well positioned to enable future annotation efforts to support the development of the next generation of lexical normalisation algorithms and systems. Future work will focus on converting LexiClean from a single user tool to one that supports multi-user collaborative annotation akin to the current generation of token-level annotation tools.