AD-NLP: A Benchmark for Anomaly Detection in Natural Language Processing

,


Introduction
An anomaly, sometimes referred to as an outlier, discordant, or novelty, can be intuitively described as an observation that appears to be inconsistent with the remainder of that set of data (Ord, 1996) to the degree that it arouses suspicion (Hawkins, 1980).Such an observation can be described as being atypical, irregular, erroneous, or simply strange (Ruff et al., 2020).Anomalies are relative to the data distribution at hand and can be perceived as having characteristics that are not definitory to the vast majority of the population.Researchers have been interested in the Anomaly Detection (AD) problem for several decades (Knorr and Ng, 1997;Chandola et al., 2009;Aggarwal and Reddy, 2014), some of the first formal mentions of "discordant observations" going back to the 19th century (Edgeworth, 1887).Both general anomaly detectors (Schölkopf et al., 1999;Liu et al., 2008;Manevitz and Yousef, 2002;Ruff et al., 2018) and narrowscoped ones (Wang et al., 2019c,b;Ruff et al., 2019;Manolache et al., 2021) have shown promising results in various domains, such as network monitoring (Stolfo et al., 2000;Sharafaldin et al., 2018;Radford et al., 2018), time series (Braei and Wagner, 2020), computer vision (Wang et al., 2019c;Yu et al., 2020), and natural language processing (Ruff et al., 2019;Manolache et al., 2021).
The advent of deep learning methods for detecting anomalies amplified the interest in the field, producing a great variety of models such as ones based on generative networks (Wang et al., 2019b;Zhou and Paffenroth, 2017), self-supervised learning (Wang et al., 2019c;Manolache et al., 2021), or one-class classification (Ruff et al., 2018).Although the field has seen a sprout of activity, most of the introduced datasets for Anomaly Detection are being specifically crafted for Computer Vision, Intrusion Detection Systems (Sharafaldin et al., 2018) or network traffic (Stolfo et al., 2000;Radford et al., 2018).Approaches in NLP are usually benchmarked on ad-hoc setups, typically by making use of an annotated dataset and downsampling some classes to produce outliers (Wang et al., 2019c;Ruff et al., 2019;Manolache et al., 2021).This does not represent an impediment in computer vision, due to the large number and variety of annotated datasets.In natural language processing, however, deciding what an anomaly is and what it is not is a delicate matter.This problem is rooted in the intrinsic complexity of the text: the form the language of a text takes is influenced not only by the style and lexicon of the author but also by the literary or historical era it was written in and its purpose (e.g., newspaper article, novel, letter, satire, etc.).Thus, anomalies can be defined through a multitude of factors, ranging from the concrete syntax, style, and grammar, to the more abstract semantics, metaphorical meaning, and contextual topic.A notable aspect of certain linguistic phenomena that can be effectively analyzed with Anomaly Detection (AD) is the challenge posed by the lack of clear negative examples.For instance, while there are countless texts that were not written by Shakespeare, determining which ones serve as a good representation of non-Shakespearean texts can be tricky.In a similar vein, identifying a typical non-scientific text can also be challenging.
We introduce AD-NLP: an anomaly detection setup for Natural Language Processing which can be used to benchmark AD systems on syntactic anomalies, semantic anomalies, pragmatic anomalies and stylistic anomalies, by aggregating different tasks and treating them as a general anomaly detection task, such as: sentence acceptability detection, topic detection, metaphor detection and autorship detection.Furthermore, we experiment with two strong classical baselines and two recent deep methods for detecting anomalies in text and make observations on both the quantitative and qualitative results of the models.
In the following, we will use the terms "anomaly" and "outlier" interchangeably to refer to the same concept of divergent observation with respect to the overall data distribution.
The paper is organized as follows: Section 2 covers related work.In Section 3, we outline our task definition, data collection approach, and datasets.Sections 4 and 5 delve into the models we used and our experimental assessment.Finally, we wrap up with our conclusions in Section 6.

AD Datasets
Most of the AD benchmarks were historically used in the domain of Anomaly-based Network Intrusion Detection Systems (A-NIDS).Some of these datasets are synthetic, such as KDD99 (Stolfo et al., 2000), CIC-IDS (Sharafaldin et al., 2018) and the LANL log dataset (Turcotte et al., 2018), while others are obtained using honeypots, like the Kyoto IDS dataset (Song et al., 2011).
In recent years, there has been a focus on Computer Vision, especially on video anomaly detection that uses such benchmarks as Avenue (Lu et al., 2013), ShanghaiTech (Zhang et al., 2016) and UCSD Ped 2 (Mahadevan et al., 2010).
In NLP we note TAP-DLND 2.0 (Tirthankar Ghosal, 2022), a document-level novelty classification dataset, which focuses on semantic and syntactic novelty and tests multiple baselines on a singular task.All datasets of the TAP benchmark are in the newswire domain, we wish to offer a larger variety in terms of anomaly types.On top of this, TAP-DLND 2.0 focuses on detecting the degree of novelty of paraphrased or plagiarized text with respect to one or multiple ground truth texts, while we target out-of-distribution samples compared to an overarching distribution.
Additionally, out-of-distribution detection setups (Arora et al., 2021) can be used to construct ad-hoc outliers in the validation or test data, albeit these distribution-shifted samples are artificial.

NLP Datasets
Anomaly detection can be viewed as a particular case of the one-class classification setting.Classification datasets can be ideal for AD tasks due to their ease of being adapted for one-class unsupervised classification (Ruff et al., 2019;Manolache et al., 2021).Various NLP classification benchmarks are widely used, their purpose varying and including news topic detection tasks (Sam Dobbins, 1987;Lang, Ken and Rennie, Jason, 2008;Zhang et al., 2015), sentiment analysis (Maas et al., 2011;Socher et al., 2013;He and McAuley, 2016) or authorship verification (Bevendorff et al., 2020).
Recently, more general NLP datasets that contain multi-task challenges have emerged due to the desire to obtain general NLP models.decaNLP (McCann et al., 2018) provides a general framework for multitask learning as question answering and is proposing ten tasks together with a leaderboard to facilitate research in areas such as multitask learning, transfer learning, and general purpose question answering.GLUE (Wang et al., 2018) is a multi-task benchmark and analysis platform for Natural Language Understanding.GLUE provides a suite of nine sentences or sentence-pair NLU tasks, an evaluation leaderboard, and a socalled "diagnostic evaluation dataset".The performance on the GLUE benchmark has surpassed the 3 Data

Task Definition
Problem Setting Our benchmark is aimed at addressing a broad spectrum of anomaly detection scenarios.In this regard, we concentrate on three crucial elements for our benchmark: diversity in the domain of natural language where anomalies occur, diversity within each of these domains (see Table 2), covering various settings that might arise within the same NLP domain, and diversity in terms of sample counts.The final aspect explores the variation in both train and test sample counts across datasets, as well as the variation in sample numbers for each class within each dataset.Dataset Properties Firstly, our benchmark covers four outlier classes, as can be seen in Table 2. On top of this, it also aims at delivering intra-class variety by supplying multiple datasets for our content category.We believe this is important so as to not lock an anomaly class to a specific instance of outlier distribution.As an example, content anomalies can appear in news data through a minority of articles with a diverging topic, whereas when it comes to music genres, lyricism, or fiction writing, the outliers can present a cluster of multiple similar subjects which are different from the ones of the majority class.Some dataset statistics are available in Table 1, with more detailed information in our benchmark's datasheet1 .Similarly, training and validation code can be obtained through our GitHub repository.

Data Collection
Our data acquisition methodology was designed with the clear goal of providing a large diversity of what we can define as anomalies.This has been done so the data mimics the in-the-wild distribution of classes as well as possible.Our benchmark consists of four already-available datasets: 20Newsgroups (Lang, Ken and Rennie, Jason, 2008), AG News (Zhang et al., 2015), CoLA (Warstadt et al., 2018) and VUA (Steen et al., 2010), as well as novel datasets: Song Genres, Gutenberg Categories, and Gutenberg Authors.Table 3 provides an overview of the data through examples of outliers and inliers.

Available Datasets
We selected a set of representative existing datasets due to their prior utilization in literature (Ruff et al., 2019;Manolache et al., 2021).The 20Newsgroups and AGNews datasets have been frequently used in various experimental setups, hence, we included them to ensure comprehensiveness.Additionally, we incorporated COLA for syntactic anomaly detection and VUA for metaphor detection.
20Newsgroups.The 20Newsgroups dataset (Lang, Ken and Rennie, Jason, 2008) amounts to almost 20,000 news documents clustered into twenty groups, each of which corresponds to a different topic.Some newsgroups are closely related, such as comp.sys.ibm.pc.hardware and comp.sys.mac.hardware, while others are highly unconnected, namely, misc.forsale and soc.religion.christian.

Domain Description Syntactic
A pronounced divergence in the arrangement of words and phrases.

Semantic
A difference in the subject or content appearing throughout the sample.

Pragmatic
The presence of metaphors or figures of speech in the sample.

Stylistic
A distinctive manner of expression, including but not limited to: repetition of verbs or phrases, use of stopwords or punctuation.
We have extracted six classes from the initial twenty: computer, recreation, science, miscellaneous, politics, and religion, as done in (Ruff et al., 2019).Each category is represented by a range of 577-2.856training samples and 382-1.909validation samples.Despite the relatively small size of the dataset, its classical relevance for NLP tasks prompted us to include it in our analysis.
AG News.The AG News dataset (Zhang et al., 2015) encompasses 496.835 categorized news articles from over 2.000 news sources.This topic classification corpus was gathered from multiple news sources for over a year.Out of all the classes, we have chosen the four largest classes in the dataset: business, sci, sports, and world.We are using the full 30.000training samples per class, selecting only the title and description fields.
CoLA.CoLA (Warstadt et al., 2018) consists of a corpus of 10.657 sentences from 23 linguistics publications, books, and articles, annotated as being grammatically acceptable (in other words, grammatically correct) or not.Each sample is a sequence of words annotated with whether it is a grammatical English sentence.The public version of this dataset we used contains 8.551 sentences belonging to the training set and 1.043 sentences as part of the development set.As the test is not publicly available, we have used the development set as a de-facto test set for our work.
VUA. VUA (Steen et al., 2010) consists of 117 fragments sampled across four genres from the British National Corpus: Academic, News, Conversation, and Fiction and contain word-level all content-word metaphors annotations.The train set contains 12.122 lines or sentences and is the only publicly available subset of VUA.Under these circumstances, we applied an 80-20 train-test split on this solely open data subset and produced 8.485 train samples and 3.637 test samples.
The data is annotated according to the MIPVU procedure described by its authors.As a conse-quence, the words annotated as metaphors have been prefixed with the "M_" string in the original annotation setting.To transform this initial problem setup of text segmentation into one of anomaly detection and for the data to comply with our methodology, we removed the word-level annotations and instead labeled the whole sentences as containing a metaphor or not.

Newly-Proposed Datasets
We introduce three new datasets: Song Genres, Gutenberg Categories, and Gutenberg Authors.The latter two were extracted from the Project Gutenberg website2 .We scraped the entire website and parsed all bookshelves, which stored the book texts, their authors, titles, and the category in which Project Gutenberg placed them.We annotated the books for said category.The result is a corpus of over 15.000 literary texts3 , along with their authors, titles, and titles and bookshelves (a term that Gutenberg maintainers use for categories).We then filtered this dataset to produce Gutenberg Categories and Gutenberg Authors.
Song Genres.The Song Lyrics is a dataset4 composed of four sources and consists of over 290.000 multilingual song lyrics and their respective genres.The initial data was forwarded from the 2018 Textract Hackathon5 .This was enhanced with data collected from three other datasets from Kaggle: 150K Lyrics Labeled with Spotify Valence, dataset lyrics musics, and AZLyrics song lyrics.
To deal with the lack of labels, we have built a labeling system using the spotipy library, which uses the Spotify API to retrieve an Artist's genre.The Spotify API returns a list of genres for one artist, so we consider the mode of that list to be the Table 3: Data samples for each anomaly domain included in our benchmark.Apart from Pragmatic and Syntactic, we used only a chunk of the sample.We did not mention the inlier class in this table, as the majority class is constructed by clustering all non-outlier classes from the dataset as inliers.

Domain
Outlier Class Outlier Sample Inlier Sample Semantic computers Apple has patented their implementation of regions, which presumably includes the internal data structure.
I am looking for an inexpensive motorcycle, nothing fancy, have to be able to do all maintenance my self.

Syntactic
Gramatically Unacceptable They caused him to become angry by making him.
Bill coughed his way out of the restaurant.
Pragmatic Metaphor Mr Franklin went there at the end of the 1970s, after the collapse of Keyser Ullman.
It would be a criticism if I was doing it to impoverish myself.

Stylistic
Arthur Conan Doyle You may place considerable confidence in Mr. Holmes, sir, said the police agent loftily.
Mars, I scarcely need remind the reader, revolves around the sun at a mean distance of 140,000,000 miles.
dominant genre of the lyrics of said artist.Additionally, we used the langdetect library to label the lyrics with a language automatically.In total, the lyrics come in 34 languages.Please note that we have only used the lyrics as our training data, with their respective genres as labels, leaving aside the corresponding artist, year, or song name.We've applied this procedure for all the data apart from the original 2018 Textract data.
From the original dataset, we have curated our subsequent Song Genres subset, which encompasses nine genres: Pop, Hip-Hop, Electronic, Indie, Rock, Metal, Country, Folk, and Jazz.Song Genres is designed to present an anomaly setup where crucial data aspects (e.g., melody, rhythm, etc.) are obscured or absent.This accentuates the necessity of discerning subtle text variations in songs to distinguish between different groups.Through this, we aim to foster the advancement of more robust models for AD in NLP.
Gutenberg Categories.The initial subset derived from the original Gutenberg data is termed as the Gutenberg Categories dataset.It comprises texts corresponding to 10 categories from the Gutenberg project website: Biology, Botany, United States Central Intelligence Agency, Canada, Detective, Historical, Mystery, Science, Children's, and Harvard.It's important to note that the categories are not inherently distinct by nature.Some, like CIA and Children's, are expected to be eas-ily distinguishable, while others, such as Biology and Botany or Science and Harvard might exhibit significant overlap.We specifically included the CIA category to offer a class that stands distinctly separable from the rest in the text distribution.
We have selected 500 train samples from each class for the train set, and 100 samples per class for the test set.The samples have been extracted from multiple authors for each category, as to offer a wider distribution of styles, syntax, and grammar.
Gutenberg Authors.The Gutenberg Authors dataset represents our second subset generated from the project Gutenberg data.It comprises the texts respective to 10 authors: Arthur Conan Doyle, Charles Dickens, Charles Darwin, Mark Twain, Edgar Allan Poe, Walter Scott, United States Central Intelligence Agency, H. G. Wells, L. M. Montgomery, and Agatha Christie.Again, we aimed at providing different levels of complexity throughout our data, the reason for which we included authors whose novels are within the same genre, namely Doyle and Christie, as well as those writing about the same historical era or similar historical events, such as Twain and Dickens, and female authors which supposedly share a common sensibility, meaning Montgomery and Christie.We've added the CIA and Darwin classes with the same purpose as for Gutenberg Categories.
We have sampled between 400 and 500 train text chunks for each author and 100 test samples.
The samples have been extracted through multiple books for each author we consider for this experiment.This has been done to avoid the possibility of the event in which a particular anomaly class might be locked into a repetitive word or phrasing, e.g., character names (Sherlock Holmes, Huckleberry Finn, etc.), places (London, Washington, etc.) or simply by the sample length.

Experimental Methodology
Our methodology consists of creating multiple data splits for each dataset within our benchmark, running our models on all splits, and finally aggregating the results.By data split, we refer to a separation of classes into one inlier class and a cluster of classes that are considered to be outliers in this setup.To achieve this, we iterate through all the classes for every dataset and, at each iteration, choose one of them as the inlier, while the rest are treated as outliers by the models.Through this, we achieve an important objective: we unfold an exhaustive series of experiments over all possible combinations of outliers and inliers, outlining which of the former are the most prominent and which are the hardest to detect.We ran a hyperparameter search for the two classical models on each split, thus finding the best parameters for detecting each outlier choice.The deep models were trained with a limited set of hyperparameters, as can be seen in Subsections 4.2 and 4.3.
Evaluation Metrics.We use AUROC in Table 4, as well as AUPR-In, and AUPR-Out in Table 6 and Supplementary Tables 7 and 8. AUROC (Area Under the Receiver Operating Characteristic) is the area under the curve where the false positive rate is on the X-axis and the true positive rate on the Y-axis.AUPR-In (Area Under the Precision-Recall for Inliers) is the area under the curve where the recall is on the X-axis and the precision is on the Yaxis.AUPR-Out (Area Under the Precision-Recall for Outliers) has the same definition as AUPR-In but is computed on inverted labels.

Classical approaches
The SVM classifier is a versatile model adaptable for outlier detection tasks, known as the One Class Support Vector Machine, as detailed in (Schölkopf et al., 1999).The OC-SVM aims to learn from an inlier class, designating test samples as anomalies when they deviate from the train-ing dataset.The Isolation Forest technique is another outlier detection approach, drawing inspiration from the Random Forest model, as described in (Liu et al., 2008).In an n-dimensional space, inliers typically form denser clusters, whereas outliers tend to be more dispersed.
To optimize the performance of traditional methods, we conducted a comprehensive hyperparameter tuning.For the OC-SVM, we explored different kernels, namely rbf, polynomial, and linear, and assessed a range of ν values: ν ∈ 0.05, 0.1, 0.2, 0.5.For the Isolation Forest, we evaluated various numbers of estimators, specifically 64, 100, 128, 256.
For both models, we compared the effectiveness of two embedding methods: FastText and GloVe, each with an embedding size of 300.

Neural approaches
CVDD.Context Vector Data Description (CVDD) (Ruff et al., 2019) is a method that takes advantage of pre-trained word embeddings to perform AD on text.CVDD jointly learns the so-called "context vectors" and a multi-head self-attention mechanism that projects the word representations near these "context vectors" by minimizing the cosine distance between them.
CVDD allows the disentanglement of the context vectors such that they can provide more interpretable results and penalize non-orthogonal context vectors.The resulting projection function and context vectors act like a clustering method and cluster centroids, respectively.Anomalies are detected based on the mean sample distance from the sequence projection to the context vectors.We only search for the optimal number of context vectors like in (Ruff et al., 2019) (c ∈ {3, 5, 10}) and report the best performing models.DATE.Detecting Anomalies in Text using ELECTRA (DATE) is an approach that uses selfsupervision for training Transformer networks using two pretext tasks tailored for detecting anomalies -Replaced Token Detection (RTD) (Clark et al., 2020) and Replaced Mask Detection (RMD).The method uses a generator to sample masked tokens and a discriminator to identify the replaced tokens and the masking patterns.The generator can be any distribution over the vocabulary.The discriminator is a BERT (Devlin et al., 2018) model trained from scratch.The Replaced Token Detection head ouputs a token-wise anomaly score which   author is writing texts in registers.
We show in Table 5 how CVDD creates clusters of semantically similar words when trained on the Hip-Hop Music Genre subset.The context vectors act as topic centroids.CVDD cannot only distinguish between colloquial usage of verbs, pronouns, and foreign languages but also associates unreproducible words (e.g., obscene, insulting, etc.) with certain contexts.

Conclusions
We introduce AD-NLP, a benchmark for anomaly detection in text over an extensive assortment of outlier scenarios, covering syntactic, semantic, pragmatic, and stylistic language anomalies.Additionally, we introduce three new datasets as part of AD-NLP: Song Genres, Gutenberg Categories, and Gutenberg Authors.Song Genres provides a complex setting in which part of information about the data has been obscured, enforcing a distinct focus on the subtle differences between texts, while the two datasets derived from the Gutenberg data are meant to provide variety on multiple levels: syntax, style, genre, and literary movement.We find that anomalies that depend solely on semantic or stylistic aspects of the text are easier to recognize, whereas those that only partially depend on the text, like song lyrics, are harder to detect and separate.We have also disclosed our results on various models and found out that the neural models react well to domain-specific words, author idiosyncrasies and punctuation as being anomalous or not.We hope that the proposed benchmark and tools will facilitate research in Text Anomaly Detection.

Limitations & Further Work
Some of our more simple baselines managed to outperform more sophisticated anomaly detection methods in some scenarios.For instance, we did observe that on datasets such as COLA and VUA, both CVDD and DATE obtain weak results -as an example, in Tables 7 and 8 of the Supplementary Material, we can observe that the OC-SVM and the Isolation Forest outperform DATE on the metaphor detection task from VUA.Moreover, we can observe that for every dataset in the benchmark, there are instances where the Isolation Forest and the OC-SVM outperform the more sophisticated CVDD and DATE methods.Therefore, we believe that it's very important to be able to analyze the limitations and inductive biases on a wide range of scenarios while developing an Anomaly Detection methodology.One way of accomplishing this would be studying AD-NLP at an even more granular level.For example, one could determine various linguistic properties of texts written by Edgar Allan Poe, aiming to discover the reasons behind the poor performance of DATE compared to the other authors, as can be observed in Table 4.Further linguistic analysis would benefit the quality of AD-NLP, and we leave this undertaking for further work.
Benchmark updates.We commit to enhancing AD-NLP with future datasets that would expand the intra or inter-domain variety, either by adding new datasets to AD-NLP or scraping, aggregating, and labeling new data off the web ourselves.We also commit to making the dataset more accessible through multiple hosting services, as well as updating our GitHub repository.
8 Supplementary Materials

More quantitative results
In Tables 7 and 8 we present the Area Under the Precision-Recall curve (AUPR) results for all our models and datasets.This provides a clear measure of how each model performed across different datasets.

Figure 1 :
Figure 1: We provide a qualitative examples for outliers and inliers detected by DATE for the Botany subset of the Guttemberg Categories dataset.A stronger red highlight indicates a greater anomaly score.We can see that, in the case of the inlier sample, there is no strong indication of abnormalities, since the contents relate to Botany.For the outlier sample, words such as "ship" and "cars" are detected as being anomalous.

Figure 2 :
Figure 2: Qualitative examples of DATE on the Gutenberg data.From top to bottom: Biology True Positive sample, Agatha Christie True Negative sample, and CIA True Negative sample.The darker the colour of the highlight, the more greater the anomaly score for the word.

Table 1 :
Data statistics for our benchmark.By the "Several" domain we denote a blend of Politics, Science, Fiction, Academia, and News.The Avg #Words column denotes each dataset's average number of words per sample.

Table 2 :
The outlier classes contained in our benchmark and a succinct definition of their data type.

Table 4 :
Best AUROC scores for each model and each split.We also provide the mean and standard deviations between the splits for all datasets apart from COLA and VUA, where we use a single split as our inlier class.

Table 5 :
CVDD's ability to cluster semantically related words from the Hip-Hop Music Genre subset.The model effectively identifies variations in verb and pronoun usage, recognizes foreign language terms, and associates specific contexts with obscene terms.We highlight second context as it is the most meaningful one when detecting anomalies.

Table 6 :
The Area under the Precision-Recall curve for all of the datasets, averaged over the subsets.We highlight the best scores in bold.± N/A 78.1 ± N/A 76.1 ± N/A 55.7 ± N/A AUPR-Out 73.3 ± N/A 74.8 ± N/A 75.0 ± N/A 48.4 ± N/A

Table 7 :
Best AUPR-In scores for each model and for each split.

Table 8 :
Best AUPR-Out scores for each model and for each split.