Semantic Novelty Detection in Natural Language Descriptions

This paper proposes to study a fine-grained semantic novelty detection task, which can be illustrated with the following example. It is normal that a person walks a dog in the park, but if someone says “A man is walking a chicken in the park”, it is novel. Given a set of natural language descriptions of normal scenes, we want to identify descriptions of novel scenes. We are not aware of any existing work that solves the problem. Although existing novelty or anomaly detection algorithms are applicable, since they are usually topic-based, they perform poorly on our fine-grained semantic novelty detection task. This paper proposes an effective model (called GAT-MA) to solve the problem and also contributes a new dataset. Experimental evaluation shows that GAT-MA outperforms 11 baselines by large margins.


Introduction
Novelty or anomaly detection has been an important research topic since 1970s (Barnett and Lewis, 1994) due to numerous applications (Chalapathy et al., 2018;Pang et al., 2021). Recently, it has also become important for natural language processing (NLP). Many researchers have studied the problem in the text classification setting (Fei and Liu, 2016;Shu et al., 2017;Xu et al., 2019;Lin and Xu, 2019;Zheng et al., 2020). However, these text novelty classifiers are mainly coarse-grained, working at the document or topic level. Given a text document, their goal is to detect whether the text belongs to a known class or unknown class. This paper introduces a new text novelty detection problem -fine-grained semantic novelty detection. Specifically, given a text description d, we detect whether d represents a semantically novel fact or not. This work considers text data that describe scenes of real-world phenomena in natural language (NL). In our daily lives, we observe different real-world phenomena (events, activities, situations, etc.) and often describe these observations (referred as "scenes" onwards) in NL to others or write about them. It is quite natural to observe scenes that we have not seen before (i.e., novel scenes). For example, it is a common scene that "A person walks a dog in the park", but if someone says "A man is walking a chicken in the park", it is quite unexpected and novel. Detecting such semantic novelty requires complex conceptual and semantic reasoning over text and thus, is a challenging NLP problem. Note that conceptually, the judgement of the novelty of a scene is subjective and might differ from person to person. However, there are some scenes for which a majority of people have agreement about their novelty. A good example of such majority-view of novelty is the widely-spread meme pictures on social media, which contain novel interactions between objects. In this work, we restrict our research to this majority-based view of novelty and leave the personalized novelty view angle for the future work.
In this work, we leverage the captions of images from popular datasets like COCO, Flickr, etc., to build a semantic novelty detection dataset (Sec. 3), 1 where we consider an image as a scene and the corresponding image captions as different NL descriptions of the scene. Detecting text describing semantically novel observations have many applications, e.g., recommending novel news, novel images & videos (based on their text descriptions), social media posts and conversations. The problem of semantic novelty detection is defined as follows.
Problem Definition: Given a set of natural language descriptions D = {d 1 , d 2 , .....d n } of common scenes, build a model M using D to score the semantic novelty of a test NL description d with respect to D, i.e., classifying d into one of the two classes {NORMAL, NOVEL}. "NORMAL" means that d is a description of a common scene and "NOVEL" means d is a description of a se-mantically novel scene. As the detection model M is built only with "NORMAL" class data, the task is an one-class text classification problem.
We are unaware of any existing work that can effectively solve this problem. Although existing novelty/anomaly detection and one-class classification algorithms are applicable, since they are coarse-grained or topic-based, they perform poorly on our task (see Sec. 5). Note that although we focus on solving the problem of semantic novelty detection of NL descriptions of scenes, the proposed task and solution framework are generally applicable to other applications. This paper proposes a new technique, called GAT-MA (Graph Attention network with Max-Margin loss and knowledge-based contrastive data Augmentation) to identify NL description sentences of novel scenes. Since our task is at the sentence level and fine-grained, we exploit Graph Attention Network (GAT) on the parsed dependency graph of each sentence, which fuses both semantic and syntactic information in the sentence for reasoning with the internal interactions of entities and actions. To enable the model to capture long-range interactions, we stack multiple layers of GATs to build a deep GAT model with multi-hop graph attention. We also create the pseudo novel training data based on the given normal training data through contrastive data augmentation. Thus, GAT-MA is trained with the given original normal scene descriptions and the augmented pseudo novel scene descriptions (Sec. 4).
GAT-MA is evaluated using our newly created Novel Scene Description Detection (NSD2) Dataset. The results show that GAT-MA outperforms a wide range of latest novelty or anomaly detection baselines by very large margins. Our main contributions are as follows: 1. We propose a new task of semantic novelty detection in text. Whereas the existing work focuses on coarse-grained document-or topiclevel novelty, our task requires fine-grained sentence-level semantic & syntactic analysis.
2. We propose a highly effective technique called GAT-MA to solve the proposed semantic novelty detection problem, which is based on GAT with dependency parsing and knowledgebased contrastive data augmentation.
3. We create a new dataset called NSD2 for the proposed task. The dataset can be used as a benchmark dataset by the NLP community.

Related Work
Our work is closely related to anomaly, outlier or novelty detection. Earlier approaches include one-class SVM (OCSVM) (Schölkopf et al., 2001;Manevitz and Yousef, 2001)  Our GAT-MA is based on stacked graph attention neural networks, parsing and data augmentation. Novelty detection has also been studied in outof-distribution (OOD) detection (Fei and Liu, 2016;Fei et al., 2016;Liang et al., 2018;Shu et al., 2018;Erfani et al., 2017;Xu et al., 2019). However, these methods work in the multi-class classification setting. Our work focuses on one-class classification.
Our work is also related to document or sentence topical novelty detection (Dasgupta and Dey, 2016;Ghosal et al., 2018;Nandi and Basak, 2020;Jo et al., 2020;Zhang et al., 2003;Ru et al., 2004;Li and Croft, 2005;Zhang and Tsai, 2009). These tasks differ from our problem setting as we focus on fine-grained semantic novelty detection.
Our work is also related to semantic plausibility (SPLA) and selectional preference (SPRE). SPLA is concerned with whether an event is plausible, and SPRE is about the "typicality" of an event. For SPLA, existing models employ pretrained language models (Porada et al., 2019) and manually elicited entity property knowledge (Wang et al., 2018) to model physical plausibility in the supervised setting. Other related work includes creating datasets with plausibility ratings (Keller and Lapata, 2003) and dealing with multi-event inference (Zhang et al., 2017;Sap et al., 2019). For SPRE, the early works include (Resnik, 1996;Clark and Weir, 2001;Erk and Padó, 2010;Bergsma et al., 2008;Ritter et al., 2010;Ó Séaghdha, 2010;Van de Cruys, 2009). The performance is improved by neural networks (Van de Cruys, 2014;Dasigi and Hovy, 2014;Tilk et al., 2016). Our work is different: (1) Conceptually, SPLA and SPRE are related but different from novelty, (2) they are mostly based on structured Subject-Verb-Object triples, rather than natural language sentences, and (3) they use fully labeled data (Dasigi and Hovy, 2014) while we do novelty detection with only normal data in training.
Our work is also related to trivia fact mining (Merzbacher, 2002;Ganguly et al., 2014;Gamon et al., 2014;Prakash et al., 2015;Fatma et al., 2017;Mahesh and Karanth;Tsurel et al., 2017;Niina and Shimada, 2018;Korn et al., 2019;Kwon et al., 2020). However, trivia is more related to interestingness. Some trivia facts are interesting because they are rare, but not necessarily novel. Existing works use labeled training data for learning, or rely on Wikipedia structure to retrieve interesting facts using information retrieval methods (Tsurel et al., 2017;Kwon et al., 2020). We have only normal data but not novel data.
Our proposed model learns text representation using a Graph Neural Network and leveraging dependency parsing. Other works in NLP that use Graph Neural Networks and dependency structures include (Huang and Carley, 2019;Ma et al., 2020;Guo et al., 2019;Wang et al., 2020b;Pouran Ben Veyseh et al., 2020;Xiao and Zhou, 2020), etc. But they solve different problems, such as sentiment analysis and argument mining. Their approaches are also different from ours and do not do novelty detection.

Dataset Collection and Annotation
As there is no semantic novelty detection dataset available for text, we build a new dataset. As our proposed task requires learning of latent semantic knowledge in text, such as capturing the interaction among entities and verbs (e.g. "person" and "food" are related to each other by verb "cook"); the actions (verbs) that an entity can support (e.g., only a person can perform action "cook"); actions an entity can be applied on (e.g. "cook" can be applied on entity "vegetables"), etc., we aim to build a corpus rich in such knowledge. Text data like news articles, social media posts, reviews, etc., generally contain such knowledge in low density and thus, are not very suitable. Instead, we leverage image captions to built our dataset, which we found to be suitable for our task.
Image caption data collection. We found that the captions of non-iconic images (depicting multiple objects and their interactions) meet the aforementioned dataset requirements. We chose three popular benchmark image caption datasets: COCO (Chen et al., 2015;Lin et al., 2014), Flickr30k (Plummer et al., 2015 and Visual Genome (Krishna et al., 2017) to build our dataset. COCO consists of 616,435 captions of Flickr images. Flickr30k contains 158,915 captions about people and animals, and Visual Genome contains 5.4 million captions describing interactions among various objects. To ensure we have a diverse dataset to learn interactions among entities and verbs, we merge the 3 datasets into one large dataset. NSD2 dataset preparation. Given the merged NL caption dataset, we proceed to build our proposed NSD2 dataset as follows. We consider the captions from the NL caption dataset as normal or common scene descriptions. As our proposed GAT-MA model uses only "NORMAL" class data, we build our training dataset involving only normal scene descriptions and compile a test dataset having scene descriptions involving both "NOR-MAL" and "NOVEL" classes.
Due to budgetary constraints, we cannot evaluate on all verbs. We selected 20 verbs (see Appendix Sec. A) frequently used in the NL caption dataset and built our training and test dataset with scene descriptions involving these 20 verbs. For training, we extract the captions from the merged set that contain any of the 20 verbs as the "NOR-MAL" class text examples. For test dataset preparation, we employ human annotators to write NL scene descriptions involving both "NORMAL" and "NOVEL" classes (discussed below). Test dataset preparation. The test dataset is prepared by 5 volunteer graduate students with advanced level of English as crowd workers. We divide the task into 20 small subtasks, one verb for each subtask. For each substask, the designated worker is asked to write at least 100 normal and 100 novel scene descriptions from scratch for this verb. For training the workers, each of them is asked to write 25 normal and 25 novel sentences for a verb and then we check these sentences and give them feedback. Any disagreements are discussed. After the training session, each subtask is carried out by each worker independently. The workers are unaware of the proposed model. After initial writing of each subtask is done, the scene descriptions are assigned to other four workers (who are not the writer) to label them as normal or novel. If the consensus (majority judgment) is the same as the original writer's label of the scene description, it means this scene description's label aligns with the majority view of novelty. If the majority judgement is not the same as the original writer's label, this scene description is discarded. Then the worker is asked to write more and iterate the above voting process until 100 normal and 100 novel scene descriptions are collected for this verb. Table 1 shows the summary of our NSD2 dataset statistics. More detailed statistics regarding training data statistics for each verb and description token number are provided in Appendix Sec.A.

The Proposed GAT-MA Model
The proposed GAT-MA model consists of two main components: (i) Knowledge-based Contrastive Data Generator (CDG), and (ii) Text Semantic Novelty Scorer (SNS). Given a set of NL descriptions D tr = {d 1 , d 2 , ..., d n } of normal scenes in the training data, CDG dynamically generates pseudo-novel descriptions by perturbing the normal scene descriptions in D tr utilizing the lexical knowledge base WordNet 2 (Fellbaum, 2010). The normal descriptions in D tr are augmented with these pseudo-novel descriptions (used as NOVEL class examples in training) to learn a SNS.
The SNS is a deep GAT model that learns to score an input text to measure its semantic novelty with respect to D tr . To capture the semantic and syntactic information in an input text d, GAT-MA parses d into a dependency graph and feeds the graph enriched with additional word-level features to the SNS, which is then trained to assign higher score to a normal scene description compared to that of a novel one.

Knowledge-based Contrastive Data
Generator (CDG) We propose to use the lexical knowledge base WordNet to help generate contrastive instances to the normal scene descriptions in D tr . These contrastive instances serve as pseudo-novel data and enable supervised learning of the Text Semantic Novelty Scorer (SNS). WordNet contains rich taxonomy of words and thus, is beneficial to our semantic novelty detection task. In our generator, a knowledge-based misfit sampler S misf it (.) is the key component. Given a normal scene description d ∈ D tr , S misf it (e) [here, e is an entity, either a noun or a noun phrase] samples an entity e that is semantically distant from e in the WordNet. We use Wu-Palmer Similarity (Wu and Palmer, 1994) to measure the semantic distance between e and e . We randomly sample e from WordNet such that the similarity score between e and e is less than 0.9 (an empirically set threshold). Next, since e is semantically distant from e, e is a misfit in original description d. e is replaced with e in description d to generate a pseudo-novel description. For example, "a man is driving a car" describes a normal scene. It is commonsense that the subject for verb "drive" should be a person. Any thing outside of the category introduces novelty, e.g., "a dog is driving a car".
When replacing the entity e, the choice of e is also critical. For our task, we focus on three novelty aspects in a given description: (1) what actions an entity can perform, (2) what actions an entity can be applied to, and (3) how several entities interact with each other. In the interactions between entities and verbs, verbs are the core of these interactions. Thus, we only replace entities that are syntactically related to a verb to create pseudo-novel descriptions. We refer to the verb of interest in d as the target verb, which is used later in Sec. 4.2. We include the details for finding and extracting entities syntactically related to the target verb in the Appendix Sec.B.
Note, the novel scene description d generated by the perturbation is contrastive to the original description d. We dynamically generate one (empiri-cally set) pseudo-novel description for each normal description in D tr in every training epoch.

Text Semantic Novelty Scorer (SNS)
The recent progress of employing GAT (Velickovic et al., 2018) on text data (Huang and Carley, 2019;Ma et al., 2020;Guo et al., 2019) has shown the advantage of explicitly combining syntactic structure (dependency parse graph) and word-level semantics for fine-grained text analysis, such as aspectlevel sentiment analysis and argument mining. Because our task is inherently a fine-grained semantic reasoning task, we build SNS based on GAT. GAT fuses the graph-structured information and node features and employs masked self-attention layers to allow a node to attend to its neighborhood features and learn different attention weights for different neighboring nodes for graph representation learning. More details can be found in Appendix Sec. D.

Input Representation
We use a dependency parser (Chen and Manning, 2014) to convert an input scene description d into a dependency parse graph. For a description d = {w 1 , w 2 , ...w n }, a word w i corresponds to a node n i in the graph. The node feature of n i is a word embedding vector: X i ∈ R F . F is the word embedding size. Since a description contains n words, the input node feature matrix is X ∈ R n×F .

Enriching Entity Word Embeddings with Hypernym Information
We consider a noun or a noun phrase in d as an entity if it exists in the WordNet. And we refer the word(s) comprising the entity as entity word(s) and the corresponding word embedding(s) as entity word embedding(s) onwards. Intuitively, the hypernym information of entities is beneficial to our task. Consider a normal description, "a golden retriever is chasing a flying frisbee". One of the hypernym chains of the entity "golden retriever" in WordNet is: This hypernym chain tells that golden retriever is a breed of dog. If we leverage the hypernym information, the model can not only learn that one specific breed of dog like "golden retriever" can chase a frisbee, but also generalize to other breeds of dogs as well. Additionally, this hypernym chain also contains other commonsense knowledge such as "dogs eat meat", since dogs belong to the category "carnivore". We perform the following three steps to incorporate hypernym features into GAT-MA: Step-1. Candidate Entity Set Extraction. We incorporate hypernym features to entities that are syntactically related to the target verb in a description. We call these entities the candidate entities onwards. Given an input description d, this step extracts the candidate entities from d using a rulebased extractor that leverages dependency parsing and POS tagging information. Details of the method can be found in Appendix Sec. B. Considering the aforementioned example, the candidate entities are "golden retriever" and "frisbee" and the target verb is "chase".
Step-2. Obtaining Hypernym Name Set from WordNet. Given a entity e, the Hypernym Name Set of e is the set of synset names of hypernyms of e in the WordNet. Considering the entity "golden retriever", we obtain its Hypernym Name Set from WordNet as follows: 1. Obtain the synet of the entity. The concept of hypernym is defined between synsets in the WordNet. The word sense of an entity e defined in the description context corresponds to a synset in the WordNet. Ideally, a Word Sense Disambiguation (WSD) model should be employed to tag this entity with an appropriate synset. We have tried state-of-the-art WSD models, and found them not working well for our dataset. On analysis, we found that choosing the first sense of the entity works better. Note that, according to the WordNet documentation 4 , "Senses in WordNet are generally ordered from most to least frequently used, with the most common sense numbered 1." which conforms to our findings.

Find a complete Hypernym Synset Set.
With the chosen synset of the entity, we recursively collect the set of all hypernym synsets from the WordNet. For instance, given the entity "golden retriever", the set of synsets in all hypernym chains originating from {golden re-triever} synset to {entity} synset in the Word-Net hypernym hierarchy forms the Hypernym Synset Set of "golden retriever".

Filter General Hypernym Synsets.
In practice, when compiling the entity hypernym information, we do not consider the whole Hypernym Synset Set for that entity because some hypernyms are too general to contribute useful knowledge for our task. Thus, we manually collect a set of synsets that are too general and remove them from the complete Hypernym Synset Set of the entity. The 24 general sysnets are given in Appendix Sec. C.

Get Hypernym Name
Set. An hypernym synset contains a set of lemma names. E.g. given a hypernym synset -Synset('dog.n.01') of entity "golden retriver", "dog", "domestic dog", "Canis familiaris" are the lemma names. We obtain the Hypernym Name Set of an entity by collecting all lemma names from all synsets in the Hypernym Synset Set of the entity.
Step-3. Construction of Hypernym Feature Vector. A Hypernym Feature Vector is created for each entity based on its Hypernym Name Set and is computed as the pointwise addition of all Hyernym Name Embeddings, one for each Hyernym Name in the Hypernym Name Set of the entity. We use two types of Hypernym Name Embeddings as follows: • GloVe-based Hyernym Name Embedding. For a single-word hypernym name, the Hyernym Name Embedding is the corresponding GloVe word embedding. For a multi-word Hypernym Name, it is computed as the average of GloVe embeddings of the words in the Hypernym Name. • BERT based Hypernym Name Embedding.
Since BERT produces contextual embedding for each word, the input of BERT should contain the context information. Given an input description, we replace the entity in the description with the Hypernym Name and feed this description into BERT. Because BERT tokenizer segments word into word pieces (subword tokens), we average the embeddings of all word pieces corresponding to this Hypernym Name to obtain the final Hypernym Name Embedding. The Hypernym Feature Vector F hyper is calcu- is the embedding of the k th Hypernym Name in Hypernym Name Set of an entity, and M is the size of the Hypernym Name Set.

Modeling Dependency using Deep GAT
We observe that the dependency parse graph of description d contains rich syntactic information that is beneficial to explicitly learn the interactions between entities and actions in a scene description, especially long range interactions. For a novel description like "a monkey with a white beard and brown hair is driving a car down the street", the interaction among monkey, drive and car makes it semantically novel. Note that, entity "monkey" and verb "drive" have a sequential word distance of 9 making it difficult for a sequential representation learning method to model the interaction. In contrast, "monkey" and "drive" are only one hop away in the dependency parse tree.
In addition, we find that for these three key words, "drive" is the parent of both "monkey" and "car" in the original directed dependency graph. To encourage interactions between them and allow the semantic information to flow freely in the dependency graph structure during training, we simplify the original directed dependency graph into an undirected graph. Importantly, the GAT model is trained not to attend to all neighbors of a given node equally. The attention weights to neighbors are trained to give higher weights to those nodes more useful for the task.
The input-output for a single GAT layer is summarized as H out = GAT (X, A; Θ). The input is X ∈ R n×F and the output is H out ∈ R n×F , where n is the number of nodes, F is the node feature size, F is GAT hidden size, and the dependency graph structure is encoded into A ∈ R n×n which is the adjacency matrix of the graph.
In a single GAT layer, a word or an entity in a graph only attends over the local information from 1-hop neighbors. To enable the model to cap-ture long-range interactions between entities and actions, we stack L layers to make a deep model, which allows information from L-hops away to propagate into this word.
As illustrated in Figure 1, the stacking architecture is represented as , is the input for layer (l + 1), denoted by H l+1 . H 0 is the initial input. W 0 ∈ R F ×F and b 0 are the projection matrix and bias vector. For a L layer GAT-MA model, the output of the final layer is H L out ∈ R n×F . For our task, we are concerned with interactions of verbs and entities. As mentioned in Sec. 4.1, when perturbing the normal descriptions, we only replace the entities that are syntactically related to a verb in the dependency graph. This verb is our target verb. Any novelty introduced in the description due to the replacement is related to this verb. If a description contains multiple verbs, the target verb of an entity is the one which is close to it along the dependency parse graph.
We use a mask layer m to fetch the output embedding for this target verb v i from GAT: is a one-hot vector indicating the position of the target verb. Next, we use a feed-forward layer to project h v i into a semantic novelty score. We denote the score function of SNS by S(d) for the input description d.
Training. GAT-MA is trained end-to-end by minimizing a max-margin ranking objective, as given below - (1) where, D tr is the set of the normal descriptions, d ∈ D is the pseudo-novel description corresponding to d ∈ D tr . L encourages the score S(d) of normal description d to be higher than S(d ) for a pseudo-novel description d .

Experiment Setup
For dataset details, please refer to Sec. 3. Appendix has additional information about the data and model implementation details. 5 Baselines. We compare GAT-MA with three categories of baselines: (1) four language model based novelty detection models, (2) seven one-class classification models, (3) other models based on different text encoders and loss functions (see Sec. 5.2). All the results in this section are the average of five runs with different seeds. The results are statistically significant with p < 0.001. A trained language model (LM) can be intuitively used as a novelty detection model due to the following reasons: (1) When training a LM on normal scene descriptions, the model minimizes the perplexity of the training data by maximizing the likelihood of each word appearing in its context. In this way, it indirectly learns the semantic meaning of words and sentences. (2) Each LM trained on normal descriptions can output the probability of each word in a description appearing in its context. Thus a sentence probability can be calculated from the list of word probabilities. We have tried various ways of calculating the sentence score from the word probability list, such as arithmetic mean, geometric mean, harmonic mean, and multiplication of all word probabilities and found harmonic mean to be the best choice. We use N-gram, the bag of  Table 2.
For general one-class classification models, most of them only work on images. We modified the related components of the models to make them suitable for text data. More details regarding model modification and parameter setting are provided in Appendix Sec. F. The following 7 baselines are compared: (1) 2021), we only produce a score function and ignore the binary decision problem and use AUC (Area Under the ROC curve) as the evaluation metric. All compared models are trained with only normal scene descriptions.

Results and Analysis
Baseline Comparison. Table 2 shows the predictive performance comparison of the baselines and our proposed model GAT-MA. Note, GAT-MA is our proposed model using BERT embedding and enhanced with hypernym embedding features. From Table 2, we conclude the following: (1) All general one-class classifiers perform poorly on our task. Even the reported state-of-theart model HRN gives AUC score of only 56.89. We have tried various ways to produce the description embedding as the input feature for these models, such as (a) averaging all words' GloVe embeddings, (b) feeding the description into BERT and using the first token [CLS]'s embedding as the sentence embedding, (c) feeding the description into BERT and averaging all output tokens' embeddings as the sentence embedding, and (d) feeding the description into the pre-trained sentence embedding extractor InferSent to produce the sentence embedding. However, none of these options give good performances. These one-class classifiers perform well 6 We use glove.840B.300d in our experiments 7 We use the BERT model "bert-base-uncased" as text encoder. We expect that using larger transformer embeddings leads to better results. But due to the limitation of our computing resources, we have to use this base BERT model. on image data because images of a given class (e.g., in the MNIST dataset) contains images with very similar latent representations. Thus, auto-encoder and GAN-based models can learn latent representations for all instances in an image class very close to each other in the latent space. In contrast, our normal scene descriptions have many topics and it's hard for them to learn latent representations that are close to each other in the latent space.
(2) Language model-based methods are in general better than one-class classifiers because they, in some sense, do not try to learn a latent representation, but exploit the sequential and semantic information of the input text to produce word probabilities. Thus, they are comparatively more effective in fine-grained semantic novelty detection. However, they still perform much worse than GAT-MA as they mainly learn the word distribution in the normal description data but do not explicitly capture the interaction of entities and verbs.
In summary, GAT-MA outperforms all baselines by large margins and is more effective for our proposed task. Below, we discuss ablation and additional experiments.
Effects of word embedding and hypernyms. In Table 4, GAT-MA vanilla is our proposed model using BERT embedding without being enhanced with the hypernum features. GAT-MA GloVe is our proposed model using GloVe embedding without being enhanced with hypernym features. Comparing the results of GAT-MA GloVe and GAT-MA vanilla , we can see that BERT embedding contains richer semantic knowledge which is more beneficial to our task compared to using GloVe embedding. It is also interesting to see that when GAT-MA vanilla is enhanced with hypernym embedding feature (noted as GAT-MA), it improves the AUC score from 88.12 to 89.22. It means that hypernym features can help our model generalize better. Label 1 a monkey with glasses is cooking food on a stovetop in a kitchen.
Novel 2 a couple of seal dogs carry their surfboard across the beach.
Novel 3 a giant panda in a white smock prepares to cut the hair of an older balding gentleman in front of a case holding several hair supplies.
Novel 4 an adult is walking on the sidewalk in St. Luis. Normal 5 a guy eats food on a table in front of a food shop on the street while a passerby walks by. Normal 6 a group of people stands around are drinking some vermouth. Normal  Effects of model depth. From Figure 2, we see that increasing the number of stacked layers from 1 to 5 improves the performance of GAT-MA vanilla . When the number of stacked layers higher than 5, the performance drops. This is because most of the interaction between entities and actions near to each other in the dependency parse graph. Stacking 5 layers is enough and more stacked layers will not help but hurt the performance.
Effects of using max-margin ranking loss. Table 5 compares fine-tuned BERT and GAT-MA variants in terms of the use of loss functions in model training. Here, [·] CE denotes the model using the cross entropy loss for training and [·] MM denotes the model using the max-margin loss as proposed in Sec. 4.2.3. From Table 5, we see that CE variants are weaker than MM variants for both BERT and GAT-MA. Both GAT-MA CE and GAT-MA MM use BERT embeddings without the hypernym feature.
Effects of using dependency parse structure. Table 5 shows that BERT MM not directly using any syntactic features easily fail on examples dissimilar to training data in terms of word distribution. However, GAT-MA MM performs better by explicitly modeling the dependency parse structure. This means that modeling dependency parse structure is beneficial to capturing the interactions between entities and actions in our task. Some descriptions predicted wrongly by BERT MM but correctly by GAT-MA MM are shown in Table 3.

Error Analysis
The AUC score in (.) for each verb is as follows: pull (0. We carried out error analysis on our test data and found that the errors are mainly due to the following factors. The first factor is the pretrained word embeddings' quality. The quality of the word embedding is critical for GAT-MA to effectively do reasoning. GAT-MA makes mistakes when the pretrained word embedding is not of good quality. For example, the "talapoin" in "the talapoin at the zoo is leaning down to drink some water". The second factor is the limitation of knowledge acquired by GAT-MA during training. GAT-MA relies on the taxology information in WordNet to generate contrastive novel descriptions during training. However, sometimes the reasoning of novel description requires more complex world knowledge. For examples, "two kids are sitting in the bar drinking spirit" is novel and requires knowledge that kids is not old enough to drink any alcohol. Another example "A dog is eating onions on the ground" is novel and requires the world knowledge that onions is poisonous to dogs 8 .

Conclusion
Novelty detection is an important problem because anything novel is of interest. This paper proposed a semantic novelty detection problem and designed a graph attention network based approach (called GAT-MA) exploiting parsing and data augmentation to solve the problem. As there is no existing evaluation dataset for the proposed task, an evaluation dataset has been created. Experimental comparisons with a wide range of baselines showed that GAT-MA outperforms them by very large margins. the average token number is 11.10, the maximum token number is 68, the minimum token number is 4, the standard deviation is 3.8.

B Knowledge Based Contrastive Data Generator Details
In the novel scene detection task, we focus on three novelty aspects: (1) what actions an entity can perform, (2) what actions an entity can be applied, and (3) how several entities interact with each other. In the interactions between entities and verbs, verbs are the core of these interactions. Thus, we only replace entities that are syntactically related to a verb to create pseudo-novel description.
Extraction of candidate entities. The candidate entities are those syntactically related to a verb. If a sentence contains only a single verb, this sentence describes a single event and all nouns (noun phrases) are syntactically related to this verb. If a sentence contains multiple verbs, candidate entities for a target verb are the nouns (noun phrases) that are closer to the target verb along the dependency parse graph. In this multi-verb case, we create multiple training instances, one for each verb as the target verb. We use a simple rule-based extraction technique based on dependency parsing path and Part-of-Speech (POS) tagging to extract the relevant entities for each target verb. The nouns or noun phrases, one hop away to the target verb along the dependency parse graph are the candidate entities.

D Graph Attention Network (GAT)
Graph Attention Network (GAT) (Velickovic et al., 2018) fuses the graph-structured information and node features within the model. Its masked selfattention layers allow a node to attend to neighborhood features, and to learn different attentions/weights to different nodes in the neighbors.
The node features fed into a GAT layer are X = [x 1 , x 2 , ...x i , ...x n ], x i ∈ R F , X ∈ R n×F , where n is the number of nodes, F is the feature size of each node. Specifically, in our context, each word corresponds to a node and F is the size of word embedding. In equation (2), node i attends over its 1-hop neighbors j ∈ N i . K k=1 means the concatenation of K multi-head attention outputs. h out i ∈ R F is the output of node i at the current layer. α k ij is the k-th attention between node i and j. is the concatenation operation. W k ∈ R F K ×F is linear transformation. α ∈ R 2F K is the weight vector, and f (·) is a LeakyReLU non-linearity function.
Overall, the input-output for a single GAT layer is summarized as H out = GAT (X, A; Θ). The input is X ∈ R n×F and the output is H out ∈ R n×F , where n is the number of nodes, F is the node feature size, F is GAT hidden size, and A ∈ R n×n is the adjacency matrix of the graph.

E GAT-MA Model Implementation Details
We employ Stanford Neural Network Dependency Parser (Chen and Manning, 2014) to convert each scene description into dependency parse graph. In our experiments, two pretrained embedding are used: GloVe 9 (Pennington et al., 2014) embedding and BERT 10 (Devlin et al., 2019) embedding. To produce BERT embedding, the input of BERT is formatted by adding "[CLS]" before and "[SEP]" after the tokens of the description. This input is tokenized by BERT tokenizer into word pieces. The output of the pretrained BERT model embedding is a sequence of vectors, each of size 768. Each output vector corresponds to one word piece token. BERT tokenizer tokenizes some words into word pieces (sub-word tokens), such as "tokenizer" is tokenized as word pieces "token" and "##izer". We take the average of the word pieces embedding of the original word to obtain embedding of this word. Note that, we use BERT embedding as the static input feature for GAT-MA. The model does not fine-tune BERT. We empirically set GAT-MA hyper-parameters as follows: hidden state size as 300D; BERT embeddings mapped into 300D using a linear layer. 6 attention heads used for the GAT layers. The minibatch size is set as 256 and learning rate is set as 5e-5. We use larger batch size to make training process faster. We apply 0.1 embedding dropout (Srivastava et al., 2014) and 0.1 attention dropout. We apply l 2 regularization with term λ = 10 −4 . Adam (Kingma and Ba, 2015) optimizer is used for training. The model is trained with 5 epochs. Each epoch takes around 200 minutes to run.
The implementation of this model is based on PyTorch Geometric(PyG) (Fey and Lenssen, 2019) and NVIDIA GPU GTX 1080 Ti.

F Baseline Models Implementation Details
For all the baselines, we do experiments using various embeddings, such as GloVe, BERT and In-ferSent embeddings. We report the best results for comparison.
InferSent pretrained model. InferSent-1 is trained using GloVe embedding. InferSent-2 is trained using fastText (Mikolov et al., 2018) embedding. OCSVM. The parameter setting of OCSVM are as follows: we use "poly" kernel; gamma as "scale", nu value as 0.1. For other parameters, we use the default setting in the scikit-learn implementation 12 . OCSVM gets the best result using GloVe-AVG sentence embedding.
iForest. The parameter setting of iForest are as follows: we use 100 base estimators in ensemble. For the amount of contamination of dataset, we set it as 0.0 because there is no novel scene description in our training dataset. For other parameter settings, we follow the default settings in the scikit-learn implementation 13 . iForest gets the best result using InferSent-1 embedding.
VAE. We use the text encoder structure in Convolutional Neural Networks (CNN) for Sentence Classification (Kim, 2014) to implement a VAE that can take text data as input. For all hyperparameters, we follow the settings from the original work. Among the 4 methods of converting descriptions into sentence embeddings, BERT-CLS gets the best result.
DSVDD. The LeNet implementation is used as our baseline model. The latent dimension of the autoencoder as well as the final fully connected layer of the model is changed to a dimension of 96 to better accommodate the size of our description embeddings. For all other parameters, we use the default settings from the original work and implementations. Among the 4 methods of converting descriptions into sentence embeddings, BERT-CLS gets the best result.
ICS. We use the default settings from the original work and implementations. Among the 4 methods of converting descriptions into sentence embeddings, BERT-CLS gets the best result.
OCGAN. To make OCGAN work better for text data, we change the depth of the generator and discriminator from 3 layers to 2 layers, the noise factor of training data from 0.02 to 0.05, and the weight of the reconstruction loss from 500 to 600. For other hyper-parameters, we follow the settings in the original work. Among the 4 methods of converting descriptions into sentence embeddings, BERT-CLS gets the best result.
HRN. We followe the default setting of the orig-