Semantic Context Path Labeling for Semantic Exploration of User Reviews

In this paper we present a prototype demonstrator showcasing a novel method to perform semantic exploration of user reviews. The system enables effective navigation in a rich contextual semantic schema with a large number of structured classes indicating relevant information. In order to identify instances of the structured classes in the reviews, we defined a new Information Extraction task called Semantic Context Path (SCP) labeling, which simultaneously assigns types and semantic roles to entity mentions. Reviews can rapidly be explored based on the fine-grained and structured semantic classes. As a proof-of-concept, we have implemented this system for reviews on Points-of-Interest, in English and Korean.


Introduction
In this paper, we demonstrate a system that proposes a novel approach for extracting rich and diverse information from informative texts, and allowing effective semantic navigation among the extracted categories. Our system is specifically designed for exploring large quantities of user reviews, which contain a wealth of useful information for potential new users, difficult to exploit in existing review platforms.
The users of our system can navigate in a rich semantic schema representing a large number of hierarchically structured information categories relevant in the reviews, where top classes have multiple layers of subclasses and attribute classes. The semantic navigation can start from the top classes, or directly from the attribute classes.
Mentions of the structured classes in the reviews are automatically assigned labels that we call Semantic Context Paths (SCPs), which denote their types and semantic roles. By semantic roles, we mean classes for which the mentions instantiate attributes. For example, in the sentence "See a good movie for $5!", the mention $5 is assigned the SCP label ShowAndExhibition.Payment.PriceValue, which means that it is the price of a movie show.
We implemented a supervised SCP tagging system for POI reviews: (1) we created an SCP semantic schema for the domain; (2) we labeled datasets both for English and Korean using an annotation tool that we developed for this task; and (3) we developed and evaluated models for this IE task.
We demonstrate the novel exploration method based on structured semantic classes through a dedicated exploration interface. It is not targeted at end-users at this stage: our purpose is to showcase the new approach in an intuitive way.
In the following sections, we first expose the advantages of exploring user reviews through semantic context paths compared to the methods used in existing social media review sites (Section 2), we then describe the SCP labeling task (Section 3) and finally present our SCP tagging system on POI reviews. We show that performing sequence labeling relying on a state-of-the-art contextual language model extracts with high accuracy great numbers of SCPs from user reviews in multiple languages (Section 4).

Semantic Exploration of User Reviews
Popular POI review sites like Google maps or Tripadvisor provide their users with multiple methods for obtaining information on POIs. In this section, we briefly describe the navigation methods offered by existing major user review sites on POIs, and compare them with navigation based on SCPs.
The most widespread way to explore reviews is based on pre-selected frequent terms, as shown in Figure 1.
The user can click on the terms, and thus rapidly access the reviews containing them. However, the main limitation is that the frequent terms are not grouped semantically and thus similar information is handled separately (c.f. jacket and sweater). In addition, important information can be conveyed by clausal expressions like dress accordingly, and therefore be missed by term detection algorithms. Finally, in many cases, when talking about relevant topics, like prices, booking, or the availability of food on-site, linguistic expression is so heterogeneous that a method based on term frequency will not detect them.
Keyword search is another way to explore reviews, as shown in Figure 2. Its advantage over the previous method is that the user is not limited by a pre-selected set of terms. However it is slower, since the user needs to enter queries, and getting all the relevant information may require multiple queries. Moreover, some queries will fail either because the information is absent or because the query keywords are not matched, which can be frustrating due to the waste of time.
Finally, a few sites propose pre-selected flat categories for navigation, as illustrated in figure 3.
This method provides fast access to information by clicking on the categories, and improves preselected term navigation. However, since the categories are not structured, they can be ambiguous. Taken together, keyword based navigation lacks the generality offered by structured search, however currently proposed structured search does not disambiguate general categories. Compared to these methods, the advantage of navigation based on semantic context paths is that it provides rapid and effective access to precise information on a great diversity of topics.

Information Extraction with Semantic
Path Labeling

Presentation of the Task
We introduce Semantic Context Paths (SCPs) as representations of entity mentions enriched with contextual semantic labels. SCP labels encode three types of information: (1) context-free semantic types, as traditionally extracted by Named Entity Recognition (NER) systems, (2) hierarchically represented context-free semantic types, traditionally extracted by fine-grained NER systems, and (3) context-dependent semantic roles specifying the classes for which the mentions instantiate attributes.
As an example, assume we want to extract various kinds of useful information from Point-of-Interest reviews about recreation places (e.g. movie theaters), such as food, prices, visitors, etc. Figure 4 illustrates how SCP labeling is applied on a real POI comment.
The proposed IE task goes beyond traditional Named Entity Recognition (NER), which merely assigns semantic categories to entity mentions (Akbik et al., 2018;Pawar et al., 2017), even when the set of categories is fine-grained and hierarchical (Mai et al., 2018;Zhang et al., 2020). In contrast, SCP labels also contain information on the semantic content of the entities or concepts within the mentions, but at the same time also on their semantic roles and types of relations within the text.
In the example in Figure 4, good price does not have a simple PriceProperty label as it could have in a classical NER system, but ShowAndExhibition.PriceProperty, which indicates that it is related to mentions with the same root label, ShowAndExhibition -in this example Regency Theatres and movie, which are respectively tagged as ShowAn-dExhibition.Loc, and ShowAndExhibition.Subject.
SCP tagging therefore presents the advantage of extracting rich contextual information, while still being a sequence labeling task.

Formal Definition of the Task
A semantic schema S is a tuple C, I, R where C is a finite set of semantic classes, I ⊆ C is the subset of instantiable classes, and R ⊂ C × C is the set of attribute relations. R defines which classes have attributes and their types: (c i , c j ) ∈ R means that class c i has an attribute of type c j . A class c ∈ C is a top class iff c ∈ C such that (c , c) ∈ R (i.e. c is not an attribute of another class), and it is terminal iff c ∈ C such that (c, c ) ∈ R (i.e. it does not have attributes).
A Semantic Context Path (SCP) in S is any non-empty and finite sequence of classes p = c 1 .c 2 .c 3 ...c n where c 1 is a top class, and We use P S to denote the set of all acyclic, instantiable paths of S.
We can now define the task of Semantic Context Path tagging as follows: given a semantic schema S and a tokenized text document d = t 1 t 2 ...t n , SCP tagging assigns a (possibly empty) subset of P S to every token in d.

System Description
The semantic navigation system has two main components: the IE component, which identifies mentions and labels them with SCPs off-line, and a navigation component, which allows exploring the reviews based on the structured classes. We describe these two components in the following subsections.

Model Description
We model the SCP labeling task as a multi-label classification problem at the level of tokens: review texts are tokenized, and each token can have one or multiple SCP labels if it is part of a relevant mention, or has the special label O otherwise. The set of possible labels is therefore the set P (S) of all acyclic, instantiable paths of a semantic schema S, plus the special label O. We implement the model as a neural network that jointly learns to predict the final path labels and the individual classes that make up the paths. Indeed, since there are many paths that share common individual classes (i.e. they have identical prefixes or suffixes), learning to predict individual classes inside the paths can alleviate the problem of data sparseness with regard to (whole) path labels. Figure 5 below depicts the model architecture. First, the model assigns contextualized word embeddings to each token w i in the input text using a pre-trained, transformer-based language model, more exactly a RoBERTa model : The contextualized token representations are then fed to a fully connected hidden layer: From these hidden token representations h i , we first compute the probabilities of all individual (flat) classes of the semantic schema: We then concatenate the class probability vector with the hidden token representation h i (in eq. 2) and feed the result to a linear layer with a sigmoid activation in order to compute the probabilities of the path (SCP) labels: (4) We train the model using the sum of the binary cross-entropy losses of the individual class predictions and whole path predictions.

Experiments
Dataset. The dataset is a collection of POI users' comments 1 in both English and Korean, randomly sampled from three main categories: Food (food places), Art & Entertainment, and Outdoor & Recreation. Every comment has a main category among these, and a set of one or more sub-categories (e.g. Park for Outdoor & Recreation, or Movie Theater for Art & Entertainment.) Since there was no rich, off-the-shelf hierarchical semantic schema to represent useful information types for POIs, we defined one from scratch, based on a development subset of the data. The final fined-grained schema comprises 42 semantic classes, their combination leading to 185 virtually instantiable semantic paths. For example, Visit.Accessibility.Mobility.Device is an SCP associated with the mention wheelchair in "Most parts are fine for wheelchairs."; ShowAndExhibition.Subject.Name is a SCP associated with the mention Last Judgment in "The pride of the Gdansk museum is a triptych of the Last Judgement by Hans Memling (1433-1494)", etc. Annotation guidelines have been defined together with the semantic schema.
We (the authors) annotated manually the English version of the dataset according to the semantic schema and the guidelines. Following the same 1 The data was acquired legally through an agreement with a company providing POI-based services. The dataset contains a set of user comments with corresponding POI identifiers, geolocation and categories. It does not contain any user information. No crowd-workers have been involved in the annotation process. schema and annotation guidelines, the Korean version of the dataset was annotated by a native Korean speaker specialized in data annotation (hired under a Naver employment contract). The distribution of SCP labels is unbalanced for both languages, with a long tail of infrequent labels: 50% of the observed path labels in the training set for Korean had a maximum of 10 occurrences. Table 1 reports some statistics regarding both versions of the annotated dataset. In order to assess the reliability of the annotations, we calculated Krippendorff's alpha (Hayes and Krippendorff, 2007) on the English annotated dataset. On a token-basis, we obtained a relatively high agreement level, α = 0.838, which indicates a good dataset reliability. This dataset is, to our knowledge, one of the finest-grained annotated datasets for information extraction.
Hyperparameters. We used the xlm-robertalarge pretrained, transformer-based model from HuggingFace transformers library (Wolf et al., 2020) to produce the token contextualized embeddings (equation 1). Experiments with multiple mono-lingual pre-trained language models either showed similar performance (roberta-large for English) or lower performance (bert-large for English, and kobert, kobart, koelectra and KR-BERT-char16424 for Korean.) We used an initial learning rate of 1e −5 with a scheduler, a batch size of 16, and a maximum of 100 epochs, with an early stopping strategy.
Evaluation. We evaluate the performance of the models as fully-automatic information extraction systems. The input contains raw text (user comments), and the system needs to identify mentions (and their boundaries) and tag them with SCP labels. We measured performance using the traditional precision, recall, and f1 scores in their re-laxed variants. That is, the boundaries of predicted mentions do not need to match fully the groundtruth mention spans: they are considered correct if they have at least one common token with groundtruth spans, and as long as the predicted SCP label is entirely correct. This relaxed span evaluation is motivated by the target application: in the UI for exploring POI reviews (see section 4.2), when the user navigates in the hierarchical semantic classes and selects one class, mentions of that class are automatically displayed and highlighted within their context (review snippets). Thus, the user can read the immediate context, including any token missing from the identified mention. Performance results are shown in table 2.  An interesting side result is that since the IE component is built on top of a multilingual pretrained language model (xlm-roberta-large), it is applicable to other languages in a zero-shot setting. Although we have not performed a quantitative evaluation on other languages than English and Korean, due to the lack of labeled data, a preliminary qualitative evaluation shows promising results. An example of SCP labeling of Arabic is provided in Figure 6.

Exploration of User Reviews
We have designed an interface on top of Naver map 2 to demonstrate the application of SCP tagging for exploring rich and detailed information conveyed in user reviews. Currently, in the demo, we cover two languages, English and Korean, but since the system is operational for a great number of other languages, these could be added as well. When a user opens the map, she first selects the language, and the POI types (one or several) that she would like to explore. The current system covers 3 POI types: Arts & Entertainment, Outdoors & Recreation, and Food.
Hovering over a POI in the map makes its name appear along with a circular chart displaying the different topics (i.e. the top classes of the semantic schema) that are covered by the reviews, as well as the number of available reviews. The width of the particular categories is determined by the number of subcategories covered by the reviews (c.f. Figure 7). In order to explore the reviews in more detail, the user can click on the circular chart, which will expand into a sunburst displaying all the subcategories that have mentions in the reviews. As we indicated in the introduction, the demo system is not targeted at end users. We have chosen the sunburst interface since it allows a comprehensive and straightforward visualization of a great number of SCPs: the innermost circle contains the main categories, and each subsequent layer displays the subcategories in the same segment. Thus, at a glimpse, the chart offers an overview of all the categories and SCPs covered by a review. Moving the mouse over any category opens a callout with the snippets Figure 8: Visualizing snippets containing class mentions. In this example, besides a tip on the best time for a general visit, we find information on time when you can dance (Recreation&Sport.Time), when you can get specific drinks (Food&Drinks.Time), as well as on when you can listen to music (Show&Recreation.Time). In the interface, snippets of one category can be visualized at a time by hovering over it; here we present the snippets of several categories on the same figure for a concise presentation. of the relevant user reviews (Figure 8).
If review snippets are not sufficient to understand the information conveyed in the review, the user can click on the callouts to access the full reviews ( Figure 10). Figure 9 shows the categories extracted from the Korean reviews on the Seoul Tower, and the snippets about recommended visitor companions.

Conclusion
We have presented a new method for the semantic exploration of user reviews.
The system underlying the demonstrator relies on an IE component which identifies relevant mentions in the reviews and labels them with Semantic Context Paths, denoting their types and semantic roles. We have implemented an SCP tagger that extracts information from user reviews on Pointsof-Interest. We have defined a dedicated semantic schema, created datasets, and developed sequence labeling models for the task. The IE component was quantitatively evaluated on English and Korean, and showed promising qualitative results on other languages.
We have designed a review exploration interface exploiting the output of the SCP tagger. A sunburst chart in the interface allows navigation among the classes, and rapid access to relevant information in the reviews. Future work includes integrating a wider range of languages to the review exploration interface.