Natural language processing (NLP) is often the backbone of today’s systems for user interactions, information retrieval and others. Many of such NLP applications rely on specialized learned representations (e.g. neural word embeddings, topic models) that improve the ability to reason about the relationships between documents of a corpus. Paired with the progress in learned representations, the similarity metrics used to compare representations of documents are also evolving, with numerous proposals differing in computation time or interpretability. In this paper we propose an extension to a specific emerging hybrid document distance metric which combines topic models and word embeddings: the Hierarchical Optimal Topic Transport (HOTT). In specific, we extend HOTT by using context-enhanced word representations. We provide a validation of our approach on public datasets, using the language model BERT for a document categorization task. Results indicate competitive performance of the extended HOTT metric. We furthermore apply the HOTT metric and its extension to support educational media research, with a retrieval task of matching topics in German curricula to educational textbooks passages, along with offering an auxiliary explanatory document representing the dominant topic of the retrieved document. In a user study, our explanation method is preferred over regular topic keywords.
The following paper describes the first steps in the development of an ontology for the textbook research discipline. The aim of the project WorldViews is to establish a digital edition focussing on views of the world depicted in textbooks. For this purpose an initial TEI profile has been formalised and tested as a use case to enable the semantical encoding of the resource ‘textbook’. This profile shall provide a basic data model describing major facets of the textbook’s structure relevant to historians.
Determining the real-world referents for name mentions of persons, organizations and other named entities in texts has become an important task in many information retrieval scenarios and is referred to as Named Entity Disambiguation (NED). While comprehensive datasets support the development and evaluation of NED approaches for English, there are no public datasets to assess NED systems for other languages, such as German. This paper describes the construction of an NED dataset based on a large corpus of German news articles. The dataset is closely modeled on the datasets used for the Knowledge Base Population tasks of the Text Analysis Conference, and contains gold standard annotations for the NED tasks of Entity Linking, NIL Detection and NIL Clustering. We also present first experimental results on the new dataset for each of these tasks in order to establish a baseline for future research efforts.
Current search engines are used for retrieving relevant documents from the huge amount of data available and have become an essential tool for the majority of Web users. Standard search engines do not consider semantic information that can help in recognizing the relevance of a document with respect to the meaning of a query. In this paper, we present our system architecture and a first user study, where we show that the use of semantics can help users in finding relevant information, filtering it ad facilitating quicker access to data.
In this paper, we present the multilingual Sense Folder Corpus. After the analysis of different corpora, we describe the requirements that have to be satisfied for evaluating semantic multilingual retrieval approaches. Justified by the unfulfilled requirements explained, we start creating a small bilingual hand-tagged corpus of 502 documents retrieved from Web searches. The documents contained in this collection have been created using Google queries. A single ambiguous word has been searched and related documents (approx. the first 60 documents for every keyword) have been retrieved. The document collection has been extended at the query word level, using single ambiguous words for English (argument, bank, chair, network and rule) and for Italian (argomento, lingua, regola, rete and stampa). The search and annotation process has been done both in a monolingual way for the English and the Italian language. 252 English and 250 Italian documents have been retrieved from Google and saved in their original rank. The performance of semantic multilingual retrieval systems has been evaluated using such a corpus with three baselines (Random, First Sense and Most Frequent Sense) that are formally presented and discussed. The fine-grained evaluation of the Sense Folder approach is discussed in details.
In this paper we present two experiments conducted for comparison of different language identification algorithms. Short words-, frequent words- and n-gram-based approaches are considered and combined with the Ad-Hoc Ranking classification method. The language identification process can be subdivided into two main steps: first a document model is generated for the document and a language model for the language; second the language of the document is determined on the basis of the language model and is added to the document as additional information. In this work we present our evaluation results and discuss the importance of a dynamic value for the out-of-place measure.
In this paper, we discuss the integration of metaphor information into the RDF/OWL representation of EuroWordNet. First, the lexical database WordNet and its variants are presented. After a brief description of the Hamburg Metaphor Database, examples of its conversion into the RDF/OWL representation of EuroWordNet are discussed. The metaphor information is added to the general EuroWordNet data and the new resulting RDF/OWL structure is shown in LexiRes, a visualization tool developed and adapted for handling structures of ontological and lexical databases. We show how LexiRes can be used to further edit the newly added metaphor information, and explain some problems with this new type of information on the basis of examples.
In this paper we discuss the problem of sense disambiguation using lexical resources like ontologies or thesauri with a focus on the application of sense detection and merging methods in information retrieval systems. For an information retrieval task it is important to detect the meaning of a query word for retrieving the related relevant documents. In order to recognize the meaning of a search word, lexical resources, like WordNet, can be used for word sense disambiguation. But, analyzing the WordNet structure, we see that this ontology is fraught with different problems. The too fine grained distinction between word senses, for example, is unfavorable for a usage in information retrieval. We describe related problems and present four implemented online methods to merge SynSets based on relations like hypernyms and hyponyms, and further context information like glosses and domain. Afterwards we show a first evaluation of our approach, compare the different merging methods and discuss briefly future work.