Press Freedom Monitor: Detection of Reported Press and Media Freedom Violations in Twitter and News Articles
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Freedom of the press and media is of vital importance for democratically organised states and open societies. We introduce the Press Freedom Monitor, a tool that aims to detect reported press and media freedom violations in news articles and tweets. It is used by press and media freedom organisations to support their daily monitoring and to trigger rapid response actions. The Press Freedom Monitor enables the monitoring experts to get a fast overview over recently reported incidents and it has shown an impressive performance in this regard. This paper presents our work on the tool, starting with the training phase, which comprises defining the topic-related keywords to be used for querying APIs for news and Twitter content and evaluating different machine learning models based on a training dataset specifically created for our use case. Then, we describe the components of the production pipeline, including data gathering, duplicates removal, country mapping, case mapping and the user interface. We also conducted a usability study to evaluate the effectiveness of the user interface, and describe improvement plans for future work.
Creating a Gold Standard Corpus for the Extraction of Chemistry-Disease Relations from Patent Texts
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper describes the creation of a gold standard for chemistry-disease relations in patent texts. We start with an automated annotation of named entities of the domains chemistry (e.g. propranolol) and diseases (e.g. hypertension) as well as of related domains like methods and substances. After that, domain-relevant relations between these entities, e.g. propranolol treats hypertension, have been manually annotated. The corpus is intended to be suitable for developing and evaluating relation extraction methods. In addition, we present two reasoning methods of high precision for automatically extending the set of extracted relations. Chain reasoning provides a method to infer and integrate additional, indirectly expressed relations occurring in relation chains. Enumeration reasoning exploits the frequent occurrence of enumerations in patents and automatically derives additional relations. These two methods are applicable both for verifying and extending the manually annotated data as well as for potential improvements of automatic relation extraction.
Learning Categories and their Instances by Contextual Features
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
We present a 3-step framework that learns categories and their instances from natural language text based on given training examples. Step 1 extracts contexts of training examples as rules describing this category from text, considering part of speech, capitalization and category membership as features. Step 2 selects high quality rules using two consequent filters. The first filter is based on the number of rule occurrences, the second filter takes two non-independent characteristics into account: a rule's precision and the amount of instances it acquires. Our framework adapts the filter's threshold values to the respective category and the textual genre by automatically evaluating rule sets resulting from different filter settings and selecting the best performing rule set accordingly. Step 3 then identifies new instances of a category using the filtered rules applied within a previously proposed algorithm. We inspect the rule filters' impact on rule set quality and evaluate our framework by learning first names, last names, professions and cities from a hitherto unexplored textual genre -- search engine result snippets -- and achieve high precision on average.