Johannes Hoffart


2021

pdf bib
Unsupervised Multi-View Post-OCR Error Correction With Language Models
Harsh Gupta | Luciano Del Corro | Samuel Broscheit | Johannes Hoffart | Eliot Brenner
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We investigate post-OCR correction in a setting where we have access to different OCR views of the same document. The goal of this study is to understand if a pretrained language model (LM) can be used in an unsupervised way to reconcile the different OCR views such that their combination contains fewer errors than each individual view. This approach is motivated by scenarios in which unconstrained text generation for error correction is too risky. We evaluated different pretrained LMs on two datasets and found significant gains in realistic scenarios with up to 15% WER improvement over the best OCR view. We also show the importance of domain adaptation for post-OCR correction on out-of-domain documents.

pdf bib
KGPool: Dynamic Knowledge Graph Context Selection for Relation Extraction
Abhishek Nadgeri | Anson Bastos | Kuldeep Singh | Isaiah Onando Mulang’ | Johannes Hoffart | Saeedeh Shekarpour | Vijay Saraswat
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
CHOLAN: A Modular Approach for Neural Entity Linking on Wikipedia and Wikidata
Manoj Prabhakar Kannan Ravi | Kuldeep Singh | Isaiah Onando Mulang’ | Saeedeh Shekarpour | Johannes Hoffart | Jens Lehmann
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

In this paper, we propose CHOLAN, a modular approach to target end-to-end entity linking (EL) over knowledge bases. CHOLAN consists of a pipeline of two transformer-based models integrated sequentially to accomplish the EL task. The first transformer model identifies surface forms (entity mentions) in a given text. For each mention, a second transformer model is employed to classify the target entity among a predefined candidates list. The latter transformer is fed by an enriched context captured from the sentence (i.e. local context), and entity description gained from Wikipedia. Such external contexts have not been used in state of the art EL approaches. Our empirical study was conducted on two well-known knowledge bases (i.e., Wikidata and Wikipedia). The empirical results suggest that CHOLAN outperforms state-of-the-art approaches on standard datasets such as CoNLL-AIDA, MSNBC, AQUAINT, ACE2004, and T-REx.

pdf bib
From Stock Prediction to Financial Relevance: Repurposing Attention Weights to Assess News Relevance Without Manual Annotations
Luciano Del Corro | Johannes Hoffart
Proceedings of the Third Workshop on Economics and Natural Language Processing

We present a method to automatically identify financially relevant news using stock price movements and news headlines as input. The method repurposes the attention weights of a neural network initially trained to predict stock prices to assign a relevance score to each headline, eliminating the need for manually labeled training data. Our experiments on the four most relevant US stock indices and 1.5M news headlines show that the method ranks relevant news highly, positively correlated with the accuracy of the initial stock price prediction task.

2018

pdf bib
A Study of the Importance of External Knowledge in the Named Entity Recognition Task
Dominic Seyler | Tatiana Dembelova | Luciano Del Corro | Johannes Hoffart | Gerhard Weikum
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In this work, we discuss the importance of external knowledge for performing Named Entity Recognition (NER). We present a novel modular framework that divides the knowledge into four categories according to the depth of knowledge they convey. Each category consists of a set of features automatically generated from different information sources, such as a knowledge-base, a list of names, or document-specific semantic annotations. Further, we show the effects on performance when incrementally adding deeper knowledge and discuss effectiveness/efficiency trade-offs.

pdf bib
diaNED: Time-Aware Named Entity Disambiguation for Diachronic Corpora
Prabal Agarwal | Jannik Strötgen | Luciano del Corro | Johannes Hoffart | Gerhard Weikum
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Named Entity Disambiguation (NED) systems perform well on news articles and other texts covering a specific time interval. However, NED quality drops when inputs span long time periods like in archives or historic corpora. This paper presents the first time-aware method for NED that resolves ambiguities even when mention contexts give only few cues. The method is based on computing temporal signatures for entities and comparing these to the temporal contexts of input mentions. Our experiments show superior quality on a newly created diachronic corpus.

2016

pdf bib
DeepLife: An Entity-aware Search, Analytics and Exploration Platform for Health and Life Sciences
Patrick Ernst | Amy Siu | Dragan Milchevski | Johannes Hoffart | Gerhard Weikum
Proceedings of ACL-2016 System Demonstrations

2013

pdf bib
HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text
Mohamed Amir Yosef | Sandro Bauer | Johannes Hoffart | Marc Spaniol | Gerhard Weikum
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2012

pdf bib
HYENA: Hierarchical Type Classification for Entity Names
Mohamed Amir Yosef | Sandro Bauer | Johannes Hoffart | Marc Spaniol | Gerhard Weikum
Proceedings of COLING 2012: Posters

2011

pdf bib
Robust Disambiguation of Named Entities in Text
Johannes Hoffart | Mohamed Amir Yosef | Ilaria Bordino | Hagen Fürstenau | Manfred Pinkal | Marc Spaniol | Bilyana Taneva | Stefan Thater | Gerhard Weikum
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing