Samuel Broscheit


pdf bib
Unsupervised Multi-View Post-OCR Error Correction With Language Models
Harsh Gupta | Luciano Del Corro | Samuel Broscheit | Johannes Hoffart | Eliot Brenner
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We investigate post-OCR correction in a setting where we have access to different OCR views of the same document. The goal of this study is to understand if a pretrained language model (LM) can be used in an unsupervised way to reconcile the different OCR views such that their combination contains fewer errors than each individual view. This approach is motivated by scenarios in which unconstrained text generation for error correction is too risky. We evaluated different pretrained LMs on two datasets and found significant gains in realistic scenarios with up to 15% WER improvement over the best OCR view. We also show the importance of domain adaptation for post-OCR correction on out-of-domain documents.


pdf bib
LibKGE - A knowledge graph embedding library for reproducible research
Samuel Broscheit | Daniel Ruffinelli | Adrian Kochsiek | Patrick Betz | Rainer Gemulla
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

LibKGE ( ) is an open-source PyTorch-based library for training, hyperparameter optimization, and evaluation of knowledge graph embedding models for link prediction. The key goals of LibKGE are to enable reproducible research, to provide a framework for comprehensive experimental studies, and to facilitate analyzing the contributions of individual components of training methods, model architectures, and evaluation methods. LibKGE is highly configurable and every experiment can be fully reproduced with a single configuration file. Individual components are decoupled to the extent possible so that they can be mixed and matched with each other. Implementations in LibKGE aim to be as efficient as possible without leaving the scope of Python/Numpy/PyTorch. A comprehensive logging mechanism and tooling facilitates in-depth analysis. LibKGE provides implementations of common knowledge graph embedding models and training methods, and new ones can be easily added. A comparative study (Ruffinelli et al., 2020) showed that LibKGE reaches competitive to state-of-the-art performance for many models with a modest amount of automatic hyperparameter tuning.

pdf bib
Can We Predict New Facts with Open Knowledge Graph Embeddings? A Benchmark for Open Link Prediction
Samuel Broscheit | Kiril Gashteovski | Yanjie Wang | Rainer Gemulla
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Open Information Extraction systems extract (“subject text”, “relation text”, “object text”) triples from raw text. Some triples are textual versions of facts, i.e., non-canonicalized mentions of entities and relations. In this paper, we investigate whether it is possible to infer new facts directly from the open knowledge graph without any canonicalization or any supervision from curated knowledge. For this purpose, we propose the open link prediction task,i.e., predicting test facts by completing (“subject text”, “relation text”, ?) questions. An evaluation in such a setup raises the question if a correct prediction is actually a new fact that was induced by reasoning over the open knowledge graph or if it can be trivially explained. For example, facts can appear in different paraphrased textual variants, which can lead to test leakage. To this end, we propose an evaluation protocol and a methodology for creating the open link prediction benchmark OlpBench. We performed experiments with a prototypical knowledge graph embedding model for openlink prediction. While the task is very challenging, our results suggests that it is possible to predict genuinely new facts, which can not be trivially explained.


pdf bib
On Evaluating Embedding Models for Knowledge Base Completion
Yanjie Wang | Daniel Ruffinelli | Rainer Gemulla | Samuel Broscheit | Christian Meilicke
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

Knowledge graph embedding models have recently received significant attention in the literature. These models learn latent semantic representations for the entities and relations in a given knowledge base; the representations can be used to infer missing knowledge. In this paper, we study the question of how well recent embedding models perform for the task of knowledge base completion, i.e., the task of inferring new facts from an incomplete knowledge base. We argue that the entity ranking protocol, which is currently used to evaluate knowledge graph embedding models, is not suitable to answer this question since only a subset of the model predictions are evaluated. We propose an alternative entity-pair ranking protocol that considers all model predictions as a whole and is thus more suitable to the task. We conducted an experimental study on standard datasets and found that the performance of popular embeddings models was unsatisfactory under the new protocol, even on datasets that are generally considered to be too easy. Moreover, we found that a simple rule-based model often provided superior performance. Our findings suggest that there is a need for more research into embedding models as well as their training strategies for the task of knowledge base completion.

pdf bib
Investigating Entity Knowledge in BERT with Simple Neural End-To-End Entity Linking
Samuel Broscheit
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

A typical architecture for end-to-end entity linking systems consists of three steps: mention detection, candidate generation and entity disambiguation. In this study we investigate the following questions: (a) Can all those steps be learned jointly with a model for contextualized text-representations, i.e. BERT? (b) How much entity knowledge is already contained in pretrained BERT? (c) Does additional entity knowledge improve BERT’s performance in downstream tasks? To this end we propose an extreme simplification of the entity linking setup that works surprisingly well: simply cast it as a per token classification over the entire entity vocabulary (over 700K classes in our case). We show on an entity linking benchmark that (i) this model improves the entity representations over plain BERT, (ii) that it outperforms entity linking architectures that optimize the tasks separately and (iii) that it only comes second to the current state-of-the-art that does mention detection and entity disambiguation jointly. Additionally, we investigate the usefulness of entity-aware token-representations in the text-understanding benchmark GLUE, as well as the question answering benchmarks SQUAD~V2 and SWAG and also the EN-DE WMT14 machine translation benchmark. To our surprise, we find that most of those benchmarks do not benefit from additional entity knowledge, except for a task with very small training data, the RTE task in GLUE, which improves by 2%.


pdf bib
A Neural Autoencoder Approach for Document Ranking and Query Refinement in Pharmacogenomic Information Retrieval
Jonas Pfeiffer | Samuel Broscheit | Rainer Gemulla | Mathias Göschl
Proceedings of the BioNLP 2018 workshop

In this study, we investigate learning-to-rank and query refinement approaches for information retrieval in the pharmacogenomic domain. The goal is to improve the information retrieval process of biomedical curators, who manually build knowledge bases for personalized medicine. We study how to exploit the relationships between genes, variants, drugs, diseases and outcomes as features for document ranking and query refinement. For a supervised approach, we are faced with a small amount of annotated data and a large amount of unannotated data. Therefore, we explore ways to use a neural document auto-encoder in a semi-supervised approach. We show that a combination of established algorithms, feature-engineering and a neural auto-encoder model yield promising results in this setting.

pdf bib
Learning Distributional Token Representations from Visual Features
Samuel Broscheit
Proceedings of The Third Workshop on Representation Learning for NLP

In this study, we compare token representations constructed from visual features (i.e., pixels) with standard lookup-based embeddings. Our goal is to gain insight about the challenges of encoding a text representation from low-level features, e.g. from characters or pixels. We focus on Chinese, which—as a logographic language—has properties that make a representation via visual features challenging and interesting. To train and evaluate different models for the token representation, we chose the task of character-based neural machine translation (NMT) from Chinese to English. We found that a token representation computed only from visual features can achieve competitive results to lookup embeddings. However, we also show different strengths and weaknesses in the models’ performance in a part-of-speech tagging task and also a semantic similarity task. In summary, we show that it is possible to achieve a text representation only from pixels. We hope that this is a useful stepping stone for future studies that exclusively rely on visual input, or aim at exploiting visual features of written language.


pdf bib
A Multigraph Model for Coreference Resolution
Sebastian Martschat | Jie Cai | Samuel Broscheit | Éva Mújdricza-Maydt | Michael Strube
Joint Conference on EMNLP and CoNLL - Shared Task


pdf bib
BART: A Multilingual Anaphora Resolution System
Samuel Broscheit | Massimo Poesio | Simone Paolo Ponzetto | Kepa Joseba Rodriguez | Lorenza Romano | Olga Uryupina | Yannick Versley | Roberto Zanoli
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
Extending BART to Provide a Coreference Resolution System for German
Samuel Broscheit | Simone Paolo Ponzetto | Yannick Versley | Massimo Poesio
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a flexible toolkit-based approach to automatic coreference resolution on German text. We start with our previous work aimed at reimplementing the system from Soon et al. (2001) for English, and extend it to duplicate a version of the state-of-the-art proposal from Klenner and Ailloud (2009). Evaluation performed on a benchmarking dataset, namely the TueBa-D/Z corpus (Hinrichs et al., 2005b), shows that machine learning based coreference resolution can be robustly performed in a language other than English.