A Multilingual Information Extraction Pipeline for Investigative Journalism

Gregor Wiedemann, Seid Muhie Yimam, Chris Biemann


Abstract
We introduce an advanced information extraction pipeline to automatically process very large collections of unstructured textual data for the purpose of investigative journalism. The pipeline serves as a new input processor for the upcoming major release of our New/s/leak 2.0 software, which we develop in cooperation with a large German news organization. The use case is that journalists receive a large collection of files up to several Gigabytes containing unknown contents. Collections may originate either from official disclosures of documents, e.g. Freedom of Information Act requests, or unofficial data leaks.
Anthology ID:
D18-2014
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:
November
Year:
2018
Address:
Brussels, Belgium
Editors:
Eduardo Blanco, Wei Lu
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
78–83
Language:
URL:
https://aclanthology.org/D18-2014
DOI:
10.18653/v1/D18-2014
Bibkey:
Cite (ACL):
Gregor Wiedemann, Seid Muhie Yimam, and Chris Biemann. 2018. A Multilingual Information Extraction Pipeline for Investigative Journalism. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 78–83, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
A Multilingual Information Extraction Pipeline for Investigative Journalism (Wiedemann et al., EMNLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/D18-2014.pdf
Data
Polyglot-NER