Stephan Oepen - ACL Anthology

Stephan Oepen

2025

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

Proceedings of the 18th International Conference on Parsing Technologies (IWPT, SyntaxFest 2025)
Kenji Sagae | Stephan Oepen
Proceedings of the 18th International Conference on Parsing Technologies (IWPT, SyntaxFest 2025)

We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence.

The use of copyrighted materials in training language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of publisher-controlled copyrighted corpora on the performance of generative large language models (LLMs) for Norwegian. When evaluated on a diverse set of tasks, we found that adding both books and newspapers to the data mixture of LLMs tend to improve their performance, while the addition of fiction works seems to be detrimental. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.

Small Languages, Big Models: A Study of Continual Training on Languages of Norway
David Samuel | Vladislav Mikhailov | Erik Velldal | Lilja Øvrelid | Lucas Georges Gabriel Charpentier | Andrey Kutuzov | Stephan Oepen
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.

2024

Argument Sharing in Meaning Representation Parsing
Maja Buljan | Stephan Oepen | Lilja Øvrelid
Proceedings of the Fifth International Workshop on Designing Meaning Representations @ LREC-COLING 2024

We present a contrastive study of argument sharing across three graph-based meaning representation frameworks, where semantically shared arguments manifest as reentrant graph nodes. For a state-of-the-art graph parser, we observe how parser performance – in terms of output quality – covaries with overall graph complexity, on the one hand, and presence of different types of reentrancies, on the other hand. We identify common linguistic phenomena that give rise to shared arguments, and therefore node reentrancies, through a small-case and partially automated annotation study and parallel error anaylsis of actual parser outputs. Our results provide new insights into the distribution of different types of reentrancies in meaning representation graphs for three distinct frameworks, as well as on the effects that these structures have on parser performance, thus suggesting both novel cross-framework generalisations as well as avenues for focussed parser development.

A New Massive Multilingual Dataset for High-Performance Language Technologies
Ona de Gibert | Graeme Nail | Nikolay Arefyev | Marta Bañón | Jelmer van der Linde | Shaoxiong Ji | Jaume Zaragoza-Bernabeu | Mikko Aulamo | Gema Ramírez-Sánchez | Andrey Kutuzov | Sampo Pyysalo | Stephan Oepen | Jörg Tiedemann
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

2022

Direct parsing to sentiment graphs
David Samuel | Jeremy Barnes | Robin Kurtz | Stephan Oepen | Lilja Øvrelid | Erik Velldal
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This paper demonstrates how a graph-based semantic parser can be applied to the task of structured sentiment analysis, directly predicting sentiment graphs from text. We advance the state of the art on 4 out of 5 standard benchmark sets. We release the source code, models and predictions.

2021

Structured Sentiment Analysis as Dependency Graph Parsing
Jeremy Barnes | Robin Kurtz | Stephan Oepen | Lilja Øvrelid | Erik Velldal
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Structured sentiment analysis attempts to extract full opinion tuples from a text, but over time this task has been subdivided into smaller and smaller sub-tasks, e.g., target extraction or targeted polarity classification. We argue that this division has become counterproductive and propose a new unified framework to remedy the situation. We cast the structured sentiment problem as dependency graph parsing, where the nodes are spans of sentiment holders, targets and expressions, and the arcs are the relations between them. We perform experiments on five datasets in four languages (English, Norwegian, Basque, and Catalan) and show that this approach leads to strong improvements over state-of-the-art baselines. Our analysis shows that refining the sentiment graphs with syntactic dependency information further improves results.

Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)
Stephan Oepen | Kenji Sagae | Reut Tsarfaty | Gosse Bouma | Djamé Seddah | Daniel Zeman
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)

Large-Scale Contextualised Language Modelling for Norwegian
Andrey Kutuzov | Jeremy Barnes | Erik Velldal | Lilja Øvrelid | Stephan Oepen
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

We present the ongoing NorLM initiative to support the creation and use of very large contextualised language models for Norwegian (and in principle other Nordic languages), including a ready-to-use software environment, as well as an experience report for data preparation and training. This paper introduces the first large-scale monolingual language models for Norwegian, based on both the ELMo and BERT frameworks. In addition to detailing the training process, we present contrastive benchmark results on a suite of NLP tasks for Norwegian. For additional background and access to the data, models, and software, please see: http://norlm.nlpl.eu

2020

Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing
Stephan Oepen | Omri Abend | Lasha Abzianidze | Johan Bos | Jan Hajič | Daniel Hershcovich | Bin Li | Tim O'Gorman | Nianwen Xue | Daniel Zeman
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

MRP 2020: The Second Shared Task on Cross-Framework and Cross-Lingual Meaning Representation Parsing
Stephan Oepen | Omri Abend | Lasha Abzianidze | Johan Bos | Jan Hajic | Daniel Hershcovich | Bin Li | Tim O’Gorman | Nianwen Xue | Daniel Zeman
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

The 2020 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks and languages. Extending a similar setup from the previous year, five distinct approaches to the representation of sentence meaning in the form of directed graphs were represented in the English training and evaluation data for the task, packaged in a uniform graph abstraction and serialization; for four of these representation frameworks, additional training and evaluation data was provided for one additional language per framework. The task received submissions from eight teams, of which two do not participate in the official ranking because they arrived after the closing deadline or made use of additional training data. All technical information regarding the task, including system submissions, official results, and links to supporting resources and software are available from the task web site at: http://mrp.nlpl.eu

DRS at MRP 2020: Dressing up Discourse Representation Structures as Graphs
Lasha Abzianidze | Johan Bos | Stephan Oepen
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

Discourse Representation Theory (DRT) is a formal account for representing the meaning of natural language discourse. Meaning in DRT is modeled via a Discourse Representation Structure (DRS), a meaning representation with a model-theoretic interpretation, which is usually depicted as nested boxes. In contrast, a directed labeled graph is a common data structure used to encode semantics of natural language texts. The paper describes the procedure of dressing up DRSs as directed labeled graphs to include DRT as a new framework in the 2020 shared task on Cross-Framework and Cross-Lingual Meaning Representation Parsing. Since one of the goals of the shared task is to encourage unified models for several semantic graph frameworks, the conversion procedure was biased towards making the DRT graph framework somewhat similar to other graph-based meaning representation frameworks.

Proceedings of the Second International Workshop on Designing Meaning Representations
Nianwen Xue | Johan Bos | William Croft | Jan Hajič | Chu-Ren Huang | Stephan Oepen | Martha Palmer | James Pustejovsky
Proceedings of the Second International Workshop on Designing Meaning Representations

Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies
Gosse Bouma | Yuji Matsumoto | Stephan Oepen | Kenji Sagae | Djamé Seddah | Weiwei Sun | Anders Søgaard | Reut Tsarfaty | Dan Zeman
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

End-to-End Negation Resolution as Graph Parsing
Robin Kurtz | Stephan Oepen | Marco Kuhlmann
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

We present a neural end-to-end architecture for negation resolution based on a formulation of the task as a graph parsing problem. Our approach allows for the straightforward inclusion of many types of graph-structured features without the need for representation-specific heuristics. In our experiments, we specifically gauge the usefulness of syntactic information for negation resolution. Despite the conceptual simplicity of our architecture, we achieve state-of-the-art results on the Conan Doyle benchmark dataset, including a new top result for our best model.

A Tale of Three Parsers: Towards Diagnostic Evaluation for Meaning Representation Parsing
Maja Buljan | Joakim Nivre | Stephan Oepen | Lilja Øvrelid
Proceedings of the Twelfth Language Resources and Evaluation Conference

We discuss methodological choices in contrastive and diagnostic evaluation in meaning representation parsing, i.e. mapping from natural language utterances to graph-based encodings of its semantic structure. Drawing inspiration from earlier work in syntactic dependency parsing, we transfer and refine several quantitative diagnosis techniques for use in the context of the 2019 shared task on Meaning Representation Parsing (MRP). As in parsing proper, moving evaluation from simple rooted trees to general graphs brings along its own range of challenges. Specifically, we seek to begin to shed light on relative strenghts and weaknesses in different broad families of parsing techniques. In addition to these theoretical reflections, we conduct a pilot experiment on a selection of top-performing MRP systems and one of the five meaning representation frameworks in the shared task. Empirical results suggest that the proposed methodology can be meaningfully applied to parsing into graph-structured target representations, uncovering hitherto unknown properties of the different systems that can inform future development and cross-fertilization across approaches.

2019

Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning
Stephan Oepen | Omri Abend | Jan Hajic | Daniel Hershcovich | Marco Kuhlmann | Tim O’Gorman | Nianwen Xue
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

MRP 2019: Cross-Framework Meaning Representation Parsing
Stephan Oepen | Omri Abend | Jan Hajic | Daniel Hershcovich | Marco Kuhlmann | Tim O’Gorman | Nianwen Xue | Jayeol Chun | Milan Straka | Zdenka Uresova
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

The 2019 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks. Five distinct approaches to the representation of sentence meaning in the form of directed graph were represented in the training and evaluation data for the task, packaged in a uniform abstract graph representation and serialization. The task received submissions from eighteen teams, of which five do not participate in the official ranking because they arrived after the closing deadline, made use of additional training data, or involved one of the task co-organizers. All technical information regarding the task, including system submissions, official results, and links to supporting resources and software are available from the task web site at: http://mrp.nlpl.eu

The ERG at MRP 2019: Radically Compositional Semantic Dependencies
Stephan Oepen | Dan Flickinger
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

The English Resource Grammar (ERG) is a broad-coverage computational grammar of English that outputs underspecified logical-form representations of meaning in a framework dubbed English Resource Semantics (ERS). Two of the target representations in the the 2019 Shared Task on Cross-Framework Meaning Representation Parsing (MRP 2019) derive graph-based simplifications of ERS, viz. Elementary Dependency Structures (EDS) and DELPH-IN MRS Bi-Lexical Dependencies (DM). As a point of reference outside the official MRP competition, we parsed the evaluation strings using the ERG and converted the resulting meaning representations to EDS and DM. These graphs yield higher evaluation scores than the purely data-driven parsers in the actual shared task, suggesting that the general-purpose linguistic knowledge about English grammar encoded in the ERG can add value when parsing into these meaning representations.

Graph-Based Meaning Representations: Design and Processing
Alexander Koller | Stephan Oepen | Weiwei Sun
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

This tutorial is on representing and processing sentence meaning in the form of labeled directed graphs. The tutorial will (a) briefly review relevant background in formal and linguistic semantics; (b) semi-formally define a unified abstract view on different flavors of semantic graphs and associated terminology; (c) survey common frameworks for graph-based meaning representation and available graph banks; and (d) offer a technical overview of a representative selection of different parsing approaches.

Proceedings of the First International Workshop on Designing Meaning Representations
Nianwen Xue | William Croft | Jan Hajič | Chu-Ren Huang | Stephan Oepen | Martha Palmer | James Pustejovsky
Proceedings of the First International Workshop on Designing Meaning Representations

Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing
Joakim Nivre | Leon Derczynski | Filip Ginter | Bjørn Lindi | Stephan Oepen | Anders Søgaard | Jörg Tidemann
Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing

Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)
Marie Candito | Kilian Evang | Stephan Oepen | Djamé Seddah
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)

2018

Transfer and Multi-Task Learning for Noun–Noun Compound Interpretation
Murhaf Fares | Stephan Oepen | Erik Velldal
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In this paper, we empirically evaluate the utility of transfer and multi-task learning on a challenging semantic classification task: semantic interpretation of noun–noun compounds. Through a comprehensive series of experiments and in-depth error analysis, we show that transfer learning via parameter initialization and multi-task learning via parameter sharing can help a neural classification model generalize over a highly skewed distribution of relations. Further, we demonstrate how dual annotation with two distinct sets of relations over the same set of compounds can be exploited to improve the overall accuracy of a neural classifier and its F1 scores on the less frequent, but more difficult relations.

The 2018 Shared Task on Extrinsic Parser Evaluation: On the Downstream Utility of English Universal Dependency Parsers
Murhaf Fares | Stephan Oepen | Lilja Øvrelid | Jari Björne | Richard Johansson
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We summarize empirical results and tentative conclusions from the Second Extrinsic Parser Evaluation Initiative (EPE 2018). We review the basic task setup, downstream applications involved, and end-to-end results for seventeen participating teams. Based on in-depth quantitative and qualitative analysis, we correlate intrinsic evaluation results at different layers of morph-syntactic analysis with observed downstream behavior.

2017

Word vectors, reuse, and replicability: Towards a community repository of large-text resources
Murhaf Fares | Andrey Kutuzov | Stephan Oepen | Erik Velldal
Proceedings of the 21st Nordic Conference on Computational Linguistics

Representation and Interchange of Linguistic Annotation. An In-Depth, Side-by-Side Comparison of Three Designs
Richard Eckart de Castilho | Nancy Ide | Emanuele Lapponi | Stephan Oepen | Keith Suderman | Erik Velldal | Marc Verhagen
Proceedings of the 11th Linguistic Annotation Workshop

For decades, most self-respecting linguistic engineering initiatives have designed and implemented custom representations for various layers of, for example, morphological, syntactic, and semantic analysis. Despite occasional efforts at harmonization or even standardization, our field today is blessed with a multitude of ways of encoding and exchanging linguistic annotations of these types, both at the levels of ‘abstract syntax’, naming choices, and of course file formats. To a large degree, it is possible to work within and across design plurality by conversion, and often there may be good reasons for divergent design reflecting differences in use. However, it is likely that some abstract commonalities across choices of representation are obscured by more superficial differences, and conversely there is no obvious procedure to tease apart what actually constitute contentful vs. mere technical divergences. In this study, we seek to conceptually align three representations for common types of morpho-syntactic analysis, pinpoint what in our view constitute contentful differences, and reflect on the underlying principles and specific requirements that led to individual choices. We expect that a more in-depth understanding of these choices across designs may led to increased harmonization, or at least to more informed design of future representations.

2016

Squibs: Towards a Catalogue of Linguistic Graph Banks
Marco Kuhlmann | Stephan Oepen
Computational Linguistics, Volume 42, Issue 4 - December 2016

OPT: Oslo–Potsdam–Teesside. Pipelining Rules, Rankers, and Classifier Ensembles for Shallow Discourse Parsing
Stephan Oepen | Jonathon Read | Tatjana Scheffler | Uladzimir Sidarenka | Manfred Stede | Erik Velldal | Lilja Øvrelid
Proceedings of the CoNLL-16 shared task

Towards Comparability of Linguistic Graph Banks for Semantic Parsing
Stephan Oepen | Marco Kuhlmann | Yusuke Miyao | Daniel Zeman | Silvie Cinková | Dan Flickinger | Jan Hajič | Angelina Ivanova | Zdeňka Urešová
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We announce a new language resource for research on semantic parsing, a large, carefully curated collection of semantic dependency graphs representing multiple linguistic traditions. This resource is called SDP~2016 and provides an update and extension to previous versions used as Semantic Dependency Parsing target representations in the 2014 and 2015 Semantic Evaluation Exercises. For a common core of English text, this third edition comprises semantic dependency graphs from four distinct frameworks, packaged in a unified abstract format and aligned at the sentence and token levels. SDP 2016 is the first general release of this resource and available for licensing from the Linguistic Data Consortium in May 2016. The data is accompanied by an open-source SDP utility toolkit and system results from previous contrastive parsing evaluations against these target representations.

2015

Proceedings of the ACL-IJCNLP 2015 Student Research Workshop
Kuan-Yu Chen | Angelina Ivanova | Ellie Pavlick | Emily Bender | Chin-Yew Lin | Stephan Oepen
Proceedings of the ACL-IJCNLP 2015 Student Research Workshop

SemEval 2015 Task 18: Broad-Coverage Semantic Dependency Parsing
Stephan Oepen | Marco Kuhlmann | Yusuke Miyao | Daniel Zeman | Silvie Cinková | Dan Flickinger | Jan Hajič | Zdeňka Urešová
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

Semantic Dependency Graph Parsing Using Tree Approximations
Željko Agić | Alexander Koller | Stephan Oepen
Proceedings of the 11th International Conference on Computational Semantics

Layers of Interpretation: On Grammar and Compositionality
Emily M. Bender | Dan Flickinger | Stephan Oepen | Woodley Packard | Ann Copestake
Proceedings of the 11th International Conference on Computational Semantics

2014

RDF Triple Stores and a Custom SPARQL Front-End for Indexing and Searching (Very) Large Semantic Networks
Milen Kouylekov | Stephan Oepen
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations

Semantic Technologies for Querying Linguistic Annotations: An Experiment Focusing on Graph-Structured Data
Milen Kouylekov | Stephan Oepen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

With growing interest in the creation and search of linguistic annotations that form general graphs (in contrast to formally simpler, rooted trees), there also is an increased need for infrastructures that support the exploration of such representations, for example logical-form meaning representations or semantic dependency graphs. In this work, we heavily lean on semantic technologies and in particular the data model of the Resource Description Framework (RDF) to represent, store, and efficiently query very large collections of text annotated with graph-structured representations of sentence meaning.

Towards an Encyclopedia of Compositional Semantics: Documenting the Interface of the English Resource Grammar
Dan Flickinger | Emily M. Bender | Stephan Oepen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We motivate and describe the design and development of an emerging encyclopedia of compositional semantics, pursuing three objectives. We first seek to compile a comprehensive catalogue of interoperable semantic analyses, i.e., a precise characterization of meaning representations for a broad range of common semantic phenomena. Second, we operationalize the discovery of semantic phenomena and their definition in terms of what we call their semantic fingerprint, a formal account of the building blocks of meaning representation involved and their configuration. Third, we ground our work in a carefully constructed semantic test suite of minimal exemplars for each phenomenon, along with a ‘target’ fingerprint that enables automated regression testing. We work towards these objectives by codifying and documenting the body of knowledge that has been constructed in a long-term collaborative effort, the development of the LinGO English Resource Grammar. Documentation of its semantic interface is a prerequisite to use by non-experts of the grammar and the analyses it produces, but this effort also advances our own understanding of relevant interactions among phenomena, as well as of areas for future work in the grammar.

Off-Road LAF: Encoding and Processing Annotations in NLP Workflows
Emanuele Lapponi | Erik Velldal | Stephan Oepen | Rune Lain Knudsen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The Linguistic Annotation Framework (LAF) provides an abstract data model for specifying interchange representations to ensure interoperability among different annotation formats. This paper describes an ongoing effort to adapt the LAF data model as the interchange representation in complex workflows as used in the Language Analysis Portal (LAP), an on-line and large-scale processing service that is developed as part of the Norwegian branch of the Common Language Resources and Technology Infrastructure (CLARIN) initiative. Unlike several related on-line processing environments, which predominantly instantiate a distributed architecture of web services, LAP achives scalability to potentially very large data volumes through integration with the Norwegian national e-Infrastructure, and in particular job sumission to a capacity compute cluster. This setup leads to tighter integration requirements and also calls for efficient, low-overhead communication of (intermediate) processing results with workflows. We meet these demands by coupling the LAF data model with a lean, non-redundant JSON-based interchange format and integration of an agile and performant NoSQL database, allowing parallel access from cluster nodes, as the central repository of linguistic annotation.

Simple Negation Scope Resolution through Deep Parsing: A Semantic Solution to a Semantic Problem
Woodley Packard | Emily M. Bender | Jonathon Read | Stephan Oepen | Rebecca Dridan
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

SemEval 2014 Task 8: Broad-Coverage Semantic Dependency Parsing
Stephan Oepen | Marco Kuhlmann | Yusuke Miyao | Daniel Zeman | Dan Flickinger | Jan Hajič | Angelina Ivanova | Yi Zhang
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

In-House: An Ensemble of Pre-Existing Off-the-Shelf Parsers
Yusuke Miyao | Stephan Oepen | Daniel Zeman
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

2013

Survey on parsing three dependency representations for English
Angelina Ivanova | Stephan Oepen | Lilja Øvrelid
51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop

Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)
Stephan Oepen | Kristin Hagen | Janne Bondi Johannessen
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

Tidying up the Basement: A Tale of Large-Scale Parsing on National eInfrastructure
Stephan Oepen
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

Simple and Accountable Segmentation of Marked-up Text
Jonathon Read | Rebecca Dridan | Stephan Oepen
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

HPC-ready Language Analysis for Human Beings
Emanuele Lapponi | Erik Velldal | Nikolay A. Vazov | Stephan Oepen
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

On Different Approaches to Syntactic Analysis Into Bi-Lexical Dependencies. An Empirical Comparison of Direct, PCFG-Based, and HPSG-Based Parsers
Angelina Ivanova | Stephan Oepen | Rebecca Dridan | Dan Flickinger | Lilja Øvrelid
Proceedings of the 13th International Conference on Parsing Technologies (IWPT 2013)

Document Parsing: Towards Realistic Syntactic Analysis
Rebecca Dridan | Stephan Oepen
Proceedings of the 13th International Conference on Parsing Technologies (IWPT 2013)

2012

Sentence Boundary Detection: A Long Solved Problem?
Jonathon Read | Rebecca Dridan | Stephan Oepen | Lars Jørgen Solberg
Proceedings of COLING 2012: Posters

Speculation and Negation: Rules, Rankers, and the Role of Syntax
Erik Velldal | Lilja Øvrelid | Jonathon Read | Stephan Oepen
Computational Linguistics, Volume 38, Issue 2 - June 2012

The WeSearch Corpus, Treebank, and Treecache – A Comprehensive Sample of User-Generated Content
Jonathon Read | Dan Flickinger | Rebecca Dridan | Stephan Oepen | Lilja Øvrelid
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present the WeSearch Data Collection (WDC)―a freely redistributable, partly annotated, comprehensive sample of User-Generated Content. The WDC contains data extracted from a range of genres of varying formality (user forums, product review sites, blogs and Wikipedia) and covers two different domains (NLP and Linux). In this article, we describe the data selection and extraction process, with a focus on the extraction of linguistic content from different sources. We present the format of syntacto-semantic annotations found in this resource and present initial parsing results for these data, as well as some reflections following a first round of treebanking.

Tokenization: Returning to a Long Solved Problem — A Survey, Contrastive Experiment, Recommendations, and Toolkit —
Rebecca Dridan | Stephan Oepen
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

UiO1: Constituent-Based Discriminative Ranking for Negation Resolution
Jonathon Read | Erik Velldal | Lilja Øvrelid | Stephan Oepen
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

Towards an ACL Anthology Corpus with Logical Document Structure. An Overview of the ACL 2012 Contributed Task
Ulrich Schäfer | Jonathon Read | Stephan Oepen
Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries

Towards High-Quality Text Stream Extraction from PDF. Technical Background to the ACL 2012 Contributed Task
Øyvind Raddum Berg | Stephan Oepen | Jonathon Read
Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries

Who Did What to Whom? A Contrastive Study of Syntacto-Semantic Dependencies
Angelina Ivanova | Stephan Oepen | Lilja Øvrelid | Dan Flickinger
Proceedings of the Sixth Linguistic Annotation Workshop

2011

Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus
Emily M. Bender | Dan Flickinger | Stephan Oepen | Yi Zhang
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Treeblazing: Using External Treebanks to Filter Parse Forests for Parse Selection and Treebanking
Andrew MacKinlay | Rebecca Dridan | Dan Flickinger | Stephan Oepen | Timothy Baldwin
Proceedings of 5th International Joint Conference on Natural Language Processing

Parser Evaluation Using Elementary Dependency Matching
Rebecca Dridan | Stephan Oepen
Proceedings of the 12th International Conference on Parsing Technologies

2010

Syntactic Scope Resolution in Uncertainty Analysis
Lilja Øvrelid | Erik Velldal | Stephan Oepen
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

WikiWoods: Syntacto-Semantic Annotation for English Wikipedia
Dan Flickinger | Stephan Oepen | Gisle Ytrestøl
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

WikiWoods is an ongoing initiative to provide rich syntacto-semantic annotations for English Wikipedia. We sketch an automated processing pipeline to extract relevant textual content from Wikipedia sources, segment documents into sentence-like units, parse and disambiguate using a broad-coverage precision grammar, and support the export of syntactic and semantic information in various formats. The full parsed corpus is accompanied by a subset of Wikipedia articles for which gold-standard annotations in the same format were produced manually. This subset was selected to represent a coherent domain, Wikipedia entries on the broad topic of Natural Language Processing.

Resolving Speculation: MaxEnt Cue Classification and Dependency-Based Scope Rules
Erik Velldal | Lilja Øvrelid | Stephan Oepen
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

2009

Automatic Translation of Norwegian Noun Compounds
Lars Bungum | Stephan Oepen
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

Hybrid Multilingual Parsing with HPSG for SRL
Yi Zhang | Rui Wang | Stephan Oepen
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task

2008

Some Fine Points of Hybrid Natural Language Parsing
Peter Adolphs | Stephan Oepen | Ulrich Callmeier | Berthold Crysmann | Dan Flickinger | Bernd Kiefer
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Large-scale grammar-based parsing systems nowadays increasingly rely on independently developed, more specialized components for pre-processing their input. However, different tools make conflicting assumptions about very basic properties such as tokenization. To make linguistic annotation gathered in pre-processing available to deep parsing, a hybrid NLP system needs to establish a coherent mapping between the two universes. Our basic assumption is that tokens are best described by attribute value matrices (AVMs) that may be arbitrarily complex. We propose a powerful resource-sensitive rewrite formalism, chart mapping, that allows us to mediate between the token descriptions delivered by shallow pre-processing components and the input expected by the grammar. We furthermore propose a novel way of unknown word treatment where all generic lexical entries are instantiated that are licensed by a particular token AVM. Again, chart mapping is used to give the grammar writer full control as to which items (e.g. native vs. generic lexical items) enter syntactic parsing. We discuss several further uses of the original idea and report on early experiences with the new machinery.

2007

Towards hybrid quality-oriented machine translation – on linguistics and probabilities in MT
Stephan Oepen | Erik Velldal | Jan Tore Lønning | Paul Meurer | Victoria Rosén | Dan Flickinger
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

Exploiting Semantic Information for HPSG Parse Selection
Sanae Fujita | Francis Bond | Stephan Oepen | Takaaki Tanaka
ACL 2007 Workshop on Deep Linguistic Processing

Efficiency in Unification-Based N-Best Parsing
Yi Zhang | Stephan Oepen | John Carroll
Proceedings of the Tenth International Conference on Parsing Technologies

2006

Using a Bi-Lingual Dictionary in Lexical Transfer
Lars Nygaard | Jan Tore Lønning | Torbjørn Nordgård | Stephan Oepen
Proceedings of the 11th Annual Conference of the European Association for Machine Translation

Discriminant-Based MRS Banking
Stephan Oepen | Jan Tore Lønning
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We present an approach to discriminant-based MRS banking, i.e. the construction of an annotated corpus where each input item is paired with a logical-form semantics. Semantic annotations are produced by parsing with a broad-coverage precision grammar, followed by manual disambiguation. The selection of the preferred analysis for each item (and hence its semantic form) builds on a notion of semantic discriminants, essentially localized dependencies extracted from a full-fledged, underspecified semantic representation.

Re-Usable Tools for Precision Machine Translation
Jan Tore Lønning | Stephan Oepen
Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions

Statistical Ranking in Tactical Generation
Erik Velldal | Stephan Oepen
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

2005

Holistic regression testing for high-quality MT: some methodological and technological reflections
Stephan Oepen | Helge Dyvik | Dan Flickinger | Jan Tore Lønning | Paul Meurer | Victoria Rosén
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

Maximum Entropy Models for Realization Ranking
Erik Velldal | Stephan Oepen
Proceedings of Machine Translation Summit X: Papers

In this paper we describe and evaluate different statistical models for the task of realization ranking, i.e. the problem of discriminating between competing surface realizations generated for a given input semantics. Three models are trained and tested; an n-gram language model, a discriminative maximum entropy model using structural features, and a combination of these two. Our realization component forms part of a larger, hybrid MT system.

SEM-I Rational MT: Enriching Deep Grammars with a Semantic Interface for Scalable Machine Translation
Dan Flickinger | Jan Tore Lønning | Helge Dyvik | Stephan Oepen | Francis Bond
Proceedings of Machine Translation Summit X: Papers

In the LOGON machine translation system where semantic transfer using Minimal Recursion Semantics is being developed in conjunction with two existing broad-coverage grammars of Norwegian and English, we motivate the use of a grammar-specific semantic interface (SEM-I) to facilitate the construction and maintenance of a scalable translation engine. The SEM-I is a theoretically grounded component of each grammar, capturing several classes of lexical regularities while also serving the crucial engineering function of supplying a reliable and complete specification of the elementary predications the grammar can realize. We make extensive use of underspecification and type hierarchies to maximize generality and precision.

Open Source Machine Translation with DELPH-IN
Francis Bond | Stephan Oepen | Melanie Siegel | Ann Copestake | Dan Flickinger
Workshop on open-source machine translation

High Efficiency Realization for a Wide-Coverage Unification Grammar
John Carroll | Stephan Oepen
Second International Joint Conference on Natural Language Processing: Full Papers

High Precision Treebanking—Blazing Useful Trees Using POS Information
Takaaki Tanaka | Francis Bond | Stephan Oepen | Sanae Fujita
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

2004

Som å kapp-ete med trollet? – Towards MRS-based Norwegian-English machine translation
Stephan Oepen | Helge Dyvik | Jan Tore Lønning | Erik Velldal | Dorothee Beerman | John Carroll | Dan Flickinger | Lars Hellan | Janne Bondi Johannessen | Paul Meurer | Torbjørn Nordgård | Victoria Rosén
Proceedings of the 10th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

A Lexicon Module for a Grammar Development Environment
Ann Copestake | Fabre Lambeau | Benjamin Waldron | Francis Bond | Dan Flickinger | Stephan Oepen
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Road-testing the English Resource Grammar Over the British National Corpus
Timothy Baldwin | Emily M. Bender | Dan Flickinger | Ara Kim | Stephan Oepen
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora
Silvia Hansen-Schirra | Stephan Oepen | Hans Uszkoreit
Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora

2002

The LinGO Redwoods Treebank: Motivation and Preliminary Applications
Stephan Oepen | Kristina Toutanova | Stuart Shieber | Christopher Manning | Dan Flickinger | Thorsten Brants
COLING 2002: The 17th International Conference on Computational Linguistics: Project Notes

The Grammar Matrix: An Open-Source Starter-Kit for the Rapid Development of Cross-linguistically Consistent Broad-Coverage Precision Grammars
Emily M. Bender | Dan Flickinger | Stephan Oepen
COLING-02: Grammar Engineering and Evaluation

Parallel Distributed Grammar Engineering for Practical Applications
Stephan Oepen | Emily M. Bender | Uli Callmeier | Dan Flickinger | Melanie Siegel
COLING-02: Grammar Engineering and Evaluation

2001

Using an Open-Source Unification-Based System for CL/NLP Teaching
Anne Copestake | John Carroll | Dan Flickinger | Robert Malouf | Stephan Oepen
Proceedings of the ACL 2001 Workshop on Sharing Tools and Resources

2000

Measure for Measure: Parser Cross-fertilization - Towards Increased Component Comparability and Exchange
Stephan Oepen | Ulrich Callmeier
Proceedings of the Sixth International Workshop on Parsing Technologies

Over the past few years significant progress was accomplished in efficient processing with wide-coverage HPSG grammars. HPSG-based parsing systems are now available that can process medium-complexity sentences (of ten to twenty words, say) in average parse times equivalent to real (i.e. human reading) time. A large number of engineering improvements in current HPSG systems were achieved through collaboration of multiple research centers and mutual exchange of experience, encoding techniques, algorithms, and even pieces of software. This article presents an approach to grammar and system engineering, termed competence & performance profiling, that makes systematic experimentation and the precise empirical study of system properties a focal point in development. Adapting the profiling metaphor familiar from software engineering to constraint-based grammars and parsers, enables developers to maintain an accurate record of system evolution, identify grammar and system deficiencies quickly, and compare to earlier versions or between different systems. We discuss a number of exemplary problems that motivate the experimental approach, and apply the empirical methodology in a fairly detailed discussion of what was achieved during a development period of three years. Given the collaborative nature in setup, the empirical results we present involve research and achievements of a large group of people.

Ambiguity Packing in Constraint-based Parsing Practical Results
Stephan Oepen | John Carroll
1st Meeting of the North American Chapter of the Association for Computational Linguistics

Proceedings of the COLING-2000 Workshop on Efficiency In Large-Scale Parsing Systems
John Carroll | Robert C. Moore | Stephan Oepen
Proceedings of the COLING-2000 Workshop on Efficiency In Large-Scale Parsing Systems

Efficient Large-Scale Parsing – a Survey
John Carroll | Stephan Oepen
Proceedings of the COLING-2000 Workshop on Efficiency In Large-Scale Parsing Systems

Cross-Platform, Cross-Grammar Comparison – Can it be Done?
Ulrich Callmeier | Stephan Oepen
Proceedings of the COLING-2000 Workshop on Efficiency In Large-Scale Parsing Systems

1996

TSNLP - Test Suites for Natural Language Processing
Sabine Lehmann | Stephan Oepen | Sylvie Regnier-Prost | Klaus Netter | Veronika Lux | Judith Klein | Kirsten Falkedal | Frederik Fouvry | Dominique Estival | Eva Dauphin | Herve Compagnion | Judith Baur | Lorna Balkan | Doug Arnold
COLING 1996 Volume 2: The 16th International Conference on Computational Linguistics

1994

DISCO-An HPSG-based NLP System and its Application for Appointment Scheduling Project Note
Hans Uszkoreit | Rolf Backofen | Stephan Busemann | Abdel Kader Diagne | Elizabeth A. Hinkelman | Walter Kasper | Bernd Kiefer | Hans-Ulrich Krieger | Klaus Netter | Gunter Neumann | Stephan Oepen | Stephen P. Spackman
COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics

Co-authors

Jonathon Read 9

Emily M. Bender 8

John A. Carroll 8

Jan Tore Lønning 8

Marco Kuhlmann 7

Andrey Kutuzov 7

Angelina Ivanova 6

Ann Copestake 4

Daniel Hershcovich 4

Vladislav Mikhailov 4

Tim O’Gorman 4

Lasha Abzianidze 3

Nikolay Arefyev 3

Jeremy Barnes 3

Marta Bañón 3

Ulrich Callmeier 3

Emanuele Lapponi 3

Sampo Pyysalo 3

Gema Ramírez-Sánchez 3

Victoria Rosén 3

Djamé Seddah 3

Jörg Tiedemann 3

Zdenka Uresova 3

Jaume Zaragoza-Bernabeu 3

Ona de Gibert 3

Timothy Baldwin 2

Laurie Burchell 2

Silvie Cinková 2

William Croft 2

Mariia Fedorova 2

Liane Guillou 2

Jindřich Helcl 2

Erik Henriksson 2

Chu-Ren Huang 2

Janne Bondi Johannessen 2

Alexander Koller 2

Milen Kouylekov 2

Veronika Laippala 2

Bhavitvya Malik 2

Christopher D. Manning 2

Farrokh Mehryary 2

Amanda Myntti 2

Petter Mæhlum 2

Torbjørn Nordgård 2

Dayyán O’Brien 2

Woodley Packard 2

Martha Palmer 2

James Pustejovsky 2

Melanie Siegel 2

Pavel Stepachev 2

Anders Søgaard 2

Takaaki Tanaka 2

Reut Tsarfaty 2

Hans Uszkoreit 2

Peter Adolphs 1

Željko Agić 1

Rolf Backofen 1

Dorothee Beerman 1

Øyvind Raddum Berg 1

Magnus Breder Birkenes 1

Rolv-Arild Braaten 1

Thorsten Brants 1

Svein Arne Brygfjeld 1

Stephan Busemann 1

Uli Callmeier 1

Marie Candito 1

Stephen Clark 1

Herve Compagnion 1

Anne Copestake 1

Berthold Crysmann 1

Javier De La Rosa 1

Leon Derczynski 1

Abdel Kader Diagne 1

Richard Eckart De Castilho 1

Dominique Estival 1

Kirsten Falkedal 1

Hans Christian Farsethås 1

Frederik Fouvry 1

Lucas Georges Gabriel Charpentier 1

Jon Atle Gulla 1

Kristin Hagen 1

Silvia Hansen-Schirra 1

Elizabeth A. Hinkelman 1

Julia Hockenmaier 1

Richard Johansson 1

Aravind Joshi 1

Ronald M. Kaplan 1

Walter Kasper 1

Tracy Holloway King 1

Mateusz Klimaszewski 1

Rune Lain Knudsen 1

Ville Komulainen 1

Hans-Ulrich Krieger 1

Joona Kytöniemi 1

Sandra Kübler 1

Fabre Lambeau 1

Sabine Lehmann 1

Andrew MacKinlay 1

Robert Malouf 1

Yuji Matsumoto 1

Robert C. Moore 1

Aslak Sira Myhre 1

Günter Neumann 1

Ellie Pavlick 1

Sylvie Regnier-Prost 1

Tatjana Scheffler 1

Ulrich Schäfer 1

Stuart M. Shieber 1

Uladzimir Sidarenka 1

Lars Jørgen Solberg 1

Stephen P. Spackman 1

Manfred Stede 1

Keith Suderman 1

Jörg Tidemann 1

Kristina Toutanova 1

Jelmer Van Der Linde 1

Nikolay A. Vazov 1

Marc Verhagen 1

Tereza Vojtěchová 1

Benjamin Waldron 1

Freddy Wetjen 1

Gisle Ytrestøl 1

Josef van Genabith 1

Wilfred Østgulen 1

Venues