Miriam Butt


2021

pdf bib
Is that really a question? Going beyond factoid questions in NLP
Aikaterini-Lida Kalouli | Rebecca Kehlbeck | Rita Sevastjanova | Oliver Deussen | Daniel Keim | Miriam Butt
Proceedings of the 14th International Conference on Computational Semantics (IWCS)

Research in NLP has mainly focused on factoid questions, with the goal of finding quick and reliable ways of matching a query to an answer. However, human discourse involves more than that: it contains non-canonical questions deployed to achieve specific communicative goals. In this paper, we investigate this under-studied aspect of NLP by introducing a targeted task, creating an appropriate corpus for the task and providing baseline models of diverse nature. With this, we are also able to generate useful insights on the task and open the way for future research in this direction.

2020

pdf bib
Representation Problems in Linguistic Annotations: Ambiguity, Variation, Uncertainty, Error and Bias
Christin Beck | Hannah Booth | Mennatallah El-Assady | Miriam Butt
Proceedings of the 14th Linguistic Annotation Workshop

The development of linguistic corpora is fraught with various problems of annotation and representation. These constitute a very real challenge for the development and use of annotated corpora, but as yet not much literature exists on how to address the underlying problems. In this paper, we identify and discuss five sources of representation problems, which are independent though interrelated: ambiguity, variation, uncertainty, error and bias. We outline and characterize these sources, discussing how their improper treatment can have stark consequences for research outcomes. Finally, we discuss how an adequate treatment can inform corpus-related linguistic research, both computational and theoretical, improving the reliability of research results and NLP models, as well as informing the more general reproducibility issue.

pdf bib
Dependency Parsing for Urdu: Resources, Conversions and Learning
Toqeer Ehsan | Miriam Butt
Proceedings of the 12th Language Resources and Evaluation Conference

This paper adds to the available resources for the under-resourced language Urdu by converting different types of existing treebanks for Urdu into a common format that is based on Universal Dependencies. We present comparative results for training two dependency parsers, the MaltParser and a transition-based BiLSTM parser on this new resource. The BiLSTM parser incorporates word embeddings which improve the parsing results significantly. The BiLSTM parser outperforms the MaltParser with a UAS of 89.6 and an LAS of 84.2 with respect to our standardized treebank resource.

2019

pdf bib
Complex Predicates and Multidimensionality in Grammar
Miriam Butt
Linguistic Issues in Language Technology, Volume 17, 2019

This paper contributes to the on-going discussion of how best to analyze and handle complex predicate formations, commenting in particular on the properties of Hindi N-V complex predicates as set out by Vaidya et al. (2019). I highlight features of existing LFG analyses and focus in particular on the modular architecture of LFG, its attendant multidimensional lexicon and the analytic consequences which follow from this. I point out where the previously existing LFG proposals have been misunderstood as viewed from the lens of theories such as LTAG and HPSG, which assume a very different architectural set-up and provide a comparative discussion of the issues.

pdf bib
Using Meta-Morph Rules to develop Morphological Analysers: A case study concerning Tamil
Kengatharaiyer Sarveswaran | Gihan Dias | Miriam Butt
Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing

This paper describes a new and larger coverage Finite-State Morphological Analyser (FSM) and Generator for the Dravidian language Tamil. The FSM has been developed in the context of computational grammar engineering, adhering to the standards of the ParGram effort. Tamil is a morphologically rich language and the interaction between linguistic analysis and formal implementation is complex, resulting in a challenging task. In order to allow the development of the FSM to focus more on the linguistic analysis and less on the formal details, we have developed a system of meta-morph(ology) rules along with a script which translates these rules into FSM processable representations. The introduction of meta-morph rules makes it possible for computationally naive linguists to interact with the system and to expand it in future work. We found that the meta-morph rules help to express linguistic generalisations and reduce the manual effort of writing lexical classes for morphological analysis. Our Tamil FSM currently handles mainly the inflectional morphology of 3,300 verb roots and their 260 forms. Further, it also has a lexicon of approximately 100,000 nouns along with a guesser to handle out-of-vocabulary items. Although the Tamil FSM was primarily developed to be part of a computational grammar, it can also be used as a web or stand-alone application for other NLP tasks, as per general ParGram practice.

pdf bib
ParHistVis: Visualization of Parallel Multilingual Historical Data
Aikaterini-Lida Kalouli | Rebecca Kehlbeck | Rita Sevastjanova | Katharina Kaiser | Georg A. Kaiser | Miriam Butt
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

The study of language change through parallel corpora can be advantageous for the analysis of complex interactions between time, text domain and language. Often, those advantages cannot be fully exploited due to the sparse but high-dimensional nature of such historical data. To tackle this challenge, we introduce ParHistVis: a novel, free, easy-to-use, interactive visualization tool for parallel, multilingual, diachronic and synchronic linguistic data. We illustrate the suitability of the components of the tool based on a use case of word order change in Romance wh-interrogatives.

pdf bib
Visualizing Linguistic Change as Dimension Interactions
Christin Schätzle | Frederik L. Dennig | Michael Blumenschein | Daniel A. Keim | Miriam Butt
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

Historical change typically is the result of complex interactions between several linguistic factors. Identifying the relevant factors and understanding how they interact across the temporal dimension is the core remit of historical linguistics. With respect to corpus work, this entails a separate annotation, extraction and painstaking pair-wise comparison of the relevant bits of information. This paper presents a significant extension of HistoBankVis, a multilayer visualization system which allows a fast and interactive exploration of complex linguistic data. Linguistic factors can be understood as data dimensions which show complex interrelationships. We model these relationships with the Parallel Sets technique. We demonstrate the powerful potential of this technique by applying the system to understanding the interaction of case, grammatical relations and word order in the history of Icelandic.

pdf bib
lingvis.io - A Linguistic Visual Analytics Framework
Mennatallah El-Assady | Wolfgang Jentner | Fabian Sperrle | Rita Sevastjanova | Annette Hautli-Janisz | Miriam Butt | Daniel Keim
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We present a modular framework for the rapid-prototyping of linguistic, web-based, visual analytics applications. Our framework gives developers access to a rich set of machine learning and natural language processing steps, through encapsulating them into micro-services and combining them into a computational pipeline. This processing pipeline is auto-configured based on the requirements of the visualization front-end, making the linguistic processing and visualization design, detached independent development tasks. This paper describes the constellation and modality of our framework, which continues to support the efficient development of various human-in-the-loop, linguistic visual analytics research techniques and applications.

2018

pdf bib
A Multilingual Approach to Question Classification
Aikaterini-Lida Kalouli | Katharina Kaiser | Annette Hautli-Janisz | Georg A. Kaiser | Miriam Butt
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
HistoBankVis: Detecting Language Change via Data Visualization
Christin Schätzle | Michael Hund | Frederik Dennig | Miriam Butt | Daniel Keim
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

pdf bib
Interactive Visual Analysis of Transcribed Multi-Party Discourse
Mennatallah El-Assady | Annette Hautli-Janisz | Valentin Gold | Miriam Butt | Katharina Holzinger | Daniel Keim
Proceedings of ACL 2017, System Demonstrations

2015

pdf bib
Self Organizing Maps for the Visual Analysis of Pitch Contours
Dominik Sacha | Yuki Asano | Christian Rohrdantz | Felix Hamborg | Daniel Keim | Bettina Braun | Miriam Butt
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

2014

pdf bib
The CLE Urdu POS Tagset
Saba Urooj | Sarmad Hussain | Asad Mustafa | Rahila Parveen | Farah Adeeba | Tafseer Ahmed Khan | Miriam Butt | Annette Hautli
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The paper presents a design schema and details of a new Urdu POS tagset. This tagset is designed due to challenges encountered in working with existing tagsets for Urdu. It uses tags that judiciously incorporate information about special morpho-syntactic categories found in Urdu. With respect to the overall naming schema and the basic divisions, the tagset draws on the Penn Treebank and a Common Tagset for Indian Languages. The resulting CLE Urdu POS Tagset consists of 12 major categories with subdivisions, resulting in 32 tags. The tagset has been used to tag 100k words of the CLE Urdu Digest Corpus, giving a tagging accuracy of 96.8%.

pdf bib
Automatic Detection of Causal Relations in German Multilogs
Tina Bögel | Annette Hautli-Janisz | Sebastian Sulger | Miriam Butt
Proceedings of the EACL 2014 Workshop on Computational Approaches to Causality in Language (CAtoCL)

2013

pdf bib
ParGramBank: The ParGram Parallel Treebank
Sebastian Sulger | Miriam Butt | Tracy Holloway King | Paul Meurer | Tibor Laczkó | György Rákosi | Cheikh Bamba Dione | Helge Dyvik | Victoria Rosén | Koenraad De Smedt | Agnieszka Patejuk | Özlem Çetinoğlu | I Wayan Arka | Meladel Mistica
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations
Miriam Butt | Sarmad Hussain
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2012

pdf bib
Identifying Urdu Complex Predication via Bigram Extraction
Miriam Butt | Tina Bögel | Annette Hautli | Sebastian Sulger | Tafseer Ahmed
Proceedings of COLING 2012

pdf bib
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH
Miriam Butt | Sheelagh Carpendale | Gerald Penn | Jelena Prokić | Michael Cysouw
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH

pdf bib
Introduction
Miriam Butt | Jelena Prokić | Thomas Mayer | Michael Cysouw
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH

pdf bib
Lexical Semantics and Distribution of Suffixes - A Visual Analysis
Christian Rohrdantz | Andreas Niekler | Annette Hautli | Miriam Butt | Daniel A. Keim
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH

pdf bib
A Reference Dependency Bank for Analyzing Complex Predicates
Tafseer Ahmed | Miriam Butt | Annette Hautli | Sebastian Sulger
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

When dealing with languages of South Asia from an NLP perspective, a problem that repeatedly crops up is the treatment of complex predicates. This paper presents a first approach to the analysis of complex predicates (CPs) in the context of dependency bank development. The efforts originate in theoretical work on CPs done within Lexical-Functional Grammar (LFG), but are intended to provide a guideline for analyzing different types of CPs in an independent framework. Despite the fact that we focus on CPs in Hindi and Urdu, the design of the dependencies is kept general enough to account for CP constructions across languages.

2011

pdf bib
Discovering Semantic Classes for Urdu N-V Complex Predicates
Tafseer Ahmed | Miriam Butt
Proceedings of the Ninth International Conference on Computational Semantics (IWCS 2011)

pdf bib
Towards a Computational Semantic Analyzer for Urdu
Annette Hautli | Miriam Butt
Proceedings of the 9th Workshop on Asian Language Resources

pdf bib
Towards Tracking Semantic Change by Visual Analytics
Christian Rohrdantz | Annette Hautli | Thomas Mayer | Miriam Butt | Daniel A. Keim | Frans Plank
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Transliterating Urdu for a Broad-Coverage Urdu/Hindi LFG Grammar
Muhammad Kamran Malik | Tafseer Ahmed | Sebastian Sulger | Tina Bögel | Atif Gulzar | Ghulam Raza | Sarmad Hussain | Miriam Butt
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present a system for transliterating the Arabic-based script of Urdu to a Roman transliteration scheme. The system is integrated into a larger system consisting of a morphology module, implemented via finite state technologies, and a computational LFG grammar of Urdu that was developed with the grammar development platform XLE (Crouch et al. 2008). Our long-term goal is to handle Hindi alongside Urdu; the two languages are very similar with respect to syntax and lexicon and hence, one grammar can be used to cover both languages. However, they are not similar concerning the script -- Hindi is written in Devanagari, while Urdu uses an Arabic-based script. By abstracting away to a common Roman transliteration scheme in the respective transliterators, our system can be enabled to handle both languages in parallel. In this paper, we discuss the pipeline architecture of the Urdu-Roman transliterator, mention several linguistic and orthographic issues and present the integration of the transliterator into the LFG parsing system.

pdf bib
Consonant Co-Occurrence in Stems across Languages: Automatic Analysis and Visualization of a Phonotactic Constraint
Thomas Mayer | Christian Rohrdantz | Frans Plank | Peter Bak | Miriam Butt | Daniel A. Keim
Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground

2002

pdf bib
Urdu and the Parallel Grammar Project
Miriam Butt | Tracy Holloway King
COLING-02: The 3rd Workshop on Asian Language Resources and International Standardization

pdf bib
The Parallel Grammar Project
Miriam Butt | Helge Dyvik | Tracy Holloway King | Hiroshi Masuichi | Christian Rohrer
COLING-02: Grammar Engineering and Evaluation

1996

pdf bib
Syntactic Analyses for Parallel Grammars: Auxiliaries and Genitive NPs
Miriam Butt | Christian Fortmann | Christian Rohrer
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics