Constantine Lignos


pdf bib
ParaNames: A Massively Multilingual Entity Name Corpus
Jonne Sälevä | Constantine Lignos
Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

We present ParaNames, a Wikidata-derived multilingual parallel name resource consisting of names for approximately 14 million entities spanning over 400 languages. ParaNames is useful for multilingual language processing, both in defining tasks for name translation tasks and as supplementary data for other tasks. We demonstrate an application of ParaNames by training a multilingual model for canonical name translation to and from English.

pdf bib
Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling
Elena Álvarez-Mellado | Constantine Lignos
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings—words from one language that are introduced into another without orthographic adaptation—and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.

pdf bib
Toward More Meaningful Resources for Lower-resourced Languages
Constantine Lignos | Nolan Holley | Chester Palen-Michel | Jonne Sälevä
Findings of the Association for Computational Linguistics: ACL 2022

In this position paper, we describe our perspective on how meaningful resources for lower-resourced languages should be developed in connection with the speakers of those languages. Before advancing that position, we first examine two massively multilingual resources used in language technology development, identifying shortcomings that limit their usefulness. We explore the contents of the names stored in Wikidata for a few lower-resourced languages and find that many of them are not in fact in the languages they claim to be, requiring non-trivial effort to correct. We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand-annotated data. We then discuss the importance of creating annotations for lower-resourced languages in a thoughtful and ethical way that includes the language speakers as part of the development process. We conclude with recommended guidelines for resource development.


pdf bib
SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation
Chester Palen-Michel | Nolan Holley | Constantine Lignos
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

To address a looming crisis of unreproducible evaluation for named entity recognition, we propose guidelines and introduce SeqScore, a software package to improve reproducibility. The guidelines we propose are extremely simple and center around transparency regarding how chunks are encoded and scored. We demonstrate that despite the apparent simplicity of NER evaluation, unreported differences in the scoring procedure can result in changes to scores that are both of noticeable magnitude and statistically significant. We describe SeqScore, which addresses many of the issues that cause replication failures.

pdf bib
MasakhaNER: Named Entity Recognition for African Languages
David Ifeoluwa Adelani | Jade Abbott | Graham Neubig | Daniel D’souza | Julia Kreutzer | Constantine Lignos | Chester Palen-Michel | Happy Buzaaba | Shruti Rijhwani | Sebastian Ruder | Stephen Mayhew | Israel Abebe Azime | Shamsuddeen H. Muhammad | Chris Chinenye Emezue | Joyce Nakatumba-Nabende | Perez Ogayo | Aremu Anuoluwapo | Catherine Gitau | Derguene Mbaye | Jesujoba Alabi | Seid Muhie Yimam | Tajuddeen Rabiu Gwadabe | Ignatius Ezeani | Rubungo Andre Niyongabo | Jonathan Mukiibi | Verrah Otiende | Iroro Orife | Davis David | Samba Ngom | Tosin Adewumi | Paul Rayson | Mofetoluwa Adeyemi | Gerald Muriuki | Emmanuel Anebi | Chiamaka Chukwuneke | Nkiruka Odu | Eric Peter Wairagala | Samuel Oyerinde | Clemencia Siro | Tobius Saul Bateesa | Temilola Oloyede | Yvonne Wambui | Victor Akinode | Deborah Nabagereka | Maurice Katusiime | Ayodele Awokoya | Mouhamadane MBOUP | Dibora Gebreyohannes | Henok Tilaye | Kelechi Nwaike | Degaga Wolde | Abdoulaye Faye | Blessing Sibanda | Orevaoghene Ahia | Bonaventure F. P. Dossou | Kelechi Ogueji | Thierno Ibrahima DIOP | Abdoulaye Diallo | Adewale Akinfaderin | Tendai Marengereke | Salomey Osei
Transactions of the Association for Computational Linguistics, Volume 9

Abstract We take a step towards addressing the under- representation of the African continent in NLP research by bringing together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition (NER) in ten African languages. We detail the characteristics of these languages to help researchers and practitioners better understand the challenges they pose for NER tasks. We analyze our datasets and conduct an extensive empirical evaluation of state- of-the-art methods across both supervised and transfer learning settings. Finally, we release the data, code, and models to inspire future research on African NLP.1

pdf bib
Macro-Average: Rare Types Are Important Too
Thamme Gowda | Weiqiu You | Constantine Lignos | Jonathan May
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

While traditional corpus-level evaluation metrics for machine translation (MT) correlate well with fluency, they struggle to reflect adequacy. Model-based MT metrics trained on segment-level human judgments have emerged as an attractive replacement due to strong correlation results. These models, however, require potentially expensive re-training for new domains and languages. Furthermore, their decisions are inherently non-transparent and appear to reflect unwelcome biases. We explore the simple type-based classifier metric, MacroF1, and study its applicability to MT evaluation. We find that MacroF1 is competitive on direct assessment, and outperforms others in indicating downstream cross-lingual information retrieval task performance. Further, we show that MacroF1 can be used to effectively compare supervised and unsupervised neural machine translation, and reveal significant qualitative differences in the methods’ outputs.

pdf bib
TMR: Evaluating NER Recall on Tough Mentions
Jingxuan Tu | Constantine Lignos
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

We propose the Tough Mentions Recall (TMR) metrics to supplement traditional named entity recognition (NER) evaluation by examining recall on specific subsets of ”tough” mentions: unseen mentions, those whose tokens or token/type combination were not observed in training, and type-confusable mentions, token sequences with multiple entity types in the test data. We demonstrate the usefulness of these metrics by evaluating corpora of English, Spanish, and Dutch using five recent neural architectures. We identify subtle differences between the performance of BERT and Flair on two English NER corpora and identify a weak spot in the performance of current models in Spanish. We conclude that the TMR metrics enable differentiation between otherwise similar-scoring systems and identification of patterns in performance that would go unnoticed from overall precision, recall, and F1.

pdf bib
The Effectiveness of Morphology-aware Segmentation in Low-Resource Neural Machine Translation
Jonne Saleva | Constantine Lignos
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

This paper evaluates the performance of several modern subword segmentation methods in a low-resource neural machine translation setting. We compare segmentations produced by applying BPE at the token or sentence level with morphologically-based segmentations from LMVR and MORSEL. We evaluate translation tasks between English and each of Nepali, Sinhala, and Kazakh, and predict that using morphologically-based segmentation methods would lead to better performance in this setting. However, comparing to BPE, we find that no consistent and reliable differences emerge between the segmentation methods. While morphologically-based methods outperform BPE in a few cases, what performs best tends to vary across tasks, and the performance of segmentation methods is often statistically indistinguishable.


pdf bib
Effective Architectures for Low Resource Multilingual Named Entity Transliteration
Molly Moran | Constantine Lignos
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

In this paper, we evaluate LSTM, biLSTM, GRU, and Transformer architectures for the task of name transliteration in a many-to-one multilingual paradigm, transliterating from 590 languages to English. We experiment with different encoder-decoder combinations and evaluate them using accuracy, character error rate, and an F-measure based on longest continuous subsequences. We find that using a Transformer for the encoder and decoder performs best, improving accuracy by over 4 points compared to previous work. We explore whether manipulating the source text by adding macrolanguage flag tokens or pre-romanizing source strings can improve performance and find that neither manipulation has a positive effect. Finally, we analyze performance differences between the LSTM and Transformer encoders when using a Transformer decoder and find that the Transformer encoder is better able to handle insertions and substitutions when transliterating.

pdf bib
If You Build Your Own NER Scorer, Non-replicable Results Will Come
Constantine Lignos | Marjan Kamyab
Proceedings of the First Workshop on Insights from Negative Results in NLP

We attempt to replicate a named entity recognition (NER) model implemented in a popular toolkit and discover that a critical barrier to doing so is the inconsistent evaluation of improper label sequences. We define these sequences and examine how two scorers differ in their handling of them, finding that one approach produces F1 scores approximately 0.5 points higher on the CoNLL 2003 English development and test sets. We propose best practices to increase the replicability of NER evaluations by increasing transparency regarding the handling of improper label sequences.


pdf bib
The Challenges of Optimizing Machine Translation for Low Resource Cross-Language Information Retrieval
Constantine Lignos | Daniel Cohen | Yen-Chieh Lien | Pratik Mehta | W. Bruce Croft | Scott Miller
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

When performing cross-language information retrieval (CLIR) for lower-resourced languages, a common approach is to retrieve over the output of machine translation (MT). However, there is no established guidance on how to optimize the resulting MT-IR system. In this paper, we examine the relationship between the performance of MT systems and both neural and term frequency-based IR models to identify how CLIR performance can be best predicted from MT quality. We explore performance at varying amounts of MT training data, byte pair encoding (BPE) merge operations, and across two IR collections and retrieval models. We find that the choice of IR collection can substantially affect the predictive power of MT tuning decisions and evaluation, potentially introducing dissociations between MT-only and overall CLIR performance.

pdf bib
SARAL: A Low-Resource Cross-Lingual Domain-Focused Information Retrieval System for Effective Rapid Document Triage
Elizabeth Boschee | Joel Barry | Jayadev Billa | Marjorie Freedman | Thamme Gowda | Constantine Lignos | Chester Palen-Michel | Michael Pust | Banriskhem Kayang Khonglah | Srikanth Madikeri | Jonathan May | Scott Miller
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

With the increasing democratization of electronic media, vast information resources are available in less-frequently-taught languages such as Swahili or Somali. That information, which may be crucially important and not available elsewhere, can be difficult for monolingual English speakers to effectively access. In this paper we present an end-to-end cross-lingual information retrieval (CLIR) and summarization system for low-resource languages that 1) enables English speakers to search foreign language repositories of text and audio using English queries, 2) summarizes the retrieved documents in English with respect to a particular information need, and 3) provides complete transcriptions and translations as needed. The SARAL system achieved the top end-to-end performance in the most recent IARPA MATERIAL CLIR+summarization evaluations. Our demonstration system provides end-to-end open query retrieval and summarization capability, and presents the original source text or audio, speech transcription, and machine translation, for two low resource languages.


pdf bib
Modeling Infant Word Segmentation
Constantine Lignos
Proceedings of the Fifteenth Conference on Computational Natural Language Learning


pdf bib
Recession Segmentation: Simpler Online Word Segmentation Using Limited Resources
Constantine Lignos | Charles Yang
Proceedings of the Fourteenth Conference on Computational Natural Language Learning