Constantine Lignos

2025

pdf bib
Proceedings of the Sixth Workshop on African Natural Language Processing (AfricaNLP 2025)
Constantine Lignos | Idris Abdulmumin | David Adelani
Proceedings of the Sixth Workshop on African Natural Language Processing (AfricaNLP 2025)

pdf bib abs
OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
Chester Palen-Michel | Maxwell Pickering | Maya Kruse | Jonne Sälevä | Constantine Lignos
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets.OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies.We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER.We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER.We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task.OpenNER is released at https://github.com/bltlab/open-ner.

pdf bib abs
Beyond statistical significance: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation
Jonne Sälevä | Duygu Ataman | Constantine Lignos
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

We introduce a set of resampling-based methods for quantifying uncertainty and statistical precision of evaluation metrics in multilingual and/or multitask NLP benchmarks.We show how experimental variation in performance scores arises from both model and data-related sources, and that accounting for both of them is necessary to avoid substantially underestimating the overall variability over hypothetical replications.Using multilingual question answering, machine translation, and named entity recognition as example tasks, we also demonstrate how resampling methods are useful for quantifying the replication uncertainty of various quantities used in leaderboards such as model rankings and pairwise differences between models.

pdf bib abs
MetaMeme: A Dataset for Meme Template and Meta-Category Classification
Benjamin Lambright | Jordan Youner | Constantine Lignos
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

This paper introduces a new dataset for classifying memes by their template and communicative intent.It includes a broad selection of meme templates and examples scraped from imgflip and a smaller hand-annotated set of memes scraped from Reddit.The Reddit memes have been annotated for meta-category using a novel annotation scheme that classifies memes by the structure of the perspective they are being used to communicate.YOLOv11 and ChatGPT 4o are used to provide baseline modeling results.We find that YOLO struggles with template classification on real-world data but outperforms ChatGPT in classifying meta-categories.

2024

pdf bib abs
Language Model Priors and Data Augmentation Strategies for Low-resource Machine Translation: A Case Study Using Finnish to Northern Sámi
Jonne Sälevä | Constantine Lignos
Findings of the Association for Computational Linguistics: ACL 2024

We investigate ways of using monolingual data in both the source and target languages for improving low-resource machine translation. As a case study, we experiment with translation from Finnish to Northern Sámi.Our experiments show that while conventional backtranslation remains a strong contender, using synthetic target-side data when training backtranslation models can be helpful as well.We also show that monolingual data can be used to train a language model which can act as a regularizer without any augmentation of parallel data.

pdf bib abs
CoNLL#: Fine-grained Error Analysis and a Corrected Test Set for CoNLL-03 English
Andrew Rueda | Elena Alvarez-Mellado | Constantine Lignos
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Modern named entity recognition systems have steadily improved performance in the age of larger and more powerful neural models. However, over the past several years, the state-of-the-art has seemingly hit another plateau on the benchmark CoNLL-03 English dataset. In this paper, we perform a deep dive into the test outputs of the highest-performing NER models, conducting a fine-grained evaluation of their performance by introducing new document-level annotations on the test set. We go beyond F1 scores by categorizing errors in order to interpret the true state of the art for NER and guide future work. We review previous attempts at correcting the various flaws of the test set and introduce CoNLL#, a new corrected version of the test set that addresses its systematic and most prevalent errors, allowing for low-noise, interpretable error analysis.

pdf bib abs
ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages Using Wikidata
Jonne Sälevä | Constantine Lignos
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a standard type (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate the usefulness of ParaNames on two tasks. First, we perform canonical name translation between English and 17 other languages. Second, we use it as a gazetteer for multilingual named entity recognition, obtaining performance improvements on all 10 languages evaluated.

pdf bib abs
QueryNER: Segmentation of E-commerce Queries
Chester Palen-Michel | Lizzie Liang | Zhe Wu | Constantine Lignos
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present QueryNER, a manually-annotated dataset and accompanying model for e-commerce query segmentation. Prior work in sequence labeling for e-commerce has largely addressed aspect-value extraction which focuses on extracting portions of a product title or query for narrowly defined aspects. Our work instead focuses on the goal of dividing a query into meaningful chunks with broadly applicable types. We report baseline tagging results and conduct experiments comparing token and entity dropping for null and low recall query recovery. Challenging test sets are created using automatic transformations and show how simple data augmentation techniques can make the models more robust to noise. We make the QueryNER dataset publicly available.

2023

pdf bib abs
LR-Sum: Summarization for Less-Resourced Languages
Chester Palen-Michel | Constantine Lignos
Findings of the Association for Computational Linguistics: ACL 2023

We introduce LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages.LR-Sum contains human-written summaries for 40 languages, many of which are less-resourced. We describe our process for extracting and filtering the dataset from the Multilingual Open Text corpus (Palen-Michel et al., 2022).The source data is public domain newswire collected from from Voice of America websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0), making it one of the most openly-licensed multilingual summarization datasets. We describe abstractive and extractive summarization experiments to establish baselines and discuss the limitations of this dataset.

pdf bib abs
What changes when you randomly choose BPE merge operations? Not much.
Jonne Saleva | Constantine Lignos
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP

We introduce two simple randomized variants of byte pair encoding (BPE) and explore whether randomizing the selection of merge operations substantially affects a downstream machine translation task. We focus on translation into morphologically rich languages, hypothesizing that this task may show sensitivity to the method of choosing subwords. Analysis using a Bayesian linear model indicates that one variant performs nearly indistinguishably compared to standard BPE while the other degrades performance less than we anticipated. We conclude that although standard BPE is widely used, there exists an interesting universe of potential variations on it worth investigating. Our code is available at: https://github.com/bltlab/random-bpe.

This paper provides an overview of the first shared task on choosing beneficial instances for machine translation, conducted as part of the CoCo4MT 2023 Workshop at MTSummit. This shared task was motivated by the need to make the data annotation process for machine translation more efficient, particularly for low-resource languages for which collecting human translations may be difficult or expensive. The task involved developing methods for selecting the most beneficial instances for training a machine translation system without access to an existing parallel dataset in the target language, such that the best selected instances can then be manually translated. Two teams participated in the shared task, namely the Williams team and the AST team. Submissions were evaluated by training a machine translation model on each submission’s chosen instances, and comparing their performance with the chRF++ score. The system that ranked first is by the Williams team, that finds representative instances by clustering the training data.

pdf bib abs
Improving NER Research Workflows with SeqScore
Constantine Lignos | Maya Kruse | Andrew Rueda
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

We describe the features of SeqScore, an MIT-licensed Python toolkit for working with named entity recognition (NER) data.While SeqScore began as a tool for NER scoring, it has been expanded to help with the full lifecycle of working with NER data: validating annotation, providing at-a-glance and detailed summaries of the data, modifying annotation to support experiments, scoring system output, and aiding with error analysis.SeqScore is released via PyPI (https://pypi.org/project/seqscore/) and development occurs on GitHub (https://github.com/bltlab/seqscore).

2022

pdf bib abs
Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling
Elena Álvarez-Mellado | Constantine Lignos
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings—words from one language that are introduced into another without orthographic adaptation—and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.

pdf bib
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Workshop 2: Corpus Generation and Corpus Augmentation for Machine Translation)
John E. Ortega | Marine Carpuat | William Chen | Katharina Kann | Constantine Lignos | Maja Popovic | Shabnam Tafreshi
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Workshop 2: Corpus Generation and Corpus Augmentation for Machine Translation)

pdf bib
Proceedings of the Workshop on Dataset Creation for Lower-Resourced Languages within the 13th Language Resources and Evaluation Conference
Jonne Sälevä | Constantine Lignos
Proceedings of the Workshop on Dataset Creation for Lower-Resourced Languages within the 13th Language Resources and Evaluation Conference

African languages are spoken by over a billion people, but they are under-represented in NLP research and development. Multiple challenges exist, including the limited availability of annotated training and evaluation datasets as well as the lack of understanding of which settings, languages, and recently proposed methods like cross-lingual transfer will be effective. In this paper, we aim to move towards solutions for these challenges, focusing on the task of named entity recognition (NER). We present the creation of the largest to-date human-annotated NER dataset for 20 African languages. We study the behaviour of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, empirically demonstrating that the choice of source transfer language significantly affects performance. While much previous work defaults to using English as the source language, our results show that choosing the best transfer language improves zero-shot F1 scores by an average of 14% over 20 languages as compared to using English.

pdf bib abs
Toward More Meaningful Resources for Lower-resourced Languages
Constantine Lignos | Nolan Holley | Chester Palen-Michel | Jonne Sälevä
Findings of the Association for Computational Linguistics: ACL 2022

In this position paper, we describe our perspective on how meaningful resources for lower-resourced languages should be developed in connection with the speakers of those languages. Before advancing that position, we first examine two massively multilingual resources used in language technology development, identifying shortcomings that limit their usefulness. We explore the contents of the names stored in Wikidata for a few lower-resourced languages and find that many of them are not in fact in the languages they claim to be, requiring non-trivial effort to correct. We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand-annotated data. We then discuss the importance of creating annotations for lower-resourced languages in a thoughtful and ethical way that includes the language speakers as part of the development process. We conclude with recommended guidelines for resource development.

pdf bib abs
Multilingual Open Text Release 1: Public Domain News in 44 Languages
Chester Palen-Michel | June Kim | Constantine Lignos
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present a Multilingual Open Text (MOT), a new multilingual corpus containing text in 44 languages, many of which have limited existing text resources for natural language processing. The first release of the corpus contains over 2.8 million news articles and an additional 1 million short snippets (photo captions, video descriptions, etc.) published between 2001–2022 and collected from Voice of America’s news websites. We describe our process for collecting, filtering, and processing the data. The source material is in the public domain, our collection is licensed using a creative commons license (CC BY 4.0), and all software used to create the corpus is released under the MIT License. The corpus will be regularly updated as additional documents are published.

pdf bib abs
Borrowing or Codeswitching? Annotating for Finer-Grained Distinctions in Language Mixing
Elena Alvarez-Mellado | Constantine Lignos
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present a new corpus of Twitter data annotated for codeswitching and borrowing between Spanish and English. The corpus contains 9,500 tweets annotated at the token level with codeswitches, borrowings, and named entities. This corpus differs from prior corpora of codeswitching in that we attempt to clearly define and annotate the boundary between codeswitching and borrowing and do not treat common “internet-speak” (lol, etc.) as codeswitching when used in an otherwise monolingual context. The result is a corpus that enables the study and modeling of Spanish-English borrowing and codeswitching on Twitter in one dataset. We present baseline scores for modeling the labels of this corpus using Transformer-based language models. The annotation itself is released with a CC BY 4.0 license, while the text it applies to is distributed in compliance with the Twitter terms of service.

pdf bib abs
ParaNames: A Massively Multilingual Entity Name Corpus
Jonne Sälevä | Constantine Lignos
Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

We present ParaNames, a Wikidata-derived multilingual parallel name resource consisting of names for approximately 14 million entities spanning over 400 languages. ParaNames is useful for multilingual language processing, both in defining tasks for name translation tasks and as supplementary data for other tasks. We demonstrate an application of ParaNames by training a multilingual model for canonical name translation to and from English.

2021

pdf bib abs
TMR: Evaluating NER Recall on Tough Mentions
Jingxuan Tu | Constantine Lignos
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

We propose the Tough Mentions Recall (TMR) metrics to supplement traditional named entity recognition (NER) evaluation by examining recall on specific subsets of ”tough” mentions: unseen mentions, those whose tokens or token/type combination were not observed in training, and type-confusable mentions, token sequences with multiple entity types in the test data. We demonstrate the usefulness of these metrics by evaluating corpora of English, Spanish, and Dutch using five recent neural architectures. We identify subtle differences between the performance of BERT and Flair on two English NER corpora and identify a weak spot in the performance of current models in Spanish. We conclude that the TMR metrics enable differentiation between otherwise similar-scoring systems and identification of patterns in performance that would go unnoticed from overall precision, recall, and F1.

pdf bib abs
The Effectiveness of Morphology-aware Segmentation in Low-Resource Neural Machine Translation
Jonne Saleva | Constantine Lignos
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

This paper evaluates the performance of several modern subword segmentation methods in a low-resource neural machine translation setting. We compare segmentations produced by applying BPE at the token or sentence level with morphologically-based segmentations from LMVR and MORSEL. We evaluate translation tasks between English and each of Nepali, Sinhala, and Kazakh, and predict that using morphologically-based segmentation methods would lead to better performance in this setting. However, comparing to BPE, we find that no consistent and reliable differences emerge between the segmentation methods. While morphologically-based methods outperform BPE in a few cases, what performs best tends to vary across tasks, and the performance of segmentation methods is often statistically indistinguishable.

pdf bib abs
SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation
Chester Palen-Michel | Nolan Holley | Constantine Lignos
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

To address a looming crisis of unreproducible evaluation for named entity recognition, we propose guidelines and introduce SeqScore, a software package to improve reproducibility. The guidelines we propose are extremely simple and center around transparency regarding how chunks are encoded and scored. We demonstrate that despite the apparent simplicity of NER evaluation, unreported differences in the scoring procedure can result in changes to scores that are both of noticeable magnitude and statistically significant. We describe SeqScore, which addresses many of the issues that cause replication failures.

pdf bib abs
Macro-Average: Rare Types Are Important Too
Thamme Gowda | Weiqiu You | Constantine Lignos | Jonathan May
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

While traditional corpus-level evaluation metrics for machine translation (MT) correlate well with fluency, they struggle to reflect adequacy. Model-based MT metrics trained on segment-level human judgments have emerged as an attractive replacement due to strong correlation results. These models, however, require potentially expensive re-training for new domains and languages. Furthermore, their decisions are inherently non-transparent and appear to reflect unwelcome biases. We explore the simple type-based classifier metric, MacroF1, and study its applicability to MT evaluation. We find that MacroF1 is competitive on direct assessment, and outperforms others in indicating downstream cross-lingual information retrieval task performance. Further, we show that MacroF1 can be used to effectively compare supervised and unsupervised neural machine translation, and reveal significant qualitative differences in the methods’ outputs.

We take a step towards addressing the under- representation of the African continent in NLP research by bringing together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition (NER) in ten African languages. We detail the characteristics of these languages to help researchers and practitioners better understand the challenges they pose for NER tasks. We analyze our datasets and conduct an extensive empirical evaluation of state- of-the-art methods across both supervised and transfer learning settings. Finally, we release the data, code, and models to inspire future research on African NLP.1

2020

pdf bib abs
If You Build Your Own NER Scorer, Non-replicable Results Will Come
Constantine Lignos | Marjan Kamyab
Proceedings of the First Workshop on Insights from Negative Results in NLP

We attempt to replicate a named entity recognition (NER) model implemented in a popular toolkit and discover that a critical barrier to doing so is the inconsistent evaluation of improper label sequences. We define these sequences and examine how two scorers differ in their handling of them, finding that one approach produces F1 scores approximately 0.5 points higher on the CoNLL 2003 English development and test sets. We propose best practices to increase the replicability of NER evaluations by increasing transparency regarding the handling of improper label sequences.

pdf bib abs
Effective Architectures for Low Resource Multilingual Named Entity Transliteration
Molly Moran | Constantine Lignos
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

In this paper, we evaluate LSTM, biLSTM, GRU, and Transformer architectures for the task of name transliteration in a many-to-one multilingual paradigm, transliterating from 590 languages to English. We experiment with different encoder-decoder combinations and evaluate them using accuracy, character error rate, and an F-measure based on longest continuous subsequences. We find that using a Transformer for the encoder and decoder performs best, improving accuracy by over 4 points compared to previous work. We explore whether manipulating the source text by adding macrolanguage flag tokens or pre-romanizing source strings can improve performance and find that neither manipulation has a positive effect. Finally, we analyze performance differences between the LSTM and Transformer encoders when using a Transformer decoder and find that the Transformer encoder is better able to handle insertions and substitutions when transliterating.

2019

pdf bib abs
The Challenges of Optimizing Machine Translation for Low Resource Cross-Language Information Retrieval
Constantine Lignos | Daniel Cohen | Yen-Chieh Lien | Pratik Mehta | W. Bruce Croft | Scott Miller
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

When performing cross-language information retrieval (CLIR) for lower-resourced languages, a common approach is to retrieve over the output of machine translation (MT). However, there is no established guidance on how to optimize the resulting MT-IR system. In this paper, we examine the relationship between the performance of MT systems and both neural and term frequency-based IR models to identify how CLIR performance can be best predicted from MT quality. We explore performance at varying amounts of MT training data, byte pair encoding (BPE) merge operations, and across two IR collections and retrieval models. We find that the choice of IR collection can substantially affect the predictive power of MT tuning decisions and evaluation, potentially introducing dissociations between MT-only and overall CLIR performance.

With the increasing democratization of electronic media, vast information resources are available in less-frequently-taught languages such as Swahili or Somali. That information, which may be crucially important and not available elsewhere, can be difficult for monolingual English speakers to effectively access. In this paper we present an end-to-end cross-lingual information retrieval (CLIR) and summarization system for low-resource languages that 1) enables English speakers to search foreign language repositories of text and audio using English queries, 2) summarizes the retrieved documents in English with respect to a particular information need, and 3) provides complete transcriptions and translations as needed. The SARAL system achieved the top end-to-end performance in the most recent IARPA MATERIAL CLIR+summarization evaluations. Our demonstration system provides end-to-end open query retrieval and summarization capability, and presents the original source text or audio, speech transcription, and machine translation, for two low resource languages.