Sharid Loáiciga

Also published as: Sharid Loaiciga

2025

Proceedings of the 6th Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences (CODI 2025)
Michael Strube | Chloe Braud | Christian Hardmeier | Junyi Jessy Li | Sharid Loaiciga | Amir Zeldes | Chuyuan Li
Proceedings of the 6th Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences (CODI 2025)

pdf bib abs

Exploring smaller batch sizes for a high-performing BabyLM model architecture
Sharid Loáiciga | Eleni Fysikoudi | Asad B. Sayeed
Proceedings of the First BabyLM Workshop

We explore the conditions under which the highest-performing entry to the BabyLM task in 2023, Every Layer Counts BERT or ELC-BERT, is best-performing given more constrained resources than the original run, with a particular focus on batch size. ELC-BERT’s relative success, as an instance of model engineering compared to more cognitively-motivated architectures, could be taken as evidence that the “lowest-hanging” fruit is to be found from non-linguistic machine learning approaches. We find that if we take away the advantage of training time from ELC-BERT, the advantage of the architecture mostly disappears, but some hyperparameter combinations nevertheless differentiate themselves in performance.

pdf bib

Information Divergence in Translation and Interpreting: Findings from Same-Source Texts
Maria Kunilovskaya | Sharid Loáiciga | Ekaterina Lapshinova-Koltunski
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Long and Short Papers

pdf bib abs

Coreference as an indicator of context scope in multimodal narrative
Nikolai Ilinykh | Shalom Lappin | Asad B. Sayeed | Sharid Loáiciga
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

We demonstrate that large multimodal language models differ substantially from humans in the distribution of coreferential expressions in a visual storytelling task. We introduce a number of metrics to quantify the characteristics of coreferential patterns in both human- and machine-written texts. Humans distribute coreferential expressions in a way that maintains consistency across texts and images, interleaving references to different entities in a highly varied way. Machines are less able to track mixed references, despite achieving perceived improvements in generation quality. Materials, metrics, and code for our study are available at https://github.com/GU-CLASP/coreference-context-scope.

pdf bib abs

Active Curriculum Language Modeling over a Hybrid Pre-training Method
Eleni Fysikoudi | Sharid Loáiciga | Asad B. Sayeed
Proceedings of the First BabyLM Workshop

We apply the Active Curriculum Language Modeling (ACLM) method to the constrained pretraining setting of the 2025 BabyLM Challenge, where models are limited by both data and compute budgets. Using GPT-BERT (Charpentier and Samuel, 2024) as the base architecture, we investigate the impact of surprisal-based example selection for constructing a training curriculum. In addition, we conduct a targeted hyperparameter search over tokenizer size and batch size. Our approach yields stable pretrained models that surpass the official baseline on multiple evaluation tasks, demonstrating ACLM’s potential for improving performance and generalization in low-resource pretraining scenarios.

2024

pdf bib

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts
Mohsen Mesgar | Sharid Loáiciga
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

pdf bib

pdf bib abs

Multilingual Models for ASR in Chibchan Languages
Rolando Coto-Solano | Tai Wan Kim | Alexander Jones | Sharid Loáiciga
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We present experiments on Automatic Speech Recognition (ASR) for Bribri and Cabécar, two languages from the Chibchan family. We fine-tune four ASR algorithms (Wav2Vec2, Whisper, MMS & WavLM) to create monolingual models, with the Wav2Vec2 model demonstrating the best performance. We then proceed to use Wav2Vec2 for (1) experiments on training joint and transfer learning models for both languages, and (2) an analysis of the errors, with a focus on the transcription of tone. Results show effective transfer learning for both Bribri and Cabécar, but especially for Bribri. A post-processing spell checking step further reduced character and word error rates. As for the errors, tone is where the Bribri models make the most errors, whereas the simpler tonal system of Cabécar is better transcribed by the model. Our work contributes to developing better ASR technology, an important tool that could facilitate transcription, one of the major bottlenecks in language documentation efforts. Our work also assesses how existing pre-trained models and algorithms perform for genuine extremely low resource-languages.

pdf bib abs

A surprisal oracle for when every layer counts
Xudong Hong | Sharid Loáiciga | Asad Sayeed
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

Active Curriculum Language Modeling (ACLM; Hong et al., 2023) is a learner-directed approach to training a language model. We proposed the original version of this process in our submission to the BabyLM 2023 task, and now we propose an updated ACLM process for the BabyLM 2024 task. ACLM involves an iteratively-and dynamically-constructed curriculum informed over the training process by a model of uncertainty; other training items that are similarly uncertain to a least certain candidate item are prioritized. Our new process improves the similarity model so that it is more dynamic, and we run ACLM over the most successful model from the BabyLM 2023 task: ELC-BERT (Charpentier and Samuel, 2023). We find that while our models underperform on fine-grained grammatical inferences, they outperform the BabyLM 2024 official base-lines on common-sense and world-knowledge tasks. We make our code available at https://github.com/asayeed/ActiveBaby.

2023

pdf bib

Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)
Ellen Breitholtz | Shalom Lappin | Sharid Loaiciga | Nikolai Ilinykh | Simon Dobnik
Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)

pdf bib

Proceedings of the 4th Workshop on Computational Approaches to Discourse (CODI 2023)
Michael Strube | Chloe Braud | Christian Hardmeier | Junyi Jessy Li | Sharid Loaiciga | Amir Zeldes
Proceedings of the 4th Workshop on Computational Approaches to Discourse (CODI 2023)

pdf bib

A surprisal oracle for active curriculum language modeling
Xudong Hong | Sharid Loáiciga | Asad Sayeed
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

2022

pdf bib abs

New or Old? Exploring How Pre-Trained Language Models Represent Discourse Entities
Sharid Loáiciga | Anne Beyer | David Schlangen
Proceedings of the 29th International Conference on Computational Linguistics

Recent research shows that pre-trained language models, built to generate text conditioned on some context, learn to encode syntactic knowledge to a certain degree. This has motivated researchers to move beyond the sentence-level and look into their ability to encode less studied discourse-level phenomena. In this paper, we add to the body of probing research by investigating discourse entity representations in large pre-trained language models in English. Motivated by early theories of discourse and key pieces of previous work, we focus on the information-status of entities as discourse-new or discourse-old. We present two probing models, one based on binary classification and another one on sequence labeling. The results of our experiments show that pre-trained language models do encode information on whether an entity has been introduced before or not in the discourse. However, this information alone is not sufficient to find the entities in a discourse, opening up interesting questions about the definition of entities for future work.

pdf bib

pdf bib abs

Anaphoric Phenomena in Situated dialog: A First Round of Annotations
Sharid Loáiciga | Simon Dobnik | David Schlangen
Proceedings of the Fifth Workshop on Computational Models of Reference, Anaphora and Coreference

We present a first release of 500 documents from the multimodal corpus Tell-me-more (Ilinykh et al., 2019) annotated with coreference information according to the ARRAU guidelines (Poesio et al., 2021). The corpus consists of images and short texts of five sentences. We describe the annotation process and present the adaptations to the original guidelines in order to account for the challenges of grounding the annotations to the image. 50 documents from the 500 available are annotated by two people and used to estimate inter-annotator agreement (IAA) relying on Krippendorff’s alpha.

2021

pdf bib abs

Event and Entity Coreference Across Five Languages: Effects of Context and Referring Expression
Luca Bevacqua | Sharid Loáiciga | Hannah Rohde | Christian Hardmeier
Dialogue Discourse Volume 12

Current work on coreference focuses primarily on entities, often leaving unanalysed the use of anaphors to corefer with antecedents such as events and textual segments. Moreover, the anaphoric forms that speakers use for entity and non-entity coreference are not mutually exclusive. This ambiguity has been the subject of recent work in English, with evidence of a split between comprehenders’ preferential interpretation of personal versus demonstrative pronouns. In addition, comprehenders are shown to be sensitive to antecedent complexity and aspectual status, two verb-driven cues that signal how an event is being portrayed. Here we extend this work via a comparison across five languages (English, French, German, Italian, and Spanish). With a story-continuation experiment, we test how different referring expressions corefer with entity and event antecedents and whether verbal features such as argument structure and aspect influence this choice. Our results show widely consistent, not categorical biases across languages: entity coreference is favoured for personal pronouns and event coreference for demonstratives. Antecedent complexity increases the rate at which anaphors are taken to corefer with an event antecedent, but portraying an event as completed does not reach statistical significance (though showing quite uniform patterns). Lastly, we report a comparison of the same referring expressions to refer to entity and event antecedents in a trilingual parallel corpus annotated with coreference.Together, the results provide a first crosslingual picture of coreference preferences beyond the restricted entity-only patterns targeted by most existing work on coreference. The five languages are all shown to allow gradable use of pronouns for entity and event coreference, with biases that align with existing generalizations about the link between prominence and the use of reduced referring expressions. The studies also show the feasibility of manipulating targeted verb-driven cues across multiple languages to support crosslingual comparisons.

pdf bib abs

Is Incoherence Surprising? Targeted Evaluation of Coherence Prediction from Language Models
Anne Beyer | Sharid Loáiciga | David Schlangen
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Coherent discourse is distinguished from a mere collection of utterances by the satisfaction of a diverse set of constraints, for example choice of expression, logical relation between denoted events, and implicit compatibility with world-knowledge. Do neural language models encode such constraints? We design an extendable set of test suites addressing different aspects of discourse and dialogue coherence. Unlike most previous coherence evaluation studies, we address specific linguistic devices beyond sentence order perturbations, which allow for a more fine-grained analysis of what constitutes coherence and what neural models trained on a language modelling objective are capable of encoding. Extending the targeted evaluation paradigm for neural language models (Marvin and Linzen, 2018) to phenomena beyond syntax, we show that this paradigm is equally suited to evaluate linguistic qualities that contribute to the notion of coherence.

pdf bib

Towards Universal Dependencies for Bribri
Rolando Coto-Solano | Sharid Loáiciga | Sofía Flores-Solórzano
Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021)

pdf bib abs

Annotating anaphoric phenomena in situated dialogue
Sharid Loáiciga | Simon Dobnik | David Schlangen
Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR)

In recent years several corpora have been developed for vision and language tasks. With this paper, we intend to start a discussion on the annotation of referential phenomena in situated dialogue. We argue that there is still significant room for corpora that increase the complexity of both visual and linguistic domains and which capture different varieties of perceptual and conversational contexts. In addition, a rich annotation scheme covering a broad range of referential phenomena and compatible with the textual task of coreference resolution is necessary in order to take the most advantage of these corpora. Consequently, there are several open questions regarding the semantics of reference and annotation, and the extent to which standard textual coreference accounts for the situated dialogue genre. Working with two corpora on situated dialogue, we present our extension to the ARRAU (Uryupina et al., 2020) annotation scheme in order to start this discussion.

pdf bib abs

Reference and coreference in situated dialogue
Sharid Loáiciga | Simon Dobnik | David Schlangen
Proceedings of the Second Workshop on Advances in Language and Vision Research

In recent years several corpora have been developed for vision and language tasks. We argue that there is still significant room for corpora that increase the complexity of both visual and linguistic domains and which capture different varieties of perceptual and conversational contexts. Working with two corpora approaching this goal, we present a linguistic perspective on some of the challenges in creating and extending resources combining language and vision while preserving continuity with the existing best practices in the area of coreference annotation.

2020

pdf bib abs

Exploiting Cross-Lingual Hints to Discover Event Pronouns
Sharid Loáiciga | Christian Hardmeier | Asad Sayeed
Proceedings of the Twelfth Language Resources and Evaluation Conference

Non-nominal co-reference is much less studied than nominal coreference, partly because of the lack of annotated corpora. We explore the possibility to exploit parallel multilingual corpora as a means of cheap supervision for the classification of three different readings of the English pronoun ‘it’: entity, event or pleonastic, from their translation in several languages. We found that the ‘event’ reading is not very frequent, but can be easily predicted provided that the construction used to translate the ‘it’ example is a pronoun as well. These cases, nevertheless, are not enough to generalize to other types of non-nominal reference.

pdf bib abs

Exploring Span Representations in Neural Coreference Resolution
Patrick Kahardipraja | Olena Vyshnevska | Sharid Loáiciga
Proceedings of the First Workshop on Computational Approaches to Discourse

In coreference resolution, span representations play a key role to predict coreference links accurately. We present a thorough examination of the span representation derived by applying BERT on coreference resolution (Joshi et al., 2019) using a probing model. Our results show that the span representation is able to encode a significant amount of coreference information. In addition, we find that the head-finding attention mechanism involved in creating the spans is crucial in encoding coreference knowledge. Last, our analysis shows that the span representation cannot capture non-local coreference as efficiently as local coreference.

2019

pdf bib abs

Cross-lingual Incongruences in the Annotation of Coreference
Ekaterina Lapshinova-Koltunski | Sharid Loáiciga | Christian Hardmeier | Pauline Krielke
Proceedings of the Second Workshop on Computational Models of Reference, Anaphora and Coreference

In the present paper, we deal with incongruences in English-German multilingual coreference annotation and present automated methods to discover them. More specifically, we automatically detect full coreference chains in parallel texts and analyse discrepancies in their annotations. In doing so, we wish to find out whether the discrepancies rather derive from language typological constraints, from the translation or the actual annotation process. The results of our study contribute to the referential analysis of similarities and differences across languages and support evaluation of cross-lingual coreference annotation. They are also useful for cross-lingual coreference resolution systems and contrastive linguistic studies.

pdf bib abs

Analysing concatenation approaches to document-level NMT in two different domains
Yves Scherrer | Jörg Tiedemann | Sharid Loáiciga
Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)

In this paper, we investigate how different aspects of discourse context affect the performance of recent neural MT systems. We describe two popular datasets covering news and movie subtitles and we provide a thorough analysis of the distribution of various document-level features in their domains. Furthermore, we train a set of context-aware MT models on both datasets and propose a comparative evaluation scheme that contrasts coherent context with artificially scrambled documents and absent context, arguing that the impact of discourse-aware MT models will become visible in this way. Our results show that the models are indeed affected by the manipulation of the test data, providing a different view on document-level translation quality than absolute sentence-level scores.

pdf bib

Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)
Andrei Popescu-Belis | Sharid Loáiciga | Christian Hardmeier | Deyi Xiong
Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)

2018

pdf bib abs

A Pronoun Test Suite Evaluation of the English–German MT Systems at WMT 2018
Liane Guillou | Christian Hardmeier | Ekaterina Lapshinova-Koltunski | Sharid Loáiciga
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We evaluate the output of 16 English-to-German MT systems with respect to the translation of pronouns in the context of the WMT 2018 competition. We work with a test suite specifically designed to assess system quality in various fine-grained categories known to be problematic. The main evaluation scores come from a semi-automatic process, combining automatic reference matching with extensive manual annotation of uncertain cases. We find that current NMT systems are good at translating pronouns with intra-sentential reference, but the inter-sentential cases remain difficult. NMT systems are also good at the translation of event pronouns, unlike systems from the phrase-based SMT paradigm. No single system performs best at translating all types of anaphoric pronouns, suggesting unexplained random effects influencing the translation of pronouns with NMT.

pdf bib abs

Event versus entity co-reference: Effects of context and form of referring expression
Sharid Loáiciga | Luca Bevacqua | Hannah Rohde | Christian Hardmeier
Proceedings of the First Workshop on Computational Models of Reference, Anaphora and Coreference

Anaphora resolution systems require both an enumeration of possible candidate antecedents and an identification process of the antecedent. This paper focuses on (i) the impact of the form of referring expression on entity-vs-event preferences and (ii) how properties of the passage interact with referential form. Two crowd-sourced story-continuation experiments were conducted, using constructed and naturally-occurring passages, to see how participants interpret It and This pronouns following a context sentence that makes available event and entity referents. Our participants show a strong, but not categorical, bias to use This to refer to events and It to refer to entities. However, these preferences vary with passage characteristics such as verb class (a proxy in our constructed examples for the number of explicit and implicit entities) and more subtle author intentions regarding subsequent re-mention (the original event-vs-entity re-mention of our corpus items).

pdf bib abs

Forms of Anaphoric Reference to Organisational Named Entities: Hoping to widen appeal, they diversified
Christian Hardmeier | Luca Bevacqua | Sharid Loáiciga | Hannah Rohde
Proceedings of the Seventh Named Entities Workshop

Proper names of organisations are a special case of collective nouns. Their meaning can be conceptualised as a collective unit or as a plurality of persons, allowing for different morphological marking of coreferent anaphoric pronouns. This paper explores the variability of references to organisation names with 1) a corpus analysis and 2) two crowd-sourced story continuation experiments. The first shows that the preference for singular vs. plural conceptualisation is dependent on the level of formality of a text. In the second, we observe a strong preference for the plural they otherwise typical of informal speech. Using edited corpus data instead of constructed sentences as stimuli reduces this preference.

2017

pdf bib abs

What is it? Disambiguating the different readings of the pronoun ‘it’
Sharid Loáiciga | Liane Guillou | Christian Hardmeier
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In this paper, we address the problem of predicting one of three functions for the English pronoun ‘it’: anaphoric, event reference or pleonastic. This disambiguation is valuable in the context of machine translation and coreference resolution. We present experiments using a MAXENT classifier trained on gold-standard data and self-training experiments of an RNN trained on silver-standard data, annotated using the MAXENT classifier. Lastly, we report on an analysis of the strengths of these two models.

pdf bib abs

We describe the design, the setup, and the evaluation results of the DiscoMT 2017 shared task on cross-lingual pronoun prediction. The task asked participants to predict a target-language pronoun given a source-language pronoun in the context of a sentence. We further provided a lemmatized target-language human-authored translation of the source sentence, and automatic word alignments between the source sentence words and the target-language lemmata. The aim of the task was to predict, for each target-language pronoun placeholder, the word that should replace it from a small, closed set of classes, using any type of information that can be extracted from the entire document. We offered four subtasks, each for a different language pair and translation direction: English-to-French, English-to-German, German-to-English, and Spanish-to-English. Five teams participated in the shared task, making submissions for all language pairs. The evaluation results show that most participating teams outperformed two strong n-gram-based language model-based baseline systems by a sizable margin.

pdf bib

Annotating tense, mood and voice for English, French and German
Anita Ramm | Sharid Loáiciga | Annemarie Friedrich | Alexander Fraser
Proceedings of ACL 2017, System Demonstrations

pdf bib abs

A BiLSTM-based System for Cross-lingual Pronoun Prediction
Sara Stymne | Sharid Loáiciga | Fabienne Cap
Proceedings of the Third Workshop on Discourse in Machine Translation

We describe the Uppsala system for the 2017 DiscoMT shared task on cross-lingual pronoun prediction. The system is based on a lower layer of BiLSTMs reading the source and target sentences respectively. Classification is based on the BiLSTM representation of the source and target positions for the pronouns. In addition we enrich our system with dependency representations from an external parser and character representations of the source sentence. We show that these additions perform well for German and Spanish as source languages. Our system is competitive and is in first or second place for all language pairs.

2016

pdf bib abs

Discontinuous Verb Phrases in Parsing and Machine Translation of English and German
Sharid Loáiciga | Kristina Gulordava
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we focus on the verb-particle (V-Prt) split construction in English and German and its difficulty for parsing and Machine Translation (MT). For German, we use an existing test suite of V-Prt split constructions, while for English, we build a new and comparable test suite from raw data. These two data sets are then used to perform an analysis of errors in dependency parsing, word-level alignment and MT, which arise from the discontinuous order in V-Prt split constructions. In the automatic alignments of parallel corpora, most of the particles align to NULL. These mis-alignments and the inability of phrase-based MT system to recover discontinuous phrases result in low quality translations of V-Prt split constructions both in English and German. However, our results show that the V-Prt split phrases are correctly parsed in 90% of cases, suggesting that syntactic-based MT should perform better on these constructions. We evaluate a syntactic-based MT system on German and compare its performance to the phrase-based system.

pdf bib abs

Predicting and Using a Pragmatic Component of Lexical Aspect of Simple Past Verbal Tenses for Improving English-to-French Machine Translation
Sharid Loáiciga | Cristina Grisot
Linguistic Issues in Language Technology, Volume 13, 2016

This paper proposes a method for improving the results of a statistical Machine Translation system using boundedness, a pragmatic component of the verbal phrase’s lexical aspect. First, the paper presents manual and automatic annotation experiments for lexical aspect in EnglishFrench parallel corpora. It will be shown that this aspectual property is identified and classified with ease both by humans and by automatic systems. Second, Statistical Machine Translation experiments using the boundedness annotations are presented. These experiments show that the information regarding lexical aspect is useful to improve the output of a Machine Translation system in terms of better choices of verbal tenses in the target language, as well as better lexical choices. Ultimately, this work aims at providing a method for the automatic annotation of data with boundedness information and at contributing to Machine Translation by taking into account linguistic data.

pdf bib

It-disambiguation and source-aware language models for cross-lingual pronoun prediction
Sharid Loáiciga | Liane Guillou | Christian Hardmeier
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

2015

pdf bib

Rule-Based Pronominal Anaphora Treatment for Machine Translation
Sharid Loáiciga | Éric Wehrli
Proceedings of the Second Workshop on Discourse in Machine Translation

pdf bib

Predicting Pronoun Translation Using Syntactic, Morphological and Contextual Features from Parallel Data
Sharid Loáiciga
Proceedings of the Second Workshop on Discourse in Machine Translation

2014

pdf bib abs

English-French Verb Phrase Alignment in Europarl for Tense Translation Modeling
Sharid Loáiciga | Thomas Meyer | Andrei Popescu-Belis
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents a method for verb phrase (VP) alignment in an English-French parallel corpus and its use for improving statistical machine translation (SMT) of verb tenses. The method starts from automatic word alignment performed with GIZA++, and relies on a POS tagger and a parser, in combination with several heuristics, in order to identify non-contiguous components of VPs, and to label the aligned VPs with their tense and voice on each side. This procedure is applied to the Europarl corpus, leading to the creation of a smaller, high-precision parallel corpus with about 320,000 pairs of finite VPs, which is made publicly available. This resource is used to train a tense predictor for translation from English into French, based on a large number of surface features. Three MT systems are compared: (1) a baseline phrase-based SMT; (2) a tense-aware SMT system using the above predictions within a factored translation model; and (3) a system using oracle predictions from the aligned VPs. For several tenses, such as the French “imparfait”, the tense-aware SMT system improves significantly over the baseline and is closer to the oracle system.

2013

pdf bib

Anaphora Resolution for Machine Translation (Résolution d’anaphores et traitement des pronoms en traduction automatique à base de règles) [in French]
Sharid Loáiciga
Proceedings of TALN 2013 (Volume 2: Short Papers)

2012

pdf bib

Improving machine translation of null subjects in Italian and Spanish
Lorenza Russo | Sharid Loáiciga | Asheesh Gulati
Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib abs

Italian and Spanish Null Subjects. A Case Study Evaluation in an MT Perspective.
Lorenza Russo | Sharid Loáiciga | Asheesh Gulati
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Thanks to their rich morphology, Italian and Spanish allow pro-drop pronouns, i.e., non lexically-realized subject pronouns. Here we distinguish between two different types of null subjects: personal pro-drop and impersonal pro-drop. We evaluate the translation of these two categories into French, a non pro-drop language, using Its-2, a transfer-based system developed at our laboratory; and Moses, a statistical system. Three different corpora are used: two subsets of the Europarl corpus and a third corpus built using newspaper articles. Null subjects turn out to be quantitatively important in all three corpora, but their distribution varies depending on the language and the text genre though. From a MT perspective, translation results are determined by the type of pro-drop and the pair of languages involved. Impersonal pro-drop is harder to translate than personal pro-drop, especially for the translation from Italian into French, and a significant portion of incorrect translations consists of missing pronouns.

2011

pdf bib abs

Étude inter-langues de la distribution et des ambiguïtés syntaxiques des pronoms (A study of cross-language distribution and syntactic ambiguities of pronouns)
Lorenza Russo | Yves Scherrer | Jean-Philippe Goldman | Sharid Loáiciga | Luka Nerima | Éric Wehrli
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Ce travail décrit la distribution des pronoms selon le style de texte (littéraire ou journalistique) et selon la langue (français, anglais, allemand et italien). Sur la base d’un étiquetage morpho-syntaxique effectué automatiquement puis vérifié manuellement, nous pouvons constater que la proportion des différents types de pronoms varie selon le type de texte et selon la langue. Nous discutons les catégories les plus ambiguës de manière détaillée. Comme nous avons utilisé l’analyseur syntaxique Fips pour l’étiquetage des pronoms, nous l’avons également évalué et obtenu une précision moyenne de plus de 95%.

pdf bib abs

La traduction automatique des pronoms. Problèmes et perspectives (Automatic translation of pronouns. Problems and perspectives)
Yves Scherrer | Lorenza Russo | Jean-Philippe Goldman | Sharid Loáiciga | Luka Nerima | Éric Wehrli
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cette étude, notre système de traduction automatique, Its-2, a fait l’objet d’une évaluation manuelle de la traduction des pronoms pour cinq paires de langues et sur deux corpus : un corpus littéraire et un corpus de communiqués de presse. Les résultats montrent que les pourcentages d’erreurs peuvent atteindre 60% selon la paire de langues et le corpus. Nous discutons ainsi deux pistes de recherche pour l’amélioration des performances de Its-2 : la résolution des ambiguïtés d’analyse et la résolution des anaphores pronominales.