Johanna Monti - ACL Anthology

Johanna Monti

2025

ITALERT: Assessing the Quality of LLMs and NMT in Translating Italian Emergency Response Text
Maria Carmen Staiano | Lifeng Han | Johanna Monti | Francesca Chiusaroli
Proceedings of Machine Translation Summit XX: Volume 1

This paper presents the outcomes of an initial investigation into the performance of Large Language Models (LLMs) and Neural Machine Translation (NMT) systems in translating high-stakes messages. The research employed a novel bilingual corpus, ITALERT (Italian Emergency Response Text) and applied a human-centric post-editing based metric (HOPE) to assess translation quality systematically. The initial dataset contains eleven texts in Italian and their corresponding English translations, both extracted from the national communication campaign website of the Italian Civil Protection Department. The texts deal with eight crisis scenarios: flooding, earthquake, forest fire, volcanic eruption, tsunami, industrial accident, nuclear risk, and dam failure. The dataset has been carefully compiled to ensure usability and clarity for evaluating machine translation (MT) systems in crisis settings. Our findings show that current LLMs and NMT models, such as ChatGPT (OpenAI’s GPT-4o model) and Google MT, face limitations in translating emergency texts, particularly in maintaining the appropriate register, resolving context ambiguities, and managing domain-specific terminology.

Balancing Translation Quality and Environmental Impact: Comparing Large and Small Language Models
Antonio Castaldo | Petra Giommarelli | Johanna Monti
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

Proceedings of the 5th Conference on Language, Data and Knowledge
Mehwish Alam | Andon Tchechmedjiev | Jorge Gracia | Dagmar Gromann | Maria Pia di Buono | Johanna Monti | Maxim Ionov
Proceedings of the 5th Conference on Language, Data and Knowledge

Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion
Katerina Gkirtzou | Slavko Žitnik | Jorge Gracia | Dagmar Gromann | Maria Pia di Buono | Johanna Monti | Maxim Ionov
Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion

Proceedings of the 5th Conference on Language, Data and Knowledge: TermTrends 2025
Katerina Gkirtzou | Slavko Žitnik | Jorge Gracia | Dagmar Gromann | Maria Pia di Buono | Johanna Monti | Maxim Ionov
Proceedings of the 5th Conference on Language, Data and Knowledge: TermTrends 2025

Proceedings of the 5th Conference on Language, Data and Knowledge: The 5th OntoLex Workshop
Katerina Gkirtzou | Slavko Žitnik | Jorge Gracia | Dagmar Gromann | Maria Pia di Buono | Johanna Monti | Maxim Ionov
Proceedings of the 5th Conference on Language, Data and Knowledge: The 5th OntoLex Workshop

Extending CREAMT: Leveraging Large Language Models for Literary Translation Post-Editing
Antonio Castaldo | Sheila Castilho | Joss Moorkens | Johanna Monti
Proceedings of Machine Translation Summit XX: Volume 1

Post-editing machine translation (MT) for creative texts, such as literature, requires balancing efficiency with the preservation of creativity and style. While neural MT systems struggle with these challenges, large language models (LLMs) offer improved capabilities for context-aware and creative translation. This study evaluates the feasibility of post-editing literary translations generated by LLMs. Using a custom research tool, we collaborated with professional literary translators to analyze editing time, quality, and creativity. Our results indicate that post-editing (PE) LLM-generated translations significantly reduce editing time compared to human translation while maintaining a similar level of creativity. The minimal difference in creativity between PE and MT, combined with substantial productivity gains, suggests that LLMs may effectively support literary translators.

UniOr PET: An Online Platform for Translation Post-Editing
Antonio Castaldo | Sheila Castilho | Joss Moorkens | Johanna Monti
Proceedings of Machine Translation Summit XX: Volume 2

UniOr PET is a browser-based platform for machine translation post-editing and a modern successor to the original PET tool. It features a user-friendly interface that records detailed editing actions, including time spent, additions, and deletions. Fully compatible with PET, UniOr PET introduces two advanced timers for more precise tracking of editing time and computes widely used metrics such as hTER, BLEU, and ChrF, providing comprehensive insights into translation quality and post-editing productivity. Designed with translators and researchers in mind, UniOr PET combines the strengths of its predecessor with enhanced functionality for efficient and user-friendly post-editing projects.

2024

Emojilingo: Harnessing AI to Translate Words into Emojis
Francesca Chiusaroli | Federico Sangati | Johanna Monti | Maria Laura Pierucci | Tiberio Uricchio
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

This paper presents an AI experiment of translation in emoji conducted on a glossary from Dante Alighieri’s Comedy. The experiment is part of a project aiming to build up an automated emojibased pivot language providing an interlingua as a tool for linguistic simplification, accessibility, and international communication: Emojilingo. The present test involves human (Emojitaliano) and machine (Chat-GPT) translations in a comparative analysis to devise an automated integrated model highlighting emojis’ expressive ability in transferring senses, clarifying semantic obscurities and ambiguities, and simplifying language. A first preliminary evaluation highlights Chat-GPT’s ability to deal with a classic archaic literary vocabulary, also raising issues on managing criteria for better grasping the meanings and forms and about the multicultural extent of content transfer.

Large Language Models as Legal Translators of Arabic Legislation: Do ChatGPT and Gemini Care for Context and Terminology?
Khadija Ait ElFqih | Johanna Monti
Proceedings of the Second Arabic Natural Language Processing Conference

Accurate translation of terminology and adaptation to in-context information is a pillar to high quality translation. Recently, there is a remarkable interest towards the use and the evaluation of Large Language Models (LLMs) particularly for Machine Translation tasks. Nevertheless, despite their recent advancement and ability to understand and generate human-like language, these LLMs are still far from perfect, especially in domain-specific scenarios, and need to be thoroughly investigated. This is particularly evident in automatically translating legal terminology from Arabic into English and French, where, beyond the inherent complexities of legal language and specialised translations, technical limitations of LLMs further hinder accurate generation of text. In this paper, we present a preliminary evaluation of two evolving LLMs, namely GPT-4 Generative Pre-trained Transformer and Gemini, as legal translators of Arabic legislatives to test their accuracy and the extent to which they care for context and terminology across two language pairs (AR→EN / AR→FR). The study targets the evaluation of Zero-Shot prompting for in-context and out-of-context scenarios of both models relying on a gold standard dataset, verified by professional translators who are also experts in the field. We evaluate the results applying the Multidimensional Quality Metrics to classify translation errors. Moreover, we also evaluate the general LLMs outputs to verify their correctness, consistency, and completeness. In general, our results show that the models are far from perfect and recall for more fine-tuning efforts using specialised terminological data in the legal domain from Arabic into English and French.

Riddle Me This: Evaluating Large Language Models in Solving Word-Based Games
Raffaele Manna | Maria Pia di Buono | Johanna Monti
Proceedings of the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024

In this contribution, we examine the proficiency of Large Language Models (LLMs) in solving the linguistic game “La Ghigliottina,” the final game of the popular Italian TV quiz show “L’Eredità”. This game is particularly challenging as it requires LLMs to engage in semantic inference reasoning for identifying the solutions of the game. Our experiment draws inspiration from Ghigliottin-AI, a task of EVALITA 2020, an evaluation campaign focusing on Natural Language Processing (NLP) and speech tools designed for the Italian language. To benchmark our experiment, we use the results of the most successful artificial player in this task, namely Il Mago della Ghigliottina. The paper describes the experimental setting and the results which show that LLMs perform poorly.

The SETU-ADAPT Submission for WMT 24 Biomedical Shared Task
Antonio Castaldo | Maria Zafar | Prashanth Nayak | Rejwanul Haque | Andy Way | Johanna Monti
Proceedings of the Ninth Conference on Machine Translation

This system description paper presents SETU-ADAPT’s submission to the WMT 2024 Biomedical Shared Task, where we participated for the language pairs English-to-French and English-to-German. Our approach focused on fine-tuning Large Language Models, using in-domain and synthetic data, employing different data augmentation and data retrieval strategies. We introduce a novel MT framework, involving three autonomous agents: a Translator Agent, an Evaluator Agent and a Reviewer Agent. We present our findings and report the quality of the outputs.

Prompting Large Language Models for Idiomatic Translation
Antonio Castaldo | Johanna Monti
Proceedings of the 1st Workshop on Creative-text Translation and Technology

Large Language Models (LLMs) have demonstrated impressive performance in translating content across different languages and genres. Yet, their potential in the creative aspects of machine translation has not been fully explored. In this paper, we seek to identify the strengths and weaknesses inherent in different LLMs when applied to one of the most prominent features of creative works: the translation of idiomatic expressions. We present an overview of their performance in the EN→IT language pair, a context characterized by an evident lack of bilingual data tailored for idiomatic translation. Lastly, we investigate the impact of prompt design on the quality of machine translation, drawing on recent findings which indicate a substantial variation in the performance of LLMs depending on the prompts utilized.

2023

On the Evaluation of Terminology Translation Errors in NMT and PB-SMT in the Legal Domain: a Study on the Translation of Arabic Legal Documents into English and French
Khadija Ait ElFqih | Johanna Monti
Proceedings of the Workshop on Computational Terminology in NLP and Translation Studies (ConTeNTS) Incorporating the 16th Workshop on Building and Using Comparable Corpora (BUCC)

In the translation process, terminological resources are used to solve translation problems, so information on terminological equivalence is crucial to make the most appropriate choices in terms of translation equivalence. In the context of Machine translation, indeed, neural models have improved the state-of-the-art in Machine Translation considerably in recent years. However, they still underperform in domain-specific fields and in under-resourced languages. This is particularly evident in translating legal terminology for Arabic, where current Machine Translation outputs do not adhere to the contextual, linguistic, cultural, and terminological constraints posed by translating legal terms in Arabic. In this paper, we conduct a comparative qualitative evaluation and comprehensive error analysis on legal terminology translation in Phrase-Based Statistical Machine Translation and Neural Machine Translation in two translation language pairs: Arabic-English and Arabic-French. We propose an error typology taking the legal terminology translation from Arabic into account. We demonstrate our findings, highlighting the strengths and weaknesses of both approaches in the area of legal terminology translation for Arabic. We also introduce a multilingual gold standard dataset that we developed using our Arabic legal corpus. This dataset serves as a reliable benchmark and/or reference during the evaluation process to decide the degree of adequacy and fluency of the Phrase-Based Statistical Machine Translation and Neural Machine Translation systems.

We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.

GPT-based Language Models meet Emojitaliano: A Preliminary Assessment Test between Automation and Creativity
Francesca Chiusaroli | Tiberio Uricchio | Johanna Monti | Maria Laura Pierucci | Federico Sangati
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

Formalizing Translation Equivalence and Lexico-Semantic Relations Among Terms in a Bilingual Terminological Resource
Giulia Speranza | Maria Pia Di Buono | Johanna Monti
Proceedings of the 4th Conference on Language, Data and Knowledge

2022

Assessing the Quality of an Italian Crowdsourced Idiom Corpus:the Dodiom Experiment
Giuseppina Morza | Raffaele Manna | Johanna Monti
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper describes how idiom-related language resources, collected through a crowdsourcing experiment carried out by means of Dodiom, a Game-with-a-purpose, have been analysed by language experts. The paper focuses on the criteria adopted for the data annotation and evaluation process. The main scope of this project is, indeed, the evaluation of the quality of the linguistic data obtained through a crowdsourcing project, namely to assess if the data provided and evaluated by the players who joined the game are actually considered of good quality by the language experts. Finally, results of the annotation and evaluation processes as well as future work are presented.

Proceedings of the Second International Workshop on Resources and Techniques for User Information in Abusive Language Analysis
Johanna Monti | Valerio Basile | Maria Pia Di Buono | Raffaele Manna | Antonio Pascucci | Sara Tonelli
Proceedings of the Second International Workshop on Resources and Techniques for User Information in Abusive Language Analysis

2021

gENder-IT: An Annotated English-Italian Parallel Challenge Set for Cross-Linguistic Natural Gender Phenomena
Eva Vanmassenhove | Johanna Monti
Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing

Languages differ in terms of the absence or presence of gender features, the number of gender classes and whether and where gender features are explicitly marked. These cross-linguistic differences can lead to ambiguities that are difficult to resolve, especially for sentence-level MT systems. The identification of ambiguity and its subsequent resolution is a challenging task for which currently there aren’t any specific resources or challenge sets available. In this paper, we introduce gENder-IT, an English–Italian challenge set focusing on the resolution of natural gender phenomena by providing word-level gender tags on the English source side and multiple gender alternative translations, where needed, on the Italian target side.

2020

Proceedings of the Workshop on Resources and Techniques for User and Author Profiling in Abusive Language
Johanna Monti | Valerio Basile | Maria Pia Di Buono | Raffaele Manna | Antonio Pascucci | Sara Tonelli
Proceedings of the Workshop on Resources and Techniques for User and Author Profiling in Abusive Language

We present edition 1.2 of the PARSEME shared task on identification of verbal multiword expressions (VMWEs). Lessons learned from previous editions indicate that VMWEs have low ambiguity, and that the major challenge lies in identifying test instances never seen in the training data. Therefore, this edition focuses on unseen VMWEs. We have split annotated corpora so that the test corpora contain around 300 unseen VMWEs, and we provide non-annotated raw corpora to be used by complementary discovery methods. We released annotated and raw corpora in 14 languages, and this semi-supervised challenge attracted 7 teams who submitted 9 system results. This paper describes the effort of corpus creation, the task design, and the results obtained by the participating systems, especially their performance on unseen expressions.

Is this hotel review truthful or deceptive? A platform for disinformation detection through computational stylometry
Antonio Pascucci | Raffaele Manna | Ciro Caterino | Vincenzo Masucci | Johanna Monti
Proceedings for the First International Workshop on Social Threats in Online Conversations: Understanding and Management

In this paper, we present a web service platform for disinformation detection in hotel reviews written in English. The platform relies on a hybrid approach of computational stylometry techniques, machine learning and linguistic rules written using COGITO, Expert System Corp.’s semantic intelligence software thanks to which it is possible to analyze texts and extract all their characteristics. We carried out a research experiment on the Deceptive Opinion Spam corpus, a balanced corpus composed of 1,600 hotel reviews of 20 Chicago hotels split into four datasets: positive truthful, negative truthful, positive deceptive and negative deceptive reviews. We investigated four different classifiers and we detected that Simple Logistic is the most performing algorithm for this type of classification.

UNIOR NLP at MWSA Task - GlobaLex 2020: Siamese LSTM with Attention for Word Sense Alignment
Raffaele Manna | Giulia Speranza | Maria Pia di Buono | Johanna Monti
Proceedings of the 2020 Globalex Workshop on Linked Lexicography

In this paper we describe the system submitted to the ELEXIS Monolingual Word Sense Alignment Task. We test different systems,which are two types of LSTMs and a system based on a pretrained Bidirectional Encoder Representations from Transformers (BERT)model, to solve the task. LSTM models use fastText pre-trained word vectors features with different settings. For training the models,we did not combine external data with the dataset provided for the task. We select a sub-set of languages among the proposed ones,namely a set of Romance languages, i.e., Italian, Spanish, Portuguese, together with English and Dutch. The Siamese LSTM withattention and PoS tagging (LSTM-A) performed better than the other two systems, achieving a 5-Class Accuracy score of 0.844 in theOverall Results, ranking the first position among five teams.

From Linguistic Resources to Ontology-Aware Terminologies: Minding the Representation Gap
Giulia Speranza | Maria Pia di Buono | Johanna Monti | Federico Sangati
Proceedings of the Twelfth Language Resources and Evaluation Conference

Terminological resources have proven crucial in many applications ranging from Computer-Aided Translation tools to authoring softwares and multilingual and cross-lingual information retrieval systems. Nonetheless, with the exception of a few felicitous examples, such as the IATE (Interactive Terminology for Europe) Termbank, many terminological resources are not available in standard formats, such as Term Base eXchange (TBX), thus preventing their sharing and reuse. Yet, these terminologies could be improved associating the correspondent ontology-based information. The research described in the present contribution demonstrates the process and the methodologies adopted in the automatic conversion into TBX of such type of resources, together with their semantic enrichment based on the formalization of ontological information into terminologies. We present a proof-of-concept using the Italian Linguistic Resource for the Archaeological domain (developed according to Thesauri and Guidelines of the Italian Central Institute for the Catalogue and Documentation). Further, we introduce the conversion tool developed to support the process of creating ontology-aware terminologies for improving interoperability and sharing of existing language technologies and data sets.

The Role of Computational Stylometry in Identifying (Misogynistic) Aggression in English Social Media Texts
Antonio Pascucci | Raffaele Manna | Vincenzo Masucci | Johanna Monti
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

In this paper, we describe UniOr_ExpSys team participation in TRAC-2 (Trolling, Aggression and Cyberbullying) shared task, a workshop organized as part of LREC 2020. TRAC-2 shared task is organized in two sub-tasks: Aggression Identification (a 3-way classification between “Overtly Aggressive”, “Covertly Aggressive” and “Non-aggressive” text data) and Misogynistic Aggression Identification (a binary classifier for classifying the texts as “gendered” or “non-gendered”). Our approach is based on linguistic rules, stylistic features extraction through stylometric analysis and Sequential Minimal Optimization algorithm in building the two classifiers.

Preface
Johanna Monti | Felice Dell’Orletta | Fabio Tamburini
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

Monitoring Social Media to Identify Environmental Crimes through NLP. A preliminary study
Raffaele Manna | Antonio Pascucci | Wanda Punzi Zarino | Vincenzo Simoniello | Johanna Monti
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

ItaGLAM: A corpus of Cultural Communication on Twitter during the Pandemic
Gennaro Nolano | Carola Carlino | Maria Pia Di Buono | Johanna Monti
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)
Johanna Monti | Felice Dell'Orletta | Fabio Tamburini
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

The Challenge of the TV game La Ghigliottina to NLP
Federico Sangati | Antonio Pascucci | Johanna Monti
Workshop on Games and Natural Language Processing

In this paper, we describe a Telegram bot, Mago della Ghigliottina (Ghigliottina Wizard), able to solve La Ghigliottina game (The Guillotine), the final game of the Italian TV quiz show L’Eredità. Our system relies on linguistic resources and artificial intelligence and achieves better results than human players (and competitors of L’Eredità too). In addition to solving a game, Mago della Ghigliottina can also generate new game instances and challenge the users to match the solution.

“Spotto la quarantena”: per una analisi dell’italiano scritto degli studenti universitari via social network in tempo di COVID-19
Francesca Chiusaroli | Johanna Monti | Maria Laura Pierucci | Gennaro Nolano
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

A Case Study of Natural Gender Phenomena in Translation. A Comparison of Google Translate, Bing Microsoft Translator and DeepL for English to Italian, French and Spanish
Argentina Anna Rescigno | Eva Vanmassenhove | Johanna Monti | Andy Way
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

A Case Study of Natural Gender Phenomena in Translation: A Comparison of Google Translate, Bing Microsoft Translator and DeepL for English to Italian, French and Spanish
Argentina Anna Rescigno | Johanna Monti | Andy Way | Eva Vanmassenhove
Workshop on the Impact of Machine Translation (iMpacT 2020)

The Archaeo-Term Project: Multilingual Terminology in Archaeology
Giulia Speranza | Raffaele Manna | Maria Pia Di Buono | Johanna Monti
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

2019

Gender Detection and Stylistic Differences and Similarities between Males and Females in a Dream Tales Blog
Raffaele Manna | Antonio Pascucci | Johanna Monti
Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)

2018

PARSEME-IT - Issues in verbal Multiword Expressions Identification and Classification
Johanna Monti | Valeria Caruso | Maria Pia Di Buono
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

EnetCollect in Italy
Lionel Nicolas | Verena Lyding | Luisa Bentivogli | Federico Sangati | Johanna Monti | Irene Russo | Roberto Gretter | Daniele Falavigna
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

This paper describes the PARSEME Shared Task 1.1 on automatic identification of verbal multiword expressions. We present the annotation methodology, focusing on changes from last year’s shared task. Novel aspects include enhanced annotation guidelines, additional annotated data for most languages, corpora for some new languages, and new evaluation settings. Corpora were created for 20 languages, which are also briefly discussed. We report organizational principles behind the shared task and the evaluation metrics employed for ranking. The 17 participating systems, their methods and obtained results are also presented and analysed.

DialettiBot: a Telegram Bot for Crowdsourcing Recordings of Italian Dialects
Federico Sangati | Ekaterina Abramova | Johanna Monti
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

Advances in Multiword Expression Identification for the Italian language: The PARSEME Shared Task Edition 1.1
Johanna Monti | Silvio Ricardo Cordeiro | Carlos Ramisch | Federico Sangati | Agata Savary | Veronika Vincze
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

2017

Survey: Multiword Expression Processing: A Survey
Mathieu Constant | Gülşen Eryiǧit | Johanna Monti | Lonneke van der Plas | Carlos Ramisch | Michael Rosner | Amalia Todirascu
Computational Linguistics, Volume 43, Issue 4 - December 2017

Multiword expressions (MWEs) are a class of linguistic forms spanning conventional word boundaries that are both idiosyncratic and pervasive across different languages. The structure of linguistic processing that depends on the clear distinction between words and phrases has to be re-thought to accommodate MWEs. The issue of MWE handling is crucial for NLP applications, where it raises a number of challenges. The emergence of solutions in the absence of guiding principles motivates this survey, whose aim is not only to provide a focused review of MWE processing, but also to clarify the nature of interactions between MWE processing and downstream applications. We propose a conceptual framework within which challenges and research contributions can be positioned. It offers a shared understanding of what is meant by “MWE processing,” distinguishing the subtasks of MWE discovery and identification. It also elucidates the interactions between MWE processing and two use cases: Parsing and machine translation. Many of the approaches in the literature can be differentiated according to how MWE processing is timed with respect to underlying use cases. We discuss how such orchestration choices affect the scope of MWE-aware systems. For each of the two MWE processing subtasks and for each of the two use cases, we conclude on open issues and research perspectives.

PARSEME-It Corpus An annotated Corpus of Verbal Multiword Expressions in Italian
Johanna Monti | Maria Pia Di Buono | Federico Sangati
Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017)

2016

PARSEME Survey on MWE Resources
Gyri Smørdal Losnegaard | Federico Sangati | Carla Parra Escartín | Agata Savary | Sascha Bargmann | Johanna Monti
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper summarizes the preliminary results of an ongoing survey on multiword resources carried out within the IC1207 Cost Action PARSEME (PARSing and Multi-word Expressions). Despite the availability of language resource catalogs and the inventory of multiword datasets on the SIGLEX-MWE website, multiword resources are scattered and difficult to find. In many cases, language resources such as corpora, treebanks, or lexical databases include multiwords as part of their data or take them into account in their annotations. However, these resources need to be centralized to make them accessible. The aim of this survey is to create a portal where researchers can easily find multiword(-aware) language resources for their research. We report on the design of the survey and analyze the data gathered so far. We also discuss the problems we have detected upon examination of the data as well as possible ways of enhancing the survey.

2014

Linguistic Evaluation of Support Verb Constructions by OpenLogos and Google Translate
Anabela Barreiro | Johanna Monti | Brigitte Orliac | Susanne Preuß | Kutz Arrieta | Wang Ling | Fernando Batista | Isabel Trancoso
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents a systematic human evaluation of translations of English support verb constructions produced by a rule-based machine translation (RBMT) system (OpenLogos) and a statistical machine translation (SMT) system (Google Translate) for five languages: French, German, Italian, Portuguese and Spanish. We classify support verb constructions by means of their syntactic structure and semantic behavior and present a qualitative analysis of their translation errors. The study aims to verify how machine translation (MT) systems translate fine-grained linguistic phenomena, and how well-equipped they are to produce high-quality translation. Another goal of the linguistically motivated quality analysis of SVC raw output is to reinforce the need for better system hybridization, which leverages the strengths of RBMT to the benefit of SMT, especially in improving the translation of multiword units. Taking multiword units into account, we propose an effective method to achieve MT hybridization based on the integration of semantico-syntactic knowledge into SMT.

2013

Multi-word processing in an ontology-based cross-language information retrieval model for specific domain collections
Maria Pia di Buono | Johanna Monti | Mario Monteleone | Federica Marano
Proceedings of the Workshop on Multi-word Units in Machine Translation and Translation Technologies

Cross-Lingual Information Retrieval and Semantic Interoperability for Cultural Heritage Repositories
Johanna Monti | Mario Monteleone | Maria Pia di Buono | Federica Marano
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

When multiwords go bad in machine translation
Anabela Barreiro | Johanna Monti | Brigitte Orliac | Fernando Batista
Proceedings of the Workshop on Multi-word Units in Machine Translation and Translation Technologies

2011

In search of knowledge: text mining dedicated to technical translation
Johanna Monti | Annibale Elia | Alberto Postiglione | Maria Monteleone | Federica Marano
Proceedings of Translating and the Computer 33

Taking on new challenges in multi-word unit processing for machine translation
Johanna Monti | Anabela Barreiro | Annibale Elia | Federica Marano | Antonella Napoli
Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper discusses the qualitative comparative evaluation performed on the results of two machine translation systems with different approaches to the processing of multi-word units. It proposes a solution for overcoming the difficulties multi-word units present to machine translation by adopting a methodology that combines the lexicon grammar approach with OpenLogos ontology and semantico-syntactic rules. The paper also discusses the importance of a qualitative evaluation metrics to correctly evaluate the performance of machine translation engines with regards to multi-word units.

2010

Mixed up with machine translation: multi-word units disambiguation challenge
Anabele Barreiro | Annibale Elia | Johanna Monti | Mario Monteleone
Proceedings of Translating and the Computer 32

Co-authors

Carlos Ramisch 5

Francesca Chiusaroli 4

Dagmar Gromann 4

Federica Marano 4

Giulia Speranza 4

Verginica Barbu Mititelu 3

Anabela Barreiro 3

Archna Bhatia 3

Marie Candito 3

Annibale Elia 3

Katerina Gkirtzou 3

Uxoa Iñurrieta 3

Chaya Liebeskind 3

Mario Monteleone 3

Carla Parra Escartín 3

Maria Laura Pierucci 3

Eva Vanmassenhove 3

Veronika Vincze 3

Abigail Walsh 3

Slavko Žitnik 3

Khadija Ait ElFqih 2

Valerio Basile 2

Fernando Batista 2

Sheila Castilho 2

Silvio Cordeiro 2

Felice Dell’Orletta 2

Polona Gantar 2

Bruno Guillaume 2

Jolanta Kovalevskaitė 2

Vincenzo Masucci 2

Joss Moorkens 2

Gennaro Nolano 2

Brigitte Orliac 2

Renata Ramisch 2

Argentina Anna Rescigno 2

Ivelina Stoyanova 2

Fabio Tamburini 2

Tiberio Uricchio 2

Ashwini Vaidya 2

Ekaterina Abramova 1

Sascha Bargmann 1

Anabele Barreiro 1

Eduard Bejček 1

Chérifa Ben Khelil 1

Luisa Bentivogli 1

Carola Carlino 1

Valeria Caruso 1

Ciro Caterino 1

Matthieu Constant 1

Gülşen Eryiğit 1

Daniele Falavigna 1

Petra Giommarelli 1

Roberto Gretter 1

Najet Hadj Mohamed 1

Rejwanul Haque 1

Abdelati Hawwari 1

Menghan Jiang 1

Cvetana Krstev 1

Nikola Ljubešić 1

Verena Lyding 1

Maria Monteleone 1

Giuseppina Morza 1

Antonella Napoli 1

Prashanth Nayak 1

Lionel Nicolas 1

Thomas Pickard 1

Alberto Postiglione 1

Susanne Preuß 1

Wanda Punzi Zarino 1

Behrang QasemiZadeh 1

Michael Rosner 1

Nathan Schneider 1

Mehrnoush Shamsfard 1

Vincenzo Simoniello 1

Gyri Smørdal Losnegaard 1

Maria Carmen Staiano 1

Andon Tchechmedjiev 1

Amalia Todirascu 1

Isabel Trancoso 1

Jakub Waszczuk 1

Lonneke van der Plas 1

Venues