Luciano Del Corro

Also published as: Luciano Del Corro, Luciano del Corro

2025

Despite the remarkable success of large language models (LLMs) in English, a significant performance gap remains in non-English languages. To address this, we introduce a novel approach for strategically constructing a multilingual synthetic instruction tuning dataset, sPhinX. Unlike prior methods that directly translate fixed instruction-response pairs, sPhinX enhances diversity by selectively augmenting English instruction-response pairs with multilingual translations. Additionally, we propose LANGIT, a novel N-shot guided fine-tuning strategy, which further enhances model performance by incorporating contextually relevant examples in each training sample. Our ablation study shows that our approach enhances the multilingual capabilities of Mistral-7B and Phi-3-Small improving performance by an average of 39.8% and 11.2%, respectively, across multilingual benchmarks in reasoning, question answering, reading comprehension, and machine translation. Moreover, sPhinX maintains strong performance on English LLM benchmarks while exhibiting minimal to no catastrophic forgetting, even when trained on 51 languages.

pdf bib abs

Are Optimal Algorithms Still Optimal? Rethinking Sorting in LLM-Based Pairwise Ranking with Batching and Caching
Juan Wisznia | Cecilia Bolaños | Juan Tollo | Giovanni Franco Gabriel Marraffini | Agustín Andrés Gianolini | Noe Fabian Hsueh | Luciano Del Corro
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We introduce a novel framework for analyzing sorting algorithms in pairwise ranking prompting (PRP), re-centering the cost model around LLM inferences rather than traditional pairwise comparisons. While classical metrics based on comparison counts have traditionally been used to gauge efficiency, our analysis reveals that expensive LLM inferences overturn these predictions; accordingly, our framework encourages strategies such as batching and caching to mitigate inference costs. We show that algorithms optimal in the classical setting can lose efficiency when LLM inferences dominate the cost under certain optimizations.

2024

pdf bib abs

The Greatest Good Benchmark: Measuring LLMs’ Alignment with Utilitarian Moral Dilemmas
Giovanni Franco Gabriel Marraffini | Andrés Cotton | Noe Fabian Hsueh | Axel Fridman | Juan Wisznia | Luciano Del Corro
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The question of how to make decisions that maximise the well-being of all persons is very relevant to design language models that are beneficial to humanity and free from harm. We introduce the Greatest Good Benchmark to evaluate the moral judgments of LLMs using utilitarian dilemmas. Our analysis across 15 diverse LLMs reveals consistently encoded moral preferences that diverge from established moral theories and lay population moral standards. Most LLMs have a marked preference for impartial beneficence and rejection of instrumental harm. These findings showcase the ‘artificial moral compass’ of LLMs, offering insights into their moral alignment.

2021

pdf bib abs

From Stock Prediction to Financial Relevance: Repurposing Attention Weights to Assess News Relevance Without Manual Annotations
Luciano Del Corro | Johannes Hoffart
Proceedings of the Third Workshop on Economics and Natural Language Processing

We present a method to automatically identify financially relevant news using stock price movements and news headlines as input. The method repurposes the attention weights of a neural network initially trained to predict stock prices to assign a relevance score to each headline, eliminating the need for manually labeled training data. Our experiments on the four most relevant US stock indices and 1.5M news headlines show that the method ranks relevant news highly, positively correlated with the accuracy of the initial stock price prediction task.

pdf bib abs

Unsupervised Multi-View Post-OCR Error Correction With Language Models
Harsh Gupta | Luciano Del Corro | Samuel Broscheit | Johannes Hoffart | Eliot Brenner
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We investigate post-OCR correction in a setting where we have access to different OCR views of the same document. The goal of this study is to understand if a pretrained language model (LM) can be used in an unsupervised way to reconcile the different OCR views such that their combination contains fewer errors than each individual view. This approach is motivated by scenarios in which unconstrained text generation for error correction is too risky. We evaluated different pretrained LMs on two datasets and found significant gains in realistic scenarios with up to 15% WER improvement over the best OCR view. We also show the importance of domain adaptation for post-OCR correction on out-of-domain documents.

2018

pdf bib abs

Facts That Matter
Marco Ponza | Luciano Del Corro | Gerhard Weikum
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This work introduces fact salience: The task of generating a machine-readable representation of the most prominent information in a text document as a set of facts. We also present SalIE, the first fact salience system. SalIE is unsupervised and knowledge agnostic, based on open information extraction to detect facts in natural language text, PageRank to determine their relevance, and clustering to promote diversity. We compare SalIE with several baselines (including positional, standard for saliency tasks), and in an extrinsic evaluation, with state-of-the-art automatic text summarizers. SalIE outperforms baselines and text summarizers showing that facts are an effective way to compress information.

pdf bib abs

diaNED: Time-Aware Named Entity Disambiguation for Diachronic Corpora
Prabal Agarwal | Jannik Strötgen | Luciano del Corro | Johannes Hoffart | Gerhard Weikum
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Named Entity Disambiguation (NED) systems perform well on news articles and other texts covering a specific time interval. However, NED quality drops when inputs span long time periods like in archives or historic corpora. This paper presents the first time-aware method for NED that resolves ambiguities even when mention contexts give only few cues. The method is based on computing temporal signatures for entities and comparing these to the temporal contexts of input mentions. Our experiments show superior quality on a newly created diachronic corpus.

pdf bib abs

A Study of the Importance of External Knowledge in the Named Entity Recognition Task
Dominic Seyler | Tatiana Dembelova | Luciano Del Corro | Johannes Hoffart | Gerhard Weikum
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In this work, we discuss the importance of external knowledge for performing Named Entity Recognition (NER). We present a novel modular framework that divides the knowledge into four categories according to the depth of knowledge they convey. Each category consists of a set of features automatically generated from different information sources, such as a knowledge-base, a list of names, or document-specific semantic annotations. Further, we show the effects on performance when incrementally adding deeper knowledge and discuss effectiveness/efficiency trade-offs.

2017

pdf bib abs

MinIE: Minimizing Facts in Open Information Extraction
Kiril Gashteovski | Rainer Gemulla | Luciano del Corro
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The goal of Open Information Extraction (OIE) is to extract surface relations and their arguments from natural-language text in an unsupervised, domain-independent manner. In this paper, we propose MinIE, an OIE system that aims to provide useful, compact extractions with high precision and recall. MinIE approaches these goals by (1) representing information about polarity, modality, attribution, and quantities with semantic annotations instead of in the actual extraction, and (2) identifying and removing parts that are considered overly specific. We conducted an experimental study with several real-world datasets and found that MinIE achieves competitive or higher precision and recall than most prior systems, while at the same time producing shorter, semantically enriched extractions.