Tiberiu Boroş

Also published as: Tiberiu Boroș, Tiberiu Boros

2024

pdf bib abs
Fine-Tuning and Retrieval Augmented Generation for Question Answering Using Affordable Large Language Models
Tiberiu Boros | Radu Chivereanu | Stefan Dumitrescu | Octavian Purcaru
Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024

We present our proposed system named Sherlock to UNLP 2024 Shared Task on Question Answering winning first place. We employ a mix of methods, from using automatically translated datasets to perform supervised fine-tuning and direct preference optimization on instruction-tuned models, to model weight merging and retrieval augmented generation. We present and motivate our chosen sequence of steps, as well as an ablation study to understand the effect of each additional step. The resulting model and code are made publicly available (download links provided in the paper).

2018

pdf bib abs
NLP-Cube: End-to-End Raw Text Processing With Neural Networks
Tiberiu Boros | Stefan Daniel Dumitrescu | Ruxandra Burtica
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We introduce NLP-Cube: an end-to-end Natural Language Processing framework, evaluated in CoNLL’s “Multilingual Parsing from Raw Text to Universal Dependencies 2018” Shared Task. It performs sentence splitting, tokenization, compound word expansion, lemmatization, tagging and parsing. Based entirely on recurrent neural networks, written in Python, this ready-to-use open source system is freely available on GitHub. For each task we describe and discuss its specific network architecture, closing with an overview on the results obtained in the competition.

pdf bib
Attention-free encoder decoder for morphological processing
Stefan Daniel Dumitrescu | Tiberiu Boros
Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

pdf bib abs
GBD-NER at PARSEME Shared Task 2018: Multi-Word Expression Detection Using Bidirectional Long-Short-Term Memory Networks and Graph-Based Decoding
Tiberiu Boros | Ruxandra Burtica
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This paper addresses the issue of multi-word expression (MWE) detection by employing a new decoding strategy inspired after graph-based parsing. We show that this architecture achieves state-of-the-art results with minimum feature-engineering, just by relying on lexicalized and morphological attributes. We validate our approach in a multilingual setting, using standard MWE corpora supplied in the PARSEME Shared Task.

2017

pdf bib abs
CASSANDRA: A multipurpose configurable voice-enabled human-computer-interface
Tiberiu Boros | Stefan Daniel Dumitrescu | Sonia Pipa
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

Voice enabled human computer interfaces (HCI) that integrate automatic speech recognition, text-to-speech synthesis and natural language understanding have become a commodity, introduced by the immersion of smart phones and other gadgets in our daily lives. Smart assistants are able to respond to simple queries (similar to text-based question-answering systems), perform simple tasks (call a number, reject a call etc.) and help organizing appointments. With this paper we introduce a newly created process automation platform that enables the user to control applications and home appliances and to query the system for information using a natural voice interface. We offer an overview of the technologies that enabled us to construct our system and we present different usage scenarios in home and office environments.

pdf bib abs
RACAI’s Natural Language Processing pipeline for Universal Dependencies
Stefan Daniel Dumitrescu | Tiberiu Boros | Dan Tufis
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper presents RACAI’s approach, experiments and results at CONLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. We handle raw text and we cover tokenization, sentence splitting, word segmentation, tagging, lemmatization and parsing. All results are reported under strict training, development and testing conditions, in which the corpora provided for the shared tasks is used “as is”, without any modifications to the composition of the train and development sets.

pdf bib abs
Fast and Accurate Decision Trees for Natural Language Processing Tasks
Tiberiu Boros | Stefan Daniel Dumitrescu | Sonia Pipa
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Decision trees have been previously employed in many machine-learning tasks such as part-of-speech tagging, lemmatization, morphological-attribute resolution, letter-to-sound conversion and statistical-parametric speech synthesis. In this paper we introduce an optimized tree-computation algorithm, which is based on the original ID3 algorithm. We also introduce a tree-pruning method that uses a development set to delete nodes from over-fitted models. The later mentioned algorithm also uses a results caching method for speed-up. Our algorithm is almost 200 times faster than a naive implementation and yields accurate results on our test datasets.

pdf bib abs
A data-driven approach to verbal multiword expression detection. PARSEME Shared Task system description paper
Tiberiu Boros | Sonia Pipa | Verginica Barbu Mititelu | Dan Tufis
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

“Multiword expressions” are groups of words acting as a morphologic, syntactic and semantic unit in linguistic analysis. Verbal multiword expressions represent the subgroup of multiword expressions, namely that in which a verb is the syntactic head of the group considered in its canonical (or dictionary) form. All multiword expressions are a great challenge for natural language processing, but the verbal ones are particularly interesting for tasks such as parsing, as the verb is the central element in the syntactic organization of a sentence. In this paper we introduce our data-driven approach to verbal multiword expressions which was objectively validated during the PARSEME shared task on verbal multiword expressions identification. We tested our approach on 12 languages, and we provide detailed information about corpora composition, feature selection process, validation procedure and performance on all languages.

2016

pdf bib abs
RACAI Entry for the IWSLT 2016 Shared Task
Sonia Pipa | Alin Florentin Vasile | Ioana Ionașcu | Stefan Daniel Dumitrescu | Tiberiu Boros
Proceedings of the 13th International Conference on Spoken Language Translation

Spoken Language Translation is currently a hot topic in the research community. This task is very complex, involving automatic speech recognition, text-normalization and machine translation. We present our speech translation system, which was compared against the other systems participating in the IWSLT 2016 Shared Task. We introduce our ASR system for English and our MT system for English to French (En-Fr) and English to German (En-De) language pairs. Additionally, for the English to French Challenge we introduce a methodology that enables the enhancement of statistical phrase-based translation with translation equivalents deduced from monolingual corpora using neural word embedding.

pdf bib abs
The IPR-cleared Corpus of Contemporary Written and Spoken Romanian Language
Dan Tufiș | Verginica Barbu Mititelu | Elena Irimia | Ștefan Daniel Dumitrescu | Tiberiu Boroș
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The article describes the current status of a large national project, CoRoLa, aiming at building a reference corpus for the contemporary Romanian language. Unlike many other national corpora, CoRoLa contains only - IPR cleared texts and speech data, obtained from some of the country’s most representative publishing houses, broadcasting agencies, editorial offices, newspapers and popular bloggers. For the written component 500 million tokens are targeted and for the oral one 300 hours of recordings. The choice of texts is done according to their functional style, domain and subdomain, also with an eye to the international practice. A metadata file (following the CMDI model) is associated to each text file. Collected texts are cleaned and transformed in a format compatible with the tools for automatic processing (segmentation, tokenization, lemmatization, part-of-speech tagging). The paper also presents up-to-date statistics about the structure of the corpus almost two years before its official launching. The corpus will be freely available for searching. Users will be able to download the results of their searches and those original files when not against stipulations in the protocols we have with text providers.

2014

pdf bib abs
RSS-TOBI - A Prosodically Enhanced Romanian Speech Corpus
Tiberiu Boroș | Adriana Stan | Oliver Watts | Stefan Daniel Dumitrescu
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper introduces a recent development of a Romanian Speech corpus to include prosodic annotations of the speech data in the form of ToBI labels. We describe the methodology of determining the required pitch patterns that are common for the Romanian language, annotate the speech resource, and then provide a comparison of two text-to-speech synthesis systems to establish the benefits of using this type of information to our speech resource. The result is a publicly available speech dataset which can be used to further develop speech synthesis systems or to automatically learn the prediction of ToBI labels from text in Romanian language.

pdf bib
RACAI GEC – A hybrid approach to Grammatical Error Correction
Tiberiu Boroș | Stefan Daniel Dumitrescu | Adrian Zafiu | Verginica Barbu Mititelu | Ionut Paul Văduva
Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task

2013

pdf bib
Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language
Tiberiu Boros | Radu Ion | Dan Tufis
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
A unified lexical processing framework based on the Margin Infused Relaxed Algorithm. A case study on the Romanian Language
Tiberiu Boroș
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib abs
Romanian to English automatic MT experiments at IWSLT12 – system description paper
Ştefan Daniel Dumitrescu | Radu Ion | Dan Ştefănescu | Tiberiu Boroş | Dan Tufiş
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

The paper presents the system developed by RACAI for the ISWLT 2012 competition, TED task, MT track, Romanian to English translation. We describe the starting baseline phrase-based SMT system, the experiments conducted to adapt the language and translation models and our post-translation cascading system designed to improve the translation without external resources. We further present our attempts at creating a better controlled decoder than the open-source Moses system offers.