Traian Rebedea - ACL Anthology

Traian Rebedea

2026

Improving Romanian LLM Pretraining Data using Diversity and Quality Filtering
Vlad-Andrei Negoiță | Mihai Masala | Traian Rebedea
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)

Large Language Models (LLMs) have recently exploded in popularity, often matching or outperforming human abilities on many tasks. One of the key factors in training LLMs is the availability and curation of high-quality data.Data quality is especially crucial for under-represented languages, where high-quality corpora are scarce. In this work we study the characteristics and coverage of Romanian pretraining corpora and we examine how they differ from English data. By training a lightweight multitask model on carefully LLM-annotated Romanian texts, we are able to analyze and perform multi-level filtering (e.g., educational value, topic, format) to generate high-quality pretraining datasets. Our experiments show noteworthy trends in the topics present in Romanian and English data, while also proving the effectiveness of filtering data through improved LLM pretraining performance across multiple benchmarks.

2025

Guardrails and Security for LLMs: Safe, Secure and Controllable Steering of LLM Applications
Traian Rebedea | Leon Derczynski | Shaona Ghosh | Makesh Narsimhan Sreedhar | Faeze Brahman | Liwei Jiang | Bo Li | Yulia Tsvetkov | Christopher Parisien | Yejin Choi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)

Pretrained generative models, especially large language models, provide novel ways for users to interact with computers. While generative NLP research and applications had previously aimed at very domain-specific or task-specific solutions, current LLMs and applications (e.g. dialogue systems, agents) are versatile across many tasks and domains. Despite being trained to be helpful and aligned with human preferences (e.g., harmlessness), enforcing robust guardrails on LLMs remains a challenge. And, even when protected against rudimentary attacks, just like other complex software, LLMs can be vulnerable to attacks using sophisticated adversarial inputs. This tutorial provides a comprehensive overview of key guardrail mechanisms developed for LLMs, along with evaluation methodologies and a detailed security assessment protocol - including auto red-teaming of LLM-powered applications. Our aim is to move beyond the discussion of single prompt attacks and evaluation frameworks towards addressing how guardrailing can be done in complex dialogue systems that employ LLMs.

MultiMatch: Multihead Consistency Regularization Matching for Semi-Supervised Text Classification
Iustin Sirbu | Robert-Adrian Popovici | Cornelia Caragea | Stefan Trausan-Matu | Traian Rebedea
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We introduce **MultiMatch**, a novel semi-supervised learning (SSL) algorithm combining the paradigms of co-training and consistency regularization with pseudo-labeling. At its core, MultiMatch features a three-fold pseudo-label weighting module designed for selecting and filtering pseudo-labels based on head agreement and model confidence, and weighting them according to the perceived classification difficulty. This novel module enhances and unifies three existing techniques - heads agreement from **Multi**head Co-training, self-adaptive thresholds from Free**Match**, and Average Pseudo-Margins from Margin**Match** - resulting in a holistic approach that improves robustness and performance in SSL settings.Experimental results on benchmark datasets highlight the superior performance of MultiMatch, i.e., MultiMatch achieves state-of-the-art results on 8 out of 10 setups from 5 natural language processing datasets and ranks first according to the Friedman test among 21 methods. Furthermore, MultiMatch demonstrates exceptional robustness in highly imbalanced settings, outperforming the second-best approach by 3.26%, a critical advantage for real-world text classification tasks. Our code is available on GitHub.

Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models
Makesh Narsimhan Sreedhar | Traian Rebedea | Christopher Parisien
Findings of the Association for Computational Linguistics: EMNLP 2025

Reasoning-based language models have demonstrated strong performance across various domains, with the most notable gains seen in mathematical and coding tasks. Recent research has shown that reasoning also offers significant benefits for LLM safety and guardrail applications. In this work, we conduct a comprehensive analysis of training reasoning-based guardrail models for content moderation, with an emphasis on generalization to custom safety policies at inference time. Our study focuses on two key dimensions: data efficiency and inference efficiency. On the data front, we find that reasoning-based models exhibit strong sample efficiency, achieving competitive performance with significantly fewer training examples than their non-reasoning counterparts. This unlocks the potential to repurpose the remaining data for mining high-value, difficult samples that further enhance model performance. On the inference side, we evaluate practical trade-offs by introducing reasoning budgets, examining the impact of reasoning length on latency and accuracy, and exploring dual-mode training to allow runtime control over reasoning behavior. Our findings will provide practical insights for researchers and developers to effectively and efficiently train and deploy reasoning-based guardrails models in real-world systems.

AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails
Shaona Ghosh | Prasoon Varshney | Makesh Narsimhan Sreedhar | Aishwarya Padmakumar | Traian Rebedea | Jibin Rajan Varghese | Christopher Parisien
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

As Large Language Models (LLMs) and generative AI become increasingly widespread, concerns about content safety have grown in parallel. Currently, there is a clear lack of high-quality, human-annotated datasets that address the full spectrum of LLM-related safety risks and are usable for commercial applications. To bridge this gap, we propose a comprehensive and adaptable taxonomy for categorizing safety risks, structured into 12 top-level hazard categories with an extension to 9 fine-grained subcategories. This taxonomy is designed to meet the diverse requirements of downstream users, offering more granular and flexible tools for managing various risk types. Using a hybrid data generation pipeline that combines human annotations with a multi-LLM “jury” system to assess the safety of responses we obtain Aegis2.0, a carefully curated collection of 34,248 samples of human-LLM interactions, annotated according to our proposed taxonomy. To validate its effectiveness, we demonstrate that several lightweight models, trained using parameter-efficient techniques on Aegis2.0, achieve performance competitive with leading safety models fully fine-tuned on much larger, non-commercial datasets generated leveraging GPT-4. Additionally, we introduce a novel training blend that combines topic following data with safety data. This approach enhances the adaptability of guard models, enabling them to generalize to new risk categories defined during inference. We plan to open-source Aegis2.0 data and models to the research community to aid in safety guardrailing of LLMs.

2024

GunStance: Stance Detection for Gun Control and Gun Regulation
Nikesh Gyawali | Iustin Sirbu | Tiberiu Sosea | Sarthak Khanal | Doina Caragea | Traian Rebedea | Cornelia Caragea
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The debate surrounding gun control and gun regulation in the United States has intensified in the wake of numerous mass shooting events. As perspectives on this matter vary, it becomes increasingly important to comprehend individuals’ positions. Stance detection, the task of determining an author’s position towards a proposition or target, has gained attention for its potential use in understanding public perceptions towards controversial topics and identifying the best strategies to address public concerns. In this paper, we present GunStance, a dataset of tweets pertaining to shooting events, focusing specifically on the controversial topics of “banning guns” versus “regulating guns.” The tweets in the dataset are sourced from discussions on Twitter following various shooting incidents in the United States. Amazon Mechanical Turk was used to manually annotate a subset of the tweets relevant to the targets of interest (“banning guns” and “regulating guns”) into three classes: In-Favor, Against, and Neutral. The remaining unlabeled tweets are included in the dataset to facilitate studies on semi-supervised learning (SSL) approaches that can help address the scarcity of the labeled data in stance detection tasks. Furthermore, we propose a hybrid approach that combines curriculum-based SSL and Large Language Models (LLM), and show that the proposed approach outperforms supervised, semi-supervised, and LLM-based zero-shot models in most experiments on our assembled dataset.

Unsupervised Extraction of Dialogue Policies from Conversations
Makesh Narsimhan Sreedhar | Traian Rebedea | Christopher Parisien
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Dialogue policies play a crucial role in developing task-oriented dialogue systems, yet their development and maintenance are challenging and typically require substantial effort from experts in dialogue modeling. While in many situations, large amounts of conversational data are available for the task at hand, people lack an effective solution able to extract dialogue policies from this data. In this paper, we address this gap by first illustrating how Large Language Models (LLMs) can be instrumental in extracting dialogue policies from datasets, through the conversion of conversations into a unified intermediate representation consisting of canonical forms. We then propose a novel method for generating dialogue policies utilizing a controllable and interpretable graph-based methodology. By combining canonical forms across conversations into a flow network, we find that running graph traversal algorithms helps in extracting dialogue flows. These flows are a better representation of the underlying interactions than flows extracted by prompting LLMs. Our technique focuses on giving conversation designers greater control, offering a productivity tool to improve the process of developing dialogue policies.

“Vorbești Românește?” A Recipe to Train Powerful Romanian LLMs with English Instructions
Mihai Masala | Denis Ilie-Ablachim | Alexandru Dima | Dragos Georgian Corlatescu | Miruna-Andreea Zavelca | Ovio Olaru | Simina-Maria Terian | Andrei Terian | Marius Leordeanu | Horia Velicu | Marius Popescu | Mihai Dascalu | Traian Rebedea
Findings of the Association for Computational Linguistics: EMNLP 2024

In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and release open-source LLMs tailored for Romanian. We evaluate our methods on four different categories, including academic benchmarks, MT-Bench (manually translated), and a professionally built historical, cultural, and social benchmark adapted to Romanian. We argue for the usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board. We publicly release all resources (i.e., data, training and evaluation code, models) with the goal of supporting and encouraging research on Romanian LLMs while concurrently creating a generalizable recipe adequate for other low or less-resourced languages.

CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues
Traian Rebedea | Makesh Sreedhar | Shaona Ghosh | Jiaqi Zeng | Christopher Parisien
Findings of the Association for Computational Linguistics: EMNLP 2024

Recent advancements in instruction-tuning datasets have predominantly focused on specific tasks like mathematical or logical reasoning. There has been a notable gap in data designed for aligning language models to maintain topic relevance in conversations - a critical aspect for deploying chatbots to production. We introduce the CantTalkAboutThis dataset to help language models remain focused on the subject at hand during task-oriented interactions. It consists of synthetic dialogues on a wide range of conversation topics from different domains. These dialogues are interspersed with distractor turns that intentionally divert the chatbot from the predefined topic. Fine-tuning language models on this dataset helps make them resilient to deviating from the assigned role and improves their ability to maintain topical coherence compared to general-purpose instruction-tuned LLMs like gpt-4-turbo and Mixtral-Instruct. Additionally, preliminary observations suggest that training models on this dataset also enhance their performance on fine-grained instruction following tasks, including safety alignment.

Improving Legal Judgement Prediction in Romanian with Long Text Encoders
Mihai Masala | Traian Rebedea | Horia Velicu
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024

In recent years,the entire field of Natural Language Processing (NLP) has enjoyed amazing novel results achieving almost human-like performance on a variety of tasks. Legal NLP domain has also been part of this process, as it has seen an impressive growth. However, general-purpose models are not readily applicable for legal domain. Due to the nature of the domain (e.g. specialized vocabulary, long documents) specific models and methods are often needed for Legal NLP. In this work we investigate both specialized and general models for predicting the final ruling of a legal case, task known as Legal Judgment Prediction (LJP). We particularly focus on methods to extend to sequence length of Transformer-based models to better understand the long documents present in legal corpora. Extensive experiments on 4 LJP datasets in Romanian, originating from 2 sources with significantly different sizes and document lengths, show that specialized models and handling long texts are critical for a good performance.

2023

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails
Traian Rebedea | Razvan Dinu | Makesh Narsimhan Sreedhar | Christopher Parisien | Jonathan Cohen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. Guardrails (or rails for short) are a specific way of controlling the output of an LLM, such as not talking about topics considered harmful, following a predefined dialogue path, using a particular language style, and more. There are several mechanisms that allow LLM providers and developers to add guardrails that are embedded into a specific model at training, e.g. using model alignment. Using a runtime inspired from dialogue management, NeMo Guardrails provides a different approach by allowing developers to add programmable rails to LLM applications - these are user-defined, independent of the underlying LLM, and interpretable. Our initial results show that the proposed approach can be used with several LLM providers to develop controllable and safe LLM applications using programmable rails.

2022

Multimodal Semi-supervised Learning for Disaster Tweet Classification
Iustin Sirbu | Tiberiu Sosea | Cornelia Caragea | Doina Caragea | Traian Rebedea
Proceedings of the 29th International Conference on Computational Linguistics

During natural disasters, people often use social media platforms, such as Twitter, to post information about casualties and damage produced by disasters. This information can help relief authorities gain situational awareness in nearly real time, and enable them to quickly distribute resources where most needed. However, annotating data for this purpose can be burdensome, subjective and expensive. In this paper, we investigate how to leverage the copious amounts of unlabeled data generated on social media by disaster eyewitnesses and affected individuals during disaster events. To this end, we propose a semi-supervised learning approach to improve the performance of neural models on several multimodal disaster tweet classification tasks. Our approach shows significant improvements, obtaining up to 7.7% improvements in F-1 in low-data regimes and 1.9% when using the entire training data. We make our code and data publicly available at https://github.com/iustinsirbu13/multimodal-ssl-for-disaster-tweet-classification.

Distilling the Knowledge of Romanian BERTs Using Multiple Teachers
Andrei-Marius Avram | Darius Catrina | Dumitru-Clementin Cercel | Mihai Dascalu | Traian Rebedea | Vasile Pais | Dan Tufis
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Running large-scale pre-trained language models in computationally constrained environments remains a challenging problem yet to be addressed, while transfer learning from these models has become prevalent in Natural Language Processing tasks. Several solutions, including knowledge distillation, network quantization, or network pruning have been previously proposed; however, these approaches focus mostly on the English language, thus widening the gap when considering low-resource languages. In this work, we introduce three light and fast versions of distilled BERT models for the Romanian language: Distil-BERT-base-ro, Distil-RoBERT-base, and DistilMulti-BERT-base-ro. The first two models resulted from the individual distillation of knowledge from two base versions of Romanian BERTs available in literature, while the last one was obtained by distilling their ensemble. To our knowledge, this is the first attempt to create publicly available Romanian distilled BERT models, which were thoroughly evaluated on five tasks: part-of-speech tagging, named entity recognition, sentiment analysis, semantic textual similarity, and dialect identification. Our experimental results argue that the three distilled models offer performance comparable to their teachers, while being twice as fast on a GPU and ~35% smaller. In addition, we further test the similarity between the predictions of our students versus their teachers by measuring their label and probability loyalty, together with regression loyalty - a new metric introduced in this work.

2021

BART-TL: Weakly-Supervised Topic Label Generation
Cristian Popa | Traian Rebedea
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We propose a novel solution for assigning labels to topic models by using multiple weak labelers. The method leverages generative transformers to learn accurate representations of the most important topic terms and candidate labels. This is achieved by fine-tuning pre-trained BART models on a large number of potential labels generated by state of the art non-neural models for topic labeling, enriched with different techniques. The proposed BART-TL model is able to generate valuable and novel labels in a weakly-supervised manner and can be improved by adding other weak labelers or distant supervision on similar tasks.

jurBERT: A Romanian BERT Model for Legal Judgement Prediction
Mihai Masala | Radu Cristian Alexandru Iacob | Ana Sabina Uban | Marina Cidota | Horia Velicu | Traian Rebedea | Marius Popescu
Proceedings of the Natural Legal Language Processing Workshop 2021

Transformer-based models have become the de facto standard in the field of Natural Language Processing (NLP). By leveraging large unlabeled text corpora, they enable efficient transfer learning leading to state-of-the-art results on numerous NLP tasks. Nevertheless, for low resource languages and highly specialized tasks, transformer models tend to lag behind more classical approaches (e.g. SVM, LSTM) due to the lack of aforementioned corpora. In this paper we focus on the legal domain and we introduce a Romanian BERT model pre-trained on a large specialized corpus. Our model outperforms several strong baselines for legal judgement prediction on two different corpora consisting of cases from trials involving banks in Romania.

Dialect Identification through Adversarial Learning and Knowledge Distillation on Romanian BERT
George-Eduard Zaharia | Andrei-Marius Avram | Dumitru-Clementin Cercel | Traian Rebedea
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

Dialect identification is a task with applicability in a vast array of domains, ranging from automatic speech recognition to opinion mining. This work presents our architectures used for the VarDial 2021 Romanian Dialect Identification subtask. We introduced a series of solutions based on Romanian or multilingual Transformers, as well as adversarial training techniques. At the same time, we experimented with a knowledge distillation tool in order to check whether a smaller model can maintain the performance of our best approach. Our best solution managed to obtain a weighted F1-score of 0.7324, allowing us to obtain the 2nd place on the leaderboard.

2020

Neural Approaches for Natural Language Interfaces to Databases: A Survey
Radu Cristian Alexandru Iacob | Florin Brad | Elena-Simona Apostol | Ciprian-Octavian Truică | Ionel Alexandru Hosu | Traian Rebedea
Proceedings of the 28th International Conference on Computational Linguistics

A natural language interface to databases (NLIDB) enables users without technical expertise to easily access information from relational databases. Interest in NLIDBs has resurged in the past years due to the availability of large datasets and improvements to neural sequence-to-sequence models. In this survey we focus on the key design decisions behind current state of the art neural approaches, which we group into encoder and decoder improvements. We highlight the three most important directions, namely linking question tokens to database schema elements (schema linking), better architectures for encoding the textual query taking into account the schema (schema encoding), and improved generation of structured queries using autoregressive neural models (grammar-based decoders). To foster future research, we also present an overview of the most important NLIDB datasets, together with a comparison of the top performing neural models and a short insight into recent non deep learning solutions.

UPB at SemEval-2020 Task 9: Identifying Sentiment in Code-Mixed Social Media Texts Using Transformers and Multi-Task Learning
George-Eduard Zaharia | George-Alexandru Vlad | Dumitru-Clementin Cercel | Traian Rebedea | Costin Chiru
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Sentiment analysis is a process widely used in opinion mining campaigns conducted today. This phenomenon presents applications in a variety of fields, especially in collecting information related to the attitude or satisfaction of users concerning a particular subject. However, the task of managing such a process becomes noticeably more difficult when it is applied in cultures that tend to combine two languages in order to express ideas and thoughts. By interleaving words from two languages, the user can express with ease, but at the cost of making the text far less intelligible for those who are not familiar with this technique, but also for standard opinion mining algorithms. In this paper, we describe the systems developed by our team for SemEval-2020 Task 9 that aims to cover two well-known code-mixed languages: Hindi-English and Spanish-English. We intend to solve this issue by introducing a solution that takes advantage of several neural network approaches, as well as pre-trained word embeddings. Our approach (multlingual BERT) achieves promising performance on the Hindi-English task, with an average F1-score of 0.6850, registered on the competition leaderboard, ranking our team 16 out of 62 participants. For the Spanish-English task, we obtained an average F1-score of 0.7064 ranking our team 17th out of 29 participants by using another multilingual Transformer-based model, XLM-RoBERTa.

Exploring the Power of Romanian BERT for Dialect Identification
George-Eduard Zaharia | Andrei-Marius Avram | Dumitru-Clementin Cercel | Traian Rebedea
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

Dialect identification represents a key aspect for improving a series of tasks, for example, opinion mining, considering that the location of the speaker can greatly influence the attitude towards a subject. In this work, we describe the systems developed by our team for VarDial 2020: Romanian Dialect Identification, a task specifically created for challenging participants to solve the previously mentioned issue. More specifically, we introduce a series of neural systems based on Transformers, that combine a BERT model exclusively pre-trained on the Romanian language with techniques such as adversarial training or character-level embeddings. By using these approaches, we were able to obtain a 0.6475 macro F1 score on the test dataset, thus allowing us to be ranked 5th out of 8 participant teams.

2019

Answering questions by learning to rank - Learning to rank by answering questions
George Sebastian Pirtoaca | Traian Rebedea | Stefan Ruseti
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Answering multiple-choice questions in a setting in which no supporting documents are explicitly provided continues to stand as a core problem in natural language processing. The contribution of this article is two-fold. First, it describes a method which can be used to semantically rank documents extracted from Wikipedia or similar natural language corpora. Second, we propose a model employing the semantic ranking that holds the first place in two of the most popular leaderboards for answering multiple-choice questions: ARC Easy and Challenge. To achieve this, we introduce a self-attention based neural network that latently learns to rank documents by their importance related to a given question, whilst optimizing the objective of predicting the correct answer. These documents are considered relevant contexts for the underlying question. We have published the ranked documents so that they can be used off-the-shelf to improve downstream decision models.

Cross-Domain Training for Goal-Oriented Conversational Agents
Alexandra Maria Bodîrlău | Stefania Budulan | Traian Rebedea
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Goal-Oriented Chatbots in fields such as customer support, providing certain information or general help with bookings or reservations, suffer from low performance partly due to the difficulty of obtaining large domain-specific annotated datasets. Given that the problem is closely related to the domain of the conversational agent and the data belonging to a specific domain is difficult to annotate, there have been some attempts at surpassing these challenges such as unsupervised pre-training or transfer learning between different domains. A more thorough analysis of the transfer learning mechanism is justified by the significant improvement of the results demonstrated in the results section. We describe extensive experiments using transfer learning and warm-starting techniques with improvements of more than 5% in relative percentage of success rate in the majority of cases, and up to 10x faster convergence as opposed to training the system without them.

2018

Natural Language Interface for Databases Using a Dual-Encoder Model
Ionel Alexandru Hosu | Radu Cristian Alexandru Iacob | Florin Brad | Stefan Ruseti | Traian Rebedea
Proceedings of the 27th International Conference on Computational Linguistics

We propose a sketch-based two-step neural model for generating structured queries (SQL) based on a user’s request in natural language. The sketch is obtained by using placeholders for specific entities in the SQL query, such as column names, table names, aliases and variables, in a process similar to semantic parsing. The first step is to apply a sequence-to-sequence (SEQ2SEQ) model to determine the most probable SQL sketch based on the request in natural language. Then, a second network designed as a dual-encoder SEQ2SEQ model using both the text query and the previously obtained sketch is employed to generate the final SQL query. Our approach shows improvements over previous approaches on two recent large datasets (WikiSQL and SENLIDB) suitable for data-driven solutions for natural language interfaces for databases.

2017

Dataset for a Neural Natural Language Interface for Databases (NNLIDB)
Florin Brad | Radu Cristian Alexandru Iacob | Ionel Alexandru Hosu | Traian Rebedea
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Progress in natural language interfaces to databases (NLIDB) has been slow mainly due to linguistic issues (such as language ambiguity) and domain portability. Moreover, the lack of a large corpus to be used as a standard benchmark has made data-driven approaches difficult to develop and compare. In this paper, we revisit the problem of NLIDBs and recast it as a sequence translation problem. To this end, we introduce a large dataset extracted from the Stack Exchange Data Explorer website, which can be used for training neural natural language interfaces for databases. We also report encouraging baseline results on a smaller manually annotated test corpus, obtained using an attention-based sequence-to-sequence neural network.

Neural Paraphrase Generation using Transfer Learning
Florin Brad | Traian Rebedea
Proceedings of the 10th International Conference on Natural Language Generation

Progress in statistical paraphrase generation has been hindered for a long time by the lack of large monolingual parallel corpora. In this paper, we adapt the neural machine translation approach to paraphrase generation and perform transfer learning from the closely related task of entailment generation. We evaluate the model on the Microsoft Research Paraphrase (MSRP) corpus and show that the model is able to generate sentences that capture part of the original meaning, but fails to pick up on important words or to show large lexical variation.

2016

Using Embedding Masks for Word Categorization
Stefan Ruseti | Traian Rebedea | Stefan Trausan-Matu
Proceedings of the 1st Workshop on Representation Learning for NLP

Co-authors

Andrei-Marius Avram 3

Cornelia Caragea 3

Ionel Alexandru Hosu 3

Stefan Ruseti 3

George-Eduard Zaharia 3

Doina Caragea 2

Mihai Dascalu 2

Marius Popescu 2

Tiberiu Sosea 2

Stefan Trausan-Matu 2

Elena-Simona Apostol 1

Alexandra Maria Bodîrlău 1

Faeze Brahman 1

Stefania Budulan 1

Darius Catrina 1

Marina Cidota 1

Jonathan Cohen 1

Dragos Georgian Corlatescu 1

Leon Derczynski 1

Alexandru Dima 1

Nikesh Gyawali 1

Denis Ilie-Ablachim 1

Sarthak Khanal 1

Marius Leordeanu 1

Vlad-Andrei Negoiță 1

Aishwarya Padmakumar 1

George Sebastian Pirtoaca 1

Cristian Popa 1

Robert-Adrian Popovici 1

Makesh Sreedhar 1

Simina-Maria Terian 1

Andrei Terian 1

Ciprian-Octavian Truică 1

Yulia Tsvetkov 1

Ana Sabina Uban 1

Jibin Rajan Varghese 1

Prasoon Varshney 1

George-Alexandru Vlad 1

Miruna-Andreea Zavelca 1

Venues