Paramita Mirza

2025

Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy
Paramita Mirza | Lucas Weber | Fabian Küch
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent work shows that post-training datasets for LLMs can be substantially downsampled without noticeably deteriorating performance. However, data selection often incurs high computational costs or is limited to narrow domains. In this paper, we demonstrate that data selection can be both—efficient and universal—by using a multi-step pipeline in which we efficiently bin data points into groups, estimate quality using specialized models, and score difficulty with a robust, lightweight method. Task-based categorization allows us to control the composition of our final data—crucial for finetuning multi-purpose models. To guarantee diversity, we improve upon previous work using embedding models and a clustering algorithm. This integrated strategy enables high-performance fine-tuning with minimal overhead.

2024

pdf bib abs

ILLUMINER: Instruction-tuned Large Language Models as Few-shot Intent Classifier and Slot Filler
Paramita Mirza | Viju Sudhi | Soumya Ranjan Sahoo | Sinchana Ramakanth Bhat
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

State-of-the-art intent classification (IC) and slot filling (SF) methods often rely on data-intensive deep learning models, limiting their practicality for industry applications. Large language models on the other hand, particularly instruction-tuned models (Instruct-LLMs), exhibit remarkable zero-shot performance across various natural language tasks. This study evaluates Instruct-LLMs on popular benchmark datasets for IC and SF, emphasizing their capacity to learn from fewer examples. We introduce ILLUMINER, an approach framing IC and SF as language generation tasks for Instruct-LLMs, with a more efficient SF-prompting method compared to prior work. A comprehensive comparison with multiple baselines shows that our approach, using the FLAN-T5 11B model, outperforms the state-of-the-art joint IC+SF method and in-context learning with GPT3.5 (175B), particularly in slot filling by 11.1–32.2 percentage points. Additionally, our in-depth ablation study demonstrates that parameter-efficient fine-tuning requires less than 6% of training data to yield comparable performance with traditional full-weight fine-tuning.

2021

pdf bib abs

AligNarr: Aligning Narratives on Movies
Paramita Mirza | Mostafa Abouhamra | Gerhard Weikum
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

High-quality alignment between movie scripts and plot summaries is an asset for learning to summarize stories and to generate dialogues. The alignment task is challenging as scripts and summaries substantially differ in details and abstraction levels as well as in linguistic register. This paper addresses the alignment problem by devising a fully unsupervised approach based on a global optimization model. Experimental results on ten movies show the viability of our method with 76% F1-score and its superiority over a previous baseline. We publish alignments for 914 movies to foster research in this new topic.

pdf bib abs

PRIDE: Predicting Relationships in Conversations
Anna Tigunova | Paramita Mirza | Andrew Yates | Gerhard Weikum
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Automatically extracting interpersonal relationships of conversation interlocutors can enrich personal knowledge bases to enhance personalized search, recommenders and chatbots. To infer speakers’ relationships from dialogues we propose PRIDE, a neural multi-label classifier, based on BERT and Transformer for creating a conversation representation. PRIDE utilizes dialogue structure and augments it with external knowledge about speaker features and conversation style. Unlike prior works, we address multi-label prediction of fine-grained relationships. We release large-scale datasets, based on screenplays of movies and TV shows, with directed relationships of conversation participants. Extensive experiments on both datasets show superior performance of PRIDE compared to the state-of-the-art baselines.

2020

pdf bib abs

RedDust: a Large Reusable Dataset of Reddit User Traits
Anna Tigunova | Paramita Mirza | Andrew Yates | Gerhard Weikum
Proceedings of the Twelfth Language Resources and Evaluation Conference

Social media is a rich source of assertions about personal traits, such as “I am a doctor” or “my hobby is playing tennis”. Precisely identifying explicit assertions is difficult, though, because of the users’ highly varied vocabulary and language expressions. Identifying personal traits from implicit assertions like I’ve been at work treating patients all day is even more challenging. This paper presents RedDust, a large-scale annotated resource for user profiling for over 300k Reddit users across five attributes: profession, hobby, family status, age,and gender. We construct RedDust using a diverse set of high-precision patterns and demonstrate its use as a resource for developing learning models to deal with implicit assertions. RedDust consists of users’ personal traits, which are (attribute, value) pairs, along with users’ post ids, which may be used to retrieve the posts from a publicly available crawl or from the Reddit API. We discuss the construction of the resource and show interesting statistics and insights into the data. We also compare different classifiers, which can be learned from RedDust. To the best of our knowledge, RedDust is the first annotated language resource about Reddit users at large scale. We envision further use cases of RedDust for providing background knowledge about user traits, to enhance personalized search and recommendation as well as conversational agents.

pdf bib abs

CHARM: Inferring Personal Attributes from Conversations
Anna Tigunova | Andrew Yates | Paramita Mirza | Gerhard Weikum
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Personal knowledge about users’ professions, hobbies, favorite food, and travel preferences, among others, is a valuable asset for individualized AI, such as recommenders or chatbots. Conversations in social media, such as Reddit, are a rich source of data for inferring personal facts. Prior work developed supervised methods to extract this knowledge, but these approaches can not generalize beyond attribute values with ample labeled training samples. This paper overcomes this limitation by devising CHARM: a zero-shot learning method that creatively leverages keyword extraction and document retrieval in order to predict attribute values that were never seen during training. Experiments with large datasets from Reddit show the viability of CHARM for open-ended attributes, such as professions and hobbies.

2019

pdf bib abs

KnowledgeNet: A Benchmark Dataset for Knowledge Base Population
Filipe Mesquita | Matteo Cannaviccio | Jordan Schmidek | Paramita Mirza | Denilson Barbosa
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

KnowledgeNet is a benchmark dataset for the task of automatically populating a knowledge base (Wikidata) with facts expressed in natural language text on the web. KnowledgeNet provides text exhaustively annotated with facts, thus enabling the holistic end-to-end evaluation of knowledge base population systems as a whole, unlike previous benchmarks that are more suitable for the evaluation of individual subcomponents (e.g., entity linking, relation extraction). We discuss five baseline approaches, where the best approach achieves an F1 score of 0.50, significantly outperforming a traditional approach by 79% (0.28). However, our best baseline is far from reaching human performance (0.82), indicating our dataset is challenging. The KnowledgeNet dataset and baselines are available at https://github.com/diffbot/knowledge-net

pdf bib abs

Discovering the Functions of Language in Online Forums
Youmna Ismaeil | Oana Balalau | Paramita Mirza
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

In this work, we revisit the functions of language proposed by linguist Roman Jakobson and we highlight their potential in analyzing online forum conversations. We investigate the relationship between functions and other properties of comments, such as controversiality. We propose and evaluate a semi-supervised framework for predicting the functions of Reddit comments. To accommodate further research, we release a corpus of 165K comments annotated with their functions of language.

pdf bib abs

Coverage of Information Extraction from Sentences and Paragraphs
Simon Razniewski | Nitisha Jain | Paramita Mirza | Gerhard Weikum
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Scalar implicatures are language features that imply the negation of stronger statements, e.g., “She was married twice” typically implicates that she was not married thrice. In this paper we discuss the importance of scalar implicatures in the context of textual information extraction. We investigate how textual features can be used to predict whether a given text segment mentions all objects standing in a certain relationship with a certain subject. Preliminary results on Wikipedia indicate that this prediction is feasible, and yields informative assessments.

2018

pdf bib abs

KOI at SemEval-2018 Task 5: Building Knowledge Graph of Incidents
Paramita Mirza | Fariz Darari | Rahmad Mahendra
Proceedings of the 12th International Workshop on Semantic Evaluation

We present KOI (Knowledge of Incidents), a system that given news articles as input, builds a knowledge graph (KOI-KG) of incidental events. KOI-KG can then be used to efficiently answer questions such “How many killing incidents happened in 2017 that involve Sean?” The required steps in building the KG include: (i) document preprocessing involving word sense disambiguation, named-entity recognition, temporal expression recognition and normalization, and semantic role labeling; (ii) incidental event extraction and coreference resolution via document clustering; and (iii) KG construction and population.

2017

pdf bib abs

Cardinal Virtues: Extracting Relation Cardinalities from Text
Paramita Mirza | Simon Razniewski | Fariz Darari | Gerhard Weikum
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Information extraction (IE) from text has largely focused on relations between individual entities, such as who has won which award. However, some facts are never fully mentioned, and no IE method has perfect recall. Thus, it is beneficial to also tap contents about the cardinalities of these relations, for example, how many awards someone has won. We introduce this novel problem of extracting cardinalities and discusses the specific challenges that set it apart from standard IE. We present a distant supervision method using conditional random fields. A preliminary evaluation results in precision between 3% and 55%, depending on the difficulty of relations.

2016

pdf bib abs

CATENA: CAusal and TEmporal relation extraction from NAtural language texts
Paramita Mirza | Sara Tonelli
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We present CATENA, a sieve-based system to perform temporal and causal relation extraction and classification from English texts, exploiting the interaction between the temporal and the causal model. We evaluate the performance of each sieve, showing that the rule-based, the machine-learned and the reasoning components all contribute to achieving state-of-the-art performance on TempEval-3 and TimeBank-Dense data. Although causal relations are much sparser than temporal ones, the architecture and the selected features are mostly suitable to serve both tasks. The effects of the interaction between the temporal and the causal components, although limited, yield promising results and confirm the tight connection between the temporal and the causal dimension of texts.

pdf bib abs

On the contribution of word embeddings to temporal relation classification
Paramita Mirza | Sara Tonelli
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Temporal relation classification is a challenging task, especially when there are no explicit markers to characterise the relation between temporal entities. This occurs frequently in inter-sentential relations, whose entities are not connected via direct syntactic relations making classification even more difficult. In these cases, resorting to features that focus on the semantic content of the event words may be very beneficial for inferring implicit relations. Specifically, while morpho-syntactic and context features are considered sufficient for classifying event-timex pairs, we believe that exploiting distributional semantic information about event words can benefit supervised classification of other types of pairs. In this work, we assess the impact of using word embeddings as features for event words in classifying temporal relations of event-event pairs and event-DCT (document creation time) pairs.

Paramita Mirza

2025

2024

2021

2020

2019

2018

2017

2016

2015

2014

2013

Co-authors

Venues