Anubhav Sharma

2024

Lack of diverse perspectives causes neutrality bias in Wikipedia content leading to millions of worldwide readers getting exposed by potentially inaccurate information. Hence, neutrality bias detection and mitigation is a critical problem. Although previous studies have proposed effective solutions for English, no work exists for Indian languages. First, we contribute two large datasets, mWIKIBIAS and mWNC, covering 8 languages, for the bias detection and mitigation tasks respectively. Next, we investigate the effectiveness of popular multilingual Transformer-based models for the two tasks by modeling detection as a binary classification problem and mitigation as a style transfer problem. We make the code and data publicly available.

2023

pdf bib abs

Multiple business scenarios require an automated generation of descriptive human-readable text from structured input data. This has resulted into substantial work on fact-to-text generation systems recently. Unfortunately, previous work on fact-to-text (F2T) generation has focused primarily on English mainly due to the high availability of relevant datasets. Only recently, the problem of cross-lingual fact-to-text (XF2T) was proposed for generation across multiple languages alongwith a dataset, XAlign for eight languages. However, there has been no rigorous work on the actual XF2T generation problem. We extend XAlign dataset with annotated data for four more languages: Punjabi, Malayalam, Assamese and Oriya. We conduct an extensive study using popular Transformer-based text generation models on our extended multi-lingual dataset, which we call XAlignV2. Further, we investigate the performance of different text generation strategies: multiple variations of pretraining, fact-aware embeddings and structure-aware input encoding. Our extensive experiments show that a multi-lingual mT5 model which uses fact-aware embeddings with structure-aware input encoding leads to best results (30.90 BLEU, 55.12 METEOR and 59.17 chrF++) across the twelve languages. We make our code, dataset and model publicly available, and hope that this will help advance further research in this critical area.

pdf bib abs

Billy-Batson at SemEval-2023 Task 5: An Information Condensation based System for Clickbait Spoiling
Anubhav Sharma | Sagar Joshi | Tushar Abhishek | Radhika Mamidi | Vasudeva Varma
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

The Clickbait Challenge targets spoiling the clickbaits using short pieces of information known as spoilers to satisfy the curiosity induced by a clickbait post. The large context of the article associated with the clickbait and differences in the spoiler forms, make the task challenging. Hence, to tackle the large context, we propose an Information Condensation-based approach, which prunes down the unnecessary context. Given an article, our filtering module optimised with a contrastive learning objective first selects the parapraphs that are the most relevant to the corresponding clickbait.The resulting condensed article is then fed to the two downstream tasks of spoiler type classification and spoiler generation. We demonstrate and analyze the gains from this approach on both the tasks. Overall, we win the task of spoiler type classification and achieve competitive results on spoiler generation.

pdf bib abs

WebNLG Challenge 2023: Domain Adaptive Machine Translation for Low-Resource Multilingual RDF-to-Text Generation (WebNLG 2023)
Kancharla Aditya Hari | Bhavyajeet Singh | Anubhav Sharma | Vasudeva Varma
Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023)

This paper presents our submission to the WebNLG Challenge 2023 for generating text in several low-resource languages from RDF-triples. Our submission focuses on using machine translation for generating texts in Irish, Maltese, Welsh and Russian. While a simple and straightfoward approach, recent works have shown that using monolingual models for inference for multilingual tasks with the help of machine translation (translate-test) can out-perform multilingual models and training multilingual models on machine-translated data (translate-train) through careful tuning of the MT component. Our results show that this approach demonstrates competitive performance for this task even with limited data.

2022

pdf bib abs

Massively Multilingual Language Models for Cross Lingual Fact Extraction from Low Resource Indian Languages
Bhavyajeet Singh | Siri Venkata Pavan Kumar Kandru | Anubhav Sharma | Vasudeva Varma
Proceedings of the 19th International Conference on Natural Language Processing (ICON)

Massive knowledge graphs like Wikidata attempt to capture world knowledge about multiple entities. Recent approaches concentrate on automatically enriching these KGs from text. However a lot of information present in the form of natural text in low resource languages is often missed out. Cross Lingual Information Extraction aims at extracting factual information in the form of English triples from low resource Indian Language text. Despite its massive potential, progress made on this task is lagging when compared to Monolingual Information Extraction. In this paper, we propose the task of Cross Lingual Fact Extraction(CLFE) from text and devise an end-to-end generative approach for the same which achieves an overall F1 score of 77.46

Co-authors

Rudra Dhar 1

Sagar Joshi 1

Siri Venkata Pavan Kumar Kandru 1

Ankita Maity 1

Radhika Mamidi 1

Shivprasad Sagare 1

Venues

SIGDIAL1

WILDRE1

Fix author