2024
pdf
bib
abs
AssameseBackTranslit: Back Transliteration of Romanized Assamese Social Media Text
Hemanta Baruah
|
Sanasam Ranbir Singh
|
Priyankoo Sarmah
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper presents a novel back transliteration dataset capturing native language text originally composed in the Roman/Latin script, harvested from popular social media platforms, along with its corresponding representation in the native Assamese script. Assamese, categorized as a low-resource language within the Indo-Aryan language family, predominantly spoken in the north-east Indian state of Assam, faces a scarcity of linguistic resources. The dataset comprises a total of 60,312 Roman-native parallel transliterated sentences. This paper diverges from conventional forward transliteration datasets consisting mainly of named entities and technical terms, instead presenting a novel transliteration dataset cultivated from three prominent social media platforms, Facebook, Twitter(currently X), and YouTube, in the backward transliteration direction. The paper offers a comprehensive examination of ten state-of-the-art word-level transliteration models within the context of this dataset, encompassing transliteration evaluation benchmarks, extensive performance assessments, and a discussion of the unique chal- lenges encountered during the processing of transliterated social media content. Our approach involves the initial use of two statistical transliteration models, followed by the training of two state-of-the-art neural network-based transliteration models, evaluation of three publicly available pre-trained models, and ultimately fine-tuning one existing state-of-the-art multilingual transliteration model along with two pre-trained large language models using the collected datasets. Notably, the Neural Transformer model outperforms all other baseline transliteration models, achieving the lowest Word Error Rate (WER) and Character Error Rate (CER), and the highest BLEU (up to 4 gram) score of 55.05, 19.44, and 69.15, respectively.
pdf
bib
abs
ClusterCore at SemEval-2024 Task 7: Few Shot Prompting With Large Language Models for Numeral-Aware Headline Generation
Monika Singh
|
Sujit Kumar
|
Tanveen .
|
Sanasam Ranbir Singh
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
The generation of headlines, a crucial aspect of abstractive summarization, aims to compress an entire article into a concise, single line of text despite the effectiveness of modern encoder-decoder models for text generation and summarization tasks. The encoder-decoder model commonly faces challenges in accurately generating numerical content within headlines. This study empirically explored LLMs for numeral-aware headline generation and proposed few-shot prompting with LLMs for numeral-aware headline generations. Experiments conducted on the NumHG dataset and NumEval-2024 test set suggest that fine-tuning LLMs on NumHG dataset enhances the performance of LLMs for numeral aware headline generation. Furthermore, few-shot prompting with LLMs surpassed the performance of fine-tuned LLMs for numeral-aware headline generation.
2023
pdf
bib
abs
Jack-flood at SemEval-2023 Task 5:Hierarchical Encoding and Reciprocal Rank Fusion-Based System for Spoiler Classification and Generation
Sujit Kumar
|
Aditya Sinha
|
Soumyadeep Jana
|
Rahul Mishra
|
Sanasam Ranbir Singh
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
The rise of social media has exponentially witnessed the use of clickbait posts that grab users’ attention. Although work has been done to detect clickbait posts, this is the first task focused on generating appropriate spoilers for these potential clickbaits. This paper presents our approach in this direction. We use different encoding techniques that capture the context of the post text and the target paragraph. We propose hierarchical encoding with count and document length feature-based model for spoiler type classification which uses Recurrence over Pretrained Encoding. We also propose combining multiple ranking with reciprocal rank fusion for passage spoiler retrieval and question-answering approach for phrase spoiler retrieval. For multipart spoiler retrieval, we combine the above two spoiler retrieval methods. Experimental results over the benchmark suggest that our proposed spoiler retrieval methods are able to retrieve spoilers that are semantically very close to the ground truth spoilers.
pdf
bib
abs
Can Big Models Help Diverse Languages? Investigating Large Pretrained Multilingual Models for Machine Translation of Indian Languages
Telem Joyson Singh
|
Sanasam Ranbir Singh
|
Priyankoo Sarmah
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Machine translation of Indian languages is challenging due to several factors, including linguistic diversity, limited parallel data, language divergence, and complex morphology. Recently, large pre-trained multilingual models have shown promise in improving translation quality. In this paper, we conduct a large-scale study on applying large pre-trained models for English-Indic machine translation through transfer learning across languages and domains. This study systematically evaluates the practical gains these models can provide and analyzes their capabilities for the translation of the Indian language by transfer learning. Specifically, we experiment with several models, including Meta’s mBART, mBART-manyto-many, NLLB-200, M2M-100, and Google’s MT5. These models are fine-tuned on small, high-quality English-Indic parallel data across languages and domains. Our findings show that adapting large pre-trained models to particular languages by fine-tuning improves translation quality across the Indic languages, even for languages unseen during pretraining. Domain adaptation through continued fine-tuning improves results. Our study provides insights into utilizing large pretrained models to address the distinct challenges of MT of Indian languages.
pdf
bib
abs
Multiset Dual Summarization for Incongruent News Article Detection
Sujit Kumar
|
Rohan Jaiswal
|
Mohit Ram Sharma
|
Sanasam Ranbir Singh
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
The prevalence of deceptive and incongruent news headlines has highlighted their substantial role in the propagation of fake news, exacerbating the spread of both misinformation and disinformation. Existing studies on incongruity detection primarily concentrate on estimating the similarity between the encoded representation of headlines and the encoded representation or summary representative vector of the news body. In the process of obtaining the encoded representation of the news body, researchers often consider either sequential encoding or hierarchical encoding of the news body or to acquire a summary representative vector of the news body, they explore techniques like summarization or dual summarization methods. Nevertheless, when it comes to detecting partially incongruent news, dual summarization-based methods tend to outperform hierarchical encoding-based methods. On the other hand, for datasets focused on detecting fake news, where the hierarchical structure within a news article plays a crucial role, hierarchical encoding-based methods tend to perform better than summarization-based methods. Recognizing this contradictory performance of hierarchical encoding-based and summarizationbased methods across datasets with different characteristics, we introduced a novel approach called Multiset Dual Summarization (MDS). MDS combines the strengths of both hierarchical encoding and dual summarization methods to leverage their respective advantages. We conducted experiments on datasets with diverse characteristics, and our findings demonstrate that our proposed model outperforms established state-of-the-art baseline models.
pdf
bib
Subwords to Word Back Composition for Morphologically Rich Languages in Neural Machine Translation
Telem Joyson Singh
|
Sanasam Ranbir Singh
|
Priyankoo Sarmah
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation
pdf
bib
Assamese Back Transliteration - An Empirical Study Over Canonical and Non-canonical Datasets
Hemanta Baruah
|
Sanasam Ranbir Singh
|
Priyankoo Sarmah
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation
2022
pdf
bib
abs
Detecting Incongruent News Articles Using Multi-head Attention Dual Summarization
Sujit Kumar
|
Gaurav Kumar
|
Sanasam Ranbir Singh
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
With the increasing use of influencing incongruent news headlines for spreading fake news, detecting incongruent news articles has become an important research challenge. Most of the earlier studies on incongruity detection focus on estimating the similarity between the headline and the encoding of the body or its summary. However, most of these methods fail to handle incongruent news articles created with embedded noise. Motivated by the above issue, this paper proposes a Multi-head Attention Dual Summary (MADS) based method which generates two types of summaries that capture the congruent and incongruent parts in the body separately. From various experimental setups over three publicly available datasets, it is evident that the proposed model outperforms the state-of-the-art baseline counterparts.
2021
pdf
bib
abs
Manipuri-English Machine Translation using Comparable Corpus
Lenin Laitonjam
|
Sanasam Ranbir Singh
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)
Unsupervised Machine Translation (MT) model, which has the ability to perform MT without parallel sentences using comparable corpora, is becoming a promising approach for developing MT in low-resource languages. However, majority of the studies in unsupervised MT have considered resource-rich language pairs with similar linguistic characteristics. In this paper, we investigate the effectiveness of unsupervised MT models over a Manipuri-English comparable corpus. Manipuri is a low-resource language having different linguistic characteristics from that of English. This paper focuses on identifying challenges in building unsupervised MT models over the comparable corpus. From various experimental observations, it is evident that the development of MT over comparable corpus using unsupervised methods is feasible. Further, the paper also identifies future directions of developing effective MT for Manipuri-English language pair under unsupervised scenarios.
2020
pdf
bib
abs
Sentiment Analysis of Tweets using Heterogeneous Multi-layer Network Representation and Embedding
Loitongbam Gyanendro Singh
|
Anasua Mitra
|
Sanasam Ranbir Singh
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Sentiment classification on tweets often needs to deal with the problems of under-specificity, noise, and multilingual content. This study proposes a heterogeneous multi-layer network-based representation of tweets to generate multiple representations of a tweet and address the above issues. The generated representations are further ensembled and classified using a neural-based early fusion approach. Further, we propose a centrality aware random-walk for node embedding and tweet representations suitable for the multi-layer network. From various experimental analysis, it is evident that the proposed method can address the problem of under-specificity, noisy text, and multilingual content present in a tweet and provides better classification performance than the text-based counterparts. Further, the proposed centrality aware based random walk provides better representations than unbiased and other biased counterparts.
2016
pdf
bib
abs
Automatic Syllabification for Manipuri language
Loitongbam Gyanendro Singh
|
Lenin Laitonjam
|
Sanasam Ranbir Singh
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Development of hand crafted rule for syllabifying words of a language is an expensive task. This paper proposes several data-driven methods for automatic syllabification of words written in Manipuri language. Manipuri is one of the scheduled Indian languages. First, we propose a language-independent rule-based approach formulated using entropy based phonotactic segmentation. Second, we project the syllabification problem as a sequence labeling problem and investigate its effect using various sequence labeling approaches. Third, we combine the effect of sequence labeling and rule-based method and investigate the performance of the hybrid approach. From various experimental observations, it is evident that the proposed methods outperform the baseline rule-based method. The entropy based phonotactic segmentation provides a word accuracy of 96%, CRF (sequence labeling approach) provides 97% and hybrid approach provides 98% word accuracy.