Naman Ahuja
2022
PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics & Language Model Embeddings To Estimate Code-Mix Quality
Prashant Kodali
|
Tanmay Sachan
|
Akshay Goindani
|
Anmol Goel
|
Naman Ahuja
|
Manish Shrivastava
|
Ponnurangam Kumaraguru
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges
Code-Mixing is a phenomenon of mixing two or more languages in a speech event and is prevalent in multilingual societies. Given the low-resource nature of Code-Mixing, machine generation of code-mixed text is a prevalent approach for data augmentation. However, evaluating the quality of such machine gen- erated code-mixed text is an open problem. In our submission to HinglishEval, a shared- task collocated with INLG2022, we attempt to build models factors that impact the quality of synthetically generated code-mix text by pre- dicting ratings for code-mix quality. Hingli- shEval Shared Task consists of two sub-tasks - a) Quality rating prediction); b) Disagree- ment prediction. We leverage popular code- mixed metrics and embeddings of multilin- gual large language models (MLLMs) as fea- tures, and train task specific MLP regression models. Our approach could not beat the baseline results. However, for Subtask-A our team ranked a close second on F-1 and Co- hen’s Kappa Score measures and first for Mean Squared Error measure. For Subtask-B our ap- proach ranked third for F1 score, and first for Mean Squared Error measure. Code of our submission can be accessed here.
HashSet - A Dataset For Hashtag Segmentation
Prashant Kodali
|
Akshala Bhatnagar
|
Naman Ahuja
|
Manish Shrivastava
|
Ponnurangam Kumaraguru
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Hashtag segmentation is the task of breaking a hashtag into its constituent tokens. Hashtags often encode the essence of user-generated posts, along with information like topic and sentiment, which are useful in downstream tasks. Hashtags prioritize brevity and are written in unique ways - transliterating and mixing languages, spelling variations, creative named entities. Benchmark datasets used for the hashtag segmentation task - STAN, BOUN - are small and extracted from a single set of tweets. However, datasets should reflect the variations in writing styles of hashtags and account for domain and language specificity, failing which the results will misrepresent model performance. We argue that model performance should be assessed on a wider variety of hashtags, and datasets should be carefully curated. To this end, we propose HashSet, a dataset comprising of: a) 1.9k manually annotated dataset; b) 3.3M loosely supervised dataset. HashSet dataset is sampled from a different set of tweets when compared to existing datasets and provides an alternate distribution of hashtags to build and validate hashtag segmentation models. We analyze the performance of SOTA models for Hashtag Segmentation, and show that the proposed dataset provides an alternate set of hashtags to train and assess models.
Search
Fix data
Co-authors
- Prashant Kodali 2
- Ponnurangam Kumaraguru 2
- Manish Shrivastava 2
- Akshala Bhatnagar 1
- Anmol Goel 1
- show all...