Ganesh Bagler

2024

Food touches our lives through various endeavors, including flavor, nourishment, health, and sustainability. Recipes are cultural capsules transmitted across generations via unstructured text. Automated protocols for recognizing named entities, the building blocks of recipe text, are of immense value for various applications ranging from information extraction to novel recipe generation. Named entity recognition is a technique for extracting information from unstructured or semi-structured data with known labels. Starting with manually-annotated data of 6,611 ingredient phrases, we created an augmented dataset of 26,445 phrases cumulatively. Simultaneously, we systematically cleaned and analyzed ingredient phrases from RecipeDB, the gold-standard recipe data repository, and annotated them using the Stanford NER. Based on the analysis, we sampled a subset of 88,526 phrases using a clustering-based approach while preserving the diversity to create the machine-annotated dataset. A thorough investigation of NER approaches on these three datasets involving statistical, fine-tuning of deep learning-based language models and few-shot prompting on large language models (LLMs) provides deep insights. We conclude that few-shot prompting on LLMs has abysmal performance, whereas the fine-tuned spaCy-transformer emerges as the best model with macro-F1 scores of 95.9%, 96.04%, and 95.71% for the manually-annotated, augmented, and machine-annotated datasets, respectively.

2020

pdf bib abs
Building Hierarchically Disentangled Language Models for Text Generation with Named Entities
Yash Agarwal | Devansh Batra | Ganesh Bagler
Proceedings of the 28th International Conference on Computational Linguistics

Named entities pose a unique challenge to traditional methods of language modeling. While several domains are characterised with a high proportion of named entities, the occurrence of specific entities varies widely. Cooking recipes, for example, contain a lot of named entities — viz. ingredients, cooking techniques (also called processes), and utensils. However, some ingredients occur frequently within the instructions while most occur rarely. In this paper, we build upon the previous work done on language models developed for text with named entities by introducing a Hierarchically Disentangled Model. Training is divided into multiple branches with each branch producing a model with overlapping subsets of vocabulary. We found the existing datasets insufficient to accurately judge the performance of the model. Hence, we have curated 158,473 cooking recipes from several publicly available online sources. To reliably derive the entities within this corpus, we employ a combination of Named Entity Recognition (NER) as well as an unsupervised method of interpretation using dependency parsing and POS tagging, followed by a further cleaning of the dataset. This unsupervised interpretation models instructions as action graphs and is specific to the corpus of cooking recipes, unlike NER which is a general method applicable to all corpora. To delve into the utility of our language model, we apply it to tasks such as graph-to-text generation and ingredients-to-recipe generation, comparing it to previous state-of-the-art baselines. We make our dataset (including annotations and processed action graphs) available for use, considering their potential use cases for language modeling and text generation research.

Co-authors

Venues

coling2
lrec1