Journal for Language Technology and Computational Linguistics, Vol. 36 No. 2

Anthology ID:: 2023.jlcl-2
Month:: May
Year:: 2023
Address:: unknown
Venue:: JLCL
SIG:
Publisher:: German Society for Computational Lingustics and Language Technology
URL:: https://aclanthology.org/2023.jlcl-2
DOI:
Bib Export formats:: BibTeX MODS XML EndNote

pdf bib
Journal for Language Technology and Computational Linguistics, Vol. 36 No. 2
Christian Wartena

Indigenous African languages are categorized as under-served in Natural Language Processing. They therefore experience poor digital inclusivity and information access. The processing challenge with such languages has been how to use machine learning and deep learning models without the requisite data. The Kencorpus project intends to bridge this gap by collecting and storing text and speech data that is good enough for data-driven solutions in applications such as machine translation, question answering and transcription in multilingual communities. The Kencorpus dataset is a text and speech corpus for three languages predominantly spoken in Kenya: Swahili, Dholuo and Luhya (three dialects of Lumarachi, Lulogooli and Lubukusu). Data collection was done by researchers who were deployed to the various data collection sources such as communities, schools, media, and publishers. The Kencorpus’ dataset has a collection of 5,594 items, being 4,442 texts of 5.6 million words and 1,152 speech files worth 177 hours. Based on this data, other datasets were also developed such as Part of Speech tagging sets for Dholuo and the Luhya dialects of 50,000 and 93,000 words tagged respectively. We developed 7,537 Question-Answer pairs from 1,445 Swahili texts and also created a text translation set of 13,400 sentences from Dholuo and Luhya into Swahili. The datasets are useful for downstream machine learning tasks such as model training and translation. Additionally, we developed two proof of concept systems: for Kiswahili speech-to-text and a machine learning system for Question Answering task. These proofs provided results of a performance of 18.87% word error rate for the former, and 80% Exact Match (EM) for the latter system. These initial results give great promise to the usability of Kencorpus to the machine learning community. Kencorpus is one of few public domain corpora for these three low resource languages and forms a basis of learning and sharing experiences for similar works especially for low resource languages. Challenges in developing the corpus included deficiencies in the data sources, data cleaning challenges, relatively short project timelines and the Coronavirus disease (COVID-19) pandemic that restricted movement and hence the ability to get the data in a timely manner.

pdf bib abs
The Proof is in the Pudding: Using Automated Theorem Proving to Generate Cooking Recipes
Louis Mahon | Carl Vogel

This paper presents FASTFOOD, a rule-based natural language generation (NLG) program for cooking recipes. We consider the representation of cooking recipes as discourse representation, because the meaning of each sentence needs to consider the context of the others. Our discourse representation system is based on states of affairs and transtions between states of affairs, and does not use discourse referents. Recipes are generated by using an automated theorem-proving procedure to select the ingredients and instructions, with ingredients corresponding to axioms and instructions to implications. FASTFOOD also contains a temporal optimization module which can rearrange the recipe to make it more time efficient for the user, e.g. the recipe specifies to chop the vegetables while the rice is boiling. The system is described in detail, including the decision to forgo discourse referents and how plausible representations of nouns and verbs emerge purely as a by-product of the practical requirements of efficiently representing recipe content. A comparison is then made with existing recipe generation systems, NLG systems more generally, and automated theorem provers.