LLM-Generated Contexts to Practice Specialised Vocabulary: Corpus Presentation and Comparison

Iglika Nikolova-Stoupak; Serge Bibauw; Amandine Dumont; Françoise Stas; Patrick Watrin; Thomas François

LLM-Generated Contexts to Practice Specialised Vocabulary: Corpus Presentation and Comparison

Iglika Nikolova-Stoupak, Serge Bibauw, Amandine Dumont, Françoise Stas, Patrick Watrin, Thomas François

Abstract

This project evaluates the potential of LLM and dynamic corpora to generate contexts ai- med at the practice and acquisition of specialised English vocabulary. We compared reference contexts—handpicked by expert teachers—for a specialised vocabulary list to contexts generated by three recent large language models (LLM) of different sizes (Mistral-7B-Instruct, Vicuna-13B, and Gemini 1.0 Pro) and to contexts extracted from articles web-crawled from specialised websites. The comparison uses a representative set of length-based, morphosyntactic, semantic, and discourse- related textual characteristics. We conclude that the LLM-based corpora can be combined effectively with a web-crawled one to form an academic corpus characterised by appropriate complexity and textual variety.

Anthology ID:: 2024.jeptalnrecital-taln.33
Volume:: Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position
Month:: 7
Year:: 2024
Address:: Toulouse, France
Editors:: Mathieu Balaguer, Nihed Bendahman, Lydia-Mai Ho-dac, Julie Mauclair, Jose G Moreno, Julien Pinquier
Venue:: JEP/TALN/RECITAL
SIG:
Publisher:: ATALA and AFPC
Note:
Pages:: 472–498
Language:
URL:: https://aclanthology.org/2024.jeptalnrecital-taln.33
DOI:
Bibkey:
Cite (ACL):: Iglika Nikolova-Stoupak, Serge Bibauw, Amandine Dumont, Françoise Stas, Patrick Watrin, and Thomas François. 2024. LLM-Generated Contexts to Practice Specialised Vocabulary: Corpus Presentation and Comparison. In Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position, pages 472–498, Toulouse, France. ATALA and AFPC.
Cite (Informal):: LLM-Generated Contexts to Practice Specialised Vocabulary: Corpus Presentation and Comparison (Nikolova-Stoupak et al., JEP/TALN/RECITAL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.jeptalnrecital-taln.33.pdf

PDF Cite Search