Adrian Jan Zasina


2025

pdf bib
Automatic Generation of Corpus-Based Exercises Using Generative AI
Adrian Jan Zasina
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)

This study explores the automatic generation of corpus-based language exercises using generative AI models. We focus on the interaction between language models and corpus data, detailing a workflow in which lexical and syntactic patterns are extracted from a tagged corpus and structured prompts are constructed to guide the model in producing sentence-level exercises. The generated exercises reveal both the potential of AI-driven approaches. However, observations highlight the necessity of careful design and critical evaluation when integrating generative models with corpus-based language materials. By analysing these processes from a computational linguistics perspective, this study contributes to understanding how generative AI can interact with structured linguistic data, informing future applications in automated language resources.

2016

pdf bib
SYN2015: Representative Corpus of Contemporary Written Czech
Michal Křen | Václav Cvrček | Tomáš Čapka | Anna Čermáková | Milena Hnátková | Lucie Chlumská | Tomáš Jelínek | Dominika Kováříková | Vladimír Petkevič | Pavel Procházka | Hana Skoumalová | Michal Škrabal | Petr Truneček | Pavel Vondřička | Adrian Jan Zasina
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The paper concentrates on the design, composition and annotation of SYN2015, a new 100-million representative corpus of contemporary written Czech. SYN2015 is a sequel of the representative corpora of the SYN series that can be described as traditional (as opposed to the web-crawled corpora), featuring cleared copyright issues, well-defined composition, reliability of annotation and high-quality text processing. At the same time, SYN2015 is designed as a reflection of the variety of written Czech text production with necessary methodological and technological enhancements that include a detailed bibliographic annotation and text classification based on an updated scheme. The corpus has been produced using a completely rebuilt text processing toolchain called SynKorp. SYN2015 is lemmatized, morphologically and syntactically annotated with state-of-the-art tools. It has been published within the framework of the Czech National Corpus and it is available via the standard corpus query interface KonText at http://kontext.korpus.cz as well as a dataset in shuffled format.