2024
pdf
bib
abs
Teaching LLMs at Charles University: Assignments and Activities
Jindřich Helcl
|
Zdeněk Kasner
|
Ondřej Dušek
|
Tomasz Limisiewicz
|
Dominik Macháček
|
Tomáš Musil
|
Jindřich Libovický
Proceedings of the Sixth Workshop on Teaching NLP
This paper presents teaching materials, particularly assignments and ideas for classroom activities, from a new course on large language modelsThe assignments include experiments with LLM inference for weather report generation and machine translation.The classroom activities include class quizzes, focused research on downstream tasks and datasets, and an interactive “best paper” session aimed at reading and comprehension of research papers.
pdf
bib
abs
Exploring Interpretability of Independent Components of Word Embeddings with Automated Word Intruder Test
Tomáš Musil
|
David Mareček
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Independent Component Analysis (ICA) is an algorithm originally developed for finding separate sources in a mixed signal, such as a recording of multiple people in the same room speaking at the same time. Unlike Principal Component Analysis (PCA), ICA permits the representation of a word as an unstructured set of features, without any particular feature being deemed more significant than the others. In this paper, we used ICA to analyze word embeddings. We have found that ICA can be used to find semantic features of the words and these features can easily be combined to search for words that satisfy the combination. We show that most of the independent components represent such features. To quantify the interpretability of the components, we use the word intruder test, performed both by humans and by large language models. We propose to use the automated version of the word intruder test as a fast and inexpensive way of quantifying vector interpretability without the need for human effort.
2022
pdf
bib
abs
THEaiTRobot: An Interactive Tool for Generating Theatre Play Scripts
Rudolf Rosa
|
Patrícia Schmidtová
|
Alisa Zakhtarenko
|
Ondrej Dusek
|
Tomáš Musil
|
David Mareček
|
Saad Ul Islam
|
Marie Novakova
|
Klara Vosecka
|
Daniel Hrbek
|
David Kostak
Proceedings of the 15th International Conference on Natural Language Generation: System Demonstrations
We present a free online demo of THEaiTRobot, an open-source bilingual tool for interactively generating theatre play scripts, in two versions. THEaiTRobot 1.0 uses the GPT-2 language model with minimal adjustments. THEaiTRobot 2.0 uses two models created by fine-tuning GPT-2 on purposefully collected and processed datasets and several other components, generating play scripts in a hierarchical fashion (title → synopsis → script). The underlying tool is used in the THEaiTRE project to generate scripts for plays, which are then performed on stage by a professional theatre.
pdf
bib
abs
GPT-2-based Human-in-the-loop Theatre Play Script Generation
Rudolf Rosa
|
Patrícia Schmidtová
|
Ondřej Dušek
|
Tomáš Musil
|
David Mareček
|
Saad Obaid
|
Marie Nováková
|
Klára Vosecká
|
Josef Doležal
Proceedings of the 4th Workshop of Narrative Understanding (WNU2022)
We experiment with adapting generative language models for the generation of long coherent narratives in the form of theatre plays. Since fully automatic generation of whole plays is not currently feasible, we created an interactive tool that allows a human user to steer the generation somewhat while minimizing intervention. We pursue two approaches to long-text generation: a flat generation with summarization of context, and a hierarchical text-to-text two-stage approach, where a synopsis is generated first and then used to condition generation of the final script. Our preliminary results and discussions with theatre professionals show improvements over vanilla language model generation, but also identify important limitations of our approach.
2021
pdf
bib
abs
Representations of Meaning in Neural Networks for NLP: a Thesis Proposal
Tomáš Musil
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop
Neural networks are the state-of-the-art method of machine learning for many problems in NLP. Their success in machine translation and other NLP tasks is phenomenal, but their interpretability is challenging. We want to find out how neural networks represent meaning. In order to do this, we propose to examine the distribution of meaning in the vector space representation of words in neural networks trained for NLP tasks. Furthermore, we propose to consider various theories of meaning in the philosophy of language and to find a methodology that would enable us to connect these areas.
2019
pdf
bib
abs
Derivational Morphological Relations in Word Embeddings
Tomáš Musil
|
Jonáš Vidra
|
David Mareček
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
Derivation is a type of a word-formation process which creates new words from existing ones by adding, changing or deleting affixes. In this paper, we explore the potential of word embeddings to identify properties of word derivations in the morphologically rich Czech language. We extract derivational relations between pairs of words from DeriNet, a Czech lexical network, which organizes almost one million Czech lemmas into derivational trees. For each such pair, we compute the difference of the embeddings of the two words, and perform unsupervised clustering of the resulting vectors. Our results show that these clusters largely match manually annotated semantic categories of the derivational relations (e.g. the relation ‘bake–baker’ belongs to category ‘actor’, and a correct clustering puts it into the same cluster as ‘govern–governor’).
pdf
bib
abs
A Test Suite and Manual Evaluation of Document-Level NMT at WMT19
Kateřina Rysová
|
Magdaléna Rysová
|
Tomáš Musil
|
Lucie Poláková
|
Ondřej Bojar
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
As the quality of machine translation rises and neural machine translation (NMT) is moving from sentence to document level translations, it is becoming increasingly difficult to evaluate the output of translation systems. We provide a test suite for WMT19 aimed at assessing discourse phenomena of MT systems participating in the News Translation Task. We have manually checked the outputs and identified types of translation errors that are relevant to document-level translation.
2018
pdf
bib
Neural Monkey: The Current State and Beyond
Jindřich Helcl
|
Jindřich Libovický
|
Tom Kocmi
|
Tomáš Musil
|
Ondřej Cífka
|
Dušan Variš
|
Ondřej Bojar
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
2017
pdf
bib
Results of the WMT17 Neural MT Training Task
Ondřej Bojar
|
Jindřich Helcl
|
Tom Kocmi
|
Jindřich Libovický
|
Tomáš Musil
Proceedings of the Second Conference on Machine Translation