Jue Hou


2024

pdf bib
What Do Transformers Know about Government?
Jue Hou | Anisia Katinskaia | Lari Kotilainen | Sathianpong Trangcasanchai | Anh-Duc Vu | Roman Yangarber
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper investigates what insights about linguistic features and what knowledge about the structure of natural language can be obtained from the encodings in transformer language models. In particular, we explore how BERT encodes the government relation between constituents in a sentence. We use several probing classifiers, and data from two morphologically rich languages. Our experiments show that information about government is encoded across all transformer layers, but predominantly in the early layers of the model. We find that, for both languages, a small number of attention heads encode enough information about the government relations to enable us to train a classifier capable of discovering new, previously unknown types of government, never seen in the training data. Currently, data is lacking for the research community working on grammatical constructions, and government in particular. We release the Government Bank—a dataset defining the government relations for thousands of lemmas in the languages in our experiments.

pdf bib
Intelligent Tutor to Support Teaching and Learning of Tatar
Alsu Zakirova | Jue Hou | Anisia Katinskaia | Anh-Duc Vu | Roman Yangarber
Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024)

This paper presents our work on tools to support the Tatar language, using Revita, a web-based Intelligent Tutoring System for language teaching and learning. The system allows the users — teachers and learners — to upload arbitrary authentic texts, and automatically creates exercises based on these texts that engage the learners in active production of language. It provides graduated feedback when they make mistakes, and performs continuous assessment, based on which the system selects exercises for the learners at the appropriate level. The assessment also helps the students maintain their learning pace, and helps the teachers to monitor their progress.The paper describes the functionality currently implemented for Tatar, which enables learners — who possess basic proficiency beyond the beginner level — to improve their competency, using texts of their choice as learning content. Support for Tatar is being developed to increase public interest in learning the language of this important regional minority, as well as to to provide tools for improving fluency to “heritage speakers” — those who have substantial passive competency, but lack active fluency and need support for regular practice.

2023

pdf bib
Linguistic Constructs Represent the Domain Model in Intelligent Language Tutoring
Anisia Katinskaia | Jue Hou | Anh-duc Vu | Roman Yangarber
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

This paper presents the development of the AI-based language-learning platform, Revita. It is an intelligent online tutor, developed to support learners of multiple languages, from lower-intermediate toward advanced levels. It has been in pilot use with hundreds of students at several universities, whose feedback and needs shape the development. One of the main emerging features of Revita is the system of linguistic constructs to represent the domain knowledge. The system of constructs is developed in collaboration with experts in language pedagogy. Constructs define the types of exercises, the content of the feedback, and enable detailed modeling and evaluation of learner progress.

pdf bib
Effects of sub-word segmentation on performance of transformer language models
Jue Hou | Anisia Katinskaia | Anh-Duc Vu | Roman Yangarber
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Language modeling is a fundamental task in natural language processing, which has been thoroughly explored with various architectures and hyperparameters. However, few studies focus on the effect of sub-word segmentation on the performance of language models (LMs). In this paper, we compare GPT and BERT models trained with the statistical segmentation algorithm BPE vs. two unsupervised algorithms for morphological segmentation — Morfessor and StateMorph. We train the models for several languages — including ones with very rich morphology — and compare their performance with different segmentation algorithms, vocabulary sizes, and model sizes. The results show that training with morphological segmentation allows the LMs to: (1) achieve lower perplexity, (2) converge more efficiently in terms of training time, and (3) achieve equivalent or better evaluation scores on downstream tasks. Lastly, we show that (4) LMs of smaller size using morphological segmentation can perform comparably to models of larger size trained with BPE — both in terms of (1) perplexity and (3) scores on downstream tasks. Points (2) and (4) impact on sustainability, since they reduce the model cost; and while 2 reduces cost only in the training phase, 4 does so also in the inference phase.

2022

pdf bib
Applying Gamification Incentives in the Revita Language-learning System
Jue Hou | Ilmari Kylliäinen | Anisia Katinskaia | Giacomo Furlan | Roman Yangarber
Proceedings of the 9th Workshop on Games and Natural Language Processing within the 13th Language Resources and Evaluation Conference

We explore the importance of gamification features in a language-learning platform designed for intermediate-to-advanced learners. Our main thesis is: learning toward advanced levels requires a massive investment of time. If the learner engages in more practice sessions, and if the practice sessions are longer, we can expect the results to be better. This principle appears to be tautologically self-evident. Yet, keeping the learner engaged in general—and building gamification features in particular—requires substantial efforts on the part of developers. Our goal is to keep the learner engaged in long practice sessions over many months—rather than for the short-term. This creates a conflict: In academic research on language learning, resources are typically scarce, and gamification usually is not considered an essential priority for allocating resources. We argue in favor of giving serious consideration to gamification in the language-learning setting—as a means of enabling in-depth research. In this paper, we introduce several gamification incentives in the Revita language-learning platform. We discuss the problems in obtaining quantitative measures of the effectiveness of gamification features.

pdf bib
Semi-automatically Annotated Learner Corpus for Russian
Anisia Katinskaia | Maria Lebedeva | Jue Hou | Roman Yangarber
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present ReLCo— the Revita Learner Corpus—a new semi-automatically annotated learner corpus for Russian. The corpus was collected while several thousand L2 learners were performing exercises using the Revita language-learning system. All errors were detected automatically by the system and annotated by type. Part of the corpus was annotated manually—this part was created for further experiments on automatic assessment of grammatical correctness. The Learner Corpus provides valuable data for studying patterns of grammatical errors, experimenting with grammatical error detection and grammatical error correction, and developing new exercises for language learners. Automating the collection and annotation makes the process of building the learner corpus much cheaper and faster, in contrast to the traditional approach of building learner corpora. We make the data publicly available.

2019

pdf bib
Modeling language learning using specialized Elo rating
Jue Hou | Koppatz Maximilian | José María Hoya Quecedo | Nataliya Stoyanova | Roman Yangarber
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

Automatic assessment of the proficiency levels of the learner is a critical part of Intelligent Tutoring Systems. We present methods for assessment in the context of language learning. We use a specialized Elo formula used in conjunction with educational data mining. We simultaneously obtain ratings for the proficiency of the learners and for the difficulty of the linguistic concepts that the learners are trying to master. From the same data we also learn a graph structure representing a domain model capturing the relations among the concepts. This application of Elo provides ratings for learners and concepts which correlate well with subjective proficiency levels of the learners and difficulty levels of the concepts.

pdf bib
Projecting named entity recognizers without annotated or parallel corpora
Jue Hou | Maximilian Koppatz | José María Hoya Quecedo | Roman Yangarber
Proceedings of the 22nd Nordic Conference on Computational Linguistics

Named entity recognition (NER) is a well-researched task in the field of NLP, which typically requires large annotated corpora for training usable models. This is a problem for languages which lack large annotated corpora, such as Finnish. We propose an approach to create a named entity recognizer with no annotated or parallel documents, by leveraging strong NER models that exist for English. We automatically gather a large amount of chronologically matched data in two languages, then project named entity annotations from the English documents onto the Finnish ones, by resolving the matches with limited linguistic rules. We use this “artificially” annotated data to train a BiLSTM-CRF model. Our results show that this method can produce annotated instances with high precision, and the resulting model achieves state-of-the-art performance.