Tomáš Machálek


2026

We present an AI assistant designed to help researchers interact with language corpora using natural language instead of formal query languages. Built as a custom GPT with access to multilingual corpora via Czech National Corpus platform API, the system translates research questions into CQL queries, retrieves corpus data, and guides users through linguistic analysis. After more than a year of deployment, the system has processed over 1000 interactions with human users. We discuss the hybrid approach combining rule-based translation with LLM intelligence, challenges of building on a constantly evolving platform, and lessons learned from production usage. Notably, this system represents the first voice-enabled corpus interface in history, significantly lowering barriers to corpus-based research for non-technical users and users outside linguistic fields.

2020

We present an advanced, highly customizable corpus query interface KonText built on top of core libraries of the open-source corpus search engine NoSketch Engine (NoSkE). The aim is to overcome some limitations of the original NoSkE user interface and provide integration capabilities allowing connection of the basic search service with other language resources (LRs). The introduced features are based on long-term feedback given by the users and researchers of the Czech National Corpus (CNC) along with other LRs providers running KonText as a part of their services. KonText is a fully operational and mature software deployed at the CNC since 2014 that currently handles thousands user queries per day.
Word at a Glance (WaG) is a word profile aggregator that provides means for exploring individual words, their comparison and translation, based on existing language resources and related software services. It is designed as a building kit-like application that fetches data from different sources and compiles them into a single, comprehensible and structured web page. WaG can be easily configured to support many tasks, but in general, it is intended to be used not only by language experts but also the general public.