Matyáš Kopp


2024

pdf bib
Multilingual Power and Ideology identification in the Parliament: a reference dataset and simple baselines
Çağrı Çöltekin | Matyáš Kopp | Meden Katja | Vaidas Morkevicius | Nikola Ljubešić | Tomaž Erjavec
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024

We introduce a dataset on political orientation and power position identification. The dataset is derived from ParlaMint, a set of comparable corpora of transcribed parliamentary speeches from 29 national and regional parliaments. We introduce the dataset, provide the reasoning behind some of the choices during its creation, present statistics on the dataset, and, using a simple classifier, some baseline results on predicting political orientation on the left-to-right axis, and on power position identification, i.e., distinguishing between the speeches delivered by governing coalition party members from those of opposition party members.

pdf bib
ParlaMint in TEITOK
Maarten Janssen | Matyáš Kopp
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024

This paper describes the ParlaMint 4.0 parliamentary corpora as made available in TEITOK at LINDAT. The TEITOK interface makes it possible to search through the corpus, to view each session in a readable manner, and to explore the names in the corpus. The interface does not present any new data, but provides an access point to the ParlaMint corpus that is less oriented to linguistic use only, and more accessible for the general public or researchers from other fields.

2022

pdf bib
Annotating Attribution in Czech News Server Articles
Barbora Hladka | Jiří Mírovský | Matyáš Kopp | Václav Moravec
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper focuses on detection of sources in the Czech articles published on a news server of Czech public radio. In particular, we search for attribution in sentences and we recognize attributed sources and their sentence context (signals). We organized a crowdsourcing annotation task that resulted in a data set of 2,167 stories with manually recognized signals and sources. In addition, the sources were classified into the classes of named and unnamed sources.

pdf bib
ParlaMint II: The Show Must Go On
Maciej Ogrodniczuk | Petya Osenova | Tomaž Erjavec | Darja Fišer | Nikola Ljubešić | Çağrı Çöltekin | Matyáš Kopp | Meden Katja
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference

In ParlaMint I, a CLARIN-ERIC supported project in pandemic times, a set of comparable and uniformly annotated multilingual corpora for 17 national parliaments were developed and released in 2021. For 2022 and 2023, the project has been extended to ParlaMint II, again with the CLARIN ERIC financial support, in order to enhance the existing corpora with new data and metadata; upgrade the XML schema; add corpora for 10 new parliaments; provide more application scenarios and carry out additional experiments. The paper reports on these planned steps, including some that have already been taken, and outlines future plans.

2020

pdf bib
Compiling Czech Parliamentary Stenographic Protocols into a Corpus
Barbora Hladka | Matyáš Kopp | Pavel Straňák
Proceedings of the Second ParlaCLARIN Workshop

The Parliament of the Czech Republic consists of two chambers: the Chamber of Deputies (Lower House) and the Senate (Upper House). In our work, we focus on agenda and documents that relate to the Chamber of Deputies exclusively. We pay particular attention to stenographic protocols that record the Chamber of Deputies’ meetings. Our overall goal is to (1) compile the protocols into a ParlaCLARIN TEI encoded corpus, (2) make this corpus accessible and searchable in the TEITOK web-based platform, (3) annotate the corpus using the modules available in TEITOK, e.g. detect and recognize named entities, and (4) highlight the annotations in TEITOK. In addition, we add two more goals that we consider innovative: (5) update the corpus every time a new stenographic protocol is published online by the Chambers of Deputies and (6) expose the annotations as the linked open data in order to improve the protocols’ interoperability with other existing linked open data. This paper is devoted to the goals (1) and (5).