Maarten Janssen

2025

Searchable Language Documentation Corpora: DoReCo meets TEITOK
Maarten Janssen | Frank Seifart
Proceedings of the Fourth Workshop on NLP Applications to Field Linguistics

In this paper, we describe a newly created searchable interface for DoReCo, a database that contains spoken corpora from a world-wide sample of 53, mostly lesser described languages, with audio, transcription, translation, and - for most languages - interlinear morpheme glosses. Until now, DoReCo data were available for download via the DoReCo website and via the Nakala repository in a number of different formats, but not directly accessible online. We created a graphical interface to view, listen to, and search these data online, providing direct and intuitive access for linguists and laypeople. The new interface uses the TEITOK corpus infrastructure to provide a number of different visualizations on individual documents in DoReCo and provides a search interface to perform detailed searches on individual languages. The use of TEITOK also enables the corpus for use with NLP pipelines, either using the data to train NLP models, or to use NLP models to enrich the data.

pdf bib abs

Alignment of Historical Manuscript Transcriptions and Translations
Maarten Janssen | Piroska Lendvai | Anna Jouravel
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Using an XML-based framework, we compiled a gold standard for alignments in five primary as well as derived texts, related to De Lepra ad Sistelium by Methodius Olympius. These comprise diplomatic transcripts, editions, and translations of this work, involving both historical and modern languages. Using the TEITOK corpus platform, we created sentence-level gold standard alignments for our parallel resp. comparable texts, and applied both neural and classical alignment methods (SentenceBERT, Hunalign, Awesome-Align). We evaluated the methods in terms of Alignment Error Rate. We show that for alignment of our historical texts, Hunalign performs better than deep learning based methods.

2024

pdf bib abs

UDMorph: Morphosyntactically Tagged UD Corpora
Maarten Janssen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

UDMorph provides an infrastructure parallel to that provided by UD for annotated corpus data that follow the UD guidelines, but do not provide dependency relations: a place where new annotated data-sets can be deposited, and existing data-sets can be found and downloaded. It also provides a corpus creation environment to easily create annotated data for additional languages. And it provides a REST and GUI interface to a growing collection taggers with a CoNLL-U output, currently for around 150 different languages.

pdf bib abs

ParlaMint in TEITOK
Maarten Janssen | Matyáš Kopp
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024

This paper describes the ParlaMint 4.0 parliamentary corpora as made available in TEITOK at LINDAT. The TEITOK interface makes it possible to search through the corpus, to view each session in a readable manner, and to explore the names in the corpus. The interface does not present any new data, but provides an access point to the ParlaMint corpus that is less oriented to linguistic use only, and more accessible for the general public or researchers from other fields.

We present the COPLE2 corpus, a learner corpus of Portuguese that includes written and spoken texts produced by learners of Portuguese as a second or foreign language. The corpus includes at the moment a total of 182,474 tokens and 978 texts, classified according to the CEFR scales. The original handwritten productions are transcribed in TEI compliant XML format and keep record of all the original information, such as reformulations, insertions and corrections made by the teacher, while the recordings are transcribed and aligned with EXMARaLDA. The TEITOK environment enables different views of the same document (XML, student version, corrected version), a CQP-based search interface, the POS, lemmatization and normalization of the tokens, and will soon be used for error annotation in stand-off format. The corpus has already been a source of data for phonological, lexical and syntactic interlanguage studies and will be used for a data-informed selection of language features for each proficiency level.

pdf bib abs

TEITOK: Text-Faithful Annotated Corpora
Maarten Janssen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

TEITOK is a web-based framework for corpus creation, annotation, and distribution, that combines textual and linguistic annotation within a single TEI based XML document. TEITOK provides several built-in NLP tools to automatically (pre)process texts, and is highly customizable. It features multiple orthographic transcription layers, and a wide range of user-defined token-based annotations. For searching, TEITOK interfaces with a local CQP server. TEITOK can handle various types of additional resources including Facsimile images and linked audio files, making it possible to have a combined written/spoken corpus. It also has additional modules for PSDX syntactic annotation and several types of stand-off annotation.

pdf bib

Towards error annotation in a learner corpus of Portuguese
Iria del Río | Sandra Antunes | Amália Mendes | Maarten Janssen
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition

2012

pdf bib abs

The Common Orthographic Vocabulary of the Portuguese Language: a set of open lexical resources for a pluricentric language
José Pedro Ferreira | Maarten Janssen | Gladis Barcellos de Oliveira | Margarita Correia | Gilvan Müller de Oliveira
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper outlines the design principles and choices, as well as the ongoing development process of the Common Orthographic Vocabulary of the Portuguese Language (VOC), a large scale electronic lexical database which was adopted by the Community of Portuguese-Speaking Countries' (CPLP) Instituto Internacional da Língua Portuguesa to implement a spelling reform that is currently taking place. Given the different available resources and lexicographic traditions within the CPLP countries, a range of different solutions was adopted for different countries and integrated into a common development framework. Although the publication of lexicographic resources to implement spelling reforms has always been done for Portuguese, VOC represents a paradigm change, switching from idiosyncratic, closed source, paper-format official resources to standardized, open, free, web-accessible and reusable ones. We start by outlining the context that justifies the resource development and its requirements, then focusing on the description of the methodology, workflow and tools used, showing how a collaborative project in a common web-based platform and administration interface make the creation of such a long-sought and ambitious project possible.

pdf bib abs

NeoTag: a POS Tagger for Grammatical Neologism Detection
Maarten Janssen
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

POS Taggers typically fail to correctly tag grammatical neologisms: for known words, a tagger will only take known tags into account, and hence discard any possibility that the word is used in a novel or deviant grammatical category in the text at hand. Grammatical neologisms are relatively rare, and therefore do not pose a significant problem for the overall performance of a tagger. But for studies on neologisms and grammaticalization processes, this makes traditional taggers rather unfit. This article describes a modified POS tagger that explicitly considers new tags for known words, hence making it better fit for neologism research. This tagger, called NeoTag, has an overall accuracy that is comparable to other taggers, but scores much better for grammatical neologisms. To achieve this, the tagger applies a system of {\em lexical smoothing}, which adds new categories to known words based on known homographs. NeoTag also lemmatizes words as part of the tagging system, achieving a high accuracy on lemmatization for both known and unknown words, without the need for an external lexicon. The use of NeoTag is not restricted to grammatical neologism detection, and it can be used for other purposes as well.

2010

pdf bib abs

Combining Resources: Taxonomy Extraction from Multiple Dictionaries
Rogelio Nazar | Maarten Janssen
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The idea that dictionaries are a good source for (computational) information has been around for a long while, and the extraction of taxonomic information from them is something that has been attempted several times. However, such information extraction was typically based on the systematic analysis of the text of a single dictionary. In this paper, we demonstrate how it is possible to extract taxonomic information without any analysis of the specific text, by comparing the same lexical entry in a number of different dictionaries. Counting word frequencies in the dictionary entry for the same word in different dictionaries leads to a surprisingly good recovery of taxonomic information, without the need for any syntactic analysis of the entries in question nor any kind of language-specific treatment. As a case in point, we will show in this paper an experiment extracting hyperonymy relations from several Spanish dictionaries, measuring the effect that the different number of dictionaries have on the results.

2008

pdf bib abs

Spock - a Spoken Corpus Client
Maarten Janssen | Tiago Freitas
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Spock is an open source tool for the easy deployment of time-aligned corpora. It is fully web-based, and has very limited server-side requirements. It allows the end-user to search the corpus in a text-driven manner, obtaining both the transcription and the corresponding sound fragment in the result page. Spock has an administration environment to help manage the sound files and their respective transcription files, and also provides statistical data about the files at hand. Spock uses a proprietary file format for storing the alignment data but the integrated admin environment allows you to import files from a number of common file formats. Spock is not intended as a transcriber program: it is not meant as an alternative to programs such as ELAN, Wavesurfer, or Transcriber, but rather to make corpora created with these tools easily available on line. For the end user, Spock provides a very easy way of accessing spoken corpora, without the need of installing any special software, which might make time-aligned corpora corpora accessible to a large group of users who might otherwise never look at them.

Maarten Janssen

2025

2024

2021

2017

2016

2012

2010

2008

Co-authors

Venues