Challenges in the Management of Large Corpora (2022)

Volumes

Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10) 7 papers

pdf (full)
bib (full) Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)

pdf bib
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)
Piotr Banski | Adrien Barbaresi | Simon Clematide | Marc Kupietz | Harald Lüngen

Following the successful creation of a national representative corpus of contemporary Romanian language, we turned our attention to the social media text, as present in micro-blogging platforms. In this paper, we present the current activities as well as the challenges faced when trying to apply existing tools (for both annotation and indexing) to a Romanian language micro-blogging corpus. These challenges are encountered at all annotation levels, including tokenization, and at the indexing stage. We consider that existing tools for Romanian language processing must be adapted to recognize features such as emoticons, emojis, hashtags, unusual abbreviations, elongated words (commonly used for emphasis in micro-blogging), multiple words joined together (within oroutside hashtags), and code-mixed text.

pdf bib abs
Exhaustive Indexing of PubMed Records with Medical Subject Headings
Modest von Korff

With fourteen million publication records the PubMed database is one of the largest repositories in medical science. Analysing this database to relate biological targets to diseases is an important task in pharmaceutical research. We developed a software tool, MeSHTreeIndexer, for indexing the PubMed medical literature with disease terms. The disease terms were taken from the Medical Subject Heading (MeSH) Terms compiled by the National Institutes of Health (NIH) of the US. In a first semi-automatic step we identified about 5’900 terms as disease related. The MeSH terms contain so-called entry points that are synonymously used for the terms. We created an inverted index for these 5’900 MeSH terms and their 58’000 entry points. From the PubMed database fourteen million publication records were stored in Lucene. These publication records were tagged by the inverted MeSH term index. In this contribution we demonstrate that our approach provided a significant higher enrichment in MeSH terms than the indexing of the PubMed records by the NIH themselves. Manual control proved that our enrichment is meaningful. Our software was written in Java and is available as open source.

pdf bib abs
UDeasy: a Tool for Querying Treebanks in CoNLL-U Format
Luca Brigada Villa

Many tools are available to query a dependency treebank, but they require the users to know a query language. In this paper I present UDeasy, an application whose main goal is to allow the users to easily query and extract patterns from a dependency treebank in CoNLL-U format.

pdf bib abs
Matrix and Double-Array Representations for Efficient Finite State Tokenization
Nils Diewald

This paper presents an algorithm and implementation for efficient tokenization of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality.

pdf bib abs
Count-Based and Predictive Language Models for Exploring DeReKo
Peter Fankhauser | Marc Kupietz

We present the use of count-based and predictive language models for exploring language use in the German Reference Corpus DeReKo. For collocation analysis along the syntagmatic axis we employ traditional association measures based on co-occurrence counts as well as predictive association measures derived from the output weights of skipgram word embeddings. For inspecting the semantic neighbourhood of words along the paradigmatic axis we visualize the high dimensional word embeddings in two dimensions using t-stochastic neighbourhood embeddings. Together, these visualizations provide a complementary, explorative approach to analysing very large corpora in addition to corpus querying. Moreover, we discuss count-based and predictive models w.r.t. scalability and maintainability in very large corpora.

pdf bib abs
“The word expired when that world awoke.” New Challenges for Research with Large Text Corpora and Corpus-Based Discourse Studies in Totalitarian Times
Hanno Biber

In the following poster proposal a report will be given on the prospects of a promising corpus project initiated by one of the large digital text corpora hosted by the Austrian Academy of Sciences. First, the resources of the AAC-Austrian Academy Corpus, that has been founded in 2001, which is one of the very valuable examples of digital diachronic text corpora suitable for corpus-based discourse studies and lexicography based upon historical sources, can be used as a basis for trying to answer new questions concerning the challenges for doing linguistic research with large digital text corpora in the context of studying totalitarian language use. The questions, as well as the chances and limits of such an approach, have very obvious actual references to the historic events unfolding today as well as a clearly historical dimension, precisely because the digital text sources that have been created to analyse the German language use of the Nazi-period from 1933 to 1945 can be understood as a model to deal with related questions of contemporary language use, particularly in the context of the new war of extermination of Russia in Ukraine of the year 2022 and how it is represented in contemporary media.