Juhani Luotolahti


2020

pdf bib
From Web Crawl to Clean Register-Annotated Corpora
Veronika Laippala | Samuel Rönnqvist | Saara Hellström | Juhani Luotolahti | Liina Repo | Anna Salmela | Valtteri Skantsi | Sampo Pyysalo
Proceedings of the 12th Web as Corpus Workshop

The web presents unprecedented opportunities for large-scale collection of text in many languages. However, two critical steps in the development of web corpora remain challenging: the identification of clean text from source HTML and the assignment of genre or register information to the documents. In this paper, we evaluate a multilingual approach to this end. Our starting points are the Swedish and French Common Crawl datasets gathered for the 2017 CoNLL shared task, particularly the URLs. We 1) fetch HTML pages based on the URLs and run boilerplate removal, 2) train a classifier to further clean out undesired text fragments, and 3) annotate text registers. We compare boilerplate removal against the CoNLL texts, and find an improvement. For the further cleaning of undesired material, the best results are achieved using Multilingual BERT with monolingual fine-tuning. However, our results are promising also in a cross-lingual setting, without fine-tuning on the target language. Finally, the register annotations show that most of the documents belong to a relatively small set of registers, which are relatively similar in the two languages. A number of additional flags in the annotation are, however, necessary to reflect the wide range of linguistic variation associated with the documents.

2017

pdf bib
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Daniel Zeman | Martin Popel | Milan Straka | Jan Hajič | Joakim Nivre | Filip Ginter | Juhani Luotolahti | Sampo Pyysalo | Slav Petrov | Martin Potthast | Francis Tyers | Elena Badmaeva | Memduh Gokirmak | Anna Nedoluzhko | Silvie Cinková | Jan Hajič jr. | Jaroslava Hlaváčová | Václava Kettnerová | Zdeňka Urešová | Jenna Kanerva | Stina Ojala | Anna Missilä | Christopher D. Manning | Sebastian Schuster | Siva Reddy | Dima Taji | Nizar Habash | Herman Leung | Marie-Catherine de Marneffe | Manuela Sanguinetti | Maria Simi | Hiroshi Kanayama | Valeria de Paiva | Kira Droganova | Héctor Martínez Alonso | Çağrı Çöltekin | Umut Sulubacak | Hans Uszkoreit | Vivien Macketanz | Aljoscha Burchardt | Kim Harris | Katrin Marheinecke | Georg Rehm | Tolga Kayadelen | Mohammed Attia | Ali Elkahky | Zhuoran Yu | Emily Pitler | Saran Lertpradit | Michael Mandl | Jesse Kirchner | Hector Fernandez Alcalde | Jana Strnadová | Esha Banerjee | Ruli Manurung | Antonio Stella | Atsuko Shimada | Sookyoung Kwak | Gustavo Mendonça | Tatiana Lando | Rattima Nitisaroj | Josie Li
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

pdf bib
TurkuNLP: Delexicalized Pre-training of Word Embeddings for Dependency Parsing
Jenna Kanerva | Juhani Luotolahti | Filip Ginter
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We present the TurkuNLP entry in the CoNLL 2017 Shared Task on Multilingual Parsing from Raw Text to Universal Dependencies. The system is based on the UDPipe parser with our focus being in exploring various techniques to pre-train the word embeddings used by the parser in order to improve its performance especially on languages with small training sets. The system ranked 11th among the 33 participants overall, being 8th on the small treebanks, 10th on the large treebanks, 12th on the parallel test sets, and 26th on the surprise languages.

pdf bib
Creating register sub-corpora for the Finnish Internet Parsebank
Veronika Laippala | Juhani Luotolahti | Aki-Juhani Kyröläinen | Tapio Salakoski | Filip Ginter
Proceedings of the 21st Nordic Conference on Computational Linguistics

pdf bib
Dep_search: Efficient Search Tool for Large Dependency Parsebanks
Juhani Luotolahti | Jenna Kanerva | Filip Ginter
Proceedings of the 21st Nordic Conference on Computational Linguistics

pdf bib
Cross-Lingual Pronoun Prediction with Deep Recurrent Neural Networks v2.0
Juhani Luotolahti | Jenna Kanerva | Filip Ginter
Proceedings of the Third Workshop on Discourse in Machine Translation

In this paper we present our system in the DiscoMT 2017 Shared Task on Crosslingual Pronoun Prediction. Our entry builds on our last year’s success, our system based on deep recurrent neural networks outperformed all the other systems with a clear margin. This year we investigate whether different pre-trained word embeddings can be used to improve the neural systems, and whether the recently published Gated Convolutions outperform the Gated Recurrent Units used last year.

2016

pdf bib
Cross-Lingual Pronoun Prediction with Deep Recurrent Neural Networks
Juhani Luotolahti | Jenna Kanerva | Filip Ginter
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

2015

pdf bib
SETS: Scalable and Efficient Tree Search in Dependency Graphs
Juhani Luotolahti | Jenna Kanerva | Sampo Pyysalo | Filip Ginter
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

pdf bib
Turku: Semantic Dependency Parsing as a Sequence Classification
Jenna Kanerva | Juhani Luotolahti | Filip Ginter
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
Sentence Compression For Automatic Subtitling
Juhani Luotolahti | Filip Ginter
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf bib
Towards Universal Web Parsebanks
Juhani Luotolahti | Jenna Kanerva | Veronika Laippala | Sampo Pyysalo | Filip Ginter
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)

2014

pdf bib
Turku: Broad-Coverage Semantic Parsing with Rich Features
Jenna Kanerva | Juhani Luotolahti | Filip Ginter
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)