Tapio Salakoski


2019

pdf bib
An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora
Hans Moen | Laura-Maria Peltonen | Henry Suhonen | Hanna-Maria Matinolli | Riitta Mieronkoski | Kirsi Telen | Kirsi Terho | Tapio Salakoski | Sanna Salanterä
Proceedings of the 22nd Nordic Conference on Computational Linguistics

We present our work towards developing a system that should find, in a large text corpus, contiguous phrases expressing similar meaning as a query phrase of arbitrary length. Depending on the use case, this task can be seen as a form of (phrase-level) query rewriting. The suggested approach works in a generative manner, is unsupervised and uses a combination of a semantic word n-gram model, a statistical language model and a document search engine. A central component is a distributional semantic model containing word n-grams vectors (or embeddings) which models semantic similarities between n-grams of different order. As data we use a large corpus of PubMed abstracts. The presented experiment is based on manual evaluation of extracted phrases for arbitrary queries provided by a group of evaluators. The results indicate that the proposed approach is promising and that the use of distributional semantic models trained with uni-, bi- and trigrams seems to work better than a more traditional unigram model.

pdf bib
Template-free Data-to-Text Generation of Finnish Sports News
Jenna Kanerva | Samuel Rönnqvist | Riina Kekki | Tapio Salakoski | Filip Ginter
Proceedings of the 22nd Nordic Conference on Computational Linguistics

News articles such as sports game reports are often thought to closely follow the underlying game statistics, but in practice they contain a notable amount of background knowledge, interpretation, insight into the game, and quotes that are not present in the official statistics. This poses a challenge for automated data-to-text news generation with real-world news corpora as training data. We report on the development of a corpus of Finnish ice hockey news, edited to be suitable for training of end-to-end news generation methods, as well as demonstrate generation of text, which was judged by journalists to be relatively close to a viable product. The new dataset and system source code are available for research purposes.

pdf bib
Is Multilingual BERT Fluent in Language Generation?
Samuel Rönnqvist | Jenna Kanerva | Tapio Salakoski | Filip Ginter
Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing

The multilingual BERT model is trained on 104 languages and meant to serve as a universal language model and tool for encoding sentences. We explore how well the model performs on several languages across several tasks: a diagnostic classification probing the embeddings for a particular syntactic property, a cloze task testing the language modelling ability to fill in gaps in a sentence, and a natural language generation task testing for the ability to produce coherent text fitting a given context. We find that the currently available multilingual BERT model is clearly inferior to the monolingual counterparts, and cannot in many cases serve as a substitute for a well-trained monolingual model. We find that the English and German models perform well at generation, whereas the multilingual model is lacking, in particular, for Nordic languages. The code of the experiments in the paper is available at: https://github.com/TurkuNLP/bert-eval

2018

pdf bib
Biomedical Event Extraction Using Convolutional Neural Networks and Dependency Parsing
Jari Björne | Tapio Salakoski
Proceedings of the BioNLP 2018 workshop

Event and relation extraction are central tasks in biomedical text mining. Where relation extraction concerns the detection of semantic connections between pairs of entities, event extraction expands this concept with the addition of trigger words, multiple arguments and nested events, in order to more accurately model the diversity of natural language. In this work we develop a convolutional neural network that can be used for both event and relation extraction. We use a linear representation of the input text, where information is encoded with various vector space embeddings. Most notably, we encode the parse graph into this linear space using dependency path embeddings. We integrate our neural network into the open source Turku Event Extraction System (TEES) framework. Using this system, our machine learning model can be easily applied to a large set of corpora from e.g. the BioNLP, DDI Extraction and BioCreative shared tasks. We evaluate our system on 12 different event, relation and NER corpora, showing good generalizability to many tasks and achieving improved performance on several corpora.

pdf bib
Evaluation of a Prototype System that Automatically Assigns Subject Headings to Nursing Narratives Using Recurrent Neural Network
Hans Moen | Kai Hakala | Laura-Maria Peltonen | Henry Suhonen | Petri Loukasmäki | Tapio Salakoski | Filip Ginter | Sanna Salanterä
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

We present our initial evaluation of a prototype system designed to assist nurses in assigning subject headings to nursing narratives – written in the context of documenting patient care in hospitals. Currently nurses may need to memorize several hundred subject headings from standardized nursing terminologies when structuring and assigning the right section/subject headings to their text. Our aim is to allow nurses to write in a narrative manner without having to plan and structure the text with respect to sections and subject headings, instead the system should assist with the assignment of subject headings and restructuring afterwards. We hypothesize that this could reduce the time and effort needed for nursing documentation in hospitals. A central component of the system is a text classification model based on a long short-term memory (LSTM) recurrent neural network architecture, trained on a large data set of nursing notes. A simple Web-based interface has been implemented for user interaction. To evaluate the system, three nurses write a set of artificial nursing shift notes in a fully unstructured narrative manner, without planning for or consider the use of sections and subject headings. These are then fed to the system which assigns subject headings to each sentence and then groups them into paragraphs. Manual evaluation is conducted by a group of nurses. The results show that about 70% of the sentences are assigned to correct subject headings. The nurses believe that such a system can be of great help in making nursing documentation in hospitals easier and less time consuming. Finally, various measures and approaches for improving the system are discussed.

pdf bib
Turku Neural Parser Pipeline: An End-to-End System for the CoNLL 2018 Shared Task
Jenna Kanerva | Filip Ginter | Niko Miekka | Akseli Leino | Tapio Salakoski
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

In this paper we describe the TurkuNLP entry at the CoNLL 2018 Shared Task on Multilingual Parsing from Raw Text to Universal Dependencies. Compared to the last year, this year the shared task includes two new main metrics to measure the morphological tagging and lemmatization accuracies in addition to syntactic trees. Basing our motivation into these new metrics, we developed an end-to-end parsing pipeline especially focusing on developing a novel and state-of-the-art component for lemmatization. Our system reached the highest aggregate ranking on three main metrics out of 26 teams by achieving 1st place on metric involving lemmatization, and 2nd on both morphological tagging and parsing.

2017

pdf bib
Creating register sub-corpora for the Finnish Internet Parsebank
Veronika Laippala | Juhani Luotolahti | Aki-Juhani Kyröläinen | Tapio Salakoski | Filip Ginter
Proceedings of the 21st Nordic Conference on Computational Linguistics

pdf bib
A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora
Aleksi Vesanto | Filip Ginter | Hannu Salmi | Asko Nivala | Tapio Salakoski
Proceedings of the 21st Nordic Conference on Computational Linguistics

pdf bib
Applying BLAST to Text Reuse Detection in Finnish Newspapers and Journals, 1771-1910
Aleksi Vesanto | Asko Nivala | Heli Rantala | Tapio Salakoski | Hannu Salmi | Filip Ginter
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

pdf bib
End-to-End System for Bacteria Habitat Extraction
Farrokh Mehryary | Kai Hakala | Suwisa Kaewphan | Jari Björne | Tapio Salakoski | Filip Ginter
BioNLP 2017

We introduce an end-to-end system capable of named-entity detection, normalization and relation extraction for extracting information about bacteria and their habitats from biomedical literature. Our system is based on deep learning, CRF classifiers and vector space models. We train and evaluate the system on the BioNLP 2016 Shared Task Bacteria Biotope data. The official evaluation shows that the joint performance of our entity detection and relation extraction models outperforms the winning team of the Shared Task by 19pp on F1-score, establishing a new top score for the task. We also achieve state-of-the-art results in the normalization task. Our system is open source and freely available at https://github.com/TurkuNLP/BHE.

pdf bib
Detecting mentions of pain and acute confusion in Finnish clinical text
Hans Moen | Kai Hakala | Farrokh Mehryary | Laura-Maria Peltonen | Tapio Salakoski | Filip Ginter | Sanna Salanterä
BioNLP 2017

We study and compare two different approaches to the task of automatic assignment of predefined classes to clinical free-text narratives. In the first approach this is treated as a traditional mention-level named-entity recognition task, while the second approach treats it as a sentence-level multi-label classification task. Performance comparison across these two approaches is conducted in the form of sentence-level evaluation and state-of-the-art methods for both approaches are evaluated. The experiments are done on two data sets consisting of Finnish clinical text, manually annotated with respect to the topics pain and acute confusion. Our results suggest that the mention-level named-entity recognition approach outperforms sentence-level classification overall, but the latter approach still manages to achieve the best prediction scores on several annotation classes.

2016

pdf bib
Syntactic analyses and named entity recognition for PubMed and PubMed Central — up-to-the-minute
Kai Hakala | Suwisa Kaewphan | Tapio Salakoski | Filip Ginter
Proceedings of the 15th Workshop on Biomedical Natural Language Processing

pdf bib
Deep Learning with Minimal Training Data: TurkuNLP Entry in the BioNLP Shared Task 2016
Farrokh Mehryary | Jari Björne | Sampo Pyysalo | Tapio Salakoski | Filip Ginter
Proceedings of the 4th BioNLP Shared Task Workshop

pdf bib
UTU at SemEval-2016 Task 10: Binary Classification for Expression Detection (BCED)
Jari Björne | Tapio Salakoski
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Towards the Classification of the Finnish Internet Parsebank: Detecting Translations and Informality
Veronika Laippala | Jenna Kanerva | Anna Missilä | Sampo Pyysalo | Tapio Salakoski | Filip Ginter
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

2014

pdf bib
Care Episode Retrieval
Hans Moen | Erwin Marsi | Filip Ginter | Laura-Maria Murtola | Tapio Salakoski | Sanna Salanterä
Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi)

2013

pdf bib
UTurku: Drug Named Entity Recognition and Drug-Drug Interaction Extraction Using SVM Classification and Domain Knowledge
Jari Björne | Suwisa Kaewphan | Tapio Salakoski
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
TEES 2.1: Automated Annotation Scheme Learning in the BioNLP 2013 Shared Task
Jari Björne | Tapio Salakoski
Proceedings of the BioNLP Shared Task 2013 Workshop

pdf bib
EVEX in ST’13: Application of a large-scale text mining resource to event extraction and network construction
Kai Hakala | Sofie Van Landeghem | Tapio Salakoski | Yves Van de Peer | Filip Ginter
Proceedings of the BioNLP Shared Task 2013 Workshop

pdf bib
Predicting Conjunct Propagation and Other Extended Stanford Dependencies
Jenna Nyblom | Samuel Kohonen | Katri Haverinen | Tapio Salakoski | Filip Ginter
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)

pdf bib
Towards a Dependency-Based PropBank of General Finnish
Katri Haverinen | Veronika Laippala | Samuel Kohonen | Anna Missilä | Jenna Nyblom | Stina Ojala | Timo Viljanen | Tapio Salakoski | Filip Ginter
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

pdf bib
Building a Large Automatically Parsed Corpus of Finnish
Filip Ginter | Jenna Nyblom | Veronika Laippala | Samuel Kohonen | Katri Haverinen | Simo Vihjanen | Tapio Salakoski
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2012

pdf bib
PubMed-Scale Event Extraction for Post-Translational Modifications, Epigenetics and Protein Structural Relations
Jari Björne | Sofie Van Landeghem | Sampo Pyysalo | Tomoko Ohta | Filip Ginter | Yves Van de Peer | Sophia Ananiadou | Tapio Salakoski
BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing

2011

pdf bib
EVEX: A PubMed-Scale Resource for Homology-Based Generalization of Text Mining Predictions
Sofie Van Landeghem | Filip Ginter | Yves Van de Peer | Tapio Salakoski
Proceedings of BioNLP 2011 Workshop

pdf bib
Generalizing Biomedical Event Extraction
Jari Björne | Tapio Salakoski
Proceedings of BioNLP Shared Task 2011 Workshop

2010

pdf bib
Dependency-Based PropBanking of Clinical Finnish
Katri Haverinen | Filip Ginter | Timo Viljanen | Veronika Laippala | Tapio Salakoski
Proceedings of the Fourth Linguistic Annotation Workshop

pdf bib
Scaling up Biomedical Event Extraction to the Entire PubMed
Jari Björne | Filip Ginter | Sampo Pyysalo | Jun’ichi Tsujii | Tapio Salakoski
Proceedings of the 2010 Workshop on Biomedical Natural Language Processing

pdf bib
Reconstruction of Semantic Relationships from Their Projections in Biomolecular Domain
Juho Heimonen | Jari Björne | Tapio Salakoski
Proceedings of the 2010 Workshop on Biomedical Natural Language Processing

2009

pdf bib
Extracting Complex Biological Events with Rich Graph-Based Feature Sets
Jari Björne | Juho Heimonen | Filip Ginter | Antti Airola | Tapio Pahikkala | Tapio Salakoski
Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task

pdf bib
Learning to Extract Biological Event and Relation Graphs
Jari Björne | Filip Ginter | Juho Heimonen | Sampo Pyysalo | Tapio Salakoski
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

pdf bib
Parsing Clinical Finnish: Experiments with Rule-Based and Statistical Dependency Parsers
Katri Haverinen | Filip Ginter | Veronika Laippala | Tapio Salakoski
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

2008

pdf bib
A Graph Kernel for Protein-Protein Interaction Extraction
Antti Airola | Sampo Pyysalo | Jari Björne | Tapio Pahikkala | Filip Ginter | Tapio Salakoski
Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing

2007

pdf bib
On the unification of syntactic annotations under the Stanford dependency scheme: A case study on BioInfer and GENIA
Sampo Pyysalo | Filip Ginter | Veronika Laippala | Katri Haverinen | Juho Heimonen | Tapio Salakoski
Biological, translational, and clinical language processing

pdf bib
Utterance-Initial Duration of Finnish Non-Plosive Consonants
Tuomo Saarni | Jussi Hakokari | Olli Aaltonen | Jouni Isoaho | Tapio Salakoski
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

pdf bib
Role of Different Spectral Attributes in Vowel Categorization: the Case of Udmurt
Janne Savela | Stina Ojala | Olli Aaltonen | Tapio Salakoski
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

2006

pdf bib
A Probabilistic Search for the Best Solution Among Partially Completed Candidates
Filip Ginter | Aleksandr Mylläri | Tapio Salakoski
Proceedings of the Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing

2004

pdf bib
Analysis of Link Grammar on Biomedical Dependency Corpus Targeted at Protein-Protein Interactions
Sampo Pyysalo | Filip Ginter | Tapio Pahikkala | Jorma Boberg | Jouni Järvinen | Tapio Salakoski | Jeppe Koivula
Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP)