Atro Voutilainen

2018

Analysing Finnish with word lists: the DDI approach to morphology revisited
Atro Voutilainen | Maria Palolahti
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

2012

pdf bib

Refining the Design of a Contracting Finite-State Dependency Parser
Anssi Yli-Jyrä | Jussi Piitulainen | Atro Voutilainen
Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing

pdf bib abs

Specifying Treebanks, Outsourcing Parsebanks: FinnTreeBank 3
Atro Voutilainen | Kristiina Muhonen | Tanja Purtonen | Krister Lindén
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Corpus-based treebank annotation is known to result in incomplete coverage of mid- and low-frequency linguistic constructions: the linguistic representation and corpus annotation quality are sometimes suboptimal. Large descriptive grammars cover also many mid- and low-frequency constructions. We argue for use of large descriptive grammars and their sample sentences as a basis for specifying higher-coverage grammatical representations. We present an sample case from an ongoing project (FIN-CLARIN FinnTreeBank) where an grammatical representation is documented as an annotator's manual alongside manual annotation of sample sentences extracted from a large descriptive grammar of Finnish. We outline the linguistic representation (morphology and dependency syntax) for Finnish, and show how the resulting `Grammar Definition Corpus' and the documentation is used as a task specification for an external subcontractor for building a parser engine for use in morphological and dependency syntactic analysis of large volumes of Finnish for parsebanking purposes. The resulting corpus, FinnTreeBank 3, is due for release in June 2012, and will contain tens of millions of words from publicly available corpora of Finnish with automatic morphological and dependency syntactic analysis, for use in research on the corpus linguistics and language engineering.

pdf bib abs

Improving corpus annotation productivity: a method and experiment with interactive tagging
Atro Voutilainen
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Corpus linguistic and language technological research needs empirical corpus data with nearly correct annotation and high volume to enable advances in language modelling and theorising. Recent work on improving corpus annotation accuracy presents semiautomatic methods to correct some of the analysis errors in available annotated corpora, while leaving the remaining errors undetected in the annotated corpus. We review recent advances in linguistics-based partial tagging and parsing, and regard the achieved analysis performance as sufficient for reconsidering a previously proposed method: combining nearly correct but partial automatic analysis with a minimal amount of human postediting (disambiguation) to achieve nearly correct corpus annotation accuracy at a competitive annotation speed. We report a pilot experiment with morphological (part-of-speech) annotation using a partial linguistic tagger of a kind previously reported with a very attractive precision-recall ratio, and observe that a desired level of annotation accuracy can be reached by using human disambiguation for less than 10\% of the words in the corpus.