The hunvec framework for NN-CRF-based sequential tagging
Katalin Pajkossy | Attila Zséder
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this work we present the open source hunvec framework for sequential tagging, built upon Theano and Pylearn2. The underlying statistical model, which connects linear CRF-s with neural networks, was used by Collobert and co-workers, and several other researchers. For demonstrating the flexibility of our tool, we describe a set of experiments on part-of-speech and named-entity-recognition tasks, using English and Hungarian datasets, where we modify both model and training parameters, and illustrate the usage of custom features. Model parameters we experiment with affect the vectorial word representations used by the model; we apply different word vector initializations, defined by Word2vec and GloVe embeddings and enrich the representation of words by vectors assigned trigram features. We extend training methods by using their regularized (l2 and dropout) version. When testing our framework on a Hungarian named entity corpus, we find that its performance reaches the best published results on this dataset, with no need for language-specific feature engineering. Our code is available at http://github.com/zseder/hunvec
Structure Learning in Weighted Languages
András Kornai | Attila Zséder | Gábor Recski
Proceedings of the 13th Meeting on the Mathematics of Language (MoL 13)
Rapid creation of large-scale corpora and frequency dictionaries
Attila Zséder | Gábor Recski | Dániel Varga | András Kornai
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
We describe, and make public, large-scale language resources and the toolchain used in their creation, for fifteen medium density European languages: Catalan, Czech, Croatian, Danish, Dutch, Finnish, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Serbian, Slovak, Spanish, and Swedish. To make the process uniform across languages, we selected tools that are either language-independent or easily customizable for each language, and reimplemented all stages that were taking too long. To achieve processing times that are insignificant compared to the time data collection (crawling) takes, we reimplemented the standard sentence- and word-level tokenizers and created new boilerplate and near-duplicate detection algorithms. Preliminary experiments with non-European languages indicate that our methods are now applicable not just to our sample, but the entire population of digitally viable languages, with the main limiting factor being the availability of high quality stemmers.
NP Alignment in Bilingual Corpora
Gábor Recski | András Rung | Attila Zséder | András Kornai
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Aligning the NPs of parallel corpora is logically halfway between the sentence- and word-alignment tasks that occupy much of the MT literature, but has received far less attention. NP alignment is a challenging problem, capable of rapidly exposing flaws both in the word-alignment and in the NP chunking algorithms one may bring to bear. It is also a very rewarding problem in that NPs are semantically natural translation units, which means that (i) word alignments will cross NP boundaries only exceptionally, and (ii) within sentences already aligned, the proportion of 1-1 alignments will be higher for NPs than words. We created a simple gold standard for English-Hungarian, Orwells 1984, (since this already exists in manually verified POS-tagged format in many languages thanks to the Multex and MultexEast project) by manually verifying the automaticaly generated NP chunking (we used the yamcha, mallet and hunchunk taggers) and manually aligning the maximal NPs and PPs. The maximum NP chunking problem is much harder than base NP chunking, with F-measure in the .7 range (as opposed to over .94 for base NPs). Since the results are highly impacted by the quality of the NP chunking, we tested our alignment algorithms both with real world (machine obtained) chunkings, where results are in the .35 range for the baseline algorithm which propagates GIZA++ word alignments to the NP level, and on idealized (manually obtained) chunkings, where the baseline reaches .4 and our current system reaches .64.