2019
pdf
bib
abs
Evaluating Automatic Term Extraction Methods on Individual Documents
Antonio Šajatović
|
Maja Buljan
|
Jan Šnajder
|
Bojana Dalbelo Bašić
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)
Automatic Term Extraction (ATE) extracts terminology from domain-specific corpora. ATE is used in many NLP tasks, including Computer Assisted Translation, where it is typically applied to individual documents rather than the entire corpus. While corpus-level ATE has been extensively evaluated, it is not obvious how the results transfer to document-level ATE. To fill this gap, we evaluate 16 state-of-the-art ATE methods on full-length documents from three different domains, on both corpus and document levels. Unlike existing studies, our evaluation is more realistic as we take into account all gold terms. We show that no single method is best in corpus-level ATE, but C-Value and KeyConceptRelatendess surpass others in document-level ATE.
2017
pdf
bib
abs
Two Layers of Annotation for Representing Event Mentions in News Stories
Maria Pia di Buono
|
Martin Tutek
|
Jan Šnajder
|
Goran Glavaš
|
Bojana Dalbelo Bašić
|
Nataša Milić-Frayling
Proceedings of the 11th Linguistic Annotation Workshop
In this paper, we describe our preliminary study on annotating event mention as a part of our research on high-precision news event extraction models. To this end, we propose a two-layer annotation scheme, designed to separately capture the functional and conceptual aspects of event mentions. We hypothesize that the precision of models can be improved by modeling and extracting separately the different aspects of news events, and then combining the extracted information by leveraging the complementarities of the models. In addition, we carry out a preliminary annotation using the proposed scheme and analyze the annotation quality in terms of inter-annotator agreement.
pdf
bib
abs
Predicting News Values from Headline Text and Emotions
Maria Pia di Buono
|
Jan Šnajder
|
Bojana Dalbelo Bašić
|
Goran Glavaš
|
Martin Tutek
|
Natasa Milic-Frayling
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism
We present a preliminary study on predicting news values from headline text and emotions. We perform a multivariate analysis on a dataset manually annotated with news values and emotions, discovering interesting correlations among them. We then train two competitive machine learning models – an SVM and a CNN – to predict news values from headline text and emotions as features. We find that, while both models yield a satisfactory performance, some news values are more difficult to detect than others, while some profit more from including emotion information.
2015
pdf
bib
TKLBLIIR: Detecting Twitter Paraphrases with TweetingJay
Vanja Mladen Karan
|
Goran Glavaš
|
Jan Šnajder
|
Bojana Dalbelo Bašić
|
Ivan Vulić
|
Marie-Francine Moens
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
2012
pdf
bib
abs
Evaluation of Classification Algorithms and Features for Collocation Extraction in Croatian
Vanja Mladen Karan
|
Jan Šnajder
|
Bojana Dalbelo Bašić
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Collocations can be defined as words that occur together significantly more often than it would be expected by chance. Many natural language processing applications such as natural language generation, word sense disambiguation and machine translation can benefit from having access to information about collocated words. We approach collocation extraction as a classification problem where the task is to classify a given n-gram as either a collocation (positive) or a non-collocation (negative). Among the features used are word frequencies, classical association measures (Dice, PMI, chi2), and POS tags. In addition, semantic word relatedness modeled by latent semantic analysis is also included. We apply wrapper feature subset selection to determine the best set of features. Performance of various classification algorithms is tested. Experiments are conducted on a manually annotated set of bigrams and trigrams sampled from a Croatian newspaper corpus. Best results obtained are 79.8 F1 measure for bigrams and 67.5 F1 measure for trigrams. The best classifier for bigrams was SVM, while for trigrams the decision tree gave the best performance. Features which contributed the most to overall performance were PMI, semantic relatedness, and POS information.
pdf
bib
TakeLab: Systems for Measuring Semantic Text Similarity
Frane Šarić
|
Goran Glavaš
|
Vanja Mladen Karan
|
Jan Šnajder
|
Bojana Dalbelo Bašić
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)
pdf
bib
Experiments on Hybrid Corpus-Based Sentiment Lexicon Acquisition
Goran Glavaš
|
Jan Šnajder
|
Bojana Dalbelo Bašić
Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data
2010
pdf
bib
abs
Corpus Aligner (CorAl) Evaluation on English-Croatian Parallel Corpora
Sanja Seljan
|
Marko Tadić
|
Željko Agić
|
Jan Šnajder
|
Bojana Dalbelo Bašić
|
Vjekoslav Osmann
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
An increasing demand for new language resources of recent EU members and accessing countries has in turn initiated the development of different language tools and resources, such as alignment tools and corresponding translation memories for new languages pairs. The primary goal of this paper is to provide a description of a free sentence alignment tool CorAl (Corpus Aligner), developed at the Faculty of Electrical Engineering and Computing, University of Zagreb. The tool performs paragraph alignment at the first step of the alignment process, which is followed by sentence alignment. Description of the tool is followed by its evaluation. The paper describes an experiment with applying the CorAl aligner to a English-Croatian parallel corpus of legislative domain using metrics of precision, recall and F1-measure. Results are discussed and the concluding sections discuss future directions of CorAl development.
2009
pdf
bib
String Distance-Based Stemming of the Highly Inflected Croatian Language
Jan Šnajder
|
Bojana Dalbelo Bašić
Proceedings of the International Conference RANLP-2009
2008
pdf
bib
Evolving New Lexical Association Measures Using Genetic Programming
Jan Šnajder
|
Bojana Dalbelo Bašić
|
Saša Petrović
|
Ivan Sikirić
Proceedings of ACL-08: HLT, Short Papers