Towards Instance-Level Parser Selection for Cross-Lingual Transfer of Dependency Parsers
Robert Litschko | Ivan Vulić | Željko Agić | Goran Glavaš
Proceedings of the 28th International Conference on Computational Linguistics

Current methods of cross-lingual parser transfer focus on predicting the best parser for a low-resource target language globally, that is, “at treebank level”. In this work, we propose and argue for a novel cross-lingual transfer paradigm: instance-level parser selection (ILPS), and present a proof-of-concept study focused on instance-level selection in the framework of delexicalized parser transfer. Our work is motivated by an empirical observation that different source parsers are the best choice for different Universal POS-sequences (i.e., UPOS sentences) in the target language. We then propose to predict the best parser at the instance level. To this end, we train a supervised regression model, based on the Transformer architecture, to predict parser accuracies for individual POS-sequences. We compare ILPS against two strong single-best parser selection baselines (SBPS): (1) a model that compares POS n-gram distributions between the source and target languages (KL) and (2) a model that selects the source based on the similarity between manually created language vectors encoding syntactic properties of languages (L2V). The results from our extensive evaluation, coupling 42 source parsers and 20 diverse low-resource test languages, show that ILPS outperforms KL and L2V on 13/20 and 14/20 test languages, respectively. Further, we show that by predicting the best parser “at treebank level” (SBPS), using the aggregation of predictions from our instance-level model, we outperform the same baselines on 17/20 and 16/20 test languages.

MultiQT: Multimodal learning for real-time question tracking in speech
Jakob D. Havtorn | Jan Latko | Joakim Edin | Lars Maaløe | Lasse Borgholt | Lorenzo Belgrano | Nicolai Jacobsen | Regitze Sdun | Željko Agić
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We address a challenging and practical task of labeling questions in speech in real time during telephone calls to emergency medical services in English, which embeds within a broader decision support system for emergency call-takers. We propose a novel multimodal approach to real-time sequence labeling in speech. Our model treats speech and its own textual representation as two separate modalities or views, as it jointly learns from streamed audio and its noisy transcription into text via automatic speech recognition. Our results show significant gains of jointly learning from the two modalities when compared to text or audio only, under adverse noise and limited volume of training data. The results generalize to medical symptoms detection where we observe a similar pattern of improvements with multimodal learning.


JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages
Željko Agić | Ivan Vulić
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Viable cross-lingual transfer critically depends on the availability of parallel texts. Shortage of such resources imposes a development and evaluation bottleneck in multilingual processing. We introduce JW300, a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average. In this paper, we present the resource and showcase its utility in experiments with cross-lingual word embedding induction and multi-source part-of-speech projection.


Toward Universal Dependencies for Shipibo-Konibo
Alonso Vasquez | Renzo Ego Aguirre | Candy Angulo | John Miller | Claudia Villanueva | Željko Agić | Roberto Zariquiey | Arturo Oncevay
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

We present an initial version of the Universal Dependencies (UD) treebank for Shipibo-Konibo, the first South American, Amazonian, Panoan and Peruvian language with a resource built under UD. We describe the linguistic aspects of how the tagset was defined and the treebank was annotated; in addition we present our specific treatment of linguistic units called clitics. Although the treebank is still under development, it allowed us to perform a typological comparison against Spanish, the predominant language in Peru, and dependency syntax parsing experiments in both monolingual and cross-lingual approaches.

Low-resource named entity recognition via multi-source projection: Not quite there yet?
Jan Vium Enghoff | Søren Harrison | Željko Agić
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

Projecting linguistic annotations through word alignments is one of the most prevalent approaches to cross-lingual transfer learning. Conventional wisdom suggests that annotation projection “just works” regardless of the task at hand. We carefully consider multi-source projection for named entity recognition. Our experiment with 17 languages shows that to detect named entities in true low-resource languages, annotation projection may not be the right way to move forward. On a more positive note, we also uncover the conditions that do favor named entity projection from multiple sources. We argue these are infeasible under noisy low-resource constraints.

Distant Supervision from Disparate Sources for Low-Resource Part-of-Speech Tagging
Barbara Plank | Željko Agić
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

a cross-lingual neural part-of-speech tagger that learns from disparate sources of distant supervision, and realistically scales to hundreds of low-resource languages. The model exploits annotation projection, instance selection, tag dictionaries, morphological lexicons, and distributed representations, all in a uniform framework. The approach is simple, yet surprisingly effective, resulting in a new state of the art without access to any gold annotated data.

Baselines and Test Data for Cross-Lingual Inference
Željko Agić | Natalie Schluter
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


Cross-Lingual Parser Selection for Low-Resource Languages
Željko Agić
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

Empirically Sampling Universal Dependencies
Natalie Schluter | Željko Agić
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages
Tanja Samardžić | Mirjana Starović | Željko Agić | Nikola Ljubešić
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

The paper documents the procedure of building a new Universal Dependencies (UDv2) treebank for Serbian starting from an existing Croatian UDv1 treebank and taking into account the other Slavic UD annotation guidelines. We describe the automatic and manual annotation procedures, discuss the annotation of Slavic-specific categories (case governing quantifiers, reflexive pronouns, question particles) and propose an approach to handling deverbal nouns in Slavic languages.

How (not) to train a dependency parser: The curious case of jackknifing part-of-speech taggers
Željko Agić | Natalie Schluter
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In dependency parsing, jackknifing taggers is indiscriminately used as a simple adaptation strategy. Here, we empirically evaluate when and how (not) to use jackknifing in parsing. On 26 languages, we reveal a preference that conflicts with, and surpasses the ubiquitous ten-folding. We show no clear benefits of tagging the training data in cross-lingual parsing.

Parsing Universal Dependencies without training
Héctor Martínez Alonso | Željko Agić | Barbara Plank | Anders Søgaard
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We present UDP, the first training-free parser for Universal Dependencies (UD). Our algorithm is based on PageRank and a small set of specific dependency head rules. UDP features two-step decoding to guarantee that function words are attached as leaf nodes. The parser requires no training, and it is competitive with a delexicalized transfer system. UDP offers a linguistically sound unsupervised alternative to cross-lingual parsing for UD. The parser has very few parameters and distinctly robust to domain change across languages.

Cross-lingual tagger evaluation without test data
Željko Agić | Barbara Plank | Anders Søgaard
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We address the challenge of cross-lingual POS tagger evaluation in absence of manually annotated test data. We put forth and evaluate two dictionary-based metrics. On the tasks of accuracy prediction and system ranking, we reveal that these metrics are reliable enough to approximate test set-based evaluation, and at the same time lean enough to support assessment for truly low-resource languages.


Multilingual Projection for Parsing Truly Low-Resource Languages
Željko Agić | Anders Johannsen | Barbara Plank | Héctor Martínez Alonso | Natalie Schluter | Anders Søgaard
Transactions of the Association for Computational Linguistics, Volume 4

We propose a novel approach to cross-lingual part-of-speech tagging and dependency parsing for truly low-resource languages. Our annotation projection-based approach yields tagging and parsing models for over 100 languages. All that is needed are freely available parallel texts, and taggers and parsers for resource-rich languages. The empirical evaluation across 30 test languages shows that our method consistently provides top-level accuracies, close to established upper bounds, and outperforms several competitive baselines.

New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian
Nikola Ljubešić | Filip Klubička | Željko Agić | Ivo-Pavao Jazbec
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present newly developed inflectional lexcions and manually annotated corpora of Croatian and Serbian. We introduce hrLex and srLex - two freely available inflectional lexicons of Croatian and Serbian - and describe the process of building these lexicons, supported by supervised machine learning techniques for lemma and paradigm prediction. Furthermore, we introduce hr500k, a manually annotated corpus of Croatian, 500 thousand tokens in size. We showcase the three newly developed resources on the task of morphosyntactic annotation of both languages by using a recently developed CRF tagger. We achieve best results yet reported on the task for both languages, beating the HunPos baseline trained on the same datasets by a wide margin.

Joint part-of-speech and dependency projection from multiple sources
Anders Johannsen | Željko Agić | Anders Søgaard
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)


Do dependency parsing metrics correlate with human judgments?
Barbara Plank | Héctor Martínez Alonso | Željko Agić | Danijela Merkler | Anders Søgaard
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

Inverted indexing for cross-lingual NLP
Anders Søgaard | Željko Agić | Héctor Martínez Alonso | Barbara Plank | Bernd Bohnet | Anders Johannsen
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages
Željko Agić | Dirk Hovy | Anders Søgaard
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Semantic Dependency Graph Parsing Using Tree Approximations
Željko Agić | Alexander Koller | Stephan Oepen
Proceedings of the 11th International Conference on Computational Semantics

Universal Dependencies for Croatian (that work for Serbian, too)
Željko Agić | Nikola Ljubešić
The 5th Workshop on Balto-Slavic Natural Language Processing


Potsdam: Semantic Dependency Parsing by Bidirectional Graph-Tree Transformations and Syntactic Parsing
Željko Agić | Alexander Koller
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

Language Processing Infrastructure in the XLike Project
Lluís Padró | Željko Agić | Xavier Carreras | Blaz Fortuna | Esteban García-Cuesta | Zhixing Li | Tadej Štajner | Marko Tadić
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents the linguistic analysis tools and its infrastructure developed within the XLike project. The main goal of the implemented tools is to provide a set of functionalities for supporting some of the main objectives of XLike, such as enabling cross-lingual services for publishers, media monitoring or developing new business intelligence applications. The services cover seven major and minor languages: English, German, Spanish, Chinese, Catalan, Slovenian, and Croatian. These analyzers are provided as web services following a lightweight SOA architecture approach, and they are publically callable and are catalogued in META-SHARE.

The SETimes.HR Linguistically Annotated Corpus of Croatian
Željko Agić | Nikola Ljubešić
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present SETimes.HR ― the first linguistically annotated corpus of Croatian that is freely available for all purposes. The corpus is built on top of the SETimes parallel corpus of nine Southeast European languages and English. It is manually annotated for lemmas, morphosyntactic tags, named entities and dependency syntax. We couple the corpus with domain-sensitive test sets for Croatian and Serbian to support direct model transfer evaluation between these closely related languages. We build and evaluate statistical models for lemmatization, morphosyntactic tagging, named entity recognition and dependency parsing on top of SETimes.HR and the test sets, providing the state of the art in all the tasks. We make all resources presented in the paper freely available under a very permissive licensing scheme.

Croatian Dependency Treebank 2.0: New Annotation Guidelines for Improved Parsing
Željko Agić | Daša Berović | Danijela Merkler | Marko Tadić
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a new version of the Croatian Dependency Treebank. It constitutes a slight departure from the previously closely observed Prague Dependency Treebank syntactic layer annotation guidelines as we introduce a new subset of syntactic tags on top of the existing tagset. These new tags are used in explicit annotation of subordinate clauses via subordinate conjunctions. Introducing the new annotation to Croatian Dependency Treebank, we also modify head attachment rules addressing subordinate conjunctions and subordinate clause predicates. In an experiment with data-driven dependency parsing, we show that implementing these new annotation guidelines leeds to a statistically significant improvement in parsing accuracy. We also observe a substantial improvement in inter-annotator agreement, facilitating more consistent annotation in further treebank development.

XLike Project Language Analysis Services
Xavier Carreras | Lluís Padró | Lei Zhang | Achim Rettinger | Zhixing Li | Esteban García-Cuesta | Željko Agić | Božo Bekavac | Blaz Fortuna | Tadej Štajner
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

Treebank Translation for Cross-Lingual Parser Induction
Jörg Tiedemann | Željko Agić | Joakim Nivre
Proceedings of the Eighteenth Conference on Computational Natural Language Learning

Cross-lingual Dependency Parsing of Related Languages with Rich Morphosyntactic Tagsets
Željko Agić | Jörg Tiedemann | Danijela Merkler | Simon Krek | Kaja Dobrovoljc | Sara Može
Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants


Lemmatization and Morphosyntactic Tagging of Croatian and Serbian
Željko Agić | Nikola Ljubešić | Danijela Merkler
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

Parsing Croatian and Serbian by Using Croatian Dependency Treebanks
Željko Agić | Danijela Merkler | Daša Berović
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages

Building and Evaluating a Distributional Memory for Croatian
Jan Šnajder | Sebastian Padó | Željko Agić
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)


Rule-Based Sentiment Analysis in Narrow Domain: Detecting Sentiment in Daily Horoscopes Using Sentiscope
Zeljko Agic | Danijela Merkler
Proceedings of the 2nd Workshop on Sentiment Analysis where AI meets Psychology

K-Best Spanning Tree Dependency Parsing With Verb Valency Lexicon Reranking
Zeljko Agic
Proceedings of COLING 2012: Posters

Croatian Dependency Treebank: Recent Development and Initial Experiments
Daša Berović | Željko Agić | Marko Tadić
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present the current state of development of the Croatian Dependency Treebank ― with special empahsis on adapting the Prague Dependency Treebank formalism to Croatian language specifics ― and illustrate its possible applications in an experiment with dependency parsing using MaltParser. The treebank currently contains approximately 2870 sentences, out of which the 2699 sentences and 66930 tokens were used in this experiment. Three linear-time projective algorithms implemented by the MaltParser system ― Nivre eager, Nivre standard and stack projective ― running on default settings were used in the experiment. The highest performing system, implementing the Nivre eager algorithm, scored (LAS 71.31 UAS 80.93 LA 83.87) within our experiment setup. The results obtained serve as an illustration of treebank's usefulness in natural language processing research and as a baseline for further research in dependency parsing of Croatian.


Corpus Aligner (CorAl) Evaluation on English-Croatian Parallel Corpora
Sanja Seljan | Marko Tadić | Željko Agić | Jan Šnajder | Bojana Dalbelo Bašić | Vjekoslav Osmann
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

An increasing demand for new language resources of recent EU members and accessing countries has in turn initiated the development of different language tools and resources, such as alignment tools and corresponding translation memories for new languages pairs. The primary goal of this paper is to provide a description of a free sentence alignment tool CorAl (Corpus Aligner), developed at the Faculty of Electrical Engineering and Computing, University of Zagreb. The tool performs paragraph alignment at the first step of the alignment process, which is followed by sentence alignment. Description of the tool is followed by its evaluation. The paper describes an experiment with applying the CorAl aligner to a English-Croatian parallel corpus of legislative domain using metrics of precision, recall and F1-measure. Results are discussed and the concluding sections discuss future directions of CorAl development.

Improving Chunking Accuracy on Croatian Texts by Morphosyntactic Tagging
Kristina Vučković | Željko Agić | Marko Tadić
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present the results of an experiment with utilizing a stochastic morphosyntactic tagger as a pre-processing module of a rule-based chunker and partial parser for Croatian in order to raise its overall chunking and partial parsing accuracy on Croatian texts. In order to conduct the experiment, we have manually chunked and partially parsed 459 sentences from the Croatia Weekly 100 kw newspaper sub-corpus taken from the Croatian National Corpus, that were previously also morphosyntactically disambiguated and lemmatized. Due to the lack of resources of this type, these sentences were designated as a temporary chunking and partial parsing gold standard for Croatian. We have then evaluated the chunker and partial parser in three different scenarios: (1) chunking previously morphosyntactically untagged text, (2) chunking text that was tagged using the stochastic morphosyntactic tagger for Croatian and (3) chunking manually tagged text. The obtained F1-scores for the three scenarios were, respectively, 0.874 (P: 0.825, R: 0.930), 0.891 (P: 0.856, R: 0.928) and 0.914 (P: 0.904, R: 0.925). The paper provides the description of language resources and tools used in the experiment, its setup and discussion of results and perspectives for future work.

Towards Sentiment Analysis of Financial Texts in Croatian
Željko Agić | Nikola Ljubešić | Marko Tadić
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The paper presents results of an experiment dealing with sentiment analysis of Croatian text from the domain of finance. The goal of the experiment was to design a system model for automatic detection of general sentiment and polarity phrases in these texts. We have assembled a document collection from web sources writing on the financial market in Croatia and manually annotated articles from a subset of that collection for general sentiment. Additionally, we have manually annotated a number of these articles for phrases encoding positive or negative sentiment within a text. In the paper, we provide an analysis of the compiled resources. We show a statistically significant correspondence (1) between the overall market trend on the Zagreb Stock Exchange and the number of positively and negatively accented articles within periods of trend and (2) between the general sentiment of articles and the number of polarity phrases within those articles. We use this analysis as an input for designing a rule-based local grammar system for automatic detection of polarity phrases and evaluate it on held out data. The system achieves F1-scores of 0.61 (P: 0.94, R: 0.45) and 0.63 (P: 0.97, R: 0.47) on positive and negative polarity phrases.


Evaluating Morphosyntactic Tagging of Croatian Texts
Željko Agić | Marko Tadić
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes results of the first successful effort in applying a stochastic strategy – or, namely, a second order Markov model paradigm implemented by the TnT trigram tagger – to morphosyntactic tagging of Croatian texts. Beside the tagger, for purposes of both training and testing, we had at our disposal only a 100 Kw Croatia Weekly newspaper subcorpus, manually tagged using approximately 1000 different MULTEXT-East v3 morphosyntactic tags. The test basically consisted of randomly assigning a variable size portion of the corpus for the tagger’s training procedure and also another fixed-size portion, sized at 10% of the corpus, for the tagging procedure itself; this method allowed us not only to provide preliminary results regarding tagger accuracy on Croatian texts, but also to inspect the behavior of the stochastic tagging paradigm in general. The results were then taken from the test case providing 90% of the corpus for training purposes and varied from around 86% in the worst case scenario up to a peak of around 95% correctly assigned full MSD tags. Results on PoS only expectedly reached the human error level, with TnT correctly tagging above 98% of test sets on average. Most MSD errors occurred on types with the highest number of candidate tags per word form – nouns, pronouns and adjectives – while errors on PoS, although following the same pattern, were almost insignificant. Detailed insight on tagging, F-measure for all PoS categories is provided in the course of the paper along with other facts of interest.