L’existence de modèles universels pour décrire la syntaxe des langues a longtemps été débattue. L’apparition de ressources comme le World Atlas of Language Structures et les corpus des Universal Dependencies rend possible l’étude d’une grammaire universelle pour l’analyse syntaxique en dépendances. Notre travail se concentre sur l’étude de différentes représentations des langues dans des systèmes multilingues appris sur des corpus arborés de 37 langues. Nos tests d’analyse syntaxique montrent que représenter la langue dont est issu chaque mot permet d’obtenir de meilleurs résultats qu’en cas d’un apprentissage sur une simple concaténation des langues. En revanche, l’utilisation d’un vecteur pour représenter la langue ne permet pas une amélioration évidente des résultats dans le cas d’une langue n’ayant pas du tout de données d’apprentissage.
The existence of universal models to describe the syntax of languages has been debated for decades. The availability of resources such as the Universal Dependencies treebanks and the World Atlas of Language Structures make it possible to study the plausibility of universal grammar from the perspective of dependency parsing. Our work investigates the use of high-level language descriptions in the form of typological features for multilingual dependency parsing. Our experiments on multilingual parsing for 40 languages show that typological information can indeed guide parsers to share information between similar languages beyond simple language identification.
This paper describes the Veyn system, submitted to the closed track of the PARSEME Shared Task 2018 on automatic identification of verbal multiword expressions (VMWEs). Veyn is based on a sequence tagger using recurrent neural networks. We represent VMWEs using a variant of the begin-inside-outside encoding scheme combined with the VMWE category tag. In addition to the system description, we present development experiments to determine the best tagging scheme. Veyn is freely available, covers 19 languages, and was ranked ninth (MWE-based) and eight (Token-based) among 13 submissions, considering macro-averaged F1 across languages.
We present a simple and efficient tagger capable of identifying highly ambiguous multiword expressions (MWEs) in French texts. It is based on conditional random fields (CRF), using local context information as features. We show that this approach can obtain results that, in some cases, approach more sophisticated parser-based MWE identification methods without requiring syntactic trees from a treebank. Moreover, we study how well the CRF can take into account external information coming from a lexicon.