Géraldine Walther

2023

Mandarin classifier systems optimize to accommodate communicative pressures
Yamei Wang | Géraldine Walther
Findings of the Association for Computational Linguistics: EMNLP 2023

Previous work on noun classification implies that gender systems are inherently optimized to accommodate communicative pressures on human language learning and processing (Dye. et al 2017, 2018). They state that languages make use of either grammatical (e.g., gender) or probabilistic (pre-nominal modifiers) to smoothe the entropy of nouns in context. We show that even languages that are considered genderless, like Mandarin Chinese, possess a noun classification device that plays the same functional role as gender markers. Based on close to 1M Mandarin noun phrases extracted from the Leipzig Corpora Collection (Goldhahn et al. 2012) and their corresponding fastText embeddings (Bojanowski et al. 2016), we show that noun-classifier combinations are sensitive to same frequency, similarity, and co-occurrence interactions that structure gender systems. We also present the first study of the effects of the interaction between grammatical and probabilisitic noun classification.

pdf bib abs

Measure words are measurably different from sortal classifiers
Yamei Wang | Géraldine Walther
Proceedings of the Seventh International Conference on Dependency Linguistics (Depling, GURT/SyntaxFest 2023)

Nominal classifiers categorize nouns based on salient semantic properties. Past studies have long debated whether sortal classifiers (related to intrinsic semantic noun features) and mensural classifiers (related to quantity) should be considered as the same grammatical category. Suggested diagnostic tests rely on functional and distributional criteria, typically evaluated in terms of isolated example sentences obtained through elicitation. This paper offers a systematic re-evaluation of this long-standing question: using 981,076 nominal phrases from a 489 MB dependency-parsed word corpus, corresponding extracted contextual word embeddings from a Chinese BERT model, and information-theoretic measures of mutual information, we show that mensural classifiers can be distributionally and functionally distinguished from sortal classifiers justifying the existence of distinct syntactic categories for mensural and sortal classifiers. Our study also entails broader implications for the typological study of classifier systems.

2018

pdf bib

pdf bib

2017

pdf bib

pdf bib abs

Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin
Géraldine Walther | Benoît Sagot
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

In this paper, we present ongoing work for developing language resources and basic NLP tools for an undocumented variety of Romansh, in the context of a language documentation and language acquisition project. Our tools are meant to improve the speed and reliability of corpus annotations for noisy data involving large amounts of code-switching, occurrences of child-speech and orthographic noise. Being able to increase the efficiency of language resource development for language documentation and acquisition research also constitutes a step towards solving the data sparsity issues with which researchers have been struggling.

2011

pdf bib abs

Développement de ressources pour le persan : PerLex 2, nouveau lexique morphologique et MEltfa, étiqueteur morphosyntaxique (Development of resources for Persian: PerLex 2, a new morphological lexicon and MEltfa, a morphosyntactic tagger)
Benoît Sagot | Géraldine Walther | Pegah Faghiri | Pollet Samvelian
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Nous présentons une nouvelle version de PerLex, lexique morphologique du persan, une version corrigée et partiellement réannotée du corpus étiqueté BijanKhan (BijanKhan, 2004) et MEltfa, un nouvel étiqueteur morphosyntaxique librement disponible pour le persan. Après avoir développé une première version de PerLex (Sagot & Walther, 2010), nous en proposons donc ici une version améliorée. Outre une validation manuelle partielle, PerLex 2 repose désormais sur un inventaire de catégories linguistiquement motivé. Nous avons également développé une nouvelle version du corpus BijanKhan : elle contient des corrections significatives de la tokenisation ainsi qu’un réétiquetage à l’aide des nouvelles catégories. Cette nouvelle version du corpus a enfin été utilisée pour l’entraînement de MEltfa, notre étiqueteur morphosyntaxique pour le persan librement disponible, s’appuyant à la fois sur ce nouvel inventaire de catégories, sur PerLex 2 et sur le système d’étiquetage MElt (Denis & Sagot, 2009).

pdf bib

Modélisation et implémentation de phénomènes flexionnels non canoniques [Modeling and implementing non canonical morphological phenomena]
Géraldine Walther | Benoît Sagot
Traitement Automatique des Langues, Volume 52, Numéro 2 : Vers la morphologie et au-delà [Toward Morphology and beyond]

2010

pdf bib abs

A Morphological Lexicon for the Persian Language
Benoît Sagot | Géraldine Walther
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We introduce PerLex, a large-coverage and freely-available morphological lexicon for the Persian language. We describe the main features of the Persian morphology, and the way we have represented it within the Alexina formalism, on which PerLex is based. We focus on the methodology we used for constructing lexical entries from various sources, as well as the problems related to typographic normalisation. The resulting lexicon shows a satisfying coverage on a reference corpus and should therefore be a good starting point for developing a syntactic lexicon for the Persian language.

pdf bib abs

Développement de ressources pour le persan: lexique morphologique et chaîne de traitements de surface
Benoît Sagot | Géraldine Walther
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous présentons PerLex, un lexique morphologique du persan à large couverture et librement disponible, accompagné d’une chaîne de traitements de surface pour cette langue. Nous décrivons quelques caractéristiques de la morphologie du persan, et la façon dont nous l’avons représentée dans le formalisme lexical Alexina, sur lequel repose PerLex. Nous insistons sur la méthodologie que nous avons employée pour construire les entrées lexicales à partir de diverses sources, ainsi que sur les problèmes liés à la normalisation typographique. Le lexique obtenu a une couverture satisfaisante sur un corpus de référence, et devrait donc constituer un bon point de départ pour le développement d’un lexique syntaxique du persan.