Aman Berhe


2023

pdf bib
AliBERT: A Pre-trained Language Model for French Biomedical Text
Aman Berhe | Guillaume Draznieks | Vincent Martenot | Valentin Masdeu | Lucas Davy | Jean-Daniel Zucker
The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

Over the past few years, domain specific pretrained language models have been investigated and have shown remarkable achievements in different downstream tasks, especially in biomedical domain. These achievements stem on the well known BERT architecture which uses an attention based self-supervision for context learning of textual documents. However, these domain specific biomedical pretrained language models mainly use English corpora. Therefore, non-English, domain-specific pretrained models remain quite rare, both of these requirements being hard to achieve. In this work, we proposed AliBERT, a biomedical pretrained language model for French and investigated different learning strategies. AliBERT is trained using regularized Unigram based tokenizer trained for this purpose. AliBERT has achieved state of the art F1 and accuracy scores in different down-stream biomedical tasks. Our pretrained model manages to outperform some French non domain-specific models such as CamemBERT and FlauBERT on diverse down-stream tasks, with less pretraining and training time and with much smaller corpora.

2022

pdf bib
Bazinga! A Dataset for Multi-Party Dialogues Structuring
Paul Lerner | Juliette Bergoënd | Camille Guinaudeau | Hervé Bredin | Benjamin Maurice | Sharleyne Lefevre | Martin Bouteiller | Aman Berhe | Léo Galmant | Ruiqing Yin | Claude Barras
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We introduce a dataset built around a large collection of TV (and movie) series. Those are filled with challenging multi-party dialogues. Moreover, TV series come with a very active fan base that allows the collection of metadata and accelerates annotation. With 16 TV and movie series, Bazinga! amounts to 400+ hours of speech and 8M+ tokens, including 500K+ tokens annotated with the speaker, addressee, and entity linking information. Along with the dataset, we also provide a baseline for speaker diarization, punctuation restoration, and person entity recognition. The results demonstrate the difficulty of the tasks and of transfer learning from models trained on mono-speaker audio or written text, which is more widely available. This work is a step towards better multi-party dialogue structuring and understanding. Bazinga! is available at hf.co/bazinga. Because (a large) part of Bazinga! is only partially annotated, we also expect this dataset to foster research towards self- or weakly-supervised learning methods.

pdf bib
Survey on Narrative Structure: from Linguistic Theories to Automatic Extraction Approaches
Aman Berhe | Camille Guinaudeau | Claude Barras
Traitement Automatique des Langues, Volume 63, Numéro 1 : Varia [Varia]