Jean-Benoit Delbrouck

Also published as: Jean-benoit Delbrouck


pdf bib
ViLMedic: a framework for research at the intersection of vision and language in medical AI
Jean-benoit Delbrouck | Khaled Saab | Maya Varma | Sabri Eyuboglu | Pierre Chambon | Jared Dunnmon | Juan Zambrano | Akshay Chaudhari | Curtis Langlotz
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

There is a growing need to model interactions between data modalities (e.g., vision, language) — both to improve AI predictions on existing tasks and to enable new applications. In the recent field of multimodal medical AI, integrating multiple modalities has gained widespread popularity as multimodal models have proven to improve performance, robustness, require less training samples and add complementary information. To improve technical reproducibility and transparency for multimodal medical tasks as well as speed up progress across medical AI, we present ViLMedic, a Vision-and-Language medical library. As of 2022, the library contains a dozen reference implementations replicating the state-of-the-art results for problems that range from medical visual question answering and radiology report generation to multimodal representation learning on widely adopted medical datasets. In addition, ViLMedic hosts a model-zoo with more than twenty pretrained models for the above tasks designed to be extensible by researchers but also simple for practitioners. Ultimately, we hope our reproducible pipelines can enable clinical translation and create real impact.The library is available at


pdf bib
QIAI at MEDIQA 2021: Multimodal Radiology Report Summarization
Jean-Benoit Delbrouck | Cassie Zhang | Daniel Rubin
Proceedings of the 20th Workshop on Biomedical Language Processing

This paper describes the solution of the QIAI lab sent to the Radiology Report Summarization (RRS) challenge at MEDIQA 2021. This paper aims to investigate whether using multimodality during training improves the summarizing performances of the model at test-time. Our preliminary results shows that taking advantage of the visual features from the x-rays associated to the radiology reports leads to higher evaluation metrics compared to a text-only baseline system. These improvements are reported according to the automatic evaluation metrics METEOR, BLEU and ROUGE scores. Our experiments can be fully replicated at the following address: https://

pdf bib
MiniVQA - A resource to build your tailored VQA competition
Jean-Benoit Delbrouck
Proceedings of the Fifth Workshop on Teaching NLP

MiniVQA is a Jupyter notebook to build a tailored VQA competition for your students. The resource creates all the needed resources to create a classroom competition that engages and inspires your students on the free, self-service Kaggle platform. “InClass competitions make machine learning fun¡‘.


pdf bib
Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition
Jean-Benoit Delbrouck | Noé Tits | Stéphane Dupont
Proceedings of the First International Workshop on Natural Language Processing Beyond Text

This paper aims to bring a new lightweight yet powerful solution for the task of Emotion Recognition and Sentiment Analysis. Our motivation is to propose two architectures based on Transformers and modulation that combine the linguistic and acoustic inputs from a wide range of datasets to challenge, and sometimes surpass, the state-of-the-art in the field. To demonstrate the efficiency of our models, we carefully evaluate their performances on the IEMOCAP, MOSI, MOSEI and MELD dataset. The experiments can be directly replicated and the code is fully open for future researches.

pdf bib
A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis
Jean-Benoit Delbrouck | Noé Tits | Mathilde Brousmiche | Stéphane Dupont
Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML)

Understanding expressed sentiment and emotions are two crucial factors in human multimodal language. This paper describes a Transformer-based joint-encoding (TBJE) for the task of Emotion Recognition and Sentiment Analysis. In addition to use the Transformer architecture, our approach relies on a modular co-attention and a glimpse layer to jointly encode one or more modalities. The proposed solution has also been submitted to the ACL20: Second Grand-Challenge on Multimodal Language to be evaluated on the CMU-MOSEI dataset. The code to replicate the presented experiments is open-source .


pdf bib
An empirical study on the effectiveness of images in Multimodal Neural Machine Translation
Jean-Benoit Delbrouck | Stéphane Dupont
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In state-of-the-art Neural Machine Translation (NMT), an attention mechanism is used during decoding to enhance the translation. At every step, the decoder uses this mechanism to focus on different parts of the source sentence to gather the most useful information before outputting its target word. Recently, the effectiveness of the attention mechanism has also been explored for multi-modal tasks, where it becomes possible to focus both on sentence parts and image regions that they describe. In this paper, we compare several attention mechanism on the multi-modal translation task (English, image → German) and evaluate the ability of the model to make use of images to improve translation. We surpass state-of-the-art scores on the Multi30k data set, we nevertheless identify and report different misbehavior of the machine while translating.