Bart Desmet


pdf bib
Improving the Generalizability of Depression Detection by Leveraging Clinical Questionnaires
Thong Nguyen | Andrew Yates | Ayah Zirikly | Bart Desmet | Arman Cohan
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automated methods have been widely used to identify and analyze mental health conditions (e.g., depression) from various sources of information, including social media. Yet, deployment of such models in real-world healthcare applications faces challenges including poor out-of-domain generalization and lack of trust in black box models. In this work, we propose approaches for depression detection that are constrained to different degrees by the presence of symptoms described in PHQ9, a questionnaire used by clinicians in the depression screening process. In dataset-transfer experiments on three social media datasets, we find that grounding the model in PHQ9’s symptoms substantially improves its ability to generalize to out-of-distribution data compared to a standard BERT-based approach. Furthermore, this approach can still perform competitively on in-domain data. These results and our qualitative analyses suggest that grounding model predictions in clinically-relevant symptoms can improve generalizability while producing a model that is easier to inspect.

pdf bib
Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology
Ayah Zirikly | Dana Atzil-Slonim | Maria Liakata | Steven Bedrick | Bart Desmet | Molly Ireland | Andrew Lee | Sean MacAvaney | Matthew Purver | Rebecca Resnik | Andrew Yates
Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology

pdf bib
A Whole-Person Function Dictionary for the Mobility, Self-Care and Domestic Life Domains: a Seedset Expansion Approach
Ayah Zirikly | Bart Desmet | Julia Porcino | Jonathan Camacho Maldonado | Pei-Shu Ho | Rafael Jimenez Silva | Maryanne Sacco
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Whole-person functional limitations in the areas of mobility, self-care and domestic life affect a majority of individuals with disabilities. Detecting, recording and monitoring such limitations would benefit those individuals, as well as research on whole-person functioning and general public health. Dictionaries of terms related to whole-person function would enable automated identification and extraction of relevant information. However, no such terminologies currently exist, due in part to a lack of standardized coding and their availability mainly in free text clinical notes. In this paper, we introduce terminologies of whole-person function in the domains of mobility, self-care and domestic life, built and evaluated using a small set of manually annotated clinical notes, which provided a seedset that was expanded using a mix of lexical and deep learning approaches.

pdf bib
QA4IE: A Quality Assurance Tool for Information Extraction
Rafael Jimenez Silva | Kaushik Gedela | Alex Marr | Bart Desmet | Carolyn Rose | Chunxiao Zhou
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Quality assurance (QA) is an essential though underdeveloped part of the data annotation process. Although QA is supported to some extent in existing annotation tools, comprehensive support for QA is not standardly provided. In this paper we contribute QA4IE, a comprehensive QA tool for information extraction, which can (1) detect potential problems in text annotations in a timely manner, (2) accurately assess the quality of annotations, (3) visually display and summarize annotation discrepancies among annotation team members, (4) provide a comprehensive statistics report, and (5) support viewing of annotated documents interactively. This paper offers a competitive analysis comparing QA4IE and other popular annotation tools and demonstrates its features, usage, and effectiveness through a case study. The Python code, documentation, and demonstration video are available publicly at


pdf bib
Development of Natural Language Processing Tools to Support Determination of Federal Disability Benefits in the U.S.
Bart Desmet | Julia Porcino | Ayah Zirikly | Denis Newman-Griffis | Guy Divita | Elizabeth Rasch
Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov)

The disability benefits programs administered by the US Social Security Administration (SSA) receive between 2 and 3 million new applications each year. Adjudicators manually review hundreds of evidence pages per case to determine eligibility based on financial, medical, and functional criteria. Natural Language Processing (NLP) technology is uniquely suited to support this adjudication work and is a critical component of an ongoing inter-agency collaboration between SSA and the National Institutes of Health. This NLP work provides resources and models for document ranking, named entity recognition, and terminology extraction in order to automatically identify documents and reports pertinent to a case, and to allow adjudicators to search for and locate desired information quickly. In this paper, we describe our vision for how NLP can impact SSA’s adjudication process, present the resources and models that have been developed, and discuss some of the benefits and challenges in working with large-scale government data, and its specific properties in the functional domain.


pdf bib
Classifying the reported ability in clinical mobility descriptions
Denis Newman-Griffis | Ayah Zirikly | Guy Divita | Bart Desmet
Proceedings of the 18th BioNLP Workshop and Shared Task

Assessing how individuals perform different activities is key information for modeling health states of individuals and populations. Descriptions of activity performance in clinical free text are complex, including syntactic negation and similarities to textual entailment tasks. We explore a variety of methods for the novel task of classifying four types of assertions about activity performance: Able, Unable, Unclear, and None (no information). We find that ensembling an SVM trained with lexical features and a CNN achieves 77.9% macro F1 score on our task, and yields nearly 80% recall on the rare Unclear and Unable samples. Finally, we highlight several challenges in classifying performance assertions, including capturing information about sources of assistance, incorporating syntactic structure and negation scope, and handling new modalities at test time. Our findings establish a strong baseline for this novel task, and identify intriguing areas for further research.


pdf bib
SMHD: a Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions
Arman Cohan | Bart Desmet | Andrew Yates | Luca Soldaini | Sean MacAvaney | Nazli Goharian
Proceedings of the 27th International Conference on Computational Linguistics

Mental health is a significant and growing public health concern. As language usage can be leveraged to obtain crucial insights into mental health conditions, there is a need for large-scale, labeled, mental health-related datasets of users who have been diagnosed with one or more of such conditions. In this paper, we investigate the creation of high-precision patterns to identify self-reported diagnoses of nine different mental health conditions, and obtain high-quality labeled data without the need for manual labelling. We introduce the SMHD (Self-reported Mental Health Diagnoses) dataset and make it available. SMHD is a novel large dataset of social media posts from users with one or multiple mental health conditions along with matched control users. We examine distinctions in users’ language, as measured by linguistic and psychological variables. We further explore text classification methods to identify individuals with mental conditions through their language.

pdf bib
RSDD-Time: Temporal Annotation of Self-Reported Mental Health Diagnoses
Sean MacAvaney | Bart Desmet | Arman Cohan | Luca Soldaini | Andrew Yates | Ayah Zirikly | Nazli Goharian
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

Self-reported diagnosis statements have been widely employed in studying language related to mental health in social media. However, existing research has largely ignored the temporality of mental health diagnoses. In this work, we introduce RSDD-Time: a new dataset of 598 manually annotated self-reported depression diagnosis posts from Reddit that include temporal information about the diagnosis. Annotations include whether a mental health condition is present and how recently the diagnosis happened. Furthermore, we include exact temporal spans that relate to the date of diagnosis. This information is valuable for various computational methods to examine mental health through social media because one’s mental health state is not static. We also test several baseline classification and extraction approaches, which suggest that extracting temporal information from self-reported diagnosis statements is challenging.


pdf bib
Noise or music? Investigating the usefulness of normalisation for robust sentiment analysis on social media data
Cynthia Van Hee | Marjan Van de Kauter | Orphée De Clercq | Els Lefever | Bart Desmet | Véronique Hoste
Traitement Automatique des Langues, Volume 58, Numéro 1 : Varia [Varia]


pdf bib
Mental Distress Detection and Triage in Forum Posts: The LT3 CLPsych 2016 Shared Task System
Bart Desmet | Gilles Jacobs | Véronique Hoste
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

pdf bib
The GW/LT3 VarDial 2016 Shared Task System for Dialects and Similar Languages Detection
Ayah Zirikly | Bart Desmet | Mona Diab
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

This paper describes the GW/LT3 contribution to the 2016 VarDial shared task on the identification of similar languages (task 1) and Arabic dialects (task 2). For both tasks, we experimented with Logistic Regression and Neural Network classifiers in isolation. Additionally, we implemented a cascaded classifier that consists of coarse and fine-grained classifiers (task 1) and a classifier ensemble with majority voting for task 2. The submitted systems obtained state-of-the art performance and ranked first for the evaluation on social media data (test sets B1 and B2 for task 1), with a maximum weighted F1 score of 91.94%.


pdf bib
UGENT-LT3 SCATE System for Machine Translation Quality Estimation
Arda Tezcan | Veronique Hoste | Bart Desmet | Lieve Macken
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Detection and Fine-Grained Classification of Cyberbullying Events
Cynthia Van Hee | Els Lefever | Ben Verhoeven | Julie Mennes | Bart Desmet | Guy De Pauw | Walter Daelemans | Veronique Hoste
Proceedings of the International Conference Recent Advances in Natural Language Processing


pdf bib
Recognising suicidal messages in Dutch social media
Bart Desmet | Véronique Hoste
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Early detection of suicidal thoughts is an important part of effective suicide prevention. Such thoughts may be expressed online, especially by young people. This paper presents on-going work on the automatic recognition of suicidal messages in social media. We present experiments for automatically detecting relevant messages (with suicide-related content), and those containing suicide threats. A sample of 1357 texts was annotated in a corpus of 2674 blog posts and forum messages from Netlog, indicating relevance, origin, severity of suicide threat and risks as well as protective factors. For the classification experiments, Naive Bayes, SVM and KNN algorithms are combined with shallow features, i.e. bag-of-words of word, lemma and character ngrams, and post length. The best relevance classification is achieved by using SVM with post length, lemma and character ngrams, resulting in an F-score of 85.6% (78.7% precision and 93.8% recall). For the second task (threat detection), a cascaded setup which first filters out irrelevant messages with SVM and then predicts the severity with KNN, performs best: 59.2% F-score (69.5% precision and 51.6% recall).

pdf bib
Towards Shared Datasets for Normalization Research
Orphée De Clercq | Sarah Schulz | Bart Desmet | Véronique Hoste
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluating text normalization approaches. With the combination of text messages, message board posts and tweets, these datasets represent a variety of user generated content. All data was manually normalized to their standard form using newly-developed guidelines. We perform automatic lexical normalization experiments on these datasets using statistical machine translation techniques. We focus on both the word and character level and find that we can improve the BLEU score with ca. 20% for both languages. In order for this user generated content data to be released publicly to the research community some issues first need to be resolved. These are discussed in closer detail by focussing on the current legislation and by investigating previous similar data collection projects. With this discussion we hope to shed some light on various difficulties researchers are facing when trying to share social media data.


pdf bib
Normalization of Dutch User-Generated Content
Orphée De Clercq | Sarah Schulz | Bart Desmet | Els Lefever | Véronique Hoste
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013


pdf bib
Towards a Balanced Named Entity Corpus for Dutch
Bart Desmet | Véronique Hoste
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper introduces a new named entity corpus for Dutch. State-of-the-art named entity recognition systems require a substantial annotated corpus to be trained on. Such corpora exist for English, but not for Dutch. The STEVIN-funded SoNaR project aims to produce a diverse 500-million-word reference corpus of written Dutch, with four semantic annotation layers: named entities, coreference relations, semantic roles and spatiotemporal expressions. A 1-million-word subset will be manually corrected. Named entity annotation guidelines for Dutch were developed, adapted from the MUC and ACE guidelines. Adaptations include the annotation of products and events, the classification into subtypes, and the markup of metonymic usage. Inter-annotator agreement experiments were conducted to corroborate the reliability of the guidelines, which yielded satisfactory results (Kappa scores above 0.90). We are building a NER system, trained on the 1-million-word subcorpus, to automatically classify the remainder of the SoNaR corpus. To this end, experiments with various classification algorithms (MBL, SVM, CRF) and features have been carried out and evaluated.

pdf bib
Towards a Learning Approach for Abbreviation Detection and Resolution.
Klaar Vanopstal | Bart Desmet | Véronique Hoste
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The explosion of biomedical literature and with it the -uncontrolled- creation of abbreviations presents some special challenges for both human readers and computer applications. We developed an annotated corpus of Dutch medical text, and experimented with two approaches to abbreviation detection and resolution. Our corpus is composed of abstracts from two medical journals from the Low Countries in which approximately 65 percent (NTvG) and 48 percent (TvG) of the abbreviations have a corresponding full form in the abstract. Our first approach, a pattern-based system, consists of two steps: abbreviation detection and definition matching. This system has an average F-score of 0.82 for the detection of both defined and undefined abbreviations and an average F-score of 0.77 was obtained for the definitions. For our second approach, an SVM-based classifier was used on the preprocessed data sets, leading to an average F-score of 0.93 for the abbreviations; for the definitions an average F-score of 0.82 was obtained.