Shaoxiong Ji


pdf bib
A New Massive Multilingual Dataset for High-Performance Language Technologies
Ona de Gibert | Graeme Nail | Nikolay Arefyev | Marta Bañón | Jelmer van der Linde | Shaoxiong Ji | Jaume Zaragoza-Bernabeu | Mikko Aulamo | Gema Ramírez-Sánchez | Andrey Kutuzov | Sampo Pyysalo | Stephan Oepen | Jörg Tiedemann
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

pdf bib
Can Machine Translation Bridge Multilingual Pretraining and Cross-lingual Transfer Learning?
Shaoxiong Ji | Timothee Mickus | Vincent Segonne | Jörg Tiedemann
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Multilingual pretraining and fine-tuning have remarkably succeeded in various natural language processing tasks. Transferring representations from one language to another is especially crucial for cross-lingual learning. One can expect machine translation objectives to be well suited to fostering such capabilities, as they involve the explicit alignment of semantically equivalent sentences from different languages. This paper investigates the potential benefits of employing machine translation as a continued training objective to enhance language representation learning, bridging multilingual pretraining and cross-lingual applications. We study this question through two lenses: a quantitative evaluation of the performance of existing models and an analysis of their latent representations. Our results show that, contrary to expectations, machine translation as the continued training fails to enhance cross-lingual representation learning in multiple cross-lingual natural language understanding tasks. We conclude that explicit sentence-level alignment in the cross-lingual scenario is detrimental to cross-lingual transfer pretraining, which has important implications for future cross-lingual transfer studies. We furthermore provide evidence through similarity measures and investigation of parameters that this lack of positive influence is due to output separability—which we argue is of use for machine translation but detrimental elsewhere.

pdf bib
Knowledge-augmented Graph Neural Networks with Concept-aware Attention for Adverse Drug Event Detection
Ya Gao | Shaoxiong Ji | Pekka Marttinen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Adverse drug events (ADEs) are an important aspect of drug safety. Various texts such as biomedical literature, drug reviews, and user posts on social media and medical forums contain a wealth of information about ADEs. Recent studies have applied word embedding and deep learning-based natural language processing to automate ADE detection from text. However, they did not explore incorporating explicit medical knowledge about drugs and adverse reactions or the corresponding feature learning. This paper adopts the heterogeneous text graph, which describes relationships between documents, words, and concepts, augments it with medical knowledge from the Unified Medical Language System, and proposes a concept-aware attention mechanism that learns features differently for the different types of nodes in the graph. We further utilize contextualized embeddings from pretrained language models and convolutional graph neural networks for effective feature representation and relational learning. Experiments on four public datasets show that our model performs competitively to the recent advances, and the concept-aware attention consistently outperforms other attention mechanisms.

pdf bib
MAMMOTH: Massively Multilingual Modular Open Translation @ Helsinki
Timothee Mickus | Stig-Arne Grönroos | Joseph Attieh | Michele Boggia | Ona De Gibert | Shaoxiong Ji | Niki Andreas Loppi | Alessandro Raganato | Raúl Vázquez | Jörg Tiedemann
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

NLP in the age of monolithic large language models is approaching its limits in terms of size and information that can be handled. The trend goes to modularization, a necessary step into the direction of designing smaller sub-networks and components with specialized functionality. In this paper, we present the MAMMOTH toolkit: a framework designed for training massively multilingual modular machine translation systems at scale, initially derived from OpenNMT-py and then adapted to ensure efficient training across computation clusters.We showcase its efficiency across clusters of A100 and V100 NVIDIA GPUs, and discuss our design philosophy and plans for future information.The toolkit is publicly available online at

pdf bib
Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca
Pinzhen Chen | Shaoxiong Ji | Nikolay Bogoychev | Andrey Kutuzov | Barry Haddow | Kenneth Heafield
Findings of the Association for Computational Linguistics: EACL 2024

Foundational large language models (LLMs) can be instruction-tuned to perform open-domain question answering, facilitating applications like chat assistants. While such efforts are often carried out in a single language, we empirically analyze cost-efficient strategies for multilingual scenarios. Our study employs the Alpaca dataset and machine translations of it to form multilingual data, which is then used to tune LLMs through either low-rank adaptation or full-parameter training. Under a controlled computation budget, comparisons show that multilingual tuning is on par or better than tuning a model for each language. Furthermore, multilingual tuning with downsampled data can be as powerful and more robust. Our findings serve as a guide for expanding language support through instruction tuning.


pdf bib
HPLT: High Performance Language Technologies
Mikko Aulamo | Nikolay Bogoychev | Shaoxiong Ji | Graeme Nail | Gema Ramírez-Sánchez | Jörg Tiedemann | Jelmer van der Linde | Jaume Zaragoza
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

We describe the High Performance Language Technologies project (HPLT), a 3-year EU-funded project started in September 2022. HPLT will build a space combining petabytes of natural language data with large-scale model training. It will derive monolingual and bilingual datasets from the Internet Archive and CommonCrawl and build efficient and solid machine translation (MT) as well as large language models (LLMs). HPLT aims at providing free, sustainable and reusable datasets, models and workflows at scale using high-performance computing (HPC).

pdf bib
Patient Outcome and Zero-shot Diagnosis Prediction with Hypernetwork-guided Multitask Learning
Shaoxiong Ji | Pekka Marttinen
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Multitask deep learning has been applied to patient outcome prediction from text, taking clinical notes as input and training deep neural networks with a joint loss function of multiple tasks. However, the joint training scheme of multitask learning suffers from inter-task interference, and diagnosis prediction among the multiple tasks has the generalizability issue due to rare diseases or unseen diagnoses. To solve these challenges, we propose a hypernetwork-based approach that generates task-conditioned parameters and coefficients of multitask prediction heads to learn task-specific prediction and balance the multitask learning. We also incorporate semantic task information to improve the generalizability of our task-conditioned multitask model. Experiments on early and discharge notes extracted from the real-world MIMIC database show our method can achieve better performance on multitask patient outcome prediction than strong baselines in most cases. Besides, our method can effectively handle the scenario with limited information and improve zero-shot prediction on unseen diagnosis categories.

pdf bib
Towards Interpretable Mental Health Analysis with Large Language Models
Kailai Yang | Shaoxiong Ji | Tianlin Zhang | Qianqian Xie | Ziyan Kuang | Sophia Ananiadou
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The latest large language models (LLMs) such as ChatGPT, exhibit strong capabilities in automated mental health analysis. However, existing relevant studies bear several limitations, including inadequate evaluations, lack of prompting strategies, and ignorance of exploring LLMs for explainability. To bridge these gaps, we comprehensively evaluate the mental health analysis and emotional reasoning ability of LLMs on 11 datasets across 5 tasks. We explore the effects of different prompting strategies with unsupervised and distantly supervised emotional information. Based on these prompts, we explore LLMs for interpretable mental health analysis by instructing them to generate explanations for each of their decisions. We convey strict human evaluations to assess the quality of the generated explanations, leading to a novel dataset with 163 human-assessed explanations. We benchmark existing automatic evaluation metrics on this dataset to guide future related works. According to the results, ChatGPT shows strong in-context learning ability but still has a significant gap with advanced task-specific methods. Careful prompt engineering with emotional cues and expert-written few-shot examples can also effectively improve performance on mental health analysis. In addition, ChatGPT generates explanations that approach human performance, showing its great potential in explainable mental health analysis.


pdf bib
AaltoNLP at SemEval-2022 Task 11: Ensembling Task-adaptive Pretrained Transformers for Multilingual Complex NER
Aapo Pietiläinen | Shaoxiong Ji
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper presents the system description of team AaltoNLP for SemEval-2022 shared task 11: MultiCoNER. Transformer-based models have produced high scores on standard Named Entity Recognition (NER) tasks. However, accuracy on complex named entities is still low. Complex and ambiguous named entities have been identified as a major error source in NER tasks. The shared task is about multilingual complex named entity recognition. In this paper, we describe an ensemble approach, which increases accuracy across all tested languages. The system ensembles output from multiple same architecture task-adaptive pretrained transformers trained with different random seeds. We notice a large discrepancy between performance on development and test data. Model selection based on limited development data may not yield optimal results on large test data sets.

pdf bib
MentalBERT: Publicly Available Pretrained Language Models for Mental Healthcare
Shaoxiong Ji | Tianlin Zhang | Luna Ansari | Jie Fu | Prayag Tiwari | Erik Cambria
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Mental health is a critical issue in modern society, and mental disorders could sometimes turn to suicidal ideation without adequate treatment. Early detection of mental disorders and suicidal ideation from social content provides a potential way for effective social intervention. Recent advances in pretrained contextualized language representations have promoted the development of several domainspecific pretrained models and facilitated several downstream applications. However, there are no existing pretrained language models for mental healthcare. This paper trains and release two pretrained masked language models, i.e., MentalBERT and MentalRoBERTa, to benefit machine learning for the mental healthcare research community. Besides, we evaluate our trained domain-specific models and several variants of pretrained language models on several mental disorder detection benchmarks and demonstrate that language representations pretrained in the target domain improve the performance of mental health detection tasks.

pdf bib
Towards Intention Understanding in Suicidal Risk Assessment with Natural Language Processing
Shaoxiong Ji
Findings of the Association for Computational Linguistics: EMNLP 2022

Recent applications of natural language processing techniques to suicidal ideation detection and risk assessment frame the detection or assessment task as a text classification problem. Recent advances have developed many models, especially deep learning models, to boost predictive performance.Though the performance (in terms of aggregated evaluation scores) is improving, this position paper urges that better intention understanding is required for reliable suicidal risk assessment with computational methods. This paper reflects the state of natural language processing applied to suicide-associated text classification tasks, differentiates suicidal risk assessment and intention understanding, and points out potential limitations of sentiment features and pretrained language models in suicidal intention understanding.Besides, it urges the necessity for sequential intention understanding and risk assessment, discusses some critical issues in evaluation such as uncertainty, and studies the lack of benchmarks.


pdf bib
Medical Code Assignment with Gated Convolution and Note-Code Interaction
Shaoxiong Ji | Shirui Pan | Pekka Marttinen
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021


pdf bib
Dilated Convolutional Attention Network for Medical Code Assignment from Clinical Text
Shaoxiong Ji | Erik Cambria | Pekka Marttinen
Proceedings of the 3rd Clinical Natural Language Processing Workshop

Medical code assignment, which predicts medical codes from clinical texts, is a fundamental task of intelligent medical information systems. The emergence of deep models in natural language processing has boosted the development of automatic assignment methods. However, recent advanced neural architectures with flat convolutions or multi-channel feature concatenation ignore the sequential causal constraint within a text sequence and may not learn meaningful clinical text representations, especially for lengthy clinical notes with long-term sequential dependency. This paper proposes a Dilated Convolutional Attention Network (DCAN), integrating dilated convolutions, residual connections, and label attention, for medical code assignment. It adopts dilated convolutions to capture complex medical patterns with a receptive field which increases exponentially with dilation size. Experiments on a real-world clinical dataset empirically show that our model improves the state of the art.