2024
pdf
bib
abs
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading
Tu Anh Dinh
|
Carlos Mullov
|
Leonard Bärmann
|
Zhaolin Li
|
Danni Liu
|
Simon Reiß
|
Jueun Lee
|
Nathan Lerzer
|
Jianfeng Gao
|
Fabian Peller-Konrad
|
Tobias Röddiger
|
Alexander Waibel
|
Tamim Asfour
|
Michael Beigl
|
Rainer Stiefelhagen
|
Carsten Dachsbacher
|
Klemens Böhm
|
Jan Niehues
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a benchmark consisting of university computer science exam questions, to evaluate LLMs’ ability on solving scientific tasks. SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are freeform, it is not straightforward to evaluate LLM performance. Therefore, we provide human expert grading of the LLM outputs on SciEx. We show that the free-form exams in SciEx remain challenging for the current LLMs, where the best LLM only achieves 59.4% exam grade on average. We also provide detailed comparisons between LLM performance and student performance on SciEx. To enable future evaluation of new LLMs, we propose using LLM-as-a-judge to grade the LLM answers on SciEx. Our experiments show that, although they do not perform perfectly on solving the exams, LLMs are decent as graders, achieving 0.948 Pearson correlation with expert grading.
pdf
bib
abs
The KIT Speech Translation Systems for IWSLT 2024 Dialectal and Low-resource Track
Zhaolin Li
|
Enes Yavuz Ugan
|
Danni Liu
|
Carlos Mullov
|
Tu Anh Dinh
|
Sai Koneru
|
Alexander Waibel
|
Jan Niehues
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
This paper presents KIT’s submissions to the IWSLT 2024 dialectal and low-resource track. In this work, we build systems for translating into English from speech in Maltese, Bemba, and two Arabic dialects Tunisian and North Levantine. Under the unconstrained condition, we leverage the pre-trained multilingual models by fine-tuning them for the target language pairs to address data scarcity problems in this track. We build cascaded and end-to-end speech translation systems for different language pairs and show the cascaded system brings slightly better overall performance. Besides, we find utilizing additional data resources boosts speech recognition performance but slightly harms machine translation performance in cascaded systems. Lastly, we show that Minimum Bayes Risk is effective in improving speech translation performance by combining the cascaded and end-to-end systems, bringing a consistent improvement of around 1 BLUE point.
pdf
bib
abs
Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages
Carlos Mullov
|
Quan Pham
|
Alexander Waibel
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multilingual neural machine translation systems learn to map sentences of different languages into a common representation space. Intuitively, with a growing number of seen languages the encoder sentence representation grows more flexible and easily adaptable to new languages. In this work, we test this hypothesis by zero-shot translating from unseen languages. To deal with unknown vocabularies from unknown languages we propose a setup where we decouple learning of vocabulary and syntax, i.e. for each language we learn word representations in a separate step (using cross-lingual word embeddings), and then train to translate while keeping those word representations frozen. We demonstrate that this setup enables zero-shot translation from entirely unseen languages. Zero-shot translating with a model trained on Germanic and Romance languages we achieve scores of 42.6 BLEU for Portuguese-English and 20.7 BLEU for Russian-English on TED domain. We explore how this zero-shot translation capability develops with varying number of languages seen by the encoder. Lastly, we explore the effectiveness of our decoupled learning strategy for unsupervised machine translation. By exploiting our model’s zero-shot translation capability for iterative back-translation we attain near parity with a supervised setting.
2023
pdf
bib
abs
KIT’s Multilingual Speech Translation System for IWSLT 2023
Danni Liu
|
Thai Binh Nguyen
|
Sai Koneru
|
Enes Yavuz Ugan
|
Ngoc-Quan Pham
|
Tuan Nam Nguyen
|
Tu Anh Dinh
|
Carlos Mullov
|
Alexander Waibel
|
Jan Niehues
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which focuses on the translation of scientific conference talks. The test condition features accented input speech and terminology-dense contents. The tasks requires translation into 10 languages of varying amounts of resources. In absence of training data from the target domain, we use a retrieval-based approach (kNN-MT) for effective adaptation (+0.8 BLEU for speech translation). We also use adapters to easily integrate incremental training data from data augmentation, and show that it matches the performance of re-training. We observe that cascaded systems are more easily adaptable towards specific target domains, due to their separate modules. Our cascaded speech system outperforms its end-to-end counterpart on scientific talk translation, although their performance remains similar on TED talks.
pdf
bib
abs
End-to-End Evaluation for Low-Latency Simultaneous Speech Translation
Christian Huber
|
Tu Anh Dinh
|
Carlos Mullov
|
Ngoc-Quan Pham
|
Thai Binh Nguyen
|
Fabian Retkowski
|
Stefan Constantin
|
Enes Ugan
|
Danni Liu
|
Zhaolin Li
|
Sai Koneru
|
Jan Niehues
|
Alexander Waibel
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches. In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components. Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user.
2022
pdf
bib
abs
Effective combination of pretrained models - KIT@IWSLT2022
Ngoc-Quan Pham
|
Tuan Nam Nguyen
|
Thai-Binh Nguyen
|
Danni Liu
|
Carlos Mullov
|
Jan Niehues
|
Alexander Waibel
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)
Pretrained models in acoustic and textual modalities can potentially improve speech translation for both Cascade and End-to-end approaches. In this evaluation, we aim at empirically looking for the answer by using the wav2vec, mBART50 and DeltaLM models to improve text and speech translation models. The experiments showed that the presence of these models together with an advanced audio segmentation method results in an improvement over the previous end-to-end system by up to 7 BLEU points. More importantly, the experiments showed that given enough data and modeling capacity to overcome the training difficulty, we can outperform even very competitive Cascade systems. In our experiments, this gap can be as large as 2.0 BLEU points, the same gap that the Cascade often led over the years.
pdf
bib
abs
CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022
Peter Polák
|
Ngoc-Quan Pham
|
Tuan Nam Nguyen
|
Danni Liu
|
Carlos Mullov
|
Jan Niehues
|
Ondřej Bojar
|
Alexander Waibel
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)
In this paper, we describe our submission to the Simultaneous Speech Translation at IWSLT 2022. We explore strategies to utilize an offline model in a simultaneous setting without the need to modify the original model. In our experiments, we show that our onlinization algorithm is almost on par with the offline setting while being 3x faster than offline in terms of latency on the test set. We also show that the onlinized offline model outperforms the best IWSLT2021 simultaneous system in medium and high latency regimes and is almost on par in the low latency regime. We make our system publicly available.