Jan-Christoph Kalo

2025

ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events
Duygu Sezen Islakoglu | Jan-Christoph Kalo
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Large Language Models (LLMs) still face significant challenges in reasoning and arithmetic. Although temporal reasoning has raised increasing research attention, comprehensive testing of Allen’s interval relations (e.g., before, after, during) —a fundamental framework for temporal relationships— remains underexplored. To fill this gap, we present ChronoSense, a new benchmark for evaluating LLMs’ temporal understanding. It includes 16 tasks, identifying the Allen relation between two temporal events and temporal arithmetic. We assess the performance of seven recent LLMs. The results indicate that models handle Allen relations, even symmetrical ones, quite differently. Moreover, the findings suggest that the models may rely on memorization to answer time-related questions. Overall, the models’ low performance highlights the need for improved temporal understanding in LLMs. Our dataset and the source code are available at https://github.com/duyguislakoglu/chronosense.

2024

pdf bib abs

Prompt Tuned Embedding Classification for Industry Sector Allocation
Valentin Buchner | Lele Cao | Jan-Christoph Kalo | Vilhelm Von Ehrenheim
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)

We introduce Prompt Tuned Embedding Classification (PTEC) for classifying companies within an investment firm’s proprietary industry taxonomy, supporting their thematic investment strategy. PTEC assigns companies to the sectors they primarily operate in, conceptualizing this process as a multi-label text classification task. Prompt Tuning, usually deployed as a text-to-text (T2T) classification approach, ensures low computational cost while maintaining high task performance. However, T2T classification has limitations on multi-label tasks due to the generation of non-existing labels, permutation invariance of the label sequence, and a lack of confidence scores. PTEC addresses these limitations by utilizing a classification head in place of the Large Language Models (LLMs) language head. PTEC surpasses both baselines and human performance while lowering computational demands. This indicates the continuing need to adapt state-of-the-art methods to domain-specific tasks, even in the era of LLMs with strong generalization abilities.

pdf bib abs

Graph Representations for Machine Translation in Dialogue Settings
Lea Krause | Selene Baez Santamaria | Jan-Christoph Kalo
Proceedings of the Ninth Conference on Machine Translation

In this paper, we present our approach to the WMT24 - Chat Task, addressing the challenge of translating chat conversations.Chat conversations are characterised by their informal, ungrammatical nature and strong reliance on context posing significant challenges for machine translation systems. To address these challenges, we augment large language models with explicit memory mechanisms designed to enhance coherence and consistency across dialogues. Specifically, we employ graph representations to capture and utilise dialogue context, leveraging concept connectivity as a compressed memory. Our approach ranked second place for Dutch and French, and third place for Portuguese and German, based on COMET-22 scores and human evaluation.

pdf bib abs

Retrieval-based Question Answering with Passage Expansion Using a Knowledge Graph
Benno Kruit | Yiming Xu | Jan-Christoph Kalo
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Recent advancements in dense neural retrievers and language models have led to large improvements in state-of-the-art approaches to open-domain Question Answering (QA) based on retriever-reader architectures. However, issues stemming from data quality and imbalances in the use of dense embeddings have hindered performance, particularly for less common entities and facts. To tackle these problems, this study explores a multi-modal passage retrieval model’s potential to bolster QA system performance. This study poses three key questions: (1) Can a distantly supervised question-relation extraction model enhance retrieval using a knowledge graph (KG), compensating for dense neural retrievers’ shortcomings with rare entities? (2) How does this multi-modal approach compare to existing QA systems based on textual features? (3) Can this QA system alleviate poor performance on less common entities on common benchmarks? We devise a multi-modal retriever combining entity features and textual data, leading to improved retrieval precision in some situations, particularly for less common entities. Experiments across different datasets confirm enhanced performance for entity-centric questions, but challenges remain in handling complex generalized questions.

2023

pdf bib abs

Evaluating the Knowledge Base Completion Potential of GPT
Blerta Veseli | Simon Razniewski | Jan-Christoph Kalo | Gerhard Weikum
Findings of the Association for Computational Linguistics: EMNLP 2023

Structured knowledge bases (KBs) are an asset for search engines and other applications but are inevitably incomplete. Language models (LMs) have been proposed for unsupervised knowledge base completion (KBC), yet, their ability to do this at scale and with high accuracy remains an open question. Prior experimental studies mostly fall short because they only evaluate on popular subjects, or sample already existing facts from KBs. In this work, we perform a careful evaluation of GPT’s potential to complete the largest public KB: Wikidata. We find that, despite their size and capabilities, models like GPT-3, ChatGPT and GPT-4 do not achieve fully convincing results on this task. Nonetheless, it provides solid improvements over earlier approaches with smaller LMs. In particular, we show that it is feasible to extend Wikidata by 27M facts at 90% precision.

Co-authors

Benno Kruit 1

Simon Razniewski 1

Blerta Veseli 1

Vilhelm Von Ehrenheim 1

Gerhard Weikum 1

Yiming Xu 1

Venues

WMT1

WS1

Fix author