Itai Mondshine
2024
HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew
Tzuf Paz-Argaman
|
Itai Mondshine
|
Asaf Achi Mordechai
|
Reut Tsarfaty
Findings of the Association for Computational Linguistics ACL 2024
While large language models (LLMs) excel in various natural language tasks in English, their performance in low-resource languages like Hebrew, especially for generative tasks such as abstractive summarization, remains unclear. The high morphological richness in Hebrew adds further challenges due to the ambiguity in sentence comprehension and the complexities in meaning construction.In this paper, we address this evaluation and resource gap by introducing HeSum, a novel benchmark dataset specifically designed for Hebrew abstractive text summarization. HeSum consists of 10,000 article-summary pairs sourced from Hebrew news websites written by professionals. Linguistic analysis confirms HeSum’s high abstractness and unique morphological challenges. We show that HeSum presents distinct difficulties even for state-of-the-art LLMs, establishing it as a valuable testbed for advancing generative language technology in Hebrew, and MRLs generative challenges in general.
HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew
Itai Mondshine
|
Tzuf Paz-Argaman
|
Asaf Achi Mordechai
|
Reut Tsarfaty
Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)
While large language models (LLMs) excel in various natural language tasks in English, their performance in low-resource languages like Hebrew, especially for generative tasks such as abstractive summarization, remains unclear. The high morphological richness in Hebrew adds further challenges due to the ambiguity in sentence comprehension and the complexities in meaning construction. In this paper, we address this evaluation and resource gap by introducing HeSum, a novel benchmark dataset specifically designed for Hebrew abstractive text summarization. HeSum consists of 10,000 article-summary pairs sourced from Hebrew news websites written by professionals. Linguistic analysis confirms HeSum’s high abstractness and unique morphological challenges. We show that HeSum presents distinct difficulties even for state-of-the-art LLMs, establishing it as a valuable testbed for advancing generative language technology in Hebrew, and MRLs generative challenges in general.
2023
HeGeL: A Novel Dataset for Geo-Location from Hebrew Text
Tzuf Paz-Argaman
|
Tal Bauman
|
Itai Mondshine
|
Itzhak Omer
|
Sagi Dalyot
|
Reut Tsarfaty
Findings of the Association for Computational Linguistics: ACL 2023
The task of textual geolocation — retrieving the coordinates of a place based on a free-form language description — calls for not only grounding but also natural language understanding and geospatial reasoning. Even though there are quite a few datasets in English used for geolocation, they are currently based on open-source data (Wikipedia and Twitter), where the location of the described place is mostly implicit, such that the location retrieval resolution is limited. Furthermore, there are no datasets available for addressing the problem of textual geolocation in morphologically rich and resource-poor languages, such as Hebrew. In this paper, we present the Hebrew Geo-Location (HeGeL) corpus, designed to collect literal place descriptions and analyze lingual geospatial reasoning. We crowdsourced 5,649 literal Hebrew place descriptions of various place types in three cities in Israel. Qualitative and empirical analysis show that the data exhibits abundant use of geospatial reasoning and requires a novel environmental representation.
Search
Co-authors
- Tzuf Paz-Argaman 3
- Reut Tsarfaty 3
- Asaf Achi Mordechai 2
- Tal Bauman 1
- Itzhak Omer 1
- show all...