CUET_SSTM at the GEM’24 Summarization Task: Integration of extractive and abstractive method for long text summarization in Swahili language

Samia Rahman, Momtazul Arefin Labib, Hasan Murad, Udoy Das


Abstract
Swahili, spoken by around 200 million people primarily in Tanzania and Kenya, has been the focus of our research for the GEM Shared Task at INLG’24 on Underrepresented Language Summarization. We have utilized the XLSUM dataset and have manually summarized 1000 texts from a Swahili news classification dataset. To achieve the desired results, we have tested abstractive summarizers (mT5_multilingual_XLSum, t5-small, mBART-50), and an extractive summarizer (based on PageRank algorithm). But our adopted model consists of an integrated extractive-abstractive model combining the Bert Extractive Summarizer with some abstractive summarizers (t5-small, mBART-50). The integrated model overcome the drawbacks of both the extractive and abstractive summarization system and utilizes the benefit from both of it. Extractive summarizer shorten the paragraphs exceeding 512 tokens, ensuring no important information has been lost before applying the abstractive models. The abstractive summarizer use its pretrained knowledge and ensure to generate context based summary.
Anthology ID:
2024.inlg-genchal.12
Volume:
Proceedings of the 17th International Natural Language Generation Conference: Generation Challenges
Month:
September
Year:
2024
Address:
Tokyo, Japan
Editors:
Simon Mille, Miruna-Adriana Clinciu
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
112–117
Language:
URL:
https://aclanthology.org/2024.inlg-genchal.12
DOI:
Bibkey:
Cite (ACL):
Samia Rahman, Momtazul Arefin Labib, Hasan Murad, and Udoy Das. 2024. CUET_SSTM at the GEM’24 Summarization Task: Integration of extractive and abstractive method for long text summarization in Swahili language. In Proceedings of the 17th International Natural Language Generation Conference: Generation Challenges, pages 112–117, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):
CUET_SSTM at the GEM’24 Summarization Task: Integration of extractive and abstractive method for long text summarization in Swahili language (Rahman et al., INLG 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.inlg-genchal.12.pdf