Back-of-the-Book Index Automation for Arabic Documents

Nawal Haidar, Ahmad Kashmar, Fadi Zaraket


Abstract
Back-of-the-book indexes (BoBIs) are crucial for book readability. However, their manual creation is laborious and error prone. In this paper, we introduce ArBoBIM to automate BoBI extraction and review processes for Arabic books. Given a book with a corresponding BoBI, ArBoBIM extracts BoBI terms and identifies their occurrences and aligns those across several versions of the book. ArBoBIM first defines a pool of candidates for each term by leveraging noun phrases and named entities. ArBoBIM leverages several metrics, including exact matches, morpho-lexical similarity, and semantic similarity, to determine the best candidates. We empirically fine-tuned thresholds for ArBoBIM and achieve an F1-score of 0.94 (precision= 0.97, recall=0.91). These results are significantly better than baseline results, and top LLM based results with lower computational cost and no publishing house IP risks. Additionally, with ArBoBIM, over 500 books have been processed, resulting in the ArBoBIMap dataset, containing books, their terms, occurrences, and various metadata related to them, to be made available for the public. This dataset is used to train a model to identify if a term, given its features, should be added to the back-of-the-book index of a specific book. The model achieves an F1-score of 0.91 (precision = 0.97, recall = 0.85).
Anthology ID:
2026.abjadnlp-1.29
Volume:
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Month:
March
Year:
2026
Address:
Rabat, Morocco
Venues:
AbjadNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
208–217
Language:
URL:
https://aclanthology.org/2026.abjadnlp-1.29/
DOI:
Bibkey:
Cite (ACL):
Nawal Haidar, Ahmad Kashmar, and Fadi Zaraket. 2026. Back-of-the-Book Index Automation for Arabic Documents. In Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script, pages 208–217, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Back-of-the-Book Index Automation for Arabic Documents (Haidar et al., AbjadNLP 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.abjadnlp-1.29.pdf