Building a Compact Math Corpus

Andrea Ferreira


Abstract
This paper introduces the Compact Math Corpus (CMC), a preliminary resource for natural language processing in the mathematics domain. We process three open-access undergraduate textbooks from distinct mathematical areas and annotate them in the CoNLL-U format using a lightweight pipeline based on the spaCy Small model. The structured output enables the extraction of syntactic bigrams and TF-IDF scores, supporting a syntactic-semantic analysis of mathematical sentences.From the annotated data, we construct a classification dataset comprising bigrams potentially representing mathematical concepts, along with representative example sentences. We combine CMC with the conversational corpus UD English EWT and train a logistic regression model with K-fold cross-validation, achieving a minimum macro-F1 score of 0.989. These results indicate the feasibility of automatic concept identification in mathematical texts.The study is designed for easy replication in low-resource settings and to promote sustainable research practices. Our approach offers a viable path to tasks such as parser adaptation, terminology extraction, multiword expression modeling, and improved analysis of mathematical language structures.
Anthology ID:
2025.naloma-1.5
Volume:
Proceedings of the 5th Workshop on Natural Logic Meets Machine Learning (NALOMA)
Month:
August
Year:
2025
Address:
Bochum, Germany
Editors:
Lasha Abzianidze, Valeria de Paiva
Venues:
NALOMA | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
48–55
Language:
URL:
https://aclanthology.org/2025.naloma-1.5/
DOI:
Bibkey:
Cite (ACL):
Andrea Ferreira. 2025. Building a Compact Math Corpus. In Proceedings of the 5th Workshop on Natural Logic Meets Machine Learning (NALOMA), pages 48–55, Bochum, Germany. Association for Computational Linguistics.
Cite (Informal):
Building a Compact Math Corpus (Ferreira, NALOMA 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.naloma-1.5.pdf