Constructing a Public Meeting Corpus

Koji Tanaka, Chenhui Chu, Haolin Ren, Benjamin Renoust, Yuta Nakashima, Noriko Takemura, Hajime Nagahara, Takao Fujikawa


Abstract
In this paper, we propose a full pipeline of analysis of a large corpus about a century of public meeting in historical Australian news papers, from construction to visual exploration. The corpus construction method is based on image processing and OCR. We digitize and transcribe texts of the specific topic of public meeting. Experiments show that our proposed method achieves a F-score of 87.8% for corpus construction. As a result, we built a content search tool for temporal and semantic content analysis.
Anthology ID:
2020.lrec-1.238
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1934–1940
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.238
DOI:
Bibkey:
Cite (ACL):
Koji Tanaka, Chenhui Chu, Haolin Ren, Benjamin Renoust, Yuta Nakashima, Noriko Takemura, Hajime Nagahara, and Takao Fujikawa. 2020. Constructing a Public Meeting Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1934–1940, Marseille, France. European Language Resources Association.
Cite (Informal):
Constructing a Public Meeting Corpus (Tanaka et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.238.pdf