Preprocessing and lexicon design for parsing technical text

Robert P. Futrelle; Christopher E. Dunn; Debra S. Ellis; Maurice J. Pescitelli, Jr.

Preprocessing and lexicon design for parsing technical text

Robert P. Futrelle, Christopher E. Dunn, Debra S. Ellis, Maurice J. Pescitelli, Jr.

Abstract

Technical documents with complex structures and orthography present special difficulties for current parsing technology. These include technical notation such as subscripts, superscripts and numeric and algebraic expressions as well as Greek letters, italics, small capitals, brackets and punctuation marks. Structural elements such as references to figures, tables and bibliographic items also cause problems. We first hand-code documents in Standard Generalized Markup Language (SGML) to specify the document’s logical structure (paragraphs, sentences, etc.) and capture significant orthography. Next, a regular expression analyzer produced by LEX is used to tokenize the SGML text. Then a token-based phrasal lexicon is used to identify the longest token sequences in the input that represent single lexical items. This lookup is efficient because limits on lookahead are precomputed for every item. After this, the Alvey Tools parser with specialized subgrammars is used to discover items such as floating-point numbers. The product of these preprocessing stages is a text that is acceptable to a full natural language parser. This work is directed towards automating the building of knowledge bases from research articles in the field of bacterial chemotaxis, but the techniques should be of wide applicability.

Anthology ID:: 1991.iwpt-1.5
Volume:: Proceedings of the Second International Workshop on Parsing Technologies
Month:: February 13-25
Year:: 1991
Address:: Cancun, Mexico
Editors:: Masaru Tomita, Martin Kay, Robert Berwick, Eva Hajicova, Aravind Joshi, Ronald Kaplan, Makoto Nagao, Yorick Wilks
Venue:: IWPT
SIG:: SIGPARSE
Publisher:: Association for Computational Linguistics
Note:
Pages:: 31–40
Language:
URL:: https://aclanthology.org/1991.iwpt-1.5/
DOI:
Bibkey:
Cite (ACL):: Robert P. Futrelle, Christopher E. Dunn, Debra S. Ellis, and Maurice J. Pescitelli, Jr.. 1991. Preprocessing and lexicon design for parsing technical text. In Proceedings of the Second International Workshop on Parsing Technologies, pages 31–40, Cancun, Mexico. Association for Computational Linguistics.
Cite (Informal):: Preprocessing and lexicon design for parsing technical text (Futrelle et al., IWPT 1991)
Copy Citation:
PDF:: https://aclanthology.org/1991.iwpt-1.5.pdf

PDF Cite Search Fix data