2024
pdf
bib
abs
Schema-Driven Information Extraction from Heterogeneous Tables
Fan Bai
|
Junmo Kang
|
Gabriel Stanovsky
|
Dayne Freitag
|
Mark Dredze
|
Alan Ritter
Findings of the Association for Computational Linguistics: EMNLP 2024
In this paper, we explore the question of whether large language models can support cost-efficient information extraction from tables. We introduce schema-driven information extraction, a new task that transforms tabular data into structured records following a human-authored schema. To assess various LLM’s capabilities on this task, we present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages. We use this collection of annotated tables to evaluate the ability of open-source and API-based language models to extract information from tables covering diverse domains and data formats. Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels, achieving F1 scores ranging from 74.2 to 96.1, while maintaining cost efficiency. Moreover, through detailed ablation studies and analyses, we investigate the factors contributing to model success and validate the practicality of distilling compact models to reduce API reliance.
2022
pdf
bib
abs
SynKB: Semantic Search for Synthetic Procedures
Fan Bai
|
Alan Ritter
|
Peter Madrid
|
Dayne Freitag
|
John Niekrasz
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
In this paper we present SynKB, an open-source, automatically extracted knowledge base of chemical synthesis protocols. Similar to proprietary chemistry databases such as Reaxsys, SynKB allows chemists to retrieve structured knowledge about synthetic procedures. By taking advantage of recent advances in natural language processing for procedural texts, SynKB supports more flexible queries about reaction conditions, and thus has the potential to help chemists search the literature for conditions used in relevant reactions as they design new synthetic routes. Using customized Transformer models to automatically extract information from 6 million synthesis procedures described in U.S. and EU patents, we show that for many queries, SynKB has higher recall than Reaxsys, while maintaining high precision. We plan to make SynKB available as an open-source tool; in contrast, proprietary chemistry databases require costly subscriptions.
pdf
bib
abs
Accelerating Human Authorship of Information Extraction Rules
Dayne Freitag
|
John Cadigan
|
John Niekrasz
|
Robert Sasseen
Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
We consider whether machine models can facilitate the human development of rule sets for information extraction. Arguing that rule-based methods possess a speed advantage in the early development of new extraction capabilities, we ask whether this advantage can be increased further through the machine facilitation of common recurring manual operations in the creation of an extraction rule set from scratch. Using a historical rule set, we reconstruct and describe the putative manual operations required to create it. In experiments targeting one key operation—the enumeration of words occurring in particular contexts—we simulate the process or corpus review and word list creation, showing that several simple interventions greatly improve recall as a function of simulated labor.
pdf
bib
abs
Valet: Rule-Based Information Extraction for Rapid Deployment
Dayne Freitag
|
John Cadigan
|
Robert Sasseen
|
Paul Kalmar
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We present VALET, a framework for rule-based information extraction written in Python. VALET departs from legacy approaches predicated on cascading finite-state transducers, instead offering direct support for mixing heterogeneous information–lexical, orthographic, syntactic, corpus-analytic–in a succinct syntax that supports context-free idioms. We show how a handful of rules suffices to implement sophisticated matching, and describe a user interface that facilitates exploration for development and maintenance of rule sets. Arguing that rule-based information extraction is an important methodology early in the development cycle, we describe an experiment in which a VALET model is used to annotate examples for a machine learning extraction model. While learning to emulate the extraction rules, the resulting model generalizes them, recognizing valid extraction targets the rules failed to detect.
pdf
bib
Proceedings of the Third Workshop on Scholarly Document Processing
Arman Cohan
|
Guy Feigenblat
|
Dayne Freitag
|
Tirthankar Ghosal
|
Drahomira Herrmannova
|
Petr Knoth
|
Kyle Lo
|
Philipp Mayr
|
Michal Shmueli-Scheuer
|
Anita de Waard
|
Lucy Lu Wang
Proceedings of the Third Workshop on Scholarly Document Processing
pdf
bib
abs
Overview of the Third Workshop on Scholarly Document Processing
Arman Cohan
|
Guy Feigenblat
|
Dayne Freitag
|
Tirthankar Ghosal
|
Drahomira Herrmannova
|
Petr Knoth
|
Kyle Lo
|
Philipp Mayr
|
Michal Shmueli-Scheuer
|
Anita de Waard
|
Lucy Lu Wang
Proceedings of the Third Workshop on Scholarly Document Processing
With the ever-increasing pace of research and high volume of scholarly communication, scholars face a daunting task. Not only must they keep up with the growing literature in their own and related fields, scholars increasingly also need to rebut pseudo-science and disinformation. These needs have motivated an increasing focus on computational methods for enhancing search, summarization, and analysis of scholarly documents. However, the various strands of research on scholarly document processing remain fragmented. To reach out to the broader NLP and AI/ML community, pool distributed efforts in this area, and enable shared access to published research, we held the 3rd Workshop on Scholarly Document Processing (SDP) at COLING as a hybrid event (
https://sdproc.org/2022/). The SDP workshop consisted of a research track, three invited talks and five Shared Tasks: 1) MSLR22: Multi-Document Summarization for Literature Reviews, 2) DAGPap22: Detecting automatically generated scientific papers, 3) SV-Ident 2022: Survey Variable Identification in Social Science Publications, 4) SKGG: Scholarly Knowledge Graph Generation, 5) MuP 2022: Multi Perspective Scientific Document Summarization. The program was geared towards NLP, information retrieval, and data mining for scholarly documents, with an emphasis on identifying and providing solutions to open challenges.
2021
pdf
bib
Proceedings of the Second Workshop on Scholarly Document Processing
Iz Beltagy
|
Arman Cohan
|
Guy Feigenblat
|
Dayne Freitag
|
Tirthankar Ghosal
|
Keith Hall
|
Drahomira Herrmannova
|
Petr Knoth
|
Kyle Lo
|
Philipp Mayr
|
Robert M. Patton
|
Michal Shmueli-Scheuer
|
Anita de Waard
|
Kuansan Wang
|
Lucy Lu Wang
Proceedings of the Second Workshop on Scholarly Document Processing
pdf
bib
abs
Argument Mining for Scholarly Document Processing: Taking Stock and Looking Ahead
Khalid Al Khatib
|
Tirthankar Ghosal
|
Yufang Hou
|
Anita de Waard
|
Dayne Freitag
Proceedings of the Second Workshop on Scholarly Document Processing
Argument mining targets structures in natural language related to interpretation and persuasion which are central to scientific communication. Most scholarly discourse involves interpreting experimental evidence and attempting to persuade other scientists to adopt the same conclusions. While various argument mining studies have addressed student essays and news articles, those that target scientific discourse are still scarce. This paper surveys existing work in argument mining of scholarly discourse, and provides an overview of current models, data, tasks, and applications. We identify a number of key challenges confronting argument mining in the scientific domain, and suggest some possible solutions and future directions.
pdf
bib
abs
Overview of the Second Workshop on Scholarly Document Processing
Iz Beltagy
|
Arman Cohan
|
Guy Feigenblat
|
Dayne Freitag
|
Tirthankar Ghosal
|
Keith Hall
|
Drahomira Herrmannova
|
Petr Knoth
|
Kyle Lo
|
Philipp Mayr
|
Robert Patton
|
Michal Shmueli-Scheuer
|
Anita de Waard
|
Kuansan Wang
|
Lucy Lu Wang
Proceedings of the Second Workshop on Scholarly Document Processing
With the ever-increasing pace of research and high volume of scholarly communication, scholars face a daunting task. Not only must they keep up with the growing literature in their own and related fields, scholars increasingly also need to rebut pseudo-science and disinformation. These needs have motivated an increasing focus on computational methods for enhancing search, summarization, and analysis of scholarly documents. However, the various strands of research on scholarly document processing remain fragmented. To reach out to the broader NLP and AI/ML community, pool distributed efforts in this area, and enable shared access to published research, we held the 2nd Workshop on Scholarly Document Processing (SDP) at NAACL 2021 as a virtual event (
https://sdproc.org/2021/). The SDP workshop consisted of a research track, three invited talks, and three Shared Tasks (LongSumm 2021, SCIVER, and 3C). The program was geared towards the application of NLP, information retrieval, and data mining for scholarly documents, with an emphasis on identifying and providing solutions to open challenges.
2020
pdf
bib
Proceedings of the First Workshop on Scholarly Document Processing
Muthu Kumar Chandrasekaran
|
Anita de Waard
|
Guy Feigenblat
|
Dayne Freitag
|
Tirthankar Ghosal
|
Eduard Hovy
|
Petr Knoth
|
David Konopnicki
|
Philipp Mayr
|
Robert M. Patton
|
Michal Shmueli-Scheuer
Proceedings of the First Workshop on Scholarly Document Processing
pdf
bib
abs
Overview of the First Workshop on Scholarly Document Processing (SDP)
Muthu Kumar Chandrasekaran
|
Guy Feigenblat
|
Dayne Freitag
|
Tirthankar Ghosal
|
Eduard Hovy
|
Philipp Mayr
|
Michal Shmueli-Scheuer
|
Anita de Waard
Proceedings of the First Workshop on Scholarly Document Processing
Next to keeping up with the growing literature in their own and related fields, scholars increasingly also need to rebut pseudo-science and disinformation. To address these challenges, computational work on enhancing search, summarization, and analysis of scholarly documents has flourished. However, the various strands of research on scholarly document processing remain fragmented. To reach to the broader NLP and AI/ML community, pool distributed efforts and enable shared access to published research, we held the 1st Workshop on Scholarly Document Processing at EMNLP 2020 as a virtual event. The SDP workshop consisted of a research track (including a poster session), two invited talks and three Shared Tasks (CL-SciSumm, Lay-Summ and LongSumm), geared towards easier access to scientific methods and results.
Website:
https://ornlcda.github.io/SDProc2017
pdf
bib
abs
Discourse-Wide Extraction of Assay Frames from the Biological Literature
Dayne Freitag
|
Paul Kalmar
|
Eric Yeh
Proceedings of the Biomedical NLP Workshop associated with RANLP 2017
We consider the problem of populating multi-part knowledge frames from textual information distributed over multiple sentences in a document. We present a corpus constructed by aligning papers from the cellular signaling literature to a collection of approximately 50,000 reference frames curated by hand as part of a decade-long project. We present and evaluate two approaches to the challenging problem of reconstructing these frames, which formalize biological assays described in the literature. One approach is based on classifying candidate records nominated by sentence-local entity co-occurrence. In the second approach, we introduce a novel virtual register machine traverses an article and generates frames, trained on our reference data. Our evaluations show that success in the task ultimately hinges on an integration of evidence spread across the discourse.
2016
pdf
bib
Feature Derivation for Exploitation of Distant Annotation via Pattern Induction against Dependency Parses
Dayne Freitag
|
John Niekrasz
Proceedings of the 15th Workshop on Biomedical Natural Language Processing
pdf
bib
abs
An Annotated Corpus and Method for Analysis of Ad-Hoc Structures Embedded in Text
Eric Yeh
|
John Niekrasz
|
Dayne Freitag
|
Richard Rohwer
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
We describe a method for identifying and performing functional analysis of structured regions that are embedded in natural language documents, such as tables or key-value lists. Such regions often encode information according to ad hoc schemas and avail themselves of visual cues in place of natural language grammar, presenting problems for standard information extraction algorithms. Unlike previous work in table extraction, which assumes a relatively noiseless two-dimensional layout, our aim is to accommodate a wide variety of naturally occurring structure types. Our approach has three main parts. First, we collect and annotate a a diverse sample of “naturally” occurring structures from several sources. Second, we use probabilistic text segmentation techniques, featurized by skip bigrams over spatial and token category cues, to automatically identify contiguous regions of structured text that share a common schema. Finally, we identify the records and fields within each structured region using a combination of distributional similarity and sequence alignment methods, guided by minimal supervision in the form of a single annotated record. We evaluate the last two components individually, and conclude with a discussion of further work.
2009
pdf
bib
Loss-Sensitive Discriminative Training of Machine Transliteration Models
Kedar Bellare
|
Koby Crammer
|
Dayne Freitag
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium
pdf
bib
Name Transliteration with Bidirectional Perceptron Edit Models
Dayne Freitag
|
Zhiqiang Wang
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)
2008
pdf
bib
abs
Improving NER in Arabic Using a Morphological Tagger
Benjamin Farber
|
Dayne Freitag
|
Nizar Habash
|
Owen Rambow
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
We discuss a named entity recognition system for Arabic, and show how we incorporated the information provided by MADA, a full morphological tagger which uses a morphological analyzer. Surprisingly, the relevant features used are the capitalization of the English gloss chosen by the tagger, and the fact that an analysis is returned (that a word is not OOV to the morphological analyzer). The use of the tagger also improves over a third system which just uses a morphological analyzer, yielding a 14\% reduction in error over the baseline. We conduct a thorough error analysis to identify sources of success and failure among the variations, and show that by combining the systems in simple ways we can significantly influence the precision-recall trade-off.
2007
pdf
bib
A Sequence Alignment Model Based on the Averaged Perceptron
Dayne Freitag
|
Shahram Khadivi
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)
2005
pdf
bib
New Experiments in Distributional Representations of Synonymy
Dayne Freitag
|
Matthias Blume
|
John Byrnes
|
Edmond Chow
|
Sadik Kapadia
|
Richard Rohwer
|
Zhiqiang Wang
Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005)
pdf
bib
Morphology Induction from Term Clusters
Dayne Freitag
Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005)
2004
pdf
bib
Towards Full Automation of Lexicon Construction
Richard Rohwer
|
Dayne Freitag
Proceedings of the Computational Lexical Semantics Workshop at HLT-NAACL 2004
pdf
bib
Trained Named Entity Recognition using Distributional Clusters
Dayne Freitag
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing
pdf
bib
Toward Unsupervised Whole-Corpus Tagging
Dayne Freitag
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics
pdf
bib
abs
A Critical Survey of the Methodology for IE Evaluation
A. Lavelli
|
M. E. Califf
|
F. Ciravegna
|
D. Freitag
|
C. Giuliano
|
N. Kushmerick
|
L. Romano
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
We survey the evaluation methodology adopted in Information Extraction (IE), as defined in the MUC conferences and in later independent efforts applying machine learning to IE. We point out a number of problematic issues that may hamper the comparison between results obtained by different researchers. Some of them are common to other NLP tasks: e.g., the difficulty of exactly identifying the effects on performance of the data (sample selection and sample size), of the domain theory (features selected), and of algorithm parameter settings. Issues specific to IE evaluation include: how leniently to assess inexact identification of filler boundaries, the possibility of multiple fillers for a slot, and how the counting is performed. We argue that, when specifying an information extraction task, a number of characteristics should be clearly defined. However, in the papers only a few of them are usually explicitly specified. Our aim is to elaborate a clear and detailed experimental methodology and propose it to the IE community. The goal is to reach a widespread agreement on such proposal so that future IE evaluations will adopt the proposed methodology, making comparisons between algorithms fair and reliable. In order to achieve this goal, we will develop and make available to the community a set of tools and resources that incorporate a standardized IE methodology.
1998
pdf
bib
Toward General-Purpose Learning for Information Extraction
Dayne Freitag
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1
pdf
bib
Toward General-Purpose Learning for Information Extraction
Dayne Freitag
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics