Aidan San
2023
PLAtE: A Large-scale Dataset for List Page Web Extraction
Aidan San
|
Yuan Zhuang
|
Jan Bakus
|
Colin Lockard
|
David Ciemiewicz
|
Sandeep Atluri
|
Kevin Small
|
Yangfeng Ji
|
Heba Elfardy
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
Recently, neural models have been leveraged to significantly improve the performance of information extraction from semi-structured websites. However, a barrier for continued progress is the small number of datasets large enough to train these models. In this work, we introduce the PLAtE (Pages of Lists Attribute Extraction) benchmark dataset as a challenging new web extraction task. PLAtE focuses on shopping data, specifically extractions from product review pages with multiple items encompassing the tasks of: (1) finding product list segmentation boundaries and (2) extracting attributes for each product. PLAtE is composed of 52,898 items collected from 6,694 pages and 156,014 attributes, making it the first large-scale list page web extraction dataset. We use a multi-stage approach to collect and annotate the dataset and adapt three state-of-the-art web extraction models to the two tasks comparing their strengths and weaknesses both quantitatively and qualitatively.
2020
A Tale of Two Linkings: Dynamically Gating between Schema Linking and Structural Linking for Text-to-SQL Parsing
Sanxing Chen
|
Aidan San
|
Xiaodong Liu
|
Yangfeng Ji
Proceedings of the 28th International Conference on Computational Linguistics
In Text-to-SQL semantic parsing, selecting the correct entities (tables and columns) for the generated SQL query is both crucial and challenging; the parser is required to connect the natural language (NL) question and the SQL query to the structured knowledge in the database. We formulate two linking processes to address this challenge: schema linking which links explicit NL mentions to the database and structural linking which links the entities in the output SQL with their structural relationships in the database schema. Intuitively, the effectiveness of these two linking processes changes based on the entity being generated, thus we propose to dynamically choose between them using a gating mechanism. Integrating the proposed method with two graph neural network-based semantic parsers together with BERT representations demonstrates substantial gains in parsing accuracy on the challenging Spider dataset. Analyses show that our proposed method helps to enhance the structure of the model output when generating complicated SQL queries and offers more explainable predictions.
2018
Random Decision Syntax Trees at SemEval-2018 Task 3: LSTMs and Sentiment Scores for Irony Detection
Aidan San
Proceedings of the 12th International Workshop on Semantic Evaluation
We propose a Long Short Term Memory Neural Network model for irony detection in tweets in this paper. Our model is trained using word embeddings and emoji embeddings. We show that adding sentiment scores to our model improves the F1 score of our baseline LSTM by approximately .012, and therefore show that high-level features can be used to improve word embeddings in certain Natural Language Processing applications. Our model ranks 24/43 for binary classification and 5/31 for multiclass classification. We make our model easily accessible to the research community.
Search
Fix data
Co-authors
- Yangfeng Ji 2
- Sandeep Atluri 1
- Jan Bakus 1
- Sanxing Chen 1
- David Ciemiewicz 1
- show all...