Xinyi He
2023
AnaMeta: A Table Understanding Dataset of Field Metadata Knowledge Shared by Multi-dimensional Data Analysis Tasks
Xinyi He
|
Mengyu Zhou
|
Mingjie Zhou
|
Jialiang Xu
|
Xiao Lv
|
Tianle Li
|
Yijia Shao
|
Shi Han
|
Zejian Yuan
|
Dongmei Zhang
Findings of the Association for Computational Linguistics: ACL 2023
Tabular data analysis is performed everyday across various domains. It requires an accurate understanding of field semantics to correctly operate on table fields and find common patterns in daily analysis. In this paper, we introduce the AnaMeta dataset, a collection of 467k tables with derived supervision labels for four types of commonly used field metadata: measure/dimension dichotomy, common field roles, semantic field type, and default aggregation function. We evaluate a wide range of models for inferring metadata as the benchmark. We also propose a multi-encoder framework, called KDF, which improves the metadata understanding capability of tabular models by incorporating distribution and knowledge information. Furthermore, we propose four interfaces for incorporating field metadata into downstream analysis tasks.
2022
Towards Robust Numerical Question Answering: Diagnosing Numerical Capabilities of NLP Systems
Jialiang Xu
|
Mengyu Zhou
|
Xinyi He
|
Shi Han
|
Dongmei Zhang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Numerical Question Answering is the task of answering questions that require numerical capabilities. Previous works introduce general adversarial attacks to Numerical Question Answering, while not systematically exploring numerical capabilities specific to the topic. In this paper, we propose to conduct numerical capability diagnosis on a series of Numerical Question Answering systems and datasets. A series of numerical capabilities are highlighted, and corresponding dataset perturbations are designed. Empirical results indicate that existing systems are severely challenged by these perturbations. E.g., Graph2Tree experienced a 53.83% absolute accuracy drop against the “Extra” perturbation on ASDiv-a, and BART experienced 13.80% accuracy drop against the “Language” perturbation on the numerical subset of DROP. As a counteracting approach, we also investigate the effectiveness of applying perturbations as data augmentation to relieve systems’ lack of robust numerical capabilities. With experiment analysis and empirical studies, it is demonstrated that Numerical Question Answering with robust numerical capabilities is still to a large extent an open question. We discuss future directions of Numerical Question Answering and summarize guidelines on future dataset collection and system design.
Search
Co-authors
- Mengyu Zhou 2
- Jialiang Xu 2
- Shi Han 2
- Dongmei Zhang 2
- Mingjie Zhou 1
- show all...