Lingjue Xie

2025

STARQA: A Question Answering Dataset for Complex Analytical Reasoning over Structured Databases
Mounica Maddela | Lingjue Xie | Daniel Preotiuc-Pietro | Mausam
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Our goal is to assess how well current Text2SQL systems support SQL analysts in their primary work of performing complex analytics on specialized relational databases. Although several benchmarks evaluate Text2SQL models, the complexity of questions (and the output SQL queries) in most datasets is inherently limited – they do not focus on intents involving analytics and reasoning. In response, we present STARQA, the first public human-created dataset focused on complex analytical questions and answers (involving nested joins, time series analytics, statistical operations, and more) on three specialized-domain databases. In addition to standard Text2SQL baselines, we also evaluate a novel approach (Text2SQLCode) that decomposes the task through a combination of SQL and Python: SQL is responsible for data fetch, and Python more naturally performs reasoning. Our results demonstrate that both existing Text2SQL systems and our Text2SQLCode approach find STARQA questions quite challenging, even though Text2SQLCode achieves better performance on the more difficult questions. Further analyses assess the typical errors made by existing systems and charts a research path for pushing the capabilities of real-world systems.

2023

pdf bib abs

Real-life multilingual systems should be able to efficiently incorporate new languages as data distributions fed to the system evolve and shift over time. To do this, systems need to handle the issue of catastrophic forgetting, where the model performance drops for languages or tasks seen further in its past. In this paper, we study catastrophic forgetting, as well as methods to minimize this, in a massively multilingual continual learning framework involving up to 51 languages and covering both classification and sequence labeling tasks. We present LR ADJUST, a learning rate scheduling method that is simple, yet effective in preserving new information without strongly overwriting past knowledge. Furthermore, we show that this method is effective across multiple continual learning approaches. Finally, we provide further insights into the dynamics of catastrophic forgetting in this massively multilingual setup.

pdf bib abs

EntSUMv2: Dataset, Models and Evaluation for More Abstractive Entity-Centric Summarization
Dhruv Mehra | Lingjue Xie | Ella Hofmann-Coyle | Mayank Kulkarni | Daniel Preotiuc-Pietro
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Entity-centric summarization is a form of controllable summarization that aims to generate a summary for a specific entity given a document. Concise summaries are valuable in various real-life applications, as they enable users to quickly grasp the main points of the document focusing on an entity of interest. This paper presents ENTSUMV2, a more abstractive version of the original entity-centric ENTSUM summarization dataset. In ENTSUMV2 the annotated summaries are intentionally made shorter to benefit more specific and useful entity-centric summaries for downstream users. We conduct extensive experiments on this dataset using multiple abstractive summarization approaches that employ supervised fine-tuning or large-scale instruction tuning. Additionally, we perform comprehensive human evaluation that incorporates metrics for measuring crucial facets. These metrics provide a more fine-grained interpretation of the current state-of-the-art systems and highlight areas for future improvement.

pdf bib abs

Towards a Unified Multi-Domain Multilingual Named Entity Recognition Model
Mayank Kulkarni | Daniel Preotiuc-Pietro | Karthik Radhakrishnan | Genta Indra Winata | Shijie Wu | Lingjue Xie | Shaohua Yang
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Named Entity Recognition is a key Natural Language Processing task whose performance is sensitive to choice of genre and language. A unified NER model across multiple genres and languages is more practical and efficient by leveraging commonalities across genres or languages. In this paper, we propose a novel setup for NER which includes multi-domain and multilingual training and evaluation across 13 domains and 4 languages. We explore a range of approaches to building a unified model using domain and language adaptation techniques. Our experiments highlight multiple nuances to consider while building a unified model, including that naive data pooling fails to obtain good performance, that domain-specific adaptations are more important than language-specific ones and that including domain-specific adaptations in a unified model nears the performance of training multiple dedicated monolingual models at a fraction of their parameter count.

pdf bib

Efficient Zero-Shot Cross-lingual Inference via Retrieval
Genta Winata | Lingjue Xie | Karthik Radhakrishnan | Yifan Gao | Daniel Preotiuc-Pietro
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

2022

pdf bib abs

Extractive Entity-Centric Summarization as Sentence Selection using Bi-Encoders
Ella Hofmann-Coyle | Mayank Kulkarni | Lingjue Xie | Mounica Maddela | Daniel Preotiuc-Pietro
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Entity-centric summarization is a type of controllable summarization that aims to produce a summary of a document that is specific to a given target entity. Extractive summaries possess multiple advantages over abstractive ones such as preserving factuality and can be directly used in downstream tasks like target-based sentiment analysis or incorporated into search applications. In this paper, we explore methods to solve this task by recasting it as a sentence selection task, as supported by the EntSUM data set. We use methods inspired by information retrieval, where the input to the model is a pair representing a sentence from the original document and the target entity, in place of the query. We explore different architecture variants and loss functions in this framework with results showing an up to 5.8 F1 improvement over past state-of-the-art and outperforming the competitive entity-centric Lead 3 heuristic by 1.1 F1. In addition, we also demonstrate similarly strong results on the related task of salient sentence selection for an entity.

Co-authors

Venues

Fix author