2024
pdf
bib
abs
Learning Personalized Alignment for Evaluating Open-ended Text Generation
Danqing Wang
|
Kevin Yang
|
Hanlin Zhu
|
Xiaomeng Yang
|
Andrew Cohen
|
Lei Li
|
Yuandong Tian
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Recent research has increasingly focused on evaluating large language models’ (LLMs) alignment with diverse human values and preferences, particularly for open-ended tasks like story generation. Traditional evaluation metrics rely heavily on lexical similarity with human-written references, often showing poor correlation with human judgments and failing to account for alignment with the diversity of human preferences. To address these challenges, we introduce PerSE, an interpretable evaluation framework designed to assess alignment with specific human preferences. It is tuned to infer specific preferences from an in-context personal profile and evaluate the alignment between the generated content and personal preferences. PerSE enhances interpretability by providing detailed comments and fine-grained scoring, facilitating more personalized content generation. Our 13B LLaMA-2-based PerSE shows a 15.8% increase in Kendall correlation and a 13.7% rise in accuracy with zero-shot reviewers compared to GPT-4. It also outperforms GPT-4 by 46.01% in Kendall correlation on new domains, indicating its transferability
pdf
bib
abs
To the Globe (TTG): Towards Language-Driven Guaranteed Travel Planning
Da Ju
|
Song Jiang
|
Andrew Cohen
|
Aaron Foss
|
Sasha Mitts
|
Arman Zharmagambetov
|
Brandon Amos
|
Xian Li
|
Justine T Kao
|
Maryam Fazel-Zarandi
|
Yuandong Tian
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Travel planning is a challenging and time-consuming task that aims to find an itinerary which satisfies multiple, interdependent constraints regarding flights, accommodations, attractions, and other travel arrangements. In this paper, we propose To the Globe (TTG), a real-time demo system that takes natural language requests from users, translates it to symbolic form via a fine-tuned Large Language Model, and produces optimal travel itineraries with Mixed Integer Linear Programming solvers. The overall system takes ~5seconds to reply to the user request with guaranteed itineraries. To train TTG, we develop a synthetic data pipeline that generates userrequests, flight and hotel information in symbolic form without human annotations, based on the statistics of real-world datasets, and fine-tune an LLM to translate NL user requests to their symbolic form, which is sent to the symbolic solver to compute optimal itineraries. Our NL-symbolic translation achieves ~91% exact match in a backtranslation metric (i.e., whether the estimated symbolic form of generated natural language matches the groundtruth), and its returned itineraries have a ratio of 0.979 compared to the optimal cost of the ground truth user request. When evaluated by users, TTG achieves consistently high Net Promoter Scores (NPS) of 35-40% on generated itinerary.
pdf
bib
abs
The ART of LLM Refinement: Ask, Refine, and Trust
Kumar Shridhar
|
Koustuv Sinha
|
Andrew Cohen
|
Tianlu Wang
|
Ping Yu
|
Ramakanth Pasunuru
|
Mrinmaya Sachan
|
Jason Weston
|
Asli Celikyilmaz
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Large Language Models (LLMs) have demonstrated remarkable generative abilities, but can they judge the quality of their own generations and self-improve?A popular concept, referred to as *self-refinement*, postulates that LLMs can detect and correct the errors in their generations when asked to do so. However, recent empirical evidence points in the opposite direction, suggesting that LLMs often struggle to accurately identify errors when reasoning is involved. To address this, we propose a reasoning with a refinement strategy called *ART: Ask, Refine, and Trust*, which *asks* necessary questions to decide when an LLM should *refine* its output, and uses it to affirm or deny *trust* in its refinement by ranking the refinement and the initial prediction. On two multistep reasoning tasks of mathematical word problems (GSM8K) and question answering (StrategyQA), *ART* achieves a performance gain of +5 points over self-refinement baselines, while using a much smaller model as the decision maker. We believe that *ART* with smaller models, making refinement decisions can be a cost-effective alternative to fine-tuning LLMs.
1995
pdf
bib
Developing a Nonsymbolic Phonetic Notation for Speech Synthesis
Andrew Cohen
Computational Linguistics, Volume 21, Number 4, December 1995