Few-shot Adaptation Works with UnpredicTable Data

Prior work on language models (LMs) shows that training on a large number of diverse tasks improves few-shot learning (FSL) performance on new tasks. We take this to the extreme, automatically extracting 413,299 tasks from internet tables - orders of magnitude more than the next-largest public datasets. Finetuning on the resulting dataset leads to improved FSL performance on Natural Language Processing (NLP) tasks, but not proportionally to dataset scale. In fact, we find that narrow subsets of our dataset sometimes outperform more diverse datasets. For example, finetuning on software documentation from support.google.com raises FSL performance by a mean of +7.5% on 52 downstream tasks, which beats training on 40 human-curated NLP datasets (+6.7%). Finetuning on various narrow datasets leads to similar broad improvements across test tasks, suggesting that the gains are not from domain adaptation but adapting to FSL in general. We do not observe clear patterns between the datasets that lead to FSL gains, leaving open questions about why certain data helps with FSL.


Introduction
showed that language models (LMs) learn to perform new tasks from a few examples ("few-shot learning"; FSL). Explicitly training LMs for FSL further improves performance (Min et al., 2021;Chen et al., 2021b), and prior work has found that increasing the size and diversity of training tasks improves generalization to new tasks (Sanh et al., 2021;Aribandi et al., 2021;Aghajanyan et al., 2021a;Wang et al., 2022). We push size and diversity to the extreme by finetuning on a large dataset of automatically-curated FSL tasks, and surprisingly find that certain narrow datasets of tasks (e.g. software documentation) outperform much larger and more diverse datasets. * Work done primarily at NYU and FAR. Convert tables to few-shot tasks.  Figure 1: We convert a wide variety of tables into tasks for few-shot learning (FSL), then use these tasks via finetuning to adapt language models for FSL. Unexpected tables lead to strong task transfer results: finetuning GPT2 on software documentation from support.google.com outperforms finetuning on 40 curated NLP datasets on average across 52 test tasks, with strong improvements across diverse tasks including article classification (+47%), sentiment classification (+31%) and scientific question-answering (+23%).
Investigations into dataset size and diversity requires a large dataset of FSL tasks. To this end, we explore tables as a naturally-occurring source of diverse FSL tasks. Given a table where each row is a list of fields, we hold out one row as the test example and treat all other rows as task training examples. We apply this idea to automatically convert internet tables into UnpredicTable 1 , a dataset of 413,299 diverse few-shot tasks. We finetune GPT-2 to perform a new task given a few task examples in its context ("MetaICL";Min et al., 2021). Finetuning on UnpredicTable leads to strong FSL performance on average over 52 NLP test tasks, comparable to finetuning on human-curated NLP datasets. However, the observed gains fall short of expectations for such a large dataset.
To understand why our gains were limited, we perform various ablations on dataset size, diversity, and content. In this process, we find that finetuning on narrow subsets of UnpredicTable outperforms finetuning on our diverse dataset and on curated NLP data. Surprisingly, datasets that we handpick according to what we expect to be helpful are not strongly correlated with performance. In fact, the training datasets that lead to strong improvements are often counterintuitive, covering trivia content (e.g. video games on mmo-champion.com and software documentation from support.google.com; see Fig. 1) that are unrelated to downstream test tasks. Finetuning on these narrow datasets cause broad improvements similar to finetuning on curated NLP datasets when compared on the same test tasks. This suggests that these aren't domain-or task-specific improvements, but improvements in general few-shot ability ("few-shot adaptation"). Our work calls into question common wisdom that adapting LMs to FSL requires diverse, high-quality training data.

Web Tables as a Source of Few-Shot Learning Tasks
We begin by describing FSL, which is the problem of learning from a small number of training examples. We make the case that web tables can be used as a diverse source of few-shot tasks. Then, we introduce our algorithm for converting tables into tasks and apply this to produce UnpredicTable, a dataset of 413,299 few-shot tasks.

Few-Shot Learning Tasks
We define a task T as a set of input-output pairs where inputs x i map to outputs y i . Task types can be very diverse, from questionanswering (Questions → Answers), to summarization (Books → Summaries), to translation (French → English). In FSL, k is small. LMs can be used to perform FSL by providing k known example pairs {(x i , y i ) : i = 1, . . . , k} in the LM context at inference time. Then, we give the model a new example x target for which y target is unknown, and we use the model to predict y target .

Tables Dataset
Motivated by prior work on FSL adaptation (Min et al., 2021;Chen et al., 2021b) and multi-task learning (Sanh et al., 2021;Aribandi et al., 2021;Aghajanyan et al., 2021a), we hypothesize that we can extend the results of multi-task FSL finetuning with an even larger set of few-shot tasks. We make the case that web tables are a large and diverse source of few-shot tasks. Consider a table where each row is an instance of a similar class and columns describe the attributes of an instance. We use each row as an example of a task, where the task is filling in missing attributes in a row. For a table with k rows, each table becomes a k-shot dataset for a particular task.
As a source of table data, we use tables from the English-language Relational Subset of the WDC Web Table Corpus 2015 (WTC) 2 . The WTC dataset was extracted from the July 2015 Common Crawl web corpus, and contains 50M tables from 323K web domains. We focus on relational tables, which describe a set of similar items along with their attributes. For example, a table listing national dishes by country is a relational table. On the other hand, a table describing a single item where each row describes a different attribute is not relational. WTC also provides helpful metadata including the source URL, title, and header rows.

Turning Tables Into Tasks
In practice, there are important design choices for converting a table into a task of input-output pairs. Here, we describe our chosen procedure. We start with the assumption that items in the relational table are listed row-wise (as in Fig. 2) instead of column-wise. Where necessary, we transpose the tables to suit our requirement. To convert a row into an input-output task pair, we consider a single column as a potential output target y i and concatenate the remaining columns to form the input x i . For additional context, we prefix each value with its column header (see Fig. 2). Since any column is a potential output target, we create multiple tasks per table. For example, a table with 3 columns A, B, and C may be cast as three different tasks: P (A|B, C), P (B|A, C) and P (C|A, B).  Figure 2: An algorithm to convert tables into tasks for FSL: Given the task of "Predict this column value given the other column values as input," each row in the table can be used as an example for that task.
Filtering tables We reject tables with fewer than 2 unique columns (one for the task output and at least one more for the input) or 6 unique rows (at least 5 examples + 1 target row). We find a large number of tables containing junk data or only numerical values. To remove these, we reject tables with ≥ 20% of tokens tagged as either Numeral, Proper Noun, Symbol, Punctuation, or Other by the spaCy part-of-speech classifier. 3 The tables that pass this filtering stage are converted into tasks.
Filtering tasks Given a set of candidate tasks, we require that the output space contains at least two unique answers, and reject tasks with severe class imbalance. 4 To narrow our scope to tasks with a single correct answer, we reject tasks where any input appears more than once with different outputs. Finally, we only accept up to 2500 tasks per website to counter imbalance 5 in the source website of generated tasks. Appendix A shows the breakdown of filtered tables and tasks at each stage. We apply our tables-to-tasks procedure to produce UnpredicTable, a dataset with 413,299 tasks from 23,744 unique websites.  (Bragg et al., 2021) benchmarks in Appendix C, to study the generalization of our results across different models, training algorithms and test tasks.

MetaICL
MetaICL (Min et al., 2021) trains LMs to predict the output for a target input, given a few input-output pairs provided in the LM context. On each training iteration, one task T i is sampled from D train and k + 1 training examples {(x 1 , y 1 ), . . . , (x k+1 , y k+1 )} are sampled from T i . MetaICL trains an LM with parameters θ to maximize log P (y k+1 |x 1 , y 1 , . . . , x k , y k , x k+1 ). At test time, for a new task in D test we draw a set of examples {x 1 , y 1 , . . . , x k , y k } and a query x k+1 . Given this context, the LM uses θ to select the most likely y k+1 from a discrete set of possible labels.

Experiments
Here, we investigate how finetuning on UnpredicTable compares to finetuning on human-curated NLP datasets. We finetune the 774M parameter pretrained GPT2-large LM ( MetaICL methods MetaICL evaluates performance on each task category in two ways. First, they consider an out of distribution ("OOD") setting, where they finetune a model on a dataset D train consisting of tasks from all other categories excluding the target task category. Second, for Class and QA categories, they consider an in-domain ("IID") setting, where they finetune a model on a dataset D train consisting of only tasks from the same category as the target task category. For each task category, we compute the mean accuracy per task and report the average task accuracy for all tasks in the category. Tab. 1 shows the results. MetaICL finetuning on our table tasks improves FSL performance on all test settings. Furthermore, finetuning on our dataset outperforms finetuning on OOD NLP tasks on 4/5 settings, and IID NLP tasks on 1/2 settings. Overall, finetuning on our data results in comparable performance to finetuning on curated NLP tasks.

Why Is UnpredicTable Helpful?
To understand why UnpredicTable is helpful training data, we construct subsets of the dataset varying features we wish to study. For each subdataset, we finetune on that dataset individually following the setup as before (Appendix B) and measure FSL performance on MetaICL test tasks from all categories (52 total). All experiments are repeated for 3 random seeds to minimize the effects of random task sampling in each dataset. We report the mean accuracy from each experiment in Fig. 3. We discuss our results in the following sections. Next, we study the effect of task diversity on FSL performance. Tasks from the same website tend to be similar in content, so we construct more diverse datasets by sampling tasks from UnpredicTable-unique, a version of UnpredicTable filtered to have a maximum  40  200  1k  5k  25k  125k  10  50  250  1250  6250  7  8  23  0  19  9  24  28  27  16  26  5  15  4  17  25  6  10  29  13  18  22  12  21  1  20  14  11   : Each bar represents a GPT2 model finetuned on a different dataset. The y-axis shows mean improvement of a finetuned LM over the pretrained LM. Comparing dataset helpfulness: Datasets made of diverse tasks from UnpredicTable (a) and NLP datasets (b) lead to +5-7% improvement. Narrow clusters (c) and websites (d) within UnpredicTable vary significantly, with the best narrow datasets matching the best multi-task NLP datasets (b).
of one task per website (vs. up to 2500 in UnpredicTable). Fig. 3a shows that the difference between UnpredicTable-unique and UnpredicTable at matching sizes is small, suggesting that dataset diversity is not an important factor for our finetuning transfer success.
To examine narrow datasets in contrast to the uniformly-sampled ones, we consider 3 types of datasets grouped by content. We sample tasks from 20 websites of different genres, forming a dataset from each website (Fig. 3d). Secondly, we also form datasets of semantically similar tasks by clustering UnpredicTable-unique tasks into 30 clusters using HDBSCAN 8 (McInnes et al., 2017) (Fig. 3c). Finally, we also sample 20 NLP tasks from the 90 MetaICL training tasks and use each task as a separate training dataset (Fig. 3e). Singlewebsite and single-NLP datasets have T × N = 10000 total examples, and cluster datasets have different T due to the clustering algorithm. 8 See Appendix D for details of our clustering setup.
We find there is significant variance among the narrow datasets. Some single-website or cluster datasets are better than diverse datasets, such as support.google.com which is our best dataset overall (even outperforming diverse NLP datasets). This suggests that diverse task datasets are less important than careful selection of a narrow training dataset for FSL improvement.

Can we select good tasks by hand?
Padmakumar et al. (2022) found that some training tasks can negatively impact downstream performance, which could explain why aggregating many random tasks may be less successful than individual tasks. We manually categorize 2,000 tasks from UnpredicTable-unique into High, Mid, and Low-quality. 9 We define low-quality tasks as tasks where the content is junk or relies on missing context. High-quality tasks are ones where an annotator could pick the correct answer from a list of options, and tests useful abilities (logic, general knowledge, comprehension, etc.). Mid-quality tasks are the remaining tasks. For each class, we randomly sample T = 200 tasks to form its own dataset.
Surprisingly, our manual annotations of quality are not strongly correlated with downstream task performance (Fig. 3f). Our handpicked dataset of high-quality tasks does not even surpass the scores of randomly-sampled tasks, and the difference in performance between our low and high-quality datasets are <1%. These results suggest that tasks that look helpful are not necessarily helpful.  We look for features of helpful and unhelpful datasets with examples from cluster, single-website and single-NLP datasets. 4/5 of the most helpful datasets are softwarerelated. support.google.com, w3.org and wiki.openmoko.org contain software documentation; cluster 7 describes information related to internet cookies. Unhelpful datasets are more varied. The two least-helpful datasets are NLP datasets: piqa (questionanswering task for physical knowledge) and yahoo_answers_topics (topic-classification task) both yield negative transfer results. The least helpful table datasets include highly-repetitive software tables (cluster 2 & 3), tasks classified as noise by the clustering algorithm (cluster -1), college review posts (cappex.com), and music database entries (wkdu.org).
The top datasets appear unrelated to our test tasks (e.g. there are no softwarerelated test tasks). Additional examples highlight this: mmo-champion.com and bulbapedia.bulbagarden.net are video game trivia sites that do not seem useful for other tasks, yet these datasets are on par with UnpredicTable-5k. Conversely, websites containing high-quality question-answer pairs such as cram.com and studystack.com, as well as en.wikipedia.org which contains many real-world facts, yield subpar improvements. We include examples of helpful and unhelpful tasks in Tab. 2, and more examples in Appendix F. 4.5 Which tasks are our datasets helpful for?  is only +2.8%, though the max is +43.0%. Fig. 5 shows the 10 most-improving test tasks (median improvement across all 90 training datasets in Fig. 4). The tasks are highly varied, spanning topics from news to finance to science, and have binary or multiple-choice (MCQ) output labels. It is difficult to draw a consistent relationship between test tasks and the finetuning datasets that lead to their largest improvement (Best dataset). For example, cluster 7 is a dataset about web cookies, yet it is the most helpful finetuning dataset for both ag_news and amazon_polarity which are news classification and sentiment classification tasks respectively. Our examples of unintuitive task transfer contradict prior work that suggest domain similarity is key for successful task transfer (Gururangan et al., 2020). Vu et al. (2020) observed that "Out-of-class transfer succeeds in many cases, some of which are unintuitive." In our experiments, unintuitive transfer appears to be the norm rather than the exception.

Do different datasets lead to different improvements?
We wish to understand if finetuning on different datasets lead to different test task improvements. Fig. 6 illustrates that the same set of 10 test tasks make up the majority of the top-10 improving test tasks for each of our best training datasets (the topperforming datasets for each category in Fig. 4). For example, training on wiki.openmoko.org (software documentation) leads to strong improvements on broadly similar tasks as training on lama-trex (factual knowledge). This suggests that the improvements learned from these highly different training datasets are domain-agnostic. However, it remains unclear why these improvements can be learned from these particular training datasets but not others, and why these particular test tasks benefit most from the improvements.   2022); Wang et al. (2022) have extended this result to more than 1,000 tasks. We were inspired by these results to obtain a training dataset with 100x more tasks, but found diverse task datasets less helpful than certain narrow datasets. Padmakumar et al. (2022) showed that a poor choice of training task can negatively impact downstream performance, which could explain why mixing diverse tasks underperform well-chosen narrow tasks. This begs the question of how to select training datasets to improve downstream task performance. Vu et al. (2020) show that domain similarity can be used as a predictor for successful transfer. Our results highlight a gap in this explanation, and suggest that there may be some domain-agnostic improvements to be gained from training tasks that are unrelated to the test tasks. Other attempts to understand the effect of training datasets on FSL also struggle to uncover clean rules; this includes analyses of pretraining datasets (Shin et al., 2022), varying datasets alongside model architectures (Chan et al., 2022), and influence functions to trace gradient updates to training datapoints (Akyürek et al., 2022).
Our use of structured datasets to generate training tasks is inspired by other work, though others have focused on a limited set of task types. Yoran et al. (2021) also turn tables into tasks, using handwritten templates to extract question-answer pairs from tables. Aghajanyan et al. (2021b) train LMs to predict masked spans in HTML webpages, then use HTML markup to prompt language models to do summarization and classification tasks. Chen et al. (2022) transform ordinary (non-table) text into sentence completion, masked phrase prediction, and classification tasks. In contrast, our approach captures any tasks that occur in tables.

Conclusion
We produced UnpredicTable, a dataset of 413,299 diverse few-shot learning tasks from internet tables. Finetuning on UnpredicTable improves the FSL ability of LMs. However, the size of our dataset is not the key factor in its success. We find that certain narrow datasets (even ones made of trivia) are even more helpful than diverse, curated NLP datasets. Finetuning on these narrow datasets leads to strong improvements on the same test tasks as finetuning on diverse, curated NLP datasets. This suggests that finetuning on these datasets cause domain-agnostic FSL gains, though we were unable to find clear patterns to explain why this happens for some data and not others. Our results question common wisdom that task diversity is necessary for adapting LMs to FSL. We hope our work spurs investigation on what data causes few-shot learning to emerge, both to develop better datasets and to better understand how training data leads to unexpected behaviors or failures.

Dev Tasks (50 tasks) Contains all our Train
Tasks except those which are not multiple-choice. These tasks are used for hyperparameter selection. For hyperparameter selection, we finetune the GPT2-large model (774M) 11 on UnpredicTable-5k and sweep over batch sizes {1, 8, 64} and learning rates {5e −5 , 5e −6 , 5e −7 }. We select batch size = 1 and learning rate = 5e −6 based on Dev scores and use this for all MetaICL experiments. We train for 5 epochs and evaluate after each epoch, selecting the checkpoint with the highest mean Dev Tasks 11 GPT2-large LM https://huggingface.co/gpt2-large score. We report scores of the selected checkpoint evaluated on the Test Tasks. Each training and inference run is done on a single RTX8000 GPU. The duration of training varies by dataset size (training 5 epochs on UnpredicTable-5k takes ∼24 hours).

C Do Other Learning Algorithms Benefit
from Table Data? Our main experiments use the MetaICL algorithm and benchmarks for training and evaluation. To understand how well our findings hold in other settings, we report additional experiments comparing UnpredicTable-5k against NLP datasets using different models, multi-task learning algorithms, and evaluation settings.

C.1 CrossFit
Ye et al. (2021) introduce the Few-Shot Gym, a collection of 160 NLP tasks, and a problem setup called CrossFit. We focus on the Random task partition of CrossFit where D train and D test contain 120 and 20 tasks respectively, sampled IID from the Few-Shot Gym. For our learning algorithm, we adopt the best-performing method in Ye et al. (2021), MTL, which finetunes on D train followed by finetuning on the few-shot training examples from a given target task in D test (finetuning a separate model for each target task in D test ). We compare three different methods: MTL with D train from the Few-Shot Gym, MTL with UnpredicTable-5k as D train , and Direct Finetuning (DF) which is a baseline without finetuning on any D train . All experiments finetune a BART-Base (Lewis et al., 2019), a pretrained encoderdecoder transformer model (Vaswani et al., 2017).

D Clustering
Here, we describe the clustering procedure used to group UnpredicTable-unique tasks into narrow data subsets based on content. For all examples in all tasks, we concatenate each (x, y) example and obtain their embeddings from a pretrained GPT-2 model 12 . We average the resulting 1024-dimensional embeddings at a task level. We normalize each task embedding and apply a twostage dimensionality reduction consisting of a PCA transformation to 128 dimensions followed by further reduction using UMAP (McInnes et al. (2018), n neighbors = 4, d min = 0.0) to 32 dimensions. We cluster the 32D task embeddings using the HDB-SCAN algorithm (McInnes et al., 2017) with a minimum cluster size of 60 and 400 minimum samples. This setup results in 30 task clusters plus an additional cluster (cluster -1) containing tasks that HDBSCAN rejected as noise. The cluster sizes range from T = 61 to T = 5700. We tested several hyperparameters for our clustering pipeline until we arrived at a setup with reasonable in-cluster content similarity (manual inspection).

E Task Quality Annotation Instructions
Below, we display a condensed version of the instructions given to annotators for annotating the dataset into different task quality levels. The full instructions are available online 13 .
Introduction Thank you for agreeing to contribute annotations to our dataset! Here are some brief instructions to help you successfully complete this work.
Context We have a large number of Tasks created for training language models to learn a variety of skills. A standard example of a task is shown in Tab. 7 as Task 1. This example closely resembles the Question-Answer form that is commonly encountered in human competency tests, but this is not the only valid form. More generally, a Task is simply a set of input-output pairs where the inputs map to outputs in a common and (given knowledge of the mapping) predictable way; given an input, an individual skilled in this task should be able to respond with the correct output. Another example of a valid task is shown in Tab. 7 as Task 2. In this case, the inputs are a set of issues that a user might be having, and the outputs suggest actions to address each issue.
The Problem Our pool of tasks has been curated in an automated way from natural internet content, so they vary greatly in quality and form. It would be valuable to label each task's quality so that we may investigate (1) what is the overall quality in our pool of tasks, and (2) how task quality affects the ability of language models to learn from it.
The Work In this session, you will classify a number of tasks in terms of how feasible and useful they are. Each task should be rated from 0-2, where 0 is "This task is not valid or useful at all" and 2 is "This task demonstrates an interesting and useful skill".

Examples of Tasks for Annotation
Task 1 Table 7: Example tasks provided with the instructions for the task-quality annotation • The input-output mapping appears nonsensical and/or arbitrary.
• The task is not in English.
• Would never be useful in any realistic setting / practicing this task does not build any generally-useful skills.
• Tests highly obscure knowledge that is not correlated with the input text (highly contextdependent knowledge, entertainment trivia on fan sites, product specifications, . . . ) • You would not even be able to tell if all output labels have been shuffled.

Criteria of Class 1 (medium rating) Tasks
• This class is a catch-all for tasks that are neither squarely Class 0 nor Class 2.
• The task is quite interesting, but its current form contains flaws that make it confusing or lacks enough context to do a good job of the task.
• You could narrow the space of possible options and guess the right answer with betterthan-random accuracy (especially with the help of multiple-choice options).
• The task makes sense but is trivial or not interesting enough to be Class 2. For example, the output is just a copy of the input.

Criteria of Class 2 (high rating) Tasks
• The task is well-posed with enough context that an expert could give a reasonably correct answer most of the time.
• Demonstrates a skill that is definitely useful for real-world tasks, i.e. might be tested in an exam or competency test, or part of a job.
• Resembles the type of skill that is tested in typical NLP datasets. See "Examples from real NLP datasets" section in the full instructions 13 .

Further notes
• These criteria are not a complete set of rules for membership, so based on the above you may make your own judgement regarding a new task that does not perfectly fit any criteria.
• We expect that the majority of our tasks will fall into either Class 0 or Class 1; fewer than 20% of the tasks will meet the standard for Class 2.
• A single input may not always be enough to know what the task expects in the output; this is acceptable (even for Class 2) as long as the input-output mapping is clear after observing several demonstration pairs.
• The "Examples from real NLP datasets" section in the full instructions 13 show the kinds of interesting tasks we would like to see in Class 2, but we expect (and encourage) that our tasks will span a wider variety that are still interesting and valuable.

F Examples of tasks
In the following pages, we provide examples from various datasets discussed in the text:  [King James Version] And she lay at his feet until the morning: and she rose up before one could know another. And he said, Let it not be known that a woman came into the floor. So she lay at his feet until morning. She got up before either could know the other. He said, "Don't let it be known that a woman came into the threshing-floor." [Analysis] output Boaz wants to avoid scandal. input [Verse] 5 [King James Version] And she said unto her, All that thou sayest unto me I will do. Ruth said to her, "I will do everything you say." [Analysis] output What Ruth must have thought of these orders, none can speculate.
input [Verse] 1 [King James Version] Then Naomi her mother in law said unto her, My daughter, shall I not seek rest for thee, that it may be well with thee? Now Naomi, mother-in-law of Ruth, said to her, "My daughter, I should find you a place of rest, that will be good for you.
[Analysis] output Naomi wants to settle Ruth properly.

quality_annotated : Med
Task 1  They readily give up their two valence electrons to achieve a full outer energy level, which is the most stable arrangement of electrons. As a result, the ... (Truncated) output valence electrons input Exposure gives an indication of the amount of radiation that travels through the air. Two factors influence the amount of exposure a person may receive -time and intensity. Acute exposure indicates a large amount of radiation received over a short ... (Truncated) output chronic exposure input Ventricular Systole Ventricular systole (see Figure 19.27) follows the depolarization of the ventricles and is represented by the QRS complex in the ECG. It may be conveniently divided into two phases, lasting a total of 270 ms. At the end of atrial ... (Truncated) output pulmonary and aortic semilunar nlp_test tweet_eval-stance_atheism (52 examples) input The worst day of my life so far is here, setting my Nan to rest. Even as a physicist, times like these make you wonder. #SemST output none input I will dwell in a peaceful habitation, in secure dwellings, and in quiet resting places -Isa. 32:18 #SemST output against input @user sweet! Congratulations to a rational decision. #SemST output none yelp_polarity (100 examples) input Very disappointed in this salon. Set an appt 4 days ahead of time. Area were I for my set put on was dirty from a past client. The mail tech did not talk, I felt rushed through my appt which resulted in me leaving unhappy. I won't be returning. output negative input Our flight arrived to Vegas earlier than excepted, so we expected our room not to be ready.
When we arrived at the hotel on May 19th, the front desk girl offered us a room that was ready on the 28th floor that wasn't facing the Bellagio fountain. I b ... (Truncated) output positive input My poor children who live out of state, have no idea how cheap and ugly the flowers I just received from Carmel Florist are. They do not resemble the online photo at all. I actually laughed at the gentleman who delivered them to my door. They spent ... (Truncated) output negative