UniHD at TSAR-2022 Shared Task: Is Compute All We Need for Lexical Simplification?

Previous state-of-the-art models for lexical simplification consist of complex pipelines with several components, each of which requires deep technical knowledge and fine-tuned interaction to achieve its full potential. As an alternative, we describe a frustratingly simple pipeline based on prompted GPT-3 responses, beating competing approaches by a wide margin in settings with few training instances. Our best-performing submission to the English language track of the TSAR-2022 shared task consists of an “ensemble” of six different prompt templates with varying context levels. As a late-breaking result, we further detail a language transfer technique that allows simplification in languages other than English. Applied to the Spanish and Portuguese subset, we achieve state-of-the-art results with only minor modification to the original prompts. Aside from detailing the implementation and setup, we spend the remainder of this work discussing the particularities of prompting and implications for future work. Code for the experiments is available online at https://github.com/dennlinger/TSAR-2022-Shared-Task.


Introduction
With recent advancements in Machine Learning (ML) research coming largely from increasing compute budgets, Richard Sutton coined the idea of a "bitter lesson", wherein more computational power will ultimately supersede a hand-crafted solution (Sutton, 2019).More recently, increasing compute power on a general purpose architecture has also shown to be wildly successful in the Natural Language Processing (NLP) community (Vaswani et al., 2017;Wei et al., 2022).In particular, emergent capabilities in very large language models (vLLMs) have made it possible to approach a variety of tasks wherein only few (if any) samples are labeled, and no further fine-tuning on task-specific data is required at all.In stark contrast to the complex pipelines in modern lexical simplification systems (Ferrés et al., 2017;Qiang et al., 2020;Štajner et al., 2022), we present a simplistic approach utilizing few-shot prompts based on a vLLM with basic instructions on simplification, which returns frustratingly good results considering the overall complexity of the approach, which utilizes a grand total of four hand-labeled instances.We present our results on the TSAR-2022 shared task (Saggion et al., 2022), which evaluates lexical simplification systems in three available languages (English, Spanish and Portuguese), with ten labeled instances and around 350 unlabeled test samples provided per language.For the English subset, official results rank our model as the best-performing submission, indicating that this approach may be another instance of the bitter lesson.While the initial findings are indeed promising, we want to carefully evaluate erroneous instances on the test set to analyze potential pitfalls, and further detail some of our experiences in hand-crafting prompts.We also acknowledge the technical challenges in reproducing (and deploying) systems based on vLLMs, especially given that suitable models exceed traditional computing budgets.

Prompt-based Lexical Simplification
With the public release of the GPT-3 language model (Brown et al., 2020), OpenAI has started the run on a series of now-available vLLMs for generalpurpose text generation (Thoppilan et al., 2022;BigScience, 2022;Zhang et al., 2022).Across these models, a general trend in scaling beyond a particular parameter size can be observed, while keeping the underlying architectural design close to existing smaller models.Through exhibiting zeroshot transfer capabilities, such models have also become more attractive for lower-resourced tasks; oftentimes, models are able to answer questions formulated in natural language with somewhat sen-sible results.Particular template patterns (so-called prompts) are frequently used to guide models towards predicting a particularly desirable output or answer format, without requiring a dedicated training on labeled examples.Utilizing this paradigm shift, we experimented with different prompts issued to OpenAI's largest available model, text-davinici-002, which totals 176B parameters.Our first approach uses a singular prompt template in a zero-shot setting to obtain predictions for the shared task; we further improve upon these results by combining predictions from different prompt templates later on.

Run 1: Zero-shot Prediction
Upon inspecting the provided trial data, we noted that the simplification operations required a vastly different contextualization within the provided sample sentence.Whereas some instances can be solved with pure synonym look-ups (e.g., "compulsory" and "mandatory"), others require a more nuanced look at the context sentence (e.g., replacing "disguised" with "dressed").To avoid biasing system predictions by providing samples as a prompt template, we provide a baseline that is entirely based on a single zero-shot query; it provides the context sentence and identifies the complex word, asking the model for ten simplified synonyms of the complex word in the given context.Given that no additional knowledge is provided to the model, the zero-shot contextual query also provides a reasonable lower-bound for the task setting.A secondary advantage of minimal provided context in zero-shot settings is the reduced computational cost, which will be discussed in more detail in Section 3.4.

Filtering Predictions
Model suggestions are returned as free-form text predictions, generally in the form of commaseparated lists or enumerations.This requires the additional step of parsing the output prediction into the more structured ranked predictions required for the shared task, which varies between the models used.In our experience, no clear pattern can be expected from the model and seems to be nondeterministic even with set template structures.We additionally employ a list of simple filters to ensure the quality of predictions, as detailed in Appendix C. The resulting model suggestions are considered in ranked order, and no prediction confidence scores or similar information was used to re-rank single-prompt predictions.

Run 2: Ensemble Predictions
Upon inspecting the results from the first run, we noticed that in some instances, predictions were almost fully discarded due to filtering.Simultaneously, we had already previously encountered strong variability in system generations when changing the prompt template or altering the context setting.To this extent, an ensemble of predictions from multiple different prompt templates was utilized to broaden the spectrum of possible generations, as well as ensuring that a minimum number of suggestions survives the filtering step.

Prompt Variations
The exact prompts are detailed in Table 3. Utilized variations can be grouped into with context (the context sentence is provided), or without context (synonyms are generated from the complex word alone).Simultaneously, different prompts also contain between zero and two examples taken from the trial data, including their expected outputs.This can be interpreted as a few-shot setting in which the model is demonstrated on what correct answers may look like for the particular task.We further vary the generation temperature, where a higher value increases the likelihood of a more creative (but not always correct) prediction, enabling a more diverse candidate set.

Combining Predictions
For each of the six prompts p, we ask the model to suggest ten alternative simplified expressions S p and filter them with the exact same rules as the single prompt system in Run 1.In order to combine and re-rank suggestions s, we assign a combination score V to each distinct prediction s ∈ p S p : where rank Sp (s) is the ranked position of suggestion s in the resulting ranking from prompt p.If s / ∈ S p , we set rank Sp (s) = ∞.The scaling parameters are chosen arbitrarily and can be adjusted to account for the expected number of suggestions per prompt.We estimate that the biggest performance improvement is coming simply from providing enough predictions post filtering.As a secondary gain, we see more consistent behavior in the top-most prediction slots, boosting especially the @1 performance of the ensemble.1: Results on the English language test set of the TSAR-2022 shared task, ranked by ACC@1 scores.Listed are our own results (Ensemble and Single), the two best-performing competing systems (MANTIS and UoM&MMU), as well as provided baselines (LSBert (Qiang et al., 2020) and TUNER (Ferrés et al., 2017)).
3 Results and Limitations

Results for English
For the official runs, we initially only submitted predictions for the English subset; an excerpt of the results can be seen in Table 1.While the zero-shot single prompt run has consistently better results on most metrics, it does not outperform all systems for large candidate sets; e.g., Potential@10 is lower than that of competing approaches, including the LSBert baseline.We attribute this to the previously mentioned issue of filtering predictions, and can see a consequent improvement especially for larger k by using the proposed ensemble method.Here, the Potential@10 scores indicate that at least one viable prediction is present in all but three samples.

Results for Spanish and Portuguese
Given the surprisingly good results on the English subset, we decided to extend our experiments to the Spanish and Portuguese tracks as well.Transferring the prompts to Spanish or Portuguese is surprisingly simple.We alter the prompt to: "Given the above context, list ten alternative Spanish words for 'complex_word' that are easier to understand." (bold highlight indicates change).Without this adaption, returned suggestions generally tend to be in English, which could be an attractive opportunity to mine cross-lingual simplifications in future work.By adding the output language explicitly, we ensure that the suggestions match the expected results.For Portuguese, the prompt can be adapted accordingly.We find that our system also outperforms all competing submitted approaches in the shared task; result comparisons can be found in Table 4 and 5 in the Appendix, respectively.Notably, predictions for Portuguese perform slightly better, which goes against intuition, given that Spanish is usually a highly represented language in multilingual corpora.We suspect that a more literal wording of synonyms in Portuguese, compared to multi-word expressions in Spanish, could be the cause.

Error Analysis
As is common for sequence-to-sequence tasks, crafting an approach centered around a LM requires consideration of the particular challenges arising.We detail some of the errors we have encountered in our predictions that are unlikely to appear in more stringently designed pipelines.Instances for particular failure cases can be found in Table 2.
Unstable Prompts One of the primary challenges, particularly for zero-shot prompt settings, is the unreasonable variance observed in results based on even just slightly altered prompt templates.For example, when removing the explicit mention of Context:, Question: and Answer: in the prompt template, the model is frequently predicting fewer than the ten requested answers.Practical limitations in our computational budget also mean that we have no guarantee that these prompts are yielding the best possible results; given the variability, multiple runs should be compared for a thorough pattern of a "best" prompt.
Lack of Context Instances with longer (or subtly enforced) context cues show issues where these hints are not properly recognized.In Table 2, we can see the model changing the term "collision" to a particular mode of transportation, such as "car crash", while an explicit context clue is given through the word "flight" in the original sentence.
Enforcing Language While the transfer to Spanish and Portuguese is largely successful, the model's capabilities seem to be still limited in maintaining the language throughout all samples.
Error Type

Context (complex word in bold) Model Predictions
Lack of Context #7-8 Despite the fog, other flights are reported to have landed safely leading up to the collision.car crash, train wreck, ...

Hallucinations
The larva grows to about 120-130 mm, and pupates in an underground chamber.
Transforms into a pupa, ... Language [...] propiciado la decadencia de la Revolución francesa.decline, deterioration, ... For instances with particularly rare complex terms, the predictions are sometimes still in English, despite the specific prompt request to return Spanish/Portuguese results.
Hallucinations The necessity for post-filtering of suggestions stems largely from the spontaneous occurrence of hallucinations in responses.While hallucinations in vLLMs are less about invalid vocabulary terms, we observe instances where unnecessary multi-word suggestions were chosen over a simple synonymous single-word expression, or random inflections (such as the infinitive form with an additional "to") were generated.Similar to the issues with language enforcing, this occurs more frequently with particularly complex words; in this sense, the system conversely fails at instances that are most in need of simplification.However, we note that some of the generated multi-word expressions are actually more helpful for understanding, even though the generations are not precisely matching expected outputs.

Computational Limitations
Running a vLLM in practice, even for inferenceonly settings, is non-trivial and requires compute resources that are far beyond many public institution's hardware budget.For the largest models with publicly available checkpoints2 , a total of around 325GB GPU memory is required, assuming efficient storage in bfloat16 or similar precision levels.The common alternative is to obtain predictions through a (generally paid) API, as was the case in this work.Especially for the ensemble model, which issues six individual requests to the API per sample, this can further bloat the net cost of a single prediction.To give context of the total cost, we incurred a total charge of slightly over $7 for computing predictions across the entire test set of 373 English samples, which comes out to about 1000 tokens per sample, or around $0.02 at the current OpenAI pricing scheme. 3For the Spanish subset and language-dependent prompt development, the total cost came to about $10, primarily due to longer sample contexts.Costs for Portuguese processing were around $6.50.While the singular prompt approach is cheaper at around 1/6 of the total cost, even then a continuously deployed model has to be supplied with a large enough budget.Aside from monetary concerns, environmental impacts are also to be considered for larger-scale deployments of this kind (Lacoste et al., 2019).

Conclusion and Future Work
Utilizing prompted responses from vLLMs seems to be a promising direction for lexical simplification; particularly in the constrained setting with pre-identified complex words the model performs exceptionally well, even when presented with a severely restricted budget of labeled training data.
While the approach also offers promising directions for multi-and cross-lingual approaches, obtaining state-of-the-art results in other languages, we are presented with a prohibitive amount of computation per sample instance.It would therefore be an interesting addition to deal with resource-constraint systems, putting the prediction power into a slightly different perspective.Finally, we are reminded of the unstable nature of neural LMs; given similar inputs, quality can vary greatly between samples, including a complete breakdown in performance.
For future work, we are considering approaches to generate static resources from vLLMs (Schick and Schütze, 2021), which may require only a onetime commitment to spending on datasets, which can then used as training data for cheaper systems.Exploration of prompt tuning approaches for automated search of suitable prompt templates would also greatly accelerate the development process of domain-specific applications (Lester et al., 2021).

A Prompt Templates
Table 3 provides the exact prompt templates used in the submission.Notably, the zero-shot with context prompt is included twice, but with different generation temperatures; with this we increase the likelihood of strong candidates being retained.For few-shot prompts, we have taken samples from the previously published trial set for the respective language.In instances where less than 10 distinctly different suggestions were provided by annotators, we manually extended the list of examples to match exactly ten results based on our own judgment.For instances with more provided suggestions, we limit ourselves to the ten most frequently occurring ones.
The reason for this is that GPT-3 otherwise tended to return an inconsistent number of suggestions in our preliminary testing.The exact prompts for the Spanish and Portuguese runs can be found in our

B Hyperparameters
We use the OpenAI Python package4 version 0.23.0 for our experiments.For generation, the function openai.Completion.create() is used, where most hyperparameters remain fixed across all prompts.We explicitly list those hyperparameters below that differ from their respective default values.
1. model="text-davinci-002", which is the latest and biggest available model for text completion.
2. max_tokens=256, to ensure sufficient room for generated outputs.In practice, most completions are vastly below the limit.
The values are well below the maximum (values can range from -2 to 2), since individual subword tokens might indeed be present several times across multiple (valid) predictions.A more detailed computation can be found in the documentation of OpenAI.5 Outside of the repetition penalties, the most influential parameter choice for generation is the sampling temperature.We generally take a more measured approach than the default (temperature=1.0),but vary temperature across our ensemble prompts to ensure a more diverse result set overall.We list the used temperatures in Table 3. Zero-shot with context is used twice in the ensemble, once with a more conservative temperature, and once with a more "creative" (higher) temperature.For the singular prompt run, we use the conservative zero-shot with context variant.

C Post-Filtering Operations
Given the uncertain nature of predictions by a language model, we employ a series of post-filtering steps to ensure high quality outputs.This includes stripping newlines/spaces/punctuation (\n :;.?!), lower-casing, removing infinitive forms (in some instances, we observed predictions in the form of "to deploy" instead of simply "deploy"), as well as removing identity predictions (e.g., the prediction being the same as the original complex word) and deduplicating suggestions.Additionally, we noticed that for some instances, generated synonyms resemble more of a "description" rather than truly synonymous expressions (example: "people that are crazy" as a suggestion for "maniacs").Given the nature of provided data, we removed extreme multi-word expressions (for English, any suggestion with more than two words, for Spanish and Portuguese more than three words in a single expression).

Table 2 :
Instances of observed failure classes in our system's predictions.

Table 4 :
Results on the Spanish language test set of the TSAR-2022 shared task, ranked by ACC@1 scores.

Table 5 :
Results on the Portuguese language test set of the TSAR-2022 shared task, ranked by ACC@1 scores.