InstructPTS: Instruction-Tuning LLMs for Product Title Summarization

E-commerce product catalogs contain billions of items. Most products have lengthy titles, as sellers pack them with product attributes to improve retrieval, and highlight key product aspects. This results in a gap between such unnatural products titles, and how customers refer to them. It also limits how e-commerce stores can use these seller-provided titles for recommendation, QA, or review summarization. Inspired by recent work on instruction-tuned LLMs, we present InstructPTS, a controllable approach for the task of Product Title Summarization (PTS). Trained using a novel instruction fine-tuning strategy, our approach is able to summarize product titles according to various criteria (e.g. number of words in a summary, inclusion of specific phrases, etc.). Extensive evaluation on a real-world e-commerce catalog shows that compared to simple fine-tuning of LLMs, our proposed approach can generate more accurate product name summaries, with an improvement of over 14 and 8 BLEU and ROUGE points, respectively.

1 Introduction E-commerce product catalogs (e.g.Amazon, Walmart) contain billions of products with lengthy names: 65% of product titles have more than 15 words (Rozen et al., 2021).This is due to sellers overloading titles with extra information about product functionality, colors, sizes and more in order to maximize their search rankings for as many queries as possible, and to captivate customers.
However, this can lead to poor experiences when these titles need to be used in other contexts such as being read aloud by voice assistants, referenced in narrative text such as product summaries, or rendered in text interfaces with limited display sizes.
This has resulted in the practical task of Product Title Summarization (PTS), which aims to extract a natural representation corresponding to how humans would refer to the product (Sun et al., 2018).As shown by the example in Figure 1, these summarized titles can then be used in other tasks like voice assistant speech, product QA, summarization, recommendation, and query understanding.
Most work thus far has used traditional abstractive and extractive summarization methods to create a single summary.Inspired by recent advances in Large Language Models (LLMs) and instruction-tuning, we present InstructPTS, the first PTS approach to use instruction fine-tuning (IFT) of LLMs to achieve controllable title summarization across different dimensions such as: (i) desired length, (ii) presence of specific words (e.g.brands, size, etc.), and (iii) summary specificity.Figure 2 shows supported instructions, which capture various requirements, and are automatically generated from a parallel dataset of original product titles and summaries.A key advantage of InstructPTS is that it allows us to utilize a single model for generating multiple titles for different downstream tasks.
Evaluation on a leading real-world e-commerce catalog shows that our InstructPTS approach generates accurate summaries, and has high instructionfollowing capability.Furthermore, the generated Item Name: "Blade Tail Rotor Hub Set B450 330X Fusion 270 BLH1669 Replacement Helicopter Parts" • Summarize {Item_Name} to contain at most 3 words → "Blade Rotor Hub" • Summarize {Item_Name} with Low specificity and to contain the words "B450 330X" → "Rotor Hub Set B450 330X" • Summarize {Item_Name} with Low specificity → "Rotor Hub Set" summaries are judged by humans as being highly relevant and capturing the most salient words from the original title.Finally, extrinsic evaluation using a retrieval system shows that the summarized titles retain sufficient unique characteristics of the product to retrieve it with high accuracy.

Related Work
PTS falls within the broader domain of text summarization techniques (El-Kassas et al., 2021).Both extractive and abstractive summarization approaches have been applied for PTS.For example, Wang et al. (2018) propose a multi-task learning framework, where one network summarizes the product name, while another learns to generate search queries.Sun et al. (2018) propose a multi-source pointer network to generate short product names from longer input names and background knowledge.Gong et al. (2019) developed an enhanced feature extraction approach to generate short product names by incorporating external word frequency information and named entities as additional features.An different approach based on Generative Adversarial Networks that encode multi-modality features (such as product images and attribute tags) is presented by Zhang et al. (2019).Xiao and Munro (2019) adopt Bi-LSTMs to extract key words for product name summaries.Subsequently, Mukherjee et al. (2020) tackled the vocabulary mismatch problem by integrating pretrained embeddings with trainable character-level embeddings as inputs to Bi-LSTMs.An adversarial generation model that can generate personalized short names is proposed by Wang et al. (2020).
Our approach differs from prior work in two aspects.Firstly, previous studies primarily focused on generating a single product name summary, which may not cater to the diverse use cases in ecommerce applications.In contrast, our approach offers the flexibility to generate diverse summary types (e.g.specific number of words, specific summary specificity etc.).Secondly, drawing inspiration from the recent success of LLMs (Ouyang et al., 2022;Longpre et al., 2023), we are the first to propose an instruction-based approach for PTS.

InstructPTS Approach
We now outline our proposed InstructPTS approach: we describe the base model, and provide details about the instruction fine-tuning.

Base Model
The base model for InstructPTS is FLAN-T5 (Chung et al., 2022), an LLM pre-trained on a large set of instruction fine-tuning tasks.We opt for this LLM family given that they are suitable for instruction fine-tuning (IFT) for our task.We experiment with different model sizes (cf.§4.2), and compare the advantage of IFT over other training strategies.

Ground Truth Dataset
We use a parallel dataset of original product title and summary pairs.The summaries are of two specificity levels: Low or Medium, which control how descriptive it is w.r.t. the original title.Low summaries are short (approx. 2 (SD=±1) words) and typically do not include brand or other product details, but instead focus on a highly abstract description of the product family.Medium summaries are longer (approx.4 (SD=±1.4)words) and contain brand/model names, and aspects that identify the specific product.This gold data is generated using a hybrid approach: a sequence tagger chunks words that need to be included in the summary, and human annotators accept/reject the taggers decision, or rewrite the summary entirely.This is an extractive process; the summaries only contain words that appear in the original product title.
The data is split into train/dev/test sets with 100k/10k/1M product titles, respectively.Summaries of Medium specificity make up 58% of the data; the remaining 42% are of Low specificity.The same products can have both levels, but not always.

Instruction Fine-Tuning
LLM instruction fine-tuning (Ouyang et al., 2022) has proven to improve generalizability, allowing LLMs to perform better on tasks defined using natural.IFT allows LLMs to flexibly encode various constraints defined in natural language, enabling robust and controllable performance.We follow a similar approach for generating product name summaries, and fine-tune FLAN-T5 models using instructions that are generated automatically from our parallel dataset of input product names and their corresponding summaries (cf.§3.2).Table 1 shows the instructions used for fine-tuning InstructPTS, as well as for generating product name summaries.
Using a product as a running example "Massage

Orthopedic Puzzle Floor Mat for Kids Flat Feet Prevention
Sea Theme 6 Elements", we describe in detail the instruction and the way they are constructed.
Specificity Level Constraints.Instructions 1-2 in Table 1 allow InstructPTS to generate summaries according to the specificity levels introduced in §3.2.These Low and Medium levels allow the model to dynamically determine the summary length based on the desired specificity.Depending on the original title, the Low specificity can yield summaries of slightly different lengths for different product.
Our training data has different levels for the same input, which helps the model learn which words are important for each specificity.
Word Count.This instruction allows the model to generate summaries that contain up to a certain number of words.The instruction for training is constructed automatically, where for a product name and its ground-truth summary, depending on the number of words in the summary (k), we generate the instruction that has as a target the number of words equal to k ′ = k + ∆ (∆ corresponds to a random integer 0 ≤ ∆ ≤ 3, where k > 3).For instance, in the table below, the ground-truth summary contains 3 words, however, the instruction contains the constraint "at most 5 words".This allows the model to flexibly use 5 words or fewer as it sees fit, because sometimes the most coherent summary may use fewer words due to the presence of multi-word phrases.
Summarize {Item Name} to contain at most 5 words.→ Orthopedic Floor Mat Instructions 3-4 in Table 1 show how the same name is summarized with 1 and 4 words.The choice of words is determined automatically by the InstructPTS model, allowing it to automatically pick the most salient words from the product name.
Phrase Inclusion.In real-world settings, depending on the context, certain words may be required in the summary (e.g.brand, size, color).We automatically construct instructions from the parallel dataset by randomly choosing a word or a sequence of words from the ground-truth summary.This allows InstructPTS to learn on how to incorporate specific phrases in the resulting summary.We evaluate the instruction following accuracy in §5.
Summarize {Item Name} with Low specificity and to contain the words "Orthopedic".→ Orthopedic Mat Instructions 5-6 in Table 1 show how the desired words are encoded in conjunction with categorical constraints.This allows the model to generate summaries of different specificity, and additionally enforce the inclusion of desired phrases.
Deletion of k-words.Instructions 7-8 in Table 1 allow deleting up to k-words.This represents the reverse case of the instructions that allow the model to output summaries of specific lengths.The instructions are inferred automatically from the ground-truth product name summary how many words need to be deleted, and additionally add a random integer 0 ≤ ∆ ≤ 3. Automated Evaluation: For specificity constraints, we adopt BLEU and ROUGE metrics to automatically measure summary quality and their alignment with the ground truth.For other instructions, we compute the instruction following accuracy of InstructPTS, where we only assess if the model follows the constraints encoded in the instruction. 1This verifies that the summary has the desired word count, or includes a specific phrase.
Human and Extrinsic Evaluation: We conduct human evaluation to assess summary quality ( §6), and assess summary fidelity using retrieval ( §7).
FLAN-T5-SFT: we perform supervised finetuning of FLAN-T5 models with input being the original product name, and the output being the ground-truth summary.This baseline is not controllable (e.g.specificity or number of words).

FLAN-T5-CC:
We use Control Codes (CC) (Keskar et al., 2019) to guide summary generation.Each CC corresponds to a specific summarization instruction, enabling controllable summarization capabilities.We use the following CC: (i) Low </s> {Item Name}, and (ii) Medium </s> {Item Name}.
Training details: please see Appendix D for a detailed description of the training setup.

Automatic Evaluation Results
Table 2 shows the automated evaluation results on the 1M title test set.We compare different FLAN-T5 model sizes and the impact of the different training strategies.Output examples from InstructPTS are shown in Appendix A.
Text Generation Performance: A consistent pattern is that as model size increases, so do the BLEU and ROUGE metrics.For instance, FLAN-T5-XL improves by roughly 5 BLEU1 points over FLAN-T5-BASE (for all strategies).We note a similar trend for ROUGEL.

Impact of Training Strategy:
Training strategy has a significant impact.For the same model size, InstructPTS models obtain the best performance, e.g.InstructPTS with FLAN-T5-XL obtains an improvement of 13.3 BLEU1 points over the SFT and CC models.Finally, we note a convergence between CC and SFT for the FLAN-T5-XL models, with near identical performance.Our results show the advantages of instruction tuning for PTS.
Instruction Following: Table 3 shows the instruction following accuracy for different In-structPTS models, where we measure if the summary contains the desired number of words specified in the first instruction (I#1) or includes a specific phrase as specified in the second instruction (I#2) from Table 1.We find that the accuracy is significantly impacted by model size.FLAN-T5-XL obtains the highest instruction following accuracy among the FLAN-T5 models.
Summary Length: Table 4 shows the mean mean title length (number of words) and standard deviation for summarized titles generated for different summary types using InstructPTS (FLAN-T5-XL) on the entire test set.For specific word counts, we find that the model generally respects the maximum length imposed in the instruction.The categorical constraints have more variance compared to the specific word counts, and Medium summaries have an average length of 3.80 ±1.28 words.
Compression Ratio: We also analyzed the data compression ratios for Low and Medium summaries based on character length.Results show high string compression ratios of 11:1 for Low and 5:1 for Medium summaries.We also observed that the compression ratio varies by product category, as shown in Appendix C.

Human Evaluation Study
To address the known limitations of automatic summarization evaluation, we perform a human study.
We aim to answer the following questions: H1: In a pairwise comparison, which model generates better product name summaries?
H2: Are the generated summaries valid?
H3: What is the preferred summary length by humans for a given product name?Data Evaluations are carried out on a sample of 10 popular product types (e.g.Electronics).For each product type we randomly sample 10 products and generate summary titles.Detailed evaluation setup is provided in Appendix B.

H1: Pairwise Summary Comparison
We compare the two best performing models, In-structPTS and CC using FLAN-T5-XL.For the same 100 product titles, we randomly generate either Low or Medium titles,2 and ask the annotators to chose their preferred summary.To avoid position bias, the summaries are ordered randomly.InstructPTS was preferred by the annotators in 55% of the cases, while in 29% FLAN-T5-XL-CC model was preferred.In 12% the annotators chose both summaries being equally good, while in 4% of the cases, neither title was preferred.Finally, Cohen's inter-rater agreement rate between two annotators was substantial with κ = 0.61.

H2: Validity of the Generated Summaries
Having established that InstructPTS generates the best summaries, two annotators judge if the summaries are valid.A summary is valid if it is coherent and can be used to identify at least the type of the original product.We generate 7 different summary types per product.Table 5 shows the types and their validity scores.On this sample of 700 titles, Cohen's interrater agreement was substantial (κ = 0.69).The lowest scores are obtained by short summaries.The reason for that is that most products require two or more words for a summary to be meaningful w.r.t. the original product name, and be able to identify the original product.The highest scores are achieved for summaries of Medium specificity and those with 5 Words.

H3: Preferred Summary Length
In this study, we aim to better understand human preferences w.r.t.summary length for the different product categories.This can help determine the summary types InstructPTS should generate for different categories.
Table 6 shows the results in terms of length preferences by human annotators.We omit summaries that were deemed as not meaningful by the annotators (about 19%).The summaries are generated using the InstructPTS using FLAN-T5-XL model.We find a moderate agreement between annotators with a Cohen's inter-rater agreement of κ = 0.51.
Across the different product categories, the preferences vary.For instance, for BEAUTY, the preferred summaries are longer, with 5 words.This is intuitive given the large variety of beauty products and brands.On the other hand, for FURNITURE, we see that an ideal summary length is with 2 words.Such products, in most cases, can be easily summarized with few words, e.g."TV Stand".
This study shows that ideal title summarization requires different lengths for different product categories.Our proposed InstructPTS model can robustly summarize products of any type using either Low or Medium summary specificity, which have variable summary length across product categories.Additionally, we can encode various constraints in terms of phrase inclusion in the summary.In 82% of cases Low summaries contain up to two words.Medium summaries on the other hand have more than three words in 78% of cases, with 57% having between 3 to 4 words.If we inspect the human preference of summary length in Table 6, we note that humans annotators tend to prefer summaries between 3-5 words, which represent summaries that have similar length as Medium summaries.

Extrinsic Evaluation with Retrieval
We have shown that InstructPTS can robustly summarize titles, following instructions for length and phrasal inclusion (cf.§3).To assess the fidelity of the summarized titles, we perform a retrieval-based extrinsic evaluation to determine how well the original products can be retrieved by using the summary titles.We hypothesize that a good summary with retain enough of the unique characteristics of the original product to be able to retrieve it.Additionally, this evaluation analyzes the trade-offs between summary length vs. ranking metrics of a target product under consideration.Setup: We use a catalog of 5M products as our testbed.The product titles are summarized using InstructPTS (FLAN-T5-XL) with different instructions.The summary titles are then used as queries to review the top-k products in the catalog index using the BM25 algorithm.We also use the original title as an upper bound.
Evaluation: Evaluation is performed with standard IR metrics, Mean Reciprocal Rank (MRR) and Hit@k.Higher values indicate that the summary retains more distinguishing information from the original product title.
Results: Table 7 shows the ranking scores of different summary types, based on a stratified sample of 100 products from over 800 different product categories (see Appendix C for more details).Intuitively, longer summaries obtain higher ranking scores than shorter summaries, since they tend to lose more information, leading to decreased ranking accuracy.Among all instructions, Medium achieves the best ranking scores.As shown in Table 4, Medium summaries are, on average, even longer than 5 Words summaries.
The MRR of 0.398 indicates that, on average, the ground-truth product is ranked in the 2nd and 3rd position.Furthermore, the Hit@20 score of 0.641 shows that in 64.1% of cases the groundtruth product is featured among the top 20 results.This study shows that our summaries retain key aspects that help identify the product in a set of 5M.It also provides guidance on how much the titles can be compressed.

Online Deployment
InstructPTS has been used in a leading global ecommerce service for various downstream shopping tasks.It can be applied for various content generation tasks related to product summarization, comparison, question suggestion, and review summarization.A 4k sample of generated content with embedded product titles from InstructPTS were evaluated for quality, and 96% were found to meet the validity criteria.

Conclusion
We presented InstructPTS, a new approach for Product Title Summarization, and demonstrated the effectiveness of instruction-tuning for this task.
Through IFT we can train a highly accurate and controllable model for generating various types of summaries.Empirical studies using automatic and human evaluation studies showed that the model size has a significant impact in generating reliable and meaningful summaries, and at the same time it ensures the model's ability to follow requirements specified in the instructions.InstructPTS has been deployed in systems where product titles from a billion-scale catalog are summarized for various downstream applications, such as question answering and summarization.Future work will focus on more fine-grained instructions focusing on higher levels of specificity, and support for handling constraints based on brands/sizes/colors.

Limitations and Future Work
Our proposed approach has some limitations that we aim to address in future work.Namely, although the generated summaries are highly meaningful and qualitative, they are constructed independently from their downstream applications.This creates a gap as to whether the most salient words for an application are chosen to be incorporated in a summary.For instance, for product retrievability, we aim at investigating whether choosing words to be incorporated in a summary can be provided by the BM25 ranking method, such that words with highest discriminative power are incorporated in the summary.We aim to do this in an end-to-end fashion, where the retrievability serves as a critic to the InstructPTS approach providing feedback on how to change the output summary.
Finally, we also aim to investigate the challenges in summarizing product names in conversational scenarios, where the requirements for product summaries change with every conversation turn.

Figure 1 :
Figure 1: Example of how an original product title is reformulated by InstructPTS for different applications.

Figure 2 :
Figure 2: A sample of product title summaries generated by InstructPTS for different instructions.

Table 1 :
Summarize {Item Name} to contain at most 4 words Ceramic Golden Swan Vase 5 Summarize {Item Name} with Low specificity and to contain the words "Xbox Series S" Different instructions used by InstructPTS to generate product title summaries.Each instruction has different requirements that must be satisfied in the generated summary.

Table 2 :
Text generation performance as measured based on BLEU and ROUGE metrics for the different training strategies and FLAN-T5 model sizes.In the case of CC and InstructPTS we can generate summaries according to the categorical constraints as in the ground truth (either Low or Medium), while for SFT we can only generate a single summary, which is compared against its ground-truth counterpart (either Low or Medium).

Table 3 :
Instruction following accuracy for the different InstructPTS base models using instruction fine-tuning.

Table 6 :
Summary preferences across product categories.Annotators pick their preferred summaries for a sample of 10 product names per product category.

Table 7 :
Ranking results for summaries generated by InstructPTS (FLAN-T5-XL).The first row is the upper bound, with the original product title used as a query.

Table 12 :
Product Category MRR (Low) MRR (Medium) CR (Low) CR (Medium) MRR scores and compression ratios (CR) for different product categories.The order of product categories is determined by Eq. 1 in descending order.