How well do you know your summarization datasets?

State-of-the-art summarization systems are trained and evaluated on massive datasets scraped from the web. Despite their prevalence, we know very little about the underlying characteristics (data noise, summarization complexity, etc.) of these datasets, and how these affect system performance and the reliability of automatic metrics like ROUGE. In this study, we manually analyze 600 samples from three popular summarization datasets. Our study is driven by a six-class typology which captures different noise types (missing facts, entities) and degrees of summarization difficulty (extractive, abstractive). We follow with a thorough analysis of 27 state-of-the-art summarization models and 5 popular metrics, and report our key insights: (1) Datasets have distinct data quality and complexity distributions, which can be traced back to their collection process. (2) The performance of models and reliability of metrics is dependent on sample complexity. (3) Faithful summaries often receive low scores because of the poor diversity of references. We release the code, annotated data and model outputs.


Data Noise
We have no idea about the noise in the dataset. In the context of text summarization, noise could be an incomplete or irrelevant reference. At the moment, its quantity and impact on the performance is unknown.
Summarization Complexity What do we really know about the nature of samples in the dataset? Gigaword is a headline generation dataset with short sources and references. Does this imply a higher volume of simpler (i.e. more extractive) samples? The degree of summarization complexity, and its impact on model performance is unknown.
Exploring these open questions is critical for two reasons: (1) Information about the noise could lead to more informed data collection and preprocessing methods: in a recent study, Kryscinski et al. (2019) quantified HTML artefacts in popular summarization datasets, and proposed ways to detect and remove them.
(2) Awareness about the complexity could better explain model performance, metrics, and even lead to new model architectures. In the tasks of machine comprehension and question answering, Chen et al. (2016) and Yatskar (2019) manually inspected random samples and drew insights which led to new state-of-the-art models. Such analysis could also help researchers choose datasets and metrics more carefully.
In this study, we perform intrinsic and modelcentric evaluation of three popular summarization datasets (Gigaword, CNN/DM and XSum). We are interested in answering the following questions: Q1. What are the underlying intrinsic properties of summarization datasets? We are interested in (1) Identifying and quantifying the different types of "noise" that could occur and could penalize models. (2) Whether samples have different levels of difficulty. Armed with this, we ask the following questions.
Q2 a. How do these properties impact model performance? Specifically, we'd like to know (1) If, and how, the performance varies across the different types of samples discovered from Q1. (2) If the performance is consistent across metrics.
Q2 b. If the reliability of metrics changes with these properties? This is motivated (in part) from prior metric-analysis studies, where researchers have explored inter-metric agreement and alignment with human-judgement under different conditions (Peyrard, 2019;Bhandari et al., 2020). Here we are more interested in knowing if the metrics are more correlated with human judgement for simpler samples, than complex ones.
Large-scale automatic intrinsic dataset evaluation has been explored with some promising results (Bommasani and Cardie, 2020). However, these methods rely on heuristics like content-value, density and compression (Grusky et al., 2018). We are interested in a more fine-grained, interpretable analysis that can only come from manual inspection, much like the analysis by Chen et al. (2016) and by Yatskar (2019). To that end, we first define a sixclass typology: the first three classes cover types of data-noise and the last three cover varying degrees of summarization difficulty. We then proceed to answer the aforementioned research questions, and discuss our key observations which are summarized below: Key Observations: (1) Datasets have distinct modalities -a mix of simpler samples (which we call Extractive) and complex ones (which we call Paraphrase and Inference. (2) Gigaword is majorly Extractive but suffers from data noise (45% of the targets have some key entity, or fact that is absent from the source). (3) CNN/DM is relatively cleaner, and the authors' attempts to create a more abstractive dataset seems to be successful compared with Gigaword (only 18% of samples are Extractive). (4) XSum has no Extractive samples, but also has the greatest fraction of noise: 54% of the test samples have key entities or facts missing from the source. (5) Within the datasets, the broad performance trends between the typology classes are consistent across all metrics: simpler samples score higher than complex ones. (6) Metric reliability is also complexity dependent: On CNN/DM the agreement with human judgement decreases as summarization complexity increases.
The remainder of the paper is organised as follows: in Section 2 we answer Q1, describe the three datasets, define the typology, and present results from the annotation. In Section 3 we explore Q2 a. and evaluate different models on a variety of metrics (automatic and human-judgement). In Section 4 we explore Q2 b. and investigate metric reliability. In Section 5 we share some learnings from our experience. We conclude with Section 7.

Inference
Three Malaysian and Indonesian seamen kidnapped by Philippine Abu Sayyaf kidnap-for-ransom group allegedly had been executed and the skeletons discovered in the southern Philippines are believed to be their remains , a local television reported Wednesday . Table 2: Examples for each of the six categories. Text spans with the same colors correspond to the same fact in the source and target. Target spans in RED are missing or unsupported in the source. The last sample is "Inference" because the writer will have to understand the concept of hostages, and then generalise from the group to an individual.
XSum or "Extreme Summarization" (Narayan et al., 2018a) was constructed from online news articles for highly abstractive summarization. We consider these datasets because of their popularity, and the difference in the nature of samples. The latter enables a more comprehensive analysis; Table 1 captures the size of source and target documents along with the number of samples.

Typology Definition
The classes are defined below in order of priority. Some examples are in Table 2. Readers may refer to the Appendix B, C, D for more examples.
• Incomplete/Irrelevant: The target summary ends abruptly. Or the source and target are unrelated.
• Entity Missing: The target summary contains entities (names, dates, events, etc) that are absent from the source. • Evidence Missing: The target summary is based on concepts which are absent from the source. However, the target is not Incomplete and all Entities are present. • Extractive: The target is constructed by copying tokens from the source, mostly in-order of their appearance. Minor modifications, like stemming and abbreviating, are permitted. Word substitutions, and additions, are limited to a few. No reasoning, conclusion or co-ref resolution is performed as part of the summarization. The complete context of the target should be present in the source. • Paraphrase: The majority of tokens in the target are substituted, or appear out of order, or both. There is no reasoning, conclusion or co-ref resolution. The complete context of the target should be present in the source.
• Inference: A non-trivial "inference" activity has to be completed to construct the target: some reasoning, conclusion, or complex coreference resolution. The complete context of the target should be present in the source.
We annotate 200 samples from each dataset, on par with similar studies on intrinsic evaluation (Chen et al., 2016;Cao et al., 2017). Two authors annotate samples independently. Annotations matched for 70%, 68% and 73% of Gigaword, CNN-DM and XSum samples, respectively. Disagreements were discussed between all authors before arriving at a consensus for the final label.

Motivation and Advantages
To the best of our knowledge, summarization datasets have not been manually analysed in this manner. A review of the most relevant summarization dataset analysis research shows that the most common form of intrinsic evaluation is to use surface-level heuristics. Most studies only cover a part of our typology, while almost all studies ignore the noise present in datasets. (2019b) use similar forms of token-level coverage between the source and the reference to measure the extractiveness of the summary. In it's simplest form, this is a ratio of the number of overlapping tokens and reference length. In our definition of Extractive, we first set a meaninful, well-defined criterion, and then manually check for extractive references, while allowing for some relaxations.

Content Compression
In most papers (Grusky et al., 2018;Zhong et al., 2019b;Bommasani and Cardie, 2020), the summarization complexity is defined by a compression ratio (usually the normalized word-count ratio of the source and reference). As a standalone metric, this does indeed capture the difficulty in replication. However, token rearrangement, substitution, reformulation is ignored in this measure of "complexity". To combat this, we distinctly defined Paraphrase and Inference. By manually analysing samples, we are able to differentiate between the obviously simple Extractive samples, the relatively tougher Paraphrase samples  and the most difficult Inference samples. Together these three offer a highly intuitive classification of samples. Part of the reason that the Machine Comprehension analysis by Chen et al. (2016) was so effective was the interpretability of their classes. We hope our analysis will also enable researchers to improve summarization models.
Noise Prior works have not focused on quantify the noise in popular datasets. Moreover, none of these metrics are designed to account for noise or factual inconsistencies. A high value for content compression might imply a high-degree of summarization complexity. But this ignores the possibility that the source-reference pair is unrelated (like row 1 in Table 2). In addition, the manual analysis allows us to identify factual errors and co-ref errors. This is not to say the typology is perfect and exhaustive. Limitations and possible extensions to our typology are discussed in Section 5.

Dataset Analysis
The distribution of classes in the datasets is in Figure 1. We have made the following key observations in our analysis of the labels.
Gigawords is Extractive, but very noisy. 24.5% of summaries are Extractive, but 44.5% of samples belong to Entity Missing, Evidence Missing, or Incomplete. Not unexpected considering the "headline" nature of the samples.
XSum is Abstractive, but also very noisy. The authors (Narayan et al., 2018a) designed the dataset to be highly abstractive. This is reflected in the distribution: there were no Extractive samples in our analysis, suggesting a significantly higher level of difficulty. However, 55% of samples belong to Entity Missing, Evidence Missing, or Incomplete classes. The remaining 45% belongs to Paraphrase and Inference categories. Since we found only two incomplete samples, this class is ignored in all further XSum analysis. CNN/DM is cleaner, and lives up to the design goals. The authors (Hermann et al., 2015) designed CNN/DM to be abstractive in nature, and this is reflected in the distribution: 64% of samples belong to Paraphrase and Inference categories. Of the three, CNN/DM has the lowest fraction of factual and data noise: there are no Incomplete/Irrelavant samples, and only 18% of samples belong to Entity Missing and Evidence Missing.
The degree with which missing facts affects automatic evaluation varies. In some samples, one or two entities are missing (like Row 2 in Table 2), but in others multiple facts are missing. Empirical analysis of model performance for each class of samples is discussed in Section 3.

Performance on different classes (Q2 a)
In this section, we list the different models and metrics considered for analysis, and then describe how model performance varies across class labels.

Models for evaluation
We collect outputs from 7 systems for Gigaword: (1)

Metrics for evaluation
Existing summarization systems are usually evaluated using automated metrics or manually using human judgments. We list popular automatic metrics explored in this work. Except for the last two, all outputs from every model is scored on the following metrics. ROUGE-1/2/L measure overlap of unigrams, bigrams and longest common subsequence. respectively 5 (Lin, 2004). BERTScore (BS) measures soft overlap between contextual BERT embeddings of tokens between the two texts 6 (Zhang et al., 2020). MoverScore (MS) applies a distance measure to contextualized BERT and ELMo word embeddings 7 (Zhao et al., 2019). FactCC is introduced to measure the fact consistency between the generated summaries and source documents (Kryscinski et al., 2020). Due to issues with the setup and training procedure, this metric was only used in the CNN/DM analysis. Human Pyramid (HP) provides a robust technique for evaluating content selection by exhaustively obtaining a set of Semantic Content Units (SCUs) from a set of references, and then scoring system summaries on the number of SCUs that can be inferred (Nenkova and Passonneau, 2004). We use the scores shared by Bhandari et al. (2020) for the first 100 samples of CNN/DM subset.

Model Performance
For each dataset, we group the samples by their labels. For all samples in a subset, the model response is scored using a metric. The mean of these sample scores returns a single subset-model-metric score, which is then averaged across all models in the subset, leaving us with a single subset-metric score. This is repeated for all (subset × metric) pairs. The results are captured in Figures 2, 3 and 4 for Gigaword, CNN/DM and XSum respectively. The last column in each group is the average score across all samples.

Impact of Data Quality and Noise
Incomplete and Irrelevant Of the three datasets, only Gigaword contains Incomplete (or Irrelevant) samples. Across all metrics, the performance on this label is lowest, which is to be expected -high overlap will be rare if the source and target are unrelated or incomplete (like Row 1, Table 2). What's alarming is the volume of such samples in Gigaword -if the distribution is the same for the training set, then the model is being trained on extremely noisy data (almost 14%). In addition, such samples needlessly penalise the model performance during evaluation.
Entity scores more than Evidence in Gigaword! The results for these subsets are a bit surprising. In Gigaword, the Entity Missing subset receives relatively higher scores than the Evidence Missing category. We attribute this to a combination of factors. Consider Row 2 in Table 2. Entities are missing, but token overlap is high (more than 50%), which explains the high R1 scores, but low R2 scores. In our observations, the impact of missing facts and entities varies by the length of the target, as well as the number of entities.  before. The average summary length of CNN/DM (54 tokens) is about 7 times that of Gigaword (8 tokens). As a result, with respect to the complete reference, one or two missing facts amounts to a much smaller fraction of the reference in CNN/DM. The high overlap with the remainder leads to higher scores.
Factual Correctness in CNN/DM Automatic metrics only consider the token overlap (or "semantic distance") between the target and the model output. While such metrics exhibit high correlation with human-judgement, a low score does not necessarily imply an incorrect generation, as demonstrated by Freitag et al. (2020) for machine translation. Hence we check for factual correctness of model outputs using FactCC. The competitive scores on the first three categories for FactCC in Fig .3 suggests the outputs generated by the model are factually faithful, which points to issues with the metric reliability. We discuss this in Section 4.

Impact of Summarization Complexity
For the last three categories (Extractive, Paraphrase and Inference) Gigaword and CNN/DM exhibit a common trend: the highest performance, across all metrics is on the Extractive subset, followed by Paraphrase samples which are more difficult to reproduce. The lowest performance is on the Inference samples. However, concluding models perform poorly would be incorrect. The last three samples in Table 2 suggest that model outputs are coherent, logical and factually faithful. FactCC scores in Figure 3 also suggest the outputs are factually consistent.
Some metrics are biased towards simpler samples? For the Extractive, Paraphrase and Inference samples, the samples we manually observed (some of which are captured in Table 2) and the FactCC scores indicates a gap in the token-based metrics. However, we cannot fault the metrics entirely. If we had diverse target references for the same sources, some outputs would have found better matches, and thus, higher scores! In fact, we see that BERTScore (a more "semantically" oriented metric) is extremely competitive across all categories in all three datasets (Figures 2, 3, 4), suggesting the generations are similar to the references. These results lead us to believe that token-based summarization metrics might also suffer from a "summarization-ese" effect: the metrics could be biased towards simpler, more "extractive" references. Recently, Freitag et al. (2020) also arrived at the same conclusion for machine translation and BLEU (Papineni et al., 2002).
In the next section, we continue to explore the reliability of these metrics.
4 Does the reliability of metrics change with data properties? (Q2 b) . . n} in a dataset D, we have J system outputs, where the outputs can come from different systems. Let s ij , j ∈ {1 . . . J} be the j th summary of the i th document, m i be a specific metric (including human judgment). (1) Correlation is calculated for each document, among the different system outputs of that document, and the mean value is reported. Like other meta-evaluation studies, we consider the Pearson correlation and Spearman correlation as measures for K. Due to space constraints we only show the Pearson plots for some critical results. More plots are available in Appendix A.1. Inter-metric Correlation We present a pairwise correlation analysis of the automatic metrics to understand metric agreement in Figure 5. We conjecture that a strong correlation between two vastly different metrics (say ROUGE and MoverScore) might show that the metric is more reliable. Overall, we can see in Figure 5 that correlations between token-based metrics (ROUGE) and embeddingdistance metrics (BERTScore, MoverScore) is lower in Gigaword, compared to CNN/DM and XSum. It is possible that the short length summaries of Gigaword is leading to this; perhaps there isn't enough context for BERTScore. Although, we could not find any results in the original papers to support this claim.
Correlation variation with complexity We observe that the correlation is heavily sample dependent. In Figure 5, averaged across all samples, R1 and MoverScore have a Pearson correlation of about 0.68 in Gigaword. This increases to 0.82 for the Extractive samples in Figure 6-(a), which are the simplest to reproduce. As the complexity increases, the correlation scores decrease (in Paraphrase, and then in Inference). The trends for R2 and MoverScore are similar. This is also observed for CNN/DM: in Figure 6-(b), correlations for R1-MoverScore and R1-BERTScore drop from 0.9, 0.85 for Extractive samples to about 0.83, 0.72 for Paraphrase and Inference samples. This suggests that the inter-metric correlation is heavily sample dependent. We cannot comment on XSum, because we did not encounter any Extractive samples in that dataset.
Correlation with Human Judgement For CNN/DM, we also compute the metric correlations with the human pyramid score (HP) in Figure 5 and Figure 6-(b). We observe the highest agreement with the human-judgement for the Extractive subset, and it is significantly lower in Paraphrase and Inference. This suggests that automatic metrics are more reliable when evaluating simpler examples, than complex ones.

Discussion
Limitations of the typology. Forcing samples to have a single label did limit our analysis. In retrospect, the typology could have allowed for two labels: one for quality, one for complexity. In XSum for instance most samples which were labelled Entity Missing could also be labelled Paraphrase and Inference. We also realise that the impact of positional-bias could be important. This has been explored by Zhong et al. (2019a,b), and we plan to include similar metrics in our future work. Collecting better datasets. Our results suggest that current metrics are not equally reliable across all categories of samples. If the quality of the references cannot be controlled, then having a diverse set of references for the source is also advised. This will allow for multi-reference evaluation and could offset the "summarization-ese" issues. Limits of the Pyramid Scores. At the moment, the Pyramid Scores (and judgement criteria in general) only compare the output to the gold-reference, assuming the latter is true. As we see from our analysis, ignoring the source is not the right approach, for references from the web could have quality issues. A modified judgement procedure, that also accounts for the faithfulness of the goldreference (perhaps by using automatic factuality metrics FactCC) might be better. Architecture specific performance. In this study, we were interested in measuring the broader, averaged trends that summarization models exhibit. However, it would be interesting to see how specific architectural decisions impact individual model performance across different classes. We plan to explore this in the future. "But what's the best metric for my data?" Specifically for metrics, our objective was to empirically demonstrate that (a) datasets have different modalities, and (b) metrics are not equally reliable across these modalities. In this process, we also observed some results suggesting possible biases in certain token-based metrics, and a need for diverse reference sets. We'll continue to explore this question.

Related Work
For the task of text-summarization, the data analysis heuristics presented in Zhong et al. (2019a,b); Bommasani and Cardie (2020); Grusky et al. (2018) are most relevant to our work. Their analysis is focused on surface level heuristics which ignores all noise present in the data. This has been discussed in Sections 2.2.1, 5. Researchers have also explored other dataset biases (Jung et al., 2019;Zhong et al., 2019b;Chen et al., 2020). As discussed in Section 5, we plan to include this in our future work.
For metric reliability and meta-analysis, we build on correlation analysis presented in earlier works (Peyrard, 2019;Bhandari et al., 2020;Fabbri et al., 2020). The key difference and novelty is the introduction of our typology and measuring the impact of sample complexity on model performance and metric reliability. To the best of our knowledge, metrics and models have not been evaluated on such a typology. As results in Section 3 and 4 show, sample complexity is indeed very critical for metric reliability.

Conclusion
In this study, we manually analysed 600 samples from three popular datasets, using a typology that captures data quality issues and varying degrees of sample-complexity. Our analysis of 27 summarization models reveals that the metric performance is heavily dependent on samples. On closer inspection, we found that the agreement of popular metrics also changes with the complexity, thus the scores might not reflect true model performance. This analysis also led to some suggestions for creating better summarization datasets and highlights some limitations of the current human-judgement procedures.

A.2 Annotation Details
Each sample is annotated by 2-3 annotators independently. Given the limited number of samples, and the laborious nature of the exercise, we chose not to select final labels based on majority vote. For all disagreements, annotators discussed their reasoning and came to an consensus for final label. For 70% of Gigaword samples, 68% of CNN-DM samples, and 73% of XSum samples, the initial annotations were in agreement.

Appendix B Gigaword
B. Paraphrase A woman street cleaner and her three young daughters were killed Saturday when a bomb in a metal container exploded in Bangladesh , police said .
Mother , three daughters die in in Bangladesh blast .
Mother , three daughters killed in Bangladesh blast .

Paraphrase
The UN chief of Eastern Slavonia , the last Serb-held part of Croatia , confirmed Tuesday that key elections would be held here on April 13 as part of local ballots throughout Croatia .
UN chief confirms key elections in Eastern Slavonia .
UN confirms elections to be on April 13 in Eastern Slavonia .

Paraphrase
Business at Taiwan 's theme parks and resorts grew significantly in the first quarter of this year compared to Q1 last year , the Tourism Bureau said Thursday , attributing the growth to the government 's shopping voucher program and other promotion efforts .
Business at Taiwan 's theme parks and resorts grows .
Shopping vouchers help boost theme parks business : tourism bureau .

Inference
Col. Robert E. Lee skirted the unleaded gasoline pit , negotiated a thicket of telephone cords stretched as tight as trip wires and took the center of the New York Mercantile Exchange 's main trading floor just before 3 p.m. last Monday .
New York Mercantile Exchange 's trading floor .

MILITARY STRATEGISTS PRACTICE IN REAL BATTLE ON WALL STREET .
Inference Finland scored three goals in a 40-second span of the first period Tuesday night for a 7-3 victory over the Czech Republic in their World Cup of Hockey opener .
Finland Routs Czech Republic at World Cup .
Inference Q. I 've heard that cow manure can be used for energy production , but not human waste .
Cow manure can be used for energy production .
ON NOT WASTING WASTE .

Inference
Following all his inspired charity work, Didier Drogba has been awarded with a Barclays Spirit of the Game trophy. The Chelsea forward set up the 'Didier Drogba Foundation in Africa,' as he hopes to inspire the next generation of footballers in Africa to fall in love with the game. (truncated) He said 'I come from a poor family where I played football in the streets with my friends with no shoes, there was no grass but we still enjoyed it. The 'Didier Drogba Foundation,' contribute financial and material support in education and health including school bags for the school children, as well as a medical clinic in his hometown of Abidjan, Ivory Coast, which will be opening its doors later this year. Chelsea's stars such as Eden Hazard, Petr Cech and Branislav Ivanovic were out in force earlier this month as they raises £400,000 for the foundation at a charity ball. The money raised will be used to complete the medical clinic in Abidjan and help finance mobile clinics that will travel outside of the capital to those who are either to sick or poor to make the journey to the medical centre.
Didier Drogba has been awarded with a Barclays Spirit of the Game trophy . The Chelsea forward set up the ' DidierDrogba Foundation in Africa ' He hopes to inspire the next generation of footballers in Africa to fall in love with the game . The 37-year -old scored the equaliser against Leicester on Wednesday .
Didier Drogba given the Barclays Spirit of the Game award . The 37-year -old 's foundation has done impressive work in Africa . Some of Chelsea 's stars attended a charity ball which raised £ 400,000 . CLICK HERE for all the latest Chelsea news .
Inference (truncated) Resorts on its Black Sea coast offer the best value in terms of a meal out, buying a cup of coffee and essentials such as sun cream and a cold drink, according to a study. Scroll down for video . Affordable: Bulgaria has been named Europe's cheapest destination, with Black Sea resorts like Sunny Beach (pictured) offering the best value in terms of a meal out and other holiday activities . Hotspot: Bulgaria's most popular resort of Sunny Beach is a carbon copy of those of Spain and Greece . It is one of 13 European hotspots out of 14 where your cash will go far further this summer, largely thanks to rock-bottom exchange rates and higher inflation in some countries. Research into an imaginary shopping basket of ten typical holiday purchases showed a total price of £37.39 for Bulgaria, which is down by 13.6 per cent from last summer. There was a bigger fall of 22 per cent for the Algarve in Portugal, taking the total cost to £44.02, helping it beat Spain's Costa del Sol to become the second cheapest destination. Only in Turkey, where inflation is 7.6 per cent -compared to virtually zero in Britain and the eurozone -will Britons find the cost of a day out much more expensive. The figures, compiled for the annual Post Office Holiday Costs Barometer, show the spending basket in Turkey is up by 21.4 per cent on last year, at £65.70. Bulgaria's most popular resort of (truncated) .
Former Soviet state has gained the most from the strong pound . Resorts on its Black Sea coast offer the best value in terms of a meal out , buying a cup of coffee and essentials such as sun cream and a cold drink . It is one of 13 European hotspots out of 14 where your cash will go far further this summer .
Bulgaria's Black Sea resorts cheaper than hotspots in Italy, Spain and Turkey . Researchers found cheapest destination using 'imaginary shopping basket' Cheap prices are driven by low exchange rates and country's high inflation . Its most popular resort of Sunny Beach copies those of Spain and Greece .

Paraphrase
More than 700,000 employees face unpaid leave due to the shutdown which was triggered after the two houses of Congress did not agree on a new budget. Hyundai said affected employees who currently own its vehicles will be given a payment relief "for as long as they are out of work". Employees looking to buy a new car will be given a 90-day payment deferral. "We recognize the impact on family budgets that the furlough will drive," John Krafcik, chief executive of Hyundai Motor America, said in a statement. Hyundai had offered a similar scheme, the Hyundai Assurance programme, during the peak of the global financial crisis four years ago to help consumers who had lost their jobs. Many analysts have said that the move had helped the South Korean firm win customer loyalty and boosted its sales in recent years. The company said that its latest offer to help the federal employees was an addition to that programme and aimed at "helping workers at a time when they most need it". "Like we did almost four years ago when we launched Hyundai Assurance, this is our way of saying 'We've got your back' during this uncertain time," Mr Krafcik said. Under the latest offer, Hyundai will extend all auto loan and lease payments during the shutdown for current Hyundai owners who are put on unpaid leave. The programme is available to all customers who have financed their purchase or lease through Hyundai Finance America.
US carmaker Hyundai Motor has offered financial help to federal employees who have been affected by the government shutdown .
Hyundai Motor will defer payments due from US federal employees affected by the partial government shutdown .

Paraphrase
Gary Price was suspended from all council duties for five months in November after Powys council's Standards Committee ruled he had breached the code of conduct. His appeal has been dismissed by the Adjudication Panel for Wales following a two-day hearing in Llandrindod Wells. Mr Price has been contacted for comment.
He was found to have sent information which the council said "incorrectly and unfairly" portrayed what happened at a grievance appeal hearing, in which he was a panel member. The Adjudication Panel for Wales unanimously agreed to refer the matter back to the Standards Committee with a recommendation that Mr Price be suspended for three months. Council leader Barry Thomas said the decision "sends out a clear message that those who enter public office have to operate within the members' code of conduct and maintain the highest possible standards".
A Powys council chief executive has lost his appeal against a decision to suspend him .
A decision to suspend a Powys county councillor has been upheld .

Inference
Derby City Council wanted to shut Moorways Pool from April in a bid to save aboutÂ£350,000 a year. The Labour-led authority, which needs to saveÂ£79m over the next three years, said it had found the savings by making cuts in other areas. Campaigners who gathered more than 4,000 signatures on a petition said they were delighted at the news. Ranjit Banwait, leader of the authority, said the council had committed to keep it open for a year. He said the council had identified savings "in back-office areas" and a restructuring of management jobs, which had been "untouched" since 2010. However, he stressed if the authority failed to get a "fair deal" from central government in the future, the pool would still have to close. Campaigners had accepted the pool, which is 33m in length, was in need of repair. There are plans for a new 50m pool to be built by 2018 to replace it. However, closing it would have left only one other public pool in the city -the Queen's Leisure Centre, they said. Doug Whitlam, of the Derbyshire Amateur Swimming Association, said: "One of the main things for me would have been the loss of teaching. "Twelve hundred young people use this facility every week and that would be lost forever." A council has backed down over plans to close a public swimming pool in a bid to save money .
A Derby swimming pool threatened with closure is to remain open for another year , council bosses have confirmed .

Inference
It is likely to include a scrappage scheme for older diesel cars in areas with high levels of dirty air. Speed bumps could be removed in some cities to cut pollution from cars slowing down and speeding up. Environmental lawyers ClientEarth said they would "thoroughly analyse" the proposals. According to the Royal College of Physicians, air pollution across the UK is linked to around 40,000 premature deaths every year. The UK has struggled to keep within EU limits on some pollutants, particularly nitrogen dioxide (NO2), which is produced by diesel engines and is linked to a range of respiratory diseases including asthma. Some 37 of the 43 regions of the UK are in breach of NO2 limits. Under earlier government plans, some parts of the UK would not have met EU NO2 standards until 2030. The original deadline to achieve these limits was 2010. Exasperated by what they believed was government foot-dragging on the question of cleaner air, ClientEarth mounted a legal challenge to force faster action. In April 2015, the UK Supreme Court ruled the government had to take immediate steps on the issue. Unhappy with the timescales in the plan that was then produced, ClientEarth went to the High Court last November for a judicial review. Once again the court supported the lawyers, telling the government that its scheme was "woefully inadequate" and giving ministers until 24 April this year to produce a new draft. With a general election in the offing, the government last week asked the judge for permission to delay the draft plan. But Mr Justice Garnham disagreed and ordered publication by 9 May. "These steps are necessary in order to safeguard public health," he said. Earlier this week, the government said it would not appeal against the ruling and would publish. In their previous plans, ministers wanted to create "clean air zones" in five cities outside London with high levels of NO2. Only the most polluting vehicles would have to pay a charge to enter the zone under that scheme. The new draft plan is expected to create many more such zones. Councils will be given the power to impose fines or restrictions on all polluting vehicles in these areas. In the worst cities, so called "toxin taxes" could range up toÂ£20 a day but the government is said to be keen not to punish drivers who bought diesels as a result of incentives brought in by a previous Labour administration. This is something that the lawyers at ClientEarth support. "Successive governments have encouraged people to buy diesel. We don't want to see diesel drivers vilified, and we think the plans should also include properly funded incentives to help people move to cleaner forms of transport," said ClientEarth CEO James Thornton. "We will thoroughly analyse the government's draft plans when they are produced. If we do not think they are in line with the court order, to deal with illegal levels of pollution as soon as possible, then we will consider our next steps." According to newspaper reports, the government has agreed to back a "targeted" scrappage scheme for older diesel cars, but limited to vehicles in areas of high pollution. There may also be funding for a retrofitting scheme to help existing diesel car and van owners cut their emissions of NO2. The government is also said to be pushing for councils to use alternatives to charging, including the removal of speed bumps in some places and the better sequencing of traffic lights in others. Both of these measures could limit cars having to slow down and speed up repeatedly, actions that can almost double the amount of NO2 produced. However, the idea that speed bumps which slow down traffic would be sacrificed to help clean up the air we breathe is not a welcome concept according to road safety charity Brake. "We ought not to be made to choose between having cleaner air and safer roads," a spokesman said. "The evidence shows that air pollution is contributing to the early deaths of thousands of people. It's now clear that there's more than one way a car can kill you." The new proposals will be out for consultation for six weeks before the government produces a final plan at the end of July. Follow Matt on Twitter and on Facebook.
The government is expected to publish a new draft plan to tackle air pollution in the UK later this week .
The UK government is set to publish a draft air pollution plan after a protracted legal battle with environmental campaigners .