LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization

Text Summarization is a popular task and an active area of research for the Natural Language Processing community. By definition, it requires to account for long input texts, a characteristic which poses computational challenges for neural models. Moreover, real-world documents come in a variety of complex, visually-rich, layouts. This information is of great relevance, whether to highlight salient content or to encode long-range interactions between textual passages. Yet, all publicly available summarization datasets only provide plain text content. To facilitate research on how to exploit visual/layout information to better capture long-range dependencies in summarization models, we present LoRaLay, a collection of datasets for long-range summarization with accompanying visual/layout information. We extend existing and popular English datasets (arXiv and PubMed) with layout information and propose four novel datasets – consistently built from scholar resources – covering French, Spanish, Portuguese, and Korean languages. Further, we propose new baselines merging layout-aware and long-range models – two orthogonal approaches – and obtain state-of-the-art results, showing the importance of combining both lines of research.


Introduction
Deep learning techniques have enabled remarkable progress in Natural Language Processing (NLP) in recent years (Devlin et al., 2018;Raffel et al., 2019;Brown et al., 2020). However, the majority of models, benchmarks, and tasks have been designed for unimodal approaches, i.e. focusing exclusively on a single source of information, namely plain text. While it can be argued that for specific NLP tasks, such as textual entailment or machine translation, plain text is all that is needed, there exist several tasks for which disregarding the visual appearance of text is clearly sub-optimal: in * Work partially done while at reciTAL. a real-world context (business documentation, scientific articles, etc.), text does not naturally come as a sequence of characters, but is rather displayed in a bi-dimensional space containing rich visual information. The layout of e.g. this very paper provides valuable semantics to the reader: in which section are we right now? At the blink of an eye, this information is readily accessible via the salient section title (formatted differently and placed to highlight its role) preceding these words. Just to emphasize this point, imagine having to scroll this content in plain text to access such information.
In the last couple of years, the research community has shown a growing interest in addressing these limitations. Several approaches have been proposed to deal with visually-rich documents and integrate layout information into language models, with direct applications to Document Understanding tasks. Joint multi-modal pretraining (Xu et al., 2021;Powalski et al., 2021;Appalaraju et al., 2021) has been key to reach state-of-the-art performance on several benchmarks (Jaume et al., 2019;Graliński et al., 2020;Mathew et al., 2021). Nonetheless, a remaining limitation is that these (transformer-based) approaches are not suitable for processing long documents, the quadratic complexity of self-attention constraining their use to short sequences. Such models are hence unable to encode global context (e.g. long-range dependencies among text blocks).
Focusing on compressing the most relevant information from long texts to short summaries, the Text Summarization task naturally lends itself to benefit from such global context. Notice that, in practice, the limitations linked to sequence length are also amplified by the lack of visual/layout information in the existing datasets. Therefore, in this work, we aim at spurring further research on how to incorporate multimodal information to better capture long-range dependencies.
Our contributions can be summarized as follows: • We extend two popular datasets for long-range summarization, arXiv and PubMed (Cohan et al., 2018), by including visual and layout information -thus allowing direct comparison with previous works; • We release 4 additional layout-aware summarization datasets (128K documents), covering French, Spanish, Portuguese, and Korean languages; • We provide baselines including adapted architectures for multi-modal long-range summarization, and report results showing that (1) performance is far from being optimal; and (2) layout provides valuable information.
All the datasets are available on HuggingFace. 1 2 Related Work

Layout/Visually-rich Datasets
Document Understanding covers problems that involve reading and interpreting visually-rich documents (in contrast to plain texts), requiring comprehending the conveyed multimodal information. Hence, several tasks with a central layout aspect have been proposed by the document understanding community. Key Information Extraction tasks consist in extracting the values of a given set of keys, e.g., the total amount in a receipt or the date in a form. In such tasks, documents have a layout structure that is crucial for their interpretation. No-  (e.g., form understanding, visual QA) in which the placement of text on the page and/or visual components are the main source of information needed to find the desired data (Borchmann et al., 2021), text plays a predominant role in document summarization. However, guidelines for summarizing texts -espe-cially long ones -often recommend roughly previewing them to break them down into their major sections (Toprak and Almacioglu, 2009;Luo et al., 2019). This suggests that NLP systems might leverage multimodal information in documents. Miculicich and Han (2022) propose a two-stage method which detects text segments and incorporates this information in an extractive summarization model. Cao and Wang (2022) collect a new dataset for long and structure-aware document summarization, consisting of 21k documents written in English and extracted from WikiProject Biography.
Although not all documents are explicitly organized into clearly defined sections, the great majority contains layout and visual clues (e.g., a physical organization into paragraphs, bigger headings/subheadings) which help structure their textual contents and facilitate reading. Thus, we argue that layout is crucial to summarize long documents. We propose a corpus of more than 345K long documents with layout information. Furthermore, to address the need for multilingual training data (Chi et al., 2020), we include not only English documents, but also French, Spanish, Portuguese and Korean ones.

Datasets Construction
Inspired by the way the arXiv and PubMed datasets were built (Cohan et al., 2018), we construct our corpus from research papers, with abstracts as ground-truth summaries. As the PDF format allows simultaneous access to textual, visual and layout information, we collect PDF files to construct our datasets, and provide their URLs. 2 For each language, we select a repository that contains a high number of academic articles (in the order of hundreds of thousands) and provides easy access to abstracts. More precisely, we chose the following repositories: Further, we provide enhanced versions of the arXiv and PubMed datasets, respectively denoted as arXiv-Lay and PubMed-Lay, for which layout information is provided.

Collecting the Data
Extended Datasets The arXiv and PubMed datasets (Cohan et al., 2018) contain long scientific research papers extracted from the arXiv and PubMed repositories. We augment them by providing their PDFs, allowing access to layout and visual information. As the abstracts contained in the original datasets are all lowercased, we do not reuse them, but rather extract the raw abstracts using the corresponding APIs.
Note that we were unable to retrieve all the original documents. For the most part, we failed to retrieve the corresponding abstracts, as they did not necessarily match the ones contained in the PDF files (due to e.g. PDF-parsing errors). We also found that some PDF files were unavailable, while others were corrupted or scanned documents. 6 In total, about 39% (35%) of the original documents in arXiv (PubMed) were lost. arXiv-Lay The original arXiv dataset (Cohan et al., 2018) was constructed by converting the L A T E X files to plain text. To be consistent with the other datasets -for which L A T E X files are not available -we instead use the PDF files to extract both text and layout elements. For each document contained in the original dataset, we fetch (when possible) the corresponding PDF file using Google Cloud Storage buckets. As opposed to the original procedure, we do not remove tables nor discard sections that follow the conclusion. We retrieve the corresponding abstracts from a metadata file provided by Kaggle. 7 PubMed-Lay For PubMed, we use the PMC OAI Service 8 to retrieve abstracts and PDF files.
HAL We use the HAL API 9 to download research papers written in French. To avoid excessively long (e.g. theses) or short (e.g. posters) documents, extraction is restricted to journal and conference papers.
SciELO Using Scrapy, 10 we crawl the following SciELO collections: Ecuador, Colombia, Paraguay, Uruguay, Bolivia, Peru, Portugal, Spain and Brazil. We download documents written either in Spanish or Portuguese, according to the metadata, obtaining two distinct datasets: SciELO-ES (Spanish) and SciELO-PT (Portuguese).
KoreaScience Similarly, we scrape the Korea-Science website to extract research papers. We limit search results to documents whose publishers' names contain the word Korean. This rule was designed after sampling documents in the repository, and is the simplest way to get a good proportion of papers written in Korean. 11 Further, search is restricted to papers published between 2012 and 2021, as recent publications are more likely to have digital-born, searchable PDFs. Finally, we download the PDF files of documents that contain an abstract written in Korean.

Data Pre-processing
For each corpus, we use the 95th percentile of the page distribution as an upper bound to filter out documents with too many pages, while the 5th (1st for HAL and SciELO) percentile of the summary length distribution is used as a minimum threshold to remove documents whose abstracts are too short. As our baselines do not consider visual information, we only extract text and layout from the PDF files. Layout is incorporated by providing the spatial position of each word in a document page image, represented by its bounding box (x 0 , y 0 , x 1 , y 1 ), where (x 0 , y 0 ) and (x 1 , y 1 ) respectively denote the coordinates of the top-left and bottom-right corners. Using the PDF rendering library Poppler 12 , text and word bounding boxes are extracted from each PDF, and the sequence order is recovered based on heuristics around the document layout (e.g., tables, columns). Abstracts are then removed by searching for exact matches; when no exact match is found, we use fuzzysearch 13 and regex 14 to find near matches. 15 For the non-English datasets, documents might contain several abstracts, written in different languages. To avoid information leakage, we retrieve the abstract of each document in every language available -according to the API for HAL or the websites for SciELO and KoreaScience -and remove them using the same strategy as for the main language. In the case an abstract cannot be found, we discard the document to prevent any unforeseen leakage. The dataset construction process is illustrated in Section A in the Appendix.

Datasets Statistics
The statistics of our proposed datasets, along with those computed on existing summarization datasets of long documents (Cohan et al., 2018;Sharma et al., 2019) are reported in Table 1. We see that document lengths are comparable or greater than for the arXiv, PubMed and BigPatent datasets.
For arXiv-Lay and PubMed-Lay, we retain the original train/validation/splits and try to reconstruct them as faithfully to the originals as possible. For the new datasets, we order documents based on their publication dates and provide splits following a chronological ordering. For HAL and Korea-Science, we retain 3% of the articles as validation data, 3% as test, and the remaining as training data. To match the number of validation/test documents in HAL and KoreaScience, we split the data into 90% for training, 5% for validation and 5% for test, for both SciELO datasets.

Models
For reproducibility purposes, we make the models' implementation, along with the fine-tuning and evaluation scripts, publicly available. 16 We do not explore the use of visual information in long document summarization, as the focus is on evaluating baseline performance using state-of-theart summarization models augmented with layout information. While visual features might provide a better understanding of structures such as tables and figures, we do not expect substantial gains with 13 https://pypi.org/project/fuzzysearch/ 14 https://pypi.org/project/regex/ 15  respect to layout-aware models. Indeed, the information provided in figures (i.e., information that cannot be captured by layout or text) are commonly described in the caption or related paragraphs.
Text-only models with standard input size We use Pegasus (Zhang et al., 2020) as a text-only baseline for arXiv-Lay and PubMed-Lay. Pegasus is an encoder-decoder model pre-trained using gapsentences generation, making it a state-of-the-art model for abstractive summarization. For the non-English datasets, we rely on a finetuned MBART as our baseline. MBART (Liu et al., 2020) is a multilingual sequence-to-sequence model pretrained on large-scale monolingual corpora in many languages using the BART objective (Lewis et al., 2019). We use its extension, MBART-50 (Tang et al., 2020), 17 which is created from the original MBART by extending its embeddings layers and pre-training it on a total of 50 languages. Both Pegasus and MBART are limited to a maximum sequence length of 1,024 tokens, which is well below the median length of each dataset.

Layout-aware models with standard input size
We introduce layout-aware extensions of Pegasus and MBART, respectively denoted as Pe-gasus+Layout and MBART+Layout. Following LayoutLM (Xu et al., 2020), which is state-ofthe-art on several document understanding tasks bounding box size (width and height). The layout representation of a token is formed by summing the resulting embedding representations The final representation of a token is then obtained through point-wise summation of its textual, 1D-positional and layout embeddings.
Long-range, text-only models To process longer sequences, we leverage BigBird (Zaheer et al., 2020), a sparse-attention based Transformer which reduces the quadratic dependency to a linear one. For arXiv-Lay and PubMed-Lay, we initialize Big-Bird from Pegasus (Zaheer et al., 2020) and for the non-English datasets, we use the weights of MBART. The resulting models are referred to as BigBird-Pegasus and BigBird-MBART. For both models, BigBird sparse attention is used only in the encoder. Both models can handle up to 4,096 inputs tokens, which is greater than the median length in PubMed-Lay, HAL and KoreaScience.
Long-range, layout-aware models We also include layout information in long-range text-only models. Similarly to layout-aware models with standard input size, we integrate layout information into our long-range models by encoding each token's spatial position in the page. The resulting models are denoted as BigBird-Pegasus+Layout and BigBird-MBART+Layout.
Additional State-of-the-Art Baselines We further consider additional state-of-the-art baselines for summarization: i) the text-only T5 (Raffel et al., 2019) with standard input size, ii) the long-range Longformer-Encoder-Decoder (LED) (Beltagy et al., 2020), and iii) the layout-aware, long-range LED+Layout, which we implement similarly to the previous layout-aware models.

Implementation Details
We initialize our Pegasus-based and MBART-based models with, respectively, the google/pegasus-large and facebook/mbart-large-50 checkpoints shared through the Hugging Face Model Hub. As for T5 and LED, we use the weights from t5-base and allenai/led-base-16384, respectively. 18 Following Zhang et al. (2020) and Zaheer et al. (2020), we fine-tune our models up to 74k (100k) steps on arXiv-Lay (PubMed-Lay). On HAL, the total number of steps is set to 100k, while it is de- 18 The large versions of T5 and LED did not fit into GPU due to their size.   (2020) and set the maximum input length at 3,072 tokens. As the median input length is much greater in almost every non-English dataset, we increase the maximum input length to 4,096 tokens for BigBird-MBART models. Output length is restricted to 256 tokens for all models, which is enough to fully capture at least 50% of the summaries in each dataset. For evaluation, we use beam search and report a single run for each model and dataset. Following Zhang et al. (2020); Zaheer et al. (2020), we set the number of beams to 8 for Pegasus-based models, and 5 for BigBird-Pegasus-based models. For the non-English datasets, we set it to 5 for all models, for fair comparison. For all experiments, we use a length penalty of 0.8. For more implementation details, see Section B.1 in the Appendix.

General Results
In Table 3, we report the ROUGE-L scores obtained on arXiv and PubMed datasets (reported by Zaheer et al. (2020)), as well as on the corresponding layout-augmented counterparts we release. 20 On arXiv-Lay and PubMed-Lay, we observe that, while the addition of layout to Pegasus does not improve the ROUGE-L scores, there are gains in integrating layout information into BigBird-Pegasus. To assess whether these gains are significant, we perform significance analysis at the 0.05 level using bootstrap, and estimate a ROUGE-L thresh- 19 We tested different values for the number of steps (10k, 25k, 50k, 100k) and chose the one that gave the best validation scores for MBART. 20 For detailed results, please refer to Section C.1 in the Appendix.
old that predicts when improvements are significant. ROUGE-L improvements between each pair of models are reported in Table 11 in the appendix.
On arXiv-Lay, we compute a threshold of 1.48 ROUGE-L, showing that BigBird-Pegasus+Layout significantly outperforms all Pegasus-based models. In particular, we find a 1.56 ROUGE-L improvement between BigBird-Pegasus and its layoutaugmented counterpart, demonstrating that the addition of layout to long-range modeling significantly improves summarization. On PubMed-Lay, we compute a threshold of 1.77. Hence, the 0.96 ROUGE-L improvement from BigBird-Pegasus to its layout-augmented counterpart is not significant. However, the variance in font sizes in PubMed-Lay is much smaller compared to arXiv-Lay (see Table 12 in the appendix), reflecting an overall more simplistic layout. Therefore, we argue that layout integration has a lesser impact in PubMed-Lay, which can explain the non-significance of results.
In addition, we find that BigBird-Pegasus significantly outperforms Pegasus and Pegasus+Layout only when augmented with layout, with an improvement of, respectively, 2.3 and 2.2 points. This demonstrates the importance of combining layout and long-range modeling. While T5 and LED obtain competitive results, we find that the gain in adding layout to LED is minor. However, the models we consider have all been pre-trained only on plain text. As a result, the layout representations are learnt from scratch during fine-tuning. Similarly to us, Borchmann et al. (2021) show that their layout-augmented T5 does not necessarily improve the scores, and that performance is significantly enhanced only when the model has been pre-trained on layout-rich data.
Further    datasets contain less training data due to the inability to process all original documents. Secondly, the settings are different: while the original arXiv and PubMed datasets contain clear discourse information (e.g., each section is delimited by markers) obtained from L A T E X files, documents in our extended versions are built by parsing raw PDF files. Therefore, the task is more challenging for text-only baselines, as they have no access to the discourse structure of documents, which further underlines the importance of taking the structural information, brought by visual cues, into account. Table 4 presents the ROUGE-L scores reported on the non-English datasets. On HAL, we note that BigBird-MBART does not benefit from layout. After investigation, we hypothesize that this is due to the larger presence of single-column and simple layouts, which makes layout integration less needed. On both SciELO datasets, we notice that combining layout with long-range modeling brings substantial improvements over MBART. Fur-ther, we find that the plain-text BigBird models do not improve over the layout-aware Pegasus and MBART on arXiv-Lay and SciELO-ES, demonstrating that simply capturing more context does not always suffice. Regarding performance on Ko-reaScience, we can see a significant drop in performance for every model w.r.t the other non-English datasets. At first glance, we notice a high amount of English segments (e.g., tables, figure captions, scientific concepts) in documents in KoreaScience. To investigate this, we use the cld2 library 21 to detect the language in each non-English document. We consider the percent confidence of the top-1 matching language as an indicator of the presence of the main language (i.e., French, Spanish, Portuguese or Korean) in a document, and average the results to obtain a score for the whole dataset. Table 5 reports the average percent confidence obtained on each split, for each dataset. We find that the percentage of text written in the main language in KoreaScience (i.e., Korean) is smaller than in other datasets. As the MBART-based models expect only one language in a document (the information is encoded using a special token), we claim the strong presence of non-Korean segments in KoreaScience causes them to suffer from interference problems. Therefore, we highlight that KoreaScience is a more challenging dataset, and 21 https://github.com/GregBowyer/cld2-cffi n < Q 1 Q 1 ≤ n < Q 2 Q 2 ≤ n < Q 3 Q 3 ≤ n 0 0.5 we hope our work will boost research on better long-range, multimodal and multilingual models.
Overall, results show a clear benefit of integrating layout information for long document summarization.  Table 6: Average human judgement scores obtained by comparing gold-truth abstracts and summaries generated by BigBird and BigBird+Layout from 50 documents sampled from arXiv-Lay and HAL. Inter-rater agreement is computed using Krippendorff's alpha coefficient, and enclosed between parentheses.

Human Evaluation
To gain more insight into the effect of document layout for summarizing long textual content, we conduct a human evaluation of summaries generated by BigBird-Pegasus/BigBird-MBART and their layout-aware counterparts. We choose the BigBird-based models over the LED ones, as the gain in augmenting BigBird with layout is much more apparent. We evenly sample 50 documents from arXiv-Lay and HAL test sets, filtering documents by their topics (computer science) to match the judgment capabilities of the three human annotators. We design an evaluation interface (see Section C.2 in the appendix). For each sentence s i in the generated summary, we ask the annotators to highlight the relevant tokens in s i , along with the equivalent parts in the ground-truth abstract (de-noted h i ). Further, we ask them to rate the summary in terms of coherence and fluency, on a scale of 0 to 5, following the DUC quality guidelines (Dang, 2005). Finally, annotators are asked to penalize summaries with hallucinated facts. The highlighting process allows us to compute precision and recall as the percentage of highlighted information in the generated summary and the ground-truth abstract, respectively. Moreover, we can compute an overlap ratio as the percentage of highlighted information that appears several times in the generated summary. Lastly, we calculate a flow percentage that evaluates how well the order of the groundtruth information is preserved by computing the percentage of times where the highlighted text h i in the gold summary for one generated sentence s i follows the highlighted text h i−1 for the previous sentence s i−1 (i.e. where any token from h i occurs after a token in h i−1 ). Table 6 reports the scores for each metric and model, averaged over all 50 documents, along with inter-rater agreements, computed using Krippendorff's alpha coefficient. We find that adding layout to the models significantly improves precision and recall, results in less overlap (repetition), and is more in line with the ground truth order. Further, annotators did not encounter any hallucinated fact in the 50 generated summaries. To conclude, reported results show that human annotators strongly agree that adding layout generates better summaries, further validating our claim that layout provides vital information for summarization tasks.

Case Studies
To have a better understanding of the previous results, we focus on uncovering the cases in which layout is most helpful. To this end, we identify fea-tures that relate to the necessity of having layout: 1) article length, as longer texts are intuitively easier to understand with layout, 2) summary length, as longer summaries are likely to cover more salient information, and 3) variance in font sizes (using the height of the bounding boxes), and, as such, the complexity of the layout. The benefit of using layout is measured as the difference in ROUGE-L scores between BigBird-Pegasus+Layout and its purely textual counterpart, on arXiv-Lay and PubMed-Lay. We compute quartiles from the distributions of article lengths, ground-truth summary lengths, and variance in the height of bounding boxes. 22 Based on the aforementioned factors, the scores obtained by each model are then grouped by quartile range, and averaged over each range, see Figure 1. On arXiv-Lay, we find that layout brings most improvement when dealing with the 25% longest documents and summaries, while, for both datasets, layout is least beneficial for the shortest documents and summaries. These results corroborate our claim that layout can bring important information about long-range context. Concerning the third factor, we see, on PubMed-Lay, that layout is most helpful for documents that have the widest ranges of font sizes, showcasing the advantage of using layout to capture salient information.

Conclusion
We have presented LoRaLay, a set of large-scale datasets for long-range and layout-aware text summarization. LoRaLay provides the research community with 4 novel multimodal corpora covering French, Spanish, Portuguese, and Korean languages, built from scientific articles. Furthermore, it includes additional layout and visual information for existing long-range summarization datasets (arXiv and PubMed). We provide adapted architectures merging layout-aware and long-range models, and show the importance of layout information in capturing long-range dependencies.

Limitations
The proposed corpus is limited to a single domain, that of scientific literature. Such limitation arguably extends to the layout diversity of documents. In terms of risks, we acknowledge the presence of Personally Identifiable Information such as author names and affiliations; nonetheless, such informa-

A.1 Extended Datasets -Lost Documents
Figure 3 provides details on the amount of original documents lost in the process of augmenting arXiv and PubMed with layout/visual information. We observe four types of failures, and provide numbers for each type: • The link to the document's PDF file is not provided (Unavailable PDF); • The PDF file is corrupted (i.e., cannot be opened) (Corrupted PDF); • The document is not digital-born, making it impossible to parse it with PDF parsing tools ( Scanned PDF); • The document's abstract cannot be found in the PDF (Irretrievable Abstract).

A.2 KoreaScience -Extraction Rule
Korean documents in KoreaScience are extracted by restricting search results to documents containing the word "Korean" in the publisher's name. We show that this rule does not bias the sample towards a specific research area. We compute the distribution of topics covered by all publishers, and compare it to the distribution of topics covered by publishers whose name contains the word Korean. Figure 4 shows that the distribution obtained using our rule remains roughly the same as the original.
N a tu re L if e A rt if ic ia l H u m a n S o c ie ty H u m a n S c ie n c e a n d T e c h n o lo g y 0 10 20 30 40 Publishers with `Korean` in name All publishers Figure 4: Distribution of topics covered by all publishers (red) vs distribution of topics covered by publishers whose name contains the word Korean (blue).

A.3 Samples
We provide samples of documents from each dataset in Figure 5.

A.4 Datasets Statistics
The distribution of research areas in arXiv-Lay and HAL are provided in Figures 6 and 7, respectively. Such distributions are not available for the other datasets, as we did not have access to topic information during extraction.

B.1 Implementation Details
Models were implemented in Python using Py-Torch (Paszke et al., 2017) andHugging Face (Wolf et al., 2019) librairies. In all experiments, we use Adafactor (Shazeer and Stern, 2018), a stochastic optimization method based on Adam (Kingma and Ba, 2014) that reduces memory usage while retaining the empirical benefits of adaptivity. We set a learning rate warmup over the first 10% stepsexcept on arXiv-Lay where it is set to 10k consistently with Zaheer et al. (2020), and use a square root decay of the learning rate. All our experiments have been run on four Nvidia V100 with 32GB each.        The present knowledge of the structure of the photon is presented based on results obtained by measurements of photon structure functions at e+e− collider. Results are presented both for the QED structure of the photon as well as for the hadronic structure, where the data are also compared to recent parametrisations of the hadronic structure function F γ 2 (x, Q2). Prospects of future photon structure function measurements, especially at an International Linear Collider are outlined.

Introduction
The measurements of photon structure functions have a long tradition since the first of such measurements was performed by the PLUTO Collaboration in 1981. The investigations concern the QED structure of the photon as well as the hadronic structure. For the hadronic structure function F γ 2 (x, Q2) the main areas of interest are the behavior at low values of x and the evolution with the momentum scale Q2, which is predicted by QCD to be logarithmic. The experimental information is dominated by the results from the four LEP experiments.
This review is based on earlier work [1,2] and as an extension provides a number of updated figures, together with a comparison of the experimental data with new parametrisations of F γ 2 (x, Q2) that became available since then. Only results on the structure of quasi-real photons are discussed here. The structure of virtual photons and the corresponding measurements of effective structure functions are detailed in [3].

Structure function measurements
The photon can fluctuate into a fermion-anti-fermion state consistent with the quantum numbers of the photon and within the limitations set by the Heisenberg uncertainty principle. These fluctuations are favored, i.e. have the longest lifetimes, for high energetic photons of low virtuality. If such a fluctuation of the photon is probed, the photon reveals its structure. Using this feature, measurements of photon structure functions are obtained from the differential crosssection of the deep-inelastic electron-photon scattering1 process sketched in Figure 1. In this

ANTHOCYANINS AND BIOMEDICINAL PROPERTIES
Anthocyanins are members of the flavonoid group of phytochemicals, a group predominant in teas, honey, wines, fruits, vegetables, nuts, olive oil, cocoa, and cereals. The flavonoids, perhaps the most important single group of phenolics in foods, comprise a group of over 4000 C15 aromatic plant compounds with multiple substitution patterns (www.nal.usda.gov/fnic/foodcomp/index.html). The primary players in this group include the anthocyanins (eg, cyanidin, pelargonidin, petunidin), the flavonols (quercetin, kaempferol), flavones (luteolin, apigenin), flavanones (myricetin, naringin, hesperetin, naringenin), flavan-3-ols (catechin, epicatechin, gallocatechin), and, although sometimes classified separately, the isoflavones (genistein, daidzein). Phytochemicals in this class are frequently referred to as bioflavonoids due to their multifaceted roles in human health maintenance, and anthocyanins in food are typically ingested as components of complex mixtures of flavonoid components. Daily intake is estimated from 500 mg to 1 g, but can be several g/d if an individual is consuming flavonoid supplements (grape seed extract, ginkgo biloba, or pycnogenol; see, eg, [1]).
The colorful anthocyanins are the most recognized, visible members of the bioflavonoid phytochemicals. The free-radical scavenging and antioxidant capacities of anthocyanin pigments are the most highly publicized of the modus operandi used by these pigments to intervene with human therapeutic targets, but, in fact, research clearly suggests that other mechanisms of action are also responsible for observed health benefits [2,3,4,5]. Anthocyanin isolates and anthocyanin-rich mixtures of bioflavonoids may provide protection from DNA cleavage, estrogenic activity (altering development of hormone-dependent disease symptoms), enzyme inhibition, boosting production of cytokines (thus regulating immune responses), anti-inflammatory activity, lipid peroxidation, decreasing capillary permeability and fragility, and membrane strengthening [6,7,8,9,10]. The chemical structure (position, number, and types of substitutions) of the individual anthocyanin molecule also has a bearing on the degree to which anthocyanins exert their bioactive properties [11,12] and the structure/function relationships also influence the intracellular localization of the pigments [7]. The anthocyanin literature includes some controversy over the relative contributions of glycosylated anthocyanins versus aglycones in terms of bioavailability and bioactive potential [7,13,14,15,16]. Originally, it was assumed that only aglycones could enter the circulation circuit, however, absorption and metabolism of anthocyanin glycosides has now been demonstrated. The nature of the sugar conjugate and the aglycone are important determinants of anthocyanin absorption and excretion in both humans and rats [15].
The roles of anthocyanin pigments as medicinal agents have been well-accepted dogma in folk medicine throughout the world, and, in fact, these pigments are linked to an amazingly broad-based range of health benefits. For example, anthocyanins from Hibiscus sp have (b) PubMed-Lay 1 Les représentations des enseignants de ZEP sur la relation école/famille à travers le prisme des élèves en grande réussite scolaire