Robust Open-Vocabulary Translation from Visual Text Representations

Machine translation models have discrete vocabularies and commonly use subword segmentation techniques to achieve an ‘open vocabulary.’ This approach relies on consistent and correct underlying unicode sequences, and makes models susceptible to degradation from common types of noise and variation. Motivated by the robustness of human language processing, we propose the use of visual text representations, which dispense with a finite set of text embeddings in favor of continuous vocabularies created by processing visually rendered text with sliding windows. We show that models using visual text representations approach or match performance of traditional text models on small and larger datasets. More importantly, models with visual embeddings demonstrate significant robustness to varied types of noise, achieving e.g., 25.9 BLEU on a character permuted German–English task where subword models degrade to 1.9.


Introduction
Machine translation models degrade quickly in the presence of noise, such as character swaps or misspellings (Belinkov and Bisk, 2018; Khayral lah and Koehn, 2018; Eger et al., 2019. Part of the reason for this brittleness is the reliance of MT systems on subword segmentation (Sennrich et al., 2016) as the solution for the openvocabulary problem, since it can cause even minor variations in text to result in very different token sequences, needlessly fragmenting the data (Table 1). These issues can be mitigated with techniques including normalization, adding synthetic noisy training data (Vaibhav et al., 2019), or often simply moving to larger data settings. However, it is impossible to anticipate all kinds of noise in light of their combinatorics, and in any case, attempts to do so add complexity to the model training process.

Phenomena
Word BPE (5k) Misspelling language language (1) langauge la · ng · au · ge (4) Visually Similar Characters really really (1) rea11y re · a · 1 · 1 · y (5) Humans, in contrast, are remarkably robust to all kinds of text permutations (Rayner et al., 2006), including extremes such as "l33tspeak" (Perea et al., 2008). It stands to reason that one source of this robustness is that humans process text, not from discrete unicode representations, but visu ally, and that modeling this kind of information might yield more humanlike robustness. Draw ing on this, we propose to translate from visual text representations. Our model still consumes text, but instead of creating an embedding matrix from subword tokens, we render the raw, unsegmented text as images, divide it into overlapping slices, and produce representations using techniques from optical character recognition (OCR). The rest of the architecture remains unchanged. These mod els therefore contain both visual and distributional information about the input, allowing them to po tentially provide robust representations of the input even in the presence of various kinds of noise.

Shared Character Components
After presenting the visual text embedder (Sec tion 2), we demonstrate the potential of visual rep resentations for machine translation across a range of languages, scripts, and training data sizes (Sec tion 4). We then look at a variety of types of noise, and show significant improvements in model ro bustness with visual text models (Section 5).  Figure 1: Visual text architecture combines network components from OCR and NMT, trained endtoend.

Rendering text as images
Our architecture is summarized in Figure 1. The first step is to transform text into an image. We render the raw text of each input sentence into a grayscale (single color channel) image; no sub word processing is used at all. The image height h is a function of the maximum height of the charac ters given the font and font size, while the image width w is variable based on the font and sentence length. We extract slices using sliding windows, similar to feature extraction for speech processing. Each window is of a specified length w and full height h, extracted at intervals s determined by a set stride. We experimentally tune each of these parameters per language pair (see Section 3.4).

Visual representations
The slices output from the rendering stage are analogous to subword text tokens. The next step produces "embeddings" from these slices. Em beddings typically refer to entries in a fixed size weight matrix, with the vocabulary ID as an index. Our image slices are not drawn from a predeter mined set, so we cannot work with normal embed dings. Instead, we use the outputs of 2D convolu tional blocks run over the image slices, projected to the model hidden size, as a continuous vocabulary. While OCR models for tasks such as handwrit ing recognition require depth that impacts training and inference speed, our task differs significantly. OCR tasks contend with varied image back grounds, varied horizontal spacing, and varied character 'fonts,' sizes, colors, and saliency. Vi sually rendered text is uniform along each of these characteristics by construction.
Accordingly, we can use simpler image processing and model architectures without performance impact.
Our core experiments use a single convolutional block (c = 1) followed by a linear projection to produce flattened 1D representations as used by typical texttotext Transformer models, but here the representations are drawn from a continuous space rather than a predetermined number of embeddings. A convolutional block comprises three pieces: a 2D convolution followed by 2D batch normalization and a ReLU layer. The 2D convolution uses only one color channel, and padding of 1, kernel size of 3, and stride of 1, which results in no change in dimensions between the block inputs and outputs. We contrast the c = 1 setting with two others: c = 0 and c = 7. When c = 0, the model is akin to the Vision Transformer (Dosovitskiy et al., 2021) from image classification where attentional layers are applied directly to image slices 1 after a flattening linear transformation. With c = 7, we compare the depth of VistaOCR (Rawls et al., 2017), a competitive OCR model, but without its additional color channels and subsequent maxpooling. 2 After replacing text embeddings with visual rep resentations, the standard MT architecture remains the same. The full model is trained endtoend with the typical crossentropy objective. All models are trained using a modified version of fairseq (Ott et al., 2019), which we release with the paper. 3

Training data
We experiment with two data scenarios, a small one (MTTT) and a larger one (WMT).

MTTT.
We use the MTTT dataset to compare traditional text models with visual text models across a range of languages and scripts, using simi larly sized data. We use the Multitarget TED Talks Task (MTTT), a collection of TED 4 datasets with ∼200k training sentences (Duh, 2018). Specifi cally, we use the data for the Arabic (ar), Chi nese (zh), Japanese (ja), Korean (ko), Russian (ru), French (fr), and German (de) to English (en) tasks.
WMT. We also experiment with two larger datasets derived from the 2020 shared task in news translation from the Conference on Machine Translation (WMT20). For German-English, we use all provided data except Paracrawl and Com moncrawl. We filter out sentence pairs that don't match on language ID as reported by fasttext (Joulin et al., 2016b,a), pairs with a raw length ra tio of more than 3 to 1, pairs where raw source or target length is greater than 100, and all duplicate pairs, leaving 4.9M sentence pairs. We train a joint unigram SentencePiece model of size 10k.
For Chinese, we use all provided data except UNv1.0 and the backtranslations. We filter in the same way, except that we do not apply ratio fil tering. This yields 8.7M sentence pairs. We train separate source and target unigram SentencePiece models of sizes 20k and 10k, respectively. More details can be found in Table 13 in the Appendix.

Test sets
MTTT. Our main results are on the 1,982 seg ment multiway parallel MTTT test sets.
MTNT. To evaluate model robustness on data with naturally occurring noise, we use the Ma chine Translation of Noisy Text (MTNT) test sets (Michel and Neubig, 2018). The MTNT test sets used were created from comments from Reddit in French, German, and Japanese which have been professionally translated from English. By virtue of their domain, these test sets contain "noisy" text with natural typos, semantic use of visually simi lar characters, abbreviations, grammatical errors, emojis, and more. MTNT has recently been used for evaluation in the WMT'19 and'20 Robustness tasks (Li et al., 2019; Specia et al., 2020

Baseline text models
All baseline text models are trained using fairseq. For our 7 language pairs from the MTTT TED dataset, we follow the recommended fairseq architecture and optimization param eters for IWSLT'14 deen which is of the same size and domain: 6 layers each for encoder and decoder, with 4 attention heads per layer, with slight modifications to batch size, vocabulary, and label smoothing p = 0.2.
We tune the subword vocabulary for each lan guage pair and dataset. We saw no difference be tween joint/disjoint vocabularies, so use separate vocabularies to create a direct comparison with the visual text models: the same target vocabulary is used for both and only the source representations are varied. We tuned ∼5k BPE intervals from 2.5k-35k 5 to optimize source language BPE gran ularity with the target vocabulary held constant at 10k BPE. We additionally compare characterlevel and wordlevel models; to produce wordlevel seg mentations for Chinese, we use jieba, 6 and for Japanese, we use kytea (Neubig et al., 2011). The character vocabulary for Chinese is greater than 2.5k so we do not have a BPE model of this size. Our best performing BPE models used source vo cabularies of approximately 5k (see Figure 2).
We jointly tuned batch size and subword vo  For the larger data settings, we train Trans former base models 7 with dropout 0.1 and learning rate 4e4. We use a batch size of 40k tokens, and train until heldout validation fails to improve for ten epochs. For German, we use a shared unigram subword vocabulary of size 10k. For Chinese, we train separate models of size 20k and 10k, respec tively. No other preprocessing was used.

Visual text models
Our visual text models replace the source embed ding matrix in the text models with the visual text embedder from Section 2. The model architecture otherwise remains unchanged: we use the same Transformer settings, and the target language vo cabulary is 10k BPE. We experiment with parame ters for the visual text embedder to find which are significant for this new task in Section 4, with hy perparameter sweeps in Appendix A.
We use the pygame Python package 8 with the Google Noto font family 9 to render text. For Latin and Cyrillic scripts, we use NotoSans; for Arabic, NotoNaskhArabic; and for the ideographic lan guages, NotoSansCJK JP. No preprocessing is ap plied before rendering.
While our visual text models remove the source embedding matrix, they may add parameters  from convolution blocks if used to compute representations. Our best models typically reduce the number of model parameters, and in the worst case increase overall parameters by 1% (from 36.7M to 36.9M), determined by window size and number of convolutional blocks. Computation time increases compared to BPE due to longer source sequences, but our best performing models are faster (with shorter sequences) than character models (Table 2). Time to render text during inference is negligible-comparable to subword segmentation at fractions of a second.

Chasing Translation Parity
Stateoftheart translation models use subword vocabularies, which yield best performance when tuned per language pair and task (Salesky et al., 2018; Ding et al., 2019. Our visual text approach avoids predetermining a fixed model vocabulary.
On the one hand, this allows us to represent even unanticipated characters; on the other, optimizing a finite model vocabulary per task may improve performance. Our first question, therefore, is whether visual text can recover scores produced by baselines with optimized subword vocabularies. On the smaller MTTT dataset, we can nearly recover the best results from the most optimal BPE segmentation without explicit input segmentation, solely from visual representations with a sliding window. Table 3 compares our best visual text models to our best text baselines on MTTT. The best visual text results use c = 1 convolutional block, which adds some structural biases from convolutions without excessive visual depth. We show c = 0 and c = 7 for comparison, which represent no convolutional blocks and the depth of

Text
Visual Text c = 0 c = 1 c = 7 deen 33.9 32.9 32.5 zhen 20.2 21.3 20.5 - recent stateoftheart OCR models, respectively. We find greater visual capacity through a larger number of convolutional blocks does not improve results for our task. Increased convolutional depth also comes at a cost: compared to c = 1, c = 7 adds 2.6M additional parameters and 5× longer training time. In this setting, c = 0 is consistently below c = 1. Our analysis focuses on c = 1.
On our larger data scenarios, we see our best vi sual text models approach (deen) or exceed (zh en) the textbased baselines (Table 4). This sug gests our approach scales and its efficacy is not limited to lowerresource settings. With more data, c = 0 slightly outperforms the c = 1 model, suggesting this 'direct' model may simply require more training data.
As a new approach, it is not known from the out set which hyperparameters for visual representa tions may affect performance. We conducted ex periments to determine significant hyperparame ters and best parameter ranges for visual text ex periments: namely, for window length, stride, font size, batch size, and CNN kernel size. We see sim ilar hyperparameter trends across language pairs. We find font size is not significant as long as it is sufficiently large to not affect image resolution for more visually dense scripts (at least 10pt-see Table 12 in the Appendix), and CNN kernel size of 3 × 3 and batch size of 20k to be consistently best. We always use a window length greater than or equal to stride length so that no text is dropped. Table 5 shows varied window length and stride val ues for deen; additional language pairs and pa rameter combinations can be found in Appendix A. As stride length increases (creating less overlap be tween windows) performance typically decreases: our best results typically use stride 10. Optimal window length exhibited the biggest difference be tween languages. We show ablation experiments isolating the role of sliding window segmentation in Appendix B.  Table 5: German-English BLEU scores on MTTT, tun ing stride and window length with fixed batch size.

Robustness to Noise
We hypothesize that without a fixed vocabulary and with associations between visually similar character spans, our visual text models will be more robust to noise than textbased representa tions, where noise causes divergent subword rep resentations (see Table 1 for motivating examples).
To test this, we evaluate on two different settings: induced synthetic noise, and naturally occurring noise from sources such as Reddit. Synthetic noise allows us to test various settings for all language pairs, while natural noise is limited by dataset availability. Examples of induced noise, and the resulting model inputs and outputs for both text and visual text models, can be found in Table 6.

Synthetic noise
Inducing noise enables us to control the type and frequency with which noise occurs. We com pare two types of synthetic noise: visually similar characters (e.g., l33tspeak, unicode codepoints which are visually similar) and character permu tations (e.g., Cmabrigde). For all synthetic noise experiments, we induce noise at the tokenlevel on the source side of our baseline dataset, MTTT TED. Each token may be noised with probability p from p = 0.1 to 1.0 by intervals of 0.1. Visually similar characters. Different unicode characters may share visually similar characteris tics. Such characters may be substituted intention ally, such as in l33tspeak where characters such as numbers are used in place of visually similar Roman alphabet letters, or unintentionally, where characters from another script appear in place of the expected unicode codepoints for a given lan guage and script due to e.g., use of multiple key boards or OCR errors (Rijhwani et al., 2020). For some languages without a unicode standard, mul tiple unicode sequences which render the same are all in common use (e.g., Pashto). As shown in Fig  ure 3, such errors can be very inconspicuous.   We induce noise in the form of Latin characters which are visually similar to Cyrillic characters for Russian (unicode), diacritization for Arabic (diacritics), and l33tspeak for French and Ger man (l33tspeak). We use CAMeL Tools (Obeid et al., 2020) for Arabic diacritization. Figure 4 shows that the visual text model has al most no degradation in performance with unicode noise, even when 100% of characters with a map ping to another visually similar unicode codepoint have been substituted. However, the text model quickly degrades towards 0 as substitutions cause mismatches with BPE vocabularies. Character based models are similarly unable to handle OOV codepoints, and characters in extremely novel con texts, as found with this type of noise: at p = 0.5, our character model has a disappointing 0.2 BLEU.
The substitution of visually indistinct code points is perfectly suited to our approach, and it is unsurprising that it does so well. But what about noise that does produce visual variation? Visually, Arabic diacritization represents an addition of a small number of pixels (+15%) which generally do not affect the spatial relationship between base characters. However, at the unicode level, diacriti zation inserts codepoints that break up adjacent ∆BLEU is shown for readability; absolute BLEU can be found in Figure 9 in Appendix D. For l33tspeak, improvements with visual text diminish with higher levels of noise.
character sequences required for subword matches (see Table 1). While visual text representations are relatively robust to diacritized text, textbased models are significantly negatively impacted: Figure 4 shows decreases of at most 4 BLEU with visual text but up to 31 BLEU with BPE. Finally, we look at l33tspeak. Here, a reader understands from the unexpected presence of a number that a substitution has been made, and is able to form a mapping to a similar alphabetic letter. However, '4' and 'a' are not necessarily more visually similar in many fonts than say '7' and 'z'; conventional use often dictates l33tspeak substitutions moreso than visual similarity. Fig  ure 5 shows that while both visual text models and text models are negatively affected by induced l33tspeak, the visual text models for both language pairs significantly outperform the text models in these conditions. With up to 30% of tokens containing l33tspeak mappings, the visual text models for both German and French perform >5 BLEU better than the text models.
Normalization cannot fully address these chal lenges for text models; see Appendix C for results.
Character permutations are challenging both for subword models, which necessarily back off to smaller units in the presence of OOVs (Table 1), and characterbased models (Belinkov and Bisk, 2018). 10 Here we experiment with two types of synthetic noise used by Belinkov and Bisk to com pare visual text models to traditional text models.
Swap : Swapping adjacent characters (e.g., language→langauge) is common when typing quickly. We perform one swap per selected word. This noise can be applied to words of length ≥2. Cam : The purported Cambridge spelling exper iment of spam mail fame illustrates the remark able robustness of humans to character permuta tions 11 when the first and last character are un changed (e.g., language→lnagauge). To enable wordmedial permutations, this noise can be ap plied to words of length ≥4.
We do not apply character permutations to Chi nese or Japanese text, since most tokens contain two or fewer characters after word segmentation.
Visual text representations result in significant improvements for character permutations, particu larly at higher levels of noise. Figure 6 shows the stark contrast in relative performance between the two models: though a slight gap in performance re mains for some of our models on clean text, with even 10% induced noise this gap has been closed. Improvements of up to 24 BLEU on German-English concretely mean that our visual text model achieves 25.9 BLEU on a task where the subword based model has degraded to 1.9 BLEU. Figure 7 in Appendix D shows absolute degradation in per formance for each model and permutation type.
Character permutations exhibit the opposite trend of visual noise: while improvements over text models decreased as more tokens contained visual noise, for permutations, improvements strongly increased with greater levels of noise. This may be because visual noise often involves character substitutions rather than permutations. Permutations affect a greater percentage of the character sequence for a given token, which shat ter subword representations. While subword mod 11 With a cost to reading speed (McCusker et al., 1981; Rayner et al., 2006. els can use context to recover when only 10% of tokens contain permutations, at higher levels of noise, they cannot. When 100% of tokens contain swaps, for example, the German 5k BPE model backs off to 2.25× more subwords (most words are charactersonly) than for nonnoised text.

Natural noise
Natural noise-as found in informal sources, such as Reddit-contains many additional types of noise, including keyboard typos (where nearby keys are substiyuted), substitutions of phonetically similar characterz or worts, unconventional s p a c e s and repetitionsss for effect or error, natural mispelling, and noisy spans which extend beyond individual tokens, among others. Parallel text cre ated from 'found' data (MTNT: Reddit; WIPO: patents) contains such noise in natural contexts.   Table 7 compares visual text models to text models using both subword and characterlevel representations on MTNT and WIPO test sets. We continue to test our MTTTtrained models in a zeroshot setting, which makes domain a con founding variable for these test sets. The domain mismatch proved challenging for all models. We see that characterlevel models are in some cases more robust than subwords, but are unable to re cover from the variation in others (jaen) where the visual text model does best. The visual text models improve over subword models and per form competitively with characterlevel models for German-English, where we have reached par ity on our clean data case (Table 3), and Russian-English, where the WIPO patent data has a signifi cant number of unicode OCR errors (as illustrated in Figure 3) and occasional Roman alphabet char acters (e.g., for chemical formulas): 3% of source characters in the Russian WIPO test set are outside the Cyrillic unicode codepoint range.

Related Work
Visual representations of text have previously been explored for other NLP tasks, primarily for Chinese, with mixed results. Liu et al. (2017)  In machine translation, visual information was also first used for Chinese. Initial work improved translation models by initializing character em beddings with linearized bitmaps of each character

Conclusion & Future Work
We introduced visually rendered text for continu ous openvocabulary translation. We showed that our models, trained on seven language pairs and in two data settings, approach or match the per formance of traditional text models. Further, we showed that visual text models are more robust to many kinds of induced noise, including the sub stitution of visually similar characters and charac ter permutations. An important benefit of our ap proach is that it operates on raw text, doing away with the standard preprocessing routines that in clude normalization, tokenization, and subword segmentation.
We believe our approach opens many avenues for future work. Standard data techniques from OCR (such as varied font and font size) and train ing on noise would likely further improve robust ness. There are many possible visual architec tures, and visual pretraining has benefited vision tasks (Dosovitskiy et al., 2021). There is nothing to preclude our approach from working on larger datasets. While effective, it is not clear that slid ing window segmentation is optimal; improving segmentation could close remaining performance gaps. Since our approach does away with discrete vocabularies, visual text models could be used to transfer to new languages and scripts without re quiring transliteration or normalization, or retrain ing models from scratch. Finally, it is appealing to consider this approach for additional tasks such as language identification (Caswell et al., 2020, Table 2) or spam detection. Any NLP task that requires robust, openvocabulary representations could benefit from our approach.

A Parameter tuning
Additional parameter tuning results by language pair for MTTT; Table 9 results for DEEN can be found in the main text (Table 5). Window length is always greater than or equal to stride length so that no text is dropped. With longer sequence lengths (shorter strides) and smaller batch sizes, we observe occasional instability (similar to characterbased text models), which increased batch sizes generally stabilize. w = window size, s = stride, c = number of convolutional blocks.

B Ablation experiments
To what degree are our results due to our implicit model of segmentation through overlapping sliding windows, or the use of visual representations themselves? To disentangle these two factors, we run ablation experiments to separate these two components of the visual embedder.
Sliding window segmentation only. To evaluate our approach to segmentation without visual render ing, we apply sliding window vocabularies to text, creating overlapping character 3grams: this corre sponds to a window size of approximately 30 with font size 10. Character ngrams of a fixed order are not commonly used for NMT, likely due to the large resulting vocabulary and the fact that they do not solve the OOV problem.
For languages with (more) uniform character ngram frequencies (Arabic, German, French, Russian), results with sliding window segmentation but no visual representations (w/o visrep, char ngrams abla tion) are similar to the text BPE models' results. For these four languages, the sliding window approach to segmentation does not affect performance (and for German, the sliding windows in fact provide a +1 BLEU improvement over BPE). For French, we see slight degradation with the visual representations compared to the ablation (0.2 BLEU), suggesting that the visual embedder itself has slight room for im provement. For the other languages (Chinese, Japanese, Korean), there is a significant drop w/o visrep due to the higher proportion of infrequently observed vocabulary, leading to a greater proportion of insuf ficiently trained embeddings. This is a problem that our visual text embedder removes in the full visual text models, because exact lexical matches are not required to train visual representations.  When we add noise, the ablation experiments (sliding window segmentation without visual repre sentations, w/o visrep) degrade below the BPE baselines; this suggests that the visual text embedder (combined with the resulting open vocabulary) is the primary reason for our visual text models' robustness, not our sliding window segmentation.

C Normalization as preprocessing for robustness
A natural question is whether preprocessing can address the robustness issues demonstrated here with traditional text models using e.g., BPE subword segmentation. To evaluate this setting, we apply a spellchecker to each of our noiseinduced test sets; for this task, we use the test sets with noise in duced with p = 1.0, where 100% of applicable tokens have induced noise. We use the Google Docs spellchecker, which was the best of the options evaluated in recent work (Näther, 2020) which cover all of our tested languages (unlike e.g., Grammarly, which currently supports English only), and which significantly outperformed common opensource alternatives such as Hunspell. 12 We evaluate both the text (BPE) and visual text (visrep) models on the spellchecked test sets; results are shown in Table 11.
It is clear that spellchecking can help the BPE models, in some cases dramatically (up to 20 BLEU). However, it does not close the gap with our method, and in some cases performance degrades; for all induced noise, visual text representations still outperform the BPE models, and often by a large margin.
Spellcheckers are languagespecific and as shown below in Table 11, can be more adept at certain types of noise which were taken into consideration in their construction. For example, while first spellcheck ing the French swap test set improves the BPE model by more than 20 BLEU, it does not change the l33tspeak performance at all. Similarly, BPE models were only slightly improved for Arabic diacriti zation and Russian unicode noise, while the visrep model performs strongly for both without spellcheck. Further, like translation models, spellcheckers often rely on context for disambiguation, and so with noisy context may either have lower recall or can introduce cascading errors when the correction made is not correct (illustrated below in lower performance for some conditions with spellcheck). A denoising au toencoder may also be able to address many of these phenomena, but, requires training and knowledge of the types of noise expected, where our approach is a single model and performance is zeroshot. Possible noise grows exponentially as it can appear in combination -it is not feasible to expect normalization to fully address this problem.

D Additional robustness figures
Here we show character permutation results isolated by model and noise type, and absolute BLEU for figures shown with ∆BLEU for readability in the main text.

D.1 Isolated character permutation results
Each plot in Figure 7 shows the degradation in performance of a given model with different proportions of induced noise, relative to the performance of the same model on the uncorrupted text. As more noise is added, the visual text models degrade at significantly lower pace. Average number of tokens per sentence and average token length affect the amount of noise; for cmabirdge (cam) Korean appears to be an outlier because there are fewer words where this noise may be applied than our other languages, as there are fewer words of length ≥ 4 in the data.

D.2 Absolute BLEU
Here we show absolute BLEU for figures shown with ∆BLEU for readability in the main text.
Average number of tokens per sentence and average token length affect the amount of noise; for cmabirdge (cam) Korean appears to be an outlier because there are fewer words where this noise may be applied than our other languages, as there are fewer words of length ≥ 4 in the data.  Figure 5 for ∆BLEU for readability. For l33tspeak, improvements with visual text diminish with higher levels of noise.

E Pixel Density
Below we show the average pixel density (the average pixel value, normalized to be between 0 and 1, where 0 is white and 1 is black) and percentage of nonwhite pixels for rendered text. We find that pixel density is not necessarily indicative of performance, but that those languages with lower pixel densities are less sensitive to differences in font size (see parameter grids in Appendix A). Arabic diacritization yields an approximately 2% increase in pixel density.