SOTASTREAM: A Streaming Approach to Machine Translation Training

Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer. This preparation step is increasingly at odds with modern research and development practices because this process produces a static, unchangeable version of the training data, making common training-time needs difficult (e.g., subword sampling), time-consuming (preprocessing with large data can take days), expensive (e.g., disk space), and cumbersome (managing experiment combinatorics). We propose an alternative approach that separates the generation of data from the consumption of that data. In this approach, there is no separate pre-processing step; data generation produces an infinite stream of permutations of the raw training data, which the trainer tensorizes and batches as it is consumed. Additionally, this data stream can be manipulated by a set of user-definable operators that provide on-the-fly modifications, such as data normalization, augmentation or filtering. We release an open-source toolkit, SOTASTREAM, that implements this approach: https://github.com/marian-nmt/sotastream. We show that it cuts training time, adds flexibility, reduces experiment management complexity, and reduces disk space, all without affecting the accuracy of the trained models.


Introduction
A cumbersome component of training machine translation systems is working with large amounts of data.Modern high-resource parallel datasets are often on the order of hundreds of millions of parallel sentences, and backtranslation easily doubles that (Kocmi et al., 2022, Appendix A).Because this data is too large to fit into main memory, toolkits such as FAIRSEQ (Ott et al., 2019) and SOCKEYE (Hieber et al., 2022) make use of a preprocessing step, which transforms the training data from its raw state into a static sequence of tensors.These tensors can then be read in via an index and memory-mapped shards, allowing for quick assembly into batches at training time.
While this offline preprocessing prevents data loading from becoming a bottleneck in training, it creates a number of other problems: • it breaks an abstraction: the tensorized data is tied to specific modeling decisions, such as the vocabulary; • it is cumbersome: the tensorized data cannot be changed, and even minor variations of the data must be processed separately and then managed; • it is time-consuming: pre-processing can take considerable time and must be completed before training can start; and • it is wasteful: each data variant replicates the original's disk space.
These problems exist for construction of any model, but are exacerbated in research settings, which often explore variations of the training data.
We describe an alternative that factors generation of data from the consumption of that data by the training toolkit.This view presents the training data as an (infinite) stream of permutations of the raw training samples.This stream is then consumed by the training toolkit, which tensorizes it on the fly, consuming data into a buffer from which it can assemble batches.This framework eliminates all the problems above: variants of the data are independent of any model; arbitrary manipulations can be applied on the fly; preprocessing time is amortized over training, which can start as soon as the first batch can be constructed;and no extra disk space or management is required.
We release an open-source implementation of the proposed data generation framework called SO-TASTREAM 1 .SOTASTREAM is written in Python and uses Infinibatch2 to provide a stream of data over permutations of data sources.It additionally provides an easily-extendable set of mixers, augmentors, and filters that allow data to be probabilistically manipulated on the fly.A particular configuration of manipulators is provided by the user in the form of a dynamically-loadable pipeline, which defines a parameterizable recipe that can be used for training.SOTASTREAM uses multiprocessing to reach high throughput levels that prevent starvation of the training toolkit.And finally, it employs a standard UNIX API, writing data to STDOUT.
After presenting this framework ( § 2), we conduct a quality comparison to demonstrate that it does not reduce model quality ( § 4).We then investigate stream bandwidth under various pipelines as well as necessary toolkit consumption needs ( § 5).We conclude by demonstrating a number of use cases ( § 6).

Training from data streams
The core idea underlying SOTASTREAM is to cleanly separate data generation from consumption of that data during training.The data generator is responsible for producing training samples, and the trainer consumes them.This factorization allows us to separate properties of the data (such as their sources, mixing ratios, and augmentations) from properties of training and the model (such as tensor format, batch size, and so on).
The current approach relies on standard UNIX I/O pipes as an interface between these two pieces.However, SOTASTREAM could also be used to generate data for offline uses, or modified to allow consumption through some other API, such as a library call that returns a generator.

Data generation
SOTASTREAM is a data generator.At a high level, it works by defining a pipeline.This pipeline reads from a set of zero or more input data sources, applies any augmentations, and produces a single output stream.
Pipelines Pipelines are implemented by inheriting from the base Pipeline class.The class implementation is responsible for defining the input data sources, reading from them, applying augmentations, and returning a single output stream.These are depicted in Figure 2, a simplified presentation that elides other support features, such as providing the mixing weights for the input data sources.
The pipeline has three basic components: 1. Build a stream for each input data source; 2. apply a sequence of augmentors; and 3. merge the streams to a single output stream.
Data sources SOTASTREAM uses Infinibatch to return a generator over a permutation of the samples in a data source.Each DataSource object receives two key arguments: a file path to the data source on disk, d, and a processor function, f , to read it.This can be seen in Figure 2 in the call to create_data_stream(d, f ).
The data is received as a path to a directory of compressed TSV file shards.Infinibatch requires that data be presented in this way. 3For each data epoch, Infinibatch produces a permutation of these shards.The shards are then passed, in turn, to the function f , which is responsible for opening, reading, and processing the shard.It is important to note that Infinibatch provides an infinite stream of data; that is, it will present an infinite stream over its input data, subject to the constraint that no shard (within a data source) will be seen n + 1 times until all shards have been seen n times.(See the Multiprocessing paragraph below for important caveats related to multiprocessing and MPI training).Streams are built by composing generator functions over input data sources (here, parallel and backtranslated data).This example tags the backtranslated stream, then mixes it with the parallel stream using weights provided on the command line (defaulting to 1:1).It then applies random source-lowercasing (4%) and title-casing (1%).Augmentations The second argument to create_data_stream is a generator function, f , an Infinibatch primitive whose task is to open each shard and produce an output data stream.The output is in the form of Line objects (Figure 3, each of which is a class representation of the TSV input.By convention in machine translation, fields 0 and 1 are treated as source and target segments, respectively, but the code itself makes no such assumptions.The function is not limited in just reading and returning the data.A key feature of SOTASTREAM is augmentations, which are arbitrary manipulations of a data stream that are easy to stack and accumulate.This is accomplished by composing generators.Figure 2 contains a number of examples in the Augment function.It first opens a stream on a path (passed from Infinibatch, containing a path to a sharded file name).It then applies lowercasing and title-casing to the input stream probabilistically, using a Mixer class to select among them with specified weights.Finally, it prepends a tag to the data, if requested by the caller.
Outputting the stream Finally, at the top level, the (augmented) streams from different data sources are merged into a single stream.This works in the same way as the above Mixer class exam-ple.One additional feature is that the Pipeline class provides the ability to set these top-level data weights from the command line (--mix-weights).

Data consumption
The main requirements for the trainer are to consume data into a pool, apply subword processing, organize into batches, and run backpropagation against the training objective.Because these are done on the fly, rather than in preprocessing, special considerations must be implemented to ensure that this extra processing does not become a bottleneck for training.
In Section 5, we experiment with an implementation in the Marian toolkit (Junczys-Dowmunt et al., 2018).Marian makes use of multiple worker threads, which pre-fetch data from STDIN into an internal memory pool, where the data is tokenized and integerized.When the pool is filled, it is sorted and batched (according to run-time settings).In the meantime, prefetching continues into a second pool.As training proceeds, these two pools are used alternately for filling via prefetching and batch generation.

Multiprocessing
In order to sustain a sufficient throughput, SOTAS-TREAM makes use of multiprocessing.This can be increasingly important if the augmentations applied are expensive to compute.We quantify the effects of multiprocessing for generation under a handful of pipelines of varying complexity in Section 5.
Internally, this is accomplished with the multiprocessing library.We create separate subprocesses, each of which is provided with independent access to the data sources.The parent process maintains a pipe to each subprocess, and queries them in sequence, reading a fixed number of lines from each in turn, and passing them to the standard output.
An important issue is raised when working with subprocesses.If each subprocess were to return an independent permutation over the input data, merging subprocesses would not itself result in a permutation.To address this, each of n subprocesses is initalized with 1 n of the data shards, themselves assigned in round-robin order across the subprocesses.In this way, we guarantee a permutation in settings where the number of processes evenly divides the number of shards.
When working over MPI, no such coordination takes place.Each MPI instantiation will receive a different randomly-seeded shard permutation.

Experimental setup
Our experimental goal is to demonstrate that the many advantages of SOTASTREAM do not come at a cost in accuracy ( § 4) or speed ( § 5).We do this by comparing to a number of other data loading methods.In order to isolate the effects of changing the data loader, we conduct all of our experiments within the Marian toolkit.Marian does not support offline data preprocessing; instead, we compare a number of different streaming settings that cover best-case scenarios.

Streaming variations
We compare the following data-loading variations.
• Full loading.In this scenario, the trainer has direct memory access to the entire data source.
For our experiments, Marian loads the complete datasets into main memory.There is some startup cost, after which all access to the data is immediate.
• Sequential streaming.In this approach, the training data is read sequentially, in a loop over the entire training set.Data is prefetched into a pool of a specified size, from which mini-batches are assembled.Since data is read sequentially, there is no randomization across data epochs.The pool size determines an upper bound on memory usage.
• Randomized sequential streaming.In this variant of sequential streaming, the lines in each data source are randomly permuted prior to being read.
For toolkits that support preprocessing, it is typical to construct an index, which organizes the presorted and tensorized data into memory-mappable shards.Marian does not have a preprocessing option, which means that we have no comparison to a setting where tensorization is done offline.We thus consider full-loading to be the closest equivalent, since preprocessing is in fact a stand-in for full loading.This can only possible affect speed comparisons ( § 5).

Model parameters
We conduct experiments in a large-data and smalldata setting.Our large-data setting is English-German.We train on 297m lines of Paracrawl v9 (Bañón et al., 2020) from WMT22 (Kocmi et al., 2022).We use a 32k shared unigram subword model (Kudo, 2018) using SentencePiece (Kudo and Richardson, 2018), trained jointly over both sides.We train a standard base Transformer model (Vaswani et al., 2017) with 6/6 encoder/decoder layers, an embedding size of 1024, a feed-forward size of 4096, and 8 attention heads.The large model is trained for 20 virtual epochs.Since there are roughly 7.4 billion target-side tokens after tokenizing the data, this equates to roughly three passes over the data.
For the small-data setting, we train on Czech-Ukrainian, also from WMT22.This dataset has roughly 12m parallel lines.We use the same model and parameter settings, but train for only five virtual epochs, or roughly 30 data epochs, since the model converges by then.
• COMET20/22 (Rei et al., 2020), using model wmt20-comet-da (EN-DE) or wmt22-comet-da (CS-UK).no filtering, curriculum effects may be more pronounced, and this is the only data generation approach with no randomization.Among approaches that permuted the data, SOTASTREAM is on par with the others.

Speed
Next we ask whether SOTASTREAM has a negative effect on speed.The short answer is that it does not.We examine speed in three settings: generation speed ( § 5.1), Marian's consumption speed ( § 5.2), and total runtime ( § 5.3).

Data generation
We first examine how fast SOTASTREAM can write data to STDOUT.
Our benchmark consists of a producer and a consumer connected by UNIX pipe.The producer varies among the tools we compare in our benchmark (described below), while the consumer is a lightweight script, whose sole purpose is to count records from STDIN and report the yield rate (the number of lines per second).All benchmarks are run one at a time, on the same machine,5 with no other CPU-or I/O-intensive processes are competing for resources.We run each benchmark multiple times and report the average.
We compare the following generation tools: • zcat: A wrapper to GNU gzip 6 that decompresses and outputs lines.This serves as the best case scenario, where the producer is implemented in an efficient way (e.g.C/C++) and has no time-consuming augmentations.
• zcat.py:Similar to zcat, but based on gzip API from Python's standard library. 7 • default pipeline: SOTASTREAM' default, returning lines from a single data source ( § 2.1) with no augmentations.
• case augmentor pipeline: the pipeline from Figure 2. It mixes two data sources, applies case transformations, and prepends a "[BT]" tag to the backtranslated data.
We Full loading 55.9 ± 0.4 34.9 ± 0.1 62.0 ± 0.0 85.5 ± 0.2 27.9 ± 0.4 55.6 ± 0.2 Sequential streaming 56.1 ± 0.2 35.0 ± 0.2 62.1 ± 0.0 86.4 ± 0.1 28.7 ± 0.3 56.6 ± 0.2 Randomized streaming 55.8 ± 0.2 35.1 ± 0.0 62.2 ± 0.0 85.6 ± 0.1 27.8 ± 0.0 55.6 ± 0.2 SOTASTREAM 55.9 ± 0.1 34.9 ± 0.1 62.1 ± 0.1 85.7 ± 0.2 28.5 ± 0.4 56.2 ± 0.2 Table 1: Mean over three runs for our high-and low-resource scenarios.The best constrained system is WeChat-AI (Zeng et al., 2021) for EN-DE and AMU (Nowakowski et al., 2022) for CS-UK.for Table 1, since smaller models train faster and therefore have higher data consumption needs.We use 6/6 encoder/decoder layers, 512-dimensional embeddings, and feedforward sublayers of size 2048.We report consumption rate for six settings: one vs.eight GPUs,9 and using one, four, or eight prefetching worker threads.As shown in Figure 4, a trainer with single GPU consumes about 1523 lines/s, and with eight GPUs, the consumption rate increases to 8957 lines/s.Even in the best case scenario (smaller model, more GPUs, and more prefetcher threads), the consumption rates of training process are lower than SOTASTREAM production rate.We recommend running multiple workers when augmentations are slow in order to maintain sufficient output rates.We do not experiment with them here, but in multi-node training settings coordinated with MPI, one (multiprocess) instance of SOTASTREAM should be run per node.

Total time to run
Table 2 verifies that SOTASTREAM's amortized approach is neither slower nor faster than other approaches when total runtime is considered.

Example Use Cases
In this section we show example use cases how SO-TASTREAM can be used to simply and easily modify data on the fly.This provides all the advantages of training for robustness without the cumbersome task of generating (and managing) data that has been preprocessed in many different forms, which are combinatorial and impose high costs on the complexity of managing training runs.

Mixing multiple streams of data
Training machine translation models often requires combining different data sets in desired proportions in order to balance their size or quality or other properties.The example in Figure 2 demonstrates that combining original parallel data and backtranslated data can be efficiently achieved in SO-TASTREAM by mixing multiple data streams with specific data weighting.The weights for each data stream can be then specified using the commandline options: sotastream robust -case \ parallel .tsv .gz backtrans .tsv .gz \ --mix -weights 1 1 The weights are normalized and used as probabilities with the Mixer augmentor.This approach, when compared to the traditional offline preparation of the data, is much simpler, more scalable, saves disk space and does not require complicated ratio-computation and data overor downsampling.

Data augmentation for robustness
SOTASTREAM's augmentors provide a flexible framework for different methods for data augmentation, for example, case manipulation for robustness against different casing variants of the input text.It is demonstrated in the example in Figure 2, where LowerCase is an augmentor that lowercases the source text, and TitleCase converts both source and target sides to the English title-cased format.The frequency of each variant is easily controlled with the same Mixer used to join separate data sources.The on-the-fly approach simplifies experiments when testing multiple variations, which is often needed in order to find optimal augmentation methods and ratios, it minimizes the burden of experiment management.
Many other types of robustness augmentation (Li et al., 2019), such as source-side punctuation removal, spelling errors generation, etc., can be implemented in a similar way.

Filtering bad data examples
In SOTASTREAM it is straightforward to do data filtering on the fly.This type of filtering is especially useful in scenarios in which external data is used for model training or fine-tuning that cannot be manually filtered in a controlled way.
For example, a URLFilter filter that removes lines that have unmatched URLs between the source and target fields can be implemented using the provided MatchFilter: def URLFilter ( stream ): pattern = r '\ bhttps ?:\ S +[a -z ]\ b ' return MatchFilter ( stream , pattern )

Subword tokenization sampling
The boundary separating data generation from consumption can be blurred  A wrapper around this function could merge the source and target sides of the Line object, perhaps subject to parameters such as a maximum sequence length, a maximum number of sentences, and structural tokens to be used as affixes.

Alignments and other data types
SOTASTREAM has been primarily designed for machine translation, which requires providing source and target texts as separate fields.Other data types or metadata can be generated on the fly or provided as additional fields in the input stream.By design the existing augmentors pass forward the unused fields, which makes introducing new fields that are used only by a subset of augmentors simple.
The example below demonstrates on-the-fly generation of word alignment using SimAlign (Jalili Sabet et al., 2020): The word alignment can be used directly by the trainer, e.g., for guided alignment training (Chen et al., 2016), or used by subsequent augmentors that may require it, e.g., constrained terminology translation annotations (Bergmanis and Pinnis, 2021).

Integration with data collection tools
SOTASTREAM can integrate tools like MTData, which automates the collection and preparation of machine translation data sets (Gowda et al., 2021).

Generating data sets for offline use
If the training tool does not support consuming training data from the standard input, SOTAS-TREAM can be used for static data generation.
While the real advantages of SOTASTREAM accrue when making use of its on-the-fly data manipulations, this approach retains some of its benefits.

Other uses
The SOTASTREAM approach to factoring data generation, as well as SOTASTREAM itself, could also be used for generating non-textual content.The benefits of not writing data to disk would be greater in settings where input disk space is larger than plain text, such as translation from visual representations (Salesky et al., 2021).Nor does it need to be limited to sequence-to-sequence settings; we imagine the approach could be useful for training of LLMs.

Related Work
To our knowledge, SOTASTREAM is novel in presenting a framework for the generation of training data as a distinct component in the model training pipeline.It emphasizes a clean separation between data generation and training, multithreading for throughput, and the use of the standard UNIX pipeline interface.
There are many libraries focused on data augmentation.A number of these are focused just on text augmentations, including nlpaug (Ma, 2019), TextAttack (Morris et al., 2020), and TextFlint (Gui et al., 2021).Another tool is AugLy (Papakipos and Bitton, 2022), a multimodal tool for text, audio, images, and video that provides robust training against adversarial perturbations.Many of these libraries would be useful within SOTASTREAM's general framework.

Conclusion
Working with large datasets can be difficult.It is time-consuming to copy and process data, and can be expensive to store to disk.The datapreprocessing approach that is common in machine translation model training makes it possible to work with increasingly large datasets, but this ability does not come without costs.If data is compiled with model-specific parameters that tie the data to a particular model training, it prevents or at least complicates reusability.This problem is further exacerbated by research settings where one of the experimental parameters is manipulations of the training data, since each variant must be written to disk and then managed.
We have described an approach that separates data generation from data consumption, and shared SOTASTREAM, an implementation that makes use of the standard UNIX pipeline.The requirement is that preprocessing must now be computed on the fly.Our experiments show that this does not slow down training, nor does it affect the accuracy of the models trained.The approach provides flexibility, saves processing time and disk space, and simplifies experiment management.

Limitations
We have only investigated data consumption rates in a single toolkit, Marian, written in C++.It's possible that the online preprocessing requirements may be too much for toolkits written in languages without a proper thread implementation.

Figure 1 :
Figure1: The SOTASTREAM approach separates data generation from consumption.Whereas offline tensorization requires model-specific parameters such as the vocabulary, which ties processed data to a particular training, SOTASTREAM produces data on the fly, avoiding time-consuming production and space-wasting storage of preprocessed data.

@
Figure2: A simplified pipeline.Streams are built by composing generator functions over input data sources (here, parallel and backtranslated data).This example tags the backtranslated stream, then mixes it with the parallel stream using weights provided on the command line (defaulting to 1:1).It then applies random source-lowercasing (4%) and title-casing (1%).

Figure 3 :
Figure 3: The (simplified) Line object, a lightweight wrapper around a single row of tab-separated input data.

Table 2 :
End-to-end training time.