ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System

This paper introduces our proposed system for the MIA Shared Task on Cross-lingual Openretrieval Question Answering (COQA). In this challenging scenario, given an input question the system has to gather evidence documents from a multilingual pool and generate from them an answer in the language of the question. We devised several approaches combining different model variants for three main components: Data Augmentation, Passage Retrieval, and Answer Generation. For passage retrieval, we evaluated the monolingual BM25 ranker against the ensemble of re-rankers based on multilingual pretrained language models (PLMs) and also variants of the shared task baseline, re-training it from scratch using a recently introduced contrastive loss that maintains a strong gradient signal throughout training by means of mixed negative samples. For answer generation, we focused on languageand domain-specialization by means of continued language model (LM) pretraining of existing multilingual encoders. Additionally, for both passage retrieval and answer generation, we augmented the training data provided by the task organizers with automatically generated question-answer pairs created from Wikipedia passages to mitigate the issue of data scarcity, particularly for the low-resource languages for which no training data were provided. Our results show that language- and domain-specialization as well as data augmentation help, especially for low-resource languages.


Introduction
Open-retrieval Question Answering (OQA), where the agent helps users to retrieve answers from largescale document collections with given open ques-tions, has arguably been one of the most challenging natural language processing (NLP) applications in recent years (e.g., Izacard and Grave, 2021). As is the case with the vast majority of NLP tasks, much of the OQA focused on English, relying on a pipeline that crucially depends on a neural passage retriever, i.e., a (re-)ranking model -trained on large-scale English QA datasets -to find evidence passages in English  for answer generation. Unlike in many other retrieval-based tasks, such as ad-hoc document retrieval (Craswell et al., 2021), parallel sentence mining (Zweigenbaum et al., 2018), or Entity Linking , the progress toward Cross-lingual Open-retrieval Question Answering (COQA) has been hindered by the lack of efficient integration and consolidation of knowledge expressed in different languages (Loginova et al., 2021). COQA is especially relevant for opinionated information, such as news, blogs, and social media. In the era of fake news and deliberate misinformation, training on only (or predominantly) English texts is more likely to lead to more biased and less reliable NLP models. Further, an Anglo-and Indo-European-centric NLP (Joshi et al., 2020) is unrepresentative of the needs of the majority of the world's population (e.g., Mandarin and Spanish have more native speakers than English, and Hindi and Arabic come close) and contributes to the widening of the digital language divide. 1 Developing solutions for cross-lingual open QA (COQA) thus contributes towards the goal of global equity of information access.
COQA is the task of automatic question answering, where the answer is to be found in a large Figure 1: The proposed pipeline for the cross-lingual QA problem. The pipeline is composed of two stages: (i) the retrieval of documents containing a possible answer (red box) and the generation of an answer (blue box). For the retrieval part, we exploited different methods, based both on training mDPR variants and on the ensembling of blackbox models. For training the mDPR variants, we enlarge the original training dataset with samples from our data augmentation pipeline. For the generation part, we enrich the existing baseline method with data augmentation and masked language modeling. multilingual document collection. It is a challenging NLP task, with questions written in a user's preferred language, where the system needs to find evidence in a large-scale document collection written in different languages; the answer then needs to be returned to the user in their preferred language (i.e., the language of the question).
More formally, the goal of a COQA system is to find an answer a to the query q in the collection of documents {D} N i . In the cross-lingual setting, q and {D} N i are, in general, in different languages. For example, the users can ask a question in Japanese, and the system can search an English document for an answer that then needs to be returned to the user in Japanese.
In this work, we propose data augmentation for specialized models that correspond to the two main components of a standard COQA system -passage retrieval and answer generation: (1) we first extract the passages from all documents of all languages, exploiting both unsupervised and supervised (e.g., mDPR variants) passage retrieval methods; (2) we then use the retrieved passages from the previous step and further conduct intermediate training of a pretrained language model (e.g., mT5 (Xue et al., 2021)) on the extracted augmented data in order to inject language-specific knowledge into the model, and then generate the answers for each question from different language portions. As a result, we obtain a specialized model trained on the augmented data for the COQA system. The overall process is illustrated in Figure 1.

Data Augmentation
We use a language model to generate questionanswer (QA) pairs from English texts, which we then filter according to a number of heuristics and translate into the other 15 languages. 2 An example can be seen in Figure 2 in the Appendix.

Question-Answer Generation
For generating the question-answer pairs, we use the provided Wikipedia passages as the input to a language model, which then generates questions and answers based on the input text. We based our choice of the model on the findings by Dong et al. (2019) and , who showed that language models that are fine-tuned jointly on Question Answering and Question Generation, outperform individual models fine-tuned independently on those tasks. More specifically, we use the model by Dugan et al. (2022) and make slight modifications. Dugan et al. (2022) used a T5 model fine-tuned on SQuAD and further fine-tuned it on three tasks simultaneously: Question Generation (GQ), Question Answering (QA), and Answer Extraction (AE). They also included a summarization module to create lexically diverse question-answer pairs. We found that using this module sometimes leads to factually incorrect passages, and leave this to future work. Similar to Dugan et al. (2022), we split the original passages into whole sentences that are shorter than 512 tokens. 3 We then generate the pairs using the first three sub-passages.

Filtering
Before translating question-answer pairs, to ensure better translations, we enforce each pair to satisfy at least one of a number of heuristics, which we determined through manual evaluation of the generated pairs. Each pair is evaluated on whether one of the following is true (in the respective order): the answer is a number, the question starts with the word who, the question starts with the words how many, or the answer contains a number or a date. After filtering, we are left with roughly 339,000 question-answer pairs.

Translation
We use the Google Translate API provided by translatepy 4 to translate the filtered question-answer pairs from English into the relevant 15 languages. Each language has an equal number of questionanswer pairs. In total, we generate about 5.4 million pairs for all languages combined.

Methodology
Following the approach described in Asai et al. (2021b), we consider the COQA problem as two sub-components: the retrieval of documents containing a possible answer and the generation of an answer. Figure 1 summarizes the proposed methods. This section is organized as follows: we present the proposed retrieval methods in §3.1 and demonstrate the language-specialized methods for answer generation in §3.2.

Passage Retrieval
For the passage retrieval phase, we explored the approaches described in the following sections, which fall into three main categories: the enhancement of the training procedure of the mDPR baseline, the ensembling of blackbox models (i.e. retrieval using multilingual sentence encoders trained for semantic similarity) and lexical retrieval using BM25. While 3 We use a different sentence splitting method, namely pySBD. 4 https://github.com/Animenosekai/ translate the first category is a supervised approach, which uses QA datasets to inject task knowledge into pretrained models, the others use general linguistic knowledge for retrieving (i.e., unsupervised).
Baseline: mDPR We take as a baseline the method proposed in Asai et al. (2021b). They propose mDPR (Multilingual Dense Passage Retriever), a model that extends the Dense Passage Retriever (DPR) (Qu et al., 2021) to a multilingual setting. It is made of two mBERT-based encoders , one for the question and one for the passages. The training approach proceeds over two subsequent stages: (i) parameter updates and (ii) cross-lingual data mining.
In the first phase, both mDPR and mGEN ( §3.2) are trained one after the other. For mDPR, the model processes a dataset D = made of tuples containing a question q i , the passage p + i containing an answer (called positive or gold), and a set {p − i,j } n j=1 of negative passages. For every question, negatives are made up of the positive passages from the other questions or passages either extracted at random or produced by the subsequent data mining phase. To do this, they use a contrastive loss (L mdpr ) that moves the embedding of the question close to its positive passage, while at the same time repelling the representations of negative passages: In the second stage, the training set is expanded by finding new positive and negative passages using Wikipedia language links and mGEN ( §3.2) to automatically label passages. This two-staged training pipeline is repeated T times. mDPR variants One of our approaches is to simply substitute the loss function presented above with a contrastive loss described in Wang et al. (2022), named MixCSE. In this work, the authors tackle a common problem of out-of-the-box BERT sentence embeddings, called anisotropy , which makes all the sentence representations to be distributed in a narrow cone. Contrastive learning has already proven effective in alleviating this issue by distributing embeddings in a larger space (Gao et al., 2021). Wang et al. (2022) prove that hard negatives, i.e. data points hard to distinguish from the selected anchor, are key for keeping a strong gradient signal; however, as learning proceeds, they become orthogonal to the anchor and make the gradient signal close to zero. For this reason, the key idea of MixCSE is to continually generate hard negatives via mixing positive and negative examples, which maintains a strong gradient signal throughout training. Adapting this concept to our retrieval scenario, we construct mixed negative passages as follows: where p − i,j is a negative passage chosen at random. We provide the equation of the loss in Appendix B. The main difference with L mdpr is the addition of a mixed negative in the denominator and the similarity used (exponential of cosine similarity instead of dot product).
We train mDPR with the original loss and with the MixCSE loss on the concatenation of the provided training set for mDPR and the augmented data obtained via the methods described in §2. We refer to these two variants as mDPR(AUG) and mDPR(AUG) with MixCSE, respectively.
Ensembling "blackbox" models Following the approaches presented in Litschko et al. (2022), we also ensemble the ranking of some blackbox models that directly produce a semantic embedding of the input text. We provide a brief overview of the models included in our ensemble below.
• DISTIL (Reimers and Gurevych, 2020) is a teacher-student framework for injecting the knowledge obtained through specialization for semantic similarity from a specialized monolingual transformer (e.g., BERT) into a non-specialized multilingual transformer (e.g., mBERT). For semantic similarity, it first specializes a monolingual (English) teacher encoder using the available semantic sentencematching datasets for supervision. In the second knowledge distillation step, a pretrained multilingual student encoder is trained to mimic the output of the teacher model. We benchmark different DISTIL models: -DISTIL use : instantiates the student as the pretrained m-USE  instance; -DISTIL xlmr : initializes the student model with the pretrained XLM-R  transformer; -DISTIL dmbert : distills the knowledge from the Sentence-BERT (Reimers and Gurevych, 2019) teacher into a multilingual version of DistilBERT (Sanh et al., 2019), a 6-layer transformer pre-distilled from mBERT.
• LaBSE (Language-agnostic BERT Sentence Embeddings Feng et al. (2020)) is a neural dual-encoder framework, trained with parallel data. LaBSE training starts from a pretrained mBERT instance. LaBSE additionally uses standard self-supervised objectives used in the pretraining of mBERT and XLM (Conneau and Lample, 2019): masked and translation language modelling (MLM and TLM).
• MiniLM ) is a student model trained by deeply mimicking the selfattention behavior of the last Transformer layer of the teacher, which allows a flexible number of layers for the students and alleviates the effort of finding the best layer mapping.
• MPNet (Song et al., 2020) is based on a pretraining method that leverages the dependency among the predicted tokens through permuted language modeling and makes the model see auxiliary position information to reduce the discrepancy between pre-training and finetuning.
We produce an ensembling of the blackbox models by simply taking an average of the ranks for each of the documents retrieved, which is denoted as EnsembleRank.
Oracle Monolingual BM25 (Sparck Jones et al., 2000;Jonesa et al., 2000) This approach is made of two phases: first, we automatically detect the language of the question, then we query the index in the detected language. As a weighting scheme in the vector space model, we choose BM25. It is based on a probabilistic interpretation of how terms contribute to the document's relevance. It uses exact term matching and the score is derived from a sum of contributions from each query term that appears in the document. We use an oracle BM25 approach: this naming derives from the fact that we query the index with the answer rather than the question. This was done at training time to increase the probability of the answer to be in the passages consumed by mGEN, so that the generation model would hopefully learn to extract the answer from its input, rather than generating it from the question only. At inference time, we query the index using the question.

Answer Generation
Our answer generation modules take a concatenation of the question and the related documents retrieved by the retrieval module as an input and generate an answer. In this section, we first explain the baseline system which is the basis of our proposed approaches and then present our specialization method.
Baseline: mGEN We use mGEN (Multilingual Answer Generator; Asai et al. (2021b)) as the baseline for the answer generation phase. They propose to take mT5 (Xue et al., 2021), a multilingual version of a pretrained transformer-based encoderdecoder model (Raffel et al., 2020), and fine-tune it for multilingual answer generation. The pretraining process of mT5 is based on a variant of masked language modeling named span-corruption, in which the objective is to reconstruct continuously masked tokens in an input sentence (Xue et al., 2021). For fine-tuning, the model is trained on a sequence-to-sequence (seq2seq) task as follows: The model predicts a probability distribution over its vocabulary at each time step (i). It is conditioned on the previously generated answer tokens (a L <i ), the input question (q L ) and N retrieved passages (P N ). Because of a possible language mismatch between the answer and the passages, it is not possible to extract answers as in existing work in monolingual QA tasks : for this reason, mGEN opts for directly generating answers instead.
Masked Language Modeling (MLM) Following successful work on language-specialized pretraining via language modeling (Glavaš et al., 2020;Hung et al., 2022), we investigate the effect of running MLM on the language-specific portions of Wikipedia passages (Asai et al., 2021b) and CC-Net  with mT5 (Xue et al., 2021). For the extracted texts of all 16 languages, 14 languages are from the released Wikipedia passages and the missing two surprise languages (Tamil, Tagalog) are from CCNet. We additionally clean all language portions by removing email addresses, URLs, extra emojis and punctuations, and selected 7K for training and 0.7K for validation for each language. In this way, we inject both the domain-specific (i.e., Wikipedia knowledge) and language-specific (i.e., 16 languages) knowledge into the multilingual pretrained language model via MLMing as an intermediate specialization step.
Augmentation Data Variants To further investigate model capability on (1) extracting answers from English passages or (2) extracting answers from translated passages, while keeping the Question-Answer pairs in other non-English languages, we conduct experiments on two augmentation data variants: AUG-QA and AUG-QAP. AUG-QA keeps the English passage with the translated Question-Answer pairs, while AUG-QAP translates the English passage to the same language as the translated Question-Answer pairs. Detailed examples are shown in Table 1.

Experimental Setup
We demonstrate the effectiveness of our proposed COQA systems by comparing them to the baseline models and thoroughly comparing different specialization methods from §1.
Evaluation Task and Measures Our proposed approaches are evaluated in 16 languages, 8 of which are not covered in the training data. 5 The training and evaluation data are originally from Natural Questions (Kwiatkowski et al., 2019), XOR-TyDi QA (Asai et al., 2021a), and MKQA (Longpre et al., 2020). Data size statistics for each resource and language are shown in Table 2 and 3.
The evaluation results are measured on the competition platform hosted at eval.ai. 6 The systems are evaluated on two COQA datasets: XOR-TyDi QA (Asai et al., 2021a), and MKQA (Longpre et al., 2020), using token-level F1 (F1), as common evaluation practice of open QA systems . For non-spacing languages, we follow the token-level tokenizers 7 for both predictions and    ground-truth answers. The overall score is calculated by using macro-average scores on XOR-TyDi QA and MKQA datasets, and then taking the average F1 scores of both datasets.
Data We explicitly state that we did not train on the development data or the subsets of the Natural Questions and TyDi QA, which are used to create MKQA or XOR-TyDi QA datasets. This makes all of our proposed approaches fall into the constrained setup proposed by the organizers. For training the mDPR variants, we exploit the organizer's dataset that was obtained from DPR Natural Questions (Qu et al., 2021) and XOR-TyDiQA gold paragraph data. More specifically, for training and validation, we always use the version of the dataset containing augmented positive and negative passages obtained from the top 50 retrieval results of the organizer's mDPR. We merge this dataset with the augmented data, filtering the latter to get 100k samples for each of the 16 languages.
We base our training data for answer generation models on the organizer's datasets with the top 15 retrieved documents from the coupled retriever. To use automatically generated question-answer pairs for each language from §4 for fine-tuning, we align the format with retrieved results by randomly sampling passages from English Wikipedia as negative contexts, 8 while we keep the seed documents as positive ones. We explore two ways of merging the positive and negative passages: in the "shuffle" style, the positive passage appears in one of the top 3 documents; in the "non-shuffle" method, the positive passage always appears on the top. However, since these two configurations did not show large differences, we only report the former one in this paper. We also investigated if translating passages into the different 16 languages 9 may be beneficial with respect to keeping all the passages in English (AUG-QA). Due to computational limitations, in our data augmented setting for generation model fine-tuning, we use 2K question-answer pairs with positive/negative passages for each language for our final results.
Hyperparameters and Optimization For multilingual dense passage retrieval, we mostly follow the setup provided by the organizers: learning rate 1e−5 with AdamW (Loshchilov and Hutter, 2019), linear scheduling with warm-up for 300 steps and dropout rate 0.1. For mDPR(AUG) with MixCSE (Wang et al., 2022), we use λ = 0.2 and τ = 0.05 for the loss (see Appendix B). We train with a batch size of 16 on 1 GPU for at most 40 epochs, using average rank on the validation data to pick the checkpoints. The training is done independently of mGEN, in a non-iterative fashion.
For retrieving the passages, we use cosine similarity between question and passage across all proposed retrieval models, returning the top 100 passages for each of the questions.
For language-specialized pretraining via MLM, we use AdaFactor (Shazeer and Stern, 2018) with the learning rate 1e − 5 and linear scheduling with warm-up for 2000 steps up to 20 epochs. For multilingual answer generation fine-tuning, we also mostly keep the setup from the organizers: learning rate 3e − 5 with AdamW (Loshchilov and Hutter, 2019), linear scheduling with warm-up for 500 steps, and dropout rate as 0.1. We take the top 15 documents from the retrieved results as our input and truncate the input sequence after 16, 000 tokens to fit the model into the memory constraints of our available infrastructure.

Results and Discussion
Results Overview Results in Table 4 show the comparison between the baseline and our proposed methods on XOR-TyDi QA while Table 5 shows the results on MKQA. While we can see that the additional pretraining on the answer generation model (mDPR+MLM-14) helps to outperform the baseline in XOR-TyDi QA, the same approach leads to a degradation in MKQA. None of the proposed methods for the retrieval module improved over the baseline mDPR in both datasets, as shown in Table 6.

Unsupervised vs Supervised Retrieval
In all evaluation settings, unsupervised retrieval methods underperform supervised methods by a large margin (see Table 4 and 5). This might be due to the nature of the task, which is to find a document containing an answer, rather than simply finding a document similar to the input question. For this reason, such an objective might not align well with models specialized in semantic similarity (Litschko et al., 2022). Fine-tuning mBERT, however, makes the model learn to focus on retrieving an answer-containing document and not simply retrieving documents similar to the question.

Language Specialization
We compare the evaluation results for the fine-tuned answer generation model with and without language specialization (i.e., MLMing): for XORQA-ar and XORQA-te we have +2.0 and +1.1 percentage points improvement compared to the baseline model (with mT5 trained on 100+ languages). We further distinguish MLM-14 and MLM-16, where the former is trained on the released Wikipedia passages for 14 languages and the latter is trained on the concatenation of Wikipedia passages and CCNet , to which we resort for the two surprise languages (Tamil and Tagalog), which were missing in the Wikipedia data. Overall, MLM-14 performs better than MLM-16: we hypothesize that this might be due to the domain difference between text coming from Wikipedia and CCNet: the latter is not strictly aligned with the structured text (i.e., clean) version of Wikipedia passages, and causes a slight drop in performance as we train for 2 additional languages.
Data Augmentation Data augmentation is considered a way to mitigate the performance of lowresource languages while reaching performance on par with high-resource languages (Kumar et al., 2019;Riabi et al., 2021;Shakeri et al., 2021). Two variations are considered: AUG-QA and AUG-QAP, while the former concatenates the XOR-Tydi QA training set with the additional augmented data with translated Question-Answer pairs, and the latter is made from the concatenation of both XOR-Tydi QA training set and the translated Question-Answer-Passage. 10 We assume that by also translating passages, the setting should be closer to test time when the retrieval module can retrieve passages in any of 14 languages (without the two surprise languages). In contrast, in AUG-QA setting, the input passages to the answer generation are always in English. Models trained with additional AUG-QA data could increase the capacity of seeing more data for unseen languages, while AUG-QAP may further enhance the ability of the model to generate answers from the translated passages. As expected, models trained with additional augmented data have better performance compared to the ones without. The encouraging finding states that, especially for two surprise languages, the language specialized models fine-tuned with both XOR-Tydi    QA and AUG-QAP drastically improve the performance of these unseen, low-resource languages.
mDPR variants results As shown in Table 4 and 5, we can see that the mDPR variants we trained are considerably worse than the baseline. We think this is mainly caused by the limited batch size used (16) which is a constraint due to our infrastructure. The number of samples in a batch is critical for contrastive training, as larger batches provide a stronger signal due to a higher number of negatives. For this reason, we think that the mDPR variants have not been thoroughly investigated and might still prove beneficial when trained with larger batches.

Related Work Passage Retrieval and Answer Generation
To improve the information accessibility, openretrieval question answering systems are attracting much attention in NLP applications (Chen et al., 2017;. Rajpurkar et al. (2016) were one of the early works to present a benchmark that requires systems to understand a passage to produce an answer to a given question. (Kwiatkowski et al., 2019) presented a more challenging and realistic dataset with questions collected from a search engine. To tackle these complex and knowledge-demanding QA tasks,  proposed to first retrieve related documents from a given question and use them as additional aids to predict an answer. In particular, they explored a general-purpose fine-tuning recipe for retrieval-augmented generation models, which combine pretrained parametric and non-parametric memory for language generation. Izacard and Grave (2021), they solved the problem in two steps, first retrieving support passages before processing them with a seq2seq model, and Sun et al. (2021) further extended to the cross-lingual conversational domain. Some works are explored with translatethen-answer approach, in which texts are translated into English, making the task monolingual (Ture and Boschee, 2016;Asai et al., 2021a). While this approach is conceptually simple, it is known to cause the error propagation problem in which errors of the translation get amplified in the answer generation stage (Zhu et al., 2019). To mitigate this problem, Asai et al. (2021b) proposed to extend  by using multilingual models for both the passage retrieval  and answer generation (Xue et al., 2021).
Data Augmentation Data augmentation is a common approach to reduce the data sparsity for deep learning models in NLP (Feng et al., 2021). For Question Answering (QA), data augmentation has been used to generate paraphrases via backtranslation (Longpre et al., 2019), to replace parts of the input text with translations (Singh et al., 2019), and to generate novel questions or answers (Riabi et al., 2021;Shakeri et al., 2021;Dugan et al., 2022). In the cross-lingual setting, available data have been translated into different languages (Singh et al., 2019;Kumar et al., 2019;Riabi et al., 2021;Shakeri et al., 2021) and language models have been used to train question and answer generation models (Kumar et al., 2019;Chi et al., 2020;Riabi et al., 2021;Shakeri et al., 2021). Our approach is different from previous work in Cross-lingual Question Answering task in that it only requires English passages to augment the training data, as answers are generated automatically from the trained model by Dugan et al. (2022). In addition, our filtering heuristics remove incorrectly generated question-answer pairs, which allows us to keep only question-answer pairs with answers that are more likely to be translated correctly, thus limiting the problem of error propagation.

Reproducibility
To ensure full reproducibility of our results and further fuel research on COQA systems, we release the model within the Huggingface repos-itory as the publicly available multilingual pretrained language model specialized in 14 and 16 languages. 11 We also release our code and data, which make our approach completely transparent and fully reproducible. All resources developed as part of this work are publicly available at: https: //github.com/umanlp/ZusammenQA.

Conclusion
We introduced a framework for a cross-lingual open-retrieval question-answering system, using data augmentation with specialized models in a constrained setup. Given a question, we first retrieved top relevant documents and further generated the answer with the specialized models (i.e., MLM-ing on Wikipedia passages) along with the augmented data variants. We demonstrated the effectiveness of data augmentation techniques with language-and domain-specialized additional training, especially for resource-lean languages. However, there are still remaining challenges, especially in the retrieval model training with limited computational resources. Our future efforts will be to focus on more efficient approaches of both multilingual passage retrieval and multilingual answer generation (Abdaoui et al., 2020) with the investigation of different data augmentation techniques (Zhu et al., 2019). We hope that our generated QA language resources with the released models can catalyze the research focus on resource-lean languages for COQA systems. Figure 2 shows an example of a Wikipedia passage about An American in Paris. The passage in orange is the set of sentences whose length does not exceed 512 tokens, which is the first of three sub-passages used for generating question-answer pairs. The generated pairs can be seen at the bottom of the figure. Questions and answers highlighted in red are those that satisfy the filtering heuristics detailed in §2.2. These are then translated into other languages.
Title: An American in Paris URL: https://en.wikipedia.org/wiki?curid=309 Text: An American in Paris is a jazz-influenced orchestral piece by American composer George Gershwin written in 1928. It was inspired by the time that Gershwin had spent in Paris and evokes the sights and energy of the French capital in the 1920s. Gershwin composed "An American in Paris" on commission from conductor Walter Damrosch. He scored the piece for the standard instruments of the symphony orchestra plus celesta, saxophones, and automobile horns. He brought back some Parisian taxi horns for the New York premiere of the composition, which took place on December 13, 1928 in Carnegie Hall, with Damrosch conducting the New York Philharmonic. He completed the orchestration on November 18, less than four weeks before the work's premiere. He collaborated on the original program notes with critic and composer Deems Taylor. Gershwin was attracted by Maurice Ravel's unusual chords, and Gershwin went on his first trip to Paris in 1926 ready to study with Ravel. After his initial student audition with Ravel turned into a sharing of musical theories, Ravel said he could not teach him, saying, "Why be a second-rate Ravel when you can be a first-rate Gershwin?" While the studies were cut short, that 1926 trip resulted in a piece entitled "Very Parisienne", the initial version of "An American in Paris", written as a 'thank you note' to Gershwin's hosts, Robert and Mabel Shirmer. Gershwin called it "a rhapsodic ballet"; it is written freely and in a much more modern idiom than his prior works. Gershwin strongly encouraged Ravel to come to the United States for a tour. To this end, upon his return to New York, Gershwin joined the efforts of Ravel's friend Robert Schmitz, a pianist Ravel had met during the war, to urge Ravel to tour the U.S. Schmitz was the head of Pro Musica, promoting Franco-American musical relations, and was able to offer Ravel a $10,000 fee for the tour, an enticement Gershwin knew would be important to Ravel. ...

B MixCSE Loss
The MixCSE loss described in 3.1 is given by: L mixcse = − log exp(cos(e q i , e p + i )/τ ) exp(cos(e q i , e p + i )/τ ) + n j=1 exp(cos(e q i , e p − i,j )/τ ) + exp(cos(e q i , SG(ẽ i ))/τ ) where τ is a fixed temperature and SG is the stop-gradient operator, which prevents backpropagation from flowing into the mixed negative (ẽ i ).