DopplerBAS: Binaural Audio Synthesis Addressing Doppler Effect

Recently, binaural audio synthesis (BAS) has emerged as a promising research field for its applications in augmented and virtual realities. Binaural audio helps users orient themselves and establish immersion by providing the brain with interaural time differences reflecting spatial information. However, existing BAS methods are limited in terms of phase estimation, which is crucial for spatial hearing. In this paper, we propose the \textbf{DopplerBAS} method to explicitly address the Doppler effect of the moving sound source. Specifically, we calculate the radial relative velocity of the moving speaker in spherical coordinates, which further guides the synthesis of binaural audio. This simple method introduces no additional hyper-parameters and does not modify the loss functions, and is plug-and-play: it scales well to different types of backbones. DopperBAS distinctly improves the representative WarpNet and BinauralGrad backbones in the phase error metric and reaches a new state of the art (SOTA): 0.780 (versus the current SOTA 0.807). Experiments and ablation studies demonstrate the effectiveness of our method.


Introduction
Binaural audio synthesis (BAS), which aims to render binaural audio from the monaural counterpart, has become a prominent technology in artificial spaces (e.g. augmented and virtual reality) (Richard et al., 2021Leng et al., 2022;Parida et al., 2022;Zhu et al., 2022;Park and Kim, 2022). Binaural rendering provides users with an immersive spatial and social presence (Hendrix and Barfield, 1996;Gao and Grauman, 2019;Huang et al., 2022;Zheng et al., 2022), by producing stereophonic sounds with accurate spatial information. Unlike traditional single channel audio synthesis Chen et al., 2021), BAS places more emphasis on accuracy over sound quality, since humans need to interpret accurate spatial clues to locate objects and sense their movements consistent with visual input (Richard et al., 2021;. Currently, there are three types of neural networks (NN) to synthesize binaural audio. Richard et al. (2021) collect a paired monaural-binaural speech dataset and provide an end-to-end baseline with geometric and neural warping technologies. To simplify the task, Leng et al. (2022) decompose the synthesis into two-stage: averaged received audio synthesis and BAS conditioned on the average one. They also propose to use the generative model DDPM (Ho et al., 2020) to improve the audio naturalness. To increase the generalization ability for the out-of-distribution audio,  render the speech in the Fourier space. These nonlinear NN-based methods surpass the traditional digital signal processing systems (Savioja et al., 1999;Zotkin et al., 2004;Sunder et al., 2015) with a linear time-invariant system. However, these methods still have room for improvement in accuracy, especially phase accuracy.  claim that the correct phase estimation is crucial for binaural rendering 1 . Actually, the previous works tend to view the scene "statically", and only take into account the series of positions and head orientations. This motivates DopplerBAS, which proposes to facilitate phase estimation by explicitly introducing Doppler effect (Gill, 1965;Giordano, 2009) into neural networks. Specifically, 1) we calculate the 3D velocity vector of the moving sound source in the Cartesian coordinate and then decompose that into a velocity vector in the spherical coordinates relative to the listener; 2) According to the Doppler effect, we use the radial relative velocity as an additional condition of the neural network, to incentivize the model to feel the moving objects. We also analyze the phenomena caused by different types of velocity conditions through extensive experiments.
Naturally, DopplerBAS can be applied on different neural binaural renderers without tuning hyperparameters. We pick two typical recent backbones: 1) WarpNet (Richard et al., 2021), a traditional network optimized by reconstruction losses; 2) Bin-auralGrad (Leng et al., 2022), a novel diffusion model optimized by maximizing evidence bound of the data likelihood, to demonstrate the effectiveness of our method. Experiments on WarpNet and BinauralGrad are representative and could show the generalizability of other conditions based on gains on these two models. The contributions of this work can be summarized as: • We propose DopplerBAS, which distinctly improves WarpNet and BinauralGrad in the phase metric and produces a new state-of-theart: 0.780 (vs. the current state-of-the-art 0.807).
• We conduct analytical experiments under various velocity conditions and discover that: 1) NN does not explicitly learn the derivative of position to time (velocity); 2) The velocity condition is beneficial to binaural audio synthesis, even the absolute velocity in the Cartesian coordinate; 3) The radial relative velocity is the practical velocity component, which obeys the theory of Doppler effect.

Method
We consider the most basic BAS scenario where only the monaural audio, the series of positions and head orientations are provided Leng et al., 2022), rather than other scenarios where extra modalities (Xu et al., 2021) are present.
In this section, we will introduce the Doppler Effect as the preliminary knowledge, and then introduce the proposed method DopplerBAS. We will describe how to calculate and decompose the velocity vector, and how to apply this vector to two different backbones.

Doppler Effect
The Doppler effect (Gill, 1965) is the change in frequency of a wave to an observer, when the wave source is moving relative to it. This effect is originally used in radar systems to reveal the characteristics of interest for the target moving objects (Chen et al., 2006). It can be formulated as: where c, v r , f 0 and f are the propagation speed of waves, the radial relative velocity of the moving sound source, the original frequency of waves and the received frequency of waves respectively.
! " #$ right ear Figure 1: We illustrate the top view where the height dimension is omitted for simplicity. The sound source is moving in the x-y plane with the velocity v xy . This velocity is decomposed into the radial velocity v r relative to the right ear.

DopplerBAS
We do not directly apply Eq. (1) in the frequency domain of audio, because previous work  shows that modeling the binaural audio in the frequency domain will benefit the generalization ability but cause the drop of accuracy. Instead, we calculate the velocity of interest and use it as a condition to guide the network to synthesize binaural audio consistent with the moving event. In the receiver-centric Cartesian coordinate, we define p s and p e as the 3D position of the moving sound source s and one ear of the receiver e respectively (e.g. the right ear, as shown in Figure 1). The position vector p = (p x , p y , p z ) of s relative to e is: Then s's velocity 2 can be calculated as: v = (v x , v y , v z ) = ( dp x dt , dp y dt , dp z dt ).  Next, we build the spherical coordinate system using the ear as the origin, and decompose v into the radial relative velocity v r by: wherer ∈ R 1 is the radial unit vector. Finally, we add v r as the additional condition: The original conditions in monaural-tobinaural speech synthesis are C o ∈ R 7 = (x, y, z, qx, qy, qz, qw), of which the first 3 represent the positions and the last 4 represent the head orientations. We define the new condition C ∈ R 9 = (x, y, z, qx, qy, qz, qw, v r−lef t , v r−right ), where v r−lef t and v r−right represent the radial velocity of source relative to the left and right ear respectively, which are derived from Eq. (2). We apply C to WarpNet and BinauralGrad:

WarpNet
WarpNet consists of two blocks: 1) Neural Time Warping block, to learn a warp from the source position to the listener's left and right ear while respecting physical properties (Richard et al., 2021). This block is composed of a geometric warp and a parameterized neural warp. 2) Temporal ConvNet block, to model subtle effects like room reverberations and output the final binaural audio. This block is composed of a stack of hyper-convolution layers. We replace the original C o with C for the input of parameterized neural warp and the condition of hyper-convolution layers.

BinauralGrad
BinauralGrad consists of two stages: 1) "Common Stage", to generate the average of the binaural audio. The conditions for this stage include the monaural audio, the average of the binaural audio produced by the geometric warp in Warp-Net (Richard et al., 2021), and C o . 2) "Specific Stage", to generate the final binaural audio. The conditions for this stage include the binaural audio produced by geometric warp, the output of "Common Stage", and C o . BinauralGrad adopts diffusion model for both stages, which is based on non-causal WaveNet blocks  with a conditioner block composed of a series of 1D-convolutional layers. We replace C o with C as the input of the conditioner block.

Experiments
In this section, we first introduce the commonly used binaural dataset, and then introduce the training details for WarpNet-based and BinauralGradbased models. After that, we describe the metrics that we use to evaluate baselines and our methods. Finally, we provide the main results with analytical experiments on BAS.

Setup
Dataset We evaluate our methods on the standard binaural dataset released by Richard et al. (2021). It contains 2 hours of paired monaural and binaural data at 48kHz from eight different speakers. Speakers were asked to walk around a listener equipped with binaural microphones. An OptiTrack system track the positions and orientations of the speaker and listener at 120Hz, which are aligned with the audio. We follow the original train-validation-test splits as Richard et al. (2021) and Leng et al. (2022) for a fair comparison.
Training Details We apply DopplerBAS on two open-source BAS systems WarpNet 3 and Bin-auralGrad 4 . We train 1) WarpNet and War-Net+DopplerBAS on 2 NVIDIA V100 GPUs with a batch size of 32 for 300k steps, and 2) Bin-auralGrad and BinauralGrad+DopplerBAS on 8 NVIDIA A100 GPUs with a batch size of 48 for 300k steps 5 .
Metrics Following the previous works (Leng et al., 2022;, we adopt 5 metrics to evaluate baselines and our methods: 1) Wave L2: the mean squared error between waveforms; 2) Amplitude L2: the mean squared errors between the synthesized speech and the ground truth in amplitude; 3) Phase L2: the mean squared errors between the synthesized speech and the ground truth in phase; 4) PESQ: the perceptual evaluation of speech quality; 5) MRSTFT: the multi-resolution spectral loss.

Main Results and Analysis
In this section, we conduct extensive experiments to present the effectiveness of our proposed methods.

Main Results
We compare the following systems: 1) DSP, which utilizes the room impulse response (Lin and Lee, 2006) to model the room reverberance and the head-related transfer functions (Cheng and Wakefield, 2001) to model the acoustical influence of the human head; 2) WaveNet (Richard et al., 2021;Leng et al., 2022), which utilizes WaveNet (Oord et al., 2016 model to generate binaural speech; 3) NFS, which proposes to model the binaural audio in the Fourier space; 4) WarpNet (Richard et al., 2021), which proposes a combination of geometry warp and neural warp to produce coarse binaural audio from the monaural one and a stack of hyper-convolution layers to refine that; 5) WarpNet + DopplerBAS, which applies DopplerBAS to WarpNet; 6) Binau-ralGrad (Leng et al., 2022), which proposes to use diffusion model to improve the audio naturalness; 7) BinauralGrad + DopplerBAS, which applies DopplerBAS to BinauralGrad.
The results are shown in Table 1. "+ Doppler-BAS" could improve both WarpNet and Binaural-Grad in all the metrics, especially in the phase metric. WarpNet + DopplerBAS performs best in Phase L2 metric and reaches a new state-of-the-art 4 https://github.com/microsoft/NeuralSpeech/ tree/master/BinauralGrad 5 Following the recommended training steps in their official repository.  Analysis We conduct analytical experiments under the following 4 velocity conditions. "Spherical v ": the velocity conditions introduced in Section 2.2 are calculated in the spherical coordinate system; "Cartesian v ": the velocity conditions are calculated in the Cartesian coordinate system; "Zeros": the provided conditions are two sequences of zeros; "Time series": the provided conditions are two sequences of time. The results are shown in Table 2, where we place WarpNet in the first row as the reference. We discover that: 1) Radial relative velocity is the practical velocity component, which obeys the theory of the Doppler effect (row 2 vs. row 1); 2) The velocity condition is beneficial to binaural audio synthesis, even the absolute velocity in the Cartesian coordinate (row 3 vs. row 1); 3) Just increasing the channel number of the condition C o introduced in Section 2.2 (increasing the parameters in neural networks), but not providing meaningful information could not change the results (row 4 vs. row 1); 4) The neural networks do not explicitly learn the derivative of position to time (row 5 vs. row 1). These points verify the rationality of our proposed designs.

Conclusion
In this work, we proposed DopplerBAS to address the Doppler effect of the moving sound source in binaural audio synthesis. To aid the previous neural BAS methods where the Doppler effect wasn't explicitly taken into account, we calculate the radial relative velocity of the moving source in the spherical coordinate system as the additional conditions for the BAS task. The main experiments show that DopplerBAS scales well to different types of