Multimodal Multi-Speaker Merger & Acquisition Financial Modeling: A New Task, Dataset, and Neural Baselines

Risk prediction is an essential task in financial markets. Merger and Acquisition (M&A) calls provide key insights into the claims made by company executives about the restructuring of the financial firms. Extracting vocal and textual cues from M&A calls can help model the risk associated with such financial activities. To aid the analysis of M&A calls, we curate a dataset of conference call transcripts and their corresponding audio recordings for the time period ranging from 2016 to 2020. We introduce M3ANet, a baseline architecture that takes advantage of the multimodal multi-speaker input to forecast the financial risk associated with the M&A calls. Empirical results prove that the task is challenging, with the pro-posed architecture performing marginally better than strong BERT-based baselines. We release the M3A dataset and benchmark models to motivate future research on this challenging problem domain.


Introduction
Mergers and Acquisitions (M&As) 1 conference calls are events preceding financial transactions involving two or more entities such that either one of the participant companies takes over the other(s) and establishes itself as the owner (termed as "acquisition") or when one company combines with another to become a joint entity (termed as "merger"). In these M&A conference calls, the participating companies' management makes a presentation to the call participants, such as market analysts, media personnel, and other stakeholders, explaining the rationale for the deal and possible roadblocks to deal completion (Dasgupta et al., 2020). Following the presentation segment, there is a Q&A segment in which the call participants ask questions to which the management responds. Figure 1: A schematic of our proposed approach (M3A) that leverages three types of input modalities: text utterances from the call transcripts, audio clips, and speaker specific input, for financial modeling tasks.
Building on the important information that M&As provide, academic research, the financial press, and other media give a great deal of attention. One of these discussions' principal aspects lies in how the deals may affect the company's valuation (Moeller et al., 2003;Fraunhoffer et al., 2018) and future growth. A significant focus in financial and economic literature has been on understanding whether M&As create or destroy value. Consequently, shareholders critically analyze the deals to estimate the potential stock price and stock price volatility post the M&A conference call.
Identifying the gap in natural language processing (NLP) literature on the lack of resources to study M&A conference calls with their text transcripts and audio recordings, we take the first step in multimodal financial modeling in the M&A space. Such data can allow academicians to study M&A calls further, especially with the rich multimodal data. It shall enable studies that focus not only on the words spoken in the call but also in the manner they were spoken, a relatively unexplored field in financial forecasting, as shown in Figure 1.
A salient aspect of conference calls is that, unlike text reports, the company's management interacts with external stakeholders and asks questions. This Figure 2: M&A calls have a Q&A session where financial stakeholders can ask questions to the company executives. In such sessions, company executives have to be impromptu with their responses, allowing informal words to seep in. This example Q&A session is from the call regarding the acquisition of 21st Century Fox by Disney, dated June 20, 2018. In the example, an analyst poses a few questions to the company executives (depicted in yellow). The CEO of Disney responds to these questions, where we notice some cases of informal speech (depicted in purple). The Executive's response however mainly focused on specific objects or entities (depicted in red) intermixed with some time-based information (depicted in green).
interaction presents an opportunity of analyzing not just the management's claims but also the way they express them. In Figure 2, we highlight the various components in a short Q&A interaction. Often, both the transcript and the audios of the calls are available to the public.
Vocal cues play a critical role in verbal communication as they can provide support or discredit the verbal message that is being spoken (Jiang and Pell, 2017). For example, consider if the CEO of the acquiring company exhibits confidence in the statement -"we are confident that this acquisition will bring us profits," however, displays nervousness while justifying technical details of the deal, we may infer contradiction in the claims of a successful M&A. Vocal cues have been proven indicators of emotions like deceit and nervousness (Belin et al., 2017;Sporer and Schwandt, 2006). Past research (Qin and Yang, 2019;Sawhney et al., 2020c) shows that the addition of vocal cues has helped with the task of financial predictions and enrich the learned representations.
Our contributions can be summarized as: • We curate a public dataset M3A 2 (Multimodal Multi-Speaker Merger & Acquisition Call Fi-2 The source code, processed features, and details on acquiring raw data are available at https://github.com/ midas-research/m3a-acl nancial Forecasting Dataset) that consists of 816 M&A conference calls spanning over 545 hours between 2016 to 2020 with their transcripts and audio recordings, segmented by utterances and aligned with the audio.
• We accompany the dataset with neural baseline architectures that use the multimodal multi-speaker input to predict stock volatility and price movement.
• To the best of our knowledge, no such M&A conference call dataset exists in academia, and our proposed methodology, M3ANet is the first deep learning approach for financial predictions on M&A conference calls.

Related Work
M&A Conference Calls Financial reports and conference calls have been shown to have a correlation with the stock market and improve financial predictions (Bowen et al., 2001;Kogan et al., 2009). Studies have also been carried out specifically for M&A calls, showing their effect on the market (Dasgupta et al., 2020;Hu et al., 2018). However, there exists a gap in leveraging neural predictive modeling on using verbal and vocal cues pertaining to M&A calls for financial forecasting.
Financial Forecasting Research has shown historical pricing data to be useful in predicting financial risk modeling (Kristjanpoller et al., 2014;Zheng et al., 2019;Dumas et al., 2009). It also considers volatility as an indicator of uncertainty, which helps make decisions regarding investments (Heston, 1993;Johnson and Shanno, 1987;Scott, 1987). Previous work often use numerical features Nikou et al., 2019) in approaches like neural networks (Kim et al., 2019;Luo et al., 2017), graph neural networks (Sawhney et al., 2020b), and time-series models (Bollerslev, 1986;Engle, 1981). On the other hand, we are interested in analyzing multimodal data like text and audio, which can hold completely different information for predictive models.

Natural Language Processing and Finance
For any system using human interactions to determine financial risk or stock movements, it is necessary to determine the relationship between the various words to determine the speaker's sentiment. Advances in NLP have been utilized in many approaches to show financial information significantly improving performance in forecasting tasks like volatility and stock price prediction (Wang et al., 2013;Ding et al., 2015;Mittermayer and Knolmayer, 2007). Research has also shown that social media affects the stock market (Bollen et al., 2010;Oliveira et al., 2017;Sawhney et al., 2020a). Machine learning methods using simple bag-of-words features to represent the financial documents used in previous research (Kogan et al., 2009;Rekabsaz et al., 2017) largely ignore the inter-dependencies between the sentences. To fill the gap, recent approaches have moved towards newer models such as transformers (Yang et al., 2020) and reinforcement learning (Sawhney et al., 2021b) over natural language data for financial forecasting.
Multimodality and Financial Forecasting Research shows that psychological and behavioral elements are often indicators of stock price movement (Malkiel, 2003). Vocal cues have been proven effective in portraying these elements (Wurm et al., 2010;Hobson et al., 2011;Jiang and Pell, 2017). Thus, it is no surprise that multimodal architectures that use these cues for financial predictions have seen significant improvements in their performances (Yang et al., 2020;Sawhney et al., 2020d).
Speaker Context Encoding Past research (Zhang et al., 2019;Li et al., 2020) in fields like emotion recognition have seen the improved performance on their prediction tasks with the addition of speaker context. Models with data related to spoken text benefit when the input is enriched with information about who spoke what.

Problem Formulation
Consider an M&A call χ ∈ {χ 1 , χ 2 , . . . , χ M }, which comprises multimodal components: Here, t is the sequence of textual utterances (sentences) 3 of the call transcript and can be rep- where t i is the i th utterance of the call and N is the maximum number of utterances in any call. Similarly, a is the sequence of corresponding call audios for the textual utterances (sentences) and can be represented as where a i is the i th call audio. The call's utterances are annotated with speaker information s = [s 1 , s 2 ...s N ], where s i is the speaker of the i th utterance and where each speaker in the call may have spoken one or more utterances. Each M&A conference call may have two or more participating companies, with at least one publicly-traded company with publicly available stock price information. We limit the scope of the problem being solved by forecasting predictions for just one of the participant companies with the larger market valuation (in case of a merger) or the acquiring company (in case of an acquisition). We now describe the two prediction tasks that we utilize to train M3ANet on.
Measuring stock volatility Following (Kogan et al., 2009), we formulate stock volatility as a regression problem. For a given stock with a close price of p k on the trading day k, we calculate the average log volatility as the natural log of the standard deviation of return prices r in a window of τ days as: is the return price on day k for a given stock, andr is the average return price over a period of τ days.  Formalizing price movement prediction Following (Xu and Cohen, 2018), we define price movement y d−τ,d over a period of τ days as a binary classification task. For a given stock, we employ its close price, which can either rise or fall on a day d compared to a previous day d − τ , to formulate the classification task as: Given an acquisition conference call χ, our learning objective is to predict the average negative log volatility v [0,τ ] and price movement y [d−τ,d] using the conference call data χ = [t; a].

Data Acquisition
We curate our dataset, M3A, by acquiring audio records and text transcripts from the Bloomberg Terminal. 4 Since the conference calls were reliably available from 2016, we filter and list all M&A calls between 2016 and 2020. To limit the scope, we ensured the calls were in English, had their domicile as the U.S.A., and had 'merger' or 'acquisition' in their title. The Bloomberg Terminal often only provides the stock ticker for the acquiring company (in case of an acquisition) and the company with a more prominent marker valuation (in case of a merger). To maintain uniformity, we decide only to use the given stock information. We pull the adjusted closing price data from Yahoo Finance. 5 The dataset comprises 816 conference calls. The mean number of speakers across the calls is 10.68 ± 4.17, with a maximum of 31 speakers. The mean number of utterances across the calls is 100.54 ± 38.32 utterances and a maximum of 284 utterances in a call. The mean length comes out to be 40.15 ± 15.15 minutes and a maximum length of 98.15 minutes for the audio clips. We provide further statistics in Figure 3. Looking at year-wise trends, we see that acquisitions are consistently more frequent that mergers every year. Further, we note that mergers see a decreasing trend in the number of utterances and acquisitions have a consistent number of speakers in M&A calls. We also note that acquisitions conference calls seem to be increasing in length as the years progress.
We chronologically divide our dataset into a train, validation, and test set in the ratio of 70 : 10 : 20, respectively. Such a split ensures that future data is not used for forecasting past data.

Call Segmentation and Alignment
Each transcript of the dataset begins with the company's details with the larger market valuation (in case of a merger) or the acquiring company (in case of an acquisition). These details include the company's name, stock ticker, and the date of the call. The transcript then lists the speakers in the call and their position in the companies, if any. The call contents follow the list of speakers. The contents are separated by utterances and are annotated with the utterances' speakers.
Given our dataset, we have the option to choose between transcript-level, utterance-level, and wordlevel embeddings. We decide to use utterance-level embeddings. 6 We select utterances with at least ten words to ensure better parsing of the transcript and parse the texts to extract all valid utterances.
Since we are working with audio files, it becomes essential that we can segment them such that we can align them with their corresponding utterances in the text transcript. To achieve this alignment, we have used the Aeneas 7 library to per- form the forced alignment. The Forced Alignment algorithm takes as input a text file divided into fragments and an unfragmented audio file. It processes the input to output a synchronization map, which automatically associates a time interval in the audio file to its corresponding text fragment. Aeneas uses the Sakoe-Chiba Band Dynamic Time Warping (DTW) (Sakoe and Chiba, 1978) forced alignment algorithm, which has been proven to improve discrimination between words and has superior performance over other conventional algorithms.

Text and Audio Encoding
Text Encoding We compute an utterance's textual encoding as the arithmetic mean of all its word vectors. BERT is well known as an effective pretrained language-based model for extracting wordembeddings (Biswas et al., 2020) for a variety of language modeling tasks. We use Uncased Base BERT (Devlin et al., 2019) to extract the word embeddings. For each call, we represent the text utterances as [t 1 , t 2 , . . . , t N ]. As seen from Figure  4, we embed each text utterance t i to get its corresponding 768-dimensional text encoding g i using BERT such that Audio Encoding We use the OpenSMILE 8 library to extract the audio features at a sampling rate of 10ms and choose the set of 62 geMAPS features described in (Eyben et al., 2016). This set includes features like pitch, jitter, loudness, etc., which have proven to be effective in audio analysis tasks (Chao et al., 2015). For each call, we represent the audio clips of the utterances as [a 1 , a 2 , . . . , a N ]. We embed each audio utterance a i to its corresponding 62dimensional encoding h i using OpenSMILE such that h i = OpenSMILE(a i ) for each i ∈ [1, N ]. 8 https://pypi.org/project/opensmile/

Motivation for Speaker Information Infusion
The audio encodings help decipher the vocal cues in the text transcript's context to support or discredit the speaker's claims. However, it is critical for the system to recognize the importance of the utterance's speaker to gauge its impact on financial predictions. This requires the information about the speaker of each utterance to be augmented to the input. Prior research (Zhang et al., 2019;Li et al., 2020) shows the addition of speaker context helps improve prediction performance on tasks involving datasets with spoken texts.
M&A calls have utterances spoken by the company's management (the decision-making force of the company), by analysts (who want to gauge the risk in the company's decisions), or even just the operator (often an impartial person). Capturing this speaker context will allow us to decide how much impact a specific utterance can have on a company's stock price. Thus, we extract the speaker information for each utterance. We parse the list of speakers from the transcripts and assign an ID to each of the speakers. The IDs start from 1 and are assigned incrementally to each speaker in the order in which they are listed. The operator of the call is assigned the ID 0. We then annotate each of the utterances based on who spoke it. Finally, we use one-hot encoding to represent the speaker encoding s of each utterance in the call.

M3ANet: Speaker Transformer
The Transformer (Vaswani et al., 2017) uses multihead attention and position embeddings to learn the relationship between different utterances. The multimodal input requires the model to learn the inter-dependencies between the audio and the text features. M3ANet can then use the audio cues to affirm or discredit the spoken message and make an informed prediction. The idea behind M3ANet is to use attention to weigh the importance of each modality at different timestamps. We then aug-ment the data with the speaker encoding and allow the Transformer to extract the multimodal interdependencies for performing the prediction tasks.
Attention-Fusion Before we can fuse the inputs, we need to linearly transform the text embeddings to ensure the multimodal embeddings' sizes are the same. We then extract the attention weights to calculate the attended inputs similar to (Hori et al., 2017). These attention weights describe the importance of a specific modality concerning the other modality. We multiply the text and audio features by their attention weights W T and W A respectively to get the attended input, followed by fusing them. The following equations formalize the attention mechanism used: where W wt and b wt represent the text attention layer, W wa and b wa represent the audio attention layer and + represents addition.

Sentence-Level Transformer
To model the sequence of textual and audio embeddings of the M&A calls, we augment the fused multimodal embeddings X f used with position embeddings pos by addition and the speaker information by concatenation (represented by ⊕). pos has the same dimensions as X f used , pos j,ind represents the value of the positional embedding for the j th utterance at index ind. The augmentation is summarised as follows: The Transformer block uses the augmented feature set for further processing, following which the intermediate tensors are passed through two consecutive dense layers to output the task prediction as follows: where, W l1 and b l1 represent the first linear layer, W l2 and b l2 represent the second linear layer, I 1 and O 1 represent the input to the first and second dense layer after being passed through the sentence transformer, while σ represents the final activation function and y represents the final prediction from the activation corresponding to the task. We use ReLU for the final prediction in the volatility prediction task and sigmoid for the price prediction task. We then use Mean Squared Error (MSE) and Binary Cross-Entropy (BCE) losses to train the output for volatility prediction and stock price movement prediction, respectively.
6 Experimental Setup

Baselines
We compare M3ANet against modern baselines across modalities for both the tasks. We employ GloVe (Pennington et al., 2014), FinBERT (Araci, 2019) and Roberta  to embed the text and choose an LSTM + Dense layer architecture as a benchmark for both volatility and price movement prediction. We also use all three (text, audio, and multimodal) variants of the Multimodal Deep Regression Model (MDRM) (Qin and Yang, 2019) as baselines.

Training Setup
We tune M3ANet's hyper-parameters using Grid Search. We summarize the range of hyperparam- We implement all methods with Keras 9 and Google Colab. 10 , using ReLU as our hidden layer activation function and optimize using Adam. We choose the highest performing model during the training phase on our validation set and chosen evaluation metrics as our best model. We zero-pad the calls that have less than the maximum number of utterances/speakers for efficient batching. We experiment with trading periods τ ∈ {3, 7, 15}  Table 1: Mean τ -day volatility MSE and price movement prediction results (mean and stdev. of 5 runs for each approach). * indicates that the result is significantly better than the MDRM (T+A). Bold denotes best performance.  Table 2: Effect of multimodality and multi-speaker inputs (mean and stdev. of 5 runs for each approach).

Model Volatility Prediction Price Prediction
days allowing experimentation across both short and medium-term periods. Similar to prior work (Sawhney et al., 2020d;Theil et al., 2019;Yang et al., 2020), we evaluate predicted volatility using the mean squared error (MSE) for each hold period, n ∈ {3,7,15}. For the classification task, we report the F1 score and Mathew's Correlation Coefficient (MCC) for the classification task (Matthews, 1975). We use MCC because, unlike the F1 score, MCC avoids bias due to any data skew that may be present as it does not depend on the choice of the positive class. For a given confusion matrix tp f n f p tn : 7 Results and Analysis

Performance Comparison
As shown in Table 1, M3ANet achieves the best performance for both the volatility prediction and the price prediction task. We observe improvements using M3ANet ( Table 2) that leverages the text and audio modalities along with speaker information. This improvement can be attributed to attention to emphasize the importance of each modality throughout the series of utterances. It can also be observed that the improvements our architecture results in are not quite large in magnitude. We attribute this to the difficulty that the task inherently possesses. Further research in more sophisticated models may result in greater improvements in the performance on M3A.

Multimodal and Multi-Speaker Learning
From Table 1 and Table 2, we see that in both the MDRM and Transformer models, the multimodal models performed much better than the unimodal counterparts. This performance improvement follows from previous research (Qin and Yang, 2019) with respect to volatility prediction. Similar observations validate our hypothesis that audio cues provide additional information that helps make a better prediction. It is also apparent from Table 2 that adding speaker context improves the prediction result consistently. Thus, we infer that speaker information does play an essential part in forecasting and adds to the data's richness.

Ablation Study: Fusion
We experiment with fusion by concatenation and fusion by attention for the Transformer and find the latter performing better in most cases (Table  2). We believe this happens because simple fusion techniques cannot produce features that effectively capture the individual modalities' importance. However, attention fusion uses weights for both the modalities, learned by the architecture, to determine the importance of each modality with respect to its counterpart. Using these weights to perform a weighted addition gives a much better representation of both the modalities and their particular importance in a fused vector.

Performance Drift over Time
As observed in previous works (Sawhney et al., 2020d) using earnings calls, Figure 6 shows that short-term stock volatility prediction is more complex, possibly due to the erratic price fluctuations after a M&A call. We hypothesize that these price fluctuations settle as more time elapses, similar to the phenomenon of PEAD (Post Earnings Announcement Drift) (Bernard and Thomas, 1989;Bhushan, 1994;Sadka, 2006). This saturation in performance improvement can be attributed to the dilution of cues extracted from the calls, as we 'drift' away from them. However, it can be noted that a similar trend may not necessarily be true for price movement prediction.

Merger & Acquisition Transfer
We experiment by training M3ANet on Mergers and Acquisitions calls separately, and testing both models on each set of calls separately. From Table  3, it can be observed that both models predict the price movement better for their respective sets as expected. It is surprising to see that the models predict volatility of Acquisition calls relatively better than that of Merger calls. This suggests that Acquisition conference calls lead to a volatility that's relatively easier to predict and seems to be an avenue for further research.

Qualitative Analysis
Call 1: Acquisition of Shape Security by F5 Networks Inc Following the call, F5 Networks Inc suffered a price drop of up to 5.2% within the next month. Studying the call's vocal cues, we notice ( Figure 5a) the CEO had sudden peaks in the mean pitch of his audio while answering questions. Similar peaks occurred when a participant asks the CEO about their fraud protection when compared to their competitors. Prior research on audio analysis (Jiang and Pell, 2017) proves a high mean pitch may indicate a lack of confidence in the speaker. It was later ascertained that F5 had overpaid to acquire Shape Security without proper due diligence of fraud protection plans sold by Shape Security. We observe how M3ANet successfully predicts the decrease in price for all choices of τ while the unimodal models fail to do the same each time.
Though the text reveals no lack of confidence, the audio cues likely allow the model to make a successful prediction.
Call 2: Merger of AK Steel Holding Corporation and Cleveland-Cliffs Inc Following the merger call, Cleveland-Cliffs Inc saw an increase in their stock price up to 17.9% in the next five days. Similar to the first call, we notice spikes and sudden increases in the audios' mean pitch from Figure 5b. However, the difference exists in the fact that these high pitch patterns come from an analyst in the call and not someone holding an influential position in the companies involved. M3ANet can differentiate between the speakers and correctly predicts the price going up, unlike the transformer variant without speaker embeddings. This shows how the augmentation of the multimodal data with the speaker embedding likely benefits the predictive power of M3ANet.
Call 3: Acquisition of Plateau Excavation Inc by Sterling Construction Company Inc We now analyze this acquisition as an error analysis where M3ANet predicts incorrectly. We see the text transformer performing well on this example and accurately predicting the increase in the stock price for Sterling Construction Company Inc. On the other hand, our multimodal multi-speaker is unable to do the same. Observing the audio cues (Figure 5c), we find a great deal of variance in the mean audio pitch. We attribute the erroneous performance to the potential overfitting of the model or noise in the audio cues.

Conclusion
We present a dataset of M&A calls that can be utilized to predict financial risk following M&A calls. We also present a strong baseline model using multimodal multi-speaker inputs from the M&A calls to perform financial forecasting. M3ANet uses attention-based fusion to leverage the interdependency between the verbal message and the vocal cues. Further, the approach uses speaker information to enrich the input data to determine if the speakers' vocal cues or verbal messages conflict with others and accounts for the same. Experiments on M3A display the effectiveness of M3ANet. We hope our M3A can enable more academic progress in the field of financial forecasting.

Ethical Considerations and Limitations
Examining a speaker's tone and speech in conference calls is a well-studied task in past literature (Qin and Yang, 2019;Chariri, 2009). Our work focuses only on calls for which companies publicly release transcripts and audio recordings. The data used in our study corresponds to M&A conference calls of companies in the NASDAQ stock exchange. We acknowledge the presence of gender bias in our study, given the imbalance in the gender ratio of speakers of the calls. We also acknowledge the demographic bias (Sawhney et al., 2021a) in our study as the companies are organizations within the public stock market of United States of America and may not generalize directly to non-native speakers.