BUSTER : a “BUSiness Transaction Entity Recognition” dataset

Albeit Natural Language Processing has seen major breakthroughs in the last few years, transferring such advances into real-world business cases can be challenging. One of the reasons resides in the displacement between popular benchmarks and actual data. Lack of supervision, unbalanced classes, noisy data and long documents often affect real problems in vertical domains such as finance, law and health. To support industry-oriented research, we present BUSTER , a BUSiness Transaction Entity Recognition dataset. The dataset consists of 3779 manually annotated documents on financial transactions. We establish several baselines exploiting both general-purpose and domain-specific language models. The best performing model is also used to automatically annotate 6196 documents, which we release as an additional silver corpus to BUSTER .


Introduction
Natural Language Processing (NLP) is a field potentially beneficial to a broad span of languageintensive domains, such as law and health.Whilst lots of Financial data are tabular, there is also crucial information stored in reports, news, transaction agreements, etc.
The abrupt developments in NLP (Vaswani et al., 2017) are favouring its adoption as assistance tools for human experts in many tasks, ranging from Document Classification (Chalkidis et al., 2019) to Information Extraction (Alvarado et al., 2015;Loukas et al., 2022) and even Text Summarization (Bhattacharya et al., 2019).However, transferring the emerging technologies into industry applications can be non-trivial.Adapting Large Language Models (LLMs) to vertical domains usually requires fine-tuning on domainspecific annotated data.Labeling is often a timeconsuming, expensive process, especially when experts in the field are involved.Recently, several benchmarks and datasets have been constructed for law (Chalkidis et al., 2022), health (Li et al., 2016) and finance (Loukas et al., 2022).In this work, we support industry-oriented research community by presenting BUSTER: a BUSiness Transaction Entity Recognition dataset.As the title suggests, BUSTER is an Entity Recognition (ER) benchmark that focus on the main actors involved in a business transaction.After collecting about ten thousands business transaction documents from EDGAR company acquisition reports, we constructed a dataset with 3779 manually annotated documents (the Gold corpus), from which we trained an LLM to automatically annotate the remaining 6196 documents (the Silver corpus).We analyze the properties of the proposed dataset and also evaluate the performance of some baselines.The dataset will be public and free to download as a benchmark for the NLP community.
The paper is organized as follows.First we review in Section 2 previous related works on Financial NER and document-level datasets.Then, we describe the data collection process and annotation methodologies in sections 3 and 4, respectively.A detailed description of BUSTER and its statistics follows in Section 5.In Section 6 we establish baselines with different LLMs.Finally, in Section 7 we SELLING_COMPANY The company which is selling the target.
ACQUIRED_COMPANY The company target of the transaction.

LEGAL_CONSULTING_COMPANY
A law firm providing advice on the transaction, such as: government regulation, litigation, anti-trust, structured finance, tax etc.

GENERIC_CONSULTING_COMPANY
A general firm providing any other type of advice, such as: financial, accountability, due diligence, etc.

Generic_Info ANNUAL_REVENUES
The past or present annual revenues of any company or asset involved in the transaction.
Table 1: Description of the tag-set defined in BUSTER.
draw our conclusions and outline possible future research directions.

Related works
Several document datasets in the financial domain have been proposed in the literature, but few of them are dedicated to the Entity Recognition (ER) task.Furthermore, these few are mainly intended for the standard Named Entity Recognition (NER) task, such as (Alvarado et al., 2015;Francis et al., 2019;Hampton et al., 2016;Kumar et al., 2016).
In Alvarado et al. (2015) is presented a corpus (FIN) of eight documents from SEC which were manually annotated using the standard four NER data type: person, organization, location and miscellaneous.Unlike that dataset, in BUSTER we decided to focus on Entities that are involved in a financial transaction.FiNER-139 (Loukas et al., 2022) instead consists in a large corpus of SEC documents annotated via gold XBLR tags, that includes a label set of 139 numerical entities on about 1.1M sentences.The tag attribution mostly depends on context rather than the token itself, as it is in BUSTER.Beside the completely different tag set, the main difference between BUSTER and Finer-139 is the fact that we release a document-level benchmark.Indeed, the detection of roles like the buyer company can require scopes wider than a single sentence.Moreover, documents come from files with heterogeneous layouts, extensions and structure, which can sometimes hinder the segmentation of the document into single sentences.
Outside the financial domain, a variety of document-level datasets for NER have been proposed.DocRED (Yao et al., 2019) is a NER and Relation Extraction (RE) corpus built from Wikidata and Wikipedia short text passages, while BioCreative (Li et al., 2016) is a dataset for NER/RE on health domain.In (Quirk and Poon, 2016), the authors propose a dataset for NER in medical area.

Data Collection
Our goal was to create a highly business-oriented dataset to recognize relevant entities involved in financial transactions.Unlike standard NER tasks, we focused on the problem of entity-role recognition, where the goal is to identify a set of entities but only where they appear with specific roles in a context, such as companies involved in an acquisition or consultants assisting in an operation.

Target documents
To collect such documents, we exploited the EDGAR (Electronic Data Gathering, Analysis, and Retrieval system) service of the U.S. Securities and Exchange Commission (SEC)1 .The SEC's mission is to maintain fair, orderly, and efficient markets.In particular, the organization aims to give transparency to business activities and provide investors with more security on the companies in which they invest, facilitating capital formation.For this purpose, domestic and foreign companies conducting business in the US are required to provide regular reports to the SEC through EDGAR.Reports are filed based on a list of forms that correspond to certain filing types.The EDGAR service provides more than 150 different form types (filing type)2 and of these, the Form 8K type deserves particular attention.Table 2: The quality assessment results of the output of the annotation process.
An 8-K provides investors with timely notification of significant changes at listed companies such as acquisitions, bankruptcy, the resignation of directors, or changes in the fiscal year3 .Optionally, but very frequently, the Form 8K includes a document called Exhibit 99.1 (often abbreviated on EX-99.1).It consists of a disclosure document which summarizes all the details of the operation announced in the form and it is designed to provide investors with a complete and detailed view on the operation.

Crawling, filtering and processing
To collect the EX-99.1 disclosure documents from EDGAR reporting company acquisitions, ownership changes and share purchase, we make use of the full index tool of the EDGAR site.Limiting to 2021, we downloaded about 120, 000 EX-99.1 disclosure documents in HTML format.After parsing, cleaning and removing any empty or too short documents, we selected the relevant documents using transaction-related keywords (acquisition, acquire, ownership, etc.) obtaining a final raw dataset of about 10, 000 text files.

Annotation
For data labeling, we used a double-blind manual procedure.Specifically, we utilized two annotators (ann 1 and ann 2 ), who were trained on the financial transactions topic and who were provided with a tag-set and specific guidelines to follow in the entity tagging procedure.The annotation procedure has been performed using the expert.ainatural language platform.It consists in an integrated environment for deep language understanding and provides a complete natural language workflow with end-to-end support for annotation, labeling, model training, testing and workflow orchestration4 .

Tag-set
In designing the tag-set, we identified three families of tags: (a) Parties which groups tags used to identify the entities directly involved in the transaction; (b) Advisors which groups tags identifying any external facilitator and advisor of the transaction and (c) Generic_Info which identifies tags reporting any information about the transaction.For each family, we defined a set of related tags.The tag-set is reported in Table 1.

Guidelines and General instructions
In order to improve annotation coherency, the schema definitions outlined in Table 1 were prepared as guidelines to the annotators.Moreover, the following general instructions were provided: • Annotate linguistically apparent instances only -Tag only instances of entities where the class is linguistically evident.Do not tag a string just because you know that it is an instance of an entity: the context must make it obvious that it is an instance of such class.
• Evaluate sentence context only -Tag only instances of entities in which there is evidence within a sentence that the instance is of that entity.Each sentence should be evaluated for entities in isolation from the rest of the document context.

Annotation Procedure
To monitor the annotation procedure, the data set was divided into "sprints" which have been provided sequentially to the annotators.Each sprint consists of a pair of document batches that have been submitted independently to the two annotators.Additionally, we designed each sprint so that its two batches shared a certain percentage of documents.In this way, in each sprint, a portion of docu- Table 3: The statistics of the 5 folds Gold and Silver data.
ments will be tagged by both annotators.Although this choice reduces the number of documents processed over time, it allows subsequent estimation of the annotation quality in each sprint.
We set the size of each sprint to 500 documents, 100 of which were shared between the two annotators (20%).The two annotators processed 8 sprints, thus obtaining 4000 annotated documents, 800 of which were labeled by both annotators.Finally, after removing documents without any labels, the resulting dataset was composed of 3779 labeled documents.

Validation
To evaluate the quality output of the annotation process, we exploited the shared set of documents that had been tagged by both annotators.In particular, indicating with L 1 and L 2 the two sets of annotations5 inserted respectively by the two annotators ann 1 and ann 2 in the shared documents, we calculated several standard indexes6 : (a) Joint Probability of Agreement, which measures the chance of having a match between the two annotators: (b) Conditional Probability of Agreement of ann k , which measures the naive probability that annotations inserted by an annotator k have a match with annotations entered by the other: (c) Coverage of ann k , which measures the probability that a randomly selected annotation was entered by the annotator k: (d) Cohen's kappa (κ), which extends the Joint Probability of Agreement taking into account that agreement may occur by chance (Cohen, 1960): estimates the probability of a random agreement and N = #(L 1 ∪ L 2 ) is the total number of inserted annotations.
The results are reported in the Table 2 and the values of Cohen's kappa (κ) show a substantial agreement between the two evaluators (Landis and Koch, 1977).

Managing annotations in shared documents
In creating the final dataset, it was required to manage shared sets annotated by both annotators.Firstly, we accepted all non-overlapping annotations from both annotators.Secondly, we fixed overlapping, incoherent, annotations by involving a third annotator who manually assigned the correct label.Moreover, for pairs of overlapping annotations with boundaries l 1 = [s 1 , e 1 ] and l 2 = [s 2 , e 2 ], we merged them into a new annotation such that l = [s, e] = [min(s 1 , s 2 ), max(e 1 , e 2 )].

The BUSTER dataset
The final BUSTER dataset is composed of 3779 labeled documents.In Figure 1, we show an example of an annotated text passage inside a document.As explained, those documents were manually annotated and represent the "gold" BUSTER corpus.We  4: Micro (µ-) and macro (M-) scores of the four baseline models evaluated using 5-Fold Cross Validation.
randomly split the data into 5 folds to yield a statistically robust benchmark.Indeed, such division allows the use of a standard k-fold cross-validation approach.
The data set has been used as benchmark for 4 state-of-the art ER models (as described in Section 6) and the best performing model has been used to automatically annotate the remaining 6196 documents.The resulting annotated data was released as a "silver" extra corpus in BUSTER benchmark.The details of the 5 folds and of the silver extra corpus are reported in Table 3.
The full BUSTER benchmark is publicly available and free-to-download from the expert.aiwebsite7 and on HuggingFace8 and we are confident that it can become a point of reference in the field of Entity Recognition, in particular for the financial sector.age length of around 700 words and most of them fall into the 500-1000 range.Also, documents with more than 2000 words are extremely rare.

Statistics
In figure 3, we report the distribution of the three tags families based on their position within the documents.We can observe how the tags belonging to the Parties family (in orange) are centered in the initial parts of the documents, while the remaining are distributed more uniformly and, in any case, located towards the second part of documents.However, no tags occurs beyond the 1500th word.

Experiments
To establish baselines, we performed several experiments using both generic and domain-specific language models.

Experimental Setup
In the experiments, we followed a 5-folds cross validation approach using the folds in Table 3 .
Metrics.We adopt traditional NER metrics for evaluation, i.e. micro and macro F1 scores, referred as µ-F1 and M-F1, respectively.True positives are counted in a strict sense, i.e. an entity is considered correctly predicted if and only if all of its constituent tokens are well identified, and no additional tokens belong to the entity.
Dealing with long documents.As shown in Figure 2, the vast majority of documents in BUSTER has more than 500 words, which typically exceeds the maximum sequence length that LLMs (e.g.BERT (Devlin et al., 2018)) can take in input.Truncation would cause a major drop of most of the doc- ument and a significant loss of information.Therefore, we split documents into contiguous chunks of text.Chunking is done such that no token is truncated at all and we fill each chunk sequence as much as possible.All the baselines are trained and tested on chunks with the exception of Longformers, since they are capable of processing longer sequences up to 4096 tokens.

Baseline Models
We considered several transformer-based models that report state-of-the-art performance in NLP.In particular, we have selected the following 4 models.
BERT.BERT (Devlin et al., 2018) constitutes a standard baseline since it is one of the most popular LLMs nowadays.
RoBERTa.Similarly to BERT, RoBERTa (Liu et al., 2019) is a widely-used Language Model in the NLP community.The model is an optimized version of BERT and generally outperforms it.
SEC-BERT.We also consider a domain-specific LLM.We consider SEC-BERT (Loukas et al., 2022), a model pre-trained from scratch on EDGAR-CORPUS, a large collection of financial documents (Loukas et al., 2021).
Longformer.Longformer (Beltagy et al., 2020) is a transformer architecture equipped with selfattention mechanisms that scales linearly with the sequence length.Longformers were specifically designed to deal with long documents, hence they are a natural good candidate for processing BUSTER.

Results
The baselines' performance are presented in Table 4. RoBERTa turned out to be the best performing model, with Longformer achieving similar levels of accuracy.BERT base, instead, underperformed with respect to the other baselines.How-ever, when fine-tuning BERT on the financial domain (SEC-BERT) there is a clear F1 improvement.
Inspecting the scores of single entity tags obtained by the best model, i.e.RoBERTa (Table 5), we can observe that the Advisors family is generally well captured by the model.For Parties and Generic_Info families instead, the results are different.The model performs very well on BUY-ING_COMPANY , while ACQUIRED_COMPANY , SELLING_COMPANY and ANNUAL_REVENUES appear more complex to discriminate, especially in terms of precision.In our analysis, this depends on some structural characteristics of these entities.The first two tags (ACQUIRED_COMPANY and SELLING_COMPANY ) are strongly related to each other and often they are not easy to disambiguate even for human annotators, as confirmed by the quality assessment outlined in Table 2.The definition of ANNUAL_REVENUES instead, is very specific and detailed (Section 4) and this makes it hard to distinguish it from occurrences of other economic data present in the text, e.g.EBITDA.Finally, the inherent complexity inevitably increases the noise in the gold annotations, thus affecting the training of the model itself.

Conclusions and future works
In this work, we presented BUSTER, an Entity Recognition (ER) benchmark for business transaction-related entities.It consists of a corpus of 3779 manually annotated documents on financial transaction (the Gold data) which has been randomly divided into 5 folds, plus an additional set of 6196 automatically annotated documents (the Silver data) that were created from the fine-tuned RoBERTa model.
The full BUSTER benchmark is publicly available and free-to-download from the expert.aiweb-site 9 and on HuggingFace 10 and we are confident that it can become a point of reference in the field of Entity Recognition, in particular for the financial sector.
In the future, we intend to work in two directions.On one side, we plan to increase the amount of manually labeled data and to extend the labels set with more transaction-related tags.On the other hand, we aim to introduce some specific types of relations between entities in order to extend the dataset to Relational Extraction.

Figure 1 :
Figure 1: An annotated example extracted from BUSTER.

Figure 2 Figure 2 :
Figure2shows the distribution of document lengths.The documents appear to have an aver-

Figure 3 :
Figure 3: Distribution of tags families inside the documents.