We study a new problem setting of information extraction (IE), referred to as text-to-table. In text-to-table, given a text, one creates a table or several tables expressing the main content of the text, while the model is learned from text-table pair data. The problem setting differs from those of the existing methods for IE. First, the extraction can be carried out from long texts to large tables with complex structures. Second, the extraction is entirely data-driven, and there is no need to explicitly define the schemas. As far as we know, there has been no previous work that studies the problem. In this work, we formalize text-to-table as a sequence-to-sequence (seq2seq) problem. We first employ a seq2seq model fine-tuned from a pre-trained language model to perform the task. We also develop a new method within the seq2seq approach, exploiting two additional techniques in table generation: table constraint and table relation embeddings. We consider text-to-table as an inverse problem of the well-studied table-to-text, and make use of four existing table-to-text datasets in our experiments on text-to-table. Experimental results show that the vanilla seq2seq model can outperform the baseline methods of using relation extraction and named entity extraction. The results also show that our method can further boost the performances of the vanilla seq2seq model. We further discuss the main challenges of the proposed task. The code and data are available at https://github.com/shirley-wu/text_to_table.
mixSeq: A Simple Data Augmentation Methodfor Neural Machine Translation
Xueqing Wu | Yingce Xia | Jinhua Zhu | Lijun Wu | Shufang Xie | Yang Fan | Tao Qin
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)
Data augmentation, which refers to manipulating the inputs (e.g., adding random noise,masking specific parts) to enlarge the dataset,has been widely adopted in machine learning. Most data augmentation techniques operate on a single input, which limits the diversity of the training corpus. In this paper, we propose a simple yet effective data augmentation technique for neural machine translation, mixSeq, which operates on multiple inputs and their corresponding targets. Specifically, we randomly select two input sequences,concatenate them together as a longer input aswell as their corresponding target sequencesas an enlarged target, and train models on theaugmented dataset. Experiments on nine machine translation tasks demonstrate that such asimple method boosts the baselines by a non-trivial margin. Our method can be further combined with single input based data augmentation methods to obtain further improvements.