PDTB XML: the XMLization of the Penn Discourse TreeBank 2.0

Xuchen Yao, Irina Borisova, Mehwish Alam


Abstract
The current study presents a conversion and unification of the Penn Discourse TreeBank 2.0 (PDTB) and the Penn TreeBank (PTB) under XML format. The main goal of the PDTB XML is to create a tool for efficient and broad querying of the syntax and discourse information simultaneously. The key stages of the project are developing proper cross-references between different data types and their representation in the modified TIGER-XML format, and then writing the required declarative languages (XML Schema). PTB XML is compatible with TIGER-XML format. The PDTB XML is developed as a unified format for the convenience of XQuery users; it integrates discourse relations and XML structures into one unified hierarchy and builds the cross references between the syntactic trees and the discourse relations. The syntactic and discourse elements are assigned with unique IDs in order to build cross-references between them. The converted corpus allows for a simultaneous search for syntactically specified discourse information based on the XQuery standard, which is illustrated with a simple example in the article.
Anthology ID:
L10-1230
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/336_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Xuchen Yao, Irina Borisova, and Mehwish Alam. 2010. PDTB XML: the XMLization of the Penn Discourse TreeBank 2.0. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
PDTB XML: the XMLization of the Penn Discourse TreeBank 2.0 (Yao et al., LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/336_Paper.pdf