Jian Wu


pdf bib
Extractive Research Slide Generation Using Windowed Labeling Ranking
Athar Sefid | Prasenjit Mitra | Jian Wu | C Lee Giles
Proceedings of the Second Workshop on Scholarly Document Processing

Presentation slides generated from original research papers provide an efficient form to present research innovations. Manually generating presentation slides is labor-intensive. We propose a method to automatically generates slides for scientific articles based on a corpus of 5000 paper-slide pairs compiled from conference proceedings websites. The sentence labeling module of our method is based on SummaRuNNer, a neural sequence model for extractive summarization. Instead of ranking sentences based on semantic similarities in the whole document, our algorithm measures the importance and novelty of sentences by combining semantic and lexical features within a sentence window. Our method outperforms several baseline methods including SummaRuNNer by a significant margin in terms of ROUGE score.


pdf bib
SmartCiteCon: Implicit Citation Context Extraction from Academic Literature Using Supervised Learning
Chenrui Guo | Haoran Cui | Li Zhang | Jiamin Wang | Wei Lu | Jian Wu
Proceedings of the 8th International Workshop on Mining Scientific Publications

We introduce SmartCiteCon (SCC), a Java API for extracting both explicit and implicit citation context from academic literature in English. The tool is built on a Support Vector Machine (SVM) model trained on a set of 7,058 manually annotated citation context sentences, curated from 34,000 papers from the ACL Anthology. The model with 19 features achieves F1=85.6%. SCC supports PDF, XML, and JSON files out-of-box, provided that they are conformed to certain schemas. The API supports single document processing and batch processing in parallel. It takes about 12–45 seconds on average depending on the format to process a document on a dedicated server with 6 multithreaded cores. Using SCC, we extracted 11.8 million citation context sentences from ~33.3k PMC papers in the CORD-19 dataset, released on June 13, 2020. We will provide continuous supplementary data contribution to the CORD-19 and other datasets. The source code is released at https://gitee.com/irlab/SmartCiteCon.

pdf bib
Acknowledgement Entity Recognition in CORD-19 Papers
Jian Wu | Pei Wang | Xin Wei | Sarah Rajtmajer | C. Lee Giles | Christopher Griffin
Proceedings of the First Workshop on Scholarly Document Processing

Acknowledgements are ubiquitous in scholarly papers. Existing acknowledgement entity recognition methods assume all named entities are acknowledged. Here, we examine the nuances between acknowledged and named entities by analyzing sentence structure. We develop an acknowledgement extraction system, AckExtract based on open-source text mining software and evaluate our method using manually labeled data. AckExtract uses the PDF of a scholarly paper as input and outputs acknowledgement entities. Results show an overall performance of F_1=0.92. We built a supplementary database by linking CORD-19 papers with acknowledgement entities extracted by AckExtract including persons and organizations and find that only up to 50–60% of named entities are actually acknowledged. We further analyze chronological trends of acknowledgement entities in CORD-19 papers. All codes and labeled data are publicly available at https://github.com/lamps-lab/ackextract.


pdf bib
Tibetan Unknown Word Identification from News Corpora for Supporting Lexicon-based Tibetan Word Segmentation
Minghua Nuo | Huidan Liu | Congjun Long | Jian Wu
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)


pdf bib
Zipf’s Law and Statistical Data on Modern Tibetan
Huidan Liu | Minghua Nuo | Jian Wu
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers


pdf bib
Tibetan Base Noun Phrase Identification Framework Based on Chinese-Tibetan Sentence Aligned Corpus
Ming Hua Nuo | Hui Dan Liu | Wei Na Zhao | Long Long Ma | Jian Wu | Zhi Ming Ding
Proceedings of COLING 2012

pdf bib
Building Large Scale Text Corpus for Tibetan Natural Language Processing by Extracting Text from Web Pages
Huidan Liu | Minghua Nuo | Jian Wu | Yeping He
Proceedings of the 10th Workshop on Asian Language Resources


pdf bib
Tibetan Word Segmentation as Syllable Tagging Using Conditional Random Field
Huidan Liu | Minghua Nuo | Longlong Ma | Jian Wu | Yeping He
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

pdf bib
Compression Methods by Code Mapping and Code Dividing for Chinese Dictionary Stored in a Double-Array Trie
Huidan Liu | Minghua Nuo | Longlong Ma | Jian Wu | Yeping He
Proceedings of 5th International Joint Conference on Natural Language Processing


pdf bib
Tibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation
Huidan Liu | Weina Zhao | Minghua Nuo | Li Jiang | Jian Wu | Yeping He
Coling 2010: Posters