Mohammad Golam Sohrab


2022

pdf bib
BiomedCurator: Data Curation for Biomedical Literature
Mohammad Golam Sohrab | Khoa N.A. Duong | Ikeda Masami | Goran Topić | Yayoi Natsume-Kitatani | Masakata Kuroda | Mari Nogami Itoh | Hiroya Takamura
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations

We present BiomedCurator1, a web application that extracts the structured data from scientific articles in PubMed and ClinicalTrials.gov. BiomedCurator uses state-of-the-art natural language processing techniques to fill the fields pre-selected by domain experts in the relevant biomedical area. The BiomedCurator web application includes: text generation based model for relation extraction, entity detection and recognition, text classification model for extracting several fields, information retrieval from external knowledge base to retrieve IDs, and a pattern-based extraction approach that can extract several fields using regular expressions over the PubMed and ClinicalTrials.gov datasets. Evaluation results show that different approaches of BiomedCurator web application system are effective for automatic data curation in the biomedical domain.

2020

pdf bib
BENNERD: A Neural Named Entity Linking System for COVID-19
Mohammad Golam Sohrab | Khoa Duong | Makoto Miwa | Goran Topić | Ikeda Masami | Takamura Hiroya
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present a biomedical entity linking (EL) system BENNERD that detects named enti- ties in text and links them to the unified medical language system (UMLS) knowledge base (KB) entries to facilitate the corona virus disease 2019 (COVID-19) research. BEN- NERD mainly covers biomedical domain, es- pecially new entity types (e.g., coronavirus, vi- ral proteins, immune responses) by address- ing CORD-NER dataset. It includes several NLP tools to process biomedical texts includ- ing tokenization, flat and nested entity recog- nition, and candidate generation and rank- ing for EL that have been pre-trained using the CORD-NER corpus. To the best of our knowledge, this is the first attempt that ad- dresses NER and EL on COVID-19-related entities, such as COVID-19 virus, potential vaccines, and spreading mechanism, that may benefit research on COVID-19. We release an online system to enable real-time entity annotation with linking for end users. We also release the manually annotated test set and CORD-NERD dataset for leveraging EL task. The BENNERD system is available at https://aistairc.github.io/BENNERD/.

pdf bib
mgsohrab at WNUT 2020 Shared Task-1: Neural Exhaustive Approach for Entity and Relation Recognition Over Wet Lab Protocols
Mohammad Golam Sohrab | Anh-Khoa Duong Nguyen | Makoto Miwa | Hiroya Takamura
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

We present a neural exhaustive approach that addresses named entity recognition (NER) and relation recognition (RE), for the entity and re- lation recognition over the wet-lab protocols shared task. We introduce BERT-based neural exhaustive approach that enumerates all pos- sible spans as potential entity mentions and classifies them into entity types or no entity with deep neural networks to address NER. To solve relation extraction task, based on the NER predictions or given gold mentions we create all possible trigger-argument pairs and classify them into relation types or no relation. In NER task, we achieved 76.60% in terms of F-score as third rank system among the partic- ipated systems. In relation extraction task, we achieved 80.46% in terms of F-score as the top system in the relation extraction or recognition task. Besides we compare our model based on the wet lab protocols corpus (WLPC) with the WLPC baseline and dynamic graph-based in- formation extraction (DyGIE) systems.

2019

pdf bib
A Neural Pipeline Approach for the PharmaCoNER Shared Task using Contextual Exhaustive Models
Mohammad Golam Sohrab | Minh Thang Pham | Makoto Miwa | Hiroya Takamura
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

We present a neural pipeline approach that performs named entity recognition (NER) and concept indexing (CI), which links them to concept unique identifiers (CUIs) in a knowledge base, for the PharmaCoNER shared task on pharmaceutical drugs and chemical entities. We proposed a neural NER model that captures the surrounding semantic information of a given sequence by capturing the forward- and backward-context of bidirectional LSTM (Bi-LSTM) output of a target span using contextual span representation-based exhaustive approach. The NER model enumerates all possible spans as potential entity mentions and classify them into entity types or no entity with deep neural networks. For representing span, we compare several different neural network architectures and their ensembling for the NER model. We then perform dictionary matching for CI and, if there is no matching, we further compute similarity scores between a mention and CUIs using entity embeddings to assign the CUI with the highest score to the mention. We evaluate our approach on the two sub-tasks in the shared task. Among the five submitted runs, the best run for each sub-task achieved the F-score of 86.76% on Sub-task 1 (NER) and the F-score of 79.97% (strict) on Sub-task 2 (CI).

2018

pdf bib
Deep Exhaustive Model for Nested Named Entity Recognition
Mohammad Golam Sohrab | Makoto Miwa
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We propose a simple deep neural model for nested named entity recognition (NER). Most NER models focused on flat entities and ignored nested entities, which failed to fully capture underlying semantic information in texts. The key idea of our model is to enumerate all possible regions or spans as potential entity mentions and classify them with deep neural networks. To reduce the computational costs and capture the information of the contexts around the regions, the model represents the regions using the outputs of shared underlying bidirectional long short-term memory. We evaluate our exhaustive model on the GENIA and JNLPBA corpora in biomedical domain, and the results show that our model outperforms state-of-the-art models on nested and flat NER, achieving 77.1% and 78.4% respectively in terms of F-score, without any external knowledge resources.