emrQA: A Large Corpus for Question Answering on Electronic Medical Records

AI in Healthcare

Cite Paper

Authors

Anusri Pampari
Preethi Raghavan
Jennifer Liang
Jian Peng

Published on

09/03/2018

Categories

AI in Healthcare Natural Language Processing

We propose a novel methodology to generate domain-specific large-scale question answering (QA) datasets by re-purposing existing annotations for other NLP tasks. We demonstrate an instance of this methodology in generating a large-scale QA dataset for electronic medical records by leveraging existing expert annotations on clinical notes for various NLP tasks from the community shared i2b2 datasets. The resulting corpus (emrQA) has 1 million question-logical form and 400,000+ question-answer evidence pairs. We characterize the dataset and explore its learning potential by training baseline models for question to logical form and question to answer mapping.

Please cite our work using the BibTeX below.


@article{DBLP:journals/corr/abs-1809-00732,
  author    = {Anusri Pampari and
               Preethi Raghavan and
               Jennifer J. Liang and
               Jian Peng},
  title     = {emrQA: {A} Large Corpus for Question Answering on Electronic Medical
               Records},
  journal   = {CoRR},
  volume    = {abs/1809.00732},
  year      = {2018},
  url       = {http://arxiv.org/abs/1809.00732},
  archivePrefix = {arXiv},
  eprint    = {1809.00732},
  timestamp = {Tue, 04 Jun 2019 21:08:34 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1809-00732.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}