ABSTRACT
The aim of Area 4 of the Strategic Healthcare IT Advanced Research Project (SHARP 4) is to facilitate secondary use of data stored in Electronic Medical Records (EMR) through high throughput phenotyping. Clinical Natural Language Processing (NLP) plays an important role in transforming information in clinical text to standard representation that is comparable and interoperable. To meet the NLP requirement of different secondary use cases of EMR, accommodate different NLP approaches, enable the interoperability between structured and unstructured data generated in different clinical settings, we define a common type system for clinical NLP that integrates a comprehensive model of clinical semantics with language processing types for SHARP 4. The type system has been implemented in UIMA (Unstructured Information Management Architecture), which allows for flexible passing of input and output data types among NLP components, and is available at the SHARP 4 website.
- Ferrucci, D. and Lally, A. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng., 10, 3-4 (Sept 1 2004), 327--348. Google ScholarDigital Library
- Klabbers, E., Odijk, J., De Pijper, J. and Theune, M. GoalGetter: Football results, from teletext to speech. IPO Annual Progress Report, 311996), 66--75.Google Scholar
- Stent, A., Dowding, J., Gawron, J. M., Bratt, E. O. and Moore, R. The CommandTalk spoken dialogue system. In Proc. 37th annual meeting of the Association for Computational Linguistics (College Park, MD, 1999), 183--190. Google ScholarDigital Library
- Savova, G. K., Masanz, J. J., Ogren, P. V., Zheng, J., Sohn, S., Kipper-Schuler, K. C. and Chute, C. G. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc, 17, 5 (Sep-Oct 2010), 507--513.Google ScholarCross Ref
- Verspoor, K., Baumgartner Jr, W., Roeder, C. and Hunter, L. Abstracting the types away from a UIMA type system. In C. Chiarcos, E. de Castilho and M. Stede. From Form to Meaning: Processing Texts Automatically. Narr, Tubingen, 2009.Google Scholar
- Hahn, U., Buyko, E., Landefeld, R., Mühlhausen, M., Poprat, M., Tomanek, K. and Wermter, J. An overview of JCoRe, the JULIE lab UIMA component repository. In Proceedings of the LREC (Marrakech, Morocco, 2008), 1--7.Google Scholar
- Kano, Y., Baumgartner, W. A., Jr., McCrohon, L., Ananiadou, S., Cohen, K. B., Hunter, L. and Tsujii, J. U-Compare: share and compare text mining tools with UIMA. Bioinformatics, 25, 15 (Aug 1 2009), 1997--1998. Google ScholarDigital Library
- Marcus, M. P., Marcinkiewicz, M. A. and Santorini, B. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19, 2 (June 1993), 313--330. Google ScholarDigital Library
- Buchholz, S. and Marsi, E. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning (New York City, New York, 2006), 149--164. Google ScholarDigital Library
- de Marneffe, M.-C. and Manning, C. D. The Stanford typed dependencies representation. In Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation (Manchester, United Kingdom, 2008), 1--8. Google ScholarDigital Library
- Kingsbury, P. and Palmer, M. Propbank: the next level of treebank. In Proc. Treebanks and Lexical Theories (2003).Google Scholar
- Pustejovsky, J., Hanks, P., Sauri, R., See, A., Gaizauskas, R., Setzer, A., Radev, D., Sundheim, B., Day, D. and Ferro, L. The timebank corpus. In Proceedings of Corpus Linguistics 2003 (2003), 647--656.Google Scholar
- Friedman, C., Kra, P. and Rzhetsky, A. Two biomedical sublanguages: a description based on the theories of Zellig Harris. J. of Biomedical Informatics, 35, 4 (August 2002), 222--235. Google ScholarDigital Library
- Haghighi, A. and Klein, D. An entity-level approach to information extraction. In Proceedings of the ACL 2010 Conference Short Papers (Uppsala, Sweden, 2010), 291--295. Google ScholarDigital Library
Index Terms
- Generality and reuse in a common type system for clinical natural language processing
Recommendations
Natural language processing
Graphical abstractDisplay Omitted We report on a natural language workshop sponsored by the National Library of Medicine.We summarize the current state of the art in biomedical natural language processing.We report on research strategies for advancing ...
A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing
Since Sag et al. (2002) highlighted a key problem that had been underappreciated in the past in natural language processing (NLP), namely idiosyncratic multiword expressions (MWEs) such as idioms, quasi-idioms, cliches, quasi-cliches, institutionalized ...
Comments