Natural language processing improves computerized disease tagging

Researchers in The Netherlands are claiming success with a computer system that applies natural language processing to biomedical text to link relevant concepts to sources that contain further information, a process needed to help computers extract useful information from free text.

In their paper, published this week in the Journal of the American Medical Informatics Association, the researchers compared the effects of NLP on two biomedical concept normalization systems--MetaMap and Peregrine--which were used on the Arizona Disease Corpus.

"Concept or named-entity recognition aims at finding text strings that refer to entities, and marking each entity with a semantic type, like 'gene', 'drug', or 'disease'," the authors said. "Concept normalization goes beyond entity recognition. It assigns a unique identifier to the recognized concept, which links it to a source that contains further information about the concept, such as its definition, its preferred name and synonyms, and its relationships with other concepts."

Without the NLP module, MetaMap's accuracy was 61 percent and Peregrine's 63.9 percent for exact matching of text strings within set boundaries, and  55.1 percent and 56.9 percent respectively  for matching concept identifiers. With the aid of the NLP module, MetaMap's accuracy improved to 73.3 percent for boundary matching and 66.2 for concept identifier matching, while Peregrine improved to 78  percent for boundary matching and 69.8 percent for concept identifier matching.

For inexact boundary matching, performances further increased to 85.5 percent  and 85.4 percent respectively, and to 73.6 percent and 73.3 percent for concept identifier matching.

It remains to be determined whether such improvements could be made on text beyond just diseases, the authors wrote.

NLP might be used to pick up missing diagnoses from free text, and perhaps even predict problems before physicians spot them, according to a FierceHealthIT commentary published earlier this year.

It already is finding an array of applications in healthcare, including being used to flag breast and prostate cancer and in systems to de-identify clinical health data for use in research.

To learn more:
- read the research