Natural-language processing tool flags breast, prostate cancer

By Susan D. Hall Aug 3, 2012 10:11am

Researchers claimed success using an SAS-based natural-language processing (NLP) tool to detect breast and prostate cancers from pathology reports, according to a study published this week in the Journal of the American Medical Informatics Association.

Results from the SAS-based coding, extraction, and nomenclature tool (SCENT) were compared with a random sample of 400 breast and 400 prostate cancer patients diagnosed at Kaiser Permanente Southern California whose results were classified manually.

SCENT successfully identified 51 of 54 primary and 60 of 61 recurrent cancers. It flagged only three false positives from 793 known benign results. Measures of sensitivity, specificity, positive predictive value, and negative predictive value exceeded 94 percent in both cancer groups.

The authors noted that previous research has found natural language processing tools useful in coding and extracting information from clinical text. Indeed a recent paper published in Radiology cited success with a tool to parse data from records to build a repository of data to improve dosing recommendations.

They attributed the slow rate of adoption, however, to difficulty in integration with clinical data systems, technical complexity, and habitual use of medical claims data.

Creating a common dictionary of terms can be painstaking work as illustrated in a study on the use of NLP in breast cancer research published recently in the Journal of Pathology Informatics. It reported 124 ways of saying "invasive ductal carcinoma" and 95 ways of saying "invasive lobular carcinoma" and more than 4,000 ways of saying invasive ductal carcinoma was not present.

The authors noted the system probably would require tweaking to work with clinical text in other medical areas and outside of Kaiser Permanente, but the system "has the potential to provide significant value to clinical and epidemiologic researchers, particularly when statistical NLP is infeasible due to resource or other constraints."

A second article on natural language processing in JAMIA found the technology effective for de-identifying clinical health data for research, promising relief from the tedium of manually coding records from which personal information had been scrubbed.

While much data for research remains outside machine-readable formats, platforms enabling comparative-effectiveness research too often lack natural- language processing capabilities, according to a Medical Care study.

To learn more:
- read the research
- check out the Journal of Pathology Informatics study