Natural-language processing effective for de-identifying clinical health data

Using natural-language processing programs to quickly scrub clinical narrative text of personal health information for trend analysis is just as effective as time-consuming record annotation by humans, a new study finds.

Electronic health records contain huge stores of information that researchers can use to analyze who gets sick and why and how treatment regimes affect outcomes. They also can be used to predict emerging health problems among large populations or smaller demographic subsets, or to analyze business practices. But before it can be analyzed, HIPAA privacy rules require either that personal health information is removed, or patients give their consent to the use of their records.

The study, published today in the Journal of the American Medical Informatics Association, compared manual and NLP-based de-identification of about 3,500 clinical notes extracted from 5 million documents from a large pediatric hospital in Cleveland. Performance was measured by the ability to identify and remove 18 elements of protected health information as defined under HIPAA without also removing non-PHI.

"NLP-based de-identification shows excellent performance that rivals the performance of human annotators," the authors concluded. "Furthermore, unlike manual de-identification, the automated approach scales up to millions of documents quickly and inexpensively."

The study, "Large-scale evaluation of automated clinical note de-identification and its impact on information extraction," measured both sensitivity to PHI and precision in removing it. In both cases, NLP systems performed slightly better than human annotators.

Meanwhile, concern is growing that de-identified PHI could be re-identified, although there's little evidence that's been happening. A June article in JAMIA, "Building public trust in uses of Health Insurance Portability and Accountability Act de-identified data," noted concern by privacy groups that there's no accountability if PHI is re-identified. Once it has been de-identified, the information is no longer regulated under HIPAA.

Failure to address those concerns could pose obstacles to widespread EHR data mining, the authors contended.

To learn more:
- read the study abstract

Suggested Articles

Key lawmakers are fed up with what they see as poor VA leadership over a new Cerner EHR system and vowed to take tighter control of the project.

Teladoc is playing an active role in preparations for a potential U.S. coronavirus outbreak and is working with the CDC to help track diseases.

Blue Shield of California is teaming up with Accolade to offer self-insured employers a personalized way to connect with members about their benefits.