Algorithm to redact PHI takes opposite track

Rather than training an algorithm to spot personal health information (PHI) for de-identification in physician notes, new research focused on words and phrases that are not PHI, according to a study published at BMC Medical Informatics and Decision Making.

It achieved a recall rate of 98 percent of PHI from 220 discharge summaries, the authors report. All patient names, phone numbers, and home addresses were at least partially redacted.

Doctors' notes hold a wealth of information that could be used for disease research, but have been of limited use because they contain protected information. Previous attempts to automate the process of de-identifying PHI in notes typically focused on rules-based algorithms for identifying data such as names, addresses and Social Security numbers, then "training" the algorithm to be more accurate. 

This research took a different tack: comparing part-of-speech tags between journal publications and physician notes. Patient identifiers are generally nouns and numbers that appear infrequently in journal articles, the authors note, offering a different way of thinking about the problem.

They designed the study to favor higher rates of recall over precision, and levels deemed "good enough" by previous hospital privacy boards with minimal investment, and to run on a single computer using commodity hardware.

Compared with seven competitors in the National Institutes of Health-funded i2b2 de-identification challenge, this method would have placed first in recall, but last in overall precision, the authors say.

Despite the potential value of doctors' notes, the U.S. Department of Health & Human Services' Office for Civil Rights has warned that there is no fail-safe way to de-identify patient data. Researchers from the Data Privacy Lab at Harvard University, among others, have shown that patients can easily be re-identified with just a few pieces of information.

Stanford University researchers, however, are claiming success in using real-time analysis of free-text notes in electronic health records to flag drug interactions.

To learn more:
- find the research (.pdf)