PMID- 25106934
OWN - NLM
STAT- MEDLINE
DCOM- 20150512
LR  - 20231110
IS  - 1872-8243 (Electronic)
IS  - 1386-5056 (Print)
IS  - 1386-5056 (Linking)
VI  - 83
IP  - 10
DP  - 2014 Oct
TI  - De-identification of clinical narratives through writing complexity measures.
PG  - 750-67
LID - S1386-5056(14)00137-3 [pii]
LID - 10.1016/j.ijmedinf.2014.07.002 [doi]
AB  - PURPOSE: Electronic health records contain a substantial quantity of clinical 
      narrative, which is increasingly reused for research purposes. To share data on a 
      large scale and respect privacy, it is critical to remove patient identifiers. 
      De-identification tools based on machine learning have been proposed; however, 
      model training is usually based on either a random group of documents or a 
      pre-existing document type designation (e.g., discharge summary). This work 
      investigates if inherent features, such as the writing complexity, can identify 
      document subsets to enhance de-identification performance. METHODS: We applied an 
      unsupervised clustering method to group two corpora based on writing complexity 
      measures: a collection of over 4500 documents of varying document types (e.g., 
      discharge summaries, history and physical reports, and radiology reports) from 
      Vanderbilt University Medical Center (VUMC) and the publicly available i2b2 
      corpus of 889 discharge summaries. We compare the performance (via recall, 
      precision, and F-measure) of de-identification models trained on such clusters 
      with models trained on documents grouped randomly or VUMC document type. RESULTS: 
      For the Vanderbilt dataset, it was observed that training and testing 
      de-identification models on the same stylometric cluster (with the average 
      F-measure of 0.917) tended to outperform models based on clusters of random 
      documents (with an average F-measure of 0.881). It was further observed that 
      increasing the size of a training subset sampled from a specific cluster could 
      yield improved results (e.g., for subsets from a certain stylometric cluster, the 
      F-measure raised from 0.743 to 0.841 when training size increased from 10 to 50 
      documents, and the F-measure reached 0.901 when the size of the training subset 
      reached 200 documents). For the i2b2 dataset, training and testing on the same 
      clusters based on complexity measures (average F-score 0.966) did not 
      significantly surpass randomly selected clusters (average F-score 0.965). 
      CONCLUSIONS: Our findings illustrate that, in environments consisting of a 
      variety of clinical documentation, de-identification models trained on writing 
      complexity measures are better than models trained on random groups and, in many 
      instances, document types.
CI  - Copyright (c) 2014 Elsevier Ireland Ltd. All rights reserved.
FAU - Li, Muqun
AU  - Li M
AD  - Department of Electrical Engineering & Computer Science, Vanderbilt University, 
      Nashville, TN, United States. Electronic address: muqun.li@vanderbilt.edu.
FAU - Carrell, David
AU  - Carrell D
AD  - Group Health Research Institute, Seattle, WA, United States.
FAU - Aberdeen, John
AU  - Aberdeen J
AD  - The MITRE Corporation, Bedford, MA, United States.
FAU - Hirschman, Lynette
AU  - Hirschman L
AD  - The MITRE Corporation, Bedford, MA, United States.
FAU - Malin, Bradley A
AU  - Malin BA
AD  - Department of Electrical Engineering & Computer Science, Vanderbilt University, 
      Nashville, TN, United States; Department of Biomedical Informatics, Vanderbilt 
      University, Nashville, TN, United States.
LA  - eng
GR  - R13 LM011411/LM/NLM NIH HHS/United States
GR  - R01 LM011366/LM/NLM NIH HHS/United States
GR  - R01 LM009989/LM/NLM NIH HHS/United States
GR  - U01 HG006378/HG/NHGRI NIH HHS/United States
GR  - R01LM011366/LM/NLM NIH HHS/United States
GR  - R01LM009989/LM/NLM NIH HHS/United States
GR  - U01 HG006385/HG/NHGRI NIH HHS/United States
GR  - U01HG006385/HG/NHGRI NIH HHS/United States
GR  - U01HG006378/HG/NHGRI NIH HHS/United States
PT  - Journal Article
PT  - Research Support, N.I.H., Extramural
PT  - Research Support, U.S. Gov't, Non-P.H.S.
DEP - 20140724
PL  - Ireland
TA  - Int J Med Inform
JT  - International journal of medical informatics
JID - 9711057
SB  - IM
MH  - Cluster Analysis
MH  - *Electronic Health Records
MH  - *Narration
MH  - *Writing
PMC - PMC4215974
MID - NIHMS616466
OTO - NOTNLM
OT  - Electronic medical records
OT  - Natural language processing
OT  - Privacy
COIS- Competing interest: No conflict of interest exists in this paper.
EDAT- 2014/08/12 06:00
MHDA- 2015/05/13 06:00
PMCR- 2015/10/01
CRDT- 2014/08/10 06:00
PHST- 2014/02/19 00:00 [received]
PHST- 2014/05/27 00:00 [revised]
PHST- 2014/07/16 00:00 [accepted]
PHST- 2014/08/10 06:00 [entrez]
PHST- 2014/08/12 06:00 [pubmed]
PHST- 2015/05/13 06:00 [medline]
PHST- 2015/10/01 00:00 [pmc-release]
AID - S1386-5056(14)00137-3 [pii]
AID - 10.1016/j.ijmedinf.2014.07.002 [doi]
PST - ppublish
SO  - Int J Med Inform. 2014 Oct;83(10):750-67. doi: 10.1016/j.ijmedinf.2014.07.002. 
      Epub 2014 Jul 24.