Publication Date:
2023
abstract:
The COVID-19 pandemic has required a collective global effort to be faced. The need to rapidly exchange clinical information to advance medical investigations has highlighted the importance of clinical de-identification techniques to make protected health information in electronic health records shareable and publishable while fully complying with privacy regulations. In this study, a comparative analysis is provided regarding the performance of language models with respect to the Italian language, employing the SIRM COVID-19 de-identification corpus data set, built in accordance with the US HIPAA regulations and labeled consistently with the 2014 English i2b2 data set for the purpose of applying named entity recognition for de-identification of electronic medical records. Different language models were compared that achieved state-of-the-art results in the literature, based on deep neural architectures such as bidirectional long short-term memory with conditional random field in combination with different word representations, i.e., embeddings capable of capturing morphosyntactic variations and polysemy of words, and bidirectional encoder representations from transformers. In addition, the effectiveness of different transfer learning mechanisms to improve performance in the Italian language using English language data was tested. The examination allows for highlighting the advantages and disadvantages of the different approaches, showing how to push performance in low-resource scenarios such as Italian, and highlighting some key design points to consider in the analyzed scenarios.
Iris type:
02.01 Contributo in volume (Capitolo o Saggio)
Keywords:
Clinical de-identification; named entity recognition; deep learning; sub-document level analysis; COVID-19 annotated Italian data set
List of contributors:
Esposito, Massimo
Book title:
Artificial Intelligence in Healthcare and COVID-19