Extracting medical entities from social media

Scepanovic, Sanja; Martin-Lopez, Enrique; Quercia, Daniele; Baykaner, Khan

doi:10.1145/3368555.3384467

Accurately extracting medical entities from social media is challenging because people use informal language with different expressions for the same concept, and they also make spelling mistakes. Previous work either focused on specific diseases (e.g., depression) or drugs (e.g., opioids) or, if working with a wide-set of medical entities, only tackled individual and small-scale benchmark datasets (e.g., AskaPatient). In this work, we first demonstrated how to accurately extract a wide variety of medical entities such as symptoms, diseases, and drug names on three benchmark datasets from varied social media sources, and then also validated this approach on a large-scale Reddit dataset. We first implemented a deep-learning method using contextual embeddings that upon two existing benchmark datasets, one containing annotated AskaPatient posts (CADEC) and the other containing annotated tweets (Micromed), outperformed existing state-of-the-art methods. Second, we created an additional benchmark dataset by annotating medical entities in 2K Reddit posts (made publicly available under the name of MedRed) and showed that our method also performs well on this new dataset. Finally, to demonstrate that our method accurately extracts a wide variety of medical entities on a large scale, we applied the model pre-trained on MedRed to half a million Reddit posts. The posts came from disease-specific subreddits so we could categorise them into 18 diseases based on the subreddit. We then trained a machine-learning classifier to predict the post's category solely from the extracted medical entities. The average F1 score across categories was .87. These results open up new cost-effective opportunities for modeling, tracking and even predicting health behavior at scale.

Extracting medical entities from social media / Scepanovic, Sanja; Martin-Lopez, Enrique; Quercia, Daniele; Baykaner, Khan. - (2020), pp. 170-181. ( ACM CHIL '20: ACM Conference on Health, Inference, and Learning Toronto, Ontario (CAN) April 2 - 4, 2020) [10.1145/3368555.3384467].

Extracting medical entities from social media

Scepanovic, Sanja;Martin-Lopez, Enrique;Quercia, Daniele;Baykaner, Khan

2020

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2020
			
	Codice ISBN
	
				978-1-4503-7046-2
			
	Appare nelle tipologie
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
extracting_20-2.pdf accesso aperto Tipologia: 2a Post-print versione editoriale / Version of Record Licenza: Pubblico - Tutti i diritti riservati Dimensione 1.16 MB Formato Adobe PDF Visualizza/Apri	1.16 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2996141

PORTO @ Archivio Istituzionale della Ricerca

Extracting medical entities from social media

Scepanovic, Sanja;Martin-Lopez, Enrique;Quercia, Daniele;Baykaner, Khan

2020

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)