Machine learning in medicine: a practical introduction to natural language processing

Sidey-Gibbons, Chris J.

Machine learning in medicine: a practical introduction to natural language processing

Posted on 2021-08-01 - 03:20

Abstract Background Unstructured text, including medical records, patient feedback, and social media comments, can be a rich source of data for clinical research. Natural language processing (NLP) describes a set of techniques used to convert passages of written text into interpretable datasets that can be analysed by statistical and machine learning (ML) models. The purpose of this paper is to provide a practical introduction to contemporary techniques for the analysis of text-data, using freely-available software. Methods We performed three NLP experiments using publicly-available data obtained from medicine review websites. First, we conducted lexicon-based sentiment analysis on open-text patient reviews of four drugs: Levothyroxine, Viagra, Oseltamivir and Apixaban. Next, we used unsupervised ML (latent Dirichlet allocation, LDA) to identify similar drugs in the dataset, based solely on their reviews. Finally, we developed three supervised ML algorithms to predict whether a drug review was associated with a positive or negative rating. These algorithms were: a regularised logistic regression, a support vector machine (SVM), and an artificial neural network (ANN). We compared the performance of these algorithms in terms of classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity and specificity. Results Levothyroxine and Viagra were reviewed with a higher proportion of positive sentiments than Oseltamivir and Apixaban. One of the three LDA clusters clearly represented drugs used to treat mental health problems. A common theme suggested by this cluster was drugs taking weeks or months to work. Another cluster clearly represented drugs used as contraceptives. Supervised machine learning algorithms predicted positive or negative drug ratings with classification accuracies ranging from 0.664, 95% CI [0.608, 0.716] for the regularised regression to 0.720, 95% CI [0.664,0.776] for the SVM. Conclusions In this paper, we present a conceptual overview of common techniques used to analyse large volumes of text, and provide reproducible code that can be readily applied to other research studies using open-source software.

CITE THIS COLLECTION

DataCite

3 Biotech

3D Printing in Medicine

3D Research

3D-Printed Materials and Systems

4OR

AAPG Bulletin

AAPS Open

AAPS PharmSciTech

Abhandlungen aus dem Mathematischen Seminar der Universität Hamburg

ABI Technik (German)

Academic Medicine

Academic Pediatrics

Academic Psychiatry

Academic Questions

Academy of Management Discoveries

Academy of Management Journal

Academy of Management Learning and Education

Academy of Management Perspectives

Academy of Management Proceedings

Academy of Management Review

Harrison, Conrad J.; Sidey-Gibbons, Chris J. (2021). Machine learning in medicine: a practical introduction to natural language processing. figshare. Collection. https://doi.org/10.6084/m9.figshare.c.5537977.v1

https://doi.org/10.6084/m9.figshare.c.5537977.v1

or

Select your citation style and then place your mouse over the citation text to select it.

SHARE

email

Search Collections

need help?

Machine learning in medicine: a practical introduction to natural language processing

CITE THIS COLLECTION

SHARE

Usage metrics

Read the peer-reviewed publication

AUTHORS (2)

CATEGORIES

KEYWORDS