Skip to content

A comparative study of deep learning models to correctly identify the cancer a patient has, as a means to creating a more streamlined process when making a post on the Cancer Survivors Network website.,

Notifications You must be signed in to change notification settings

thomasdurkin/Classification-of-Cancer-Discussion-Posts

Repository files navigation

Abstract

In this study, we aimed to construct a deep learning classification model to recommend which discussion board a post should go to after a user has written it on the Cancer Survivors Network, which is a cancer-related public discussion forum. Additionally, we explored multiple types of models and compared their performance on this natural language processing task. We concluded that a stacked model, which was a combination of the Bidirectional LSTM and the transformer encoder outputs, provided the best results with an accuracy of 70.7%.

Data

All data was pulled from the Cancer Survivors Network using BeautifulSoup which resulted in a total of 27 classes. Classes with less than 1000 posts were dropped from the data set as the final accuracy was greatly affected by the classes with small amounts of data. Data was padded to a max sequence length of 75.

Modeling

Models were created using PyTorch. Each model was trained using an 80/20 train-test split with 8 epochs and a learning rate of 0.001

model accuracy
Decision Tree (Baseline) 56.0%
CNN 63.1%
RNN 39.5%
Bi-LSTM 68.7%
Transformer 67.9%
Stacked Model 70.7%

Results

Below is the confusion matrix for the best model.

About

A comparative study of deep learning models to correctly identify the cancer a patient has, as a means to creating a more streamlined process when making a post on the Cancer Survivors Network website.,

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published