Skip to content

Kaggle competition with the goal of creating the best NLP model to classify COVID-19 related tweets by country. Placed 10th out of 57 teams.

Notifications You must be signed in to change notification settings

thomasdurkin/Capstone-Kaggle-Competition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COVID-19 Tweets Country of Origin Classification Kaggle Competition

Dataset

This dataset consists of Covid-19 related tweets posted by users coming from six English-speaking countries: Australia, Canada, Ireland, New Zealand, the United Kingdom, and the United States. A total of 6 columns were provided, but only the tweet text and country were used to train the models.

Dataset was extend to provide better results by replacing emojis with their respective word and expanding the shortened urls to the original link in order to extract relevant words.

text country
Remember the #WuhanCoronaVirus? The pandemic w... us
While we hit 150,000 in #COVID19 deaths, the P... new_zealand
🇺🇸 Pandémie de #coronavirus: 30 pasteurs améri... us

Analysis

Within the Notebook is an extensive descriptive analysis using numerous NLP techniques in an attempt extract useful information from the dataset.

  • Finding top ten hashtags
  • Calculating statistics based on words, characters, and hashtags within the tweets
  • Utilizing LDA to topics from the dataset
  • Performing Non-negative Matrix Factorization for topic analysis

Modeling

model accuracy
Ensemble Model 51.3%
Logistic Regression 45.2%
Linear SVC 48.7%
Multinomial Naive Bayes 49.4%

Ensemble model combined results from a CNN built using keras and a Multinomial Naive Bayes model built using Scikit-learn.

About

Kaggle competition with the goal of creating the best NLP model to classify COVID-19 related tweets by country. Placed 10th out of 57 teams.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published