Adam Sorrenti
Sagar Punn
Eisa Keramatinejad
See reference [7] for the orginals dataset. Then,
- Download the the categories as defined in Creating_Main_Dataset.ipynb
- Run Creating_Main_Dataset.ipynb
- Run each desired dataset variaton on the generated data with the 'others' files as described below.
- Run visuals__plots.ipynb to generate PCA and word cloud.
- Run OneFiveStar.ipynb and OneTwoFourFive.ipynb to generate post-processing results.
**Please note results may vary as a result of the random sampling done in Creating_Main_Dataset.ipynb.
Creating_Main_Dataset.ipynb - Randomly samples orginial dataset
OneFiveStar.ipynb - Post-processing on 1 and 5 star reviews only
OneTwoFourFive.ipynb - Post-processing ommiting 3 star reviews
getAllData.py - Feature selection implementations and helper funcitons
negative-word.txt - Negative word lexicon
positive-word.txt - Positive word lexicon
Sentimental_Analysis_Report.pdf - Final IEEE conference paper
visuals__plots.ipynb - PCA and word cloud visualizations
others - All other files contain specific dataset variations and feature selections techniques that can be run individually.
We classified Amazon reviews with a positive or negative sentiment, exclusively. For example, given the following review: ’Super comfortable and extremely lightweight. Great for crossfit!’ Using machine learning and natural language processing is a must to identify whether the review implies a positive or negative sentiment.
Following are the classification models that were used
in this report:
- Logistic Regression: Baseline version with max iterations set in range 1000-10000 in order for model to converge.
- Multinomial Naive Bayes: Baseline version used.
- Support Vector Machine: Baseline version used with max iterations set in the range of 1000-20000 and, if failed to converge, the dual formulation parameter was set to false.
- K Nearest Neighbours: Three variations of this model were used where the number of neighbours was set to 1, 3, and 5.
- Decision Trees: First baseline version is employed after which criterion of split is set to entropy with depth of tree being 3.
- Regular: This is untouched review without any filtering applied.
- Stemmed: In this case, Porter Stemmer is used to stemming the original review text where stemmer removes morphological affixes from words, leaving only the word stem.
- Filtered: Each review is filtered to only contain positive and negative words using an opinion lexicon list [3]. Before employing the filtering process, reviews are passed through the Mark Negation method [4] which appends NEG on words between negation and punctuation mark. Furthermore, all words with NEG are considered as one single word in order to reduce noise. This process significantly reduced the number of features and BOW and TFIDF are applied at the end.
- Filtered Stemmed: Here, reviews are filtered (as explained above) first and then stemmed. Finally, BOW and TFIDF are applied.
[1] Amueller, “amueller/wordcloud,” GitHub. [Online]. Available:
https://github.com/amueller/wordcloud.
[2] A. Sorrenti, E. Keramatinejad, and S. Punn, “Sentimental Analysis of Amazon Reviews,” 2020. [Online]. Available: https:
//github.com/mbrotos/ML-Project.
[3] Bing Liu. ”Opinion Mining.” Invited contribution to Encyclopedia of Database Systems, 2008.
[4] Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language
toolkit. O’Reilly Media, Inc.
[5] C. R. Harris, K. J. Millman, S. J. V. D. Walt, et al. “Array
programming with NumPy,” Nature, vol. 585, no. 7825, pp.
357–362, 2020.
[6] Fabian Pedregosa, et al. ”Scikit-learn: Machine Learning in
Python”. Journal of Machine Learning Research 12. 85(2011):
2825-2830.
[7] Jianmo Ni, Jiacheng Li, Julian McAuley. ”Justifying recommendations using distantly-labeled reviews and fined-grained
aspects.” Empirical Methods in Natural Language Processing
(EMNLP), 2019.
[8] J. Reback, W. McKinney, Jbrockmendel, et al. “pandasdev/pandas: Pandas 1.1.4,” Zenodo, 30-Oct-2020. [Online].
Available: https://doi.org/10.5281/zenodo.3509134.
[9] J. D. Hunter, ”Matplotlib: A 2D Graphics Environment”, Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007.
[10] Kluyver, T. et al., 2016. Jupyter Notebooks – a publishing
format for reproducible computational workflows. In F. Loizides
& B. Schmidt, eds. Positioning and Power in Academic Publishing: Players, Agents and Agendas. pp. 87–90.
[11] Rathor, A. S., Agarwal, A., & Dimri, P. ”Comparative Study of
Machine Learning Approaches for Amazon Reviews.” 2018.