The paper, Emotion Intensities in Tweets, authored by Saif M. Mohammad and Felipe Bravo-Marquez, talks about how they went about creating the data and building their system on Weka. Two key things came out of reading this paper.
- I was able to think about potentially creating my own data
- I now know which filters were used by them to create the model
It would be interesting to gather data in the same way they did. While it may not be as big of a research application, it would be an interesting thing to learn for future projects (the use of best-worst scaling, etc.).
More importantly, now that I know the filters they used for their model, I can recreate it in Weka to have a baseline system to test against when building my own regression model and Deep Learning model
NLTK is a really great library for doing natural language processing. Instead of working on my model, I decided to take the time and learn about NLTK, so I can use it (and perhaps use it with GraphLab Create later).
NLTK comes with a variety of tools, such as tokenization and POS tagging. Tokenization can be done in sentences or words, which splits up a paragraph into individual sentences, or a sentence (or longer text) into words, to then analyze.
There are a couple reasons for me to use NLTK. First, it will make a lot of preprocessing easy for me, such as filtering out stop words (words that do not have meaning in collecting data, such as “as”, “what”, “that”.
from nltk.corpus import stopwords
This allows you to create a set of all the stop words in the English language (the words recognized by NLTK at least). From there, you can simply pass the sentence through, and take out any word that is found in the stop word set.
Additionally, there are methods in NLTK that can create a parse tree, where you can see what objects are related to other objects. This is important in establishing relationship between objects in a sentence, so you (the computer) can figure out what the sentence is about.
I haven’t gotten through the entire “course” of what can be done with NLTK, but once I have done that, I can begin trying to use GraphLab Create’s linear regression model alongside NLTK to create my system. Until then, I plan to continue researching other potential libraries I can use for my system, and reading the paper that Dr. Carpuat wants me to read (link in Resources)