McNulty // w7d4

Winter 2015
02/26/2015

Planned schedule and activities

9:00 am: Guten Morgen

9:15 am: NLP with Python

11:20 am: Challenges & Fletcher work

12:00 pm: Lunch

1:30 pm: Work like you've never worked before

6:00 pm: Midcourse Destress Party. Fancy stuff.

Lecture Notes

NLTK.ipynb (39.4 KB)

NLP Challenges

Challenge 1
Challenge 2

Reading

Overencompassing yet still short nltk tutorial
The official book of the nltk

TextBlob tutorial
(Awesome, easy to read, very short but to the point. Check this out.) TextBlob full documentation
Great documentation

List of part-of-speech tags
What does VBZ mean? etc.

Chunking tutorial
MIT slides on chunking
If you want to learn more on chunking and prepare your own chunking classifiers, these will help.

Demo for different tokenizers
Demo for different stemmers
These let you get a feel for what different classes do.

tf-idf on Wikipedia
tf-idf tutorial with textblob

Stanford slides on text classification with naive Bayes
Naive Bayes on Wikipedia
Naive Bayes Spam Filtering on Wikipedia
Empirically, naive Bayes works really well on text classification.
One example is spam filtering. Two classes (spam/not spam).
A bag of words model (treating each word as an independent feature), using tf-idf values as weights for the features in MultinomialNB works wonders.
However, naive Bayes is not necessarily the ultimate text classifier. It works reasonably well if you don't have ginormous amounts of data. However, if you have enough data, generally successful classifiers like SVMs can surpass it (but they may take longer to train). So don't be shy in trying other classifiers.