Notes from the book “Foundations of Statistical Natural Language Processing” By Manning and Schutze

I thought of sharing my notes from this classical book of NLP. I really enjoy the examples, quotes and narration used in this book. It takes you through the absolute basics of probability and linguistics, before entering into complex modelling for language.

Preliminaries

  1. Questions relevant to Linguistics
    • What kind of things do people say?
    • What do these things say/ask/request about world?
  2. Lexical resources
    • Brown Corpus (American English)
    • Lancaster Oslo Bergen (British English)
    • Susanne Corpus (130000 subset of Brown)
    • Penn Treebank (Wall Street Journal articles)
    • Canadian Hansards (Canadian Parliament Proceedings - Bilingual Corpus)
    • Wordnet ( Dictionary, Hierarchy of synset of words, meronymy- part:whole relations)
  3. Zipf law ( Principle of least effort )
    • f.r=k { f: frequency, r: rank (position in list), k:constant}
    • Number of meanings of word m \alpha \sqrt{f}
  4. Collocation
    • Phrasal verbs, compound nouns, idioms
    • frequent bigrams + particular pos pattern ( this has noise like “next year”)
  5. Concordance
    • KWIC - Keyword in Context
    • Verb frames