Word Embeddings

A Journey to Low-Dimensional Word Vector Space

You can see my notes and details here. Please do get in touch if you have any comments or questions.

The quick, brown fox jumped over the lazy dog.

Before we start some very simple basics about language as data. Normally when we think about language data, it's a corpus of documents. Each document has paragraphs, which are groups of sentences. The sentences themselves are made up of words, the punctuation between the words.

Data follows the curve of a broken power law. The vast majority of the uses of words come from a very small group. For a variety of reasons, we use the same words repeatedly ('I', 'The', 'Is', etc) https://en.wikipedia.org/wiki/Zipf%27s_law

A plot of the rank versus frequency for the first 10 million words in 30 Wikipedias (dumps from October 2015) in a log-log scale.

The Distributional Hypothesis

"a word is characterized by the company it keeps"

Banks can create new money when they make a loan.
The boat struck the bank full tilt.

We can only learn the meaning of words from their asssociation with other words.

Firth, J.R. (1957). A synopsis of linguistic theory 1930-1955. In Studies in Linguistic Analysis, pp. 1-32. Oxford: Philological Society. Reprinted in F.R. Palmer (ed.), Selected Papers of J.R. Firth 1952-1959, London: Longman (1968).
Harris, Z. (1954). Distributional structure. Word, 10(23): 146-162.

A Word Embedding is a parameterised function that maps words from a vocabulary to lower-dimension vectors of real weights

W('King') = [0.5, 0.7]
W('Man') = [0.5, 0.2]

Image Source: https://www.tensorflow.org/versions/0.6.0/tutorials/word2vec/index.html

So we can begin to see that the distributional hypothesis tells us something more than just how words are used. It begins to tell us how underlying relationships between concepts might exist.

Word2Vec (Mikolov et al.)

The quick brown fox Jumped over_{(V_in)} the lazy dog.

Maximise P(V_out|V_in) by learning the softmax probability via gradient descent.

http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/

http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec-57135994

Operations on the embedding vector space

King + Woman - Man = Queen

Obama + Russia - USA = Putin

Breakfast, ~~Cereal~~, Lunch, Dinner

Bad is to Worse as Good is to Better

Use it to expand queries, find synonyms, disambiguate terms, group items

Doesn't always work predictably, and sometimes produces complete nonsense. It's also often not bi-directional.

Learn Monolingual Word Embeddings from lots of text
Use a small bilingual dictionary to learn the linear projection
Project unknown word from source embedding to target

Mikolov et. al, Exploiting Similarities among Languages for Machine Translation

Over to You

Need lots of pre-processed text, so pre-trained models can be a good place to start.

Gensim (Python), GloVe,WEM (R)

Making Sense of Word2Vec,Levy & Goldberg 'Linguistic Regularities in Sparse and Explicit Word Representations.'

A lot of people use the google newsgroups data.