Natural Language Processing with Deep Learning

YouTube Video List


Differential, Gradient, Jacobian, Chain Rule

Review of differential calculus theory


Some word representations:

Idea is to predict surrounding words in a window of length m of every word.

Bag of words, Hierarchical softmax

Counte-based distributional models vs Neural network-based models

The objective funcion is to maximize the log probability of any context word given the current center word:

\[J(\theta)=\frac{1}{T}\sum\limits_{t=1}^T\sum\limits_{-m\leq j\leq m, j\neq 0} \log p(w_{t+j}|w_t)\]

where $\theta$ represents all variables and $T$ is the total number of words (not vocabulary).

The simplest formulation for $p(w_{t+j}|w_t)$ is (o for outside, c for center):


Here, every word has two vectors $p(w|\cdot), p(\cdot|w)$

To maximize $J(\theta)$, we calculate the gradient:

\[\frac{\partial}{\partial v_c} J(\theta) = \frac{1}{T}\sum\limits_{t=1}^T\sum\limits_{-m\leq j\leq m, j\neq 0} \frac{\partial}{\partial v_c}\log p(w_{t+j}|w_t)\] \[\begin{align*} \frac{\partial}{\partial v_c}\log p(w_o|w_c) &= \frac{\partial}{\partial v_c}\log \frac{\exp(u_o^Tv_c)}{\sum\limits_{w=1}^V\exp(u_w^Tv_c)}\\ &= \frac{\partial}{\partial v_c}u_o^Tv_c - \frac{\partial}{\partial v_c}\log\sum\limits_{w=1}^V\exp(u_w^Tv_c)\\ &= u_o - \frac{\sum\limits_{x=1}^V\exp(u_x^Tv_c)u_x}{\sum\limits_{w=1}^V\exp(u_w^Tv_c)}\\ &= u_o - \sum\limits_{x=1}^V\frac{\exp(u_x^Tv_c)}{\sum\limits_{w=1}^V\exp(u_w^Tv_c)}u_x\\ &= u_o - \sum\limits_{x=1}^Vp(x|c)u_x \end{align*}\]

We can use Stochastic Gradient Descent to accelerate the optimization : update parameters after each window t.

Skip-gram model

Word2Vec Tutorial - The Skip-Gram Model

Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” Advances in neural information processing systems. 2013.

Since it’s expensive to calculate the sum $\sum\limits_{w=1}^V\exp(u_w^Tv_c)$, we change the objective function as follows:

\[J(\theta) = \frac{1}{T}\sum\limits_{t=1}^T J_t(\theta)\] \[J_t(\theta)=\log \sigma(u_o^Tv_c) + \sum\limits_{j\sim P(w)}\log \sigma(-u_j^Tv_c )\]

where $\sigma(x)=\frac{1}{1+e^{-x}}$, $P(w)=\frac{U(w)^{3/4}}{Z}$, $U(w)$ is unigram distribution, the frequency of the word $w$ and the 3/4 power make less frequent words be sampled more often.

Negative sampling

We choose k negative samples (instead of choosing all words) and the objectif is to

Here k is 5-20 for small training sets and 2-5 for large datasets.

Subsampling of frequent words

Some words occur frequently, like “the”, “in”, “a”, etc. So we can apply a subsampling system:

Each word $w_i$ in the training set is discarded with probability $P(w_i)$:

\[P(w_i) = 1-\sqrt{\frac{t}{f(w_i)}}\]


Word pairs

Since “New York” has a much different meaning than the individual words “New” and “York”, it makes sense to treat “New York”, wherever it occurs in the text, as a single word with its own word vector representation.

Continuous bag of words model (CBOW)

Predict center word from sum of surrounding word vectos instead of predicting surrounding single words from center word as in skip-gram model.


Pennington, Jeffrey, Richard Socher, and Christopher Manning. “Glove: Global vectors for word representation.” Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

\[J(\theta) = \frac{1}{2}\sum\limits_{i,j=1}^W f(P_{ij}) (u_i^Tv_j-\log P_{ij})^2\]



Intrinsic Evaluation : Word Vector Analogies

a:b :: c:?

i.e. man:women :: king: ?, the relation between “man” and “women” is like the relation between “king” and which word ?

This can acutally be calculated by:

\[d=\arg\max\limits_i \frac{(x_b-x_a+x_c)^Tx_i}{\|x_b-x_a+x_c\|}\]

Some analogies:

What about ambiguity ?

Improving Word Representaions Via Global Context And Multiple Word Prototypes (Huang et al. 2012)

Cluster word windows around words, retrain with each word assigned to multiple different clusters bank1, bank2, etc.

Extrinsic Evaluation

Find a person, organization or location

Word Window Classification

Cross entropy

Cross-entropy can be re-written in terms of the entropy and Kullback-Leibler divergence between the two distributions

Assuming a ground truth (or gold or target) probability distribution that is 1 at the right class and 0 everywhere else, $p = [0,\cdots,0,1,0,\cdots,0]$ and our computed probability is q, then the cross entropy is:

\[H(p,q)=-\sum\limits_c p(c)\log q(c)\]

Since p is one-hot, the only term left is the negative log probability of the true class y:

\[H(p,q)=-\log q(y)\]

Logistic regression

For a word vector $x$, the probability can be like

\[p(y|x) = \frac{\exp(f_y)}{\sum\limits_{c=1}^C\exp(f_c)}\]


We want to maximize the probability of the correct class y, so we minimize $J=-\log p(y|x)$.


The corresponding gradients are:

For the full dataset $(x_i,y_i)$, the cross entropy loss function (with regularization) is:


where $f_{y_i}$ is the $y_i$-th element of $Wx_i$.

In the classification with word vectors, the parameters are not only $W$ but also the word vectors $x$. So the size of parameters are $Cd+Vd$ ($V$ is the size of vocabulary) and if there isn’t enough data we may have overfitting problems.

Window classification

The context is very important for classifying words. For example, “To sanction” can mean “to permit” or “to punish” and all depends on the context.

So we can classify a word in its context window of neighboring words. For example with window length 2, $x_{\text{window}}=x\in\mathbb{R}^{5d}$. We can use the same way as before to calculate the gradient. Just now we have $\nabla_x J\in\mathbb{R}^{5d}$ and we need to update the word vectors seperately.

Neural networks

Now we use a single layer and an unnormalized score:

\[\begin{align*} z &= Wx+b\\ a &= f(z)\\ s &= U^Ta \end{align*}\]

Max-margin loss

Idea is to make score of true window larger and corrupt window’s score lower (until they are good enough):

\[\text{minimize}\quad J=\max(0,1-s+s_c)\]

where, for example, if we’d like to build a location classifier

Now the parameters are $U,W,b,x$. By noting ($\circ$ means element-wise multiplication)

\[\delta = U \circ f'(z)\]

The gradients are

Recurrent Neural Networks

Recurrent Neural Network

Given list of word vectors $x_1,\cdots,x_{t-1},x_t,x_{t+1},\cdots,x_T$.

At a single time step,

\[\begin{align*} h_t &= f(W^{(hh)}h_{t-1}+W^{(hx)}x_{[t]})\\ \hat{y}_t &= \text{softmax}(W^{(S)}h_t) \end{align*}\]


The corresponding loss function is

\[J^{(t)}(\theta)=-\sum\limits_{j=1}^{|V|}y_{t,j}\log \hat{y}_{t,j}\]

The total evaluation is (to minimize)

\[J=2^{\frac{1}{T}\sum\limits_{t=1}^T J^{(t)}}\]

Simpler RNN

We now consider a simpler RNN:

\[\begin{align*} h_t &= Wf(h_{t-1})+W^{(hx)}x_{[t]}\\ \hat{y}_t &= W^{(S)}f(h_t) \end{align*}\]

The total error is the sum of each error at time steps $t$,

\[\frac{\partial E}{\partial W} = \sum\limits_{t=1}^T\frac{\partial E_t}{\partial W}\]

and we have the chain rule application:

\[\frac{\partial E_t}{\partial W}=\sum\limits_{k=1}^t \frac{\partial E_t}{\partial y_t}\frac{\partial y_t}{\partial h_t}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W}\]

as well as

\[\frac{\partial h_t}{\partial h_k}=\prod\limits_{j=t}^{k+1}\frac{\partial h_j}{\partial h_{j-1}}\]


\[\frac{\partial h_j}{\partial h_{j-1}}=W\text{diag}(f'(h_{j-1}))\]

The vanishing/exploding gradient problem

We can analyze the norms of the Jacobians

\[\|\frac{\partial h_j}{\partial h_{j-1}}\|\leq \|W\| \|\text{diag}(f'(h_{j-1}))\|\leq \beta_W \beta_h\]

where $\beta_W,\beta_h$ are upper bounds of the norms.


\[\|\frac{\partial h_t}{\partial h_k}\|\leq\prod\limits_{j=t}^{k+1}\|\frac{\partial h_j}{\partial h_{j-1}}\|\leq (\beta_W \beta_h)^{t-k}\]

where the right side can vanish or explode quickly.

Bidirectional RNN

Bidirectional Recurrent Neural Network

\[\begin{align*} \overrightarrow{h}_t &= \sigma(\overrightarrow{W}x_t+\overrightarrow{V}\overrightarrow{h}_{t-1}+\overrightarrow{b})\\ \overleftarrow{h}_t &= \sigma(\overleftarrow{W}x_t+\overleftarrow{V}\overleftarrow{h}_{t-1}+\overleftarrow{b})\\ \hat{y}_t &= g(U[\overrightarrow{h}_t;\overleftarrow{h}_t] + c) \end{align*}\]

Deep Bidirectional RNN

Deep Bidirectional Recurrent Neural Network

\[\begin{align*} \overrightarrow{h}_t^{(i)} &= f(\overrightarrow{W}^{(i)}h_t^{(i-1)}+\overrightarrow{V}\overrightarrow{h}_{t-1}^{(i)}+\overrightarrow{b}^{(i)})\\ \overleftarrow{h}_t^{(i)} &= f(\overleftarrow{W}^{(i)}h_t^{(i-1)}+\overleftarrow{V}\overleftarrow{h}_{t-1}^{(i)}+\overleftarrow{b}^{(i)})\\ \hat{y}_t &= g(U[\overrightarrow{h}_t^{(L)};\overleftarrow{h}_t^{(L)}] + c) \end{align*}\]

Machine Translation

RNN Translation Model Extensions

Machine Translation Recurrent Neural Network

Gated Recurrent Units

Machine Translation Gated Recurrent Units

\[\begin{align*} z_t &= \sigma(W^{(z)}x_t+U^{(z)}h_{t-1}) \\ r_t &= \sigma(W^{(r)}x_t+U^{(r)}h_{t-1}) \\ \tilde{h}_t &= \tanh(Wx_t+U(r_t\circ h_{t-1})) \\ h_t &= z_t\circ h_{t-1}+(1-z_t)\circ \tilde{h}_t \end{align*}\]

where $\circ$ means element-wise multiplication.

Long-short-term-memories (LSTMs)

Understanding LSTM Networks


\[\begin{align*} i_t &= \sigma(W^{(i)}x_t+U^{(i)}h_{t-1}) \\ f_t &= \sigma(W^{(f)}x_t+U^{(f)}h_{t-1}) \\ o_t &= \sigma(W^{(o)}x_t+U^{(o)}h_{t-1}) \\ \tilde{c}_t &= \tanh(W^{(c)}x_t+U^{(c)}h_{t-1}) \\ c_t &= f_t\circ c_{t-1}+i_t\circ \tilde{c}_t \\ h_t &= o_t \circ \tanh(c_t) \end{align*}\]


Pointer Sentinel Mixture Models

Not finished

Notes until lecture 8