经典老番, 复习.
CS224n 的 第一课讲义 写得很好, 介绍全面而清晰, 下面的讲法来自 Speech and Language Processing December 30, 2020 draft. 只讲 skip-gram + negative sampling, 最为简洁.
关于 BERT, dive into deep learning 这里有非常清楚的解释.
Word2Vec 字面意思, 将每个单词映射为一个有限维欧氏空间中的向量.
It turns out that dense vectors work better in every NLP task than sparse vectors. While we don’t completely understand all the reasons for this, we have some intuitions. 现在神经网络方法的通病, 原因只能靠猜. (1) 参数更少, (2) 向量之间不再互相正交, 更能表示语义.
The intuition of word2vec is that instead of counting how often each word $w$ occurs near, say, apricot, we’ll instead train a classifier on a binary prediction task: “Is word $w$ likely to show up near apricot?” We don’t actually care about this prediction task; instead we’ll take the learned classifier weights as the word embeddings. 词向量的来源.
The revolutionary intuition here is that we can just use running text as implicitly supervised training data for such a classifier; a word $s$ that occurs near the target word apricot acts as gold ‘correct answer’ to the question posed above. 本质是一个浅层二分类网络 (逻辑回归), 而分类则自动由文本提供.
分类器
比如文本为
blah [c_1 c_2 t c_3 c_4] blah
已知中心词预测上下文.
Our goal is to train a classifier such that, given a tuple $(t, c)$ of a target word $t$ paired with a candidate context word $c$, it will return the probability that $c$ is a real context word $\mathbb P(+ \vert t, c)$, 错误则记为 $\mathbb P(- \vert t, c)$.
The intuition of the skipgram model is to base this probability on similarity: a word is likely to occur near the target if its embedding is similar to the target embedding. Skip-gram makes the strong but very useful simplifying assumption that all context words are independent. 用内积 $t\cdot c$ 来表示相似度, 再用 sigmoid 函数 $\sigma$ 计算概率
\[\mathbb P(+ | t, c_{1:m}) = \prod_{i=1}^m \frac{1}{1+\mathrm e^{-t\cdot c_i}},\]其中 $c_{1:m} = (c_1, \dots, c_m)$, context window 由 $m$ 个词组成.
学习词嵌入
For each of these $(t,c)$ training instances we’ll create $k$ (一个参数) negative samples, 从词汇表中采样, 并且排除 $(t, t)$.
采样分布: The noise words are chosen according to their weighted unigram frequency
\[p_\alpha(w) = \frac{\operatorname{count}^\alpha(w)} {\sum_{v}\operatorname{count}^\alpha(v)},\]其中 $\alpha$ 是一个参数, 常取为 0.75. 它小于 1 的好处是能放大罕见词出现的概率.
我们的目标是最大化正样本中 word pair $(t, c)$ 的相似度, 最小化负样本单词对的相似度
\[L(\theta) = \sum_{(t, c)\in +} \log \mathbb P(+ | t, c) + \sum_{(t, n)\in -} \log \mathbb P(- | t, n).\]回忆 $1 - \sigma(x) = \sigma(-x)$, 所以第二个求和式是在最小化负样本单词对的相似度. 负采样的动机是, 否则我们需要对词汇表里所有词计算再求和, 开销太大.
Note that the skip-gram model thus actually learns two separate embeddings for each word $w$: the target embedding $t$ and the context embedding $c$. 这些嵌入用矩阵存储, 构成了参数 $\theta$. 通常只保留 target embedding, 不过也可以尝试把两个嵌入加起来或者拼接起来之类的.
In practice, hierarchical softmax tends to be better for infrequent words, while negative sampling works better for frequent words and lower dimensional vectors.