Enable analysis of meaning beyond Bag-of-Words
(e.g., frequency counts, dictionaries, topic models)
“You shall know a word by the company it keeps.”
— Firth (1957)
Sentence:
I like doing text analysis with Chris
target = text
context (±2) = doing, analysis
Word | I | like | doing | text | analysis | with | Chris |
---|---|---|---|---|---|---|---|
text | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
Let \(\mathbf{1}_c\in\mathbb{R}^{V}\) be a one-hot vector for token \(c\).
\[ \underbrace{\mathbf{1}_c^\top}_{1\times V} \;\underbrace{E_{\text{in}}}_{V\times d} \;=\; \underbrace{e_c^\top}_{1\times d}, \qquad \underbrace{\mathbf{1}_o^\top}_{1\times V} \;\underbrace{E_{\text{out}}}_{V\times d} \;=\; \underbrace{c_o^\top}_{1\times d}. \]
Visual: \[ \left[\begin{array}{c} \vdots\\ \color{red}{\mathbf{1}_c^\top}\\ \vdots \end{array}\right] \!\! \left[\begin{array}{ccc} \cdot&\cdots&\cdot\\[-2pt] \vdots& &\vdots\\[-2pt] \cdot&\cdots&\cdot \end{array}\right] = \color{red}{\big[e_c^\top\big]} \qquad \left[\begin{array}{c} \vdots\\ \color{purple}{\mathbf{1}_o^\top}\\ \vdots \end{array}\right] \!\! \left[\begin{array}{ccc} \cdot&\cdots&\cdot\\[-2pt] \vdots& &\vdots\\[-2pt] \cdot&\cdots&\cdot \end{array}\right] = \color{purple}{\big[c_o^\top\big]} \]
Given \(V=5\) and \(d=3\). Suppose the 4th row of \(E_{\text{in}}\) is \([\,0.2,\,-1.0,\,0.5\,]\).
One-hot row for token \(c\) (the 4th token): \[ \mathbf{1}_c^\top = \big[\,0\ \ 0\ \ 0\ \ 1\ \ 0\,\big]. \]
Row selection via matrix multiply: \[ \underbrace{\big[\,0\ 0\ 0\ 1\ 0\,\big]}_{\mathbf{1}_c^\top} \; \underbrace{\begin{bmatrix} \cdot & \cdot & \cdot\\[-2pt] \cdot & \cdot & \cdot\\[-2pt] \cdot & \cdot & \cdot\\[-2pt] \color{teal}{0.2} & \color{teal}{-1.0} & \color{teal}{0.5}\\[-2pt] \cdot & \cdot & \cdot \end{bmatrix}}_{E_{\text{in}}} \;=\; \underbrace{\big[\,\color{teal}{0.2}\ \ \color{teal}{-1.0}\ \ \color{teal}{0.5}\,\big]}_{e_c^\top}. \]
Same idea with \(E_{\text{out}}\) picks out the context vector \(c_o^\top\) (with embeddings initialized at random).
Dot product score: \[ s_{co}=e_c^\top c_o \in \mathbb{R}. \]
Interpretation: large \(s_{co}\) means “\(o\) is a plausible context for \(c\).”
For each positive pair \((c,o)\), draw \(K\) negatives \(n_1,\dots,n_K\) i.i.d. from \[ p_{\text{neg}}(w)\;\propto\; f(w)^{\,0.75}, \] where \(f(w)\) is the unigram count. This down-weights stopwords while keeping them in play.
\[ \color{teal}{ \left[\begin{array}{cccc} \cdots & \cdots & \cdots & \cdots\\ \cdots & \boxed{\,S_{co}\,} & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots \end{array}\right]} \;=\; \underbrace{\color{red}{ \left[\begin{array}{c} \vdots\\ e_c^\top\\ \vdots \end{array}\right]}}_{\text{row $c$ of }E_{\text{in}}} \cdot \underbrace{\color{purple}{ \left[\begin{array}{c} \vdots\\ c_o\\ \vdots \end{array}\right]}}_{\text{column $o$ of }E_{\text{out}}^\top} \]
Given two vectors \(x,y\in\mathbb{R}^d\),
\[ \operatorname{cos}(x,y) \;=\; \frac{x^\top y}{\|x\|\,\|y\|} \;=\; \cos\theta, \]
where \(\theta\) is the angle between \(x\) and \(y\).
Let \(x=(1,0)\) and \(y=(0.6,0.8)\): \[ x^\top y = 0.6,\quad \|x\|=1,\ \|y\|=1 \Rightarrow \operatorname{cos}(x,y)=0.6. \]
Scale \(y' = 2y=(1.2,1.6)\): \[ x^\top y' = 1.2\ \ (\text{doubles}),\quad \operatorname{cos}(x,y')=\frac{1.2}{1\cdot 2}=0.6\ \ (\text{unchanged}). \]
Cosine measures directional similarity, not magnitude.
Let \(E\in\mathbb{R}^{V\times d}\) be the embedding table.
Pick two tokens with indices \(a\) and \(b\).
\[ \underbrace{\color{black}{ \left[\begin{array}{ccc} \cdot & \cdots & \cdot\\ \vdots & & \vdots\\ \cdot & \cdots & \cdot \end{array}\right]}}_{\textbf{$E$ }(V\times d)} \quad\Rightarrow\quad \color{red}{e_a^\top}=\big[E\big]_{a:},\qquad \color{blue}{e_b^\top}=\big[E\big]_{b:}. \]
Row-wise L2 normalization makes vectors comparable by angle.
\[ \hat{E}_{i:}=\frac{E_{i:}}{\lVert E_{i:}\rVert}\quad\Longleftrightarrow\quad \hat{E}=D^{-1}E,\ \ D=\operatorname{diag}\!\big(\lVert E_{1:}\rVert,\dots,\lVert E_{V:}\rVert\big). \]
So our two normalized vectors are \(\ \color{red}{\hat e_a},\ \color{blue}{\hat e_b}\).
\[ \operatorname{cos}(\color{red}{e_a},\color{blue}{e_b}) =\frac{\color{red}{e_a}^\top\color{blue}{e_b}}{\lVert e_a\rVert\,\lVert e_b\rVert} =\color{red}{\hat e_a}^\top\color{blue}{\hat e_b}\in[-1,1]. \]
“Semantics derived automatically from language corpora contain human-like biases.”
— Aylin Caliskan, Joanna J. Bryson, & Arvind Narayanan, Science (2017) —
“The geometry of culture: Analyzing the meanings of class through word embeddings.”
— Austin C. Kozlowski, Matt Taddy, & James A. Evans, American Sociological Review (2019)