Week 2: Gen AI for Sociology

Introduction

Word embeddings

What are they?

Word embeddings represent text as points in vector space
Two main approaches:
- Frequency-based embeddings: matrix arithmetic & factorization (e.g., SVD)
- Prediction-based embeddings: neural networks

Word embeddings

But why…?

Enable analysis of meaning beyond Bag-of-Words
(e.g., frequency counts, dictionaries, topic models)

Word embeddings

How so?

Measure closeness of words in vector space
Based on the distributional hypothesis:

“You shall know a word by the company it keeps.”
— Firth (1957)

Word embeddings

Key concepts

Context window: number of words around the target word we count
Co-occurrence: frequency two words appear together in a window

Word embeddings

Context window example (h = 2)

Sentence:
I like doing text analysis with Chris

target = text
context (±2) = doing, analysis

Word	I	like	doing	text	analysis	with	Chris
text	0	0	1	0	1	0	0

Scaling up

For a book → V × V matrix (V = vocabulary size)
For a corpus → compute PMI (Pairwise Mutual Information)
→ estimate embeddings with SVD
General approach: use something like word2vec and SGNS

Estimating embeddings

Approach : word2vec

Train a shallow neural net
Skip-gram: predict context words from target
CBOW: predict target word from context

Key settings

Context window

Small (2–5): syntax, local relations
Large (5–10+): semantics, topical similarity

Key settings

Dimensionality

Typical: 100–300 dimensions
Higher = more nuance, more compute

Estimating embeddings — word2vec (SGNS)

Overview

Two embedding tables (learned):
- $E_{\text{in}}\in\mathbb{R}^{V\times d}$ (centers)
- $E_{\text{out}}\in\mathbb{R}^{V\times d}$ (contexts)
Score a pair $(c,o)$ with a dot product $s_{co}=e_c^\top c_o$
Train a binary classifier: real $(c,o)$ pairs vs. random negative pairs

Embedding lookup as matrix arithmetic

Let $\mathbf{1}_c\in\mathbb{R}^{V}$ be a one-hot vector for token $c$.

\[ \underbrace{\mathbf{1}_c^\top}_{1\times V} \;\underbrace{E_{\text{in}}}_{V\times d} \;=\; \underbrace{e_c^\top}_{1\times d}, \qquad \underbrace{\mathbf{1}_o^\top}_{1\times V} \;\underbrace{E_{\text{out}}}_{V\times d} \;=\; \underbrace{c_o^\top}_{1\times d}. \]

Visual: \[ \left[\begin{array}{c} \vdots\\ \color{red}{\mathbf{1}_c^\top}\\ \vdots \end{array}\right] \!\! \left[\begin{array}{ccc} \cdot&\cdots&\cdot\\[-2pt] \vdots& &\vdots\\[-2pt] \cdot&\cdots&\cdot \end{array}\right] = \color{red}{\big[e_c^\top\big]} \qquad \left[\begin{array}{c} \vdots\\ \color{purple}{\mathbf{1}_o^\top}\\ \vdots \end{array}\right] \!\! \left[\begin{array}{ccc} \cdot&\cdots&\cdot\\[-2pt] \vdots& &\vdots\\[-2pt] \cdot&\cdots&\cdot \end{array}\right] = \color{purple}{\big[c_o^\top\big]} \]

Tiny numeric example

Given $V=5$ and $d=3$. Suppose the 4th row of $E_{\text{in}}$ is $[\,0.2,\,-1.0,\,0.5\,]$.

One-hot row for token $c$ (the 4th token): \[ \mathbf{1}_c^\top = \big[\,0\ \ 0\ \ 0\ \ 1\ \ 0\,\big]. \]

Row selection via matrix multiply: \[ \underbrace{\big[\,0\ 0\ 0\ 1\ 0\,\big]}_{\mathbf{1}_c^\top} \; \underbrace{\begin{bmatrix} \cdot & \cdot & \cdot\\[-2pt] \cdot & \cdot & \cdot\\[-2pt] \cdot & \cdot & \cdot\\[-2pt] \color{teal}{0.2} & \color{teal}{-1.0} & \color{teal}{0.5}\\[-2pt] \cdot & \cdot & \cdot \end{bmatrix}}_{E_{\text{in}}} \;=\; \underbrace{\big[\,\color{teal}{0.2}\ \ \color{teal}{-1.0}\ \ \color{teal}{0.5}\,\big]}_{e_c^\top}. \]

Same idea with $E_{\text{out}}$ picks out the context vector $c_o^\top$ (with embeddings initialized at random).

Scoring a pair (center, context)

Dot product score: \[ s_{co}=e_c^\top c_o \in \mathbb{R}. \]

Interpretation: large $s_{co}$ means “$o$ is a plausible context for $c$.”

Negative sampling (unigram$^{\,0.75}$)

For each positive pair $(c,o)$, draw $K$ negatives $n_1,\dots,n_K$ i.i.d. from \[ p_{\text{neg}}(w)\;\propto\; f(w)^{\,0.75}, \] where $f(w)$ is the unigram count. This down-weights stopwords while keeping them in play.

Visual: dot = cell (like SVD)

\[ \color{teal}{ \left[\begin{array}{cccc} \cdots & \cdots & \cdots & \cdots\\ \cdots & \boxed{\,S_{co}\,} & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots \end{array}\right]} \;=\; \underbrace{\color{red}{ \left[\begin{array}{c} \vdots\\ e_c^\top\\ \vdots \end{array}\right]}}_{\text{row $c$ of }E_{\text{in}}} \cdot \underbrace{\color{purple}{ \left[\begin{array}{c} \vdots\\ c_o\\ \vdots \end{array}\right]}}_{\text{column $o$ of }E_{\text{out}}^\top} \]

SGNS learns the factorization via stochastic gradients

Cosine similarity — definition

Given two vectors $x,y\in\mathbb{R}^d$,

\[ \operatorname{cos}(x,y) \;=\; \frac{x^\top y}{\|x\|\,\|y\|} \;=\; \cos\theta, \]

where $\theta$ is the angle between $x$ and $y$.

Range: $[-1,1]$
$\;1$ = same direction, $\;0$ = orthogonal/unrelated, $\;-1$ = opposite meanings (rare in practice)

Quick numeric intuition

Let $x=(1,0)$ and $y=(0.6,0.8)$: \[ x^\top y = 0.6,\quad \|x\|=1,\ \|y\|=1 \Rightarrow \operatorname{cos}(x,y)=0.6. \]

Scale $y' = 2y=(1.2,1.6)$: \[ x^\top y' = 1.2\ \ (\text{doubles}),\quad \operatorname{cos}(x,y')=\frac{1.2}{1\cdot 2}=0.6\ \ (\text{unchanged}). \]

Cosine measures directional similarity, not magnitude.

Look up two word vectors (row selection)

Let $E\in\mathbb{R}^{V\times d}$ be the embedding table.
Pick two tokens with indices $a$ and $b$.

\[ \underbrace{\color{black}{ \left[\begin{array}{ccc} \cdot & \cdots & \cdot\\ \vdots & & \vdots\\ \cdot & \cdots & \cdot \end{array}\right]}}_{\textbf{$E$ }(V\times d)} \quad\Rightarrow\quad \color{red}{e_a^\top}=\big[E\big]_{a:},\qquad \color{blue}{e_b^\top}=\big[E\big]_{b:}. \]

Normalize (make direction only)

Row-wise L2 normalization makes vectors comparable by angle.

\[ \hat{E}_{i:}=\frac{E_{i:}}{\lVert E_{i:}\rVert}\quad\Longleftrightarrow\quad \hat{E}=D^{-1}E,\ \ D=\operatorname{diag}\!\big(\lVert E_{1:}\rVert,\dots,\lVert E_{V:}\rVert\big). \]

So our two normalized vectors are $\ \color{red}{\hat e_a},\ \color{blue}{\hat e_b}$.

Cosine similarity = dot product of normalized vectors

\[ \operatorname{cos}(\color{red}{e_a},\color{blue}{e_b}) =\frac{\color{red}{e_a}^\top\color{blue}{e_b}}{\lVert e_a\rVert\,\lVert e_b\rVert} =\color{red}{\hat e_a}^\top\color{blue}{\hat e_b}\in[-1,1]. \]

$1$ = same direction, $0$ = orthogonal, $-1$ = opposite.

Applications

Linguistics: study semantic relations
Social science:
- Bias, stereotypes, prejudice
- Ideological placement
- Cultural meaning
Machine learning: features for NLP pipelines

Example (Week 2)

“Semantics derived automatically from language corpora contain human-like biases.”
— Aylin Caliskan, Joanna J. Bryson, & Arvind Narayanan, Science (2017) —

Example (Week 2)

“The geometry of culture: Analyzing the meanings of class through word embeddings.”
— Austin C. Kozlowski, Matt Taddy, & James A. Evans, American Sociological Review (2019)

Word embeddings

Tracing associations over time

Use dated corpora
Generate embeddings
Track associations with cosine similarity across time

Extensions

Whole document embeddings

E.g., whole subreddits
Generate embeddings
Place in ideological space…

What’s all this got to do with LLMs?

Embeddings aren’t just single word vectors…
In fact they used to be also things like counts of words
LLMs use embeddings for words, subwords, tokens, sentences, images, …
Embeddings are a core building block of LLMs

Week 2: Gen AI for Sociology

Introduction

Word embeddings

What are they?

Word embeddings

But why…?

Word embeddings

How so?

Word embeddings

Key concepts

Word embeddings

Context window example (h = 2)

Scaling up

Estimating embeddings

Approach : word2vec

Key settings

Context window

Key settings

Dimensionality

Estimating embeddings — word2vec (SGNS)

Overview

Embedding lookup as matrix arithmetic

Tiny numeric example

Scoring a pair (center, context)

Negative sampling (unigram\(^{\,0.75}\))

Visual: dot = cell (like SVD)

Cosine similarity — definition

Quick numeric intuition

Look up two word vectors (row selection)

Normalize (make direction only)

Cosine similarity = dot product of normalized vectors

Applications

Example (Week 2)

Example (Week 2)

Word embeddings

Tracing associations over time

Extensions

Whole document embeddings

What’s all this got to do with LLMs?