Note
How this differs from word2vec
word2vec trains two lookup tables (input and output embeddings) to score (center, context) pairs with a dot product; once trained, a word’s vector is fixed.
Transformers recompute a vector for each token using the other tokens in the same sequence at every layer. The representation is context‑conditioned.
Note
Relation to word2vec
In SGNS, similarity is a dot product between pretrained center/context vectors.
In self‑attention, similarity is a dot product inside the sequence between projected token features, so the “context” is sequence‑specific and position‑aware.
Effectively: attention learns a data‑dependent co‑occurrence matrix for each sequence and layer.
Note
Analogy to word2vec
Think of heads as learning multiple subspaces of “co‑occurrence,” instead of one shared space.
Multiple heads ≈ capturing different neighborhood logics simultaneously.
Note
Key difference: In word2vec, the vector you analyze is the same regardless of sentence. In Transformers, the vector you analyze depends on the sentence and the layer.
[CLS]
(or mean pool) as a sentence vector, then a linear layer + softmax/sigmoid.{protest, discrimination, solidarity}
Dataset
label_names = ["protest", "discrimination", "solidarity"]
label2id = {n:i for i,n in enumerate(label_names)}
id2label = {i:n for n,i in label2id.items()}
train_texts = [...]; train_labels = [...]
val_texts = [...]; val_labels = [...]
Note
Why this matters: - These mappings are stored in the model config (label2id
, id2label
) → cleaner logs & correct class order.
input_ids
, attention_mask
) and attach labels.from transformers import AutoTokenizer
from torch.utils.data import Dataset
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
class SimpleTextDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_length=128):
self.texts, self.labels, self.tok = texts, labels, tokenizer
self.max_length = max_length
def __len__(self): return len(self.texts)
def __getitem__(self, i):
enc = self.tok(self.texts[i], truncation=True, max_length=self.max_length, return_tensors="pt")
item = {k: v.squeeze(0) for k,v in enc.items()}
item["labels"] = torch.tensor(self.labels[i], dtype=torch.long)
return item
import torch
from transformers import AutoModelForSequenceClassification
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=len(label_names),
id2label=id2label, label2id=label2id
).to(device)
AdamW
is the standard choice for Transformers.Tip
Small LR (e.g., 2e-5
) is typical when fine-tuning pretrained encoders.
def evaluate():
model.eval(); correct = total = 0
with torch.no_grad():
for batch in val_loader:
batch = {k: v.to(device) for k,v in batch.items()}
out = model(**batch) # no labels → logits only
preds = out.logits.argmax(-1)
correct += (preds == batch["labels"]).sum().item()
total += batch["labels"].numel()
return correct / max(1,total)
epochs = 3
model.train()
for epoch in range(1, epochs+1):
running_loss = 0.0
for batch in train_loader:
batch = {k: v.to(device) for k,v in batch.items()}
out = model(**batch) # labels included → out.loss set
out.loss.backward()
optimizer.step(); optimizer.zero_grad()
running_loss += out.loss.item()
print(f"Epoch {epoch}: loss={running_loss/len(train_loader):.4f} val_acc={evaluate():.3f}")
test_samples = [
"Thousands gathered in front of parliament.",
"Volunteers cleaned the park and cooked for neighbors.",
"He yelled slurs at a woman on the tram."
]
enc = tok(test_samples, padding=True, truncation=True, return_tensors="pt").to(device)
model.eval()
with torch.no_grad():
logits = model(**enc).logits
probs = torch.softmax(logits, dim=-1).cpu()
for text, p in zip(test_samples, probs):
pred_id = int(torch.argmax(p))
print("\nTEXT:", text)
print("PRED:", id2label[pred_id])
for name, prob in zip(label_names, p.tolist()):
print(f" {name:15s} {prob:.3f}")