For Text Annotation with Large Language Models
LMs now being used for everything
Let’s say we want to classify something as:
A prompt is “all you need”
Parallels with human coders and IRR:
But we lack dedicated routine…
# Function: intra_prompt_stability
def intra_prompt_stability(original_text, prompt_postfix, iterations, bootstrap_samples):
"""
Calculate intra-prompt stability using annotations and bootstrapping.
Args:
original_text (str): The base text for the prompt.
prompt_postfix (str): Additional text to modify the prompt.
iterations (int): Number of times to run the process.
bootstrap_samples (int): Number of bootstraps for Krippendorff's Alpha.
Returns:
dict: KA scores including mean_alpha, ci_lower, and ci_upper for each iteration.
"""
# Combine original text and prompt postfix to create the full prompt
prompt = f"{original_text} {prompt_postfix}"
# Initialize storage for annotations and KA scores
all_annotations = []
ka_scores = {}
# Repeat annotation for each iteration
for i in range(1, iterations + 1):
annotations = []
# Annotate data points
for d in data: # Assumes 'data' is defined elsewhere
annotation = annotate(d, prompt) # Assumes 'annotate' function exists
annotations.append({
"id": d["id"],
"text": d["text"],
"annotation": annotation,
"iteration": i
})
# Store all annotations
all_annotations.extend(annotations)
# Convert annotations to a DataFrame
import pandas as pd
all_annotated = pd.DataFrame(all_annotations)
# Calculate KA scores over iterations using bootstrapping
for i in range(1, iterations + 1):
subset_annotated = all_annotated[all_annotated["iteration"] <= i]
# Bootstrap Krippendorff's Alpha
mean_alpha, ci_lower, ci_upper = bootstrap_krippendorff(
subset_annotated, bootstrap_samples
)
ka_scores[i] = {
"mean_alpha": mean_alpha,
"ci_lower": ci_lower,
"ci_upper": ci_upper
}
return ka_scores
# Function: interprompt_stability
def interprompt_stability(original_text, prompt_postfix, nr_variations,
temperatures, iterations, bootstrap_samples):
"""
Calculate inter-prompt stability using paraphrased prompts and bootstrapping.
Args:
original_text (str): Base text for the prompt.
prompt_postfix (str): Modifier text added to the original prompt.
nr_variations (int): Number of variations to generate.
temperatures (list): List of temperatures for generating paraphrases.
iterations (int): Number of iterations to perform.
bootstrap_samples (int): Number of bootstraps for Krippendorff's Alpha.
Returns:
dict: KA scores including mean_alpha, ci_lower, and ci_upper for each temperature.
"""
# Dictionary for KA scores
ka_scores = {}
# List for storing all annotated data
all_annotated = []
# For each temperature, generate paraphrases and annotate
for temp in temperatures:
paraphrases = generate_paraphrases(original_text, prompt_postfix, nr_variations, temp)
annotated = []
# Annotate data for each paraphrase across iterations
for i in range(1, iterations + 1):
for paraphrase in paraphrases:
for d in data: # Assumes 'data' is defined elsewhere
annotation = annotate(d, paraphrase) # Assumes 'annotate' function exists
# Store annotations with metadata
annotated.append({
"id": d["id"],
"text": d["text"],
"annotation": annotation,
"prompt_id": paraphrase["id"],
"prompt": paraphrase["text"],
"original": original_text,
"temperature": temp,
"iteration": i
})
# Convert annotations to a DataFrame
import pandas as pd
annotated_data = pd.DataFrame(annotated)
all_annotated.append(annotated_data)
# Combine all annotated data
all_annotated_df = pd.concat(all_annotated, ignore_index=True)
# Calculate KA scores over all temperatures using bootstrapping
for temp in temperatures:
temp_data = all_annotated_df[all_annotated_df["temperature"] == temp]
# Iterate over the number of iterations
for i in range(1, iterations + 1):
subset_data = temp_data[temp_data["iteration"] <= i]
# Bootstrap Krippendorff's Alpha
mean_alpha, ci_lower, ci_upper = bootstrap_krippendorff(
subset_data, bootstrap_samples
) # Assumes 'bootstrap_krippendorff' function exists
# Store KA scores
if temp not in ka_scores:
ka_scores[temp] = []
ka_scores[temp].append({
"iteration": i,
"mean_alpha": mean_alpha,
"ci_lower": ci_lower,
"ci_upper": ci_upper
})
return ka_scores
To generate variants we:
PEGASUS
and iterate through temperaturesTo annotate data we:
gpt-3.5
from OpenAIWe call this:
“Prompt stability scoring”
And the score:
“Prompt stability scores (PSS)”
promptstability
to do this
PEGASUS
Christopher Barrie (NYU), Elli Palaiologou, Petter Törnberg (Amsterdam).